Unreliable coverage report uploads

Description

I’m seeing frequent network errors when attempting to upload a coverage report to codecov.io. Typically one or two test environments fail per CI run, and which ones fail is random.

==> Uploading
    .url https://codecov.io
    .query slug=lukeburden%2Fdjango-zengo&commit=8ab188e0d06fad32df380ba18e4a2d0b68111a57&job=579.0&flags=py34dj111&branch=master&package=py2.0.15&service=circleci&build=579.0
    Pinging Codecov...
Error: HTTPSConnectionPool(host='codecov.io', port=443): Max retries exceeded with url: /upload/v4?slug=lukeburden%2Fdjango-zengo&commit=8ab188e0d06fad32df380ba18e4a2d0b68111a57&job=579.0&flags=py34dj111&branch=master&package=py2.0.15&service=circleci&build=579.0 (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7fd7d7e72fd0>: Failed to establish a new connection: [Errno 110] Connection timed out',))

https://circleci.com/gh/lukeburden/django-zengo/579

Repository

Steps to Reproduce

  1. Run CI for the above mentioned project

Expected behavior:
All tests should pass - they do locally, and which one fails is random, depending on the network reliability of codecov’s upload API.

Actual behavior:
1 or more environments fails per CI run.

Flakiness? [Does this happen all the time or only sometimes?]
Definitely flakey - which one fails is seemingly random.

Versions

Using python codecov 2.0.15

there’s a few issues filed on this on the python codecov uploader client.

the solution (beyond fixing the python uploader) seems to be switching to the codecov bash uploader.

1 Like

As @kapilt said, @lukeburden, please try the bash uploader. That will help us narrow down if the issue is in the python form, or something lower down with the coverage report itself.

Thanks, I’ll try that next time it hits a flaky patch.

1 Like

A follow-up on this - the bash method does prove to be more reliable, simply because it has a retry mechanism:

This is a nice work-around, but it seems to me there are still some reliability issues with the receiving endpoint that need to be addressed.

We’ve dug into this issue a bit more and think we have a handle on it. The fix is in testing, I’ll let you know when it’s deployed.

In case it is helpful, I have custom uploaders that use curl and the Powershell Net.WebClient and run into this issue on at least one of the CI matrix environments for every build.

Out of say 30 uploads, usually at least one or two fail. This is after I added a bunch of code to retry the upload up to 6 times with a 5 second timeout in between.