When performing a backup to DuraCloud using the Replication Task Suite, sometimes larger files (>400MB) will experience random "Caused by: java.net.SocketException: Connection reset." errors from Amazon S3 storage. In the DSpace logs, these errors actually look like:
Could not add content ITEM@123456-789.zip with type application/zip and size 466096426 to S3 bucket akiajpoktiep72aase4a.my-backup due to error: Encountered an exception and couldn't reset the stream to retry
Unfortunately, when this error is encountered (from commandline or Admin UI), the entire backup to DuraCloud fails/halts, and it needs to be restarted from the beginning.
After talking with the DuraCloud team, it sounds like these are issues in Amazon S3 itself, and are essentially temporary timeouts (if you try the upload again, it almost always will succeed the second time).
The recommended resolution is to attempt to catch the error and automatically "retry" the upload to DuraCloud (a set number of times).
In addition, we should enhance the error handling in the Replication Task Suite so that it's possible to report individual backup failures, but continue the backup process. We should not always return a complete failure if a single error is encountered...instead we should backup what content we can and report which content failed to be backed up.