Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-2000

Replication Task Suite backup to DuraCloud fails if a single upload fails

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Accepted / Claimed (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: Replication Task Suite
    • Labels:
      None
    • Attachments:
      0
    • Comments:
      0
    • Documentation Status:
      Needed

      Description

      When performing a backup to DuraCloud using the Replication Task Suite, sometimes larger files (>400MB) will experience random "Caused by: java.net.SocketException: Connection reset." errors from Amazon S3 storage. In the DSpace logs, these errors actually look like:

      Could not add content ITEM@123456-789.zip with type application/zip and size 466096426 to S3 bucket akiajpoktiep72aase4a.my-backup due to error: Encountered an exception and couldn't reset the stream to retry
      at org.dspace.ctask.replicate.store.DuraCloudObjectStore.uploadReplica(DuraCloudObjectStore.java:193)
      at org.dspace.ctask.replicate.store.DuraCloudObjectStore.transferObject(DuraCloudObjectStore.java:159)
      at org.dspace.ctask.replicate.ReplicaManager.transferObject(ReplicaManager.java:259)
      at org.dspace.ctask.replicate.TransmitAIP.perform(TransmitAIP.java:68)
      at org.dspace.curate.ResolvedTask.perform(ResolvedTask.java:88)
      at org.dspace.curate.Curator$TaskRunner.run(Curator.java:563)

      Unfortunately, when this error is encountered (from commandline or Admin UI), the entire backup to DuraCloud fails/halts, and it needs to be restarted from the beginning.

      After talking with the DuraCloud team, it sounds like these are issues in Amazon S3 itself, and are essentially temporary timeouts (if you try the upload again, it almost always will succeed the second time).

      The recommended resolution is to attempt to catch the error and automatically "retry" the upload to DuraCloud (a set number of times).

      In addition, we should enhance the error handling in the Replication Task Suite so that it's possible to report individual backup failures, but continue the backup process. We should not always return a complete failure if a single error is encountered...instead we should backup what content we can and report which content failed to be backed up.

        Attachments

          Activity

            People

            Assignee:
            tdonohue Tim Donohue
            Reporter:
            tdonohue Tim Donohue
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Dates

              Created:
              Updated: