Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-3559

OAI Indexing extremely slow and memory inefficient for bigger amount of items



    • Type: Bug
    • Status: Code Review Needed (View Workflow)
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 6.0
    • Fix Version/s: None
    • Component/s: OAI-PMH
    • Labels:
    • Attachments:
    • Comments:


      When building the OAI index in 6.x

      sudo JAVA_OPTS="-Xmx2048M -Xms2048M" /dspace/bin/dspace oai import -c

      The process becomes really slow after a while and consumes a lot of memory.

      It is the same problem as describe in ---DS-2965---. I have seen that the solution which was found for this bug was clearing the hibernate session which works great for my problem as well.

      But the pull request #1276 was changed in a later commit

      I think that the problem was that Session#clear() in the DBConnection not only evicts all items from the cache but also cancels all pending saves/updates/deletes.

      but it should be save to be used in a read only use case, like reading all items from the database to get indexed.


      Some number to help understanding the problem:

      Using the current code to index 33013 items:

      after 1463 Minutes and 26400 item:  the 2GB memory are completely in use, 1043967 items in the hibernate cache. To index one item needs about 5 seconds (instead of ~ 5 milliseconds)

      Trying to evict every object that is touched by the code similar to the index discovery code

      hard to find every place where we have something loaded in the cache (collections, communities, items, metadatavalues, bundles, bitstreams ...) so the result is:

      finished after 175 Minutes: 107695 items in the hibernate cache. To index one item needs about 0,4 seconds (instead of ~ 5 milliseconds)

      Using Session#clear() from the Hibernate Session object

      Finished after 9 Minutes. As expected 0 items in the cache. About 4 milliseconds to index an item. Should be able to index every amount of items because the cache size is constant


      So I am going to add a pull request which adds the possibility to clear the whole cache which is great for batch reading of items.

      I could add the flush method as well to make batch creation of a large number of items possible as described here:




          Issue Links



              Unassigned Unassigned
              christian.scheible Christian Scheible
              0 Vote for this issue
              2 Start watching this issue