As Tom Desair noted in
DS-3086, all batch operations in DSpace (those which employ long-running database transactions) are susceptible to memory over-use due to Hibernate caching. This can manifest itself as a large slowdown after processing a few hundred items in batch operations.
I have confirmed that this is the case with the dspace ingest command, which ingests a directory full of items in Simple Archive Format. This currently works by doing all the work in one giant database transaction. After a couple hundred items, there is a very noticeable slowdown in processing, to the point where ingesting several thousand items would be impractical.
I have also confirmed that by using the new Context.enableBatchMode, Context.getCacheSize, and Context.commit methods (commit has the effect of clearing the Hibernate cache as well as committing the underlying db connection), the problem goes away with ingest.
It is very likely that this problem also exists for several, if not all of the following:
- Discovery reindexing ( except the -i option, which already employs the new methods to avoid memory over-use...see the PR for
- Curation tasks
- Item import/export
- CSV batch export and modification.
- Possibly others