Currently, the behaviour of Islandora Solr is to display every page object found in a result set as part of the search results. This can be considered undesirable, as often hundreds or thousands of pages can show up in a result set thanks to individual page OCR and other metadata, obfuscating more useful results.
The purpose of this ticket is to:
1) Make it possible to filter page objects from Solr results queries in a simple manner similar to the Islandora Compound Solr filtering query,
2) Make it possible to store consolidated OCR at the book level so OCR search isn't interrupted by this,
3) Provide ways of consolidating the OCR during batch ingest (drush and interface),
4) Provide a way of updating that consolidated OCR if necessary, and
5) Provide a Drush method for updating the OCR of existing objects for sites that wish to use this feature.
- Sites that store a great deal of books with high volumes of pages may want to be able to only show paged content objects in Solr search results, and not the pages themselves.
- Sites that wish to use this feature may want to update existing paged content objects to this method as well.
The biggest problem here is figuring out how to make sure that OCR can be searched even if page objects are being filtered out. Ideally we don't want to do this in a way that has to read individual OCR datastreams from page objects out of Fedora (e.g., after ingest or during the index process).
The least intensive way of doing this seems to be to attach consolidated OCR datastreams to book objects while the pages are being made by maintaining a consolidated OCR file and appending individual page OCR files to it as they are made. We have precedent for this; it's how aggregated PDFs are currently done.
This touches a lot of parts of the Paged Content functionality, so any part of Paged Content that creates or creates, updates, or otherwise uses OCR should be checked. We should also check that this doesn't impact existing or new sites that don't want to use this functionality.
Because this will bake functionality into batch ingest (Drush and interface) and into the OCR regeneration, we should ensure that, whenever possible, existing functionality can be maintained - in short, we shouldn't be forcing anyone to append OCR to paged content objects, and the default behaviour should maintain current behaviour.
discoverygarden inc. | Managing Digital Content