Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-2952

SOLR: Full text indexing only includes the text on the last bitstream

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed (View Workflow)
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.4
    • Fix Version/s: 6.1
    • Component/s: Discovery, Solr
    • Labels:
      None
    • Environment:
      Windows Server 2012 R2(amd64) version 6.3
    • Attachments:
      0
    • Comments:
      9
    • Documentation Status:
      Not Required

      Description

      As discussed in this thread (https://groups.google.com/forum/#!topic/dspace-tech/l4Rzo4Pajoo) it would appear that SOLR is only preserving the full-text indexing results from `dspace filter-media` of the final bitstream that was processed.

      As discussed in the forum, this seems to be evidencing itself also in the http://demo.dspace.org/xmlui/discover site. A search for "test word document" (including quotes) should return the handle for Test PDF Document (http://demo.dspace.org/xmlui/handle/10673/5) but it does not because the index only preserved the full text of the last bitstream on that handle. This mirrors the behavior of our 5.4 installation.

      In a scenario with multi-part bitstreams, only the last is included in the index viewable by using the SOLR viewer (http://localhost:8080/solr/search/select?q=handle:...). If there are 4 bitstreams, the first 3 are not preserved in the index.

      I discovered that if I manually override the order in DSpace table bundle2bitstream (field bitstream_order) that whichever bitstream gets the greatest integer is the one that is retained.

      The `fulltext` XML handle in the SOLR index ought to account for multiple bitstreams or hopefully it can be expanded to have multiple fulltext additions. I don't see any references in the SOLR view that alludes to there being multiple streams to choose from (e.g. only the last bitstream is mentioned in stream_name) so hopefully this isn't a SOLR limitation. We often have multiple bitstreams and would like them all indexed and full-text searchable.

      This is my first JIRA posting so feel free to administratively update it as appropriate.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              Unassigned
              Reporter:
              vtrain vtown
              Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: