Per DSpace Meeting, this issue will require a restructuring of the full text components in SOLR.
1. items with no bitstreams
2. item with one public bitstream
3. item with multiple public bitstreams
4. Item with one restricted bitstream
5. item with multiple restricted bitstreams
6. Item with public and restricted bitstreams
Per dspace meeting discussion on 2/15/2017
[12:38:40] ‹kompewter› [ https://jira.duraspace.org/browse/DS-2952 ] -
DS-2952 SOLR: Full text indexing only includes the text on the last bitstream - DuraSpace JIRA
[12:38:50] ‹tdonohue› DSPR#1595
[12:38:52] ‹kompewter› [ https://github.com/DSpace/DSpace/pull/1595 ] -
DS-2952 SOLR full text indexing multiple bitstreams by tomdesair
[12:39:47] ‹tdonohue› terry-b: so, you tested the PR and had an outstanding question here. Was that a concern?
[12:40:24] ‹terry-b› I did not see a reply from Tom. What is your opinion of that question?
[12:40:48] ‹terry-b› I can't say that I have tested this carefully in my own instance
[12:42:15] ‹tdonohue› Either it should (a) not index non-public bitstreams (which is slightly less than ideal if you want to search inside files as an Admin), OR (b) the indexed copy should end up with the same permissions as the original bitstream (so only admins can search within admin restricted files)
[12:42:48] ‹tdonohue› I'm not entirely sure whether this PR checks permissions at all...if it doesn't that'd be problematic (as you don't want private files indexed as public)
[12:43:17] ‹terry-b› With this implementation, all full text is merged into the item record in solr, so nuanced permissions would be lost
[12:43:33] ‹terry-b› (when multiple bitstreams exist)
[12:44:15] ‹terry-b› I have just assumed that full text bitstreams had permissions mirroring their originals, but that is impossible with the solr representation
[12:45:42] ‹tdonohue› Hmm...this worries me a bit. I would really like to see a Unit Test which shows what happens when restricted bitstreams are encountered. It's nice this has UnitTests, but I see nothing in this code that looks at permissions (at all)
[12:46:11] ‹tdonohue› So, I'm worried this might index private bitstreams as public....or it might just fail on private bitstreams (not sure)
[12:46:20] ‹terry-b› It sounds like we should make it a requirement to look at permissions.
[12:46:44] ‹terry-b› It would also be good to verify that items with a single restricted bitstream do not end up in the full text index
[12:46:49] ‹tdonohue› I'll add a new review to the PR requiring that we have the code updated for that.
[12:46:50] ‹mhwood› Do we get one index record per bitstream, or one per item? If the latter, we have a real problem.
[12:47:27] ‹mhwood› I.e. what permissions do we put on the index record?
[12:47:32] ‹terry-b› One per item with this PR.
[12:47:57] ‹mhwood› Then the index record should have the union of all the restrictions of the bitstreams that it represents.
[12:47:59] ‹terry-b› I think the full text in solr would have the permissions of the item itself
[12:48:06] ‹tdonohue› terry-b++ This seems to append all extracted text as one index record
[12:48:38] ‹terry-b› I will take a todo to test existing behavior and provide you all with an update
[12:48:52] ‹mhwood› Thanks!