Terry Brady [11:29 AM]
We run the PDF Text Extractor. For items with no extractable text, we end up with something like the following in SOLR. Have others worked around this issue?
Sample Solr FullText
" \n \nstream_source_info whistleblowing.pdf.txt \nstream_content_type text/plain \nstream_size 222 \nContent-Encoding ISO-8859-1 \nstream_name whistleblowing.pdf.txt \nContent-Type text/plain; charset=ISO-8859-1 \n \n \n \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n \n "
The "content-type" can sometimes appear in search snippets.
Tom Desair [11:34 AM]
The "context-type" part is added by SOLR, so I'm not sure you can exclude that.
You can adjust the `FullTextContentStreams.buildFullTextList` method to ignore extracted text bitstreams with a size < 500 or 1000. (edited)
That class is passed to SOLR when indexing:
Terry Brady [11:36 AM]
Thanks @tom_desair. I like that suggestion!
I will try it out locally. If I like the result, I will post a PR as an option.
Tom Desair [11:38 AM]
Maybe you can make the "minimal required size" for extracted text bitstreams to be indexed, configurable