This request is from Google Scholar (Anurag Acharya and Darcy Darpa), and relates back to the decisions made in
As of DSpace 4, in the DSpace API (org.dspace.app.util.GoogleMetadata), our logic for identifying the file to link in the citation_pdf_url is now:
- If an Item has only one file (i.e. bitstream) and it's publicly available, link to it
- Else If an Item has multiple files, and one is specified as the "primary bitstream" (and it's publicly available), link to it
- Otherwise, just link to the first publicly available file (in the ORIGINAL bundle). In public items, this often is the file that appears first in the file listing on the "View Item" page.
Direct link to logic in the 5.x codebase: https://github.com/DSpace/DSpace/blob/dspace-5_x/dspace-api/src/main/java/org/dspace/app/util/GoogleMetadata.java#L1046
While this logic is an improvement over the past (DSpace 3.x and below), it still may result in some oddities for specific scenarios including...
- If a Item has multiple files, one textual and the other image based, and neither is flagged as "primary", then the image-based file will appear in the citation_pdf_url if it is first in the list.
- Google Scholar has found this scenario in some sites where they have a JPEG image of the first page (e.g. a thumbnail) uploaded alongside a PDF copy of the document. If neither is flagged as "primary" and the JPEG is first, then citation_pdf_url will contain a link to the JPEG instead of the (much more appropriate) PDF.
The recommendation would be to create a basic "whitelist" of common textual formats which are valid for the citation_pdf_url field. This whitelist would simply help ensure that non-textual documents are NOT referenced in citation_pdf_url even if they are listed first.
- Initial recommended whitelist from Google Scholar is: PDF, PS, DOC/DOCX, RTF, EPUB.
- Anurag estimates this should cover 99.99% or more of sites for SEO.
- Optionally, we could allow administrators to update/modify this whitelist, but it may not be necessary (or even recommended). If we went this route. we'd need to warn that updating it may affect SEO.
- Additional recommendation from Anurag: If there are multiple whitelist-format files and none of them are marked "primary", pick the longest (largest) one. This would handle the case where an institution includes the whole dissertation as well as individual chapters (some institutions do).
In the meantime, there is a "quick fix" or workaround for sites which encounter this issue. Unfortunately though it requires editing individual items (one-by-one):
- Either reorder files in the DSpace item so that a particular (textual) file is listed first (Edit Item -> Item Bitstreams -> reorder them)
- Or, flag a particular bitstream as being the "primary bitstream", which will cause it to be linked in the citation_pdf_url. (Edit Item -> Item Bitstreams -> Click on bitstream -> Specify "primary bitstream" flag)