Uploaded image for project: 'DSpace'
  1. DSpace
  2. DS-1140

Update MSWord Media Filter to use Apache POI (like PPT Filter) and also support .docx

    XMLWordPrintable

    Details

    • Attachments:
      0
    • Comments:
      10
    • Documentation Status:
      Complete or Committed

      Description

      The Microsoft Word Media Filter (org.dspace.app.mediafilter.WordFilter) uses outdated, obsolete third party software, specifically the "text-mining" tools at: http://code.google.com/p/text-mining/

      However, there are now better options out there, especially Apache POI.

      http://poi.apache.org/text-extraction.html

      Apache POI also has the benefit of being able to extract text from docx, xls, xlsx and even Publisher and Visio files.

      We may even be able to create a single "MSFilter" which can just extract doc, docx, ppt, pptx, xls, xlsx, etc. all using POI.

      Any volunteers to implement? Looks like we should be able to implement it similar to the current PPT Filter (org.dspace.app.mediafilter.PowerPointFilter) which already uses POI. See also DS-714.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              mwood Mark H. Wood
              Reporter:
              tdonohue Tim Donohue
              Votes:
              1 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: