The Microsoft Word Media Filter (org.dspace.app.mediafilter.WordFilter) uses outdated, obsolete third party software, specifically the "text-mining" tools at: http://code.google.com/p/text-mining/
However, there are now better options out there, especially Apache POI.
Apache POI also has the benefit of being able to extract text from docx, xls, xlsx and even Publisher and Visio files.
We may even be able to create a single "MSFilter" which can just extract doc, docx, ppt, pptx, xls, xlsx, etc. all using POI.
Any volunteers to implement? Looks like we should be able to implement it similar to the current PPT Filter (org.dspace.app.mediafilter.PowerPointFilter) which already uses POI. See also