Wednesday 20 September 2006 10:35:00 pm
Many people have experienced difficulties with indexing multiple types of binary files. Large binary files present a unique problem. Mindshare Interactive Campaigns has created a custom solution that we are happy to share with the eZ publish community. It involves installing a few third-party indexing tools as well as writing your own binary file handling plugin - but we have provided all the code to get you started.
Our client needed to index PDFs, PowerPoint presentations, Excel spreadsheets, and Word documents. They also indicated they might want to add other file types in the future. The native eZ publish indexing funcitonality (using pstotext and wvware handled PDFs and Word documents, but at the time (May 2005) we did not know of a good way to handle Excel or PowerPoint files and so we had to do some discovery work.
We also didn't like the idea of having a separate parsing script (i.e. ezpdfparser.php, ezwordparser.php, etc.) for each file type - it seemed more extensible to keep all the parsing code in one file where we could add a new condition to our case statement as we added new file types.
In addition, we found that both pstotext and xpdf caused the same problem with very large PDF files - the SQL INSERT statement got too large (this is related to the eZ publish indexer, not the parsing tools) and would crash the indexer, resulting in no further content being indexed. Addressing that, in addition to handling multiple binary types in one file, led us to write our own custom parsing plugin.