The Apache POI team is doing an amazing job at making Microsoft Office file formats more accessible to the open source Java world. One of the projects that benefits from their work is Apache Tika that uses POI to extract text content and metadata from all sorts of Office documents.
However, there’s one problem with POI that I’d like to see fixed: It’s too big.
More specifically, the ooxml-schemas jar used by POI for the pre-generated XMLBeans bindings for the Office Open XML schemas is taking up over 50% of the 25MB size of the current Tika application. The pie chart below illustrates the relative sizes of the different parser library dependencies of Tika:
Both PDF and the Microsoft Office formats are pretty big and complex, so one can expect the relevant parser libraries to be large. But the 14MB size of the ooxml-schemas jar seems excessive, especially since the standard OOXML schema package from which the ooxml-schemas jar is built is only 220KB in size.
Does anyone have good ideas on how to best trim down this OOXML dependency?