Articles

Putting POI on a diet

In General on 2009-10-16 by Jukka Zitting Tagged: , , , ,

The Apache POI team is doing an amazing job at making Microsoft Office file formats more accessible to the open source Java world. One of the projects that benefits from their work is Apache Tika that uses POI to extract text content and metadata from all sorts of Office documents.

Apache POI

However, there’s one problem with POI that I’d like to see fixed: It’s too big.

More specifically, the ooxml-schemas jar used by POI for the pre-generated XMLBeans bindings for the Office Open XML schemas is taking up over 50% of the 25MB size of the current Tika application. The pie chart below illustrates the relative sizes of the different parser library dependencies of Tika:

Relative sizes of Tika parser dependencies

Both PDF and the Microsoft Office formats are pretty big and complex, so one can expect the relevant parser libraries to be large. But the 14MB size of the ooxml-schemas jar seems excessive, especially since the standard OOXML schema package from which the ooxml-schemas jar is built is only 220KB in size.

Does anyone have good ideas on how to best trim down this OOXML dependency?

Advertisements

3 Responses to “Putting POI on a diet”

  1. Why do you even need the schemas at runtime? Couldn’t you autogen the parser from the schema?

    Also does poi have multiple xmlparsers in use? Why not cut down to a single one.

  2. Good questions. I’ve yet to look deeper under the hood, but I’m sure something like what you propose could be done to streamline things.

    I’m following up on the dev@poi mailing list, see http://markmail.org/message/u5csdmq4t2wvjgtd for more details.

  3. Hi,

    Do you mind posting here any concrete developments if you hear of any? I’ve seen that there’s not much on the ML for the time being. This 1.4MB => 15MB change when upgrading from POI 3.2 to 3.5 is a problem for us too :/

    Thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: