“Just like Arthur Dent, who after inserting a Babel fish in his ear could understand Vogon poetry, a computer program that uses Tika can understand Microsoft Word documents.” This is how Tika in Action, our book on Apache Tika, introduces it’s subject. Download the freely available first chapter to read the the full introduction.
Chris Mattmann and I started writing the Tika in Actionbook for Manning at the beginning of this year, and we’re now well past the half-way post. If we keep up this pace, the book should be out in print by next Summer! And thanks to the Manning Early Access Program (MEAP), you can already pre-order and access an early access edition of the book at the Tika in Action MEAP page.
If you’re interested, use the “tika50” code to get a 50% early access discount when purchasing the MEAP book. You’ll still receive updates on all new chapters and of course the full book when it’s finished. Note that this discount code is valid only until December 17th, 2010.
We’re also very interested in all comments and other feedback you may have about the book. Use the online forum or contact us directly, and we’ll do our best to make the book more useful to you!
Update: The book is out in print now! Use the “tika37com” code for a 37% discount on the final book.
The news is just in about Adobe being set to acquire Day Software (see also the FAQ). Assuming the deal goes through, it looks like I’ll be working for Adobe by the end of this year. I’m an open source developer, so I’m looking forward to finding out how committed Adobe is in supporting the open development model we’re using for many parts of Day products.
The first comments from Erik Larson, a senior director of product management and strategy at Adobe, seem promising and he also asked what the deal should mean for open source. This is my response from the perspective of the open source projects I’m involved in.
First and foremost I’m looking forward to continuing the open and standards-based development of our key technologies like Apache Jackrabbit and Apache Sling. There’s no way we’d be able to maintain the current level of innovation and productivity in these key parts of our product infrastructure without our symbiotic relationship with the open source community.
Second, I’m hoping that our experience and involvement with open source projects will help Adobe better interact with the various open source efforts that leverage Adobe standards and technologies like XMP, PDF and Flash. The Apache Software Foundation is a home to a growing collection of digital media projects like PDFBox, FOP, Tika, Batik and Sanselan, all of which are in one way or another related to Adobe’s business. For example as a committer and release manager of the Apache PDFBox project I’d much appreciate better access to Adobe’s deep technical PDF know-how. Similarly, in Apache Tika we’re considering using XMP as our metadata standard, and better access to and co-operation with the people behind Adobe’s XMP toolkit SDK (see more below) would be highly valuable.
It would be great to see Adobe becoming more proactive in reaching out and supporting such grass-roots efforts that leverage their technologies. I’ve dealt with Adobe lawyers on such cases before with good results but it did take some time before I found the correct people to contact. Another area of improvement would be to make freely redistributable Adobe IP more easily accessible for external developers by pushing them out to central repositories like Maven Central, RubyGems or CPAN, for example like I did when making PDF core font information available on Maven Central.
Finally, it would be great to see Adobe going further in embracing an open development model for some of their codebases like the XMP toolkit SDK that they already release under open source licenses. I’d love to champion or mentor the effort, should Adobe be willing to bring the XMP toolkit to the Apache Incubator!
The thread model of Java is pretty good and works well for many use cases, but every now and then you need a separate process for better isolation of certain computations. For example in Apache Tika we’re looking for a way to avoid OutOfMemoryErrors or JVM crashes caused by faulty libraries or troublesome input data.
In C and many other programming languages the straightforward way to achieve this is to fork separate processes for such tasks. Unfortunately Java doesn’t support the concept of a fork (i.e. creating a copy of a running process). Instead, all you can do is to start up a completely new process. To create a mirror copy of your current process you’d need to start a new JVM instance with a recreated classpath and make sure that the new process reaches a state where you can get useful results from it. This is quite complicated and typically depends on predefined knowledge of what your classpath looks like. Certainly not something for a simple library to do when deployed somewhere inside a complex application server.
But there’s another way! The latest Tika trunk now contains an early version of a fork feature that allows you to start a new JVM for running computations with the classes and data that you have in your current JVM instance. This is achieved by copying a few supporting class files to a temporary directory and starting the “child JVM” with only those classes. Once started, the supporting code in the child JVM establishes a simple communication protocol with the parent JVM using the standard input and output streams. You can then send serialized data and processing agents to the child JVM, where they will be deserialized using a special class loader that uses the communication link to access classes and other resources from the parent JVM.
My code is still far from production-ready, but I believe I’ve already solved all the tricky parts and everything seems to work as expected. Perhaps this code should go into an Apache Commons component, since it seems like it would be useful also to other projects beyond Tika. Initial searching didn’t bring up other implementations of the same idea, but I wouldn’t be surprised if there are some out there. Pointers welcome.
This year there is no ApacheCon Europe, but a number of more focused events related to projects at Apache and elsewhere are showing up to fill the space.
The first one is Apache Lucene EuroCon, a dedicated Lucene and Solr user conference on 18-21 May in Prague. That’s the place to be if you’re in Europe and interested in Lucene-based search technology (or want to stop by for the beer festival). I’ll be there presenting Apache Tika, and the abstract of my presentation is:
Apache Tika is a toolkit for extracting text and metadata from digital documents. It’s the perfect companion to search engines and any other applications where it’s useful to know more than just the name and size of a file. Powered by parser libraries like Apache POI and PDFBox, Tika offers a simple and unified way to access content in dozens of document formats.
This presentation introduces Apache Tika and shows how it’s being used in projects like Apache Solr and Apache Jackrabbit. You will learn how to integrate Tika with your application and how to configure and extend Tika to best suit your needs. The presentation also summarizes the key characteristics of the more widely used file formats and metadata standards, and shows how Tika can help deal with that complexity.
The rest of the conference program is also now available. See you there!
The Apache POI team is doing an amazing job at making Microsoft Office file formats more accessible to the open source Java world. One of the projects that benefits from their work is Apache Tika that uses POI to extract text content and metadata from all sorts of Office documents.
However, there’s one problem with POI that I’d like to see fixed: It’s too big.
More specifically, the ooxml-schemas jar used by POI for the pre-generated XMLBeans bindings for the Office Open XML schemas is taking up over 50% of the 25MB size of the current Tika application. The pie chart below illustrates the relative sizes of the different parser library dependencies of Tika:
Both PDF and the Microsoft Office formats are pretty big and complex, so one can expect the relevant parser libraries to be large. But the 14MB size of the ooxml-schemas jar seems excessive, especially since the standard OOXML schema package from which the ooxml-schemas jar is built is only 220KB in size.
Does anyone have good ideas on how to best trim down this OOXML dependency?
The PDFBox project is a well known and widely used Java library for reading and writing documents in the Portable Document Format (PDF). Here’s my perspective on the recent developments of the project.
The project was quite dormant when it entered the Apache Incubator about a year ago after we had discussed the idea first at the ApacheCon US 2007 and then on the Incubator mailing list. For a while it looked like project would remain quiet, but in the past few months we’ve seen a clear increase in project activity. Thanks for that goes especially to the contributions of the two new committers, Andreas Lehmkühler and Brian Carrier.
My main focus in Apache PDFBox has recently been the thorough license review that I’ve been conducting. Before entering the Incubator, the PDFBox library was liberally licensed under a BSD License. However, the copyright or licensing status of many external components included in PDFBox was neither well documented nor well understood by downstream projects. For example, PDFBox used to contain parts of the Java Advanced Imaging (JAI) library that is only available under the Sun Binary Code License, a license that is not compatible with Apache policies.
The license review has taken me through a number of legal issues, put me in contact with the Adobe legal team, and made me solve some followup issues. And we also took care of proper export control notifications needed for the PDF encryption support in PDFBox. Luckily the end is finally in sight, and I’m optimistic about having all the remaining open issues closed within a month or so. Altogether it’s been a very interesting and educational process.
With the license review nearing completion and lots of unreleased fixes and improvements accumulating in the project trunk, it is time to start preparing for the first incubating PDFBox release. This release will be called Apache PDFBox 0.8.0-incubating, and will be a major improvement over the 0.7.3 release from over two years ago. All downstream projects should seriously consider upgrading as soon as the release becomes available. It would be really great if the release was out by the ApacheCon Europe at the end of March.
As a mentor and champion of the project I am really happy with the current status. It seems reasonable to expect PDFBox to graduate from the Incubator sometime later this year.
In the Apache Jackrabbit project we’ve decided to create a new JCR Commons subproject for developing and managing the set of generic JCR tools that has grown over time around the core Jackrabbit content repository implementation.
The JCR Commons subproject will to some extent resemble the Apache Commons project, and I’m hoping to use some of the ideas put forward by Henri in his blog post about a “federated commons”.
I’m hoping to flesh out the details of this new subproject over the next month or two. It would be nice to have releases of all the new JCR Commons components ready to be used as dependencies for the upcoming Jackrabbit 1.6 release.