Archive for the ‘ASF’ Category


Forking a JVM

In ASF,Java on 2010-05-27 by Jukka Zitting Tagged: , , , ,

The thread model of Java is pretty good and works well for many use cases, but every now and then you need a separate process for better isolation of certain computations. For example in Apache Tika we’re looking for a way to avoid OutOfMemoryErrors or JVM crashes caused by faulty libraries or troublesome input data.

In C and many other programming languages the straightforward way to achieve this is to fork separate processes for such tasks. Unfortunately Java doesn’t support the concept of a fork (i.e. creating a copy of a running process). Instead, all you can do is to start up a completely new process. To create a mirror copy of your current process you’d need to start a new JVM instance with a recreated classpath and make sure that the new process reaches a state where you can get useful results from it. This is quite complicated and typically depends on predefined knowledge of what your classpath looks like. Certainly not something for a simple library to do when deployed somewhere inside a complex application server.

But there’s another way! The latest Tika trunk now contains an early version of a fork feature that allows you to start a new JVM for running computations with the classes and data that you have in your current JVM instance. This is achieved by copying a few supporting class files to a temporary directory and starting the “child JVM” with only those classes. Once started, the supporting code in the child JVM establishes a simple communication protocol with the parent JVM using the standard input and output streams. You can then send serialized data and processing agents to the child JVM, where they will be deserialized using a special class loader that uses the communication link to access classes and other resources from the parent JVM.

My code is still far from production-ready, but I believe I’ve already solved all the tricky parts and everything seems to work as expected. Perhaps this code should go into an Apache Commons component, since it seems like it would be useful also to other projects beyond Tika. Initial searching didn’t bring up other implementations of the same idea, but I wouldn’t be surprised if there are some out there. Pointers welcome.


Release time

In ASF on 2009-09-19 by Jukka Zitting

There’s lots of upcoming release activity at the Apache projects I’m more or less involved with:

  • The incubating Apache PDFBox project is just about to release the eagerly anticipated 0.8.0 release. I’m expecting to see the release announcement on Tuesday next week. PDFBox is a Java library for working with PDF documents.
  • Another incubating project, Apache UIMA, is working towards the 2.3.0 release. I’m looking forward to seeing both UIMA and PDFBox graduating from the Apache Incubator shortly after the respective releases. UIMA is a framework and a set of components for analyzing large volumes of unstructured information.
  • The Apache Sling project is a component-based project like Apache Felix, so there is no clear project-wide release cycle.  Instead Sling is about to start releasing new versions of most of the components changed since the all-inclusive incubator releases. Sling is a JCR-based web framework.
  • Apache Tika uses PDFBox for extracting text content from PDF documents. I’m hoping to see a Tika 0.5 release soon with the latest PDFBox dependency and the design improvements I’ve been working on. Tika is a toolkit for extracting text and metadata from all kinds of documents.
  • Apache Solr is about to enter code freeze in preparation for the 1.4 release that will include the “Solar Cell” feature based on Tika. Solr is a search server based on Lucene.
  • The Commons IO project has been upgraded to use Java 5 features and I’m starting to push it towards a 2.0 release. Commons IO is a library of Java IO utilities.
  • Lucene Java is gearing up for the 2.9 release, and will soon follow up with the 3.0 release. The trie range feature is an especially welcome addition for many use cases. Lucene is a feature-rich high performance search engine.
  • And last but not least, Apache Jackrabbit is getting ready to release the 2.0 version based on the recently approved JCR 2.0 standard. Jackrabbit is a feature-complete JCR content repository implementation.

I’m hoping to see most of these releases happening in time for the ApacheCon US 2009 conference in early November.


Commits per weekday and hour

In ASF on 2009-06-04 by Jukka Zitting

The punchcard graphs at Github are a nice way to quickly detect the rough geographical distribution (or nighttime coding habits) of the key contributors of an open source project. Here’s a few selected examples from the ASF.

Apache HTTP Server

Apache HTTP Server

Apache Maven (core)

Apache Maven

Apache Jackrabbit

Apache Jackrabbit


Maven meetup report

In ASF on 2009-03-26 by Jukka Zitting

A few days late, here’s a quick report on what I managed to do this Monday here at the ApacheCon EU. As mentioned earlier, I arrived at the conference hotel on Monday evening and headed straight for the Maven meetup.

Maven meetup

The meetup was already in progress when I arrived, but I managed to catch a part of a presentation about the Eclipse integration that just keeps getting better. Nowadays it’s so easy to import and manage Maven projects in Eclipse, that I get really annoyed every time I need to do manually set things up for projects with Ant builds.

Other interesting topics covered were Maven archetypes and the release plugin. I’ve for a long time been thinking about doing some archetypes to help setting up new JCR client applications. We should probably also do something similar for setting up new Sling bundles.

The release plugin demo was interesting, though I’m not so sure if I agree with all the conventions and assumptions that the plugin makes. On a related note, we should configure the GPG plugin for the Maven build in Jackrabbit.

We talked a bit about Maven 2.1.0 and the upcoming 3.0 release.  I’m already pretty happy with the recent Maven 2.0.x releases, so we’ll probably take a while before upgrading, but it’s good to hear that things are progressing on multiple fronts. We also briefly touched on the differences between the Maven and OSGi dependency models and the ways to better bridge the two worlds.

In summary the meetup was really interesting and served well in giving me a better idea of what’s up in the Maven land. Thanks for everyone involved!

Chops, ribs and beer

After the meetup a few of us headed out to Amsterdam city center for some food and drinks. Monday evening wasn’t perhaps the best time to go out as we needed to wander around looking for places that would be open long enough. Anyway, we found some “interesting” places to visit before returning to the hotel in the early hours. Good times.


ApacheCon plans

In ASF on 2009-03-23 by Jukka Zitting

It’s ApacheCon time again. I’ll be flying to Amsterdam later today, and will probably be pretty busy for the entire week. Some highlights:


  • Maven meetup. I’ll probably arrive at the conference hotel just in time for the Maven meetup, where I’m hoping to catch up with the latest news from the Maven land.


  • Git hacking. During the Hackathon on Tuesday I hope to get together with Grzegorz and anyone else interested in setting up
  • Commons Compress. There’s some useful code in the Commons Compress component that I hope to use in Apache Tika. If I have time during the Hackathon I want to help push the component towards its first release.
  • CMIS / Chemistry update. I’ve been meaning to check out the CMIS code that Florent Guillaume has been working on recently. I’d love to get the effort better integrated into Jackrabbit.
  • Commons XML. I’ve been gathering some JAXP utility code to a new XML library in the Commons sandbox. I hope to spend some time pushing more code there and perhaps discussing the concept with some interested people.
  • Juuso lab. I have lots of new ideas about RDF processing and Prolog. Hoping to turn those into working code.
  • Lucene meetup. Catching up with the latest in Lucene and telling people about Tika and the Lucene integration we have in Jackrabbit. Unfortunately I only have one hour to spend here before the JCR meetup starts.
  • JCR meetup. Starting at 8pm, the JCR meetup is one of the key highlights of the conference for me. We’ll be covering stuff related to the Jackrabbit and Sling projects. You’re welcome to join us (sign up here) if you’re interested in the latest news from the content repository world.


And lots of other stuff, too much to keep track of…


Apache PDFBox status update

In ASF on 2009-01-23 by Jukka Zitting Tagged: ,

The PDFBox project is a well known and widely used Java library for reading and writing documents in the Portable Document Format (PDF). Here’s my perspective on the recent developments of the project.

Apache PDFBox

Project activity

The project was quite dormant when it entered the Apache Incubator about a year ago after we had discussed the idea first at the ApacheCon US 2007 and then on the Incubator mailing list. For a while it looked like project would remain quiet, but in the past few months we’ve seen a clear increase in project activity. Thanks for that goes especially to the contributions of the two new committers, Andreas Lehmkühler and Brian Carrier.

License review

My main focus in Apache PDFBox has recently been the thorough license review that I’ve been conducting. Before entering the Incubator, the PDFBox library was liberally licensed under a  BSD License. However, the copyright or licensing status of many external components included in PDFBox was neither well documented nor well understood by downstream projects. For example, PDFBox used to contain parts of the Java Advanced Imaging (JAI) library that is only available under the Sun Binary Code License, a license that is not compatible with Apache policies.

The license review has taken me through a number of legal issues, put me in contact with the Adobe legal team, and made me solve some followup issues. And we also took care of proper export control notifications needed for the PDF encryption support in PDFBox. Luckily the end is finally in sight, and I’m optimistic about having all the remaining open issues closed within a month or so. Altogether it’s been a very interesting and educational process.

Next release

With the license review nearing completion and lots of unreleased fixes and improvements accumulating in the project trunk, it is time to start preparing for the first incubating PDFBox release. This release will be called Apache PDFBox 0.8.0-incubating, and will be a major improvement over the 0.7.3 release from over two years ago. All downstream projects should seriously consider upgrading as soon as the release becomes available. It would be really great if the release was out by the ApacheCon Europe at the end of March.


As a mentor and champion of the project I am really happy with the current status. It seems reasonable to expect PDFBox to graduate from the Incubator sometime later this year.


Apache JCR Commons

In ASF,Jackrabbit on 2009-01-23 by Jukka Zitting Tagged: , ,

In the Apache Jackrabbit project we’ve decided to create a new JCR Commons subproject for developing and managing the set of generic JCR tools that has grown over time around the core Jackrabbit content repository implementation.

The JCR Commons subproject will to some extent resemble the Apache Commons project, and I’m hoping to use some of the ideas put forward by Henri in his blog post about a “federated commons”.

I’m hoping to flesh out the details of this new subproject over the next month or two. It would be nice to have releases of all the new JCR Commons components ready to be used as dependencies for the upcoming Jackrabbit 1.6 release.


File system on steroids

In ASF,Jackrabbit,JCR,Technology on 2008-04-16 by Jukka Zitting

Last week at ApacheCon EU I made a case for content repositories as a general solution for applications that are currently forced to fragment their storage needs due to the different limitations of traditional storage methods, mostly file systems and databases plus more recently cloud services on the network. See below for the presentation:

It seems like the message was well received, after the presentation I got a lot of positive feedback from people who had previously thought of content repositories as something you’d only use for storing content in a content management system. Instead I see a content repository as a unifying storage layer that can be used for almost anything ranging from traditional content and data to configuration files, user account information, preferences, templates and scripts, source code and binaries, ad-hoc annotations, etc.


ApacheCon EU next week

In ASF on 2008-04-04 by Jukka Zitting

Like many others, next week I’ll be in Amsterdam attending ApacheCon EU. This is my fourth ApacheCon, and it seems like every time I get myself involved with more stuff to do and more people to meet. I’m still trying to piece together my schedule for the week, but here’s what I’ll be doing already on Monday and Tuesday.

Monday: Media training, BarCode

On Monday I’ll be attending Sally Khudairi‘s Media & Analyst Training and Intermediate Media & Analyst Training classes to improve my media skills and knowledge. I attended the first class already once during ApacheCon US 2006 in Austin, and found it very interesting and useful. Now I’m hoping to rehash the things I learned earlier and learn something new during the second class in the afternoon. It’s really cool and very useful to have something like this as a part of the conference.

The downside of attending the classes is that I miss much of the Hackathon action during Monday. But luckily there’s the BarCode event where (if not sooner) I hope to catch up with Day people and other fellow hackers. The conference networking site has already helped me hook up with some new people I’d like to meet.

Tuesday: JCR meetup, WebDAV, hacking, dinner

Tuesday is the first big JCR day of the week. The JCR meetup starts at 9am and continues to 1pm with lots of cool topics and attendees. The event is free for everyone and takes place right next to the ApacheCon venue, so feel free to drop by if you’re around and interested in JCR and related technologies!

I’m especially looking forward to the JCR content modeling workshop that we’ll likely organize at the end of the meetup. Besides ideas and questions from existing JCR users, I’ve invited other Apache projects like Abdera, James, JSPWiki, Lenya, and Roller to participate if they’re interested in evaluating how a JCR content repository could best support their content models.

After the JCR meetup I was thinking of perhaps putting up a short informal WebDAV library BOF with anyone interested who’s around. There’s a long-standing need for a good WebDAV protocol library but with Slide gone, there currently is no clear development community that could adopt the various pieces of WebDAV code in different Apache projects. It would be cool if there were enough interest for example to bootstrap a new incubating WebDAV project.

If I have time (and energy!) beyond those activities on Tuesday afternoon, you’ll probably find me in the Hackathon working on some prototype integrations like the ones we’ve done in previous Hackathons. Ping me if you’re interested in hooking some code up especially with Jackrabbit or Tika!

There’s an initial plan of having a dinner with other JCR people on Tuesday evening. Let me know if you’re interested in joining us.


Apache Jackrabbit 1.4 is available!

In ASF,Jackrabbit on 2008-01-16 by Jukka Zitting Tagged:

I just announced the release of Apache Jackrabbit 1.4. The release is the result of about nine months of development since the 1.3 release, and contains 220 new features, improvements, and bug fixes (plus the 75 bug fixes that had already been backported to 1.3.x patch releases). This is by far the biggest Jackrabbit release to date.

Apache Jackrabbit

The 1.4 release contains some cool new features:

  • Friendlier Jackrabbit webapp. The jackrabbit-webapp component now comes with a more polished user interface, better error handling, and improved repository connectivity for local and remote clients.
  • Object/content mapping framework. The jackrabbit-ocm component maps Java objects to JCR nodes and vice versa, making it possible to persist normal Java objects in a content repository.
  • Service provider interface for JCR. The jackrabbit-spi component defines an architectural layer below the JCR API. The SPI layer is designed specifically for remote access and outlines a way for us to avoid the performance limitations of JCR-RMI that works on top of JCR.
  • Optimized storage for binary content. The new DataStore feature in jackrabbit-core avoids all unnecessary copying of binary content and promises huge performance increases for versioning and copying operations. DataStore is a beta-level feature in Jackrabbit 1.4 and disabled by default.
  • Improved query engine. The jackrabbit-core component has been extended with new features like configurable indexing, synonym and similarity queries, and spell checking. Many typical queries are now noticeably faster than before thanks to numerous performance improvements.

Many thanks to the Jackrabbit development team and the entire community! I’m really proud and excited to be a member of the Apache Jackrabbit project.

PS. Interestingly enough, I built the final 1.4 release candidate exactly two years after I first volunteered to be the release manager for Apache Jackrabbit. The past two years have certainly been interesting time. :-)


Get every new post delivered to your Inbox.