Post

NoSQL interests

In General on 2009-10-27 by Jukka Zitting Tagged: , ,

NoSQL OaklandWe’re organizing a NoSQL meetup in Oakland on Monday next week. In addition to helping set the meetup agenda, the “Topics you are interested in” question in the sign up form provides some interesting insight on the current interests of the NoSQL community. Here’s a quick breakdown of the key terms distilled from the 88 signups we’ve received so far.

Note that the data is biased towards Apache projects due to the meetup being organized at ApacheCon US 2009.

Projects

The following open source projects were mentioned. The list is in alphabetical order, as the data set is too small to make any reasonable ordering by popularity.

Topics

Many responses were about the “big data” aspect of the NoSQL movement. Some frequent keywords: distributed storage, large transactional data, consistency, failover, availability, reliability, stability, failure detection, failed node replacement, (petabyte) scalability, consistency levels, storage technology, performance, benchmarks, optimization, backup and recovery, map/reduce

Another common theme were the various database types and the NoSQL “development model”. Keywods: document stores, key/value stores, consistent hashing, graph databases, object databases, persistent queues, content modeling, migration from the relational model, social graphs, streaming, software as a service, offline applications, full text search, natural language processing

Beyond the above big themes, I found it interesting that the following technologies were specifically named: Erlang, Java, WebSimpleDB, WebDAV

In addition to specific topics, many people were asking for case studies or “lessons learned” -type presentations.

Post

Putting POI on a diet

In General on 2009-10-16 by Jukka Zitting Tagged: , , , ,

The Apache POI team is doing an amazing job at making Microsoft Office file formats more accessible to the open source Java world. One of the projects that benefits from their work is Apache Tika that uses POI to extract text content and metadata from all sorts of Office documents.

Apache POI

However, there’s one problem with POI that I’d like to see fixed: It’s too big.

More specifically, the ooxml-schemas jar used by POI for the pre-generated XMLBeans bindings for the Office Open XML schemas is taking up over 50% of the 25MB size of the current Tika application. The pie chart below illustrates the relative sizes of the different parser library dependencies of Tika:

Relative sizes of Tika parser dependencies

Both PDF and the Microsoft Office formats are pretty big and complex, so one can expect the relevant parser libraries to be large. But the 14MB size of the ooxml-schemas jar seems excessive, especially since the standard OOXML schema package from which the ooxml-schemas jar is built is only 220KB in size.

Does anyone have good ideas on how to best trim down this OOXML dependency?

Post

Some graphics work for a change

In Jackrabbit on 2009-09-23 by Jukka Zitting

I’ve recently spent some effort in improving the look of the Apache Jackrabbit website. I’m no designer, so the results aren’t that great, but it’s been a nice break from the regular project work. And I got to brush up my Photoshop and Gimp skills.

One part of the effort was creating an icon for the site. Previously the site used the feather icon used as the default on all Apache project sites, but I wanted a Jackrabbit-specific icon that helps me to quickly identify and access Jackrabbit pages among the numerous tabs I usually have open in my browser. The work is a good example of incremental improvements in action:

Jackrabbit icon steps

I started with a copy of the Jackrabbit logo with nice alpha-layered transparent background. It looked great until I noticed that some browsers lost the smooth alpha layer and instead resulted in a rather badly aliased icon seen above.

The straightforward solution was to add a white background as can be seen in step 2. That worked already pretty well in all browsers.

After a few days of watching the icon I found it a bit too blocky to my taste, so I tried to restore some of the nice transparency effect by rounding the corners a bit. I’m pretty happy with the result.

Of course, if you have design talent and think you can do better, go for it!

Post

Release time

In ASF on 2009-09-19 by Jukka Zitting

There’s lots of upcoming release activity at the Apache projects I’m more or less involved with:

  • The incubating Apache PDFBox project is just about to release the eagerly anticipated 0.8.0 release. I’m expecting to see the release announcement on Tuesday next week. PDFBox is a Java library for working with PDF documents.
  • Another incubating project, Apache UIMA, is working towards the 2.3.0 release. I’m looking forward to seeing both UIMA and PDFBox graduating from the Apache Incubator shortly after the respective releases. UIMA is a framework and a set of components for analyzing large volumes of unstructured information.
  • The Apache Sling project is a component-based project like Apache Felix, so there is no clear project-wide release cycle.  Instead Sling is about to start releasing new versions of most of the components changed since the all-inclusive incubator releases. Sling is a JCR-based web framework.
  • Apache Tika uses PDFBox for extracting text content from PDF documents. I’m hoping to see a Tika 0.5 release soon with the latest PDFBox dependency and the design improvements I’ve been working on. Tika is a toolkit for extracting text and metadata from all kinds of documents.
  • Apache Solr is about to enter code freeze in preparation for the 1.4 release that will include the “Solar Cell” feature based on Tika. Solr is a search server based on Lucene.
  • The Commons IO project has been upgraded to use Java 5 features and I’m starting to push it towards a 2.0 release. Commons IO is a library of Java IO utilities.
  • Lucene Java is gearing up for the 2.9 release, and will soon follow up with the 3.0 release. The trie range feature is an especially welcome addition for many use cases. Lucene is a feature-rich high performance search engine.
  • And last but not least, Apache Jackrabbit is getting ready to release the 2.0 version based on the recently approved JCR 2.0 standard. Jackrabbit is a feature-complete JCR content repository implementation.

I’m hoping to see most of these releases happening in time for the ApacheCon US 2009 conference in early November.

Post

Apache Jackrabbit 1.6.0 released

In JCR, Jackrabbit on 2009-08-11 by Jukka Zitting

The Apache Jackrabbit project has just released Jackrabbit version 1.6.0. This release will most likely be the latest JCR 1.0 -based Jackrabbit 1.x minor release before the upcoming Jackrabbit 2.0 and the upgrade to JCR version 2.0. The purpose goal of this release is to push out as many of the recent Jackrabbit trunk improvements as possible so that the number of new things in Jackrabbit 2.0 remains manageable.

Download Apache Jackrabbit 1.6.0

The most notable changes and new features in this release are:

  • The RepositoryCopier tool makes it easy to backup and migrate repositories (JCR-442). There is also improved support for selectively copying content and version histories between repositories (JCR-1972).
  • A new WebDAV-based JCR remoting layer has been added to complement the existing JCR-RMI layer (JCR-1877, JCR-1958).
  • Query performance has been further optimized (JCR-1820, JCR-1855 and JCR-2025).
  • Added support for Ingres and MaxDB/SapDB databases (JCR-1960, JCR-1527).
  • Session.refresh() can now be used to synchronize a cluster node with changes from the other nodes in the cluster (JCR-1753).
  • Unreferenced version histories are now automatically removed once all the contained versions have been removed (JCR-134).
  • Standalone components like the JCR-RMI layer and the OCM framework have been moved to a separate JCR Commons subproject of Jackrabbit, and are not included in this release. Updates to those components will be distributed as separate releases.
  • Development preview: There are even more JSR 283 features in Jackrabbit 1.6 than were included in the 1.5 version. These new features are accessible through special “jsr283″ interfaces in the Jackrabbit API. Note that none of these features are ready for production use, and will be replaced with final JCR 2.0 versions in Jackrabbit 2.0.

This release is the result of contributions from quite a few people. Thanks to everyone involved, this is open source in action!

Post

JCR 2.0 implementation progress

In JCR, Jackrabbit on 2009-07-18 by Jukka Zitting

The JCR 2.0 API specified by JSR 283 has been in Proposed Final Draft (PFD) stage since March, and Apache Jackrabbit developers have been busy implementing all the specified new features and adding compliance test cases for them.

Apache Jackrabbit

Both the Reference Implementation (RI) and the Technology Compatibility Kit (TCK) of JSR 283 will be based on Jackrabbit code, and we expect the final version of the specification to be released shortly after Jackrabbit trunk becomes feature-complete and the API coverage of the TCK reaches 100%. The following two graphs illustrate our progress on both these fronts.

First a track of all the JCR 2.0 implementation tasks we’ve filed under the JCR-1104 collection issue. The amount of work per each sub-task is not uniform, so this graph only shows the general trend and does not suggest any specific completion date.

jcr-20-implementation

The second graph tracks the TCK API coverage. We started with the JCR 1.0 TCK, so the first 300-400 method signatures were already covered with few changes to existing test code. Based on Julian’s API coverage reports in JSR-2085, this graph tracks progress in covering the 100+ new method signatures introduced in JCR 2.0. Again, the graph is meant to show just a general trend and should not be used to extrapolate future progress.

JCR 2.0 TCK API coverage

Wan’t to see JCR 2.0 in action? The latest Jackrabbit 2.0 alpha releases are available for download!

Post

Commits per weekday and hour

In ASF on 2009-06-04 by Jukka Zitting

The punchcard graphs at Github are a nice way to quickly detect the rough geographical distribution (or nighttime coding habits) of the key contributors of an open source project. Here’s a few selected examples from the ASF.

Apache HTTP Server

Apache HTTP Server

Apache Maven (core)

Apache Maven

Apache Jackrabbit

Apache Jackrabbit

Post

Would you trust a pirate?

In General on 2009-05-17 by Jukka Zitting

Apparently they’re now setting up a Pirate Party also in Finland. I guess it’s good to have a political force that questions the appropriateness of traditional copyright in the digital world. However, as a knowledge worker I’m not that excited about drastic changes in the protection of immaterial rights.

Anyway, my appreciation for the movement in Finland went down considerably when I saw their spokesman in the news today. When asked about the main goals of the new party he only mentioned freedom of speech and protection of privacy. Did he just forget the massive overhaul of copyright and patent laws that they’re primarily after?

Post

Midgard: Where it all began

In Midgard on 2009-05-10 by Jukka Zitting

On Friday we celebrated the tenth anniversary of the Midgard project. The celebration took the form of a very nice gala evening with good food and drinks with live music, show and of course some speeches. I was asked to deliver a few words about how it all began for Midgard.

Here’s my speech, reconstructed from my draft notes and edited for the web audience:

We were a group of teenagers and young adults doing historical re-enactment and live action role playing games. One evening in early -97 we were sitting in a bus, returning from the woods with all our viking gear on. Bergie said to me: “Hey Yaro”, as I was known as Yaroslav at the time. “Hey Yaro”, he said, “you’re over 18 and you have a drivers license. Would you like to take a dozen teenagers to a trip to Norway and back?” Even back then Bergie was the one with big dreams and the power to inspire people. I had the skills required to make those dreams happen but not yet enough experience to tell that we perhaps should think twice. So I just answered: “Sounds cool, let’s do it!” That’s pretty much what happened also with Midgard.

The trip to Norway went well for us and was followed by a number of other adventures. One of them was our quest to build a better web site for our group. It was -97 and the web was booming. The de facto web publishing technology was FTP, that people used to push static HTML to a web server. Geocities was a major cool thing as it allowed you to publish your static HTML for free. We however had bigger plans and our own server running in the closet of a friendly internet company. And we were publishing lots of stuff: news, photos, articles, etc. Quite a few people were actively contributing new content to the web site.

Our first serious attempt at better managing the site was based on technologies called SGML and DSSSL. For the technically minded: nowadays you’d use XML and XSLT for similar tasks. We used this system to “cook” our content into nicely formatted HTML that was then served to the world. It worked pretty well, but was hopelessly too complex for almost all of our contributors. This was a time when people were only just discovering the Internet. Most of our contributors were teenagers who were using the net from libraries or schools. Internet connections with modems were only just finding their ways to normal households. Even FTP was often out of the question, so there was little hope of making the heavy SGML tooling work as well as we’d like.

We wanted a system that could be managed entirely through the browser. Not just the content you saw on the web site, but the layout templates and even the functional code used to list pages or to handle the forms for adding or modifying content. The system should allow you to build an entire web site, including all the administration interfaces, without any other tooling than a web browser. Such systems simply didn’t exist at the time and in fact they’re pretty rare even today.

So we had to build our own system. We looked at a number of potential platforms for something like this, and the LAMP stack seemed like a good fit. Our server already ran Linux and, like pretty much everyone, we used the Apache web server. We hadn’t used PHP or MySQL before, but they were getting some good press and were easy enough to get started with. In fact we hadn’t done much anything when we started: we hadn’t done Apache modules, we hadn’t extended (or even written!) PHP, and at the time I had only read about relational databases. As we used to say: “How hard can it be?” We didn’t know, and so we just did it.

The result of our efforts was called Midgard. We had used it to power our web site for about a year when Bergie was hired to build a new web site for a Finnish tech company. Midgard seemed like a good fit for that need, and we figured that also other people might find the system useful. Open source was cool and we wanted to join the movement so we decided to publish Midgard as open source. After nights spent researching licensing options, writing press releases, creating the project web site and setting up mailing lists and public CVS access we were finally ready to publish Midgard 1.0 to the world. That happened exactly ten years ago.

The 1.0 release was like the Land Rover it was named for. The magnificent car from -62, that we used on many of our trips, was really cool and when it worked, it did so very well. However every known and then it required some “manual help” to get it started or to keep it going. This was also the case for Midgard 1.0. The first external installation that I know of was done on a Solaris platform and required a few days worth of help and patches delivered over the mailing list before it was up and running. Much of that early feedback and experience was reflected in Midgard 1.1 that was our first release that people were actually managing to install and run without direct assistance. That started the growth of the Midgard community.

Meanwhile I had also been hired by the same company where Bergie worked, and much of our work there resulted in improvements to Midgard. Together with the feedback and early contributions we were getting from the mailing lists this made Midgard 1.2 already a pretty solid piece of software. It was fairly straightforward to install (at the standards of the time), it performed well and it had most of the functionality that you’d need to run a moderately complex web site.

And the results were showing. We were getting increasing traffic on the mailing lists, some companies would start offering Midgard support and the number of Midgard-based sites around the world was growing. One of my earliest concrete rewards for doing open source was a bottle of quality whiskey that some Midgard user from Germany sent me with a note saying: “Thanks for Midgard!” The whiskey is long gone, but I still treasure the memory. A few years later Bergie and a few other friends and Midgard developers went on to start their own company based on Midgard. I was tempted to join them, but at the time my life was taking  a different route and I gradually left Midgard to pursue other things.

Seeing the Midgard project take off and build a life of its own has been a very inspiring process for me. Having your first open source project become so successful is pretty amazing and also quite humbling. Looking at all the things Midgard is today fills me with pride of not what I’ve done, but of what you, the Midgard community, have accomplished. Thank you for that. Especially I’d like to thank my long time friend and co-conspirator in starting the Midgard project. Bergie, without your dreams and refusal to take  ”no” as an answer we wouldn’t be here today. Thank you.

Post

Content Technology at the ApacheCon US 2009

In General on 2009-04-27 by Jukka Zitting

I’m putting together a plan for a Content Technology track at the ApacheCon US 2009 in Oakland later this year. The original plan for the track was focused on JCR and related stuff, but there’s some interest in expanding the scope to cover a wider range of things related to content management and web publishing.

The track proposal has been discussed on the Jackrabbit and Sling mailing lists, and people from POI and Lenya have chimed in with interest. I also contacted Wicket, Cocoon, JSPWiki and Roller about their interest, and the initial feedback seems good. Any other projects I should be contacting?

I’m not sure how this works for the conference planners, who are probably facing some real deadlines in terms of fixing the conference schedule and contacting selected speakers. Let’s see how it all plays out.

Update: Added JSPWiki and Roller.