Post

The new BASIC

In Technology on 2010-01-29 by Jukka Zitting

I’m seeing many posts that worry about computing devices like iPhones and the new iPad preventing people from having direct control over the hardware. Mark is telling us about a Ctrl+Reset and a BASIC prompt. Nowadays you get started with the following on an HTML page:

    <script type="text/javascript">
    document.write("Hello, World!");
    </script>

And you can do anything! Don’t tell me the days of tinkering are over.

Post

Daily Shoot, week 3

In General on 2009-12-07 by Jukka Zitting Tagged: ,

Another week of @dailyshoot:

PS. Check out the updated dailyshoot.com web site.

Post

Daily Shoot, week 2

In General on 2009-11-29 by Jukka Zitting Tagged: ,

As I mentioned last week, I’ve been following @dailyshoot for a series of daily photo assignments. Here’s what I shot this week:

Post

Sling over HTTP

In General on 2009-11-28 by Jukka Zitting

A few days ago I posted about Jackrabbit, and now it’s time to follow up with Sling as a means of accessing a content repository over HTTP. Apache Sling is a web framework based on JCR content repositories like Jackrabbit and among other things it adds some pretty nice ways of accessing manipulating content over HTTP.

The easiest way to get started with Sling is to download the “Sling Standalone Application” from the Sling downloads page. Unpack the distribution package and start the Sling application with “java -jar org.apache.sling.launchpad.app-5-incubator.jar”. Like Jackrabbit, Sling can by default be accessed at http://localhost:8080/. There’s a 15 minute tutorial that you can check out to learn more about Sling.

Since Sling comes with an embedded Jackrabbit repository, it also supports much of the WebDAV functionality covered in my previous post. Instead of rehashing those points, this post takes a look at the additional HTTP content access features in Sling.

CR1: Create a document

Like with Jackrabbit, all documents in Sling have a path that is used to identify and locate the document. Sling solves the problem of having to come up with the document name by supporting a virtual “star resource” that’ll automatically generate a unique name for a new document. Thus instead of having to think of a URL like “http://localhost:8080/hello” in advance, the new document can be created by simply posting to the star resource at “http://localhost:8080/*”.

The Sling POST servlet is a pretty versatile tool, and can be used to perform many content manipulation operations using normal HTTP POST requests and the application/x-www-form-urlencoded format used by normal HTML forms. With the POST servlet, the example document can be created like this:

$ curl --data 'title=Hello, World!' --data 'date=2009-11-17T12:00:00.000Z' \
       --data 'date@TypeHint=Date' --user admin:admin \

http://localhost:8080/*

The 201 Created response will contain a Location header that points to the newly created document. In this case the returned URL is “http://localhost:8080/hello_world_” based on some document title heuristics included in Sling. If you run the command again you’ll get a different URL since the Sling star resource will automatically avoid overwriting existing content.

Pros:

  • A single standard POST request is enough
  • The HTML form format is used for the POST body
  • Automatically generated clean and readable document URL
Cons:
  • The star resource URL pattern is fixed and creates an unnecessarily tight binding between the client and the server

CR2: Read a document

Sling contains multiple ways of accessing the document content in different renderings. In fact much of the power of Sling comes from the extensive support for rendering underlying content in various different and easily customizable ways.

Unfortunately at least the latest 5-incubator version of the Sling Application doesn’t support any reasonable default rendering at the previously returned document URL. The client needs to explicitly know to add a “.json” or “.xml” suffix to the document URL to get a JSON or XML rendering of the document.

$ curl http://localhost:8080/hello_world_.json
{
  "title":           "Hello, World!",
  "date":            "Tue Nov 17 2009 12:00:00 GMT+0100",
  "jcr:primaryType": "nt:unstructured"
}
$ curl http://localhost:8080/hello_world_.xml
<?xml version="1.0" encoding="UTF-8"?>
<hello_world_ xmlns:fn="http://www.w3.org/2005/xpath-functions"
              xmlns:fn_old="http://www.w3.org/2004/10/xpath-functions"
              xmlns:xs="http://www.w3.org/2001/XMLSchema"
              xmlns:jcr="http://www.jcp.org/jcr/1.0"
              xmlns:mix="http://www.jcp.org/jcr/mix/1.0"
              xmlns:sv="http://www.jcp.org/jcr/sv/1.0"
              xmlns:sling="http://sling.apache.org/jcr/sling/1.0"
              xmlns:rep="internal"
              xmlns:nt="http://www.jcp.org/jcr/nt/1.0"
              jcr:primaryType="nt:unstructured"
              date="2009-11-17T12:00:00.000+01:00"
              title="Hello, World!"/>

The JCR document view format is used for the XML rendering.

Pros:

  • A single GET request is enough
  • Both the JSON and XML formats are easy to consume

Cons:

  • Simply GETting the document URL doesn’t return anything useful
  • The “.json” and “.xml” URL patterns create an unnecessary binding between the client and the server
  • Neither rendering contains property type information
  • The XML rendering contains unnecessary namespace declarations

CR3: Update a document

The Sling POST servlet supports also document updates, so we can just POST the updated properties to the document URL:

$ curl --data 'history=Document date updated' \
       --data 'date=2009-11-18T12:00:00.000Z' \
       --data 'date@TypeHint=Date' --user admin:admin \

http://localhost:8080/hello_world_

Pros:

  • A single standard POST request is enough
  • The HTML form format is used for the POST body

Cons:

  • None.

CR4: Delete a document

You can either use the special “:operation=delete” feature of the Sling POST servlet or a standard DELETE request to delete a document:

$ curl --data ':operation=delete' --user admin:admin \

http://localhost:8080/hello_world_

$ curl --request DELETE --user admin:admin \

http://localhost:8080/hello_world_

Pros:

  • A standard DELETE or POST request is all that’s needed

Cons:

  • None.

Post

Jackrabbit over HTTP

In General on 2009-11-24 by Jukka Zitting

Last week I posted a simple set of operations that a “RESTful content repository” should support over HTTP. Here’s a quick look at how Apache Jackrabbit meets this challenge.

To get started I first downloaded the standalone jar file from the Jackrabbit downloads page, and started it with “java -jar jackrabbit-standalone-1.6.0.jar”. This is a quick and easy way to get a Jackrabbit repository up and running. Just point your browser to http://localhost:8080/ to check that the repository is there.

Jackrabbit comes with a built-in advanced WebDAV feature that gives you pretty good control over your content. The root URL for the default workspace is http://localhost:8080/server/default/jcr:root/ and by default Jackrabbit grants full write access if you specify any username and password.

Note that Jackrabbit also has another, filesystem-oriented WebDAV feature that you can access at http://localhost:8080/repository/default/. This entry point is great for dealing with simple things like normal files and folders, but for more fine-grained content you’ll want to use the advanced WebDAV feature as outlined below.

CR1: Create a document

All documents (nodes) in Jackrabbit have a pathname just like files in a normal file system. Thus to create a new document, we first need to come up with a name and a location for it. Let’s call the example document “hello” and place it at the root of the default workspace, so we can later address it at the path “/hello”. The related WebDAV URL is http://localhost:8080/server/default/jcr:root/hello/.

You can use the MKCOL method to create a new node in Jackrabbit. An MKCOL request without a body will create a new empty node, but you can specify the initial contents of the node by including a snippet of JCR system view XML that describes your content. In our case we want to specify the “title” and “date” properties. Note that JCR does not support date-only properties, so we need to store the date value as a more accurate timestamp.

The full request looks like this:

$ curl --request MKCOL --data @- --user name:pass \
       http://localhost:8080/server/default/jcr:root/hello/ <<END
<sv:node sv:name="hello" xmlns:sv="http://www.jcp.org/jcr/sv/1.0">
  <sv:property sv:name="message" sv:type="String">
    <sv:value>Hello, World!</sv:value>
  </sv:property>
  <sv:property sv:name="date" sv:type="Date">
    <sv:value>2009-11-17T12:00:00.000Z</sv:value>
  </sv:property>
</sv:node>
END

The resulting document is available at the URL we already constructed above, i.e. http://localhost:8080/server/default/jcr:root/hello/.

Pros:

  • A single standard WebDAV MKCOL request is enough
  • The standard JCR system view XML format is used for the MKCOL body
  • The XML format is easy to produce
Cons:
  • We need to decide the name and location of the document before it can be created
  • The name of the document is duplicated, once in the URL and once in the sv:name attribute
  • The date property must be specified down to the millisecond
  • While standardized, the MKCOL method is not as well known as PUT or POST
  • While standardized, the JCR system view format is not as well known as JSON, Atom or generic XML
  • The system view XML format is quite verbose

CR2: Read a document

Now that the document is created, we can read it with a standard GET request:

$ curl --user name:pass http://localhost:8080/server/default/jcr:root/hello/
<?xml version="1.0" encoding="UTF-8"?>
<sv:node sv:name="hello"
         xmlns:fn="http://www.w3.org/2005/xpath-functions"
         xmlns:fn_old="http://www.w3.org/2004/10/xpath-functions"
         xmlns:xs="http://www.w3.org/2001/XMLSchema"
         xmlns:jcr="http://www.jcp.org/jcr/1.0"
         xmlns:mix="http://www.jcp.org/jcr/mix/1.0"
         xmlns:sv="http://www.jcp.org/jcr/sv/1.0"
         xmlns:rep="internal"
         xmlns:nt="http://www.jcp.org/jcr/nt/1.0">
  <sv:property sv:name="jcr:primaryType" sv:type="Name">
    <sv:value>nt:unstructured</sv:value>
  </sv:property>
  <sv:property sv:name="date" sv:type="Date">
    <sv:value>2009-11-17T12:00:00.000Z</sv:value>
  </sv:property>
  <sv:property sv:name="message" sv:type="String">
    <sv:value>Hello, World!</sv:value>
  </sv:property>
</sv:node>

Note that the result includes the standard jcr:primaryType property that is always included in all JCR nodes. Also all namespaces registered in the repository are included even though strictly speaking they add little value to the response.

Pros:

  • A single GET request is enough
  • The XML format is easy to consume

Cons:

  • The system view format is a bit verbose and generally not that well known

CR3: Update a document

The WebDAV feature in Jackrabbit does not support setting multiple properties in a single request, so we need to use separate requests for each property change. The easiest way to update a property is to PUT the new value to the property URL. The only tricky part is that unless the node type explicitly says otherwise the new value is by default stored as a binary stream. You need to specify a custom jcr-value/… content type to override that default.

$ curl --request PUT --header "Content-Type: jcr-value/date" \
       --data "2009-11-18T12:00:00.000Z"  --user name:pass \

http://localhost:8080/server/default/jcr:root/hello/date

$ curl --request PUT --header "Content-Type: jcr-value/string" \
       --data "Document date updated"  --user name:pass \

http://localhost:8080/server/default/jcr:root/hello/history

GETting the document after these changes will give you the updated property values.

Pros:

  • Standard PUT requests are used
  • No XML or other wrapper format needed, just send the raw value as the request body

Cons:

  • More than one request needed
  • Need to use non-standard jcr-value/… media types for non-binary values

CR4: Delete a document

Deleting a document is easy with the DELETE method:

$ curl --request DELETE --user name:pass \

http://localhost:8080/server/default/jcr:root/hello/

That’s it. Trying to GET the document after it’s been deleted gives a 404 response, just as expected.

Pros:

  • A standard DELETE request is all that’s needed

Cons:

  • None.

Post

Daily Shoot, week 1

In General on 2009-11-23 by Jukka Zitting Tagged: ,

A week ago James Duncan Davidson and Mike Clark launched @dailyshoot, a Twitter feed that posts daily photo assignments. The idea is to encourage people who want to learn photography to practice it every day with the help of a simple assignment that fits a single tweet. I’m following Duncan’s blog, so I found out about Daily Shoot the day it was launched.

So far I’ve completed all the assignments and I’ve already learned quite a bit doing so. It’s very interesting to see how other people interpret the same assignments. I avoid looking at other responses before completing an assignment so that I don’t end up just copying someone else’s approach. Once I’m done I look at what other’s have done for some nice insight on what I could have done differently. The process is quite educational.

Here’s what I’ve shot this week:

You can click on the pictures for more background on each assignment and how I approached it. For more information on Daily Shoot, see the recently launched website.

Post

Content Repository over HTTP

In General on 2009-11-18 by Jukka Zitting

Two weeks ago during the BarCamp at the ApacheCon US I chaired a short session titled “The RESTful Content Repository”. The idea of the session was to discuss the various ways that existing content repositories support RESTful access over HTTP and to perhaps find some common ground from which a generic content repository protocol could be formulated.

The REST architectural style was generally accepted as a useful set of constraints for the architecture of distributed content-based applications, but as an architectural style it doesn’t define what the bits on the wire should look like. This is what we set out to define with the HTTP protocol as a baseline. We didn’t get too far, but see below for some collected thoughts and a useful set of “test cases” that I hope to use to further investigate this idea.

Existing solutions

Many existing content repositories and related products already support one or more HTTP-based access patterns: Apache Jackrabbit exposes two slightly different WebDAV-based access points. Apache Sling adds the SlingPostServlet and default JSON and XML renderings of content. Apache CouchDB uses JSON over HTTP as the primary access protocol. Apache Solr uses XML over HTTP. Midgard doesn’t have a built-in HTTP binding for content, but makes it very easy to implement such bindings. This list just scratches the surface…

There are even existing generic protocols that match at least parts of what we wanted to achieve. WebDAV has been around for ten years already, but the way it extends HTTP with extra methods makes it harder to use with existing HTTP clients and libraries. The AtomPub protocol solves that issue, but being based on the Atom format and leaving much of the server behaviour undefined, AtomPub may not be the best solution for generic content repositories.

Content repository operations over HTTP

To better understand the needs and capabilities of existing solutions, we should come up with a simple set of content operations and find out if and how different systems support those operations over HTTP. The most basic such set of operations is CRUD, i.e. how to create, read, update, and delete a document, so let’s start with that. I’m giving each operation a key (CRn, as in “Content Repository operation N”) and a brief description of what’s expected. In later posts I hope to explore how these operations can be implemented with curl or some other simple HTTP client accessing various kinds of content repositories. I’m also planning to extend the set of required operations to cover features like search, linking, versioning, transactions, etc.

CR1: Create a document

Documents with simple properties like strings and dates are basic building blocks of all content applications. How can I create a new document with the following properties?

  • title = “Hello, World!” (string)
  • date = 2009-11-17 (date)

At the end of this operation I should have a URL that I can use to access the created document.

CR2: Read a document

Given the URL of a document (see CR1), how do I read the properties of that document?

The retrieved property values should match the values given when the document was created.

CR3: Update a document

Given the URL of a document (see CR1), how do update the properties of that document? For example, I want to update the existing date property and add a new string property:

  • date = 2009-11-18 (date)
  • history = “Document date updated” (string)

When the document is read (see CR2) after this update, the retrieved information should contain the original title and the above updated date and history values.

CR4: Delete a document

Given the URL of a document (see CR1), how do I delete that document?

Once deleted, it should no longer be possible to read (see CR2) or update (see CR3) the document.

Post

NoSQL interests

In General on 2009-10-27 by Jukka Zitting Tagged: , ,

NoSQL OaklandWe’re organizing a NoSQL meetup in Oakland on Monday next week. In addition to helping set the meetup agenda, the “Topics you are interested in” question in the sign up form provides some interesting insight on the current interests of the NoSQL community. Here’s a quick breakdown of the key terms distilled from the 88 signups we’ve received so far.

Note that the data is biased towards Apache projects due to the meetup being organized at ApacheCon US 2009.

Projects

The following open source projects were mentioned. The list is in alphabetical order, as the data set is too small to make any reasonable ordering by popularity.

Topics

Many responses were about the “big data” aspect of the NoSQL movement. Some frequent keywords: distributed storage, large transactional data, consistency, failover, availability, reliability, stability, failure detection, failed node replacement, (petabyte) scalability, consistency levels, storage technology, performance, benchmarks, optimization, backup and recovery, map/reduce

Another common theme were the various database types and the NoSQL “development model”. Keywods: document stores, key/value stores, consistent hashing, graph databases, object databases, persistent queues, content modeling, migration from the relational model, social graphs, streaming, software as a service, offline applications, full text search, natural language processing

Beyond the above big themes, I found it interesting that the following technologies were specifically named: Erlang, Java, WebSimpleDB, WebDAV

In addition to specific topics, many people were asking for case studies or “lessons learned” -type presentations.

Post

Putting POI on a diet

In General on 2009-10-16 by Jukka Zitting Tagged: , , , ,

The Apache POI team is doing an amazing job at making Microsoft Office file formats more accessible to the open source Java world. One of the projects that benefits from their work is Apache Tika that uses POI to extract text content and metadata from all sorts of Office documents.

Apache POI

However, there’s one problem with POI that I’d like to see fixed: It’s too big.

More specifically, the ooxml-schemas jar used by POI for the pre-generated XMLBeans bindings for the Office Open XML schemas is taking up over 50% of the 25MB size of the current Tika application. The pie chart below illustrates the relative sizes of the different parser library dependencies of Tika:

Relative sizes of Tika parser dependencies

Both PDF and the Microsoft Office formats are pretty big and complex, so one can expect the relevant parser libraries to be large. But the 14MB size of the ooxml-schemas jar seems excessive, especially since the standard OOXML schema package from which the ooxml-schemas jar is built is only 220KB in size.

Does anyone have good ideas on how to best trim down this OOXML dependency?

Post

Some graphics work for a change

In Jackrabbit on 2009-09-23 by Jukka Zitting

I’ve recently spent some effort in improving the look of the Apache Jackrabbit website. I’m no designer, so the results aren’t that great, but it’s been a nice break from the regular project work. And I got to brush up my Photoshop and Gimp skills.

One part of the effort was creating an icon for the site. Previously the site used the feather icon used as the default on all Apache project sites, but I wanted a Jackrabbit-specific icon that helps me to quickly identify and access Jackrabbit pages among the numerous tabs I usually have open in my browser. The work is a good example of incremental improvements in action:

Jackrabbit icon steps

I started with a copy of the Jackrabbit logo with nice alpha-layered transparent background. It looked great until I noticed that some browsers lost the smooth alpha layer and instead resulted in a rather badly aliased icon seen above.

The straightforward solution was to add a white background as can be seen in step 2. That worked already pretty well in all browsers.

After a few days of watching the icon I found it a bit too blocky to my taste, so I tried to restore some of the nice transparency effect by rounding the corners a bit. I’m pretty happy with the result.

Of course, if you have design talent and think you can do better, go for it!