NoSQL Databases

NoSQL DatabasesNoSQL databases have entered the radar of web application developers lately. While relational database management systems (RDBMS) have been powering almost every web application on the Internet for more than a decade, this is beginning to change. No longer is the selection of persistence technology a no-brainer. You have additional choices. Besides the old friend RDBMS, there are object-oriented databases, graph-oriented databases, key-value stores, column-oriented databases, and other options. Many of the newer products in this area are known as NoSQL databases. NoSQL is a movement that promotes persistence technologies that break with the conventional relational model. NoSQL databases typically don’t have tables schemas, SQL support, and are designed to scale horizontally.

For those of you old enough to remember Dbase, the NoSQL moniker may not be much of an attention grabber, because after all, products like Dbase, FoxPro, Clipper and similar DB systems never had SQL support either. With these systems, relations had to be expressed implicitly in the application and “queries” had to be coded as retrieval sequences. By contrast, modern NoSQL systems depart from the relational model and in many cases also from the tabular data structure, in order to serve use cases where traditional RDBMS fail in one or another way. A typical example would be a sparsely populated table that contains very few data in rows and columns. Such a table -if it grows to a large size- presents an efficiency problem to most RDBMS with resulting performance loss. In the remainder of this article, we will look at a few selected NoSQL databases and see which use cases they cater to.

CouchDB

Apache CouchDB is a document-oriented database that represents documents as JSON objects. CouchDB supports all data types supported by JSON, or respectively by Javascript. The JSON objects are not required to comply with schemas and can therefore be defined freely, which means that each JSON object can have a different structure. CouchDB supports queries by views. Views are aggregate functions and filters programmed in Javascript that follow the MapReduce algorithm. Views are stored and indexed in the database. CouchDB provides a RESTful API where every object (and any other item) in the database can be retrieved by an URL. It uses the HTTP POST, GET, PUT, and DELETE methods for CRUD operations. Other features include ACID semantics on basis of multi-version concurrency control, similar to RDBMS, which is optimised for a high number of concurrent reads, and a distributed architecture that allows for easy bidirectional replication and offline usage. CouchDB is thus designed from ground up for Internet use.

Neo4J

Neo4J is a graph database. As the name suggests, it is intended for use with the Java platform, which includes any language that runs on the JVM. Neo4J stores information in nodes and edges; the latter are called relationships in case of Neo4J. Relationships are always of a defined type. Both nodes and relationships can store properties, i.e. data. The Neo4J database is thus optimised for representing complex graph and network structures, such as a hierarchical object repository or a social network. It offers high-performance graph traversal operations for data access. Nodes can also be indexed and retrieved by key which enables more conventional style queries. Additional features include ACID transactions and transaction recovery, based on the Java Transaction API (JTA). Optional libraries can expose a Neo4J database as an RDF store where the node space can be queried using SPARQL. Neo4J is an embedded database with a small footprint that runs in the same JVM as the application.

Redis

Redis is a modern implementation of a persistent key-value store for general purpose use. Key-value store is a name for a simple key-based access mechanism that basically implements a dictionary (or map) data structure. Traditionally, such systems were used for caching and Redis holds its entire database in memory, which makes it ideal for applications that require ultra-fast data access. Redis allows not just plain string data but also allows sets and lists of strings in the data space. The system offers a number of special commands, such as atomic push/pop and add/remove operations for lists and set operations such as building union, intersection, and difference. Redis persists data either by asynchronously writing memory to disk, or by appending to a journalling file as data is written by clients. Additional features include easy master-slave replication and rudimentary sharding. Redis offers support for various languages, such as C/C++, Java, Scala, PHP and others through native drivers and APIs.

HBase

HBase is a free implementation of Google’s BigTable written in Java. It is not the type of database you would use for a blog or a forum software. HBase is a tabular data storage designed for massive tables in the Petabyte range with billions of rows distributed over a number of physical machines and thus optimised for horizontal scaling. HBase is part of the Apache Hadoop project, a framework for data-intensive distributed applications, inspired by Google’s MapReduce and GFS technologies. Hadoop supports the database through its distributed filesystem HDFS which provides built-in replication and MapReduce traversal for HBase tables of arbitrary size. Features include optimised query push down via server-side scan and get filters, a high performance Thrift gateway, an XLM-based RESTful Webservice gateway, Hadoop cascading, per-column probabilistic Bloom filters, as well as data warehousing and data analysis modules. Since HBase saves column families rather than columns and since empty columns are not stored, it is ideal for sparse tables with semi-structured data. Typical use cases are cloud computing and applications that require massive storage using cheap commodity hardware.

Db4o

Db4o is an open-source object-oriented database system targeted at OOP developers. The idea behind Db4o is to enable programmers to create and persist a representation of the application object model directly in the database without the need for an object-relational mapping software layer. Object instances can then be stored and retrieved with a single line of code. Db4o provides a query mechanism called Native Query (NQ). This allows querying data with native OOP language constructs thus offering type safety for query expressions while eliminating the need for building query strings. Db4o is available for the Java and .NET platforms. If used with .NET languages, data can alternatively be queried with LINQ (language integrated query). The Db4o database is embeddable with a small footprint suitable to be deployed on mobile devices. Additional features include semi-automatic schema versioning, transaction support with ACID semantics, and synchronisation/replication mechanisms that allow synchronisation between different Db4o instances and data export into SQL databases.

Onward To Lucid Lynx

Ubuntu 10.04 alias Lucid Lynx has arrived and because this is a long-time support version, many users are bound to upgrade within the next few weeks. It seems like the GUI people from Canonical were unusually daring this time. Not only is this the first Ubuntu version that sports a graphical interface that is NOT BROWN (shock!), but the window control buttons are on the wrong side, namely on the left (double shock!). Apparently, Mac OSX Leopard has godfathered here. Well, I am not going to get used window controls on the left side, so I applied a quick fix which is amply documented on the Internet, as many people seem to feel the same way. Otherwise, the new look is a welcome change, as the permutations of brown and orange seemed to have been exhausted.

The only thing that turned out to be slightly trickier was the Tomcat upgrade to 6.0.24. A surreptitious installation of Apache 2 (the purpose of which eluded me) took possession of port 80 which on my machine was previously occupied by the system-wide Tomcat installation. This was rather easy to solve with the command: sudo update-rc.d -f apache2 remove to disable Apache on boot. It turned out, however, that the application launcher jsvc was removed in Ubuntu 10.04. Since Tomcat previously used jsvc to launch Tomcat on privileged ports, Tomcat was not able to bind to port 80 any longer. I was able to solve this by setting the AUTHBIND variable in /etc/default/tomcat to ‘yes’. After that Tomcat started up on port 80 without complaints.

Ubuntu 10.04 Default Theme

During the upgrade, the system politely asked whether to replace or keep manually changed system configuration files. I have chosen to replace most files, because, the upgrade manager is kind enough to create a copy of the existing configuration using the *.dpkg-old extension during the upgrade. That way I was able to diff configuration files later and incorporate any customisations into the new files. This method is superior to keeping the old files, because it allows for upgrading the configuration files in sync with the latest program versions, though, of course it takes a bit of work manually diffing and patching those files if you happen to have numerous customisations. You can alternatively keep the old files and then diff and patch the new files created by the upgrade manager with the *.dist extension. In summary, the upgrade was painless and took less than 90 minutes per machine.

Scala Pages Released

I am glad to announce the first release of the Scala Pages (SCP) lightweight web framework, which I wrote for a personal web application project. Although it’s still at a humble stage of development, I believe it is useful enough to be shared. I am planning to extend it in future and I hope that it will contribute to increased diversity in the Scala web development area. Software and manual can be downloaded from this page:

http://www.thomasknierim.com/scala-pages-web-framework/

The source code is included in the distribution zip file. Comments, suggestions, and constructive criticism are always welcome.

Addendum: I was asked to make a brief comparison to the Lift framework, so here’s my answer:

The approach that SCP takes is different. To put it in a nutshell, Lift is Rails-inspired, XML-oriented, and it abstracts from the Servlet API and the request/response model. SCP is inspired by traditional Java MVC frameworks; it is text-oriented, and it builds directly on the request/response model and the Servlet API.

Lift processes templates using the SAX-parser and presents Scala XML data structures to the programmer. By contrast, SCP reads templates as plain text from top to bottom, performs variable replacement and executes embedded instructions. Template processing with SCP thus consumes fewer resources. Lift uses prefixed XML tags; SCP uses processing instructions.

Lift wraps much of the Servlet API and presents a number of abstracted objects to the programmer, such as the S object, the SHtml object, LiftRules, and so on. SCP doesn’t do that. With SCP you get the familiar JEE objects such as HttpServletRequest, ServletContext, etc. and you use these directly from the Scala code. So, it’s easy to use if you’ve done Java web development.

The consequence is that you deal with the request/response model directly, which I personally prefer, because it keeps the control flow simple and clear. The only disadvantage of the MVC request/response model is that controllers tend to become bulky and difficult to reuse. The SCP solution to this problem is the same that Lift offers: snippets.

In SCP, snippets are processing instructions powered by a custom Scala class that the programmer provides. This means you can encapsulate application logic and/or complicated display logic into a snippet class which -if properly coded- is perfectly reusable. SCP also keeps template expressions very simple, because the more complex a template language becomes (many are even Turing-complete), the more likely it is for application logic to sneak into the view. Currently, iteration is the only supported type of control-flow.

Last but not least, Lift is huge compared to SCP. It has a lot of functionality that SCP does not offer, at least not at this time.

Serve PHP with Tomcat

tomcat-php01.pngAs you can gather from the title of this website, I create software in Java, Scala, and PHP. While Java and Scala compile to bytecode that runs on the same virtual machine, PHP is executed by a separate interpreter. The most efficient way to run PHP scripts is to integrate the interpreter directly into the webserver. Hence, most PHP developers use a local Apache Httpd server with mod_php for development. If you also do Java programming, this raises the problem that you need two different web servers, namely Tomcat (or another web container or appserver) for Java development and Apache for PHP development. Running two servers is a bit of a nuisance. Two servers consume more resources than one and you cannot run both on the same port. This problem can be solved in three different ways: you can only run one server at a time, you can use a different port number for one server which has to be included in the URLs, or you can integrate the two servers. There are again at least three different ways to accomplish the latter: you can proxy requests from Apache to Tomcat, you can proxy request from Tomcat to Apache, or you can use a connector module, such as mod_jk. Of course, maintaining two servers is is more complicated than maintaining one, and the integration adds additional complexity. Fortunately, there is an easier way to integrate PHP and Java web applications. PHP/Java Bridge is a free open source product for the integration of the native PHP interpreter with the Java VM. It is designed with web applications in mind: Java servlets can “talk” to PHP scripts and vice versa. The official website describes it as an “implementation of a streaming, XML-based network protocol which is up to 50 times faster than local RPC via SOAP.” PHP/Java Bride requires no additional components to invoke Java procedures from PHP or vice versa. Although there are a number of different use cases, I am going to describe a particular one in this article, namely how to configure Tomcat with PHP/Java Bridge in order to have Tomcat serve PHP web pages. Let’s start with software requirements. We need the following software packages:

Follow the standard installation procedures for the JVM, Tomcat, and PHP. On Linux, you can use the standard packages for your distribution and on Windows you can use the regular installers. Make sure that both Tomcat and PHP are installed properly, which means that you should see Tomcat’s welcome web page at http://localhost:8080 and you should be able to execute a PHP script via the command line by invoking the standalone “php” command. The PHP/Java Bridge product does not use the regular executable, however, but fast CGI. The fast CGI executable is called php-cgi (or Php.cgi.exe on Windows), so you must make sure that your PHP installation contains it. Then you are all set to install and configure the PHP/Java Bridge. The PHP/Java Bridge package comes with a sample web application named JavaBridge.war. Deploy the application in Tomcat, point your browser to http://localhost:8080/JavaBridge and try out the examples. If this works, you are half-finished. To provide the capability to execute PHP scripts server-wide, not just in a single web application, you need to make some changes to the Tomcat configuration. Find the three jar files named JavaBridge.jar, php-servlet.jar and php-script.jar (look in WEB-INF/lib) and move them to Tomcat’s shared library directory. This is usually found in $CATALINA_HOME/lib (or $CATALINA_HOME/shared in older Tomcat installations). Then edit Tomcat’s conf/web.xml configuration file and add the following lines:

<listener-class>
    php.java.servlet.ContextLoaderListener
  </listener-class>
</listener>

<servlet>
  <servlet-name>PhpJavaServlet</servlet-name>
  <servlet-class>
    php.java.servlet.PhpJavaServlet
  </servlet-class>
</servlet>

<servlet>
  <servlet-name>PhpCGIServlet</servlet-name>
  <servlet-class>
    php.java.servlet.PhpCGIServlet
  </servlet-class>
  <init-param>
<param-name>prefer_system_php_exec</param-name>
<param-value>On</param-value>
  </init-param>
  <init-param>
<param-name>php_include_java</param-name>
<param-value>On</param-value>
  </init-param>
</servlet>

<servlet-mapping>
  <servlet-name>PhpJavaServlet</servlet-name>
  <url-pattern>*.phpjavabridge</url-pattern>
</servlet-mapping>

<servlet-mapping>
  <servlet-name>PhpCGIServlet</servlet-name>
  <url-pattern>*.php</url-pattern>
</servlet-mapping>

This adds the listeners and servlets required for PHP script execution to all web applications. While you are at it, you might also want to enable index.php files to display when a directory URL is requested. Simply add it to the list of welcome files in conf/web.xml. My list looks like this:

<welcome-file-list>
    <welcome-file>index.html</welcome-file>
    <welcome-file>index.htm</welcome-file>
    <welcome-file>index.jsp</welcome-file>
    <welcome-file>index.php</welcome-file>
</welcome-file-list>

Now you can copy PHP scripts into the context root directory of any web application and type the script URL into your browser. I suggest you try a script with phpinfo(). It gives you plenty of useful configuration info. If this doesn’t work and you are on Unix, the problem might be file permissions. On my machine, I had to copy the contents of “java” directory in the JavaBridge webapp manually to the context root directory where PHP applications were installed. This directory contains two files Java.inc and JavaProxy.php. Normally, the PHP/Java Bridge software copies it automatically, but it might not be able to do so if it does not have proper permissions:

~$ ls -lh /var/lib/tomcat6/webapps/ROOT/java
total 136K
-rw-rw-r-- 1 root root 64K 2009-12-30 14:14 Java.inc
-rw-rw-r-- 1 root root 64K 2009-12-30 14:14 JavaProxy.php

Now try calling a PHP script. For example, a script containing the phpinfo() command displays information about the server: tomcat-config.png I have configured my machine to host all PHP web applications in Tomcat’s ROOT context. This eliminates the extra path component of the webapp context, since the ROOT’s context path is “/”. Then I softlinked the folder that contains all my PHP projects into the ROOT webapp directory, so that the actual source files are kept separate from the Tomcat installation. In order to enable Tomcat to follow symlinks, you need to edit the context.xml of the respective web application -in this case ROOT- and add the line: <Context path=”/” allowLinking=”true” /> . Another possible gotcha is Tomcat’s security manager, which is enabled by default on Ubuntu, but not on Windows. Although a security manager is not necessary for most development scenarios, it is highly recommended for production. I consider it good practice to enable the security manager on the development machine, because it allows me to recognise security problems early during development, before the application is deployed on the production server. The downside is that additional configuration may be required, for PHP applications to function properly. The respective configuration files are located in $CATALINA_BASE/conf/policy.d. Most likely, you need to grant PHP web applications write access to files in the document root and possibly other permissions, such as opening sockets, etc. It’s probably safest to do this on a per-application basis.

Ubuntu Newbie Tips

ubuntu.pngI’ve been using Linux on servers in various flavours since 1997, but I am relatively new to Ubuntu and I have just started using Ubuntu as a desktop OS. Despite some installation problems, the overall experience was very positive. I had made earlier attempts to switch over to Linux, but for one or another reason these were thwarted, mostly because of the professional necessity of testing software under Windows. Since I am now working on cross-platform applications that particular constraint has evaporated. I spend most of my day developing software and writing documentation. Before installing Ubuntu, I was slightly concerned that there would be a temporary decrease in productivity due to having to learn new software. However, this turned out to be largely unfounded.

Most of the key applications like Eclipse, Firefox, Thunderbird, and OpenOffice work exactly the same under Linux as they do under Windows. The only major change was replacing Notepad++ (which only runs on Windows) by vi/vim. These editors are suitable for programming in situations where you don’t want to fire up an IDE. Furthermore, I have made some customisations to ease the transition, which I’d like to share with you. If you are new to Linux, you might find one or another useful for your own work. The following list is by no means exhaustive or even comprehensive, just a number of things I stumbled across during my first two weeks with desktop Ubuntu.

Repositories and download servers
Ubuntu maintains software packages with the Synaptic package manager. Because as a new user you are likely to make frequent use of this tool, one of the most useful things to do is to optimise its usage. This involves defining the repositories and the download server. Choose System/Administration/Software Sources from the main menu. In the first tab “Ubuntu Software”, select the four items marked with “main”, “universe”, “restricted” and “multiverse” for the widest choice of software packages. Next, optimise the download server. I wasted a whole day with downloading the 9.04->9.10 update, because of a slow server. Ubuntu can find the fastest server for you. Select “Other…” in the “Download from” dropdown-box. A dialogue with a list of servers shows on screen. Click on “Select Best Server” to let Ubuntu test all available servers for their response time and select the fastest one.

Keyboard and language customisations
If you are -like me- frequently typing text in different languages, chances are that the default language and keyboard settings will not suit you. Fortunately, Ubuntu is easy to configure for international use, possibly even superior to Windows in this regard. First, I added Thai language support in System/Administration/Language Support. Then I configured two additional keyboard layouts, German and Thai, in System/Preferences/Keyboard/Layout. As I am using a Thai/English keyboard, I have to remember the German key mapping by heart which is only of limited use. On Windows I got used to producing international characters by typing ALT+num key sequences. On Linux, this is even easier thanks to the concept of the compose key. In the keyboard layout dialogue, click on “Layout Options” which will show you a number of intricate keyboard customisation options. Click on “Compose key position” and pick a key, for instance “Right Alt”. Now you can use this key to compose international characters. For example, type right Alt, double quotation marks, and letter ‘u’ to produce the German Umlaut ‘ü’. Type right Alt, backtick and the letter ‘a’ to produce the accent grave ‘à’. Voilà!

Customising Nautilus
Nautilus is the Linux/Gnome equivalent to the Windows Explorer. In fact, I find it to be superior to the latter, because it supports protocols for remote access (such as ftp/sftp); it offers better search capability and better support for compressed files. If you prefer to work with a GUI rather than the command line, you would probably want to customise Nautilus in some way. The most obvious candidates for customisation are probably file associations. These can be defined by right-clicking on a file, selecting “Properties” from the context menu and switching to the “Open With” tab in the property dialogue. Here you can define alternative applications to use for opening a file, as well as the default application that is started upon double-click. If you need even more customisation options, install the package named “nautilus-actions”. This package lets you define custom actions for file entries in Nautilus which can be incorporated into the context menu. Predefined Nautilus extensions (aka shell extensions) for various file display and transformation purposes are also available.

Command line and terminal customisations
Ubuntu comes with the bash (Bourne again shell) and the Gnome-Terminal as command line defaults. These are fine for me. However, there is one feature which I found missing in the terminal application. It is not possible to search the output buffer. For example, when I run applications that produce a large amount of diagnostic output, there is no intuitive way to search trough this data, other than piping it into a command like “less”. I have found a little program named “screen” which appears to solve this problem. After “screen” is started, virtual sessions can be created within the same terminal window, each with its own searchable buffer. “Screen” involves remembering some arcane keyboard commands, but that’s the best I could find so far. Another command line annoyance is that the “vi” editor runs in compatible mode by default. This will let the cursors keys produce character output in insert mode; in other words, the cursor keys are broken. There is an easy fix for this, however. Put a file named .vimrc in your home directory that contains a single line saying “set nocompatible” and the cursor keys will work again.

Backup and antivirus software
Surprisingly, neither backup nor antivirus software packages are included in the default Ubuntu installation. Although viruses are probably not an immediate threat on a Linux system, I would rather not breed any of them on my machine. There is the open source software clamAV as well as a number of free-for-private-use commercial offerings for Linux. I am still evaluating antivirus software. So far I found clamAV and AVG quite usable, but not quite as convenient as under Windows. Backup software is an absolute necessity in my opinion, and I am surprised that it isn’t integrated in the original Ubuntu installation. Of course, individual backup needs differ, but a simple mirroring and archiving facility is probably required for even the most basic usage. Initially, I planned to hack a script based on rsync together for that purpose, but I have found something much nicer. The “backintime” package lets you create incremental backups with great ease and minimal storage requirements. Backintime revolves around the concept of snapshots; it is a GUI framework for rsync, diff, and cron. I highly recommend it.