NoSQL Databases

NoSQL DatabasesNoSQL databases have entered the radar of web application developers lately. While relational database management systems (RDBMS) have been powering almost every web application on the Internet for more than a decade, this is beginning to change. No longer is the selection of persistence technology a no-brainer. You have additional choices. Besides the old friend RDBMS, there are object-oriented databases, graph-oriented databases, key-value stores, column-oriented databases, and other options. Many of the newer products in this area are known as NoSQL databases. NoSQL is a movement that promotes persistence technologies that break with the conventional relational model. NoSQL databases typically don’t have tables schemas, SQL support, and are designed to scale horizontally.

For those of you old enough to remember Dbase, the NoSQL moniker may not be much of an attention grabber, because after all, products like Dbase, FoxPro, Clipper and similar DB systems never had SQL support either. With these systems, relations had to be expressed implicitly in the application and “queries” had to be coded as retrieval sequences. By contrast, modern NoSQL systems depart from the relational model and in many cases also from the tabular data structure, in order to serve use cases where traditional RDBMS fail in one or another way. A typical example would be a sparsely populated table that contains very few data in rows and columns. Such a table -if it grows to a large size- presents an efficiency problem to most RDBMS with resulting performance loss. In the remainder of this article, we will look at a few selected NoSQL databases and see which use cases they cater to.

CouchDB

Apache CouchDB is a document-oriented database that represents documents as JSON objects. CouchDB supports all data types supported by JSON, or respectively by Javascript. The JSON objects are not required to comply with schemas and can therefore be defined freely, which means that each JSON object can have a different structure. CouchDB supports queries by views. Views are aggregate functions and filters programmed in Javascript that follow the MapReduce algorithm. Views are stored and indexed in the database. CouchDB provides a RESTful API where every object (and any other item) in the database can be retrieved by an URL. It uses the HTTP POST, GET, PUT, and DELETE methods for CRUD operations. Other features include ACID semantics on basis of multi-version concurrency control, similar to RDBMS, which is optimised for a high number of concurrent reads, and a distributed architecture that allows for easy bidirectional replication and offline usage. CouchDB is thus designed from ground up for Internet use.

Neo4J

Neo4J is a graph database. As the name suggests, it is intended for use with the Java platform, which includes any language that runs on the JVM. Neo4J stores information in nodes and edges; the latter are called relationships in case of Neo4J. Relationships are always of a defined type. Both nodes and relationships can store properties, i.e. data. The Neo4J database is thus optimised for representing complex graph and network structures, such as a hierarchical object repository or a social network. It offers high-performance graph traversal operations for data access. Nodes can also be indexed and retrieved by key which enables more conventional style queries. Additional features include ACID transactions and transaction recovery, based on the Java Transaction API (JTA). Optional libraries can expose a Neo4J database as an RDF store where the node space can be queried using SPARQL. Neo4J is an embedded database with a small footprint that runs in the same JVM as the application.

Redis

Redis is a modern implementation of a persistent key-value store for general purpose use. Key-value store is a name for a simple key-based access mechanism that basically implements a dictionary (or map) data structure. Traditionally, such systems were used for caching and Redis holds its entire database in memory, which makes it ideal for applications that require ultra-fast data access. Redis allows not just plain string data but also allows sets and lists of strings in the data space. The system offers a number of special commands, such as atomic push/pop and add/remove operations for lists and set operations such as building union, intersection, and difference. Redis persists data either by asynchronously writing memory to disk, or by appending to a journalling file as data is written by clients. Additional features include easy master-slave replication and rudimentary sharding. Redis offers support for various languages, such as C/C++, Java, Scala, PHP and others through native drivers and APIs.

HBase

HBase is a free implementation of Google’s BigTable written in Java. It is not the type of database you would use for a blog or a forum software. HBase is a tabular data storage designed for massive tables in the Petabyte range with billions of rows distributed over a number of physical machines and thus optimised for horizontal scaling. HBase is part of the Apache Hadoop project, a framework for data-intensive distributed applications, inspired by Google’s MapReduce and GFS technologies. Hadoop supports the database through its distributed filesystem HDFS which provides built-in replication and MapReduce traversal for HBase tables of arbitrary size. Features include optimised query push down via server-side scan and get filters, a high performance Thrift gateway, an XLM-based RESTful Webservice gateway, Hadoop cascading, per-column probabilistic Bloom filters, as well as data warehousing and data analysis modules. Since HBase saves column families rather than columns and since empty columns are not stored, it is ideal for sparse tables with semi-structured data. Typical use cases are cloud computing and applications that require massive storage using cheap commodity hardware.

Db4o

Db4o is an open-source object-oriented database system targeted at OOP developers. The idea behind Db4o is to enable programmers to create and persist a representation of the application object model directly in the database without the need for an object-relational mapping software layer. Object instances can then be stored and retrieved with a single line of code. Db4o provides a query mechanism called Native Query (NQ). This allows querying data with native OOP language constructs thus offering type safety for query expressions while eliminating the need for building query strings. Db4o is available for the Java and .NET platforms. If used with .NET languages, data can alternatively be queried with LINQ (language integrated query). The Db4o database is embeddable with a small footprint suitable to be deployed on mobile devices. Additional features include semi-automatic schema versioning, transaction support with ACID semantics, and synchronisation/replication mechanisms that allow synchronisation between different Db4o instances and data export into SQL databases.

Frameworkless Architecture

Perhaps suggesting to eschew web frameworks for web application development is playing the devil’s advocate. Perhaps it is even foolish. To renounce the productivity boost one gets with a properly designed framework does not sound like sensible advice. Only ignorant script kiddies entertain such ideas. Well, for the most part that is true. A web framework does indeed simplify application development if it is chosen well. It does even more if it is designed well. It can provide architectural support for building maintainable applications. It can help with the plumbing and provide conceptual structure to guide the development process.

So, what speaks against using a web framework? Plenty actually, especially at the lower end of the spectrum and especially with dynamic languages. The main problem with web frameworks is that they add overhead. This means that the added functionality and structure is bought at the cost of performance degradation. The graveness of this problem depends on the system architecture. One  needs to keep in mind, that dynamic languages are interpreted at runtime, which makes them CPU-intensive and relatively slow. Because the life cycle of a script is essentially stateless and single-step, classes and data structures need to be rebuild and reloaded (in theory) at each request.

In practice, this does not happen, because servers are designed to provide at least rudimentary caching. However, the runtime performance of interpreted languages is typically several magnitudes smaller than that of a compiled language, which magnifies the problem. To illustrate my point, consider these benchmarks for PHP frameworks kindly provided by Paul M. Jones. According to these figures, a trivial PHP page is served by Apache 2 at a performance reduction of 43% compared to static HTML. The use of various PHP web frameworks further reduces performance by 85% – 95% compared to a PHP page that merely echoes content. Although it can be expected that these figures develop inverse logarithmically with increasing application code complexity, the slowdown is significant.

PHP offers a number of remedies, such as  opcode caching, object caching, and products such as Zend Server, APC, and MCache, yet performance is unlikely to get even close to that of a compiled language. Furthermore, there is the question whether the complexity of the project justifies the complexity introduced by a web framework. Would you use a web framework for building a guestbook script? Probably not. What about a blog software? A photo gallery? A bulletin board? These types of applications are the mainstay of dynamic languages, such as PHP. It is the area where PHP really shines. Think of WordPress, phpBB, Mediawiki, Drupal, osCommerce, Coppermine and other popular applications. They all have one thing in common: they don’t use a framework.

Hence, before choosing a web framework for PHP development, it may be worth pondering if any is required. This suggestion may sound a bit contradictory, having just reviewed the Zend framework in a previous article. However, in my own practice I haven’t come across many complex PHP projects. The commercial PHP projects I worked on during the last 10 years can roughly be divided into three categories: 1. extensions and customisations of open source packages, 2. intranet information systems, and 3. e-commerce systems and “catalogware”.

Although the latter two may be considered candidates for web frameworks, the size of these projects was almost always small enough to do without. On several occasions, I chose to implement an “ultralight” MVC architecture by hand instead of using an out-of-the-box framework. The main reason for this was again performance. The “ultralight” approach is defined by implementing only the required functionality, which results in highly specialised design.

In practice, this means slimming the controller, reducing DB abstraction to a thin wrapper around the native library, and foregoing a templating system in favour of embedded PHP. The advantage of this approach is that you get separation of presentation and business logic, componentisation, and customisable control flow without the performance cost of full-blown framework. The disadvantage is that it is slightly more laborious to implement and less flexible. Don’t get me wrong. I have no problems imagining scenarios where I would want to use a PHP web framework such as the Zend framework. However, in these cases I’d probably be drawn towards using Java or (hopefully) Scala in the first place. In summary, I have found myself using PHP mostly in situations where a web framework seemed dispensable, while I have been using Java mostly in situations where a web framework seemed essential.

CSS Grid Layouts Brittle

Recently I changed parts of the HTML template for this blog from CSS divs to tables. Gasp, tables? That's so nineties. Indeed, it is. However, the CSS floating divs were just too brittle. An occasional wide image or wide block of <pre> text would mess up the sidebar badly. Also, the visual results were different in different browsers. The problem puppy was a browser whose name shall not be mentioned (but I can tell you it starts with “I” and ends with “6.0”). Call me old-fashioned, but I think that a table-based design often beats CSS in terms of robustness. Why spend hours testing a complex CSS design if the same job can be accomplished with tables in a few minutes? Tables are especially handy with multiple columns, nested columns and rows, and elastic designs. I would still use CSS in most situations, but you can't beat tables for robust grid layouts.

HTML 5 Preview

Because HTML is at the very core of the World Wide Web, you would expect it to be a mature and refined technology. You would also expect it to provide a flexible platform for Web application development and deployment. As most web developers know, the reality is a bit different. HTML started out as a rather simple SGML application for creating hyperlinked documents. It originally provided a basic set of elements for data viewing, data input, and formatting, whereas it did a little bit of all, yet nothing quite right. While this was practical for whipping up quick-and-dirty websites, it proved to be inadequate for more demanding presentation tasks and fine-tuned user interaction. Thus a whole bunch of supplemental technologies came into being, including CSS, JavaScript, Flash and finally AJAX. You know the story. All of this was quite a messy affair and unfortunately it still is.

While the HTML 4.01 specification has ruled the Web since 1999, the fifth incarnation of HTML was released by the W3C as a working draft earlier this year and is constantly updated since then. The HTML 5 specification is supposed to pave the way for future Web standards. It contains an older draft of W3C dubbed “Web Forms 2.0”, which is W3C’s answer to Web 2.0 and the World Wide Web becoming a platform for distributed applications. Don’t expect anything too radical, though. It neither delivers the hailed “rich GUI” for the Internet, nor will it replace current technologies like AJAX. It is rather designed as a natural extension of the former. It provides good backward compatibility while smoothing some of the rough edges of HTML. No more no less. Let’s have a look at the new features in more detail.

HTML 5 mends the split between the preceding HTML 4 and XHTML 1.0 specifications. Rather than being defined in terms of syntactical rules, it makes the DOM tree its conceptual basis. Thus HTML 5 can be expressed in two similar syntaxes, the “traditional” one and the XML syntax, which both result in the same DOM tree. It goes far beyond the scope of previous specifications, for example by spelling out how markup errors are handled, rather than leaving it to browser vendors, and by specifying APIs for new and old elements. These APIs describe how scripting languages interact with HTML. So, what’s new? The following elements have been dropped from the specification:

  • <acronym>
  • <applet>
  • <basefont>
  • <center>
  • <dir>
  • <font>
  • <frame>
  • <frameset>
  • <isindex>
  • <noframes>
  • <s>
  • <small>
  • <strike>
  • <tt>
  • <u>
  • <xmp>

The following attributes are also goners:

    abbr, accesskey, align, alink, axis, background, bgcolor, border, cellpadding and cellspacing, char, charoff, charset, classid, clear, compact, codebase, codetype, coords, declare, frame, frameborder, headers, height, hspace, language, link, marginheight and marginwidth, name, nohref, noshade, nowrap, profile, rules, rev, scope, scrolling, shape, scheme, size, standby, summary, target, text, type, valuetype, valign, version, vlink, width.

Some of these elements and attributes are quite obscure, so perhaps they won’t be missed. Others like <center>, align, background, and <u> were heavily used in the past, although most of these were already deprecated in HTML 4. The message here is clear: get rid of presentational markup and use CSS instead. The <b>, <i>, <em> and <strong> tags have miraculously survived, however. Although primarily used for text formatting in the past, these tags have been assigned new (non-presentational) semantics to make them respectable. Another conspicuous omission are frames. Yes, frames are gone! But you might breath a sigh of relief to know that <iframe> is still there. Speaking presentational versus semantic HTML, there are quite a few additions to HTML 5 in the latter category. The new semantic tags are designed to aid HTML authors in structuring text and to make it easier for search engine crawlers to parse information in web pages. Here they are (explanations provided by W3C):

  • <section> represents a generic document or application section. It can be used together with h1-h6 to indicate the document structure.
  • <article> represents an independent piece of content of a document, such as a blog entry or newspaper article.
  • <aside> represents a piece of content that is only slightly related to the rest of the page.
  • <header> represents the header of a section.
  • <footer> represents a footer for a section and can contain information about the author, copyright information, et cetera.
  • <nav> represents a section of the document intended for navigation.
  • <dialog> can be used to mark up a conversation in conjunction with the <dt> and <dd> elements.
  • <figure> can be used to associate a caption together with some embedded content, such as a graphic or video.
  • <details> represents additional information or controls which the user can obtain on demand.

Most of these, except the last two, behave like the <div> element, which means their primary use is to identify a block of content that belongs together. Unlike <div> special semantics are associated with each of these elements. Not very exciting? HTML 5 also introduces the following new elements (explanations again from the W3C document):

  • <audio> and <video> for multimedia content. Both provide an API so application authors can script their own user interface, but there is also a way to trigger a user interface provided by the user agent. Source elements are used together with these elements if there are multiple streams available of different types.
  • <embed> is used for plugin content.
  • <mark> represents a run of marked (highlighted) text.
  • <meter> represents a measurement, such as disk usage.
  • <time> represents a date and/or time.
  • <canvas> is used for rendering dynamic bitmap graphics on the fly, such as graphs, games, et cetera.

The <embed> tag supersedes the <applet> and <object> tags. It defines some sort of embedded content that doesn’t expose its internal structure to the DOM tree. The content is typically rendered by a browser plugin. The <audio> and <video> tags are perhaps more interesting, because they make it possible to include multimedia files or streams directly into the HTML document without having to specify a vendor-specific plugin for playing the content. Granted, this could previously be done with the <embed> tag, but the <embed> tag was never a W3C standard and it isn’t supported by all browsers. Obviously, W3C has decided not to follow the mainstream browser implementations and added the <audio> and <video> tags instead, while reserving the <embed> tag for the above named purpose.

Arguably the most exciting additions to HTML 5 -at least from the perspective of a web developer- are the extensions to form processing and data rendering, and the related APIs, such as the editing API or the drag-and-drop API. These additions have previously evolved as a separate standard under the term Web Forms 2.0 and are now incorporated into HTML 5. The <input> element has been enhanced to support several new data types. New elements for user interface components have been defined, similar to those that can be found in GUI applications. For example, HTML 5 finally features the long awaited combo box, a combination of text input and drop-down list, which is a standard component in GUIs for decades. A new <datagrid> element for the interactive/editable representation of data in tabular, list, or tree form, is also present. Here are the new <input> types:

  • type=”datetime”- a date and time (year, month, day, hour, minute, second, fraction of a second) with the time zone set to UTC.
  • type=”datetime-local”- a date and time (year, month, day, hour, minute, second, fraction of a second) with no time zone.
  • type=”date” – a date (year, month, day) with no time zone.
  • type=”month” – a date consisting of a year and a month with no time zone.
  • type=”week” – a date consisting of a year and a week number with no time zone.
  • type=”time”- a time (hour, minute, seconds, fractional seconds) with no time zone.
  • type=”number” – a numerical value.
  • type=”range” – a numerical value, with the extra semantic that the exact value is not important.
  • type=”email”- an e-mail address.
  • type=”url” – an internationalised resource identifier.

The input element also has several new attributes in HTML 5 that enhance its functionality (many of these also apply to other form controls such as <select>, <textarea>, etc.):

  • list=”listname” – used in conjunction with the <datalist> element to create a combobox.
  • required – indicates that the user must provide an input value.
  • autofocus – automatically focuses the control upon page load.
  • form – allows a single control to be associated with multiple forms.
  • inputmode – gives a hint to the user interface as to what kind of input is expected.
  • autocomplete – tells the browser to remember the value when the user returns to the page.
  • min – minimum value constraint.
  • max – maximum value constraint.
  • pattern – specifies pattern constraint.
  • step – specifies step constraint.

The following new elements provide additional user interface components for web applications. The last three are actually not themselves UI components, but components used for scripting the UI through a server side language:

  • <command> represents a command the user can invoke (e.g. toolbar button or icon).
  • <datalist> together with the a new list attribute for input is used to create comboboxes.
  • <output> represents some type of output, such as from a calculation done through scripting.
  • <progress> represents a completion of a task, such as downloading or when performing a series of expensive operations.
  • <menu> represents a menu. The element has three new attributes: type, label and autosubmit. They allow the element to transform into a menu as found in typical user interfaces as well as providing for context menus in conjunction with the global contextmenu attribute.
  • <datagrid> represents an interactive representation of a tree list or tabular data.
  • <ruby>, <rt> and <rb> allow for marking up Ruby annotations.
  • <eventsource> represents a target that “catches” remote server events.
  • <datatemplate>, <rule> and <nest> provide a templating mechanism for HTML.

Let’s briefly look at the new <datagrid> element. <datagrid> usually has a <table> child element, although <select> and <datalist> are also possible to create a tree control. The columns in the datagrid can have clickable captions for sorting. Columns, rows, and cells can each have specific flags, known as classes, which affect the functionality of the datagrid element. Rows are selectable and single cells (or all cells) can be made editable. A cell can contain a checkbox or values that can be cycled. Rows can also be separator rows. Datagrids have a DOM API for updating, inserting, and deleting rows or columns. They also have a data provider API that controls grid data content and editing.

I hope you found this brief overview useful. Please note that the features mentioned here don’t cover everything that is new in HTML 5, but hopefully they catch the essence. The HTML 5 specification is a work in progress; it is still changing and evolving. You can find the latest editor’s draft at http://www.w3.org/html/wg/html5/. An overview of the changes from HTML 4 is available at http://www.w3.org/TR/html5-diff/.

Web Standards: OpenID

OpenID appears to be red hot right now. The adoption of this emerging standard has accelerated in the first half of 2008 as it has entered the radar screen of web developers. Many large organisations, such as Google, Yahoo, IBM, Microsoft and AOL provide OpenID servers. Popular Internet sites, such as LiveJournal, Blogger, Jabber, Drupal and Wikitravel support OpenID logins, and the list is growing. Browser support for OpenID is just around the corner (it’s a feature in Firefox 3 for example). But we are getting ahead of ourselves. What is OpenID and why is it good? Put simply, OpenID solves two common problems; that of having to manage multiple accounts on different websites and that of storing sensitive account information on websites you don’t control. With a single OpenID account you can log into hundreds of different websites. Best of it, you -the user- manage the account information, not the website owner. In more technical terms, OpenID is an open, decentralised, user-centric digital identity framework. I’ll explain this in some more detail.

openid.pngOpenID is an open standard, because nobody owns it and because it’s free of patents and commercial licensing. The standard is maintained by the OpenID foundation; free open source implementations are available in many languages, including Java and PHP. It is decentralised, because it does not depend on a specific domain server. An existing OpenID provider can be rerouted very easily, as we shall see. It is user-centric, because it allows users to manage and control their identity information. Users can identify themselves with a URL they own. While traditional authentication relies on a combination of either a name or an email address and a password, OpenID just requires one item which is either a URL or an XRI (extensible resource identifier). To understand how this works, let’s look at the OpenID protocol and see what an OpenID login procedure actually does.

Let’s assume you already have an OpenID. You can use the same OpenID with any OpenID-enabled website (called the “relying party”) by typing it into the OpenID login field or by letting your browser fill out the field automatically. When you click Submit, the relying party performs a “discovery” procedure to retrieve an authentication URL and subsequently performs an “association” procedure for secure information interchange with the OpenID provider. You are then transported to the authentication URL (called the “OpenID provider”). Normally this is a site like yahoo.com or myopenid.com, but nothing keeps you from running your own OpenID server. After authenticating at the OpenID provider’s secure login page, you are redirected back to the relying party. If the relying party has requested identity information (name, gender, birth of date, etc.), you are prompted which information should be sent to the relying party. Often this information is used to fill in a registration form at the relying party. This information isn’t retrieved for a normal login, but the OpenID protocol supports it. Once you are back at the relying party’s website, the relying party checks whether the authentication was approved and verifies that the information is received correctly from the OpenID provider.

It sounds slightly complicated and by looking at the OpenID specifications you will find that the protocol is indeed quite involved. However, from the users point of view, it is really simple. The user only sees the OpenID login screen. If the user has enabled automatic login at the OpenID provider via a certificate or cookie, the only screen the user sees is the “approve/deny” screen. Logging into a website could not be easier. Only one password needs to be remembered. Registration forms can be pre-filled. Login into specific sites can be fully automated. The best thing is that the user has full control over the OpenID provider thanks to the discovery process. During discovery, the relying party looks for two fields in the header of the web page that it finds at the OpenID URL. In HTML Discovery, there are two fields named openid.server and openid2.provider.Example:

<link rel="openid.server" href="http://www.myopenid.com/server" />

 <link rel="openid2.provider" href="http://www.myopenid.com/server" />

These two entries commonly point to the same end point (the OpenID provider) and are used by version 1 and version 2 of the OpenID protocol. If you have a website, you could simply edit the HTML of your site to add these entries into the HTML header. You could then use the URL of that page as your OpenID. The advantage of using your own web page is that you control the OpenID end point. Hence, you can switch OpenID providers while retaining your OpenID simply by editing your site’s HTML code.

If you are going to incorporate OpenID into your existing website, you might want to think twice about implementing the protocol yourself. It isn’t trivial, and there are already several open source libraries that can be used, e.g. Openid4java if you program in Java, or the JanRain PHP OpenID library which works with PHP 4.3 up. Additional libraries for these two languages, as well as Ruby, Python, C#, C++, and other languages can be found at http://wiki.openid.net/Libraries.