NoSQL Databases

NoSQL DatabasesNoSQL databases have entered the radar of web application developers lately. While relational database management systems (RDBMS) have been powering almost every web application on the Internet for more than a decade, this is beginning to change. No longer is the selection of persistence technology a no-brainer. You have additional choices. Besides the old friend RDBMS, there are object-oriented databases, graph-oriented databases, key-value stores, column-oriented databases, and other options. Many of the newer products in this area are known as NoSQL databases. NoSQL is a movement that promotes persistence technologies that break with the conventional relational model. NoSQL databases typically don’t have tables schemas, SQL support, and are designed to scale horizontally.

For those of you old enough to remember Dbase, the NoSQL moniker may not be much of an attention grabber, because after all, products like Dbase, FoxPro, Clipper and similar DB systems never had SQL support either. With these systems, relations had to be expressed implicitly in the application and “queries” had to be coded as retrieval sequences. By contrast, modern NoSQL systems depart from the relational model and in many cases also from the tabular data structure, in order to serve use cases where traditional RDBMS fail in one or another way. A typical example would be a sparsely populated table that contains very few data in rows and columns. Such a table -if it grows to a large size- presents an efficiency problem to most RDBMS with resulting performance loss. In the remainder of this article, we will look at a few selected NoSQL databases and see which use cases they cater to.

CouchDB

Apache CouchDB is a document-oriented database that represents documents as JSON objects. CouchDB supports all data types supported by JSON, or respectively by Javascript. The JSON objects are not required to comply with schemas and can therefore be defined freely, which means that each JSON object can have a different structure. CouchDB supports queries by views. Views are aggregate functions and filters programmed in Javascript that follow the MapReduce algorithm. Views are stored and indexed in the database. CouchDB provides a RESTful API where every object (and any other item) in the database can be retrieved by an URL. It uses the HTTP POST, GET, PUT, and DELETE methods for CRUD operations. Other features include ACID semantics on basis of multi-version concurrency control, similar to RDBMS, which is optimised for a high number of concurrent reads, and a distributed architecture that allows for easy bidirectional replication and offline usage. CouchDB is thus designed from ground up for Internet use.

Neo4J

Neo4J is a graph database. As the name suggests, it is intended for use with the Java platform, which includes any language that runs on the JVM. Neo4J stores information in nodes and edges; the latter are called relationships in case of Neo4J. Relationships are always of a defined type. Both nodes and relationships can store properties, i.e. data. The Neo4J database is thus optimised for representing complex graph and network structures, such as a hierarchical object repository or a social network. It offers high-performance graph traversal operations for data access. Nodes can also be indexed and retrieved by key which enables more conventional style queries. Additional features include ACID transactions and transaction recovery, based on the Java Transaction API (JTA). Optional libraries can expose a Neo4J database as an RDF store where the node space can be queried using SPARQL. Neo4J is an embedded database with a small footprint that runs in the same JVM as the application.

Redis

Redis is a modern implementation of a persistent key-value store for general purpose use. Key-value store is a name for a simple key-based access mechanism that basically implements a dictionary (or map) data structure. Traditionally, such systems were used for caching and Redis holds its entire database in memory, which makes it ideal for applications that require ultra-fast data access. Redis allows not just plain string data but also allows sets and lists of strings in the data space. The system offers a number of special commands, such as atomic push/pop and add/remove operations for lists and set operations such as building union, intersection, and difference. Redis persists data either by asynchronously writing memory to disk, or by appending to a journalling file as data is written by clients. Additional features include easy master-slave replication and rudimentary sharding. Redis offers support for various languages, such as C/C++, Java, Scala, PHP and others through native drivers and APIs.

HBase

HBase is a free implementation of Google’s BigTable written in Java. It is not the type of database you would use for a blog or a forum software. HBase is a tabular data storage designed for massive tables in the Petabyte range with billions of rows distributed over a number of physical machines and thus optimised for horizontal scaling. HBase is part of the Apache Hadoop project, a framework for data-intensive distributed applications, inspired by Google’s MapReduce and GFS technologies. Hadoop supports the database through its distributed filesystem HDFS which provides built-in replication and MapReduce traversal for HBase tables of arbitrary size. Features include optimised query push down via server-side scan and get filters, a high performance Thrift gateway, an XLM-based RESTful Webservice gateway, Hadoop cascading, per-column probabilistic Bloom filters, as well as data warehousing and data analysis modules. Since HBase saves column families rather than columns and since empty columns are not stored, it is ideal for sparse tables with semi-structured data. Typical use cases are cloud computing and applications that require massive storage using cheap commodity hardware.

Db4o

Db4o is an open-source object-oriented database system targeted at OOP developers. The idea behind Db4o is to enable programmers to create and persist a representation of the application object model directly in the database without the need for an object-relational mapping software layer. Object instances can then be stored and retrieved with a single line of code. Db4o provides a query mechanism called Native Query (NQ). This allows querying data with native OOP language constructs thus offering type safety for query expressions while eliminating the need for building query strings. Db4o is available for the Java and .NET platforms. If used with .NET languages, data can alternatively be queried with LINQ (language integrated query). The Db4o database is embeddable with a small footprint suitable to be deployed on mobile devices. Additional features include semi-automatic schema versioning, transaction support with ACID semantics, and synchronisation/replication mechanisms that allow synchronisation between different Db4o instances and data export into SQL databases.

Oracle buys Sun

Generally I don’t comment on events in the business world, since this blog is about web development and software engineering. However, the acquisition of Sun by Oracle, which was officially announced yesterday, is so large-scale that it is likely to affect the engineering halls in many subtle ways. Quick facts: the deal is $ 7.4 billion worth, it was unanimously approved by Sun’s board and it will be closed this summer. Sun and Oracle published identical press statements yesterday which sing the praises of the acquisition.

I am not sure whether this is good news. While it was apparent to most observers that Sun was past its zenith, one wonders what will happen to its employees and its innovations. Granted, an acquisition by IBM would have tipped the scales even more in favour of Big Blue’s dominance in the enterprise market and that might have distorted competition. But one may doubt that Oracle will uphold Sun’s commitment to the open source community. Sun’s market was driven by innovation and open source products. Oracle’s market is clearly not.

In particular, one wonders what will happen to MySQL which was bought by Sun earlier last year and which competes with Oracle’s core products. Pessimistic observers have already called it MyToast. Will Larry Ellison allow MySQL to compete in the enterprise market? Probably not. Other items in Sun’s portfolio once considered crown jewels, such Solaris and Glassfish, might also be on the endangered list. But that is pure speculation at this moment. Whether this acquisition will turn out to be a good move for Oracle is currently debated by the industry experts. Whether it is a positive turn for open source community may be reasonably doubted. Time will show.

Sun acquires MySQL

It’s been on the news wire for two or three weeks already, but I just learned today that Sun is going to buy MySQL. My first thought was: “Oh, that’s great news.” Now MySQL can put a Sun logo on their product. That will finally allow them to enter the Fortune 500 stratosphere. Wow! MySQL really came a long way. Who would have thought so in the late nineties. My Kudos to Michael Widenius (Monty), the programmer who started this thing, and who is currently serving as CTO at MySQL AB. I hope some of the one billion dollars, which Sun is ready to pay, will go to Monty. This would prove that you can actually get rich from giving away software. It would also prove that a company’s major assets are its people and its innovation rather than bricks and mortar. The execs at both MySQL AB and Sun seem to be quite upbeat about the deal (see Jonathan Schwartz’s blog for example) and are generous with praise (who would be surprised)? I wonder what will happen to the Dolphin logo. It’s sort of cute – a bit like Sun’s Glassfish logo.

Database duel – MySQL vs. PostgreSQL

Almost all non-trivial applications need to store data of some kind. If the data has the form of records, or n-tuples, it is typically handled by a relational database management system (RDBMS). Relational databases are conceptually founded on set theory and predicate logic. Data in an RDBMS is arranged in tables whose elements can be linked to each other. Today almost all RDBMS use SQL (structured query language) to implement the relational model. RDBMS with SQL have been in use since the late 1970s. Previously an expensive corporate technology, the first open source RDBMS became available during the late 1990s. Presently PostgreSQL and MySQL are the most popular open source RDBMS.

MySql LogoPostgreSql LogoBoth database systems are widely used for web applications. Although MySQL has a much larger user base (est. 6 million installations by 2005), the growth of PostgreSQL has recently accelerated. The latter came initially out of an academic environment. PostgreSQL was developed at the Berkeley University as a successor of the proprietary INGRES database. Until 1995, it used QUEL instead of SQL. Since version 6.0, the software is maintained and advanced by a team of volunteers and released free under the BSD license. In contrast, MySQL was developed in a commercial environment by the Swedish company TCX Dataconsult, and later by MySQL AB. It started out as a rewrite of the mSQL database and began to acquire more and better features. MySQL is released under a dual licensing scheme (GPL and paid commercial license).

Since the PostgreSQL developers had a head start of almost 10 years, the PostgreSQL database had hitherto more features than MySQL, especially more advanced features, which are desirable in an “enterprise” computing environment. These include advanced database storage, data management tools, information replication, and backup tools. MySQL, on the other hand, used to have an edge over PostgreSQL in terms of speed. It offered better performance for concurrent database access. Lately, this gap is closing, however. PostgreSQL is getting faster while MySQL acquires more enterprise features. The crucial 5.0 release of MySQL in October 2005 has added stored procedures, triggers, and views.

Let’s look at the commonalities first. Both systems are fully relational, using SQL for data definition, data manipulation, and data retrieval. They run on Windows, Linux, and a number of Unices. MySQL also runs on MacOS. Both databases come with a graphical GUI and query builder, backup, repair, and optimisation tools. They offer standard connectors such as ODBC and JDBC, as well as APIs for all major programming languages. Both systems support foreign keys and data integrity, subselects, transactions, unions, views, stored procedures, and triggers. Among the high-end features that both RDBMS offer are ACID-compliant transaction processing, multiple isolation levels, procedural languages, schemas (metadata), hot backups, data loading, replication (as an add-on in PostgreSQL), table spaces for disk storage layout, terabyte scalability, and SSL. MySQL and PostgreSQL also both support storage of geographic information (GIS). PostgreSQL additionally has network-aware data types that recognize Ipv4 and Ipv6 data types.

Now, let’s look at the differences. PostgreSQL is an object-relational database which means that it has object-oriented features, such as user-definable database objects and inheritance. Users can define data types, indexes, operators (which can be overloaded), aggregates, domains, casts, and conversions. PostgreSQL supports array data types. Inheritance in PostgreSQL allows to inherit table characteristics from a parent table. PostgreSQL also has very advanced programming features. In addition to its native procedural language, PL/pgSQL (which resembles Oracle’s PL/SQL), PostgreSQL procedures can be written in scripting languages, such as Perl, PHP. Python, etc., or compiled languages, such as C++ and Java. In contrast, MySQL (since version 5.0) only supports a native scripting language that follows the ANSI 2003 standard.

PostgreSQL/MySQL Comparison Chart

MySql PostgreSql Comparison Chart

The most evident advantage that MySQL offers –in terms of features– are its so-called pluggable storage engines. One may choose from a number of different data storage models, which allows the database administrator to optimise databases for the intended application. For example, a web application that makes heavy use of concurrent reads with few write operations may use the MyISAM storage engine to achieve top performance, while an online booking system may use the InnoDB storage engine for ACID-compliant transactions. Another interesting characteristic of MySQL not found in PostgreSQL is its support for distributed databases, which goes beyond mere database replication. Functionality for distributed data storage is offered through the NDB and FEDERATED storage engines, supporting clustered and remote databases respectively.

There are further differences, of course. MySQL is generally faster than PostgreSQL. It maintains a single process to accept new connections, instead of spawning a new process for each connection like PostgreSQL. This is a great advantage for web applications that connect on each page view. In addition, the MyISAM storage engine provides tremendous performance for both simple and complex SELECT statements. Stability is another advantage of MySQL. Due to its larger user base, MySQL has been tested more intensively, and it has historically been more stable than PostgreSQL.

PostgreSQL has a slight advantage over MySQL/InnoDB for concurrent transactions, because it makes use of Multiversioning Concurrency Control (MVCC), a mechanism found only in enterprise-grade commercial RDBMS. Another advantage of PostgreSQL is its relatively strict compliance with the ANSI 92/99 SQL standards, especially in view of data types. The ANSI SQL implementation of MySQL is more incomplete by comparison. However, MySQL has a special ANSI mode that disregards proprietary extensions.

In view of backup/restore capabilities, MySQL provides somewhat less convenience than PostgreSQL and commercial enterprise RDBMS. Nevertheless, hot backup and restore operations can be performed with both systems. Both PostgreSQL and MySQL/InnoDB allow transactional tables to be backed up simply by using a single transaction that copies all relevant tables. The disadvantage of this method is that it uses a lot of resources, which might compromise system performance.

With MySQL, a better solution is to use the replication mechanism for a continuous backup. PostgreSQL allows recovery from disk failure through point-in-time recovery (PiTR). This method combines file system level backups with a write ahead log, that records all changes to the database. Thus it is possible to recreate snapshots of the database of any point in time. In most cases, a crashed databases can be recovered up to the last transaction before the crash. The PiTR is also convenient for large databases, since it preserves resources.

MySQL Strengths

  • Excellent code stability
  • Excellent performance, fast CONNECT and SELECT
  • Multiple storage engines to choose from
  • Larger user base (thus larger number of applications and libraries)
  • Support for distributed databases
  • Many high-quality GUI tools available
  • Commercial support widely offered

PostgreSQL Strengths

  • Object-oriented features
  • Advanced programming concepts
  • Supports multiple programming languages
  • High ANSI SQL conformance
  • Mature high-end features
  • Robust online backups
  • Very liberal BSD license

In summary, PostgreSQL and MySQL are both mature products with many enterprise level features. They are both catching on with the best commercial RDBMS and are presently making inroads into the high-end market. The philosophy of both RDBMS differs in several ways. Roughly speaking, MySQL is targeted at developers who expect a workhorse database with proven performance, while PostgreSQL is suitable for developers who expect advanced features and programming concepts. MySQL offers more deployment options, whereas PostgreSQL offers more flexibility for developers.

The triumph of the lamp

It is human to feel satisfaction when one’s predictions come true. To predict the success of LAMP in 1998 wasn’t that difficult, but neither was it a no-brainer.At the time, the acronym had just been coined by the German c’t magazine and it wasn’t widely known in the corporate world. LAMP stands for Linux-Apache-MySQL-PHP, a set of open-source software that powers web servers with dynamic content. Occasionally, the ‘P’ in LAMP is switched for Perl or Python, although PHP is now by far the most popular scripting language.

carbidelamp.jpgI remember having suggested a LAMP architecture for the implementation of a geo information system and extranet for a large government organisation in 1998. This project was on a tight budget, so I proposed to invest the resources into software development rather than into hardware and licenses. LAMP seemed ideal for it. However, the committee was utterly surprised that the word “Microsoft” did not appear in the proposal and they did not seem to put too much trust into any of the letters of L-A-M-P.

Luckily, another of my then customers was more open to the suggestion. A mid-sized logistics company was looking for new way to do business on the web. Since the company wanted to run their own servers, the LAMP stack offered a perfect solution to do this cost efficiently. It turned out to be foresighted decision. LAMP quickly gathered momentum on the Internet and soon became one of the mainstream web development platforms.

One has to keep in mind that the four pieces of software that make up LAMP have not been designed as a unified platform. On the contrary, they have different histories, and they were not specifically developed to work together. This is what distinguishes them from their competitor platforms ASP/.NET and Java/J2EE.

Let’s briefly go back to the year 1998. It was the time of the browser war and the dot.com boom. The Internet then consisted of about 5 million websites, which is less than one tenth of today (2005). Linux was at kernel version 2.8.x, Apache was at 1.3.0 and the Apache Software Foundation was not yet founded; MySQL was at version 3.21, and Andi Gutmans and Zeev Suraski just released the crucial PHP3 release.

Linux and Apache were already strong at the time. Linux was fairly mature OS with a userbase of 7.5 million in 1998. Torvalds had just trademarked the Linux name and the corporate world started to take notice. The Apache web server -originally developed as an extension of NCSA httpd server- already commanded 50% market share. MySQL and PHP, on the other hand, were new kids on the block. PHP had a userbase of only several ten thousand users and MySQL was widely considered a toy database.

The combination of these products, however, did one thing extremely well: powering dynamic websites. Plus they were free. Anyone who wanted to run a web server could use them without paying a single cent for license fees. Although MySQL and PHP had several limitations at the time, it did not really matter for the purpose of serving web pages. MySQL delivered impressive performance and PHP 3 was “cleaner” and easier to program than Perl. Thus the LAMP stack offered a killer platform for web applications.

Today, the quartet is more successful than ever. Apache powers 70% of all web servers. MySQL is the most widely used database on the Internet. It has matured into a full-featured RDBMS with transactions, replication, clustering, and (as of 5.0) stored procedures, views and triggers. PHP is now installed on 20% of all Internet sites. It fully supports the OOP paradigm and it has grown an extremely large function library that makes programmers feel like boys in a candy store.

What is behind the success of LAMP? In the case of Linux and Apache this is fairly easy to tell. They are both free and they offer excellent performance. You can easily pack 50 to 100 low traffic web sites onto a commodity PC. This makes Linux + Apache very popular with hosting companies who take advantage of the low cost of ownership. Furthermore, it allows service providers to customize the server’s configuration and administration model and apply it to an arbitrary number of cheap boxes which translates to arbitrary scalability. Hosting companies love it.

The case is slightly different with MySQL and PHP, because their growth is driven mainly by developers rather than by hosting companies. From the start, MySQL was geared towards web applications, which means its main strength is concurrent reads. Besides, it is easy to use and administer. PHP has become the scripting language of choice because its learning curve is almost zero, which means that programmers can be productive with PHP from day one. In addition, PHP is integrated very well with Apache from a very early time. It uses resources efficiently and avoids the CGI model and its known security problems.

All of the LAMP components originated around 1995 or before, which means that all of them recently passed their 10th birthday. LAMP can now be considered a mature architecture. The case of LAMP proves that using an open source platform is not like buying a pig in a poke. It proves that open source gets the job done, and -in the case of web servers- that it does the job better than anything else.