Jul 20

NoSQL DatabasesNoSQL databases have entered the radar of web application developers lately. While relational database management systems (RDBMS) have been powering almost every web application on the Internet for more than a decade, this is beginning to change. No longer is the selection of persistence technology a no-brainer. You have additional choices. Besides the old friend RDBMS, there are object-oriented databases, graph-oriented databases, key-value stores, column-oriented databases, and other options. Many of the newer products in this area are known as NoSQL databases. NoSQL is a movement that promotes persistence technologies that break with the conventional relational model. NoSQL databases typically don’t have tables schemas, SQL support, and are designed to scale horizontally.

For those of you old enough to remember Dbase, the NoSQL moniker may not be much of an attention grabber, because after all, products like Dbase, FoxPro, Clipper and similar DB systems never had SQL support either. With these systems, relations had to be expressed implicitly in the application and “queries” had to be coded as retrieval sequences. By contrast, modern NoSQL systems depart from the relational model and in many cases also from the tabular data structure, in order to serve use cases where traditional RDBMS fail in one or another way. A typical example would be a sparsely populated table that contains very few data in rows and columns. Such a table -if it grows to a large size- presents an efficiency problem to most RDBMS with resulting performance loss. In the remainder of this article, we will look at a few selected NoSQL databases and see which use cases they cater to.

CouchDB

Apache CouchDB is a document-oriented database that represents documents as JSON objects. CouchDB supports all data types supported by JSON, or respectively by Javascript. The JSON objects are not required to comply with schemas and can therefore be defined freely, which means that each JSON object can have a different structure. CouchDB supports queries by views. Views are aggregate functions and filters programmed in Javascript that follow the MapReduce algorithm. Views are stored and indexed in the database. CouchDB provides a RESTful API where every object (and any other item) in the database can be retrieved by an URL. It uses the HTTP POST, GET, PUT, and DELETE methods for CRUD operations. Other features include ACID semantics on basis of multi-version concurrency control, similar to RDBMS, which is optimised for a high number of concurrent reads, and a distributed architecture that allows for easy bidirectional replication and offline usage. CouchDB is thus designed from ground up for Internet use.

Neo4J

Neo4J is a graph database. As the name suggests, it is intended for use with the Java platform, which includes any language that runs on the JVM. Neo4J stores information in nodes and edges; the latter are called relationships in case of Neo4J. Relationships are always of a defined type. Both nodes and relationships can store properties, i.e. data. The Neo4J database is thus optimised for representing complex graph and network structures, such as a hierarchical object repository or a social network. It offers high-performance graph traversal operations for data access. Nodes can also be indexed and retrieved by key which enables more conventional style queries. Additional features include ACID transactions and transaction recovery, based on the Java Transaction API (JTA). Optional libraries can expose a Neo4J database as an RDF store where the node space can be queried using SPARQL. Neo4J is an embedded database with a small footprint that runs in the same JVM as the application.

Redis

Redis is a modern implementation of a persistent key-value store for general purpose use. Key-value store is a name for a simple key-based access mechanism that basically implements a dictionary (or map) data structure. Traditionally, such systems were used for caching and Redis holds its entire database in memory, which makes it ideal for applications that require ultra-fast data access. Redis allows not just plain string data but also allows sets and lists of strings in the data space. The system offers a number of special commands, such as atomic push/pop and add/remove operations for lists and set operations such as building union, intersection, and difference. Redis persists data either by asynchronously writing memory to disk, or by appending to a journalling file as data is written by clients. Additional features include easy master-slave replication and rudimentary sharding. Redis offers support for various languages, such as C/C++, Java, Scala, PHP and others through native drivers and APIs.

HBase

HBase is a free implementation of Google’s BigTable written in Java. It is not the type of database you would use for a blog or a forum software. HBase is a tabular data storage designed for massive tables in the Petabyte range with billions of rows distributed over a number of physical machines and thus optimised for horizontal scaling. HBase is part of the Apache Hadoop project, a framework for data-intensive distributed applications, inspired by Google’s MapReduce and GFS technologies. Hadoop supports the database through its distributed filesystem HDFS which provides built-in replication and MapReduce traversal for HBase tables of arbitrary size. Features include optimised query push down via server-side scan and get filters, a high performance Thrift gateway, an XLM-based RESTful Webservice gateway, Hadoop cascading, per-column probabilistic Bloom filters, as well as data warehousing and data analysis modules. Since HBase saves column families rather than columns and since empty columns are not stored, it is ideal for sparse tables with semi-structured data. Typical use cases are cloud computing and applications that require massive storage using cheap commodity hardware.

Db4o

Db4o is an open-source object-oriented database system targeted at OOP developers. The idea behind Db4o is to enable programmers to create and persist a representation of the application object model directly in the database without the need for an object-relational mapping software layer. Object instances can then be stored and retrieved with a single line of code. Db4o provides a query mechanism called Native Query (NQ). This allows querying data with native OOP language constructs thus offering type safety for query expressions while eliminating the need for building query strings. Db4o is available for the Java and .NET platforms. If used with .NET languages, data can alternatively be queried with LINQ (language integrated query). The Db4o database is embeddable with a small footprint suitable to be deployed on mobile devices. Additional features include semi-automatic schema versioning, transaction support with ACID semantics, and synchronisation/replication mechanisms that allow synchronisation between different Db4o instances and data export into SQL databases.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Slashdot
  • LinkedIn
  • blogmarks
  • Live
  • description
  • StumbleUpon
  • Ma.gnolia
  • MisterWong
  • NewsVine
  • Reddit
  • Spurl
  • Yigg
  • E-mail this story to a friend!
Jun 14

Today I am going to talk about best practices in identifier naming. Before thinking about concrete strategies for devising identifier names, one should answer the following three questions: 1. What (human) language to use. 2. Whether to use a naming scheme or not. 3. How platform and language choices affect identifier naming. While I generally don’t like to use naming schemes for the reasons mentioned in the last article, it’s almost always a good idea to adopt existing conventions for the specific platform, language and domain one is working in. Certain languages have established strong conventions about how identifiers are formed. For example, nothing keeps you from assigning lower-case names to classes in Java, but it surely raises eyebrows among Java programmers. Likewise, Java enthusiasts would probably look down their noses at code that uses underscores instead of lower CamelCase. Other programming languages, like PHP or Perl, have less strict conventions for identifier names. Whatever the case may be, it’s invariably useful to honour  widely used conventions and apply them consistently, as it avoids misunderstandings and thus reduces the chance for errors.

Apart from consistency and conventions, the most important point to get right is semantic precision. The identifier name should be stated as clearly and as intelligible as possible, without being overly wordy. Ambiguity is the greatest enemy here, especially when there are several identifiers floating around that have similar, yet subtly different meanings. Consider the following example:

1
2
3
4
5
6
7
8
9
10
11
12
13
for (Event event: events) {
  trainees += event.trainees.size();
  if (event.status == 1 && event.end >= time) {
    for (Person train : event.trainees)
      if (train.passed) {
        trainee.add(train);
        trained++;
     }
     else training.add(train);
  }
  else if (event.status == 0 || event.end < time) {
    training.addAll(event.trainees);
}

This code looks convoluted because of the overuse of the word “train”. What it does is this: it iterates over all workshop events and creates a collection of trainees that have successfully completed a workshop and another of trainees that did not complete a workshop or are still being trained. It also counts the total number of trainees and the number of successful ones. Apart from the ambiguous “train…”, readability suffers because of non-descriptive variable names and the use of constant integer literals. Here is a more readable version:

1
2
3
4
5
6
7
8
9
10
11
12
13
for (Event workshop: allWorkshops) {
  numOfTrainees += workshop.trainees.size();
  if (workshop.status == COMPLETED && workshop.end >= today) {
    for (Person trainee : workshop.trainees)
      if (trainee.passed) {
        graduates.add(trainee);
        numOfGraduates++;
     }
     else peopleInTraining.add(trainee);
  }
  else if (workshop.status == CANCELLED || workshop.end < today) {
    peopleInTraining.addAll(event.trainees);
}

This brief example shows how to transform an ambiguous piece of code into one that can be read and understood quite easily. The question is: what are the underlying principles? Unfortunately, the answer to this question is complicated. Rather than attempting an analysis, I suggest to look at existing best practices that are widely adopted. Steven McConnel describes such a set of practices in his acclaimed book Code Complete (2nd Edition) in Chapter 11, The Power Of Variable Names. He writes: “An effective technique for coming up with a good name is to state in words what the variable represents. Often that statement itself is the best variable name. It’s easy to read because it doesn’t contain cryptic abbreviations, and it’s unambiguous.” In the remainder of this article, I will try to present a summary of Steve McConnell’s set of best practices.

Describe the “what”, not the “how”. A good identifier name relates to the problem domain rather than to the solution approach or algorithm. For example, the name tmpStateArray might hold an array of communication states of different chat users in a chat application. The name is clearly computerish and algorithm-oriented; a better name would probably be chatStates or communicationStates. Likewise, names such as bitFlag, statusFlag, optionString, etc. should be avoided unless it is totally clear what they mean.

Aim for optimum identifier length. Identifier length is important. It is almost always a compromise between conveying sufficient meaning and maintaining readability; in other words, it’s always a compromise between ambiguity and length. Consider the number of graduated trainees in the sales department. The name numberOfGraduatedTraineesInSalesDepartment is most unambiguous, but rather unwieldy. The names trainees, salesTrainees, gradSalTrn  on the other hand, are more convenient but too short and too ambiguous. A reasonable compromise would be numGradTraineesInSales.

Qualify computed values. If a variable is used for a computed numeric value, such as totals, averages, etc. qualify the variable name accordingly. Examples are: num…, count…, sum…, average…, max…, min… and so on.

Use common opposite names for variables that express the boundaries of an interval, or an operation that involves opposites, such as begin/end, first/last, min/max, old/new, lowest/highest, next/previous, source/target, sender/recipient and so on.

Use letters i, j, k for loop index variables and iterators. This convention is probably as old as C-programming and it is a good practice, since there is no point in inventing fancy names for loop index variables. As always, there is an exception to the rule. If the index is used outside the loop, for example to count records, then a more descriptive name such as recordCount makes the reference outside the loop more readable.

Use is… or has… prefixes for boolean variables and methods that return boolean values. For example, the variable name isCustomer expresses boolean semantics much clearer than just customer. An exception to this rule are words that clearly express a yes/no status, such as ready, done, failed, found, etc.

In an object-oriented language, use nouns for classes, objects and interfaces. Use verbs for class methods and functions. Sometimes, it is also meaningful to use adjectives for interfaces or methods, for example Quadratic, Sharable, isQuadratic(), or isShared().

Avoid scope prefixes unless your language forces you to do so. For example, it was customary in the past to prefix class names or functions with module names, for example: Payroll_EmployeeRecord, Payroll_Allowances, Payroll_Reimbursements, and so on. This practice battles namespace pollution, but it is a very poor replacement for actual namespaces, because it leads to incredibly long and cluttered identifiers. Use this only if your language does not support the notions of modules, packages, or namespaces.

Avoid shadowing. If you use identifiers in nested scopes, don’t use the same names to avoid shadowing. For example, if you declare a local variable with the same name as a field/member of the class, then the former shadows the latter. This is not a neat programming trick, but a recipe for disaster. It is quite confusing to read, and even more confusing to debug. An exception to this rule would be getters and setters, where the field/member is explicitly qualified with ‘this’.

Avoid reusing variables, unless the semantics are maintained. Imagine you have a variable named count which is used in one loop to count records, and in another to count errors. It may be tempting to reuse the integer already declared, but it is far more readable and maintainable to use two separate integer variables appropriately named recordCount and errorCount.

Make abbreviations readable and unambiguous. For example, the abbreviated prefix rnd is often used to denote a random value, but it could also mean rounded. If it isn’t clear from the context, write it out; for instance, write randomAmount instead of rndAmount. As a general rule, source code is read more often than it is written, thus readability is more important than writing convenience. With this in mind, if an abbreviation just saves one or two characters, always write out the word.

Create abbreviations that can be pronounced. For example, use stringPos rather than stringPstn and recCount rather than rcrdCount. It should be possible to read program code aloud without twisting your tongue. Apart from the advantage of being easier to communicate, pronounceable identifiers are also easier to remember.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Slashdot
  • LinkedIn
  • blogmarks
  • Live
  • description
  • StumbleUpon
  • Ma.gnolia
  • MisterWong
  • NewsVine
  • Reddit
  • Spurl
  • Yigg
  • E-mail this story to a friend!
Apr 7

Photo by Phil JacksonMy journey through the world of programming languages began in 1987 with the blinking cursor on a black-and-white computer screen of an Atari ST 1040 computer. After a few hours of playing with the GFA BASIC interpreter, I was hooked. The graphical capabilities of the Atari computer made it possible to program Mandelbrot fractals, the Towers of Hanoi, the Breakout game, and all those things which newbie programmers like to entertain themselves with. Quite a few of these programs looked peculiarly similar to what people programmed ten years later when the first Java applets appeared. But I am getting ahead of myself. Back in 1987, BASIC was the beginner language. The GFA BASIC dialect was considered quite modern at the time, since it didn’t have line numbers and it was a full featured procedural language, at least in principle. Yet, it was still a toy. After about six months I felt like writing more ambitious projects and I realised that I had outgrown BASIC. Someone gave me a copy of a C-compiler, so I started learning the C-language. This was a good decision as it turned out later, because I was able to use C throughout the first five years of my career. I found Kernighan-Ritchie style C to be conceptually very close to the GFA BASIC I had started with except for pointers, which were completely new.

The study of C led me to Unix. I began writing clones of Unix tools and utilities for my own use. This was the late 80s before the GNU and Linux phenomena appeared. One such project was a text editor that I enhanced with optimised scrolling routines in 386 Assembler language. I wrote the editor after I had exchanged my Atari computer for a PC. After a few months I had a number of common Unix tools and a nice text editor at my disposal which I could use under MS-DOS. Then I read Andrew Tanenbaum’s Minix book and I got into system programming. I wrote a micro-kernel task scheduler for the 386 in Assembler. Multi-tasking was a fascinating thing that seemed to be out of reach for an average personal computer. At the time, I briefly considered expanding the micro-kernel into a more complete OS by adding memory management and file management. However, I soon realised the immensity of this task. I had just started studying informatics, and I figured that I wouldn’t be able to accomplish it while still visiting lectures and doing homework. At university, we were taught Pascal as a “first” programming language and Lisp as a second. Pascal was very easy, of course; it seemed like a verbose dialect of C. - Lisp, on the other hand, I found quite repulsive. - I could appreciate the underlying mathematical idea, the lambda calculus, but the syntax was just awful. I believe it was  IEEE Scheme. The language seemed great for graph-theoretical problems, but unsuitable to express common algorithms in a natural way. In other words, I found it to be a language for eggheads.

At the time, the imperative programming paradigm was predominant. It seemed the best way to get things done, as development tools and libraries for imperative procedural languages were readily available. The next language I learned at the university was Modula 2. I thought of it as an elaboration of Pascal with emphasis on data abstraction and encapsulation. From Modula 2 I learned the importance of encapsulation. Although I didn’t use Modula 2 for practical applications, I was able to apply the conceptual foundation in my work that revolved around C  programming. After university, I worked in systems programming. I designed and implemented drivers for a company that manufactured proprietary hardware. Then I changed to work with another company in the field of machine translation and computer based training. After 5 years of coding in C, I thought it was time for a change. This was the early nineties, so I turned my attention to application programming with RAD tools which had just hit the market. I learned SQL inside out and created data-driven programs. Visual Basic 3.0 was the killer application in 1993, as it made the construction of Windows GUIs extremely easy. I was able to tie in with my prior Basic experience. Customers liked the productivity that comes with RAD. After about a year, I dropped VB in favour of Delphi, which was superior for this purpose. Likewise, I could tie in with my previous Pascal experience. I learned the rudiments of object oriented programming with Object Pascal, which is odd given that C++ would have been the more natural path to object orientation after having programmed in C for many years. However, Object Pascal taught me proper componentisation.

This was the mid-nineties and a lot of amazing things happened in the IT industry. The most important change was the commercial breakthrough of the Internet. Almost simultaneously, the Linux phenomenon happened. The IT industry boomed and technological progress was fast-paced. The Internet connected everybody everywhere and Linux brought corporate computing horsepower to the desktop. As a result of these changes, I began coding HTML in 1996 and I learned JavaScript and Perl in 1997. The next year brought even more changes, as I decided to gear my business towards web development. Perl seemed like an idiosyncratic Unix solution born out of necessity. It was certainly practical for server side programming, but it was also rather painful and hackish. Fortunately, PHP appeared at around the same time and it offered a much cleaner solution for server programming. Soon I found myself programming web applications in PHP most of the time. LAMP-based applications literally exploded on the Internet between 1998 and 2003. During this period, I also learned the rudiments of Java, C++, and C#. I was responsible for the management of projects implemented in all of these languages. Object-oriented programming had become the mainstream paradigm in the late nineties. I decided that I needed to take on one of these languages more seriously. The obvious choice was Java, since it was general purpose, but still very strong in the field of web development.

So I fully immersed myself in Java when the language made the transition from 1.4 to 1.5. At that point, Java was already mature and mainstream. As a latecomer to Java, the platform seemed huge to me, certainly larger than anything I had looked at before, including .NET. The sheer number of APIs was just unbelievable. It required a sustained effort of two years during which I read a shelf of Java books and began moving from trivial programming exercises to small projects and then to larger projects. Since the mid-2000s, Java has become my mainstay. There are two reasons why I like Java. First, there is a fantastic eco-system connected to the platform. It ranges from best-of-breed IDEs, VMs, and app-servers to a gazillion libraries and frameworks, and (almost) everything is free. Second, Java is extremely scalable and robust. It is not the purest object-oriented language, neither the richest, but Java is probably the one language that transforms average programmers into software engineers. I argue that this is so, because of the high level of standardisation and best practices endorsement in the Java community. I know that there are quite a few people who debate that. However, there’s a reason why universities teach Java to freshmen and why corporations use Java for enterprise development. It offers the largest and possibly the most robust platform for developing industrial-strength software.

Of course, not everything is hunky-dory in the Java department. I perceive that the main problem is the language itself. - It’s aging. - Although (or perhaps because) it forces programmers to write tidy code and relinquish dirty C tricks, it tends to be tedious, as it involves generous amounts of boilerplate code. It also lacks good paradigms for fine-grained concurrency control. Fortunately, with the Scala language I discovered a possible solution for these problems. At this point -early 2009- I haven’t yet done any larger projects in Scala, but my eagerness to do so is growing. Adding the functional paradigm to my programming instruments is very beneficial. It even flows over into my Java work, since it has changed the way I phrase algorithms in Java. The only negative effect is that by learning Scala, the limitations of the Java language became more evident and thus more painful. While functional programming will probably grow in the near future, Java has such a strong position that it won’t just fade away. Many large systems have been created in Java, so there will be maintenance work for decades to come. Meanwhile, it will be interesting to see how fast the industry embraces functional programming.

Share and Enjoy:
  • Digg
  • del.icio.us
  • Technorati
  • Facebook
  • Mixx
  • Google
  • YahooMyWeb
  • Slashdot
  • LinkedIn
  • blogmarks
  • Live
  • description
  • StumbleUpon
  • Ma.gnolia
  • MisterWong
  • NewsVine
  • Reddit
  • Spurl
  • Yigg
  • E-mail this story to a friend!

« Previous Entries