The Agile Samurai

The Agile Samurai

The Agile Samurai
by Jonathan Rasmusson
1st edition, 280 pages
Pragmatic Bookshelf

Book Review

Over the last ten years, I've been working with teams with different degrees of commitment to the agile process, ranging from non-existing to quite strong. I was looking for a text that summarises agile methodology to help me formalise and articulate my own experiences, and of course to enhance my knowledge of some of the finer points of agile practices. I have to admit that this book did not meet my expectations. The first eighty pages up to chapter six are mostly about project inception and read like a prolonged introduction. From chapter six onwards, the author finally comes to the point and discusses the core concepts of agile processes, so the book does get better with increasing page numbers. Unfortunately, Scrum isn't discussed at all, instead Kanban is introduced in chapter eight. The discussion of typical technical processes, such as refactoring, TDD, and continuous integration is compacted into several brief chapters at the end of the book.

The writing style is very informal; the author uses a conversational tone throughout the book. Almost every page contains illustrations, which makes it an easy and quick read. The style of the book is comparable to the Head First books. It left me with the the impression that I sat in an all-day meeting where someone said a lot of intelligent things to which everyone else agreed. Unfortunately, not many of these things seemed radically new or thought-provoking, so I fear I won't remember many of them next month. Of course, this may be entirely my own fault. I prefer a more formal, concise, old-school language. I also prefer dense and meaty text books with lots of diagrams, numbers and formulas. In return, I can dispense with stick figures, pictograms, and even with Master Sensei (a guru character used in the book). I feel that a lot of the deeper and more complex issues of agile project management have simply been left out.

To be fair, it must be mentioned that I probably do not fall into the target group for which this book was written. It is more appropriate as an introductory text for people who are new to agile project management, or even new to the entire business of project management. Think "trial lesson" and "starter course".

NoSQL Databases

NoSQL DatabasesNoSQL databases have entered the radar of web application developers lately. While relational database management systems (RDBMS) have been powering almost every web application on the Internet for more than a decade, this is beginning to change. No longer is the selection of persistence technology a no-brainer. You have additional choices. Besides the old friend RDBMS, there are object-oriented databases, graph-oriented databases, key-value stores, column-oriented databases, and other options. Many of the newer products in this area are known as NoSQL databases. NoSQL is a movement that promotes persistence technologies that break with the conventional relational model. NoSQL databases typically don’t have tables schemas, SQL support, and are designed to scale horizontally.

For those of you old enough to remember Dbase, the NoSQL moniker may not be much of an attention grabber, because after all, products like Dbase, FoxPro, Clipper and similar DB systems never had SQL support either. With these systems, relations had to be expressed implicitly in the application and “queries” had to be coded as retrieval sequences. By contrast, modern NoSQL systems depart from the relational model and in many cases also from the tabular data structure, in order to serve use cases where traditional RDBMS fail in one or another way. A typical example would be a sparsely populated table that contains very few data in rows and columns. Such a table -if it grows to a large size- presents an efficiency problem to most RDBMS with resulting performance loss. In the remainder of this article, we will look at a few selected NoSQL databases and see which use cases they cater to.

CouchDB

Apache CouchDB is a document-oriented database that represents documents as JSON objects. CouchDB supports all data types supported by JSON, or respectively by Javascript. The JSON objects are not required to comply with schemas and can therefore be defined freely, which means that each JSON object can have a different structure. CouchDB supports queries by views. Views are aggregate functions and filters programmed in Javascript that follow the MapReduce algorithm. Views are stored and indexed in the database. CouchDB provides a RESTful API where every object (and any other item) in the database can be retrieved by an URL. It uses the HTTP POST, GET, PUT, and DELETE methods for CRUD operations. Other features include ACID semantics on basis of multi-version concurrency control, similar to RDBMS, which is optimised for a high number of concurrent reads, and a distributed architecture that allows for easy bidirectional replication and offline usage. CouchDB is thus designed from ground up for Internet use.

Neo4J

Neo4J is a graph database. As the name suggests, it is intended for use with the Java platform, which includes any language that runs on the JVM. Neo4J stores information in nodes and edges; the latter are called relationships in case of Neo4J. Relationships are always of a defined type. Both nodes and relationships can store properties, i.e. data. The Neo4J database is thus optimised for representing complex graph and network structures, such as a hierarchical object repository or a social network. It offers high-performance graph traversal operations for data access. Nodes can also be indexed and retrieved by key which enables more conventional style queries. Additional features include ACID transactions and transaction recovery, based on the Java Transaction API (JTA). Optional libraries can expose a Neo4J database as an RDF store where the node space can be queried using SPARQL. Neo4J is an embedded database with a small footprint that runs in the same JVM as the application.

Redis

Redis is a modern implementation of a persistent key-value store for general purpose use. Key-value store is a name for a simple key-based access mechanism that basically implements a dictionary (or map) data structure. Traditionally, such systems were used for caching and Redis holds its entire database in memory, which makes it ideal for applications that require ultra-fast data access. Redis allows not just plain string data but also allows sets and lists of strings in the data space. The system offers a number of special commands, such as atomic push/pop and add/remove operations for lists and set operations such as building union, intersection, and difference. Redis persists data either by asynchronously writing memory to disk, or by appending to a journalling file as data is written by clients. Additional features include easy master-slave replication and rudimentary sharding. Redis offers support for various languages, such as C/C++, Java, Scala, PHP and others through native drivers and APIs.

HBase

HBase is a free implementation of Google’s BigTable written in Java. It is not the type of database you would use for a blog or a forum software. HBase is a tabular data storage designed for massive tables in the Petabyte range with billions of rows distributed over a number of physical machines and thus optimised for horizontal scaling. HBase is part of the Apache Hadoop project, a framework for data-intensive distributed applications, inspired by Google’s MapReduce and GFS technologies. Hadoop supports the database through its distributed filesystem HDFS which provides built-in replication and MapReduce traversal for HBase tables of arbitrary size. Features include optimised query push down via server-side scan and get filters, a high performance Thrift gateway, an XLM-based RESTful Webservice gateway, Hadoop cascading, per-column probabilistic Bloom filters, as well as data warehousing and data analysis modules. Since HBase saves column families rather than columns and since empty columns are not stored, it is ideal for sparse tables with semi-structured data. Typical use cases are cloud computing and applications that require massive storage using cheap commodity hardware.

Db4o

Db4o is an open-source object-oriented database system targeted at OOP developers. The idea behind Db4o is to enable programmers to create and persist a representation of the application object model directly in the database without the need for an object-relational mapping software layer. Object instances can then be stored and retrieved with a single line of code. Db4o provides a query mechanism called Native Query (NQ). This allows querying data with native OOP language constructs thus offering type safety for query expressions while eliminating the need for building query strings. Db4o is available for the Java and .NET platforms. If used with .NET languages, data can alternatively be queried with LINQ (language integrated query). The Db4o database is embeddable with a small footprint suitable to be deployed on mobile devices. Additional features include semi-automatic schema versioning, transaction support with ACID semantics, and synchronisation/replication mechanisms that allow synchronisation between different Db4o instances and data export into SQL databases.

Frameworkless Architecture

Perhaps suggesting to eschew web frameworks for web application development is playing the devil’s advocate. Perhaps it is even foolish. To renounce the productivity boost one gets with a properly designed framework does not sound like sensible advice. Only ignorant script kiddies entertain such ideas. Well, for the most part that is true. A web framework does indeed simplify application development if it is chosen well. It does even more if it is designed well. It can provide architectural support for building maintainable applications. It can help with the plumbing and provide conceptual structure to guide the development process.

So, what speaks against using a web framework? Plenty actually, especially at the lower end of the spectrum and especially with dynamic languages. The main problem with web frameworks is that they add overhead. This means that the added functionality and structure is bought at the cost of performance degradation. The graveness of this problem depends on the system architecture. One  needs to keep in mind, that dynamic languages are interpreted at runtime, which makes them CPU-intensive and relatively slow. Because the life cycle of a script is essentially stateless and single-step, classes and data structures need to be rebuild and reloaded (in theory) at each request.

In practice, this does not happen, because servers are designed to provide at least rudimentary caching. However, the runtime performance of interpreted languages is typically several magnitudes smaller than that of a compiled language, which magnifies the problem. To illustrate my point, consider these benchmarks for PHP frameworks kindly provided by Paul M. Jones. According to these figures, a trivial PHP page is served by Apache 2 at a performance reduction of 43% compared to static HTML. The use of various PHP web frameworks further reduces performance by 85% – 95% compared to a PHP page that merely echoes content. Although it can be expected that these figures develop inverse logarithmically with increasing application code complexity, the slowdown is significant.

PHP offers a number of remedies, such as  opcode caching, object caching, and products such as Zend Server, APC, and MCache, yet performance is unlikely to get even close to that of a compiled language. Furthermore, there is the question whether the complexity of the project justifies the complexity introduced by a web framework. Would you use a web framework for building a guestbook script? Probably not. What about a blog software? A photo gallery? A bulletin board? These types of applications are the mainstay of dynamic languages, such as PHP. It is the area where PHP really shines. Think of WordPress, phpBB, Mediawiki, Drupal, osCommerce, Coppermine and other popular applications. They all have one thing in common: they don’t use a framework.

Hence, before choosing a web framework for PHP development, it may be worth pondering if any is required. This suggestion may sound a bit contradictory, having just reviewed the Zend framework in a previous article. However, in my own practice I haven’t come across many complex PHP projects. The commercial PHP projects I worked on during the last 10 years can roughly be divided into three categories: 1. extensions and customisations of open source packages, 2. intranet information systems, and 3. e-commerce systems and “catalogware”.

Although the latter two may be considered candidates for web frameworks, the size of these projects was almost always small enough to do without. On several occasions, I chose to implement an “ultralight” MVC architecture by hand instead of using an out-of-the-box framework. The main reason for this was again performance. The “ultralight” approach is defined by implementing only the required functionality, which results in highly specialised design.

In practice, this means slimming the controller, reducing DB abstraction to a thin wrapper around the native library, and foregoing a templating system in favour of embedded PHP. The advantage of this approach is that you get separation of presentation and business logic, componentisation, and customisable control flow without the performance cost of full-blown framework. The disadvantage is that it is slightly more laborious to implement and less flexible. Don’t get me wrong. I have no problems imagining scenarios where I would want to use a PHP web framework such as the Zend framework. However, in these cases I’d probably be drawn towards using Java or (hopefully) Scala in the first place. In summary, I have found myself using PHP mostly in situations where a web framework seemed dispensable, while I have been using Java mostly in situations where a web framework seemed essential.

Galileo Troubles

Eclipse GalileoAnother year has passed in the Eclipse universe, and this means another minor release number and another Jupiter moon. Eclipse has moved from 3.4 to 3.5 or respectively from Ganymede to Galileo. Using a small gap in my busy development schedule, I decided to install the latest version this morning. Thanks to broadband Internet, the 180 MB JEE package was downloaded in a breeze and installed in a few minutes. Unfortunately, that’s where things stopped being easy.

When I downloaded the PDT plugin for PHP development, I found a bug in it that prevented Eclipse from creating a PHP project from existing sources. After some research on the Internet, I found that this was a well-documented bug which had been fixed in the meantime. I tried installing the latest PDT release via the Eclipse install & update feature, but the process came to a crashing halt with a message that demanded some mylyn jars that could not be found. Although I had no idea why PDT required that particular jar, I dutifully installed the mylyn plugins with the required version number.

Unfortunately, this did not impress Galileo, as it now demanded other jars when installing the PDT update. – Perhaps a case of workspace pollution, I thought. – Clearly, it was time for a fresh start. I scrapped the installation and started anew with a blank workspace and a new install location. This time, everything seemed to install fine. I was able to create Java and PHP projects. However, Galileo suddenly wouldn’t open *.xml, *.xsl, or *.html files any more. It complained that there was no editor for this content type, which appeared fishy since both web tools (WTP) and PDT were installed. I tried to solve the problem by playing around with the configuration, but to no avail.

After several fresh attempts and considerable time spent with looking up error messages on the Internet, I decided to stay with Ganymede. Since I had wasted my entire morning and since I had some real work to do as well, this seemed to be the best course of action. Maybe I will give Galileo another go when an updated distro package becomes available. With Ganymede I never ran into this sort of trouble, despite having PDT, WTP, the Scala plugin and Jboss tools installed. I am still clueless as to what went wrong and I wonder if anybody else had a similar experience.

Naming Conventions (3)

Today I am going to talk about best practices in identifier naming. Before thinking about concrete strategies for devising identifier names, one should answer the following three questions: 1. What (human) language to use. 2. Whether to use a naming scheme or not. 3. How platform and language choices affect identifier naming. While I generally don’t like to use naming schemes for the reasons mentioned in the last article, it’s almost always a good idea to adopt existing conventions for the specific platform, language and domain one is working in. Certain languages have established strong conventions about how identifiers are formed. For example, nothing keeps you from assigning lower-case names to classes in Java, but it surely raises eyebrows among Java programmers. Likewise, Java enthusiasts would probably look down their noses at code that uses underscores instead of lower CamelCase.

Other programming languages, like PHP or Perl, have less strict conventions for identifier names. Whatever the case may be, it’s invariably useful to honour  widely used conventions and apply them consistently, as it avoids misunderstandings and thus reduces the chance for errors. Apart from consistency and conventions, the most important point to get right is semantic precision. The identifier name should be stated as clearly and as intelligible as possible, without being overly wordy. Ambiguity is the greatest enemy here, especially when there are several identifiers floating around that have similar, yet subtly different meanings. Consider the following example:

for (Event event: events) {
  trainees += event.trainees.size();
  if (event.status == 1 && event.end >= time) {
    for (Person train : event.trainees)
      if (train.passed) {
        trainee.add(train);
        trained++;
     }
     else training.add(train);
  }
  else if (event.status == 0 || event.end < time) {
    training.addAll(event.trainees);
}

This code looks convoluted because of the overuse of the word “train”. What it does is this: it iterates over all workshop events and creates a collection of trainees that have successfully completed a workshop and another of trainees that did not complete a workshop or are still being trained. It also counts the total number of trainees and the number of successful ones. Apart from the ambiguous “train…”, readability suffers because of non-descriptive variable names and the use of constant integer literals. Here is a more readable version:

for (Event workshop: allWorkshops) {
  numOfTrainees += workshop.trainees.size();
  if (workshop.status == COMPLETED && workshop.end >= today) {
    for (Person trainee : workshop.trainees)
      if (trainee.passed) {
        graduates.add(trainee);
        numOfGraduates++;
     }
     else peopleInTraining.add(trainee);
  }
  else if (workshop.status == CANCELLED || workshop.end < today) {
    peopleInTraining.addAll(event.trainees);
}

This brief example shows how to transform an ambiguous piece of code into one that can be read and understood quite easily. The question is: what are the underlying principles? Unfortunately, the answer to this question is complicated. Rather than attempting an analysis, I suggest to look at existing best practices that are widely adopted. Steven McConnel describes such a set of practices in his acclaimed book Code Complete (2nd Edition) in Chapter 11, The Power Of Variable Names. He writes: “An effective technique for coming up with a good name is to state in words what the variable represents. Often that statement itself is the best variable name. It’s easy to read because it doesn’t contain cryptic abbreviations, and it’s unambiguous.”

In the remainder of this article, I will try to present a summary of Steve McConnell’s set of best practices. Describe the “what”, not the “how”. A good identifier name relates to the problem domain rather than to the solution approach or algorithm. For example, the name tmpStateArray might hold an array of communication states of different chat users in a chat application. The name is clearly computerish and algorithm-oriented; a better name would probably be chatStates or communicationStates. Likewise, names such as bitFlag, statusFlag, optionString, etc. should be avoided unless it is totally clear what they mean. Aim for optimum identifier length. Identifier length is important. It is almost always a compromise between conveying sufficient meaning and maintaining readability; in other words, it’s always a compromise between ambiguity and length.

Consider the number of graduated trainees in the sales department. The name numberOfGraduatedTraineesInSalesDepartment is most unambiguous, but rather unwieldy. The names trainees, salesTrainees, gradSalTrn  on the other hand, are more convenient but too short and too ambiguous. A reasonable compromise would be numGradTraineesInSales.

Qualify computed values. – If a variable is used for a computed numeric value, such as totals, averages, etc. qualify the variable name accordingly. Examples are: num…, count…, sum…, average…, max…, min… and so on. Use common opposite names for variables that express the boundaries of an interval, or an operation that involves opposites, such as begin/end, first/last, min/max, old/new, lowest/highest, next/previous, source/target, sender/recipient and so on.

Use letters i, j, k for loop index variables and iterators. – This convention is probably as old as C-programming and it is a good practice, since there is no point in inventing fancy names for loop index variables. As always, there is an exception to the rule. If the index is used outside the loop, for example to count records, then a more descriptive name such as recordCount makes the reference outside the loop more readable.

Use is… or has… prefixes for boolean variables and methods that return boolean values. For example, the variable name isCustomer expresses boolean semantics much clearer than just customer. An exception to this rule are words that clearly express a yes/no status, such as ready, done, failed, found, etc. In an object-oriented language, use nouns for classes, objects and interfaces. Use verbs for class methods and functions. Sometimes, it is also meaningful to use adjectives for interfaces or methods, for example Quadratic, Sharable, isQuadratic(), or isShared().

Avoid scope prefixes unless your language forces you to do so. For example, it was customary in the past to prefix class names or functions with module names, for example: Payroll_EmployeeRecord, Payroll_Allowances, Payroll_Reimbursements, and so on. This practice battles namespace pollution, but it is a very poor replacement for actual namespaces, because it leads to incredibly long and cluttered identifiers. Use this only if your language does not support the notions of modules, packages, or namespaces.

Avoid shadowing. If you use identifiers in nested scopes, don’t use the same names to avoid shadowing. For example, if you declare a local variable with the same name as a field/member of the class, then the former shadows the latter. This is not a neat programming trick, but a recipe for disaster. It is quite confusing to read, and even more confusing to debug. An exception to this rule would be getters and setters, where the field/member is explicitly qualified with ‘this’.

Avoid reusing variables, unless the semantics are maintained. Imagine you have a variable named count which is used in one loop to count records, and in another to count errors. It may be tempting to reuse the integer already declared, but it is far more readable and maintainable to use two separate integer variables appropriately named recordCount and errorCount.

Make abbreviations readable and unambiguous. For example, the abbreviated prefix rnd is often used to denote a random value, but it could also mean rounded. If it isn’t clear from the context, write it out; for instance, write randomAmount instead of rndAmount. As a general rule, source code is read more often than it is written, thus readability is more important than writing convenience. With this in mind, if an abbreviation just saves one or two characters, always write out the word. Create abbreviations that can be pronounced. For example, use stringPos rather than stringPstn and recCount rather than rcrdCount. It should be possible to read program code aloud without twisting your tongue. Apart from the advantage of being easier to communicate, pronounceable identifiers are also easier to remember.