The code is the documentation

The first time I heard someone saying: “the code is the documentation”, I thought it sounded completely wrong, like a lazy excuse for not producing documentation. However, it kept me thinking and I realised that there is also truth in this statement. This paradoxical thought-provoking quality makes it a proper mantra for agile practitioners, because it expresses a fundamental agile value: “working software over comprehensive documentation”.

Before we go further into details, I should dismiss the notion that agile developers don’t write documentation or misconceive its worth. Nope. We still produce documentation. However, we also apply the following agile principle to it: “Simplicity, the art of maximising the amount of work not done, is essential.” It is expedient to minimise documentation by adhering to practices that reduce the need for it. In agile development, documentation takes on the role of accessory parts whereas the primary attention is given to the codebase. Or more succinctly:

“Truth can only be found in one place: the code.” (Robert C. Martin, Clean Code)

The codebase is the ultimate source of truth. It is referenced in case of doubt, when a question is either too detailed or if the documentation is out-of-date. The code provides insight where no other method of reference is available. With this in mind, it is evident that code should be written in a way that is well-structured and understandable. Self-documenting code reduces or disposes the need for external documentation. Knowledge is represented in a single artefact and there is no need to synchronise multiple sources. Unfortunately, most real-world codebases do not have these ideal qualities. There could be many reasons for that, but the most common reason is that software entropy has taken its toll over time, and that too little attention was paid to refactoring.

So, code quality, self-documenting code, and continuous refactoring go hand in hand. It is important to understand that the highest code quality is achieved through conceptual clarity. Conceptual clarity comes from good naming and  good structure. This is by far more important than coding style, naming conventions, formatting and other external features, although the latter do of course contribute to code quality. Naming and structure cannot be tested automatically. Unlike code style, conventions, and formatting, they require human perception and intelligence. Which is one more reason to adopt such practices as code reviews and pair programming.

Coming back to “the code is the documentation”, I think the best way to understand this phrase is as an abstract ideal that ought to be worked towards. In an ideal world, code doesn’t need additional documentation, because it is so beautifully clear that it can be understood by anyone without any prior knowledge. It answers all questions that might arise about the software. It clarifies the intentions of the programmer and it is therefore also easy to maintain and change. Obviously, this is really difficult to achieve in the real world, especially across a large codebase, but its self-documenting properties are quite likely the best measure of code quality.

Code Review Antipatterns

Don’t you love code reviews? Having your grand solution scrutinised, criticised, and optimised by your colleagues makes you feel awkward? It doesn’t have to be that way. Code reviews are actually fun if done right and if the participants are aware of the pitfalls. They are not just conducive to achieving a high level of code quality, but they are also an ongoing learning experience.

There are three basic forms of code reviews: informal reviews (“hey, can you have a look at this?”), formal reviews (a formal task assigned to one ore more peers), or pair programming. The latter form has the advantage that it is done at the time when code is created. It is also the only form where two individuals engage in a direct conversation. There is no time lag. This makes pair programming the preferential form of review for high-priority tasks.

Agile teams working with git (or any other version control system) often encounter code reviews as a step in their code merging workflow. The typical unit of code being reviewed is a merge request, aka pull request in this setting. So, here are a number of things to avoid when doing a code review.

Not doing code reviews – Let’s begin with the obvious. Four eyes (or six or eight…) see more than two. Errors, problems, vulnerabilities, and bad design gets spotted early on. Corrective measures are still inexpensive at this stage. There is a dual benefit. Code quality increases and knowledge is shared. At least two people of a team are familiar with each feature or function, namely the author and the reviewer.

Confusing code reviews with functional reviews – Code reviews focus on code. Duh! The review is primarily (but not exclusively) concerned with the non-functional properties of the software. Functional reviews, on the other hand, ensure that functional requirements are met. The latter activity is closer to testing, therefore it makes sense to perform functional reviews in a separate procedure, probably as part of testing.

Bitching, nitpicking, fault-finding, patronising – It goes without saying that treating colleagues with kindness and respect is necessary for productive collaboration. Code reviews are no exception. Offensive behaviours only derail the process, while courteous and supportive comments go a long way. The reviewer must guard against coming across as a know-it-all and keep office politics out of the dialogue. In written reviews, annotators should aim at being helpful without being wordy.

Discussing code style conventions – Code reviews should not be about lexical code style, formatting and programming conventions. There are automated tools for this. If you care about formatting and code style, use a linter. More importantly, decide on a set of programming style conventions and configure the linter accordingly. It is totally OK if the team agrees to leave certain options open to individual preference. The formatting and style of code that passes the linter should not be argued, because it’s simply a waste of time.

Ignoring naming – Programming conventions may or may not encompass identifier naming conventions. However, linters cannot determine whether a given variable name is good or bad. Machines don’t care about names, only humans do. Phil Karlton has described naming as one of the two hard things in computer science. By all means, do review the identifier naming in the code review. As a rule of thumb, if an identifier name leaves you puzzled the first time you read it, it is probably not well chosen.

Debate – Code reviews aren’t meant as a podium for discussion. The objective is to improve the resulting artifact, not to find out who has the better arguments. Sometimes, people do have different opinions about practices and solutions. It may be helpful to resolve these by giving examples and references to commonly accepted best practices. If that is to no avail, a third party (or fourth or fifth…) should be be involved. In a written review, consider involving a third party, whenever the dialogue drifts into debate.

Not getting a third opinion – Involving a third party into the review is not only useful in cases where two people cannot agree. See first antipattern: four eyes are better than two, and six are better than four. If an important architectural question is at stake, why not involve the whole team? See: mob programming.

Reviewing huge chunks of code – Code reviews should be limited in scope, ideally to a few pages of code that can be completely read and understood in 20 minutes. Studies have shown that thoroughness and number of spotted issues are inversely proportional to the amount of code being reviewed. Therefore, results are improved if smaller units of code are being reviewed. Including time for reflection and annotations or discussion, the entire review should not take longer than 45 minutes. Anything exceeding one hour is questionable.

Unidirectional reviewing – The general idea is that everyone should review everyones code. Most teams have both senior and junior developers on board or perhaps a team lead or an architect. This often leads to the situation that seniors only review juniors, but not vice versa. This is wrong. First of all, seniors can make mistakes, too. Second, juniors can benefit from reading code written by senior developers. Third, varying fields of specialisation can be leveraged. A junior developer might have special knowledge or skills in a certain narrow area.

Rewriting code – It’s called code review, not code rewrite! Rewriting someone’s code is just not appropriate in most situations. It’s akin to pair programming where the navigator just grabs the keyboard from the driver whenever he feels like it. It’s simply rude. If the reviewer has a better idea, he should annotate the merge request with code examples.

Overdoing it – Programming is not a beauty contest. Code does not need to be perfect to be functional and maintainable. For example, even though I might prefer a functional loop construct over an imperative one, it’s just fine if the imperative one works. Perfectionism sometimes gets in the way and leads to other antipatterns such as premature optimisation and presumptive features (YAGNI). Also: A piece of code written for a one-time migration or ops procedure does not require the same high standards as production system code.

Underdoing it – Being too forgiving is likewise unhelpful. If the code review is executed as a mere approval formality, it’s a waste of time. Code should always be inspected for things like security holes, logical errors, code smells, technical debt and feature creep. It is important to strike a balance between clean maintainable code and reasonable effort.

Not reviewing tests – Unit tests are often mandatory and sometimes functional and acceptance tests are also added to the artifact. These should also be reviewed to ensure that the tests are present, logically correct and meet the standards. It’s a good idea to verify an agreed upon code coverage percentage as part of the review.

Not using a diff utility – Most teams probably use a tool set with built-in diff and annotation capability. In case you don’t, you make your work harder than it has to be.

The Agile Samurai

The Agile Samurai

The Agile Samurai
by Jonathan Rasmusson
1st edition, 280 pages
Pragmatic Bookshelf

Book Review

Over the last ten years, I've been working with teams with different degrees of commitment to the agile process, ranging from non-existing to quite strong. I was looking for a text that summarises agile methodology to help me formalise and articulate my own experiences, and of course to enhance my knowledge of some of the finer points of agile practices. I have to admit that this book did not meet my expectations. The first eighty pages up to chapter six are mostly about project inception and read like a prolonged introduction. From chapter six onwards, the author finally comes to the point and discusses the core concepts of agile processes, so the book does get better with increasing page numbers. Unfortunately, Scrum isn't discussed at all, instead Kanban is introduced in chapter eight. The discussion of typical technical processes, such as refactoring, TDD, and continuous integration is compacted into several brief chapters at the end of the book.

The writing style is very informal; the author uses a conversational tone throughout the book. Almost every page contains illustrations, which makes it an easy and quick read. The style of the book is comparable to the Head First books. It left me with the the impression that I sat in an all-day meeting where someone said a lot of intelligent things to which everyone else agreed. Unfortunately, not many of these things seemed radically new or thought-provoking, so I fear I won't remember many of them next month. Of course, this may be entirely my own fault. I prefer a more formal, concise, old-school language. I also prefer dense and meaty text books with lots of diagrams, numbers and formulas. In return, I can dispense with stick figures, pictograms, and even with Master Sensei (a guru character used in the book). I feel that a lot of the deeper and more complex issues of agile project management have simply been left out.

To be fair, it must be mentioned that I probably do not fall into the target group for which this book was written. It is more appropriate as an introductory text for people who are new to agile project management, or even new to the entire business of project management. Think "trial lesson" and "starter course".

NoSQL Databases

NoSQL DatabasesNoSQL databases have entered the radar of web application developers lately. While relational database management systems (RDBMS) have been powering almost every web application on the Internet for more than a decade, this is beginning to change. No longer is the selection of persistence technology a no-brainer. You have additional choices. Besides the old friend RDBMS, there are object-oriented databases, graph-oriented databases, key-value stores, column-oriented databases, and other options. Many of the newer products in this area are known as NoSQL databases. NoSQL is a movement that promotes persistence technologies that break with the conventional relational model. NoSQL databases typically don’t have tables schemas, SQL support, and are designed to scale horizontally.

For those of you old enough to remember Dbase, the NoSQL moniker may not be much of an attention grabber, because after all, products like Dbase, FoxPro, Clipper and similar DB systems never had SQL support either. With these systems, relations had to be expressed implicitly in the application and “queries” had to be coded as retrieval sequences. By contrast, modern NoSQL systems depart from the relational model and in many cases also from the tabular data structure, in order to serve use cases where traditional RDBMS fail in one or another way. A typical example would be a sparsely populated table that contains very few data in rows and columns. Such a table -if it grows to a large size- presents an efficiency problem to most RDBMS with resulting performance loss. In the remainder of this article, we will look at a few selected NoSQL databases and see which use cases they cater to.

CouchDB

Apache CouchDB is a document-oriented database that represents documents as JSON objects. CouchDB supports all data types supported by JSON, or respectively by Javascript. The JSON objects are not required to comply with schemas and can therefore be defined freely, which means that each JSON object can have a different structure. CouchDB supports queries by views. Views are aggregate functions and filters programmed in Javascript that follow the MapReduce algorithm. Views are stored and indexed in the database. CouchDB provides a RESTful API where every object (and any other item) in the database can be retrieved by an URL. It uses the HTTP POST, GET, PUT, and DELETE methods for CRUD operations. Other features include ACID semantics on basis of multi-version concurrency control, similar to RDBMS, which is optimised for a high number of concurrent reads, and a distributed architecture that allows for easy bidirectional replication and offline usage. CouchDB is thus designed from ground up for Internet use.

Neo4J

Neo4J is a graph database. As the name suggests, it is intended for use with the Java platform, which includes any language that runs on the JVM. Neo4J stores information in nodes and edges; the latter are called relationships in case of Neo4J. Relationships are always of a defined type. Both nodes and relationships can store properties, i.e. data. The Neo4J database is thus optimised for representing complex graph and network structures, such as a hierarchical object repository or a social network. It offers high-performance graph traversal operations for data access. Nodes can also be indexed and retrieved by key which enables more conventional style queries. Additional features include ACID transactions and transaction recovery, based on the Java Transaction API (JTA). Optional libraries can expose a Neo4J database as an RDF store where the node space can be queried using SPARQL. Neo4J is an embedded database with a small footprint that runs in the same JVM as the application.

Redis

Redis is a modern implementation of a persistent key-value store for general purpose use. Key-value store is a name for a simple key-based access mechanism that basically implements a dictionary (or map) data structure. Traditionally, such systems were used for caching and Redis holds its entire database in memory, which makes it ideal for applications that require ultra-fast data access. Redis allows not just plain string data but also allows sets and lists of strings in the data space. The system offers a number of special commands, such as atomic push/pop and add/remove operations for lists and set operations such as building union, intersection, and difference. Redis persists data either by asynchronously writing memory to disk, or by appending to a journalling file as data is written by clients. Additional features include easy master-slave replication and rudimentary sharding. Redis offers support for various languages, such as C/C++, Java, Scala, PHP and others through native drivers and APIs.

HBase

HBase is a free implementation of Google’s BigTable written in Java. It is not the type of database you would use for a blog or a forum software. HBase is a tabular data storage designed for massive tables in the Petabyte range with billions of rows distributed over a number of physical machines and thus optimised for horizontal scaling. HBase is part of the Apache Hadoop project, a framework for data-intensive distributed applications, inspired by Google’s MapReduce and GFS technologies. Hadoop supports the database through its distributed filesystem HDFS which provides built-in replication and MapReduce traversal for HBase tables of arbitrary size. Features include optimised query push down via server-side scan and get filters, a high performance Thrift gateway, an XLM-based RESTful Webservice gateway, Hadoop cascading, per-column probabilistic Bloom filters, as well as data warehousing and data analysis modules. Since HBase saves column families rather than columns and since empty columns are not stored, it is ideal for sparse tables with semi-structured data. Typical use cases are cloud computing and applications that require massive storage using cheap commodity hardware.

Db4o

Db4o is an open-source object-oriented database system targeted at OOP developers. The idea behind Db4o is to enable programmers to create and persist a representation of the application object model directly in the database without the need for an object-relational mapping software layer. Object instances can then be stored and retrieved with a single line of code. Db4o provides a query mechanism called Native Query (NQ). This allows querying data with native OOP language constructs thus offering type safety for query expressions while eliminating the need for building query strings. Db4o is available for the Java and .NET platforms. If used with .NET languages, data can alternatively be queried with LINQ (language integrated query). The Db4o database is embeddable with a small footprint suitable to be deployed on mobile devices. Additional features include semi-automatic schema versioning, transaction support with ACID semantics, and synchronisation/replication mechanisms that allow synchronisation between different Db4o instances and data export into SQL databases.

Frameworkless Architecture

Perhaps suggesting to eschew web frameworks for web application development is playing the devil’s advocate. Perhaps it is even foolish. To renounce the productivity boost one gets with a properly designed framework does not sound like sensible advice. Only ignorant script kiddies entertain such ideas. Well, for the most part that is true. A web framework does indeed simplify application development if it is chosen well. It does even more if it is designed well. It can provide architectural support for building maintainable applications. It can help with the plumbing and provide conceptual structure to guide the development process.

So, what speaks against using a web framework? Plenty actually, especially at the lower end of the spectrum and especially with dynamic languages. The main problem with web frameworks is that they add overhead. This means that the added functionality and structure is bought at the cost of performance degradation. The graveness of this problem depends on the system architecture. One  needs to keep in mind, that dynamic languages are interpreted at runtime, which makes them CPU-intensive and relatively slow. Because the life cycle of a script is essentially stateless and single-step, classes and data structures need to be rebuild and reloaded (in theory) at each request.

In practice, this does not happen, because servers are designed to provide at least rudimentary caching. However, the runtime performance of interpreted languages is typically several magnitudes smaller than that of a compiled language, which magnifies the problem. To illustrate my point, consider these benchmarks for PHP frameworks kindly provided by Paul M. Jones. According to these figures, a trivial PHP page is served by Apache 2 at a performance reduction of 43% compared to static HTML. The use of various PHP web frameworks further reduces performance by 85% – 95% compared to a PHP page that merely echoes content. Although it can be expected that these figures develop inverse logarithmically with increasing application code complexity, the slowdown is significant.

PHP offers a number of remedies, such as  opcode caching, object caching, and products such as Zend Server, APC, and MCache, yet performance is unlikely to get even close to that of a compiled language. Furthermore, there is the question whether the complexity of the project justifies the complexity introduced by a web framework. Would you use a web framework for building a guestbook script? Probably not. What about a blog software? A photo gallery? A bulletin board? These types of applications are the mainstay of dynamic languages, such as PHP. It is the area where PHP really shines. Think of WordPress, phpBB, Mediawiki, Drupal, osCommerce, Coppermine and other popular applications. They all have one thing in common: they don’t use a framework.

Hence, before choosing a web framework for PHP development, it may be worth pondering if any is required. This suggestion may sound a bit contradictory, having just reviewed the Zend framework in a previous article. However, in my own practice I haven’t come across many complex PHP projects. The commercial PHP projects I worked on during the last 10 years can roughly be divided into three categories: 1. extensions and customisations of open source packages, 2. intranet information systems, and 3. e-commerce systems and “catalogware”.

Although the latter two may be considered candidates for web frameworks, the size of these projects was almost always small enough to do without. On several occasions, I chose to implement an “ultralight” MVC architecture by hand instead of using an out-of-the-box framework. The main reason for this was again performance. The “ultralight” approach is defined by implementing only the required functionality, which results in highly specialised design.

In practice, this means slimming the controller, reducing DB abstraction to a thin wrapper around the native library, and foregoing a templating system in favour of embedded PHP. The advantage of this approach is that you get separation of presentation and business logic, componentisation, and customisable control flow without the performance cost of full-blown framework. The disadvantage is that it is slightly more laborious to implement and less flexible. Don’t get me wrong. I have no problems imagining scenarios where I would want to use a PHP web framework such as the Zend framework. However, in these cases I’d probably be drawn towards using Java or (hopefully) Scala in the first place. In summary, I have found myself using PHP mostly in situations where a web framework seemed dispensable, while I have been using Java mostly in situations where a web framework seemed essential.