Frameworkless Architecture

Perhaps suggesting to eschew web frameworks for web application development is playing the devil’s advocate. Perhaps it is even foolish. To renounce the productivity boost one gets with a properly designed framework does not sound like sensible advice. Only ignorant script kiddies entertain such ideas. Well, for the most part that is true. A web framework does indeed simplify application development if it is chosen well. It does even more if it is designed well. It can provide architectural support for building maintainable applications. It can help with the plumbing and provide conceptual structure to guide the development process.

So, what speaks against using a web framework? Plenty actually, especially at the lower end of the spectrum and especially with dynamic languages. The main problem with web frameworks is that they add overhead. This means that the added functionality and structure is bought at the cost of performance degradation. The graveness of this problem depends on the system architecture. One  needs to keep in mind, that dynamic languages are interpreted at runtime, which makes them CPU-intensive and relatively slow. Because the life cycle of a script is essentially stateless and single-step, classes and data structures need to be rebuild and reloaded (in theory) at each request.

In practice, this does not happen, because servers are designed to provide at least rudimentary caching. However, the runtime performance of interpreted languages is typically several magnitudes smaller than that of a compiled language, which magnifies the problem. To illustrate my point, consider these benchmarks for PHP frameworks kindly provided by Paul M. Jones. According to these figures, a trivial PHP page is served by Apache 2 at a performance reduction of 43% compared to static HTML. The use of various PHP web frameworks further reduces performance by 85% – 95% compared to a PHP page that merely echoes content. Although it can be expected that these figures develop inverse logarithmically with increasing application code complexity, the slowdown is significant.

PHP offers a number of remedies, such as  opcode caching, object caching, and products such as Zend Server, APC, and MCache, yet performance is unlikely to get even close to that of a compiled language. Furthermore, there is the question whether the complexity of the project justifies the complexity introduced by a web framework. Would you use a web framework for building a guestbook script? Probably not. What about a blog software? A photo gallery? A bulletin board? These types of applications are the mainstay of dynamic languages, such as PHP. It is the area where PHP really shines. Think of WordPress, phpBB, Mediawiki, Drupal, osCommerce, Coppermine and other popular applications. They all have one thing in common: they don’t use a framework.

Hence, before choosing a web framework for PHP development, it may be worth pondering if any is required. This suggestion may sound a bit contradictory, having just reviewed the Zend framework in a previous article. However, in my own practice I haven’t come across many complex PHP projects. The commercial PHP projects I worked on during the last 10 years can roughly be divided into three categories: 1. extensions and customisations of open source packages, 2. intranet information systems, and 3. e-commerce systems and “catalogware”.

Although the latter two may be considered candidates for web frameworks, the size of these projects was almost always small enough to do without. On several occasions, I chose to implement an “ultralight” MVC architecture by hand instead of using an out-of-the-box framework. The main reason for this was again performance. The “ultralight” approach is defined by implementing only the required functionality, which results in highly specialised design.

In practice, this means slimming the controller, reducing DB abstraction to a thin wrapper around the native library, and foregoing a templating system in favour of embedded PHP. The advantage of this approach is that you get separation of presentation and business logic, componentisation, and customisable control flow without the performance cost of full-blown framework. The disadvantage is that it is slightly more laborious to implement and less flexible. Don’t get me wrong. I have no problems imagining scenarios where I would want to use a PHP web framework such as the Zend framework. However, in these cases I’d probably be drawn towards using Java or (hopefully) Scala in the first place. In summary, I have found myself using PHP mostly in situations where a web framework seemed dispensable, while I have been using Java mostly in situations where a web framework seemed essential.

Galileo Troubles

Eclipse GalileoAnother year has passed in the Eclipse universe, and this means another minor release number and another Jupiter moon. Eclipse has moved from 3.4 to 3.5 or respectively from Ganymede to Galileo. Using a small gap in my busy development schedule, I decided to install the latest version this morning. Thanks to broadband Internet, the 180 MB JEE package was downloaded in a breeze and installed in a few minutes. Unfortunately, that’s where things stopped being easy.

When I downloaded the PDT plugin for PHP development, I found a bug in it that prevented Eclipse from creating a PHP project from existing sources. After some research on the Internet, I found that this was a well-documented bug which had been fixed in the meantime. I tried installing the latest PDT release via the Eclipse install & update feature, but the process came to a crashing halt with a message that demanded some mylyn jars that could not be found. Although I had no idea why PDT required that particular jar, I dutifully installed the mylyn plugins with the required version number.

Unfortunately, this did not impress Galileo, as it now demanded other jars when installing the PDT update. – Perhaps a case of workspace pollution, I thought. – Clearly, it was time for a fresh start. I scrapped the installation and started anew with a blank workspace and a new install location. This time, everything seemed to install fine. I was able to create Java and PHP projects. However, Galileo suddenly wouldn’t open *.xml, *.xsl, or *.html files any more. It complained that there was no editor for this content type, which appeared fishy since both web tools (WTP) and PDT were installed. I tried to solve the problem by playing around with the configuration, but to no avail.

After several fresh attempts and considerable time spent with looking up error messages on the Internet, I decided to stay with Ganymede. Since I had wasted my entire morning and since I had some real work to do as well, this seemed to be the best course of action. Maybe I will give Galileo another go when an updated distro package becomes available. With Ganymede I never ran into this sort of trouble, despite having PDT, WTP, the Scala plugin and Jboss tools installed. I am still clueless as to what went wrong and I wonder if anybody else had a similar experience.

Naming Conventions (3)

Today I am going to talk about best practices in identifier naming. Before thinking about concrete strategies for devising identifier names, one should answer the following three questions: 1. What (human) language to use. 2. Whether to use a naming scheme or not. 3. How platform and language choices affect identifier naming. While I generally don’t like to use naming schemes for the reasons mentioned in the last article, it’s almost always a good idea to adopt existing conventions for the specific platform, language and domain one is working in. Certain languages have established strong conventions about how identifiers are formed. For example, nothing keeps you from assigning lower-case names to classes in Java, but it surely raises eyebrows among Java programmers. Likewise, Java enthusiasts would probably look down their noses at code that uses underscores instead of lower CamelCase.

Other programming languages, like PHP or Perl, have less strict conventions for identifier names. Whatever the case may be, it’s invariably useful to honour  widely used conventions and apply them consistently, as it avoids misunderstandings and thus reduces the chance for errors. Apart from consistency and conventions, the most important point to get right is semantic precision. The identifier name should be stated as clearly and as intelligible as possible, without being overly wordy. Ambiguity is the greatest enemy here, especially when there are several identifiers floating around that have similar, yet subtly different meanings. Consider the following example:

for (Event event: events) {
  trainees += event.trainees.size();
  if (event.status == 1 && event.end >= time) {
    for (Person train : event.trainees)
      if (train.passed) {
        trainee.add(train);
        trained++;
     }
     else training.add(train);
  }
  else if (event.status == 0 || event.end < time) {
    training.addAll(event.trainees);
}

This code looks convoluted because of the overuse of the word “train”. What it does is this: it iterates over all workshop events and creates a collection of trainees that have successfully completed a workshop and another of trainees that did not complete a workshop or are still being trained. It also counts the total number of trainees and the number of successful ones. Apart from the ambiguous “train…”, readability suffers because of non-descriptive variable names and the use of constant integer literals. Here is a more readable version:

for (Event workshop: allWorkshops) {
  numOfTrainees += workshop.trainees.size();
  if (workshop.status == COMPLETED && workshop.end >= today) {
    for (Person trainee : workshop.trainees)
      if (trainee.passed) {
        graduates.add(trainee);
        numOfGraduates++;
     }
     else peopleInTraining.add(trainee);
  }
  else if (workshop.status == CANCELLED || workshop.end < today) {
    peopleInTraining.addAll(event.trainees);
}

This brief example shows how to transform an ambiguous piece of code into one that can be read and understood quite easily. The question is: what are the underlying principles? Unfortunately, the answer to this question is complicated. Rather than attempting an analysis, I suggest to look at existing best practices that are widely adopted. Steven McConnel describes such a set of practices in his acclaimed book Code Complete (2nd Edition) in Chapter 11, The Power Of Variable Names. He writes: “An effective technique for coming up with a good name is to state in words what the variable represents. Often that statement itself is the best variable name. It’s easy to read because it doesn’t contain cryptic abbreviations, and it’s unambiguous.”

In the remainder of this article, I will try to present a summary of Steve McConnell’s set of best practices. Describe the “what”, not the “how”. A good identifier name relates to the problem domain rather than to the solution approach or algorithm. For example, the name tmpStateArray might hold an array of communication states of different chat users in a chat application. The name is clearly computerish and algorithm-oriented; a better name would probably be chatStates or communicationStates. Likewise, names such as bitFlag, statusFlag, optionString, etc. should be avoided unless it is totally clear what they mean. Aim for optimum identifier length. Identifier length is important. It is almost always a compromise between conveying sufficient meaning and maintaining readability; in other words, it’s always a compromise between ambiguity and length.

Consider the number of graduated trainees in the sales department. The name numberOfGraduatedTraineesInSalesDepartment is most unambiguous, but rather unwieldy. The names trainees, salesTrainees, gradSalTrn  on the other hand, are more convenient but too short and too ambiguous. A reasonable compromise would be numGradTraineesInSales.

Qualify computed values. – If a variable is used for a computed numeric value, such as totals, averages, etc. qualify the variable name accordingly. Examples are: num…, count…, sum…, average…, max…, min… and so on. Use common opposite names for variables that express the boundaries of an interval, or an operation that involves opposites, such as begin/end, first/last, min/max, old/new, lowest/highest, next/previous, source/target, sender/recipient and so on.

Use letters i, j, k for loop index variables and iterators. – This convention is probably as old as C-programming and it is a good practice, since there is no point in inventing fancy names for loop index variables. As always, there is an exception to the rule. If the index is used outside the loop, for example to count records, then a more descriptive name such as recordCount makes the reference outside the loop more readable.

Use is… or has… prefixes for boolean variables and methods that return boolean values. For example, the variable name isCustomer expresses boolean semantics much clearer than just customer. An exception to this rule are words that clearly express a yes/no status, such as ready, done, failed, found, etc. In an object-oriented language, use nouns for classes, objects and interfaces. Use verbs for class methods and functions. Sometimes, it is also meaningful to use adjectives for interfaces or methods, for example Quadratic, Sharable, isQuadratic(), or isShared().

Avoid scope prefixes unless your language forces you to do so. For example, it was customary in the past to prefix class names or functions with module names, for example: Payroll_EmployeeRecord, Payroll_Allowances, Payroll_Reimbursements, and so on. This practice battles namespace pollution, but it is a very poor replacement for actual namespaces, because it leads to incredibly long and cluttered identifiers. Use this only if your language does not support the notions of modules, packages, or namespaces.

Avoid shadowing. If you use identifiers in nested scopes, don’t use the same names to avoid shadowing. For example, if you declare a local variable with the same name as a field/member of the class, then the former shadows the latter. This is not a neat programming trick, but a recipe for disaster. It is quite confusing to read, and even more confusing to debug. An exception to this rule would be getters and setters, where the field/member is explicitly qualified with ‘this’.

Avoid reusing variables, unless the semantics are maintained. Imagine you have a variable named count which is used in one loop to count records, and in another to count errors. It may be tempting to reuse the integer already declared, but it is far more readable and maintainable to use two separate integer variables appropriately named recordCount and errorCount.

Make abbreviations readable and unambiguous. For example, the abbreviated prefix rnd is often used to denote a random value, but it could also mean rounded. If it isn’t clear from the context, write it out; for instance, write randomAmount instead of rndAmount. As a general rule, source code is read more often than it is written, thus readability is more important than writing convenience. With this in mind, if an abbreviation just saves one or two characters, always write out the word. Create abbreviations that can be pronounced. For example, use stringPos rather than stringPstn and recCount rather than rcrdCount. It should be possible to read program code aloud without twisting your tongue. Apart from the advantage of being easier to communicate, pronounceable identifiers are also easier to remember.

My Journey Through the World of Programming Languages

Photo by Phil JacksonMy journey through the world of programming languages began in 1987 with the blinking cursor on a black-and-white computer screen of an Atari ST 1040 computer. After a few hours of playing with the GFA BASIC interpreter, I was hooked. The graphical capabilities of the Atari computer made it possible to program Mandelbrot fractals, the Towers of Hanoi, the Breakout game, and all those things which newbie programmers like to entertain themselves with.

Quite a few of these programs looked peculiarly similar to what people programmed ten years later when the first Java applets appeared. But I am getting ahead of myself. Back in 1987, BASIC was the beginner language. The GFA BASIC dialect was considered quite modern at the time, since it didn’t have line numbers and it was a full featured procedural language, at least in principle. Yet, it was still a toy. After about six months I felt like writing more ambitious projects and I realised that I had outgrown BASIC. Someone gave me a copy of a C-compiler, so I started learning the C-language.

This was a good decision as it turned out later, because I was able to use C throughout the first five years of my career. I found Kernighan-Ritchie style C to be conceptually very close to the GFA BASIC I had started with except for pointers, which were completely new. The study of C led me to Unix. I began writing clones of Unix tools and utilities for my own use. This was the late 80s before the GNU and Linux phenomena appeared.

One such project was a text editor that I enhanced with optimised scrolling routines in 386 Assembler language. I wrote the editor after I had exchanged my Atari computer for a PC. After a few months I had a number of common Unix tools and a nice text editor at my disposal which I could use under MS-DOS. Then I read Andrew Tanenbaum’s Minix book and I got into system programming. I wrote a micro-kernel task scheduler for the 386 in Assembler. Multi-tasking was a fascinating thing that seemed to be out of reach for an average personal computer. At the time, I briefly considered expanding the micro-kernel into a more complete OS by adding memory management and file management. However, I soon realised the immensity of this task. I had just started studying informatics, and I figured that I wouldn’t be able to accomplish it while still visiting lectures and doing homework.

At university, we were taught Pascal as a “first” programming language and Lisp as a second. Pascal was very easy, of course; it seemed like a verbose dialect of C. – Lisp, on the other hand, I found quite repulsive. – I could appreciate the underlying mathematical idea, the lambda calculus, but the syntax was just awful. I believe it was  IEEE Scheme. The language seemed great for graph-theoretical problems, but unsuitable to express common algorithms in a natural way. In other words, I found it to be a language for eggheads.

At the time, the imperative programming paradigm was predominant. It seemed the best way to get things done, as development tools and libraries for imperative procedural languages were readily available. The next language I learned at the university was Modula 2. I thought of it as an elaboration of Pascal with emphasis on data abstraction and encapsulation. From Modula 2 I learned the importance of encapsulation. Although I didn’t use Modula 2 for practical applications, I was able to apply the conceptual foundation in my work that revolved around C  programming.

After university, I worked in systems programming. I designed and implemented drivers for a company that manufactured proprietary hardware. Then I changed to work with another company in the field of machine translation and computer based training. After 5 years of coding in C, I thought it was time for a change. This was the early nineties, so I turned my attention to application programming with RAD tools which had just hit the market. I learned SQL inside out and created data-driven programs. Visual Basic 3.0 was the killer application in 1993, as it made the construction of Windows GUIs extremely easy. I was able to tie in with my prior Basic experience. Customers liked the productivity that comes with RAD.

After about a year, I dropped VB in favour of Delphi, which was superior for this purpose. Likewise, I could tie in with my previous Pascal experience. I learned the rudiments of object oriented programming with Object Pascal, which is odd given that C++ would have been the more natural path to object orientation after having programmed in C for many years. However, Object Pascal taught me proper componentisation. This was the mid-nineties and a lot of amazing things happened in the IT industry. The most important change was the commercial breakthrough of the Internet. Almost simultaneously, the Linux phenomenon happened. The IT industry boomed and technological progress was fast-paced. The Internet connected everybody everywhere and Linux brought corporate computing horsepower to the desktop.

As a result of these changes, I began coding HTML in 1996 and I learned JavaScript and Perl in 1997. The next year brought even more changes, as I decided to gear my business towards web development. Perl seemed like an idiosyncratic Unix solution born out of necessity. It was certainly practical for server side programming, but it was also rather painful and hackish. Fortunately, PHP appeared at around the same time and it offered a much cleaner solution for server programming.

Soon I found myself programming web applications in PHP most of the time. LAMP-based applications literally exploded on the Internet between 1998 and 2003. During this period, I also learned the rudiments of Java, C++, and C#. I was responsible for the management of projects implemented in all of these languages. Object-oriented programming had become the mainstream paradigm in the late nineties. I decided that I needed to take on one of these languages more seriously.

The obvious choice was Java, since it was general purpose, but still very strong in the field of web development. So I fully immersed myself in Java when the language made the transition from 1.4 to 1.5. At that point, Java was already mature and mainstream. As a latecomer to Java, the platform seemed huge to me, certainly larger than anything I had looked at before, including .NET. The sheer number of APIs was just unbelievable. It required a sustained effort of two years during which I read a shelf of Java books and began moving from trivial programming exercises to small projects and then to larger projects. Since the mid-2000s, Java has become my mainstay.

There are two reasons why I like Java. First, there is a fantastic eco-system connected to the platform. It ranges from best-of-breed IDEs, VMs, and app-servers to a gazillion libraries and frameworks, and (almost) everything is free. Second, Java is extremely scalable and robust. It is not the purest object-oriented language, neither the richest, but Java is probably the one language that transforms average programmers into software engineers. I argue that this is so, because of the high level of standardisation and best practices endorsement in the Java community.

I know that there are quite a few people who debate that. However, there’s a reason why universities teach Java to freshmen and why corporations use Java for enterprise development. It offers the largest and possibly the most robust platform for developing industrial-strength software. Of course, not everything is hunky-dory in the Java department. I perceive that the main problem is the language itself. – It’s aging. – Although (or perhaps because) it forces programmers to write tidy code and relinquish dirty C tricks, it tends to be tedious, as it involves generous amounts of boilerplate code. It also lacks good paradigms for fine-grained concurrency control.

Fortunately, with the Scala language I discovered a possible solution for these problems. At this point -early 2009- I haven’t yet done any larger projects in Scala, but my eagerness to do so is growing. Adding the functional paradigm to my programming instruments is very beneficial. It even flows over into my Java work, since it has changed the way I phrase algorithms in Java. The only negative effect is that by learning Scala, the limitations of the Java language became more evident and thus more painful. While functional programming will probably grow in the near future, Java has such a strong position that it won’t just fade away. Many large systems have been created in Java, so there will be maintenance work for decades to come. Meanwhile, it will be interesting to see how fast the industry embraces functional programming.

Naming Conventions (2)

Language conventions are often omitted from formal programming conventions, which is a bit odd considering their importance. The most obvious language convention is the selection of the language itself. This point is so obvious, that it’s often missed. Mind you, not every programmer is fluent in English. Just a fraction are native English speakers. Hence, English should be chosen only if all members of the development team are comfortable with it. Otherwise, it is more appropriate to choose the native language of the developer team. If the developer team is international, English is likely to be used as the common basis. If the conventions prescribe English and if some team members are less fluent in English, identifier names must be treated with special care. I recently came across a piece of software developed in Germany where identifier names for an organisational structure where assigned as follows:

German Term What it means Identifier Name Used
Organisationseinheit organisational unit department
Unternehmen branch/sector division
Unternehmensbereich division divisionArea
Abteilung department section

The meaning of these identifiers doesn’t quite correspond to what an English speaker would expect them to mean, which is rather confusing. Yet, if you look up these terms in a dictionary, you find that all of them, with exception of divisionArea, are valid translations of the original German term. This is one example where it would have been better to use German identifiers instead. These terms appeared in source code that had English identifiers and German JavaDoc comments.

Perhaps as a rule of thumb, it’s better not to use English identifiers if the team is not comfortable with English documentation as well. Again, in an international environment this may not be an option. Therefore, it might be worthwhile to have an English speaker refactor code written by non-native speakers for the sake of clarity. Another aspect that pertains to language conventions is grammar. You might think that I am nitpicking when mentioning grammar in computer programs. The point is that proper grammar facilitates understanding. This is true for natural language as well as program code. The grammatical rules for identifier names are few and simple. Use nouns for class names and object references, because classes and objects abstract real world objects. In some cases, objects directly correspond to entities of the problem domain, such as Customer, Order, Department, User, etc. In other cases they do not.

It is important to qualify class names further if a single noun does not describe it sufficiently. Examples of qualified nouns are TransferredFunds, AppraisalStatistics, AvailableOptions. Use verbs for methods and functions, because methods abstract behaviour. Good examples for method names are addTax(), computeCRC(), checkAvailability(), getDepartment().

Avoid using nouns as method identifiers, such as department() instead of getDepartment() or adjectives such as available() instead of getAvailability(). An exception to this rule would be languages where methods directly represent properties, because property names should be either adjectives or nouns. Likewise, interfaces should be nouns, or in some special cases adjectives, such as Triangular, Centred, Deductible, etc. depending on their purpose. In the latter case, the adjective is derived from the verb deduct and by convention the interface defines a method with the same name, i.e. deduct(). Other examples for similarly named interfaces are Readable, Comparable, or Sortable.

Finally, we have identifiers for local variables, parameters, and fields. What should these look like? There are no hard rules for these program elements, since local variables, parameters and fields can represent anything from objects to properties, primitives, and even methods. It’s best if grammar conventions follow those of the type the identifier refers to.

Next I want to talk about identifier naming schemes. These are fixed sets of conventions for assigning names. Two widely used schemes are positional notation and Hungarian notation. Positional notation is typically used in legacy languages where identifier names are very short, as for example in older Fortran and Cobol programs. Another example is the 8.3 notation of MS-DOS file system. Identifiers that use positional notation are usually unintelligible without a key. An example would be APRPOCTT, where the first two characters AP stand for the main program module “Accounts Payable”, RP stands for the sub-module “reporting” and OCT1 means “operating costs total 1”.

Needless to say that this notation defies easy readability and should not be used unless absolutely necessary. A more widely used scheme is Hungarian notation. Hungarian notation is characterised by prefixes, and sometimes postfixes, which are added to the variable name and which carry type information. For example, in the original Hungarian notation suggested by Charles Simonyi for the C language, b stands for boolean, i for integer, p for pointer, sz for zero-terminated strings, and so on. These prefixes can be combined. The name arrpszMessages, for example, refers to an array of pointers to zero-terminated strings.

Sometimes Hungarian notation is applied to conceptual types rather than implementation types. For instance, the identifier arrstrMessages refers to an array of strings without saying anything about pointers or zero-termination. Hungarian notation is still used today with certain languages, such as C, C++ and Delphi. In C++ the following prefixes are often used to denote scope: g_ for global, m_ for class members (fields), s_ for static members, and _ for local variables. Delphi is unusual in the sense that Hungarian notation denote object types. The type names themselves are always prefixed with the letter T in Delphi, so a button class would be defined as TButton and an instance of that type would be prefixed with Btn or something similar, such as BtnOK, BtnCancel, BtnRetry, etc.

There is a lot of debate whether Hungarian notation is generally useful or not. The main criticism is that implementation types are somewhat irrelevant for the programmer writing code in a statically typed language, since types are checked by the compiler. When the programmer needs to know about the type of a variable or an object, modern IDEs can usually resolve it automatically. In this case, the prefixes just add unnecessary visual clutter. Hungarian notation still has its place in C, but it’s definitely not that useful with contemporary languages and development tools. The same can be said in principle about most other identifier naming schemes. Naming schemes generally attempt to add metadata to identifiers, or create artificial namespaces. Most modern languages provide better means to both ends.

For example, the Java language has annotations for metadata and packages for namespaces. There is usually no need to employ naming schemes with most modern languages. It’s probably more worthwhile to spent effort on finding intelligible and descriptive names rather than inventing clever naming schemes. Next time I will talk about the practical considerations in choosing good identifier names and provide some examples to illustrate best practices.