Digital Attrition

Naively, one might assume that digital artifacts, such as software, are not subject to decay and attrition, processes that affect physical objects. After all, any digital artifact reduces to a sequence of ones and zeros that -given durable storage- remains completely unaltered and would therefore function in exactly the same way even in ten, hundred, or a thousand years. However, this notion disregards an important aspect of digital products, namely that they don’t exist on their own. A digital artifact almost always exists as part of a digital ecosystem requiring other components to be available to fulfill its function. At the very least, it requires a set of conventions and standards. For example, even a simple text file requires a standard of how to encode letters.

This became once again painfully clear to me, when the WordPress software, on which this blog runs, suddenly started behaving erratically a few weeks ago. It produced 404 page-not-found errors that were impossible to diagnose and fix. I had not changed anything, being happy with the look and functionality of the blog, so the WordPress installation had reached the ripe old age of three and a half years. The cause had to be sought somewhere in the operating platform, which in this case means the web server configuration. Upon contacting the hosting provider, I was told that this problem had been diagnosed with older versions of WordPress and could only be cured by an upgrade.

Previous Blog ThemeI had no choice but to upgrade WordPress and the result is before you. Since the old theme, which can still be seen in the thumbnail image, is not compatible with the latest WordPress version, I derived a new theme from the included twentyeleven package. It takes into consideration that screen resolution has increased over the last few years and it also provides a display theme for mobile devices. Curiously, while still offering the same set of functions and features as version 2.3.1, the WordPress software version 3.2.1 has increased significantly in complexity. I ran a quick sloccount analysis, which told me that its codebase increased from 36,895 lines to 92,141 lines, not counting plugins and themes, and the average theme has roughly doubled in code size.

I am sure that this phenomenon is not unfamiliar to anyone who has worked with computers over a number of years. Remember how MS Office 97 contained every feature you would ever need? Since text processing and spreadsheets reached maturity quite early in the game, some people would even say this for the prior versions of Office. Yet, Microsoft has successfully marketed five successor versions of MS Office since then, the latest one being Office 2010. Needless to say that the more recent versions have gained significantly in complexity and size. But who needs it? Studies have shown that most people only use a small core set of features. Unless you are a Visual Basic programmer or have specific uncommon requirements, you would probably still do well with Office 97. Or would you not?

Upon closer look, you would probably not, and this is where the attrition factor comes into play. In case of Microsoft, it is safe to say that this effect has been engineered for the sake of continued profits. Not only are older version not supported any longer, but they do actually become incompatible with current versions. The change of file formats is a case in point. For example, do you know the differences/advantages of the Office x-fomats (such as .docx and .xlsx) over the older .doc/.xls formats? The new ones are zipped XML-based and as such easier to process automatically. However, most people using older office versions of Office or competing products cannot read these formats and are thus forced to upgrade or obtain software extensions for compatibility.

This does not only apply to Microsoft products, but -as previously mentioned- to digital artifacts in general. Remember floppy disks? Not long ago I found a box of them in the storage room. They contained sundry programs and files, reaching back into the Atari and MS-DOS era. Not only don’t I possess a floppy drive any longer, but even if I had one, I could not read these files. To access my earliest attempts at digital art and programming, for instance, I would have to read .PC2 and .GFA files on the GEMDOS file system, which would constitute a major archival effort. Perhaps I should keep the them until I am retired and find some time for such projects. The surprising thing is how fast attrition has rendered digital works useless in the past decades. While I can still find ways to play an old vinyl record from the eighties, for example, it’s almost impossible to access my digital records from the same era.

Naming Conventions (2)

Language conventions are often omitted from formal programming conventions, which is a bit odd considering their importance. The most obvious language convention is the selection of the language itself. This point is so obvious, that it’s often missed. Mind you, not every programmer is fluent in English. Just a fraction are native English speakers. Hence, English should be chosen only if all members of the development team are comfortable with it. Otherwise, it is more appropriate to choose the native language of the developer team. If the developer team is international, English is likely to be used as the common basis. If the conventions prescribe English and if some team members are less fluent in English, identifier names must be treated with special care. I recently came across a piece of software developed in Germany where identifier names for an organisational structure where assigned as follows:

German Term What it means Identifier Name Used
Organisationseinheit organisational unit department
Unternehmen branch/sector division
Unternehmensbereich division divisionArea
Abteilung department section

The meaning of these identifiers doesn’t quite correspond to what an English speaker would expect them to mean, which is rather confusing. Yet, if you look up these terms in a dictionary, you find that all of them, with exception of divisionArea, are valid translations of the original German term. This is one example where it would have been better to use German identifiers instead. These terms appeared in source code that had English identifiers and German JavaDoc comments.

Perhaps as a rule of thumb, it’s better not to use English identifiers if the team is not comfortable with English documentation as well. Again, in an international environment this may not be an option. Therefore, it might be worthwhile to have an English speaker refactor code written by non-native speakers for the sake of clarity. Another aspect that pertains to language conventions is grammar. You might think that I am nitpicking when mentioning grammar in computer programs. The point is that proper grammar facilitates understanding. This is true for natural language as well as program code. The grammatical rules for identifier names are few and simple. Use nouns for class names and object references, because classes and objects abstract real world objects. In some cases, objects directly correspond to entities of the problem domain, such as Customer, Order, Department, User, etc. In other cases they do not.

It is important to qualify class names further if a single noun does not describe it sufficiently. Examples of qualified nouns are TransferredFunds, AppraisalStatistics, AvailableOptions. Use verbs for methods and functions, because methods abstract behaviour. Good examples for method names are addTax(), computeCRC(), checkAvailability(), getDepartment().

Avoid using nouns as method identifiers, such as department() instead of getDepartment() or adjectives such as available() instead of getAvailability(). An exception to this rule would be languages where methods directly represent properties, because property names should be either adjectives or nouns. Likewise, interfaces should be nouns, or in some special cases adjectives, such as Triangular, Centred, Deductible, etc. depending on their purpose. In the latter case, the adjective is derived from the verb deduct and by convention the interface defines a method with the same name, i.e. deduct(). Other examples for similarly named interfaces are Readable, Comparable, or Sortable.

Finally, we have identifiers for local variables, parameters, and fields. What should these look like? There are no hard rules for these program elements, since local variables, parameters and fields can represent anything from objects to properties, primitives, and even methods. It’s best if grammar conventions follow those of the type the identifier refers to.

Next I want to talk about identifier naming schemes. These are fixed sets of conventions for assigning names. Two widely used schemes are positional notation and Hungarian notation. Positional notation is typically used in legacy languages where identifier names are very short, as for example in older Fortran and Cobol programs. Another example is the 8.3 notation of MS-DOS file system. Identifiers that use positional notation are usually unintelligible without a key. An example would be APRPOCTT, where the first two characters AP stand for the main program module “Accounts Payable”, RP stands for the sub-module “reporting” and OCT1 means “operating costs total 1”.

Needless to say that this notation defies easy readability and should not be used unless absolutely necessary. A more widely used scheme is Hungarian notation. Hungarian notation is characterised by prefixes, and sometimes postfixes, which are added to the variable name and which carry type information. For example, in the original Hungarian notation suggested by Charles Simonyi for the C language, b stands for boolean, i for integer, p for pointer, sz for zero-terminated strings, and so on. These prefixes can be combined. The name arrpszMessages, for example, refers to an array of pointers to zero-terminated strings.

Sometimes Hungarian notation is applied to conceptual types rather than implementation types. For instance, the identifier arrstrMessages refers to an array of strings without saying anything about pointers or zero-termination. Hungarian notation is still used today with certain languages, such as C, C++ and Delphi. In C++ the following prefixes are often used to denote scope: g_ for global, m_ for class members (fields), s_ for static members, and _ for local variables. Delphi is unusual in the sense that Hungarian notation denote object types. The type names themselves are always prefixed with the letter T in Delphi, so a button class would be defined as TButton and an instance of that type would be prefixed with Btn or something similar, such as BtnOK, BtnCancel, BtnRetry, etc.

There is a lot of debate whether Hungarian notation is generally useful or not. The main criticism is that implementation types are somewhat irrelevant for the programmer writing code in a statically typed language, since types are checked by the compiler. When the programmer needs to know about the type of a variable or an object, modern IDEs can usually resolve it automatically. In this case, the prefixes just add unnecessary visual clutter. Hungarian notation still has its place in C, but it’s definitely not that useful with contemporary languages and development tools. The same can be said in principle about most other identifier naming schemes. Naming schemes generally attempt to add metadata to identifiers, or create artificial namespaces. Most modern languages provide better means to both ends.

For example, the Java language has annotations for metadata and packages for namespaces. There is usually no need to employ naming schemes with most modern languages. It’s probably more worthwhile to spent effort on finding intelligible and descriptive names rather than inventing clever naming schemes. Next time I will talk about the practical considerations in choosing good identifier names and provide some examples to illustrate best practices.