Language conventions are often omitted from formal programming conventions, which is a bit odd considering their importance. The most obvious language convention is the selection of the language itself. This point is so obvious, that it’s often missed. Mind you, not every programmer is fluent in English. Just a fraction are native English speakers. Hence, English should be chosen only if all members of the development team are comfortable with it. Otherwise, it is more appropriate to choose the native language of the developer team. If the developer team is international, English is likely to be used as the common basis. If the conventions prescribe English and if some team members are less fluent in English, identifier names must be treated with special care. I recently came across a piece of software developed in Germany where identifier names for an organisational structure where assigned as follows:
| German Term |
What it means |
Identifier Name Used |
| Organisationseinheit |
organisational unit |
department |
| Unternehmen |
branch/sector |
division |
| Unternehmensbereich |
division |
divisionArea |
| Abteilung |
department |
section |
The meaning of these identifiers doesn’t quite correspond to what an English speaker would expect them to mean, which is rather confusing. Yet, if you look up these terms in a dictionary, you find that all of them, with exception of divisionArea, are valid translations of the original German term. This is one example where it would have been better to use German identifiers instead. These terms appeared in source code that had English identifiers and German JavaDoc comments. Perhaps as a rule of thumb, it’s better not to use English identifiers if the team is not comfortable with English documentation as well. Again, in an international environment this may not be an option. Therefore, it might be worthwhile to have an English speaker refactor code written by non-native speakers for the sake of clarity.
Another aspect that pertains to language conventions is grammar. You might think that I am nitpicking when mentioning grammar in computer programs. The point is that proper grammar facilitates understanding. This is true for natural language as well as program code. The grammatical rules for identifier names are few and simple. Use nouns for class names and object references, because classes and objects abstract real world objects. In some cases, objects directly correspond to entities of the problem domain, such as Customer, Order, Department, User, etc. In other cases they do not. It is important to qualify class names further if a single noun does not describe it sufficiently. Examples of qualified nouns are TransferredFunds, AppraisalStatistics, AvailableOptions. Use verbs for methods and functions, because methods abstract behaviour. Good examples for method names are addTax(), computeCRC(), checkAvailability(), getDepartment(). Avoid using nouns as method identifiers, such as department() instead of getDepartment() or adjectives such as available() instead of getAvailability(). An exception to this rule would be languages where methods directly represent properties, because property names should be either adjectives or nouns. Likewise, interfaces should be nouns, or in some special cases adjectives, such as Triangular, Centred, Deductible, etc. depending on their purpose. In the latter case, the adjective is derived from the verb deduct and by convention the interface defines a method with the same name, i.e. deduct(). Other examples for similarly named interfaces are Readable, Comparable, or Sortable. Finally, we have identifiers for local variables, parameters, and fields. What should these look like? There are no hard rules for these program elements, since local variables, parameters and fields can represent anything from objects to properties, primitives, and even methods. It’s best if grammar conventions follow those of the type the identifier refers to.
Next I want to talk about identifier naming schemes. These are fixed sets of conventions for assigning names. Two widely used schemes are positional notation and Hungarian notation. Positional notation is typically used in legacy languages where identifier names are very short, as for example in older Fortran and Cobol programs. Another example is the 8.3 notation of MS-DOS file system. Identifiers that use positional notation are usually unintelligible without a key. An example would be APRPOCTT, where the first two characters AP stand for the main program module “Accounts Payable”, RP stands for the sub-module “reporting” and OCT1 means “operating costs total 1”. Needless to say that this notation defies easy readability and should not be used unless absolutely necessary. A more widely used scheme is Hungarian notation. Hungarian notation is characterised by prefixes, and sometimes postfixes, which are added to the variable name and which carry type information. For example, in the original Hungarian notation suggested by Charles Simonyi for the C language, b stands for boolean, i for integer, p for pointer, sz for zero-terminated strings, and so on. These prefixes can be combined. The name arrpszMessages, for example, refers to an array of pointers to zero-terminated strings. Sometimes Hungarian notation is applied to conceptual types rather than implementation types. For instance, the identifier arrstrMessages refers to an array of strings without saying anything about pointers or zero-termination.
Hungarian notation is still used today with certain languages, such as C, C++ and Delphi. In C++ the following prefixes are often used to denote scope: g_ for global, m_ for class members (fields), s_ for static members, and _ for local variables. Delphi is unusual in the sense that Hungarian notation denote object types. The type names themselves are always prefixed with the letter T in Delphi, so a button class would be defined as TButton and an instance of that type would be prefixed with Btn or something similar, such as BtnOK, BtnCancel, BtnRetry, etc. There is a lot of debate whether Hungarian notation is generally useful or not. The main criticism is that implementation types are somewhat irrelevant for the programmer writing code in a statically typed language, since types are checked by the compiler. When the programmer needs to know about the type of a variable or an object, modern IDEs can usually resolve it automatically. In this case, the prefixes just add unnecessary visual clutter. Hungarian notation still has its place in C, but it’s definitely not that useful with contemporary languages and development tools. The same can be said in principle about most other identifier naming schemes. Naming schemes generally attempt to add metadata to identifiers, or create artificial namespaces. Most modern languages provide better means to both ends. For example, the Java language has annotations for metadata and packages for namespaces. There is usually no need to employ naming schemes with most modern languages. It’s probably more worthwhile to spent effort on finding intelligible and descriptive names rather than inventing clever naming schemes.
Next time I will talk about the practical considerations in choosing good identifier names and provide some examples to illustrate best practices.
Nomen est omen. This old Latin proverb means something like “the name says it all”. The ancients were superstitious, and they believed that names carry special powers. Names were thought to predispose its subject to bring about certain fortunes or to have certain qualities. Today, we have largely done away with such superstitions. In the scientific worldview, names are nothing but symbolic artefacts without intrinsic powers. However, there is one field where this Latin proverb still applies, and where it is indeed more true than ever. Oddly, this field wasn’t even known to the Romans. I am talking about software development, of course. Names have a special importance to software, or perhaps better, the practice of naming does. The first thing I do when looking at a piece of software written by somebody else is to look at the names given to variables and other program elements. My experience has shown that the quality of the identifier names corresponds directly to the overall quality of the program code.
Identifier names are a crucial part of any program. They provide clues about semantics and program logic. They make or break code readability. They determine whether code is self-documenting or not. So, the old Latin proverb “nomen est omen” can be applied as follows: If you read through a piece of code for the first time and you have no idea what the variables are supposed to represent, or what the methods are supposed to accomplish, then this is a bad omen. It suggests that the author was not quite sure how to formulate the problem (or didn’t care) and it can be expected that the other aspects of the program are at least as confusing. If you read through a piece of code and the identifier names are easily comprehensible and fit together like the pieces of a jigsaw puzzle, then this is a good omen. It suggests that the author had a clear idea of the task at hand. Naturally, there are many intermediate levels between these two opposites.
While contemporary code editors and IDEs are very powerful, identifier naming is one of the things that cannot currently be automated by these tools. It is up to the programmer to choose identifier names. Since good naming practice is essential for code maintainability, we will first define what makes a good naming practice and then look at some concrete examples of good and bad strategies. There are three basic ingredients for a good naming practice: (1) semantic precision, (2) consistency, (3) the right amount of verbosity. The first aspect is by far the hardest to get right. Semantic precision means that the chosen name is appropriate, unambiguous, well defined, and compliant with conventions. Consistency means that names are formed according to common patterns and that terms are used consistently throughout the program. The right amount of verbosity relates to identifier length. It means that names do not leave anything open to guesswork while avoiding redundancy.
An identifier usually consists of a single word or a combination of words. In case of the latter, the individual words are often set apart by using CamelCase or the “_” underscore character. One of the most commonly found questionable practice is the use of abbreviations instead of written out words, for example rptCount instead of repeatCount. The word count indicates that this variable is a counter, but what is counted? Repetitions, receptors, recipients, red points, or something else? Reptiles perhaps? By adding a mere three characters and writing out the first word, the ambiguity is eliminated. This doesn’t mean that abbreviations are always bad. For example, nothing speaks against widely used acronyms like URI for UniqueResourceIdentifier, or LCD instead of LiquidCrystalDisplay. Likewise, domain-specific abbreviations are acceptable, if the program is written within that domain, for example FOB (free on board) in the shipping domain, or VAT (value added tax) in the accounting domain. By definition this also includes acronyms in the software domain, such as i18n for internationalisation, or ftp for file transfer protocol. In addition, there are a number of pre- and postfixes used ubiquitously in programming, such as min, max, fmt, pos, len, num, cnt, etc. which every programmer understands. Generally speaking, abbreviations and acronyms should be used sparingly and only when they are common and free of ambiguity.
This also means that one-letter or two-letter variable names, such as a, b, c, f1, x2, etc. are generally a bad idea, because they say nothing about the content of the variable. There is one exception to this rule: loop indices. Since loop indices (or iterator variables) are only used to to iterate through a collection of values, they don’t have any intrinsic meaning. So one might as well give them one-letter names. By convention, the letters i, j, k, etc. are used, whereas the alphabetic order corresponds to the loop nesting level. This mean i is used for the outermost loop, j for the second nested loop, k for the third, and so on. This is standard practice for loop indices, but in other cases, index position corresponds to certain semantics. In this case, indices do have meaning. For example, one might define an array of counters, where counter[0] contains the number of students, counter[1] contains the number of passed tests and counter[2] contains the number of failed tests. Since the index numbers themselves don’t communicate any meaning, it is appropriate to define an enumerable type or a set of integer constants that conveys this meaning, for example STUDENTS=0, PASSED_TESTS=1, FAILED_TESTS=2, and so on.
This is all pretty much standard programming practice. Next time we will look at common identifier naming schemes, their merits and demerits, as well as language conventions.
First I should explain what I mean with cup typing. When you buy a cup of coffee, you have the choice of short, tall, or grande sized cup. Sometimes you can also choose decaf or regular. When you declare an integer variable in Java, you have the choice of byte, short, int, and long. Sometimes (in languages like C++) you can also choose between signed and unsigned. The similarity is obvious. And it doesn’t end with integers. Floating point numbers come in two different flavours, namely as regular “float” values (32-bit) and as “double” values (64-bit). Characters come in the form of 7-bit, 8-bit and 16-bit encodings. In statically typed programming languages, multiplicity is the rule rather than the exception. While Fortran and Pascal offer a moderate choice of two different integers, Java offers four plus a BigInteger implementation (“extra grande”) for really large numbers. However, it’s C# that takes the biscuit in cup typing with 9 different integer types and 3 different real types. Database systems are keeping up with this trend. For example, the popular MySQL RDBMS offers 5 different integer types and 3 different real types. Seeing the evolution from Fortran to C#, it almost appears as if type plurality has increased over time. We must ask two things: How did this come about and is it useful? We appreciate the fact that we can buy coffee in different cup sizes to match our appetite, but does the same advantage apply to data types?
The first question is easy to answer. Graduated types result from the fact that computer architectures have evolved in powers of two. Over several decades, the register width of the CPU of an average PC has expanded from 8 to 16 to 32 to 64 bits. Each step facilitated the use of larger types and numeric types in particular were closely matched to register width. Expressing data types in a machine-oriented way appears to be a C legacy and quite a few newer programming languages have been strongly influenced by C. - It is my contention that while curly braces and ternary operators are an acceptable C-language tradition, graduated types are definitely not. Why not? Because they counter abstraction. They hinder rather than serve the natural expression of mathematical constructs. Have you ever wondered whether you should index an array with byte- or short-sized integers? Whether you should calculate an offset using int or long values? Whether method calls comply with type widening rules? Whether an arithmetic operation might overflow? Whether a type cast may lose significant bits or not? All of this is a complete waste of time in my view. Wouldn’t it be better to let the virtual machine worry about such low-level questions, or the library if a VM is not present? Cup typing gets positively annoying when you have to write an API that is flexible enough to deal with parameters of different widths. If there’s no type hierarchy, you inevitably end up with multiple overloaded constructors and methods (one for each type) which add unnecessary bulk. The Java APIs are full of such examples and the valueOf() method is a case in point - it’s really ugly.
However, graduated types are beyond ugly; they are outright evil. They cause an enormous number of bugs and the small numeric types are the prime offenders. I wonder how many times a signed or unsigned byte has caused erratic program behaviour by silently overflowing. Such bugs can be hard to find and worse – they often don’t show until certain border conditions are reached. Casts that shorten types also belong to the usual suspects. I shall not even mention the insidious floating point operations that regularly unsettle newbie programmers with funny looking computation results. What numeric types does one really need? - Integer numbers and real numbers. One of each and not more. - If you want to be generous as a language designer, you can throw in an optimised implementation of a complex number type and a rational number type. However, in an object-oriented language with operator overloading, it’s fairly easy to express these in a library. The fixed comma type (sometimes called decimal type) is the subset of the rational type where the denominator is always a power of ten. So, that’s really all you need – a clean representation of the basic mathematical number systems.
At this point, you might object: “but the CPU register is only x bits wide,” or “how do I allocate an array of fifty thousand short values?”, or “can I still have 8-bit chars?” Unfortunately, there is no simple answer to these questions. The natural way to represent integers is to always use the machine’s native word width, but unfortunately that doesn’t solve the problem. First of all, the word width is architecture dependent. Second, it might be wasteful for large arrays that hold small numbers and on the other hand it would still be too small for applications that need big integers. The solution is of course a variable size type, i.e. an integer representation that can grow from byte size to multiple word lengths. We have variable length strings, so why shouldn’t we have variable length numbers? It seems perfectly natural. There is certainly some overhead involved, because variable length types need special encoding. The overhead will be most likely due to loading a descriptor value and/or to bit shifting operations. After all, variable length numbers don’t come for free, but they do offer tremendous advantages. They relieve the programmer from making type width decisions, as well as documenting these decisions - and worse - changing the type width later if the decision turned out to be inadequate. Furthermore, they eliminate the above mentioned bugs resulting from silent overflows and type cast errors, not to mention API proliferation due to type plurality. Thus variable length numbers are generally preferable to common fixed width types.
Of course, there are situations where you know that you will never need more than a byte. There are also situations where performance is paramount. In addition, APIs and libraries based on multiple fixed types are not going to disappear overnight. To provide backward compatibility and to offer optimisation pathways to the programmer, a language could present these as subsets of the mathematical type. For example, if a language defines the keyword “int” for variable length integer numbers, then “int(8)” could mean a traditional byte, “int(16)” could mean a short word, and so on. Now, this is a bit like reintroducing cup typing through the back door. Therefore the use of subtypes for general purpose computations should be discouraged. However, it’s always better to have a choice of fixed and variable types than having no variable types at all.
|
|