Naming Conventions (3)

Today I am going to talk about best practices in identifier naming. Before thinking about concrete strategies for devising identifier names, one should answer the following three questions: 1. What (human) language to use. 2. Whether to use a naming scheme or not. 3. How platform and language choices affect identifier naming. While I generally don’t like to use naming schemes for the reasons mentioned in the last article, it’s almost always a good idea to adopt existing conventions for the specific platform, language and domain one is working in. Certain languages have established strong conventions about how identifiers are formed. For example, nothing keeps you from assigning lower-case names to classes in Java, but it surely raises eyebrows among Java programmers. Likewise, Java enthusiasts would probably look down their noses at code that uses underscores instead of lower CamelCase.

Other programming languages, like PHP or Perl, have less strict conventions for identifier names. Whatever the case may be, it’s invariably useful to honour  widely used conventions and apply them consistently, as it avoids misunderstandings and thus reduces the chance for errors. Apart from consistency and conventions, the most important point to get right is semantic precision. The identifier name should be stated as clearly and as intelligible as possible, without being overly wordy. Ambiguity is the greatest enemy here, especially when there are several identifiers floating around that have similar, yet subtly different meanings. Consider the following example:

for (Event event: events) {
  trainees += event.trainees.size();
  if (event.status == 1 && event.end >= time) {
    for (Person train : event.trainees)
      if (train.passed) {
        trainee.add(train);
        trained++;
     }
     else training.add(train);
  }
  else if (event.status == 0 || event.end < time) {
    training.addAll(event.trainees);
}

This code looks convoluted because of the overuse of the word “train”. What it does is this: it iterates over all workshop events and creates a collection of trainees that have successfully completed a workshop and another of trainees that did not complete a workshop or are still being trained. It also counts the total number of trainees and the number of successful ones. Apart from the ambiguous “train…”, readability suffers because of non-descriptive variable names and the use of constant integer literals. Here is a more readable version:

for (Event workshop: allWorkshops) {
  numOfTrainees += workshop.trainees.size();
  if (workshop.status == COMPLETED && workshop.end >= today) {
    for (Person trainee : workshop.trainees)
      if (trainee.passed) {
        graduates.add(trainee);
        numOfGraduates++;
     }
     else peopleInTraining.add(trainee);
  }
  else if (workshop.status == CANCELLED || workshop.end < today) {
    peopleInTraining.addAll(event.trainees);
}

This brief example shows how to transform an ambiguous piece of code into one that can be read and understood quite easily. The question is: what are the underlying principles? Unfortunately, the answer to this question is complicated. Rather than attempting an analysis, I suggest to look at existing best practices that are widely adopted. Steven McConnel describes such a set of practices in his acclaimed book Code Complete (2nd Edition) in Chapter 11, The Power Of Variable Names. He writes: “An effective technique for coming up with a good name is to state in words what the variable represents. Often that statement itself is the best variable name. It’s easy to read because it doesn’t contain cryptic abbreviations, and it’s unambiguous.”

In the remainder of this article, I will try to present a summary of Steve McConnell’s set of best practices. Describe the “what”, not the “how”. A good identifier name relates to the problem domain rather than to the solution approach or algorithm. For example, the name tmpStateArray might hold an array of communication states of different chat users in a chat application. The name is clearly computerish and algorithm-oriented; a better name would probably be chatStates or communicationStates. Likewise, names such as bitFlag, statusFlag, optionString, etc. should be avoided unless it is totally clear what they mean. Aim for optimum identifier length. Identifier length is important. It is almost always a compromise between conveying sufficient meaning and maintaining readability; in other words, it’s always a compromise between ambiguity and length.

Consider the number of graduated trainees in the sales department. The name numberOfGraduatedTraineesInSalesDepartment is most unambiguous, but rather unwieldy. The names trainees, salesTrainees, gradSalTrn  on the other hand, are more convenient but too short and too ambiguous. A reasonable compromise would be numGradTraineesInSales.

Qualify computed values. – If a variable is used for a computed numeric value, such as totals, averages, etc. qualify the variable name accordingly. Examples are: num…, count…, sum…, average…, max…, min… and so on. Use common opposite names for variables that express the boundaries of an interval, or an operation that involves opposites, such as begin/end, first/last, min/max, old/new, lowest/highest, next/previous, source/target, sender/recipient and so on.

Use letters i, j, k for loop index variables and iterators. – This convention is probably as old as C-programming and it is a good practice, since there is no point in inventing fancy names for loop index variables. As always, there is an exception to the rule. If the index is used outside the loop, for example to count records, then a more descriptive name such as recordCount makes the reference outside the loop more readable.

Use is… or has… prefixes for boolean variables and methods that return boolean values. For example, the variable name isCustomer expresses boolean semantics much clearer than just customer. An exception to this rule are words that clearly express a yes/no status, such as ready, done, failed, found, etc. In an object-oriented language, use nouns for classes, objects and interfaces. Use verbs for class methods and functions. Sometimes, it is also meaningful to use adjectives for interfaces or methods, for example Quadratic, Sharable, isQuadratic(), or isShared().

Avoid scope prefixes unless your language forces you to do so. For example, it was customary in the past to prefix class names or functions with module names, for example: Payroll_EmployeeRecord, Payroll_Allowances, Payroll_Reimbursements, and so on. This practice battles namespace pollution, but it is a very poor replacement for actual namespaces, because it leads to incredibly long and cluttered identifiers. Use this only if your language does not support the notions of modules, packages, or namespaces.

Avoid shadowing. If you use identifiers in nested scopes, don’t use the same names to avoid shadowing. For example, if you declare a local variable with the same name as a field/member of the class, then the former shadows the latter. This is not a neat programming trick, but a recipe for disaster. It is quite confusing to read, and even more confusing to debug. An exception to this rule would be getters and setters, where the field/member is explicitly qualified with ‘this’.

Avoid reusing variables, unless the semantics are maintained. Imagine you have a variable named count which is used in one loop to count records, and in another to count errors. It may be tempting to reuse the integer already declared, but it is far more readable and maintainable to use two separate integer variables appropriately named recordCount and errorCount.

Make abbreviations readable and unambiguous. For example, the abbreviated prefix rnd is often used to denote a random value, but it could also mean rounded. If it isn’t clear from the context, write it out; for instance, write randomAmount instead of rndAmount. As a general rule, source code is read more often than it is written, thus readability is more important than writing convenience. With this in mind, if an abbreviation just saves one or two characters, always write out the word. Create abbreviations that can be pronounced. For example, use stringPos rather than stringPstn and recCount rather than rcrdCount. It should be possible to read program code aloud without twisting your tongue. Apart from the advantage of being easier to communicate, pronounceable identifiers are also easier to remember.

Naming Conventions (2)

Language conventions are often omitted from formal programming conventions, which is a bit odd considering their importance. The most obvious language convention is the selection of the language itself. This point is so obvious, that it’s often missed. Mind you, not every programmer is fluent in English. Just a fraction are native English speakers. Hence, English should be chosen only if all members of the development team are comfortable with it. Otherwise, it is more appropriate to choose the native language of the developer team. If the developer team is international, English is likely to be used as the common basis. If the conventions prescribe English and if some team members are less fluent in English, identifier names must be treated with special care. I recently came across a piece of software developed in Germany where identifier names for an organisational structure where assigned as follows:

German Term What it means Identifier Name Used
Organisationseinheit organisational unit department
Unternehmen branch/sector division
Unternehmensbereich division divisionArea
Abteilung department section

The meaning of these identifiers doesn’t quite correspond to what an English speaker would expect them to mean, which is rather confusing. Yet, if you look up these terms in a dictionary, you find that all of them, with exception of divisionArea, are valid translations of the original German term. This is one example where it would have been better to use German identifiers instead. These terms appeared in source code that had English identifiers and German JavaDoc comments.

Perhaps as a rule of thumb, it’s better not to use English identifiers if the team is not comfortable with English documentation as well. Again, in an international environment this may not be an option. Therefore, it might be worthwhile to have an English speaker refactor code written by non-native speakers for the sake of clarity. Another aspect that pertains to language conventions is grammar. You might think that I am nitpicking when mentioning grammar in computer programs. The point is that proper grammar facilitates understanding. This is true for natural language as well as program code. The grammatical rules for identifier names are few and simple. Use nouns for class names and object references, because classes and objects abstract real world objects. In some cases, objects directly correspond to entities of the problem domain, such as Customer, Order, Department, User, etc. In other cases they do not.

It is important to qualify class names further if a single noun does not describe it sufficiently. Examples of qualified nouns are TransferredFunds, AppraisalStatistics, AvailableOptions. Use verbs for methods and functions, because methods abstract behaviour. Good examples for method names are addTax(), computeCRC(), checkAvailability(), getDepartment().

Avoid using nouns as method identifiers, such as department() instead of getDepartment() or adjectives such as available() instead of getAvailability(). An exception to this rule would be languages where methods directly represent properties, because property names should be either adjectives or nouns. Likewise, interfaces should be nouns, or in some special cases adjectives, such as Triangular, Centred, Deductible, etc. depending on their purpose. In the latter case, the adjective is derived from the verb deduct and by convention the interface defines a method with the same name, i.e. deduct(). Other examples for similarly named interfaces are Readable, Comparable, or Sortable.

Finally, we have identifiers for local variables, parameters, and fields. What should these look like? There are no hard rules for these program elements, since local variables, parameters and fields can represent anything from objects to properties, primitives, and even methods. It’s best if grammar conventions follow those of the type the identifier refers to.

Next I want to talk about identifier naming schemes. These are fixed sets of conventions for assigning names. Two widely used schemes are positional notation and Hungarian notation. Positional notation is typically used in legacy languages where identifier names are very short, as for example in older Fortran and Cobol programs. Another example is the 8.3 notation of MS-DOS file system. Identifiers that use positional notation are usually unintelligible without a key. An example would be APRPOCTT, where the first two characters AP stand for the main program module “Accounts Payable”, RP stands for the sub-module “reporting” and OCT1 means “operating costs total 1”.

Needless to say that this notation defies easy readability and should not be used unless absolutely necessary. A more widely used scheme is Hungarian notation. Hungarian notation is characterised by prefixes, and sometimes postfixes, which are added to the variable name and which carry type information. For example, in the original Hungarian notation suggested by Charles Simonyi for the C language, b stands for boolean, i for integer, p for pointer, sz for zero-terminated strings, and so on. These prefixes can be combined. The name arrpszMessages, for example, refers to an array of pointers to zero-terminated strings.

Sometimes Hungarian notation is applied to conceptual types rather than implementation types. For instance, the identifier arrstrMessages refers to an array of strings without saying anything about pointers or zero-termination. Hungarian notation is still used today with certain languages, such as C, C++ and Delphi. In C++ the following prefixes are often used to denote scope: g_ for global, m_ for class members (fields), s_ for static members, and _ for local variables. Delphi is unusual in the sense that Hungarian notation denote object types. The type names themselves are always prefixed with the letter T in Delphi, so a button class would be defined as TButton and an instance of that type would be prefixed with Btn or something similar, such as BtnOK, BtnCancel, BtnRetry, etc.

There is a lot of debate whether Hungarian notation is generally useful or not. The main criticism is that implementation types are somewhat irrelevant for the programmer writing code in a statically typed language, since types are checked by the compiler. When the programmer needs to know about the type of a variable or an object, modern IDEs can usually resolve it automatically. In this case, the prefixes just add unnecessary visual clutter. Hungarian notation still has its place in C, but it’s definitely not that useful with contemporary languages and development tools. The same can be said in principle about most other identifier naming schemes. Naming schemes generally attempt to add metadata to identifiers, or create artificial namespaces. Most modern languages provide better means to both ends.

For example, the Java language has annotations for metadata and packages for namespaces. There is usually no need to employ naming schemes with most modern languages. It’s probably more worthwhile to spent effort on finding intelligible and descriptive names rather than inventing clever naming schemes. Next time I will talk about the practical considerations in choosing good identifier names and provide some examples to illustrate best practices.

Naming Conventions (1)

Nomen est omen. This old Latin proverb means something like “the name says it all”. The ancients were superstitious, and they believed that names carry special powers. Names were thought to predispose its subject to bring about certain fortunes or to have certain qualities. Today, we have largely done away with such superstitions. In the scientific worldview, names are nothing but symbolic artefacts without intrinsic powers.

However, there is one field where this Latin proverb still applies, and where it is indeed more true than ever. Oddly, this field wasn’t even known to the Romans. I am talking about software development, of course. Names have a special importance to software, or perhaps better, the practice of naming does. The first thing I do when looking at a piece of software written by somebody else is to look at the names given to variables and other program elements. My experience has shown that the quality of the identifier names corresponds directly to the overall quality of the program code.

Identifier names are a crucial part of any program. They provide clues about semantics and program logic. They make or break code readability. They determine whether code is self-documenting or not. So, the old Latin proverb “nomen est omen” can be applied as follows: If you read through a piece of code for the first time and you have no idea what the variables are supposed to represent, or what the methods are supposed to accomplish, then this is a bad omen. It suggests that the author was not quite sure how to formulate the problem (or didn’t care) and it can be expected that the other aspects of the program are at least as confusing.

If you read through a piece of code and the identifier names are easily comprehensible and fit together like the pieces of a jigsaw puzzle, then this is a good omen. It suggests that the author had a clear idea of the task at hand. Naturally, there are many intermediate levels between these two opposites. While contemporary code editors and IDEs are very powerful, identifier naming is one of the things that cannot currently be automated by these tools.

It is up to the programmer to choose identifier names. Since good naming practice is essential for code maintainability, we will first define what makes a good naming practice and then look at some concrete examples of good and bad strategies. There are three basic ingredients for a good naming practice: (1) semantic precision, (2) consistency, (3) the right amount of verbosity.

The first aspect is by far the hardest to get right. Semantic precision means that the chosen name is appropriate, unambiguous, well defined, and compliant with conventions. Consistency means that names are formed according to common patterns and that terms are used consistently throughout the program. The right amount of verbosity relates to identifier length. It means that names do not leave anything open to guesswork while avoiding redundancy.

An identifier usually consists of a single word or a combination of words. In case of the latter, the individual words are often set apart by using CamelCase or the “_” underscore character. One of the most commonly found questionable practice is the use of abbreviations instead of written out words, for example rptCount instead of repeatCount. The word count indicates that this variable is a counter, but what is counted? Repetitions, receptors, recipients, red points, or something else? Reptiles perhaps? By adding a mere three characters and writing out the first word, the ambiguity is eliminated.

This doesn’t mean that abbreviations are always bad. For example, nothing speaks against widely used acronyms like URI for UniqueResourceIdentifier, or LCD instead of LiquidCrystalDisplay. Likewise, domain-specific abbreviations are acceptable, if the program is written within that domain, for example FOB (free on board) in the shipping domain, or VAT (value added tax) in the accounting domain. By definition this also includes acronyms in the software domain, such as i18n for internationalisation, or ftp for file transfer protocol. In addition, there are a number of  pre- and postfixes used ubiquitously in programming, such as min, max, fmt, pos, len, num, cnt, etc. which every programmer understands.

Generally speaking, abbreviations and acronyms should be used sparingly and only when they are common and free of ambiguity. This also means that one-letter or two-letter variable names, such as a, b, c, f1, x2, etc. are generally a bad idea, because they say nothing about the content of the variable. There is one exception to this rule: loop indices. Since loop indices (or iterator variables) are only used to to iterate through a collection of values, they don’t have any intrinsic meaning. So one might as well give them one-letter names. By convention, the letters i, j, k, etc. are used, whereas the alphabetic order corresponds to the loop nesting level. This mean i is used for the outermost loop, j for the second nested loop, k for the third, and so on.

This is standard practice for loop indices, but in other cases, index position corresponds to certain semantics. In this case, indices do have meaning. For example, one might define an array of counters, where counter[0] contains the number of students, counter[1] contains the number of passed tests and counter[2] contains the number of failed tests. Since the index numbers themselves don’t communicate any meaning, it is appropriate to define an enumerable type or a set of integer constants that conveys this meaning, for example STUDENTS=0, PASSED_TESTS=1, FAILED_TESTS=2, and so on.

This is all pretty much standard programming practice. Next time we will look at common identifier naming schemes, their merits and demerits, as well as language conventions.

Discussion board moderation

Discussion board moderation is a new “profession” and as such it requires a new set of skills. These are not, as many believe, technical skills. Discussion board moderation is primarily a management task and therefore it requires management skills. Since management is not an exact science, the dos and don’ts of discussion board moderation are not chiselled in granite. Yet, there are some important principles which executive and prospective moderators should consider.

Discussion boards (or “forums”) are a newfangled social phenomenon that came about with the Internet. They are meeting places for people who share a common interest about which they like to talk. An online discussion is essentially a written asynchronous conversation between two or more parties who send and receive questions, answers, and comments with a relative delay. These written conversations are much slower than natural conversations, but still faster than a traditional exchange of letters.

The necessity for moderation exists for several reasons. Usually the board operator desires some level of control over the content posted by other participants in order to ensure that it does not violate laws and regulations. In addition, the operator might want to define specific rules for the discussion board that fit the culture of its community. Such rules usually target netiquette and ethical codes. Finally, the board administrator must uphold the technical functioning of the discussion board system and prevent abuse. The attainment of these goals are usually delegated to the moderator(s) who may or may not be the same person as the board operator.

Common Challenges

Discussion boards provide entertainment, support, and fun for many people, but they are not without challenges. A virtual meeting place is a bit like a masked ball where participants enjoy complete anonymity. This can lead to problems. Anonymity, as well as the lack of physical contact, has a tendency to lower the inhibition threshold for socially unacceptable behaviour in some individuals. Common challenges are angry, hateful, obscene, or otherwise inappropriate posts, cross posting, spamming, trolling, DoS attacks, identity theft, and other more technical problems.

Flaming And Flame Wars

Flames are intentionally hostile or insulting messages that usually result from a heated exchange between people holding different opinions. Flames are the most common problem of discussion boards. The flame character of a message is identified by its design to attack the opponent rather than the argument. Hence, flames are ad hominems with a strong emotional impact. Flame wars are prolonged exchanges of flame posts, into which –according to the group dynamics of the community– many individuals may get involved. The affinity to flame wars depends on many factors, such as community behaviour, the nature of topics discussed, as well as moderation practices. Flaming is generally deterring and discouraging to users. Obviously, controversial topics are especially susceptible to flames.

Flames are a rather difficult challenge for the moderator. The most suitable strategy to control flames is to employ non-punitive measures, for example posting placatory comments, appeals to fairness, and conciliation proposals to calm the situation. Diplomacy and humour often work well. Prevention of flames, for example by creating a relaxed and intimate atmosphere, is even better. If this doesn’t work, it may be necessary to remind the opponents of the rules regarding discussion style or to close the thread. If the posted flames are inappropriate it may also be essential to delete offensive passages or posts. Finally, if nothing else works, warning and barring the offending member(s) is the last recourse.

Trolling

A troll is someone who habitually posts disturbing, inflammatory or nonsensical messages that disrupt the discussion and upset the community. Trolls are basically agitators who provoke and create perturbation by some means, usually by flames, in order to drag attention to themselves or to sabotage the discussion. Trolling is best moderated by confronting the offender directly via the personal message system and by putting the troll user on the pre-moderation list if the discussion board software allows it. The motives for trolling are varied. The troll may be a disgruntled user, someone who feels that the board community “has turned against him”, someone with an underlying psychological problem, or merely someone venting temporary frustration. Trolls can be quite problematic. Persistent trolls should be pre-moderated or banned if pre-moderation is not an option.

Spamming

Outright spamming has become somewhat rare on discussion boards, since most board software prevents robots from signing up and submitting spam. Yet, there is still the problem of spam posted by human subscribers. Spam contents range from fairly subtle, such as text links to a commercial website, to blatant, such as advertising banners in user signatures and posts. Spammers frequently seek out communities that fit the target group for their products or services. For example, a shop that sells exercise machines might seek out sports communities. Evidently most spammers have an agenda apart from the community and the discussion. Nothing is lost by immediately deleting the spam posts and blocking the offending user and IP address. The situation is somewhat different if a regular member submits an advertising post. In most cases, deletion and a warning issued via PM or the warning system will be sufficient to deal with a one-time transgression.

Cross-Posting

Cross-posting is the practice of submitting the same message to more than one forum. The intention of the sender is to reach the greatest possible number of readers. The conjunct problem is fragmentation of the ensuing discussion. If the cross-post is targeted at the same community, people also get the impression of being spammed. Cross-posting within the same discussion board is annoying in most cases. The moderator needs to decide whether cross-posting is appropriate or whether to delete duplicate posts. In order to avoid thread fragmentation, duplicate threads may be closed, ideally with an annotation containing a link that leads to one thread singled out to continue the discussion. Alternatively, the administrator may disallow cross-posting within the same discussion board altogether.

Off-Topic Posts

This is a very common problem and it is simultaneously difficult to control. Off-topic (OT) posts arise from the associative nature of subject matters, a characteristic that goes to the root of human language. Getting off on a tangent is all to easy. For example, a discussion about nuclear energy may divert into a discussion about alternative energies, nuclear weapons, or state regulations. In the natural flow of a discussion, minor diversions are common and probably unobjectionable. However, a thread often develops in a contingent way that spawns discussions about multiple topics –often in parallel– which is confusing in the same way a group of people talking at the same time is confusing. Unfortunately, there are no universally valid guidelines for off-topic moderation. It always depends on context and community. In an informal discussion about philosophy OT posts may be of no concern, while in a more formal setting, such as a technical support forum, off-topic contributions may not be allowed at all.

A topic is usually outlined by its thread title and the tagline (short description). If a thread develops an OT sideline, the OT posts may be swapped out into a new thread by the administrator. Many software packages provide a “split thread” operation for this purpose. To what extent OT posts are moderated and how strongly OT contributions are discouraged depends very much on the nature of the discussion board.

Noise

“Noise” is text and other content that does either not belong to a discussion or that interrupts the flow of a discussion. For example, long quotations or distracting signatures can be considered noise. If the noise ratio exceeds a certain value, following the discussion becomes visually tiresome. The best strategy to avoid this is by limiting signatures to a certain length (perhaps also to disallow images in signatures) and by discouraging full quotes. Quotations are often useful, even necessary to remind the reader of something previously mentioned and to establish the context for a reply. However, a full quote in which the answerer refers only to a tiny fragment within the quote is confusing and counterproductive. To avoid this, the discussion board software may be configured to discourage full quotes, for example by ergonomic means. Alternatively, the moderator may remind people not to overuse full quotes and edit out noise manually if necessary.

Multiple Identities and Impersonation

Multiple identities result from the same user subscribing several times to the same discussion board. This might happen with technically inexperienced users, users who have lost their password, or users who intentionally create multiple identities. Although most software packages can be configured to prohibit multiple subscriptions with the same email address and/or from the same IP number, subscribers may bypass this mechanism by using different email addresses and IPs. Furthermore, blocking IP addresses is problematic with dynamically assigned IPs. In most cases, multiple subscriptions result in a number of dead accounts which can be deleted after a certain period of inactivity. Other cases are more troublesome, especially those which involve the continued use of multiple identities or impersonation (identity theft). These are deceptive tactics which are not always easy to detect. They a re popular with trolls. An analysis of IP numbers and time stamps of a sequence of posts is often necessary to uncover this form of abuse. Since this is a serious form of abuse, it usually results in account termination and banning.

Denial of Service Attacks, Hacker Attacks

Denial of Service (DoS) Attacks are technical sabotage manoeuvres aimed at disrupting the discussion board service. The most common method is flooding. A flooding robot (a program) sends huge quantities of messages to the board, which then becomes unusable for other users. Most discussion board software packages have basic features to avert such attacks, for example by limiting the number of messages a user can post within a certain period. However, resourceful attackers may find ways to bypass these protection mechanisms. Luckily, DoS are somewhat rare since they require a some technical sophistication, and quite a bit of dedication to the purpose of sabotage. Hacker attacks, on the other hand, are more common. The most ordinary hacker attack is password sniffing on unencrypted connections, and subsequently using passwords for gaining entry to the discussion board system, preferably as a user with administrator privileges. DoS and hacker attacks are serious forms of abuse and should be reported to the service provider and possibly to the law enforcement authorities. Board operators do not always have the technical means to take on such attacks on their own.

Types of Moderation

The Usenet community generally distinguishes between four types of moderation, which are likewise applicable to web-based discussion board systems. These types of moderation differ in the way posts are moderated. They feature different decision and communication flow models.

Post-Moderation

The most common form of moderation is post-moderation, which means that either a single moderator or a group of moderators reviews contributions once they have been posted. In such a setting, messages ought to reviewed on a regular basis (perhaps daily) and moderators ought to perform editorial tasks as required. Post-moderation is time-consuming if done correctly, because moderators need to review all content and respond to inappropriate content in time. Moderators have full censoring power.

Pre-Moderation

The most restrictive form of moderation is pre-moderation. Again, moderators have full censoring power and need to review every message, but content is reviewed before it goes online, not after. This means that posted messages first go into a waiting queue before they are approved and released by the moderator. The delay that results from this procedure is quite detrimental to discussions, because replies are not available to the community in real time. Since this normally drains the lifeblood from a discussion, pre-moderation is applied only in special situations, where the sensitivity of the topic requires more restrictive action. One example for pre-moderation are the book reviews on amazon.com.

Reactive Moderation

Reactive moderation relies on alerts from members of the discussion board. It moves the task of supervision from the moderator to the audience by offering easily accessible means of reporting problems to the moderator. The moderator only needs to review those areas with reported problems. This form of moderation is quite effective in conjunction with automatic supervision, such as word filters. Its greatest advantage is the reduction of moderation workload associated with the pre- and post-moderation methods. What is more, the legal responsibilities of the operator seem to move primarily to removing questionable content, rather than preventing it being posted. The principal disadvantage of reactive moderation is that not all breaches of house rules and legal provisions might get reported.

Distributed Moderation

The distributed moderation model is even more radical. It dispenses with the concept of a moderator person altogether. Instead it relies on the assumption that a community can collectively decide what is appropriate for itself and what is not. Moderation tasks are thus carried out by the community by means of a voting system. Current implementations of voting systems are often similar to content rating systems. For example, if someone suggests a post for deletion, it takes a number of consenting votes to actually carry out deletion. There are two problems with this approach. First, the community might have different views about “appropriate content” than the board operator. Second, online voting systems are still prone to abuse. Thus distributed moderation is not yet widespread, although some groups, such as slashdot.org and wikipedia.org have used it with great success.

Towards web engineering

Perhaps you have never heard of the term web engineering. That is not surprising, because it is not commonly used. The fact that it is not commonly used is, however, surprising. Very surprising actually. The process of software creation resembles that of website creation and engineering principles are readily applied to the former. A website can be considered an information system. However, there is one peculiarity: the creation of a website is as much a technical as an artistic process.

The design of graphics and text content (presentation) goes hand in hand with the design of data structures, interactive features, and user interface (application). Despite some crucial differences, web development and software development have many features in common. Since we are familiar with software engineering, and since we understand its principles and advantages, it seems sensible to apply similar principles to web development. Thus web engineering.

Before expanding on this thought delving into the details of the engineering process, let’s try to define the term “web engineering”. Here is the definition that Prof. San Murugesan suggested in his 2005 paper Web Engineering: Introduction and Perspectives: “Web engineering uses scientific, engineering, and management principles and systematic approaches to successfully develop, deploy, and maintain high-quality web systems and applications.”

Maturation Stages

It seems that current web development practices rarely conform with this definition. Most websites are still implemented in an uncontrolled code-and-fix style without codified procedures and heuristics. The person responsible for implementation is often a webmaster who is basically a skilled craftsman. The process of crafted web development is quite different from an engineering process. Generally speaking, there is a progression that all fields of industrial production go through from craftsmanship to engineering. The following figure illustrates this maturation process (Steve McConnel, 2003):

Industry Maturity Stages

At the craft stage, production is carried out by accomplished craftsmen, who rely on artistic skill, rule of thumb, and experience to create a product. Craftsmen tend to make extravagant use of resources and production is often time consuming. Therefore, production cost is high.

The commercial stage is marked by a stronger economic orientation. Growing demand, increased competition, and companies instead of craftsmen carrying out work are hallmarks of the commercial phase. At this stage, the production process is systematically refined and codified. Commercial suppliers provide standardised products at competitive prices.

Some of the technical challenges encountered at the commercial stage cannot be solved, because the research and development costs are too high for individual manufacturers. If the economic stakes are high enough, a corresponding science will emerge. As the science matures, it develops theories that contribute to commercial practice. This is the point at which production reaches the professional engineering stage.

On a global level, software production was in the craft stage until the 1970s and has progressed to the commercial stage since then. Though the level of professional engineering is already on the horizon, the software industry has not reached it yet. Not all software producers make full use of available methods, while other more advanced methods are still being researched.

Recent developments in methodology

Web development became a new field of production with the universal breakthrough of the Internet in the 1990s. During the subsequent decade, web development has largely remained in the craft stage. This is now changing, albeit slowly. The move from hand-coded websites to template-based websites, content management systems, and standard application packages signals the transition from craft to commercial stage. Nonetheless, web development has not yet drawn level with the state of art in software development.

Until recently, web development was like a blast from the past. Scripting a web application involved the use of arcane languages for producing non-reusable, non-object-oriented, non-componentised, non-modularised, and in the worst case non-procedural code. HTML, JavaScript, and messy CGI scripts were glued together to create so-called dynamic web pages. In other words, web development was a primitive craft. Ironically, all principles of software engineering were either forgotten or ignored. Thus, in terms of work practices, developers found themselves thrown back twenty years. However, the rapid growth of the Internet quickly made these methods appear outmoded.

During the past 15 years, the World Wide Web underwent a transformation from a linked information repository (for which it was originally designed) to a universal vehicle for worldwide communication and transactions. It has become a major delivery platform for a wide range of applications, such as e-commerce, online banking, community sites, e-government, and others. This transition has created demand for new and advanced web development technologies.

Today, we have some of these technologies. We have unravelled the more obscure aspects of HTML. We have style sheets to encapsulate presentation logic. We have OOP scripting languages for web programming. We have multi-tier architectures. We have more capable componentised browsers. We have specialised protocols and applications for a great variety of purposes.

However, we don’t yet have established methodologies to integrate all these technologies and build robust, maintainable, high-quality web applications. We don’t yet have a universally applicable set of best practices that tells us how to build, say a financial web application, from a server language, a client language, and a database backend. Consequently, there is still a good deal of black magic involved in web development.

If you build a house, you can choose from a number of construction techniques and building materials, such as brick, wood, stone, or concrete. For any of these materials, established methods exist that describe how to join them to create buildings. Likewise, there are recognised procedures to install plumbing, electricity, drainage, and so on. When you build a house, you fit ready-made components together. Normally you don’t fabricate your own bricks, pipes, or sockets.

Unfortunately, this is not so in the field of web development. Web developers do not ubiquitously rely on standard components and established methods. On occasion, they still manufacture the equivalent of bricks, pipes, and sockets for their own purposes. And they fit them together at their own discretion, rather than by following standard procedures. Unsurprisingly, this results in a more time consuming development process and in less predictable product quality.

Having recognised the need for web engineering, a number of question arises. What does web engineering have in common with software engineering? What are the differences? Which software engineering methods lend themselves best to web development? Which methods must be defined from scratch? A detailed examination of all these questions is unfortunately beyond the scope of this article. However, we can briefly outline the span of the field and name those aspects that are most crucial to the Web engineering process.

Web applications are different

Web applications are inherently different from traditional software applications, because they combine multimedia content (text, graphics, animation, audio, video) with procedural processing. Web development comprises software development as well as the discipline of publishing. This includes, for instance, authoring, proofing, editing, graphic design, layout, etc.

Web applications evolve faster and in smaller increments than conventional software. Installing, fixing, and updating a website is easier than distributing and installing a large number of applications on individual computers. Web applications can be used by anyone with Internet access. Hence, the user community may be vast and may have different cultural and educational backgrounds. Security and privacy requirements of web applications are more demanding. Web applications grow in an environment of rapid technological change. Developers must constantly cope with new standards, tools, and languages.

Multidisciplinary approach

Building a large website involves dissimilar tasks such as photo editing, graphics design, user interface design, copy writing, and programming, which in turn requires a palette of dissimilar skills. It is therefore likely that a number of specialists are involved in the creation of a website, each one working on a different aspect of it. For example, there may be a writer, a graphic designer, a Flash specialist and a programmer in the team. Hence, web development calls for a multidisciplinary approach and team work techniques. “Web development is a mixture between print publishing and software development, between marketing and computing, between internal communications and external relations, and between art and technology.” (Powell, 2000)

The website lifecycle

The concept of the website lifecycle is analogous to that of the software lifecycle. Since the field of software engineering knows several competing lifecycle models, there are likewise different approaches to website design. For example, the waterfall model can be used for relatively small web sites with mainly static content:

1. Requirements Analysis
2. Design
3. Implementation
4. Integration
5. Testing and Debugging
6. Installation
7. Maintenance

This methodology obviously fails for larger websites and web applications for the same reason it fails for larger software projects: the development process of large scale projects is incremental and requirements typically evolve with the project. But there is another argument that speaks for a more incremental/iterative methodology. Web applications are much easier to rollout and update than traditional software applications. Shorter lifecycles therefore make good practical sense. The “release early, release often” philosophy of the open source community certainly applies to web development. Frequent releases increase customer confidence, feedback, and avoid early design mistakes.

Categories of web applications

Different types of web applications can be distinguished by functionality (Murugesan, 2005):

Function Examples
Informational Online newspapers, product catalogues, newsletters, manuals, reports, online classifieds, online books.
Interactive Registration forms, customised information presentation, online games.
Transactional Online shopping (ordering goods and services), online banking, online airline reservation, online payment of bills
Workflow Oriented Online planning and scheduling, inventory management, status monitoring, supply chain management
Collaborative work environments Distributed authoring systems, collaborative design tools
Online communities, market places Discussion groups, recommender systems, online market places, e-malls (electronic shopping malls), online auctions, intermediaries

Maintainability

Maintainability is an absolutely crucial aspect in the design of a website. Even small to medium sites can quickly become difficult to maintain. Many developers find it tempting to use static HTML for presentation logic, because it allows for quick prototyping, and because script code can be mixed seamlessly with plain HTML. What is more, presentation logic is notoriously difficult to separate from business and application logic in web applications. However, this approach is only suitable for very small sites. In most other cases, a HTML generator, a template engine, or a content management system will be more appropriate.

Dependency reduction also contributes to maintainability. Web applications have multiple levels of dependencies: internal script code dependencies, HTML and template dependencies, and style sheet dependencies. These issues need to be addressed individually. The reduction of dependencies usually comes at the price of increasing redundancy/complexity. For example, it is more complex to use individual style sheets for different templates, than to use one master style sheet for all. Internal dependencies and complexity levels need to be balanced by an experienced developer.

As in conventional software development, code reusability is paramount to maintainability. Modern object-oriented scripting languages allow for adequate encapsulation and componentisation of web application parts. The problem is that developers do not always make use of these programming techniques, because they require more upfront planning effort. In environments with high economical pressure, there is a tendency to ad-hoc coding which yields quick results, but lacks maintainability.

Scalability

Scalability is one of the friendlier facets of web development, because the web platform is -at least in theory- inherently scalable. Increasing traffic can often be handled by simply upgrading the server system. Nonetheless, developers still need to consider scalability when designing software systems. Two important issues are session management and database management. Memory resource usage is proportional to the amount of session management data. I/O and CPU load is proportional to concurrent database access, thus developers do well to anticipate peak loads and optimise links between different tiers in an n-tier system.

Ergonomics and aesthetics

The look-and-feel of a website, its layout, navigation, colour schemes, menus, etc. make up the ergonomics and aesthetics of a website. This is a more artistic aspect of web development and it should be left to a professional with relevant skills and a good understanding of usability aspects. Software ergonomics and aesthetics should not be an afterthought, but an integral aspect of the development process. These factors are strongly influenced by culture. A website that appeals to a global audience is more difficult to build than a website targeted at a specific group and culture.

Website testing

Website testing comprises all aspects of conventional software testing. In addition, it involves special fields of testing which are specific to web development:

  • Page display
  • Semantic clarity and cohesion
  • Browser compatibility
  • Navigation
  • Usability
  • User Interaction
  • Performance
  • Security
  • Code standards compliance

Compatibility and interoperability

Web applications are generally intended to run on a large variety of client computers. Users might use different operating systems, different font sets, different languages, different monitors, different screen sizes, and different browsers to display web pages. In some cases, web applications do not only run on standard PCs, but also on pocket computers, PDAs, mobile phones, and other devices. While it is nearly impossible to guarantee proper page display on all of these devices, there needs to be a clearly defined scope of interoperability. This scope should spell out at least the browsers, browser versions, languages, and screen sizes that are to be supported.

Bibliography

Steve McConnell (2003). Professional Software Development
San Murugesan (2005). Web Engineering: Introduction and Perspectives
T.A. Powell (1998). Web site engineering: Beyond Web page design