Guides to Software Evaluation

Guides to Software Evaluation

Authors: Hans-Ludwig Hausen, Dieter Welzel
Arbeitspapiere der GMD 746, April 1993, 92 pages
[Abstract] [HTML]

Summary by
George Candea

One-sentence summary:

Describes a framework for measuring, assessing, and certifying the quality of a software product; reported information is meant to be equally useful to software producers, vendors, and users.

Overview/Main Points

According to one of the authors, the guide was used on about 100 industrial case studies in the European Union and has been adopted by the European Space Organization after the Ariane-5 disaster to improve sw quality. The framework intends to be useful in evaluating a wide range of products, from off-the-shelf packages to custom/commissioned software, to embedded systems.
The evaluation procedure needs to be cost-effective and run by an independent 3rd party, such as specialized testing labs. This software evaluation guide is meant to be a handbook for such labs, as well as producers/purchasers of software.
There are 4 principles guiding the authors:
- repeatability: repeated evaluation yields same results
- reproducibility: evaluation by different labs gives same results
- impartiality
- objectivity: there is a minimal amount of subjective judgment required.
The evaluation procedure consists of 5 steps (italicized terms are defined below):
1. The lab's client (e.g., a software vendor) states the evaluation requirements, including evaluation levels, and the lab analyzes them.
2. Lab produces an evaluation specification. For example, the spec describes evaluation techniques to be used for each software characteristic: black box testing + glass box + UI inspection + algorithmic complexity analysis + static analysis + design process evaluation. The client can accept the spec or withdraw from evaluation process.
3. Lab designs the evaluation process based on spec; client accepts or withdraws.
4. Lab performs evaluation. This may involve manual processes, computer aided processes (e.g., using a check list manager for applying check lists), as well as automatic evaluation (e.g., measuring complexity using static analyzer).
5. Lab reports results; client accepts or lodges an appeal in court or some legal forum (results become public information otherwise).
An evaluation level represents the depth/thoroughness of the evaluation techniques and results, w.r.t. safety, security, economic risk, availability, app constraints). Evaluation levels are names A, B, C, D and have various levels for:
- safety: can range from a safety breach causing "small damage to property + no risk to people" to "unrecoverable environmental damage + many people killed".
- economy: ranges from "negligible loss" to "financial disaster that ruins company".
- security: defined in terms of ITSEC assurance levels.
- availability: ranges from "up on request/demand" to "no down time at all".
- application domain: e.g., "office automation + entertainment", "air and railway systems", etc.
Software characteristics are: functionality, reliability, usability, efficiency, maintainability, or portability.
An evaluation module encapsulates atomic evaluation procedures (e.g., "check whether system automatically proposes corrections to clearly correctable user-errors?"), metrics and evaluation levels for each software characteristic (for the example shown above, "yes=2, some=1, no=0"), a description of the assessment procedure, and a format for reporting results and costs.
The information required by the lab for the evaluation can include: product information (user handbook, design docs, object code, code listings, etc.), development process information (management report, quality assurance report, project documentation), etc.
Annex B shows an evaluation module in the form of a usability checklist. It is based on the following usability metrics: installability from scratch, learnability, use efficiency, interface customizability, experienced user migration ease.

Relevance

Lays out a reasonably well thought-out process that can be used in evaluating software. However, this seems like a more useful tool for the software producers themselves, as part of their internal reviews, rather than for customers, because many of the features in the evaluation may not be things they care about. Hence, it's unclear to what extent one could compare two products on the basis of their evaluations.
In doing so, the author employ existing/available tools and technology, which makes the guide easier to adopt. They don't invent new languages or use exotic tools.
It is a realistic guide, in that it allows for reasonably useful reviews to be conducted for reasonable software. It is clearly not the result of a committee. The approach is rather holistic and end-to-end, much unlike most evaluation processes I've seen so far (e.g., verification and validation).
Would certainly be useful and have an impact on the industry if all major players agreed upon it.

Drawbacks

Some of these are not necessarily flaws of this guide, but rather issues that would affect any process that tried to evaluate software quality.

The development of evaluation modules is closely tied to existing computing paradigms (e.g., windowing interfaces for UI's). As such, they might become obsolete. But worse, the existence of some "standard" evaluation modules may thwart innovation (e.g., see the industry's focus on TPC-C/W performance). But, "for better or worse, bechmarks shape a field" (Dave Patterson).
In step 1, "analyzing evaluation requirements," the guide suggests that in the case of complex systems, the lab should closely collaborate with its customer, in order to reduce costs; this however can seriously influence objectivity.
Some of the evaluation procedures are inherently subjective and there is no way around it. E.g., in the usability module, there are a few metrics that evaluate the "understandability" and "clarity" of documentation.
Impartiality can also be a problem. In the usability module, for the "check that all required installation files are present" metric, there is a note saying "The architecture of the product may make it impossible for the evaluator to directly test for this feature. If so, the evaluator should answer this question by querying the developer."
The guide may not be amenable to today's software industry. For any reasonably complex system, source code and other internal, company-confidential documents may be required. It also seems like few companies would be willing to commit to the "results reporting" step unless the evaluation truly guarantees reproducibility (e.g., like TPC-C). However, the necessarily-subjective items would prevent such reproducibility.
I haven't seen the case studies mentioned by the author, so I can't speak to the domain of applicability nor to the effectiveness. I don't know how many man-hours are required for such an evaluation, nor how automatable the evaluation really is. However, many items seem to not allow automated evaluation.
Finally, I don't know whether such independent labs using this guide would do much more to the software industry than PC Magazine, Gartner, etc. do when they publish their research. To some extent it might be possible these organizations use parts of this guide in their work.

Back to index

Summaries may be used for non-commercial purposes only, provided the summary's author and origin are acknowledged. For all other uses, please contact us.