Guides to Software Evaluation
|Authors: Hans-Ludwig Hausen, Dieter Welzel
Arbeitspapiere der GMD 746, April 1993, 92 pages
Describes a framework for measuring, assessing, and certifying the
quality of a software product; reported information is meant to be
equally useful to software producers, vendors, and users.
- According to one of the authors, the guide was used on about 100
industrial case studies in the European Union and has been adopted
by the European Space Organization after the Ariane-5 disaster to
improve sw quality. The framework intends to be useful in
evaluating a wide range of products, from off-the-shelf packages
to custom/commissioned software, to embedded systems.
- The evaluation procedure needs to be cost-effective and run by an
independent 3rd party, such as specialized testing labs. This
software evaluation guide is meant to be a handbook for such labs,
as well as producers/purchasers of software.
- There are 4 principles guiding the authors:
- repeatability: repeated evaluation yields same results
- reproducibility: evaluation by different labs gives
- objectivity: there is a minimal amount of subjective
- The evaluation procedure consists of 5 steps (italicized terms are
- The lab's client (e.g., a software vendor) states the
evaluation requirements, including evaluation levels,
and the lab analyzes them.
- Lab produces an evaluation specification. For example, the
spec describes evaluation techniques to be used for each
software characteristic: black box testing + glass box
+ UI inspection + algorithmic complexity analysis + static
analysis + design process evaluation. The client can accept
the spec or withdraw from evaluation process.
- Lab designs the evaluation process based on spec; client
accepts or withdraws.
- Lab performs evaluation. This may involve manual processes,
computer aided processes (e.g., using a check list manager for
applying check lists), as well as automatic evaluation (e.g.,
measuring complexity using static analyzer).
- Lab reports results; client accepts or lodges an appeal in
court or some legal forum (results become public information
- An evaluation level represents the depth/thoroughness of
the evaluation techniques and results, w.r.t. safety, security,
economic risk, availability, app constraints). Evaluation levels
are names A, B, C, D and have various levels for:
- safety: can range from a safety breach causing "small damage
to property + no risk to people" to "unrecoverable
environmental damage + many people killed".
- economy: ranges from "negligible loss" to "financial disaster
that ruins company".
- security: defined in terms of ITSEC assurance levels.
- availability: ranges from "up on request/demand" to "no down
time at all".
- application domain: e.g., "office automation + entertainment",
"air and railway systems", etc.
- Software characteristics are: functionality, reliability,
usability, efficiency, maintainability, or portability.
- An evaluation module encapsulates atomic evaluation
procedures (e.g., "check whether system automatically proposes
corrections to clearly correctable user-errors?"), metrics and
evaluation levels for each software characteristic (for the
example shown above, "yes=2, some=1, no=0"), a description of the
assessment procedure, and a format for reporting results and
- The information required by the lab for the evaluation can
include: product information (user handbook, design docs, object
code, code listings, etc.), development process information
(management report, quality assurance report, project
B shows an evaluation module in the form of a usability
checklist. It is based on the following usability metrics:
installability from scratch, learnability, use efficiency,
interface customizability, experienced user migration ease.
- Lays out a reasonably well thought-out process that can be used in
evaluating software. However, this seems like a more useful tool
for the software producers themselves, as part of their internal
reviews, rather than for customers, because many of the features
in the evaluation may not be things they care about. Hence, it's
unclear to what extent one could compare two products on the basis
of their evaluations.
- In doing so, the author employ existing/available tools and
technology, which makes the guide easier to adopt. They don't
invent new languages or use exotic tools.
- It is a realistic guide, in that it allows for reasonably useful
reviews to be conducted for reasonable software. It is clearly
not the result of a committee. The approach is rather holistic
and end-to-end, much unlike most evaluation processes I've seen so
far (e.g., verification and validation).
- Would certainly be useful and have an impact on the industry
if all major players agreed upon it.
Some of these are not necessarily flaws of this guide, but rather
issues that would affect any process that tried to evaluate software
- The development of evaluation modules is closely tied to existing
computing paradigms (e.g., windowing interfaces for UI's). As
such, they might become obsolete. But worse, the existence of
some "standard" evaluation modules may thwart innovation (e.g.,
see the industry's focus on TPC-C/W performance). But, "for
better or worse, bechmarks shape a field" (Dave Patterson).
- In step 1, "analyzing evaluation requirements," the guide suggests
that in the case of complex systems, the lab should closely
collaborate with its customer, in order to reduce costs; this
however can seriously influence objectivity.
- Some of the evaluation procedures are inherently subjective and
there is no way around it. E.g., in the usability module, there
are a few metrics that evaluate the "understandability" and
"clarity" of documentation.
- Impartiality can also be a problem. In the usability module, for
the "check that all required installation files are present"
metric, there is a note saying "The architecture of the product
may make it impossible for the evaluator to directly test for this
feature. If so, the evaluator should answer this question by
querying the developer."
- The guide may not be amenable to today's software industry. For
any reasonably complex system, source code and other internal,
company-confidential documents may be required. It also seems
like few companies would be willing to commit to the "results
reporting" step unless the evaluation truly guarantees
reproducibility (e.g., like TPC-C). However, the
necessarily-subjective items would prevent such reproducibility.
- I haven't seen the case studies mentioned by the author, so I
can't speak to the domain of applicability nor to the
effectiveness. I don't know how many man-hours are required for
such an evaluation, nor how automatable the evaluation really is.
However, many items seem to not allow automated evaluation.
- Finally, I don't know whether such independent labs using this
guide would do much more to the software industry than PC
Magazine, Gartner, etc. do when they publish their research. To
some extent it might be possible these organizations use parts of
this guide in their work.