Legacy intelligence automation, from code to application to help improve software quality. Asetechs Softwares for legacy applications quality assessment and renovation


Your Account | Your Cart | Register


  • Interpretring Halstead and McCabe measures S.Q.MEASUREMENTS : Interpretation of Halstead measures, article 1 of 2.

    (back to initial page)

KRIS Description
  • Products overview
  • Support and services
KRIS Features
  • KRIS Code Reviewer (CR)
  • KRIS Code Normalizer (CN)
  • KRIS Dead Code Cleaner
  • KRIS Extractor xmi/cwm/kdm
  • KRIS CR measures collected
Learn More
  • KRIS video tour
  • McCabe revisited (discussion)
Contact us
  • Further information
  • Asetechs Home |
  • Selected articles

KRIS on MVS for System, Applications and Code inspection is available in Q1 2011 for: COBOL, C, CICS, IMS-DC, IMS-DB, JCL (IBM), SQL (DB2, ORACLE), VSAM. Other languages available only on demand ( PL/1 with preprocessing, Natural, Adabas, Rexx, Clist).

ERROR! You do not have the required version of Flash Player.

Get Adobe Flash player
Next article
Beginning article

How Halstead's metrics can be useful ?

There are 2 main types of software reliability models: the deterministic and the probabilistic (random event is involved). In general, these models represent a growing quantitative approach to the measurement of computer software.

Two well known models of the deterministic type are the Halstead's software metric and the McCabe's cyclomatic complexity metric.

Halstead's software metric is used to estimate the number of errors in a program. (see article about Halstead´s software metrics)

McCabe's cyclomatic complexity metric (McCabe 1976) is used to determine an upper bound on the model for estimating the number of remaining defects.(see article about McCabe´s software metric)

For software qualification, it is highly desirable to have an estimate of the remaining errors in a software system. It is difficult to determine such an important finding without knowing what the initial errors are. Research activities in software reliability engineering have been studied over the past 30 years and many statistical models and various techniques have been developed for estimating and predicting reliability of software and numbers of residual errors in software. From historical data on programming errors, there are likely to be about 8 errors per 1000 program statements after the unit tests. This, of course, is just an average and does not take into account any tests on the program.

We present in the next 2 articles (Halstead) and (McCabe) some historical results obtained thanks to the utilization of the Halstead's and McCabe software metrics and their interpretation(s) in order to answer the questions: are the Halstead's software metrics and the McCabe's software metrics useful ? how ?

Previous article
Beginning of article

Halstead's software metric interpreted.

Content

  • Halstead's software metric description
  • General analysis of Halstead's software metrics
    • The theory behind Halstead's software science
    • Empirical evidence
    • SQM based on Halstead's software metrics: is it useful and reliable ?

Halstead's software metric description

Halstead's theory of software metric is probably the best-known technique to measure complexity in a software program and the amount of difficulty involved in testing and debugging the software. Halstead (1977) uses the number of distinct operators and the number of distinct operands in a program to develop expressions for the overall program length, volume, and the number of remaining defects in a program.

The following notations are used:

  • $n_{1} = $ number of unique or distinct operators appearing in a program.
  • $n_{2} = $ number of unique or distinct operands.
  • $n = n_{1} + n_{2}$, this is the vocabulary.
  • $N_{1} = $ total number of operators (implementation).
  • $N_{2} = $ total number of operands (implementation).
  • $N = N_{1} + N_{2}$, this is the counted (or implementation) length (KRIS delivers the counted length for each program it analyzes) .
  • $N' = \left( n_{1} \times Log_{2}\left(n_{1}\right) + n_{2} \times Log_{2} \left(n_{2}\right) \right)$, this is the calculated (or estimated) length.
  • $V = N \times Log_{2} \left(n1+n2\right)$, this is the volume.
  • $Ê_{1} = $ calculated (or estimated) number of errors (or bugs) in a program.
    • model 1 : $ \begin{equation} Ê_{1} = \frac{V}{3200} \end{equation} $
    • model 2 : $ \begin{equation} Ê_{2} = \frac{A}{3200} \end{equation} $ where $A = \left(\frac{V}{\left(\frac{2\times n_{2}} {n_{1}\times N_{2}}\right)}\right)^{\frac{2}{3}}$
      • The value “3,200” in the 2 equations above represents the Halstead number of discriminations per bug. Experimentation has found that 3,200 discriminations per bug is a typical value.

The Halstead measures computed by KRIS are the following: distinct operators, distinct operands, total operators, total operands, vocabulary, counted length, volume, difficulty, effort. From these values all other Halstead measures can be deduced.

KRIS computes all the Halstead measures required to proceed with the interpretation of results presented here.

General analysis of Halstead's software metrics

Criticism of Halstead's software metrics falls into two categories:

  • criticism of the software science theory, and
  • criticism of the empirical results used to confirm the theories.

In the former case, researchers look for consistent logic among the equations, analyze whether the assumptions used to derive them are valid, and determine whether the metrics measure the software attributes they purport to measure. In the latter, empirical results are analyzed and critiqued.

To determine whether the metrics measure these software attributes, experiments must be conducted. From early experiments, several criticisms pointed out weaknesses in the experimental evidence. Later experiments have provided better results.

The theory behind Halstead's software science

In general, the original theory addressed metrics that would analyze algorithms, not programs. Since algorithms only contain operators and operands, it was easy to form a counting strategy because classifying each was fairly straightforward.

In certain languages it is difficult to determine whether some tokens should be operators or operands. The meaning may depend on how the token is used at execution time, or on information given in the declaration section.

Since operator and operand counts originally did not include declaration sections and other non-executable code, the counts could yield inaccurate results if they were used to measure effort or time. In a COBOL application, for example, the extensive declaration section would not be counted. Yet this section requires a significant amount of programmer effort since it can account for a major portion of a program.

To address this problem, it is necessary to establish clearly the rules for defining the operators and operands of the implemented language. If a large application uses several languages, rules should be established for each language based on the constructs of that language. While most measures purportedly are language-independent, it is critical to define the counting strategy to accommodate the syntax peculiarities of a particular language. If automatic counting tools or analyzers are developed for this purpose, documentation on the tools should include a description of the counting methods used.

While mathematical derivations have been performed on all of Halstead's equations, researchers have questioned the assumptions underlying some of the variables. Researchers have also questioned the derivations. For example, Halstead bases many of the relationships on the assumption that programmers use the "binary search" method when they create algorithms or programs. He further contends the binary search is the most efficient to use on an ordered list (Halstead 1977). Halstead used this relationship in many of his fundamental equations, including the Program Length and Program Volume.

Critics question the validity of the use of a binary search mathematical component in some of Halstead's equations. Although a software program may use a binary search to select items, it is not clear that people, in general, use mental processes that can be likened to such a search. Nor is it clear that programmers, in particular, use binary searches when selecting their Vocabulary for writing a program. However, experimental evidence suggests that Program Length and Program Volume equations (which include this assumption) provide monotonic, repeatable relationships, providing the counting methods employed are used consistently.

The explanations for some of the derivations are not given. In the Implementation Length equation, Halstead provides no reason for dividing a program of Length N into N/η substrings. While this equation cannot be justified theoretically, its validity has been established through experiments.

Empirical evidence

Although the theoretical foundation of some of Halstead's metrics is suspect, it is possible to determine whether the formulas approximate real values by empirically validating the metrics. The number of empirical studies performed over 25 years on his metrics substantiate their utility.

Researchers have criticized early experiments conducted by Halstead and others for the following reasons:

  • The sample sizes were too small, so there was not enough data to infer much from the experiment. Also, the small sample constrained the use of parametric statistics.
  • The programs were small. All except one were single modules containing fewer than 50 lines.
  • The programmers were generally college students. Hence, the results may not apply to professional programmers.

While these first criticisms were well-founded because the work had been conducted in a university setting, further testing has validated most of Halstead's equations.

The Estimated Length equation is a good indicator of the Implementation Length. In analyzing 1,637 modules, the relative error between N and N' was less than six percent, although the error can be larger for separate modules. This measure is most accurate when programs range in size from 2,000 to 4,000 words (Shen, Conte, and Dunsmore 1983). For larger applications, the error of the length equation can be minimized by dividing a program into separate modules and then summing the individual module estimates. For programs that are much smaller than the minimum range listed, inlining subroutines may help to create larger modules that would minimize the error.

Ottenstein (1981) conducted a study based on 11 programs written by first year graduate students for a compiler class. She tested the correlation between the Estimated Length and the Implementation Length. The study used three counting methods. In the first method, operators and operands were computed for each routine and then were summed over the entire module. If an operatoror operand occurred in two different routines, it was counted as unique in each. N1 and N2 were calculated by counting the total number of occurrences of operators and operands in the module.

The second method was similar to the first but did not consider routine boundaries. If a variable was global, it was counted only once. Operators were considered in the same way.

The third method estimated the number of operators used in the program, leaving unknown only the number of distinct operators that would be used in the code. The results from this study indicate that N' (estimated length) is a good estimator of N (implementation length).

The results from this study indicate a high correlation exists between Halstead's Estimated Length and Implementation Length. The average correlation coefficient was 0.968 (significant at the 0.001 level) between the Estimated Length and the Implementation Length.

Bulut and Halstead (1974) analyzed 429 FORTRAN programs from a library at the Purdue University Computing Center. The total sample had 242,990 occurrences of operators and operands, with individual program lengths ranging from seven to 10,661 occurrences. The coefficient of correlation between the observed and predicted lengths was 0.95.

Elshoff (1976) tested the length estimator on 154 programs. The correlation between actual and estimated lengths was 0.985 for structured programs and 0.976 for non-structured programs, further validating this metric.

Studies have been performed to test the number of bugs (errors) estimate, Akiyama's study (1971) involved a six module software project. The correlation coefficient between number of predicted bugs and actual bugs was 0.95 (significant at the 0.01 level). Another study on implementing an assembler in PL/1 (Ottenstein 1981), calculated a correlation coefficient between the predicted number of bugs (errors) and actual number of bugs of 0.828 (significant at the 0.02 level).

Although critics have raised serious questions about the validity of some of the assumptions underlying Halstead's metrics, the metrics have been substantiated by almost two decades of empirical studies. In spite of the theoretical questions on the derivations of some of the formulas and on the assumptions built into them, field studies have supported their validity.

SQM based on Halstead's software metrics: is it useful and reliable?

Several of Halstead's software metrics have been applied, and have been substantiated by empirical study to correlate with SQFs (Software Quality Factors). Quantitative studies indicate that the Implementation Length, N, has been successfully correlated with maintainability, and number of bugs with reliability. Implementation Length is also correlated with flexibility, testability, portability, reusability, and interoperability (McCall, Richards, and Walters 1977; and Bowen, Wigle, and Tsai 1985). This software metric may be related to understandability, since this SQF is a subset of maintainability. Further, since understandability is complementary to complexity, the measure is related to complexity (Munson and Khoshgoftaar 1990).

Volume and Programming Effort have also produced several SQM. They have each been successfully correlated with complexity, maintainability, modifiability, modularity, number of bugs, reliability, simplicity, and understandability.

Implementation Length, Volume, and Effort form the basic software metric set, from which the other Halstead measures are derived. While correlational and other empirical studies quantify several attributes of software, do these measures suffice to give the overall software quality of a product? Their limitations suggest that they do not cover all the attributes that compose even one quality factor.

Halstead's software metrics provide practical SQM. When developers use his metrics, the resulting SQM indicate the structural and psychological complexity of a module. The software metrics, however, are all built on the fundamental counts of operators and operands. The resulting SQM thus have a narrow focus because the measures are not context-sensitive. The metrics do not address other aspects of code complexity, such as control flow (as the McCabe'software metric does, see Cyclomatic Complexity paper).

While the empirical evidence supports the SQM-claim that Halstead's metrics measure an aspect of complexity, the metrics are primarily concerned with the size or volume of an application. While size is an indicator of complexity, it is not the only aspect of this quality factor. The developer should consider combining metrics to capture more of the total quality factor complexity.

As a partial indicator of code complexity, Halstead's SQM are well-supported and provide a good, disciplined approach to aid in developing software. They do not, however, provide structural test coverage analysis.

 
  • © Copyright 2001-2012 Asetechs
  • Privacy Policy - Contact us

Valid XHTML 1.0 Strict