To get a feeling for how much testing is already done, code coverage is often used as a benchmark for how much of the tested system is already covered by automated test cases. When having particular benchmarks of code coverage that have to be reachd, management often thinks that this means that they can fully control how good a software is tested.
However, these assumptions quickly prove to be simply wrong.
The limits of statements about test cases based on code coverage quickly emerge when thinking about the very nature of code coverage, namely the code itself.
In this article, I showcase these limits at which code coverage fails catastrophically, by implying claims about an evaluated suite of test cases that are simply wrong.
These examples should get you thinking about why (or actuall if) you should ever rely on code coverage results ever again.
But first, let’s agree on a common understanding of terminology: when I talk about “coverage” in the following, I mean “the number of statements that are executed by running a particular set of automated test cases, divided by the overall number of statements in a system” (more compact: #statements run during test execution / #overall statements). As there are different terms and definitions of different coverage aspects out there, I will rely on this general definition in this article. However, the issues are the same for each other names you could give to the problem (the problem does not change because you give it a different name).
So now, let us have a look at some examples:
Figure 1: neat implementation
Figure 2: spaghetti implementation
Figure 3: incorrect implementation
In the above-depicted figures, three different implementations of an equals-method are shown. The requirement of this method is to check whether a given instance of an imaginary MyClass matches the current instance of MyClass (referenced with this), based on the value-property of MyClass. This example is reduced to the extent that is relevant for this article. However, the statements made generalize to more complex implementations.
In Figure 1, the coverage of a simple test case that uses two variables where this.value == that.value would be 2/5=40%. In Figure 2, the unnecessary and redundant statements that would be executed for the same test case would lead to coverage of 9/10 = 90 %. In Figure 3, in which the implementation of the method is obviously incorrect, this same test case would lead to a coverage of 100 %, although the corresponding test case would fail (because the method returns false instead of true). However, if the evaluated test case would use MyClass implementations with different values (this.value != that.value), coverage would still be 100 %, with no failing test case(s). But does this coverage of 100 % really imply that testing is “complete”?
Although more sophisticated coverage metrics such as branch or path coverage  might help reduce the impact of fallacies as described in this example, the problem that syntactic test case evaluation is an inaccurate representative of the completeness of software testing still remains. In general, this implies that (1) metric values can easily be manipulated, (2) there is a huge impact of the code style (the syntactic realization rather than the semantic interpretation of software), (3) bad code style (code smells) is often incentivized (because of the general good practice in coding to need few lines of code to realize a particular functionality) and (4) metrics do not consider code that is missing (on purpose or because of a defect, as shown in Figure 3), because the syntax does not allow statements about what should be there.