Verifying Results based on Precision, Recall and False-Positive-Rate
Recently, I read a paper which reported quite impressive results on a time series anomaly detection task with several time series. The paper was already cited by many other researchers and the presented algorithm really seems to perform quite well. However, after doing some calculations, it appears like the results were presented in a way which is a little misleading and strange. But let us take a closer look.
The table of results looked something like this:
Recall | Precision | FPR | |
time series 1 | 0.2555 | 0.9559 | 0.0118 |
time series 2 | 0.5109 | 0.9909 | 0.0047 |
time series 3 | 0.0953 | 0.9231 | 0.0079 |
time series 4 | 0.9986 | 0.9930 | 0.0070 |
On the first glance, the results cannot be easy interpreted that easily. What we can see is that the recall is not that high on time series 1 & 3 (only a small fraction of the anomalies were actually detected) and the precision seems to be constantly high.
Similar to typical classification tasks, for time series anomaly detection problems an algorithm has to classify each time series sample as either anomalous (unusual) or as normal (usual). Commonly, correctly identified anomalies and normal instances are considered as true-positives (TP) and true negatives (TN), respectively. Misclassifications are accordingly referred to as false-positives (FP) and false-negatives (FN). In these cases normal/usual instances are falsely flagged as anomalous (FP) or the algorithm fails to detect real anomalies (FN). Due to the large number of TN for anomaly detection tasks, usually this score is not reported. Based on the three remaining measures additional metrics can be derived, which are listed in the above table:
Usually, also the false positive rate (FPR) would not be reported for anomaly detection tasks, since its value will be rather small due to the fact that most of the data is normal (TN is very large).
Now that we have the three equations for precision, recall and FPR, it would be interesting to now the actual quantities such as TP, FP and FN for above table.
Let us start with finding a way to compute TP: In order to compute the actual number of true-positives (correctly detected anomalies) based on precision, recall and false-positive-rate, we derive one formula as shown in the following.
First, let us re-write the equations for precision and recall:
Also the FPR can be written in a slightly different form:
where is the overall number of examples in the data set (sum over all elements of the confusion matrix). Now we insert \eqref{ctp:fp} and \eqref{ctp:fn} into \eqref{ctp:fpr} and get:
If is unknown, we can also compute the fraction (between 0 and 1) of the data points which should be true-positives:
With equations \eqref{ctp:fp} and \eqref{ctp:fn} we can also retreive the values for FP and FN. Again, if is unknown, we can write these quantities as fractions of the overall number of data points:
So what does this mean, when we consider the values presented in above table? Let us do the calculations:
Recall | Precision | FPR | TP_{%} | FP_{%} | FN_{%} | TP_{%} + FN_{%} | |
time series 1 | 0.2555 | 0.9559 | 0.0118 | 24.76 | 1.14 | 72.15 | 96.92 |
time series 2 | 0.5109 | 0.9909 | 0.0047 | 50.95 | 0.46 | 48.77 | 99.72 |
time series 3 | 0.0953 | 0.9231 | 0.0079 | 08.86 | 0.73 | 84.19 | 93.06 |
time series 4 | 0.9986 | 0.9930 | 0.0070 | 99.29 | 0.69 | 0.13 | 99.43 |
The first interesting observation is that for each time series more than 8% of the data points are correctly classified as anomalous (column TP_{%}). That means, for example, that for time series 4 at least 99% of the data is anomalous!!! If we sum up true-positives and false-negatives we get the overall amount of anomalies in the data (column TP_{%} + FN_{%}). For all time series, more than 90% of all the data is anomalous!!!??? This is rather strange, since anomalies are usually very rare events. In this setup we could simply build a naive anomaly detection algorithm which classifies each single data point as anomalous and we would basically always be right. Apparently, many people did not read the paper carefully, which presented these results…