April 29, 2024 by Iain Wilson
Measured concentrations below the limit of detection (LoD) (or limit of quantification (LoQ)) can be a potential issue when performing a chemical risk assessment. Numerous methods have been applied ranging from simple (non-statistical) approaches, for example ignoring censored values or using the half LOD method (also known as substitution), to more complex statistical methods including Maximum Likelihood Estimation and Kaplan-Meier analysis. The Regression on Order Statistics (ROS) method is a statistical method that can be utilised which straddles these two areas; a statistical method that can be readily applied.
Theory and Application
To apply the ROS method, the distribution of the dataset to be analysed is required. In the absence of information to the contrary, the assumption of a lognormal distribution is supported in the case of environmental data sets by a great deal of evidence. This being done, the natural logarithms of each data point are calculated, and the transposed data is sorted in ascending order, with blanks for the censored values, and rank orders are created. Q-Q plots are then produced (Figure 1), and the y-intercept of the straight line and the slope provide the population mean value and standard deviation, respectively.
Figure 1 Example Q-Q plot
These values can then be re-transformed to real world values based on defined equations and used as the basis of the (estimated) uncensored data distribution in a Cumulative Distribution Function (CDF) (Figure 2).
Figure 2 Cumulative Distribution Function (CDF) plot
In the CDF the data distribution of the non-detect observations can be plotted using the inverse lognormal data distribution versus the median rank values corresponding the non-detect values. This “invented” section of the CDF is not strictly part of the ROS output and is shown merely as an aid to the visualisation of the distribution of lower percentiles.
Simulations of systematic and random error of different approaches to non-detects.
A comparison between the estimates of population statistics (mean and standard deviation, (sd)) using ½ limit value substitution, ROS and ignoring non-detects has been performed (Figure 3). The simulations involve sampling of a logNormal population of data (mean 10 and coefficient of variation (CoV = sd /mean) varying from a relatively precise value of 0.1 to a large possible value of 1.0). This sequence of trials involved sampling 40 data points and using these to estimate the (known) mean and standard deviation of the population. This was repeated 500 times and the average values of the required statistics calculated.
Figure 3 Simulations of errors associated with different treatments of non-detects
The graphs in Figure 3 illustrate the effect of both increasing CoV and of different levels of data censoring on systematic errors for the different treatments of non-detects. When analysing the mean values, it is evident that no censoring is essentially unbiased. This will not necessarily be true for much smaller sample sizes which fail adequately to sample the rare but very high value in the right hand “long tail” of the higher CoV data sets (Strimbu (2012)). ROS is shown to be subject to a small bias at CoV values approaching 1.0. Serious negative bias of mean values occurs for ½ substitution is shown to increase markedly as levels of censoring rise from 15% to 50%, though it is fair to say that for censoring at 30% less and for CoV values larger than 0.6, mean bias is less than 5%.
When the estimation of standard deviation is reviewed, it is clear that the key differentiating factor between treatments of non-detects is CoV itself (Figure 3). At low values of CoV there is evidence of high % bias for both ½ substitution (positive) and ignoring non detects (negative), For the uncensored data case and for censoring plus ROS bias is low. Remarkably, for the case of CoV near to 1.0, there is less than 10% bias for all four data treatments, more or less regardless of the level of censoring up to 30%. The likely range of the 5th to 95th percentile estimated values of mean and standard deviation statistics derived from the 500 simulations described above are shown in Figure 4 (note, these are not the 5th and 95th percentile estimates of the data). Data for different CoV values are plotted to show the uncertainty range for difference level of censoring. The larger levels of uncertainty associated with higher level of CoV are not materially worse for ROS treatment than having no censoring – the increase in uncertainty over the CoV range is a primarily a function the difficulty of sampling highly variable populations of data. The issues of censoring having been largely dealt with by the ROS treatment over the range of all CoV values.
Figure 4 Range of the 5th to 95th percentile estimated values of mean and standard deviation statistics for different levels of censorship
In summary, it appears that substitution is needlessly inaccurate under many circumstances (though not necessarily all). However, even more seriously, the size and direction of errors associated with substitution are not readily predictable, depending as they do, on the coefficient of variation of the underlying data – which in real-life is what is to be determined in the first place.
A poster is being presented at SETAC Europe 2024 (Th-321) also discussing this topic, and we are in the stages of preparing a manuscript describing this method with an accompanying Excel tool that will be able to be used to apply the ROS method. If you have any questions or comments on this work, please contact us or come and have discussion with me at the poster or the wca booth (Booth 226 in the exhibition hall).