March 7, 2016 by Dean Leverett

All aquatic environmental risk assessments follow a basic paradigm regardless of the particular set of regulations under which releases into the environment of a substance are governed. This involves the comparison of hazard (Predicted No Effect Concentrations (PNECs)) and exposure Predicted Environmental Concentrations (PECs)) values to predict the risk to aquatic organisms.

Since the derivation of these values relies on data generated under a wide-range of circumstances – from highly quality controlled laboratories to academic or research institutions – it is necessary to first evaluate the quality of such data prior to its use in deriving the final values to be applied to the risk assessment. The use of flawed, unreliable or irrelevant data or the erroneous exclusion of good quality data could result in the final risk assessment predictions being too rigorous or too lenient, with myriad and often expensive implications (in both financial and environmental terms) in either case.

The reliability and relevance of aquatic ecotoxicology studies (the basis of PNEC derivation) has traditionally been achieved using the so-called Klimisch scoring method (Klimisch 1997). This approach relies on the expert judgement of experienced ecotoxicologists to assess whether a particular study is ‘reliable without restrictions’ (K1), ‘reliable with restrictions’ (K2), ‘unreliable’ (K3), or that there is insufficient information presented with which to assess reliability (‘unassignable’ (K4)), based on a small number of broad criteria. The relevance of studies is also addressed by Klimisch, although this assessment is less comprehensive than that for reliability.

The Klimisch system has been applied in Europe for many years in the evaluation of study quality for use in regulatory regimes for controlling chemical emissions (e.g. the Notification of New Substances (NONS), REACH, the Plant Protection Products Directive, the Water Framework Directive, the European Medicines Agency guidelines on pharmaceutical ERA, etc.) and has long been considered to be fit for this purpose. In such regimes, studies scored as K1 or K2 are usually deemed to be sufficiently reliable for regulatory purposes, while those scored K3 or K4 are not. Despite the brief consideration given to study relevance in the Klimisch paper, most assessors incorporate relevance within their final reliability score for a study. Thus, a biomarker study investigating the concentrations of an enzyme present in an organism’s tissues, would generally be scored as K3 (‘unreliable’) for regulatory risk assessment, even if the study had been well performed and reported, simply because such responses are generally not considered to be directly relevant to population or ecosystem scale effects.

While for the majority of studies, the outcome of this assessment is clear and largely unequivocal amongst assessors, disagreements between assessors on the reliability of particular studies do occur, because for the most part the evaluation is based on expert judgement. In the context of the WFD, such disagreements have the greatest effect on chemical risk assessments when some consider a study to be K2, and others score it as K3, and the degree of significance of the disagreement is directly related to the significance of the study in the overall substance dataset (i.e. its inclusion or exclusion has an appreciable effect on the PNEC). Disagreements between whether a study is K1 or K2 usually have less effect, since the study would usually be used in in either case.

Though relatively rare, situations inevitably arise during the development of Environmental Quality Standards (EQS) for WFD Priority Substances, when the technical groups responsible for PNEC derivation cannot reach consensus on the reliability of a study that critically affects the final PNEC value. In such cases, the issue is referred to the European Commission’s Scientific Committee on Health and Environmental Risks (SCHER), an expert committee that provides opinions on health and environmental risks related to pollutants in environmental media. The SCHER opinion on the reliability of critical studies, and their subsequent use (or not) in the derivation of PNECs (and therefore EQS), amongst many other technical issues, is generally considered to be a means of settling the issue in a manner which enables the process of EQS development to progress, rather than stalling owing to ongoing debate (for example, between industry and regulators, or between individual EU Member States). The opinions of SCHER are highly regarded in this context, and once an issue has been put before SCHER for a decision, the decision of SCHER is generally considered, for all intents and purposes, to be final.

Over the last few years, there has been increasing criticism of the Klimisch approach to study reliability, primarily from NGOs and some national regulators. Such criticism has primarily been focussed on the lack of refinement in the reliability assessment (e.g. too few criteria, critical areas of study design not properly covered or missing) and the effect of this on the consistency of such assessments (i.e. different assessors may come to different conclusions), its apparent lack of transparency (i.e. studies are deemed as reliable or unreliable without further resolution as to the reasons for the outcome), and the lack of a robust assessment of the relevance of studies. In addition, there is a suggestion that studies contracted by industry and undertaken to standardised guidelines in highly quality controlled laboratory environments take disproportionate precedence over studies published in the scientific literature, which are often undertaken in a research or academic setting, may apply non-standard methods, and are often designed to address a research need rather than regulatory risk assessment (despite reporting data that may be useful for regulatory hazard assessment).

This has led to the development of a new scheme for the assessment of the reliability and relevance of aquatic ecotoxicology studies, known as Criteria for Reporting and Evaluating Ecotoxicity Data (CRED) (Moermond et al. 2015). CRED expands the reliability assessment proposed by Klimisch et al., by requiring the scoring of aquatic ecotoxicity studies against a more extensive and specific array of reliability criteria, and crucially also includes a discrete set of relevance criteria, enabling studies to be scored separately for reliability and relevance. Thus in the biomarker example described above, the CRED assessment would enable an assessor to reflect that the study had been performed and reported in a reliable manner, but is not relevant to regulatory PNEC derivation.

wca were involved in the trialling of the CRED approach, and consider that it contains many positive aspects, most notably the increased focus on study relevance. However, we are concerned by some of the claims made with regard to its superiority to the well-established Klimisch approach, and especially that the mere existence of CRED may be used as a justification for re-evaluating studies on which SCHER have already provided a decision regarding reliability and relevance.

Firstly, we consider that CRED should be considered as a supporting tool for the more established Klimisch method of scoring the reliability and relevance of ecotoxicity studies. It cannot be regarded as a system that is inherently different from Klimisch, because it is clearly a logical development of the previous scheme, at least in terms of study reliability. The expanded criteria for reliability assessment proposed by CRED are certainly those that should be considered in a reliability assessment, however, we would suggest that it is exactly these criteria that are considered by in the Klimish approach, even if they are not explicitly detailed in the Klimisch paper. Any experienced ecotoxicologist, in assessing the reliability of a study, would take these criteria into account in deriving their Klimisch score. Thus, in the majority of cases, it would be expected that CRED would (or should) result in the same outcome as Klimisch regarding study reliability. Indeed, for most studies that are assessed for the WFD, there is little contention over their reliability, and all stakeholders agree on the rating as K1/K2 (and therefore suitable for use in PNEC derivation) or >K2 (and therefore unreliable for use in PNEC derivation). Therefore, under most circumstances, there seems to be limited need for the application of a different scheme.

The problem arises for so-called ‘borderline’ studies, where disagreements between assessors arise regarding whether a study with a significant effect on the final PNEC value is reliable (or not). Under the Klimisch approach, the specific area of disagreement may not, initially, be clear but generally becomes apparent as discussions on the study intensify. In such cases, it may be useful to apply a CRED assessment, in addition to the initial Klimisch assessment, to highlight the specific areas where the disagreements lie. In some cases, the increased focus on specific criteria may assist in brokering a consensus amongst assessors on the reliability of the study in question. However, assuming the same assessors undertake the Klimisch and CRED assessents, then it is also possible that the disagreement will be perpetuated, albeit focussed on one or two specific criteria. The point here, is that expert judgement is required in both cases, and the judgement of experts will sometimes differ, regardless of the system used to apply that expertise.

Our recent experiences actually highlight a slightly different situation, whereby a contentious study is re-evaluated by a single assessor using CRED, the outcome of which is then highlighted as the definitive, unambiguous result, which supersedes all previous assessments, simply because it has been undertaken using this ‘new’ approach. In such cases, it could be suggested that should a number of experts assess the study using CRED, the same lack of agreement in expert judgement would arise as originally resulted from a range of individual Klimisch-based assessments. It is, therefore, insufficient to undertake a single CRED assessment, and state that the outcome effectively settles the matter.

Furthermore, and perhaps more importantly, CRED should not be used as robust reasoning to re-visit critical studies which have already been subject to extreme scrutiny by all stakeholders, and which have been referred to SCHER for a final decision, simply because the decision made by SCHER is not agreeable to all. It should always be borne in mind that considerable resources are expended in the process of developing and evaluating substance EQS dossiers, and the continual re-visiting of contentious studies, especially those for which a final decision has already been made by those responsible for settling disagreements among stakeholders, risks wasting such resources. Where would such a process stop? Do we go back and re-evaluate every EQS where a contentious study was included/ excluded in the derivation process?

CRED is additionally cited as being able to address potential data paucity in EQS derivation by ensuring that all available high-quality data are used, irrespective of whether the studies followed GLP or other standard guidelines, and will allow more data from the open literature to be used to relieve the data scarcity that exists for many substances. This seems to imply that somehow CRED will improve the data finding process, and potentially also suggests that the threshold of acceptance for reliability is lower in CRED than it is for Klimisch. This process already involves extensive literature searching to ensure all (or almost all) data from the open literature is assessed in the derivation, so it is difficult to envisage how CRED will increase the amount of reliable data, unless the requirements for reliability are reduced. Indeed, in many recent cases (e.g. for pharmaceuticals), the studies that are missing from such a dataset (if any) tend to be the GLP studies, rather than studies from the open literature because they are not available in the public domain. Therefore, far from a reliance on GLP studies, EQS for such substances tend to be derived without considering them at all (unless offered to the process by industry or identified via other regulatory regimes).

The CRED paper also suggests that relevant and well described ecotoxicity studies are sometimes rated to be of low reliability, because they did not follow GLP or standard procedures. Again, we do not believe this to be the case, so long as the assessment is undertaken by a qualified expert. Studies are never rated to be low reliability because they do not follow GLP, and only studies that are so far from standardised procedures that they are not useful for risk assessment (or where it is not possible to tell from the paper) are deemed unreliable. In our experience, the emphasis tends to be on inclusion rather than exclusion, and only very poorly reported or highly non-standardised studies are excluded.

In conclusion, we support the use of CRED for assessing study relevance, and for supporting the assessment of reliability on new (not previously assessed) studies where a rapid, objective consensus cannot be reached. However, the majority of studies can quickly, easily and consistently be assessed using the Klimisch approach without contention. We also do not believe that CRED should be used to re-visit studies where considerable effort and resource has already been expended in deciding on the reliability of a study unless significant new information has come to light in the meantime, and even then, re-assessments should be subject to a range of expert opinion. The development of CRED should, in itself, not be used as a reason to re-evaluate studies, nor should CRED outcomes derived by one assessor be considered to outweigh previous outcomes on reliability undertaken by other experts using the Klimisch-based system.

Finally, it is worth noting that while we remain fixated on the reliability and relevance of hazard data, the reliability of exposure data, which forms an equally crucial role in preliminary risk assessment and EQS setting, is subject to far less scrutiny and often taken on face value. For many ‘emerging’ substances (e.g. pharmaceuticals), a wealth of exposure data exists that could be taken into account in deciding if the implementation of an EQS is warranted. A system which focusses on assessing the reliability and relevance of these data, and ensuring only high quality exposure data are taken into account, would likely greatly improve the processes involved, and prevent the setting of EQS, and resource-heavy monitoring of substances, for which concentrations are likely to pose negligible environment risk.