A Proposal for Improving Organic Group Certiﬁcation. Quantiﬁcation of Internal Control Systems’ Performance and Sample Size Determination

: Organic certiﬁcation, especially for smallholders, often uses group certiﬁcation procedures. An internal control system (ICS) visits all farmers, and then the external certiﬁcation body (CB) inspects a sample to assess the ICS’ performance. Harmonised methods for measuring the ICS’ reliability are missing so far. Here, we deﬁne criteria of “ICS performance”, propose a new procedure for quantifying this performance and, based on this procedure, suggest that the sample size can be determined using classical statistical methods for survey sampling, instead of using the square root or a percentage of group size as in current practice.


Group Certification and One of Its Weaknesses
Group certification is used in different farm certification schemes (GLOBALG.A.P., Rainforest Alliance, Round Table on Sustainable Palm Oil, organic farming, etc.). The basic idea is to facilitate access to certification by building up an Internal Control System (ICS), the effectiveness of which is verified by an external inspection (also called "audit"). While under some programs (e.g. GLOBALG.A.P. and the National Organic Program of the USA, NOP [1], there is no restriction concerning size of the member farms, the EU regulation on organic farming restricts participation in group certification to small farms [2,3]. Research in relation to group certification so far has addressed its impact on market access and smallholder incomes [4][5][6][7][8][9][10][11], implementation of improved agricultural practices by the certified farmers [10,12], schooling [13], scalability [14], internal organisational problems of the groups and certification costs [7,15], environment and nature conservation [4,9], and adaptation to climate change [16], but not on the functioning of the ICS as such, their ability to ensure compliance with the standards, nor the way that certification bodies (CBs) deal with the ICS.
For a better understanding of the organic group certification process, Figure 1 describes the general workflow. General workflow of an organic group certification process. On the left side, the four on-site inspection activities, for which both the need for, and possibility of, quantification increase from top to bottom. Our article deals mainly with farm inspections and witness audits. 1) Buying centres (also called collection points, wholesale points, buying points) are places, to which member farmers deliver their products. Sometimes the group contracts some of its members for this purpose, in other cases the group sets up its own structure. Some buying centres are permanent, others are active only during the harvest season. In other groups, the buying staff drives to the farmers for picking up the products. 2) NC: non-conformity. 3) Witness audits: the external inspector accompanies the internal inspector for observing her/his competence. See Section 8. 4) Denial: an initial application for certification is turned down; Suspension: existing certification is withdrawn temporarily; Revocation: existing certification is withdrawn terminally. 5) "Inspector" here refers to the external inspector who is an employee or contractor of the CB. Since the task is complex, group inspections are often performed by teams of several external inspectors. 6) Some certification programs require two, other three different persons to be involved in the certification process. The distribution of roles among these two or three persons depends on the certification program. All programs, however, require that the final certification decision is made by a person that is different from the inspector. Table 1 summarizes the most important rules for an organic ICS and also explains at which level and through which methods an external inspector can verify compliance with each of these rules. Out of the eight rules in this table, (h) is the most important one, because an ICS cannot be considered functional if it does not identify the existing nonconformities (NCs) among its members, ensuring that these are either corrected or the non-compliant members are excluded. Also, for the CB the visit to a sample of farmers is the core part of the group inspection. The CB should not only assess compliance with basic organic farming rules like, e.g., having a proper crop rotation, protecting the soil from erosion, ensuring adequate storage conditions for organic products, using only allowed fertilizers, etc. at each farm in the sample, but also use these visits to the farmers for cross checking the accuracy of records kept at the group level, verify separation of certified from non-certified products on their way from farm to export, and find out if member farmers have received appropriate training and consultancy (Table 1).
However, little to no efforts have been made so far for a systematic assessment of the outcome of these external visits. A new EU regulation for the first time establishes official rules for group certification instead of unofficial guidelines [3]. But what exactly does it mean, when this new regulation says "For the purpose of evaluating the set-up, functioning and maintaining of the ICS of a group of operators, the [...] control body, shall determine at least that the ICS manager takes appropriate measures in case of non-compliance, including their follow up, according to the ICS documented procedures that have been put in place" [3]? If in a sample of n farmers the CB finds one case where the ICS manager has not "taken appropriate measures": does that mean the ICS is not functional-which ultimately means the group cannot be certified ( Figure 1)? Or is there a meaningful threshold, above which the CB should make that decision?
In a worldwide survey among organic CBs, including expert interviews, the lack of such thresholds was identified as one of the main weaknesses of the current situation of organic group certification (Textbox 1). Non-organic group certification schemes are also vague in this regard. GLOBALG.A.P., e.g., differentiates between "structural" and "non-structural" NCs, but does not explain how often an NC must occur for categorising it as "structural" [18].

The External Sample Size
The size of the sample of farmers visited by the CB (the "external sample") has been subject of long standing discussions between the stakeholders involved. Currently, the most common approach is using the square root of the total number of group members, multiplied by a risk factor, which varies between 1.0 and 1.4. This is established in an unofficial guideline by the EU Commission [2]. Also GLOB-ALG.A.P. [18], Rainforest Alliance [19] and other programs use the square root as the basis for calculating the external sample size, although without applying risk factors.
The new Regulation (EU) 2018/848 on organic farming [20], which will come into force in January 2022, for the first time introduces official minimum requirements for the groups and their ICS [21] and for the procedures to be followed by CBs for this purpose [3]. Although clear evidence does not exist in this regard, according to the perception of regulatory authorities, fraud is more common under group certification than under individual farm certification [22]. To address the related risks, the EU Commission stipulates that (a) the maximum group size shall be limited to 2,000 members, and (b) organic CBs, instead of the square root shall inspect a minimum of 5% of the group members [3]. Figure 2 shows that for small groups the sample would be smaller with the 5% rule, while for large groups it would be much bigger.
This proposed change raises two concerns: (a) a fixed 5% sample disregards basic statistical principles of sample size determination and will lead to high standard errors for small groups, and (b) as long as the weaknesses in the system described above are not addressed, larger sample sizes (for big groups, see Figure 2) will only reproduce the existing problems at a larger scale. groups up to 2,000 members. For very small groups, [3] furthermore prescribes: If N ≤ 10 → n = N ; if N > 10 → n ≥ 10. These special cases are not considered in the graph. Sqrt = square root.

What is the Purpose of Sampling in Organic Group Certification?
As explained above, the performance assessment of an ICS takes place at four different levels: at the ICS office, in the buying centres, at the farms and during witness audits with the internal inspectors (Table 1 and Figure 1, also [23]). The results of the audits at the first two levels are mostly qualitative, but a meaningful assessment of the findings from the farm level requires some kind of quantification ( Figure  1). Quantification of the results of the witness audits with internal inspectors may not be necessary in small groups with one or few inspectors, but becomes important in large groups with many internal inspectors (Section 8). A key underlying question is: What exactly is the goal of sampling a certain number of member farmers? a. Is the goal to determine the exact percentage (incidence) of each kind of NC? Not really. Let us assume we are dealing with a group, where many farmers use herbicides, which are prohibited in organic farming. Does it matter for the CB, if, say, 14%, 32% or 45% of the farmers use herbicides? The answer is "no", because in any of these cases, the conclusion would be the same: the ICS is not functional, and certification would have to be suspended, temporarily or terminally. Or let's imagine a group, where some farmers do not keep records of their daily field activities. Would it make a difference for the CB, if this problem were found among, say, 2%, 4% or 10% of farmers? No, because in any of these cases, the ICS would be requested to propose corrective actions, to ensure that farmers in the future keep their records. And in none of these cases would the group's certification be at risk.
b. Is the goal to find each and every NC that may exist in the group and has slipped through the ICS? Any type of sampling always involves the risk of a certain number of cases slipping through. This may not be acceptable when it comes, e.g., to high food safety risks, but it would not be appropriate for organic group inspections, because (i) compliance with organic production rules is not a food safety issue, (ii) the idea of "group" certification would become meaningless, since ultimately the sample size would have to be equal to the total number of farmers, and (iii) even with 100% external inspections, not all NCs existing at the time of the inspection will be detected, let alone those NCs, which may not be detectable on the day of the inspection.
c. Is the goal to ensure that non-compliant farmers identified during external inspections are excluded from the group? This is a common misunderstanding (see also Textbox 1), which completely misses the point of group certification. If the CB inspects, e.g., 10 out of 100 farmers, and finds in this sample two farmers using synthetic fertilizers, then we assume that in the entire group there are many more farmers with this problem, and excluding the two members would not solve the problem. d. Is the goal to decertify groups, when the incidence of severe NCs exceeds a certain threshold? This is how, e.g., the Rainforest Alliance group certification works: "if an irreversible non-compliant practice occurred on more than 5% (of the whole group, after extrapolation (...) and/or on at least 5 of the audited small farms this is considered to be a systemic issue (...) and therefore shall result in non-certification and/or cancellation" [19]. There may be different opinions among CBs and regulatory authorities in this regard, but the authors believe that this approach does not sufficiently consider the efforts made by the ICS. Let's look again at the example above of a group with widespread herbicide use: When in a group of 100 farmers, the ICS has never detected any case of herbicide use, but then the CB in a sample of 10 farmers detects one case-this situation should be treated differently from the case where the ICS has already excluded 20 out of 100 farmers, but then the CB finds one more case.
e. The real goal of external inspections should be to determine (i) which existing NCs have been properly handled by the ICS and which not; (ii) among the latter, which are "systemic" and which are "isolated" cases; and (iii) which of the systemic cases put at risk the integrity of the products sold on the organic market, and the credibility of the certification system.

Judgement Sampling vs. Statistical Sampling
The U.S. Office of the Comptroller of the Currency [24] distinguishes between "judgement sampling" and "statistical sampling". The definition of judgement sampling is quoted in Textbox 2.

Textbox 2:
Definition of "judgement sampling" [24]. The current organic group certification procedures are mostly based on judgement sampling. The problem is, however, that the involved CBs do not always have the necessary level of "professional judgement" that would lead to satisfactory results (see [17]). A solution to the problem presented in Textbox 1 can only be found using "statistical sampling", which allows extrapolation of sample results to the entire group. Statistical sampling methods must select the sample randomly, not risk-based [24], otherwise the results would be biased. If a CB knows, e.g., that a specific problem is more frequent in one village belonging to a producer group, and therefore targets farmers from that village more than the rest of the group, the results from this inspection cannot be extrapolated to the entire group, because the problem would be over-estimated ( Figure 3).

What Does "Systemic" Mean? What Does "Integrity" Mean?
For finding a solution to the problem described in Textbox 1, we must first define systemic NCs vs. isolated NCs and in which cases systemic NCs should lead to decertification. In this section, we propose a new procedure for quantifying these terms and for answering these questions, with the help of the variables defined in Table 2. Readers who are not so much interested in the statistical details, can jump directly to Table 3, from there to Figure   In both cases, the group has 80 members, 9 of whom (11%) with erosion problems, and 7 (9%) with herbicide use. The sample is 20 in both cases. While (a) allows to estimate the probable dimension of the two problems, (b) does not. Therefore, the conclusion in the red box is wrong. Table 2. Abbreviations and variables used in this article. For an illustration of π, refer to Figure 4, while Figure 5 illustrates some of the other variables.

Abbreviation
Variable Definition CB Certification body (called "control body" in the EU Regulation on organic farming).

ICS
Internal control system.

NA
Not applicable.
NC 1 A specific non-conformity occurring among the members of the group (see Table 4 and following for examples).

M
Number of farmers in the entire group with NC 1 .

Ma
Number of farmers in the entire group identified by the ICS for NC 1 .

M b
Number of farmers in the entire group with NC 1 identified but not corrected by the ICS. Can be estimated from the sample by m b × (N/n).

Mc
Number of farmers in the entire group with NC 1 found by the CB, which were missed by the ICS. The CB estimates this variable from the sample using mc × (N/n).
Number of farmers with NC 1 found by the CB, which had previously been detected by the ICS, but not yet corrected at the time of the external inspection. δ L Lower limit of the confidence interval for δ : δ L = π e(L) − πa.
r Repetition: number of subsequent external inspections, during which NC 1 is found at a systemic level. Normally, such inspections take place yearly, but they can also be more frequent.
r 0 Threshold, above which the repetition of a systemic NC leads to decertification.
s Severity category of an NC (see Table 3).
Variance of a character trait within a group.  . Venn diagram illustrating the incidence π of a specific NC in a producer group, its components π a , π b , π c and π e and the definition of δ. Refer to Table 2 for further details. ICS = Internal Control System, CB = Certification Body.
a. We count M and compute π a (see Table 2 and Figure 4). b. As explained in Section 2(d), one approach for assessing the performance of the ICS would be to simply define a threshold, above which a group should be decertified. This would mean using an estimate of π e ( Table 2) for this purpose, i.e. π e = m n . For the reasons explained in Section 2(d) (we want to value the efforts made by the ICS, which have already detected certain cases), we suggest to use the difference between the incidence of a specific NC identified by the CB in the sample (extrapolated to the entire group), and the incidence identified and corrected by the ICS in the entire group. This better values the efforts made by the ICS (an approach, which may not be shared by all CBs and regulatory authorities): c. Next, to reflect that an estimate is used, we compute the lower and upper limits of a 95% confidence interval for π e (Table 2) using standard procedures as described by Agresti ([25], pp. 15,18-21) and also described in detail below (see Equations 4 to 7). The lower and upper limits for π e are denoted as π e(L) and π e(U ) , respectively. Those on δ are denoted as δ L and δ U , respectively. d. We define a threshold above which the incidence of an NC is considered systemic. Since this threshold should be different, depending on the type of NC, we group the existing NCs in five categories s, from 1 (least severe) to 5 (most severe). Refer to Table 3 for examples. These severity categories are associated with an acceptable threshold δ 0 , above which δ is considered "systemic" (third column in Table 3).
The more severe the category, the lower the acceptance threshold. If neither of the two conditions hold, the sample size was too small to reach a definitive assessment. This is likely to happen only when δ is close to the threshold δ 0 .
Note that this step amounts to a significance test at the 5% level to decide if δ is significantly smaller or larger than the threshold δ 0 . e. As a second condition for considering an NC as "non-systemic", we introduce the requirement that π e must be below 0.3 -regardless of δ. The rationale is as follows: if the ICS makes serious efforts for handling NCs, but in spite of these efforts the CB still finds many undetected or uncorrected cases, there is a systemic problem.
The assessment is inconclusive otherwise. If this happens for NCs with s ≤ 4, we suggest the CB decides from case to case, if the NC is considered systemic or not. For NCs with s = 5, the sample should be increased until getting a clear picture.
Finally, we suggest how often a systemic NC can be repeated (r, see Table 3), before it seriously affects the integrity of the system and should therefore lead to (temporary or final) decertification. We call this threshold "repetition tolerance" r 0 . r 0 is also related to s (Table 3,

Two Real Life Examples
For exemplifying the proposed method, we have selected two cases of group certification from the CERES database: a positive case of a group with a functioning ICS and only minor deficiencies, and a negative case, which lost its certification. If the method suggested had been applied, these results would have been confirmed-but based on a more transparent and reliable procedure.
The first case study refers to a cocoa farmers group with 1,079 members. Since this was the first inspection to this group, the risk factor had been calculated as 1.2, based on theoretical assumptions, leading to the sample size: n = √ 1, 079 × 1.2 ≈ 40.
Three NCs were found, two of which were systemic, but none of these with serious implications for integrity ( Table  4).
The rather small NCs could be easily corrected, and the group was certified.
The second case study is for a group of 1,413 coffee farmers, spread over a large area, with highly heterogeneous geographical conditions. A risk factor of 1.4 had been determined, leading to the following sample size: During the four previous years, only minor NCs had been detected. During inspection planning in 2016, the CB found that the samples in previous years had not been random, because they had only covered a relatively small part of the region. This was corrected by randomly including farmers from all parishes in the new sample. Furthermore, the CB had learned that agrochemical use among coffee smallholders in the entire region had increased substantially. Therefore, coffee leaf samples were taken from 16 out of the externally inspected 54 farmers and tested for pesticide residues.
As a result of this change in inspection procedures, in addition to several other (systemic and non-systemic) NCs, on 10 farms the inspectors found synthetic pesticides and/or fertilizers. In 6 out of 16 leaf samples, residues of synthetic fungicides were found at levels, which could only be explained by application by the organic farmers ( Table 5).
None of these NCs had been detected by the ICS, therefore the group's organic certificate had been withdrawn immediately. If the method proposed here had been used, the result would have been the same. These severe problems in the group, however, were detected not because the sample size was increased as compared to previous years, but because (a) the sample was chosen randomly, and (b) because the inspection procedure was improved by testing leaf samples, which had not been done in previous years.
1) Repetition: in how many external inspection has this NC been found at a systemic level. In the present case, this is 1, because it was the first inspection. For NC3 the value is 0, because this NC is not systemic.
2) Refer to Table 2   The group lost its certification because of these results. For further details regarding the different columns, refer to header and footnotes in Table 4. 2) Figures for NC5 refer to different agrochemicals found on the farms during field visits.
3) "Water Pollution" is caused by pulping coffee cherries in nearby creeks, which leads to heavy organic contamination. 4) The threshold in column K is between the lower (H) and upper (I) limit here, therefore a clear statement concerning the systemic condition of this NC cannot be made, and the warning "Watch!" appears.
Since it is not an issue of severity 5, it would be up to the CB to decide if the sample is increased for arriving at a clear decision, or the group is requested to correct the problem and the CB will follow up next year.
In this case, this was no longer relevant, because the group lost its certification anyhow.

Sample Size Determination in Scientific Surveys
In a scientific survey with the goal explained above, neither a fixed percentage nor a square root of the total population size would be used as sample size. Instead, a specification would be made regarding the precision with which a population parameter is to be estimated, based on a random sample, and then the necessary sample size would be determined accordingly [26]. Again, readers who are not so much concerned about the mathematical details at this point, can go directly to Figure 9, from there to Textbox 3, then to Figure 11 and then continue with section 7 on stratification.
Assuming we deal with a very large population (as, e.g., in consumer studies or pre-election polls), an asymptotic interval with 95% coverage probability could be employed, based on the estimate π e = m/n and these equations for the lower (L) and upper (U) 95% confidence limit for π e [25]: is the half width of the confidence interval. Further, we may compute lower and upper 95% confidence limits for δ as δ L = π e(L) − π a and δ U = π e(U ) − π a , respectively.
It is important to point out that the half-width (HW ) of the interval is inversely proportional to the square root of the sample size n (see Equations 6 and 7). Thus, the larger the sample size, the smaller the HW . This relation can be used to determine sample size, if we can make a specification of the desired HW .
Thus, the sample size to achieve a desired HW can be computed as n = 1.96 2 π e (1 − π e ) HW 2 Note that population size N does not enter this equation. The sample size remains the same, regardless if the population is, e.g. 10 4 or 10 8 , so long as the population size is large relative to sample size. What matters, are the variables π e and HW . If, e.g., our rough guess in such a large group was that there is a proportion of up to π e = 0.10 or π e = 0.20 of undetected non-compliant members remaining for a specific problem (NC), then the sample size plotted against HW would take the form of Figure 6.
To shed further light on the equation for determining the sample size, we may give a second interpretation. If the sample size is chosen to equal n , then the probability is 5% that the estimate of π e deviates from the true value by more than H W [26]. So far, we have assumed that the population size is very large. In smaller populations, as in the case of group certification, the exact Clopper-Pearson interval should be used [25], which takes the population size into account. There are no exact equations to determine sample size for this procedure, which yields asymmetrical intervals. As an approximation, we may employ the fact that in finite populations the standard error (s.e.) takes the form described by Thompson [26]: s.e.( π e ) = π e (1 − π e ) n N − n N − 1 As opposed to Equation 8, population size N does enter here. The factor N −n N −1 relates to the finite population correction. From this, assuming approximate normality of the estimator of π e , the sample size may be computed according to [26]: Note that for large N , this equation approaches the simpler one in equation 8. Also note that, even though there is a dependence on N , the required sample size is not proportional to N . And only in very small populations is the finite population correction at all noticeable. In Figure  7 we have plotted n against π e , for four different H W s, showing that n is inversely related to H W (the higher our expectations on precision, the larger the sample must be), while in relation to π e , n is biggest for 0.5, and decreases both towards 0 and towards 1.

Figure 7.
Sample size n for a population of 2,000, plotted against incidence π e for four different H W s, using Equation 9.
In Figure 8 we see the impact of five different H W s and five different π e on the required sample size for populations up to 1,000.
Two decisions remain to be made: (a) which is the highest π e in the range from 0 to 0.5 that we must consider in an unknown group of farmers, and (b) which H W are we ready to accept? Statistics cannot answer these questions, which require normative or political answers.
Nevertheless, we can try an approximation: a. In most cases, we do not know the incidence π e , therefore it is reasonable to assume a value that is close to the worst-case scenario. The worst case is π e = 0.5 (50% of the farmers have the NC we are dealing with)-for this scenario we need the largest sample for arriving at a correct decision (Figure 7). If we move too far away from this worst case, there is a risk of arriving at wrong conclusions. b. Furthermore, we consider that the s.e. should not be too far above 0.05, corresponding to a H W of 0.10. d. Then we looked for real life examples, where the CB CERES had used sample sizes, which were equal, higher, or at least close to these figures. Since CERES has also been using the square root multiplied by a risk factor, there are not many examples meeting these criteria. The examples we have found, are all from very large groups, because, as shown in Figure 8, above a certain population size, the sample sizes resulting from Equation 10 remain in the same range. We used the procedure explained in Section 4 for assessing the systemic condition of the NCs found during inspection of these example groups.

Figure 8.
Sample size plotted against populations up to 1,000, for half-widths (H W ) between 0.05 and 0.20 and π e from 0.1 to 0.5, using Equation 10. Please note that the vertical scales are different for each H W . We omit displaying the results for larger groups, because for all H W s, the lines turn almost horizontal above 1,000.
e. As a result, we selected nine groups, all of them from Africa, because this is where the largest producer groups exist [17], with between 3,554 and 78,496 members each. At this point, we do not want to enter into the debate, if such large groups are certifiable or not-the groups were solely selected for the reasons explained in (d). Adding up all the NCs resulting from the nine inspections to these groups, in total CERES had identified 57 NCs, out of which (using the procedure explained in section 4), 29 were systemic, 19 were non-systemic, while 9 remained unclear (Method (I) in Figure 9). f. Then we calculated for each of the nine groups different sample sizes, using Equation 10, with π e ranging from 0.50 to 0.10, and H W from 0.10 to 0.20. The frequency of each NC was calculated proportionally to the sample size: When a specific NC had occurred 22 times in the original sample of 75 farmers, we assumed that in the same group, it would be detected 14 times in a sample of 58 farmers. From these proportional frequencies, we assessed the systemic condition of each NC, using the same procedure explained in Section 4. The results are shown in Figure 9 (Methods II to XIX).
g. To summarize what is represented in Figure 9: • For achieving a result with only two "unclear" cases, we would have to use an unrealistically large sample size (Method II in Figure 9, with sample sizes between 2,594 and 8,577 farmers). • As could be expected, the smaller the sample size, the higher the number of unclear ("watch") cases (yellow in Figure 9). • Because of the confidence interval, there is no NC, which would switch from "systemic" to "non-systemic" with decreasing sample size, or vice-versa. They switch from systemic to unclear, or from non-systemic to unclear (see also Figure 5d).
• If we use, e.g., a sample of 15 farmers per group (Method XIX), the interpretation of 38 out of 57 results would remain unclear. With all these unclear results, the sample size would have to be increased after the inspection-which is more complicated than planning for a bigger sample from the beginning.
• It becomes obvious from Figure 9 that the impact of a decreasing H W on sample size and on the number of unclear cases is much stronger than the impact of an increasing π e . This is also confirmed through a regression analysis, where we get a steep and almost linear power function for unclear cases vs. H W , but a less steep power function for unclear cases vs. π e . From π e = 0.35 upwards, the results remain the same ( Figure 10).  (Method I) shows the real sample sizes, which were used by CERES, based on the square root approach (for extremely large groups, CERES has been using a risk factor < 1, therefore some of the samples are smaller than the square root). The sample sizes for Methods III to XIX were calculated using Equation 10, with different values for π e (from 0.5 to 0.2) and H W (from 0.10 to 0.20). For demonstration purposes, also the sample for π e = 0.5 and H W = 0.01 was calculated (Method II, in purple), resulting in extremely high sample sizes. As reflected in the table on top, the sample sizes vary substantially between methods, but very little between groups. The incidence of each NC was then calculated proportionally to the sample size. Then the classification of each NC was computed for each sample size, using the method described in Section 4. The red colour means the systemic condition of the NC was confirmed, the yellow colour means the systemic condition is unclear, because the threshold for qualifying an NC as systemic or not, lies between the lower and the upper limit of the confidence interval. The green colour means the NC is non-systemic. With decreasing sample size, the number of unclear cases increases. The only result with only two unclear cases was obtained with an unrealistically large sample (Method II), followed by Methods III, VII and XI, with nine unclear cases each.
We therefore suggest to use π e = 0.35 and H W = 0.1.
This is depicted as Method XI in Figure 9 and yields the following equation: (11) Another option would be to use a slightly larger H W , e.g. 0.125, being aware that many cases may remain in the "unclear" category, and especially when it comes to NCs of severity class 5, the sample size may have to be increased and the inspection extended, for getting a clearer picture. Figure 11a shows the sample size for Equation 11 (H W = 0.1 dotted black line) and H W = 0.125, dashed black line), as compared to square root and percentage approaches. In Figure 11b we have plotted H W against sample size, showing that for groups up to approximately 1,000 members, the method established by the European Commission accepts very large and questionable H W s, i.e. standard errors. Figure 10. Regression function between (a) H W and the number of unclear cases, and (b) π e and unclear cases, using the same data from Figure 9 (more scenarios were considered than shown in Figure 9). In (a) π e is kept constant at 0.5, while in (b) the H W is constant at 0.1. For both H W and π e , we have a very high coefficient of determination R 2 , but for H W we have an almost linear correlation, while for π e we have a power function with a less steep slope. In (b) from π e = 0.35 to 0.5, the number of unclear cases remains constant.

Textbox 3:
Summarised and simplified explanation of Section 6: sample size determination using statistical standard methods.
We are looking at a binominal trait: the farmer either complies or doesn't comply with a certain requirement. For such traits, sample size in scientific surveys is determined by two variables: a. The probability of finding the trait, in our case the NC. We call this probability πe. This variable is similar to what is commonly called "risk". But, as opposed to the common perception of "risk based sample size", the required sample size does not grow proportionally to πe. It is highest for πe = 0.5 (50% probability) and decreases both towards 0 and towards 1 (Figure 7). The problem is that normally we do not know πe beforehand, because the number of non-compliant farmers is exactly what we want to find out. Therefore, we start from the worst-case scenario: 0.5. The real-life examples we checked, however, showed that for our purpose, we can go down to πe = 0.35 without compromising the reliability of results.
b. The second variable is the standard error, which we are ready to accept. A common value used in many surveys for this purpose, is a standard error of 0.05. This means there is a 95% probability that the sample-based estimation for the entire group is correct. In our article we use the term "half-width" (HW ) instead of standard error. A standard error of 0.05 corresponds to an approximate HW of 0.1.
The combination of πe = 0.35 and HW = 0.1 yields the sample size represented by the black dotted line in Figure 11a. Figure 11. (a) Sample sizes plotted against group members, for four different procedures. The lines for 5%, square root, and square root multiplied by a risk factor 1.4 are the same as shown in Figure 2, but here presented in contrast to the sample size resulting from Equation 11 (black dotted line). The required sample for small groups is much bigger than with any of the other methods, while for a group of 2,000 members, it is slightly lower than the sample size required when using the 5% rule. (b) HW plotted against group members, for the same four methods. HW for Equation 11 is a horizontal line, because this is how it is defined. If we remember that HW = s.e. × 1.96 (Equation 9), this means that the accepted standard error is the same for all group sizes. If we look at the green curve for square root, we see that for a group of 20 farmers, HW is 0.41, for a group of 100 farmers, it is 0.29-meaning that we are ready to accept that 20 or 15% of NCs, respectively, slip through. The line for the 5% takes an irregular form in both (a) and (b), because according to [3] for groups with less than 200 farmers, the rules described in the caption to Figure 2 apply. Therefore, the HW reaches its highest point with 200 members, and then drops. This means that an NC in a 2,000 member group is three times more likely to be spotted than in a 200 member group.

Stratification
Even though most group certification rules include provisions for risk based sample selection (see Section 3), in real life these rules are mostly not followed, because the risks are generally unknown (with the exception of obvious risks, such as e.g. larger farms posing a higher risk than small ones, and farms on steep slopes being more prone to soil erosion than farms on flat land). Therefore, and because most group certification rules prescribe that members should be located in geographic proximity and have similar farming systems, the situation presented in Figure 3 is a rather exceptional one. If a CB faces such a situation, where a specific risk in one specific sub-group exists, which might slip through when applying random sampling, then the sampling method to the rest of the group is applied as described above, while for the "risky sub-group" one of the two following procedures is used: a. If the risk situation is very clear, judgement sampling may lead to clear results, without the need for quantification. If, e.g., in a risk-based sample of 10 farmers there are three cases of insecticide use, while in the random sample from the rest of the group there are no similar problems, then the sub-group can be excluded, while the rest of the group remains certified. b. The group can be stratified into two subgroups [26], and the sampling procedures described above are applied independently to each of the two subgroups. We should be aware, however, that a stratification, with a certification decision being taken separately for each sub-group, means that the overall sample size is increased substantially (often doubled) compared to simple random sampling.

Witness Audits: Sample Size and Quantification of Results
Witness audits with internal inspectors are an essential tool for assessing competence and compliance of an ICS [17,23]. Typically, such audits are combined with farm visits (see also Table 1 and Figure 1). For streamlining the assessment of the internal inspectors' performance, we suggest to use a scoring tool based on a weighted Likert scale [27].
To oblige users to make a clear decision between positive and negative scores, we recommend the use of a scale with four possible answers [28], as explained in Table 6.
The results are then summarized for all witnessed internal inspectors. If the total score for all witness audits is below a certain threshold (we suggest 70% of the maximum possible score), the ICS is considered to be not functional. If it is between 70 and 100%, corrective actions should be implemented (Table 7). Table 6. Scoring tool using a Likert scale for witness audits with internal inspectors. For each criterion, the external inspector can make a choice: "Strongly agree / Agree / Disagree / Strongly disagree", corresponding to 3, 2, 1 and 0 marks, respectively. The results are weighted for calculating the sum, because not all criteria are equally important.  Table 7. Summarising the scores from different witness audits for assessing the overall performance of internal inspectors. In these fictitious examples for two groups, six from a total of 10 internal inspectors have been witnessed. The maximum possible score (third column) differs from case to case, because not all questions are applicable to all farms (see Table 6, third column). Small producer groups often have only one or two internal inspectors. In these cases the question of sampling does not come up. For groups with more internal inspectors, based on [29] we propose the following method for determining the sample of internal inspectors to be witnessed, out of a total of N internal inspectors (again: readers not interested in the statistical details, can jump to Figure 12): While in Equation 10 we deal with a binominal distribution (farmers comply or don't comply with a specific requirement), here we are assuming an approximate normal distribution with unknown variance. Therefore, as opposed to Equation 10, the variance σ 2 of scores enters Equation 12 (in place of the variance π e (1 − π e ) in Equation 10). Figure 12 shows the results of this equation, for HW = 0.1 and five variances.
From the CERES database, we evaluated the witness audit results from 18 producer groups from eight different countries, with a total of 72 internal inspectors. CERES has been working with a Likert scale with only three possible answers (Yes / Partly / No), but this should not substantially bias the variability of results, as compared to a scale with four answers. The within-group variance σ 2 for the performance of internal inspectors ranged from 0 to 0.34. For estimating the pooled variance σ 2 p across k groups, we used [29]: which yielded σ 2 p = 0.079 for our case (orange line in Figure 12). To be on the safe side, we suggest to use a variance of 0.15 (black dotted line in Figure 12). Here, it is assumed that the underlying true variance is constant. If the performance of internal inspectors is more variable, larger samples must be used accordingly. According to our data, the variance tends to increase with lower score means. By way of analogy with the binomial distribution, and taking into account the fact that scores are integer values with a fixed lower and upper bound, it may be assumed that the variance drops to zero when the score mean µ attains the minimum or maximum value and follows a quadratic function of the mean in between. This model may be used to estimate a variance function for σ 2 which could then be used in Equation 12 with a prior estimate of the mean. Our estimate of the variance function based on the evaluation of the scores from 18 producer groups, is In lack of such an estimate, the worst case scenario may be considered by plugging in the midpoint between the minimum and maximum score. Details are described in the Appendix. For the sake of simplicity we will assume a constant variance here. The suggested threshold of a minimum score of 70% is a political, normative proposal, and other choices are possible, of course. If the result is close to this threshold (see Table 7), the results should be assessed in combina-tion with the results of the other inspection levels (Table 1, Figure 1). This can be done e.g. using the traffic-light system described in Table 8.

Major inconsistencies
Major problems with farmer list, internal reports, conflicts of interest, etc.
Case-to-case decision if: a) certification can be granted after a follow-up inspection has confirmed implementation of corrective actions, b) or certification must be denied, suspended or revoked

Conclusions
a. Experts agree that many CBs lack the ability of addressing NCs in producer groups at a systemic level. Our procedure for defining the systemic condition of NCs at farm level, depending on the incidence and severity of each NC, offers a tool for solving this problem. The method should be tested in practice, and the variables adjusted, as necessary. b. Sample selection should be random, not risk oriented. If a combination of random and risk oriented sampling is used, then the group must be stratified, which leads to a larger sample size.
c. Neither a square root nor a 5% sampling rule are in line with the basic principles of sample size determination in scientific surveys. Especially for smaller groups, there is a high risk of cases slipping through with these methods. We suggest to use Equation 11 for sample size determination. If a larger HW (and thus smaller sample) is used, instead of 0.1 as in Equation 11, the CB must be ready to increase the sample if NCs of severity class 5 come up, for which it is not clear if they are systemic or not.
d. Similar to the quantification of farm inspection results, also results from witness audits with internal inspectors can be quantified and summarised in a meaningful way.
e. The combination of the results from farm inspections, witness audits, ICS office and buying system assessment, allows for differentiated certification decisions.
f. As a general rule, most important for assessing the functioning of an ICS are not large sample sizes, but personal integrity of inspectors, organisational integrity of CBs, inspector competence, inspection procedures (e.g. witness audits with internal inspectors, testing for residues where appropriate), asking the right questions to the right persons, cross-checking the right documents, and conducting inspections at the right time of the year. data v; input Mean Variance n; x=Mean*(3-Mean); datalines; 3.00 0.000 3 3.00 0.000 2 3.00 0.000 3 3.00 0. 000