A Proposal for Improving Organic Group Certification Quantification of Internal Control Systems' Performance and Sample Size Determination
|Albrecht Benzing 1 ,* ,Hans-Peter Piepho 2|
|1 CERES GmbH, Bavaria, Germany|
|2 Biostatistics Unit, Institute of Crop Science, University of Hohenheim, Baden-Württemberg, Germany|
* Corresponding author.
1.1.- Group Certification and One of Its Weaknesses
Group certification is used in different farm certification schemes (GLOBALG.A.P., Rainforest Alliance, Round Table on Sustainable Palm Oil, organic farming, etc.). The basic idea is to facilitate access to certification by building up an Internal Control System (ICS), the effectiveness of which is verified by an external inspection (also called “audit”). While under some programs (e.g. GLOBALG.A.P. and the National Organic Program of the USA, NOP , there is no restriction concerning size of the member farms, the EU regulation on organic farming restricts participation in group certification to small farms [2, 3].
Research in relation to group certification so far has addressed its impact on market access and smallholder incomes [4, 5, 6, 7, 8, 9, 10, 11], implementation of improved agricultural practices by the certified farmers [10, 12], schooling , scalability , internal organisational problems of the groups and certification costs [7, 15], environment and nature conservation [4, 9], and adaptation to climate change , but not on the functioning of the ICS as such, their ability to ensure compliance with the standards, nor the way that certification bodies (CBs) deal with the ICS.
For a better understanding of the organic group certification process, Figure 1 describes the general workflow.
Table 1 summarizes the most important rules for an organic ICS and also explains at which level and through which methods an external inspector can verify compliance with each of these rules. Out of the eight rules in this table, (h) is the most important one, because an ICS cannot be considered functional if it does not identify the existing non-conformities (NCs) among its members, ensuring that these are either corrected or the non-compliant members are excluded. Also, for the CB the visit to a sample of farmers is the core part of the group inspection. The CB should not only assess compliance with basic organic farming rules like, e.g., having a proper crop rotation, protecting the soil from erosion, ensuring adequate storage conditions for organic products, using only allowed fertilizers, etc. at each farm in the sample, but also use these visits to the farmers for cross checking the accuracy of records kept at the group level, verify separation of certified from non-certified products on their way from farm to export, and find out if member farmers have received appropriate training and consultancy (Table 1).
However, little to no efforts have been made so far for a systematic assessment of the outcome of these external visits. A new EU regulation for the first time establishes official rules for group certification instead of unofficial guidelines . But what exactly does it mean, when this new regulation says “For the purpose of evaluating the set-up, functioning and maintaining of the ICS of a group of operators, the […] control body, shall determine at least that the ICS manager takes appropriate measures in case of non-compliance, including their follow up, according to the ICS documented procedures that have been put in place” ? If in a sample of farmers the CB finds one case where the ICS manager has not “taken appropriate measures”: does that mean the ICS is not functional—which ultimately means the group cannot be certified (Figure 1)? Or is there a meaningful threshold, above which the CB should make that decision?
In a worldwide survey among organic CBs, including expert interviews, the lack of such thresholds was identified as one of the main weaknesses of the current situation of organic group certification (Textbox 1).
One of the conclusions from a worldwide survey on organic group certification  (bold accentuation by the authors).
Non-organic group certification schemes are also vague in this regard. GLOBALG.A.P., e.g., differentiates between “structural” and “non-structural” NCs, but does not explain how often an NC must occur for categorising it as “structural” .
1.2.- The External Sample Size
The size of the sample of farmers visited by the CB (the “external sample”) has been subject of long standing discussions between the stakeholders involved. Currently, the most common approach is using the square root of the total number of group members, multiplied by a risk factor, which varies between 1.0 and 1.4. This is established in an unofficial guideline by the EU Commission . Also GLOBALG.A.P. , Rainforest Alliance  and other programs use the square root as the basis for calculating the external sample size, although without applying risk factors.
The new Regulation (EU) 2018/848 on organic farming , which will come into force in January 2022, for the first time introduces official minimum requirements for the groups and their ICS  and for the procedures to be followed by CBs for this purpose . Although clear evidence does not exist in this regard, according to the perception of regulatory authorities, fraud is more common under group certification than under individual farm certification . To address the related risks, the EU Commission stipulates that (a) the maximum group size shall be limited to 2,000 members, and (b) organic CBs, instead of the square root shall inspect a minimum of 5% of the group members . Figure 2 shows that for small groups the sample would be smaller with the 5% rule, while for large groups it would be much bigger.
This proposed change raises two concerns: (a) a fixed 5% sample disregards basic statistical principles of sample size determination and will lead to high standard errors for small groups, and (b) as long as the weaknesses in the system described above are not addressed, larger sample sizes (for big groups, see Figure 2) will only reproduce the existing problems at a larger scale.
2.- What is the Purpose of Sampling in Organic Group Certification?
As explained above, the performance assessment of an ICS takes place at four different levels: at the ICS office, in the buying centres, at the farms and during witness audits with the internal inspectors (Table 1 and Figure 1, also ). The results of the audits at the first two levels are mostly qualitative, but a meaningful assessment of the findings from the farm level requires some kind of quantification (Figure 1). Quantification of the results of the witness audits with internal inspectors may not be necessary in small groups with one or few inspectors, but becomes important in large groups with many internal inspectors (Section ). A key underlying question is: What exactly is the goal of sampling a certain number of member farmers?
a. Is the goal to determine the exact percentage (incidence) of each kind of NC? Not really. Let us assume we are dealing with a group, where many farmers use herbicides, which are prohibited in organic farming. Does it matter for the CB, if, say, 14%, 32% or 45% of the farmers use herbicides? The answer is “no”, because in any of these cases, the conclusion would be the same: the ICS is not functional, and certification would have to be suspended, temporarily or terminally. Or let’s imagine a group, where some farmers do not keep records of their daily field activities. Would it make a difference for the CB, if this problem were found among, say, 2%, 4% or 10% of farmers? No, because in any of these cases, the ICS would be requested to propose corrective actions, to ensure that farmers in the future keep their records. And in none of these cases would the group’s certification be at risk.
b. Is the goal to find each and every NC that may exist in the group and has slipped through the ICS? Any type of sampling always involves the risk of a certain number of cases slipping through. This may not be acceptable when it comes, e.g., to high food safety risks, but it would not be appropriate for organic group inspections, because (i) compliance with organic production rules is not a food safety issue, (ii) the idea of “group” certification would become meaningless, since ultimately the sample size would have to be equal to the total number of farmers, and (iii) even with 100% external inspections, not all NCs existing at the time of the inspection will be detected, let alone those NCs, which may not be detectable on the day of the inspection.
c. Is the goal to ensure that non-compliant farmers identified during external inspections are excluded from the group? This is a common misunderstanding (see also Textbox 1), which completely misses the point of group certification. If the CB inspects, e.g., 10 out of 100 farmers, and finds in this sample two farmers using synthetic fertilizers, then we assume that in the entire group there are many more farmers with this problem, and excluding the two members would not solve the problem.
d. Is the goal to decertify groups, when the incidence of severe NCs exceeds a certain threshold? This is how, e.g., the Rainforest Alliance group certification works: “if an irreversible non-compliant practice occurred on more than 5% (of the whole group, after extrapolation (…) and/or on at least 5 of the audited small farms this is considered to be a systemic issue (…) and therefore shall result in non-certification and/or cancellation” . There may be different opinions among CBs and regulatory authorities in this regard, but the authors believe that this approach does not sufficiently consider the efforts made by the ICS. Let’s look again at the example above of a group with widespread herbicide use: When in a group of 100 farmers, the ICS has never detected any case of herbicide use, but then the CB in a sample of 10 farmers detects one case—this situation should be treated differently from the case where the ICS has already excluded 20 out of 100 farmers, but then the CB finds one more case.
e. The real goal of external inspections should be to determine (i) which existing NCs have been properly handled by the ICS and which not; (ii) among the latter, which are “systemic” and which are “isolated” cases; and (iii) which of the systemic cases put at risk the integrity of the products sold on the organic market, and the credibility of the certification system.
3.- Judgement Sampling vs. Statistical Sampling
The U.S. Office of the Comptroller of the Currency  distinguishes between “judgement sampling” and “statistical sampling”. The definition of judgement sampling is quoted in Textbox 2.
The current organic group certification procedures are mostly based on judgement sampling. The problem is, however, that the involved CBs do not always have the necessary level of “professional judgement” that would lead to satisfactory results (see ). A solution to the problem presented in Textbox 1 can only be found using “statistical sampling”, which allows extrapolation of sample results to the entire group. Statistical sampling methods must select the sample randomly, not risk-based , otherwise the results would be biased. If a CB knows, e.g., that a specific problem is more frequent in one village belonging to a producer group, and therefore targets farmers from that village more than the rest of the group, the results from this inspection cannot be extrapolated to the entire group, because the problem would be over-estimated (Figure 3).
4.- What Does “Systemic” Mean? What Does “Integrity” Mean?
For finding a solution to the problem described in Textbox 1, we must first define systemic NCs vs. isolated NCs and in which cases systemic NCs should lead to decertification. In this section, we propose a new procedure for quantifying these terms and for answering these questions, with the help of the variables defined in Table 2. Readers who are not so much interested in the statistical details, can jump directly to Table 3, from there to Figure 5, and then continue with the real life examples in section .
|CB||Certification body (called “control body” in the EU Regulation on organic farming).|
|ICS||Internal control system.|
|NC||A specific non-conformity occurring among the members of the group (see Table 4 and following for examples).|
|Half-width of 95% confidence interval (= standard error 1.96).|
|Number of farmers in the entire group with NC .|
|Number of farmers in the entire group identified by the ICS for NC .|
|Number of farmers in the entire group with NC identified but not corrected by the ICS. Can be estimated from the sample by .|
|Number of farmers in the entire group with NC found by the CB, which were missed by the ICS. The CB estimates this variable from the sample using .|
|Number of farmers with NC found by the CB, which had previously been detected by the ICS, but not yet corrected at the time of the external inspection.|
|Number of farmers with NC found by the CB, which had not been detected by the ICS.|
|: These two cases are treated equally; number of farmers in sample taken by CB with NC .|
|Size of population (number of all members of the group).|
|Size of sample inspected by the CB|
|: Incidence of NC in the entire group.|
|: Incidence of NC in the entire group that are detected and corrected by the ICS.|
|: Incidence of NC in the entire group, which were previously detected by the ICS, but not yet corrected at the time of the external inspection. Can be estimated from the sample by .|
|: Incidence of NC in the entire group, which were not detected by the ICS.|
|: Incidence of NC1 in the group, which either went undetected or were detected but not corrected by the ICS. This parameter is estimated by extrapolation from the sample by .|
|Lower limit of the confidence interval for ; this can be obtained by an asymptotic method for large samples or by the exact Clopper-Pearson interval for small samples and populations.|
|Upper limit of the confidence interval for ; this can be obtained by an asymptotic method for large samples or by the exact Clopper-Pearson interval for small samples and populations.|
|: For the sake of valuing the effort made by the ICS, is deducted from . Refer to section (b) in the text for more details. Small and negative values of this criterion are desirable. Values above a threshold are considered as a systemic failure of the ICS.|
|Lower limit of the confidence interval for .|
|Upper limit of the confidence interval for .|
|Threshold, above which is considered “systemic”.|
|Repetition: number of subsequent external inspections, during which NC is found at a systemic level. Normally, such inspections take place yearly, but they can also be more frequent.|
|Threshold, above which the repetition of a systemic NC leads to decertification.|
|Severity category of an NC (see Table 3).|
|Variance of a character trait within a group.|
|Pooled variance across groups.|
b. As explained in Section (d), one approach for assessing the performance of the ICS would be to simply define a threshold, above which a group should be de-certified. This would mean using an estimate of (Table 2) for this purpose, i.e. . For the reasons explained in Section (d) (we want to value the efforts made by the ICS, which have already detected certain cases), we suggest to use the difference between the incidence of a specific NC identified by the CB in the sample (extrapolated to the entire group), and the incidence identified and corrected by the ICS in the entire group. This better values the efforts made by the ICS (an approach, which may not be shared by all CBs and regulatory authorities):
c. Next, to reflect that an estimate is used, we compute the lower and upper limits of a 95% confidence interval for (Table 2) using standard procedures as described by Agresti (, pp. 15,18-21) and also described in detail below (see Equations to ). The lower and upper limits for are denoted as and , respectively. Those on are denoted as and , respectively.
d. We define a threshold above which the incidence of an NC is considered systemic. Since this threshold should be different, depending on the type of NC, we group the existing NCs in five categories , from 1 (least severe) to 5 (most severe). Refer to Table 3 for examples. These severity categories are associated with an acceptable threshold , above which is considered “systemic” (third column in Table 3).
The more severe the category, the lower the acceptance threshold. If neither of the two conditions hold, the sample size was too small to reach a definitive assessment. This is likely to happen only when is close to the threshold . Note that this step amounts to a significance test at the 5% level to decide if is significantly smaller or larger than the threshold .
e. As a second condition for considering an NC as “non-systemic”, we introduce the requirement that must be below 0.3 - regardless of . The rationale is as follows: if the ICS makes serious efforts for handling NCs, but in spite of these efforts the CB still finds many undetected or uncorrected cases, there is a systemic problem.
The assessment is inconclusive otherwise. If this happens for NCs with , we suggest the CB decides from case to case, if the NC is considered systemic or not. For NCs with = 5, the sample should be increased until getting a clear picture.
Finally, we suggest how often a systemic NC can be repeated ( , see Table 3), before it seriously affects the integrity of the system and should therefore lead to (temporary or final) decertification. We call this threshold “repetition tolerance” . is also related to (Table 3, column 4). For NCs with = 5, we have defined = 1, meaning there is no tolerance for systemic NCs of this category.
5.- Two Real Life Examples
For exemplifying the proposed method, we have selected two cases of group certification from the CERES database: a positive case of a group with a functioning ICS and only minor deficiencies, and a negative case, which lost its certification. If the method suggested had been applied, these results would have been confirmed—but based on a more transparent and reliable procedure.
The first case study refers to a cocoa farmers group with 1,079 members. Since this was the first inspection to this group, the risk factor had been calculated as 1.2, based on theoretical assumptions, leading to the sample size: .
Three NCs were found, two of which were systemic, but none of these with serious implications for integrity (Table 4).
The rather small NCs could be easily corrected, and the group was certified.
The second case study is for a group of 1,413 coffee farmers, spread over a large area, with highly heterogeneous geographical conditions. A risk factor of 1.4 had been determined, leading to the following sample size: .
During the four previous years, only minor NCs had been detected. During inspection planning in 2016, the CB found that the samples in previous years had not been random, because they had only covered a relatively small part of the region. This was corrected by randomly including farmers from all parishes in the new sample. Furthermore, the CB had learned that agrochemical use among coffee smallholders in the entire region had increased substantially. Therefore, coffee leaf samples were taken from 16 out of the externally inspected 54 farmers and tested for pesticide residues.
As a result of this change in inspection procedures, in addition to several other (systemic and non-systemic) NCs, on 10 farms the inspectors found synthetic pesticides and/or fertilizers. In 6 out of 16 leaf samples, residues of synthetic fungicides were found at levels, which could only be explained by application by the organic farmers (Table 5).
None of these NCs had been detected by the ICS, therefore the group’s organic certificate had been withdrawn immediately. If the method proposed here had been used, the result would have been the same. These severe problems in the group, however, were detected not because the sample size was increased as compared to previous years, but because (a) the sample was chosen randomly, and (b) because the inspection procedure was improved by testing leaf samples, which had not been done in previous years.
|1||NCs found during the inspection||Cases detected by ICS||Incidence per ICS (M/N)||Additional cases detected by CB||Incidence of additional cases per CB (m/n)||Lower confidence limit||Upper confidence limit||Difference lower limit||Difference upper limit||Severity||Threshold for systemic condition||Systemic?||Repetition||Repetition tolerance||Decertification|
|4||NC : Incorrect yield estimate||0||0.00||38||0.95||0.83||0.99||0.83||0.99||1||0.25||Yes||1||5||No|
|5||NC : Incorrect farm size||0||0.00||18||0.45||0.29||0.61||0.29||0.61||2||0.2||Yes||1||4||No|
|6||NC : Incorrect number of cocoa plots||0||0.00||2||0.05||0.006||0.17||0.006||0.17||2||0.2||No||1||NA||No|
|1) Repetition: in how many external inspection has this NC been found at a systemic level. In the present case, this is 1, because it was the first inspection. For NC the value is 0, because this NC is not systemic.|
|2) Refer to Table 2 for an explanation of this row.|
|3) Many CBs use MS Excel or similar tools for such procedures. Row 3 shows how this can be done, for the example of NC (row 4). Only the blue columns would require entries, the rest would be computed through formulas.|
|4) The common NCs in grower groups can be listed in a dropdown menu.|
|5) Cell B4 divided by cell B7 (both in yellow).|
|6) Here we calculate the lower Clopper-Pearson confidence limit (, p. 18); in Excel syntax we use the function F.INV. Refer to the template in supplementary materials.|
|7) VLOOKUP is a formula linked to the type of NC (column A).|
|8) Nested IF-THEN-ELSE formulas are used here.|
|9) In this case, incorrect yield estimates were assigned a “severity” of 1 only, because the group had intentionally used a very conservative estimate for kg cocoa beans per hectare.|
6.- Sample Size Determination in Scientific Surveys
In a scientific survey with the goal explained above, neither a fixed percentage nor a square root of the total population size would be used as sample size. Instead, a specification would be made regarding the precision with which a population parameter is to be estimated, based on a random sample, and then the necessary sample size would be determined accordingly . Again, readers who are not so much concerned about the mathematical details at this point, can go directly to Figure 9, from there to Textbox 3, then to Figure 11 and then continue with section on stratification.
Assuming we deal with a very large population (as, e.g., in consumer studies or pre-election polls), an asymptotic interval with 95% coverage probability could be employed, based on the estimate and these equations for the lower (L) and upper (U) 95% confidence limit for :
is the half width of the confidence interval. Further, we may compute lower and upper 95% confidence limits for as and , respectively.
It is important to point out that the half-width ( ) of the interval is inversely proportional to the square root of the sample size (see Equations and ). Thus, the larger the sample size, the smaller the . This relation can be used to determine sample size, if we can make a specification of the desired .
Thus, the sample size to achieve a desired can be computed as
Note that population size does not enter this equation. The sample size remains the same, regardless if the population is, e.g. or , so long as the population size is large relative to sample size. What matters, are the variables and . If, e.g., our rough guess in such a large group was that there is a proportion of up to or of undetected non-compliant members remaining for a specific problem (NC), then the sample size plotted against would take the form of Figure 6.
To shed further light on the equation for determining the sample size, we may give a second interpretation. If the sample size is chosen to equal , then the probability is 5% that the estimate of deviates from the true value by more than .
So far, we have assumed that the population size is very large. In smaller populations, as in the case of group certification, the exact Clopper-Pearson interval should be used , which takes the population size into account. There are no exact equations to determine sample size for this procedure, which yields asymmetrical intervals. As an approximation, we may employ the fact that in finite populations the standard error ( ) takes the form described by Thompson :
As opposed to Equation 8, population size N does enter here. The factor relates to the finite population correction. From this, assuming approximate normality of the estimator of , the sample size may be computed according to :
Note that for large , this equation approaches the simpler one in equation . Also note that, even though there is a dependence on , the required sample size is not proportional to . And only in very small populations is the finite population correction at all noticeable. In Figure 7 we have plotted against , for four different , showing that is inversely related to (the higher our expectations on precision, the larger the sample must be), while in relation to , is biggest for 0.5, and decreases both towards 0 and towards 1.
In Figure 8 we see the impact of five different s and five different on the required sample size for populations up to 1,000.
Two decisions remain to be made: (a) which is the highest in the range from 0 to 0.5 that we must consider in an unknown group of farmers, and (b) which are we ready to accept? Statistics cannot answer these questions, which require normative or political answers.
Nevertheless, we can try an approximation:
a. In most cases, we do not know the incidence , therefore it is reasonable to assume a value that is close to the worst-case scenario. The worst case is (50% of the farmers have the NC we are dealing with)—for this scenario we need the largest sample for arriving at a correct decision (Figure 7). If we move too far away from this worst case, there is a risk of arriving at wrong conclusions.
b. Furthermore, we consider that the should not be too far above 0.05, corresponding to a of 0.10.
c. Based on these two considerations, let us use and as a starting point. The corresponding sample size is represented by the green line for 0.10 in Figure 8b. The sample sizes are substantially higher than the square root (e.g. : 48 vs. 10; : 78 vs. 23; : 84 vs. 45).
d. Then we looked for real life examples, where the CB CERES had used sample sizes, which were equal, higher, or at least close to these figures. Since CERES has also been using the square root multiplied by a risk factor, there are not many examples meeting these criteria. The examples we have found, are all from very large groups, because, as shown in Figure 8, above a certain population size, the sample sizes resulting from Equation remain in the same range. We used the procedure explained in Section for assessing the systemic condition of the NCs found during inspection of these example groups.
e. As a result, we selected nine groups, all of them from Africa, because this is where the largest producer groups exist , with between 3,554 and 78,496 members each. At this point, we do not want to enter into the debate, if such large groups are certifiable or not—the groups were solely selected for the reasons explained in (d). Adding up all the NCs resulting from the nine inspections to these groups, in total CERES had identified 57 NCs, out of which (using the procedure explained in section ), 29 were systemic, 19 were non-systemic, while 9 remained unclear (Method (I) in Figure 9).
f. Then we calculated for each of the nine groups different sample sizes, using Equation , with ranging from 0.50 to 0.10, and from 0.10 to 0.20. The frequency of each NC was calculated proportionally to the sample size: When a specific NC had occurred 22 times in the original sample of 75 farmers, we assumed that in the same group, it would be detected 14 times in a sample of 58 farmers. From these proportional frequencies, we assessed the systemic condition of each NC, using the same procedure explained in Section . The results are shown in Figure 9 (Methods II to XIX).
g. To summarize what is represented in Figure 9:
For achieving a result with only two “unclear” cases, we would have to use an unrealistically large sample size (Method II in Figure 9, with sample sizes between 2,594 and 8,577 farmers).
As could be expected, the smaller the sample size, the higher the number of unclear (“watch”) cases (yellow in Figure 9).
Because of the confidence interval, there is no NC, which would switch from “systemic” to “non-systemic” with decreasing sample size, or vice-versa. They switch from systemic to unclear, or from non-systemic to unclear (see also Figure 5d).
If we use, e.g., a sample of 15 farmers per group (Method XIX), the interpretation of 38 out of 57 results would remain unclear. With all these unclear results, the sample size would have to be increased after the inspection—which is more complicated than planning for a bigger sample from the beginning.
It becomes obvious from Figure 9 that the impact of a decreasing on sample size and on the number of unclear cases is much stronger than the impact of an increasing . This is also confirmed through a regression analysis, where we get a steep and almost linear power function for unclear cases vs. , but a less steep power function for unclear cases vs. . From upwards, the results remain the same (Figure 10).
We therefore suggest to use and . This is depicted as Method XI in Figure 9 and yields the following equation:
Another option would be to use a slightly larger , e.g. 0.125, being aware that many cases may remain in the “unclear” category, and especially when it comes to NCs of severity class 5, the sample size may have to be increased and the inspection extended, for getting a clearer picture. Figure 11a shows the sample size for Equation ( dotted black line) and , dashed black line), as compared to square root and percentage approaches. In Figure 11b we have plotted against sample size, showing that for groups up to approximately 1,000 members, the method established by the European Commission accepts very large and questionable s, i.e. standard errors.
Summarised and simplified explanation of Section : sample size determination using statistical standard methods.
Even though most group certification rules include provisions for risk based sample selection (see Section ), in real life these rules are mostly not followed, because the risks are generally unknown (with the exception of obvious risks, such as e.g. larger farms posing a higher risk than small ones, and farms on steep slopes being more prone to soil erosion than farms on flat land). Therefore, and because most group certification rules prescribe that members should be located in geographic proximity and have similar farming systems, the situation presented in Figure 3 is a rather exceptional one. If a CB faces such a situation, where a specific risk in one specific sub-group exists, which might slip through when applying random sampling, then the sampling method to the rest of the group is applied as described above, while for the “risky sub-group” one of the two following procedures is used:
a. If the risk situation is very clear, judgement sampling may lead to clear results, without the need for quantification. If, e.g., in a risk-based sample of 10 farmers there are three cases of insecticide use, while in the random sample from the rest of the group there are no similar problems, then the sub-group can be excluded, while the rest of the group remains certified. b. The group can be stratified into two subgroups , and the sampling procedures described above are applied independently to each of the two subgroups. We should be aware, however, that a stratification, with a certification decision being taken separately for each sub-group, means that the overall sample size is increased substantially (often doubled) compared to simple random sampling.
8.- Witness Audits: Sample Size and Quantification of Results
Witness audits with internal inspectors are an essential tool for assessing competence and compliance of an ICS [17, 23]. Typically, such audits are combined with farm visits (see also Table 1 and Figure 1). For streamlining the assessment of the internal inspectors’ performance, we suggest to use a scoring tool based on a weighted Likert scale . To oblige users to make a clear decision between positive and negative scores, we recommend the use of a scale with four possible answers , as explained in Table 6.
The results are then summarized for all witnessed internal inspectors. If the total score for all witness audits is below a certain threshold (we suggest 70% of the maximum possible score), the ICS is considered to be not functional. If it is between 70 and 100%, corrective actions should be implemented (Table 7).
Small producer groups often have only one or two internal inspectors. In these cases the question of sampling does not come up. For groups with more internal inspectors, based on  we propose the following method for determining the sample of internal inspectors to be witnessed, out of a total of N internal inspectors (again: readers not interested in the statistical details, can jump to Figure 12):
While in Equation we deal with a binominal distribution (farmers comply or don’t comply with a specific requirement), here we are assuming an approximate normal distribution with unknown variance. Therefore, as opposed to Equation , the variance of scores enters Equation (in place of the variance in Equation ). Figure 12 shows the results of this equation, for and five variances.
From the CERES database, we evaluated the witness audit results from 18 producer groups from eight different countries, with a total of 72 internal inspectors. CERES has been working with a Likert scale with only three possible answers (Yes / Partly / No), but this should not substantially bias the variability of results, as compared to a scale with four answers. The within-group variance for the performance of internal inspectors ranged from 0 to 0.34. For estimating the pooled variance across groups, we used :
which yielded for our case (orange line in Figure 12). To be on the safe side, we suggest to use a variance of 0.15 (black dotted line in Figure 12). Here, it is assumed that the underlying true variance is constant. If the performance of internal inspectors is more variable, larger samples must be used accordingly. According to our data, the variance tends to increase with lower score means. By way of analogy with the binomial distribution, and taking into account the fact that scores are integer values with a fixed lower and upper bound, it may be assumed that the variance drops to zero when the score mean attains the minimum or maximum value and follows a quadratic function of the mean in between. This model may be used to estimate a variance function for which could then be used in Equation with a prior estimate of the mean. Our estimate of the variance function based on the evaluation of the scores from 18 producer groups, is
In lack of such an estimate, the worst case scenario may be considered by plugging in the midpoint between the minimum and maximum score. Details are described in the Appendix. For the sake of simplicity we will assume a constant variance here.
The suggested threshold of a minimum score of 70% is a political, normative proposal, and other choices are possible, of course. If the result is close to this threshold (see Table 7), the results should be assessed in combination with the results of the other inspection levels (Table 1, Figure 1). This can be done e.g. using the traffic-light system described in Table 8.
a. Experts agree that many CBs lack the ability of addressing NCs in producer groups at a systemic level. Our procedure for defining the systemic condition of NCs at farm level, depending on the incidence and severity of each NC, offers a tool for solving this problem. The method should be tested in practice, and the variables adjusted, as necessary.
b. Sample selection should be random, not risk oriented. If a combination of random and risk oriented sampling is used, then the group must be stratified, which leads to a larger sample size.
c. Neither a square root nor a 5% sampling rule are in line with the basic principles of sample size determination in scientific surveys. Especially for smaller groups, there is a high risk of cases slipping through with these methods. We suggest to use Equation for sample size determination. If a larger (and thus smaller sample) is used, instead of 0.1 as in Equation , the CB must be ready to increase the sample if NCs of severity class 5 come up, for which it is not clear if they are systemic or not.
d. Similar to the quantification of farm inspection results, also results from witness audits with internal inspectors can be quantified and summarised in a meaningful way.
e. The combination of the results from farm inspections, witness audits, ICS office and buying system assessment, allows for differentiated certification decisions.
f. As a general rule, most important for assessing the functioning of an ICS are not large sample sizes, but personal integrity of inspectors, organisational integrity of CBs, inspector competence, inspection procedures (e.g. witness audits with internal inspectors, testing for residues where appropriate), asking the right questions to the right persons, cross-checking the right documents, and conducting inspections at the right time of the year.
References and Notes
|||Certifying Operations with Multiple Production Units, Sites, and Facil- ities under the National Organic Program by the National Organic Standards Board (NOSB) to the National Organic Program (NOP). National Organic Standards Board (NOSB); 2008. Available from: https://www.ams.usda.gov/sites/default/files/media/NOP%20Final% 20Rec%20Certifying%20Operations%20with%20Multiple%20Sites. pdf.|
|||Guidelines on Imports of Organic Products into the European Union. 2008 December 15; Available from: https://ec.europa.eu/info/sites/ info/files/food-farming-fisheries/farming/documents/guidelines- imports-organic-products en.pdf.|
|||Commission Delegated Regulation (EU) 2021/771 of 21 January 2021 supplementing Regulation (EU) 2018/848 of the European Par- liament and of the Council by Laying Down Specific Criteria and Con- ditions for the Checks of Documentary Accounts in the Framework of Official Controls in Organic Production and the Official Controls of Groups of Operators. Official Journal of the European Union. 2021 May 11; Available from: https://eur-lex.europa.eu/legal-content/EN/ TXT/?uri=CELEX%3A32021R0771&qid=1622800043720.|
|||Tayleur C, Balmford A, Buchanan GM, Butchart SHm, Walker CC, Ducharme H, et al. Where are Commodity Crops Certified, and What Does it Mean for Conservation and Poverty Alleviation? Biological Conservation. 2018;217:36–46. doi:10.1016/j.biocon.2017.09.024.|
|||Kleemann L, Abdulai A, Buss M. Certification and Access to Export Markets: Adoption and Return on Investment of Organic-Certified Pineapple Farming in Ghana. World Development. 2014;64:79–92. doi:10.1016/j.worlddev.2014.05.005.|
|||Oelofse M, Høgh-Jensen H, Abreu LS, Almeida GF, Hui QY, Sultan T, et al. Certified Organic Agriculture in China and Brazil: Market Ac- cessibility and Outcomes Following Adoption. Ecological Economics. 2010;69(9):1785–1793. doi:10.1016/j.ecolecon.2010.04.016.|
|||Handschuch C, Wollni M, Villalobos P. Adoption of Food Safety and Quality Standards among Chilean Raspberry Produc- ers – Do Smallholders Benefit? Food Policy. 2013;40:64–73. doi:10.1016/j.foodpol.2013.02.002.|
|||Jouzi Z, Azadi H, Taheri F, Zarafshani K, Gebrehiwot K, Passel SV, et al. Organic Farming and Small-Scale Farmers: Main Opportu- nities and Challenges. Ecological Economics. 2017;132:144–154. doi:10.1016/j.ecolecon.2016.10.016.|
|||Rueda X, Lambin EF. Linking Globalization to Local Land Uses: How Eco-Consumers and Gourmands are Changing the Colom- bian Coffee Landscapes. World Development. 2013;41:286–301. doi:10.1016/j.worlddev.2012.05.018.|
|||Bolwig S, Gibbon P, Jones S. The Economics of Smallholder Or- ganic Contract Farming in Tropical Africa. World Development. 2009;37(6):1094–1104. doi:10.1016/j.worlddev.2008.09.012.|
|||Subervie J, Vagneron I. A Drop of Water in the Indian Ocean? The Impact of GlobalGap Certification on Lychee Farmers in Madagascar. World Development. 2013;50:57–73. doi:10.1016/j.worlddev.2013.05.002.|
|||Innocenti ED, Oosterveer P. Opportunities and Bottlenecks for Up- stream Learning within RSPO Certified Palm Oil Value Chains: A Comparative Analysis between Indonesia and Thailand. Journal of Rural Studies. 2020;78:426–437. doi:10.1016/j.jrurstud.2020.07.004.|
|||Akoyi KT, Mitiku F, Maertens M. Private Sustainability Standards and Child Schooling in the African Coffee Sector. Journal of Cleaner Production. 2020;264:121713. doi:10.1016/j.jclepro.2020.121713.|
|||Hajjar R, Newton P, Adshead D, Bogaerts M, Maguire-Rajpaul Va, Pinto Lfg, et al. Scaling up Sustainability in Commodity Agricul- ture: Transferability of Governance Mechanisms across the Cof- fee and Cattle Sectors in Brazil. Journal of Cleaner Production. 2019;206:124–132. doi:10.1016/j.jclepro.2018.09.102.|
|||Blanc J, Kledal PR. The Brazilian Organic Food Sec- tor: Prospects and Constraints of Facilitating the Inclusion of Smallholders. Journal of Rural Studies. 2012;28(1):142–154. doi:10.1016/j.jrurstud.2011.10.005.|
|||Verburg R, Rahn E, Verweij P, Kuijk MV, Ghazoul J. An Innovation Perspective to Climate Change Adaptation in Cof- fee Systems. Environmental Science & Policy. 2019;97:16–24. doi:10.1016/j.envsci.2019.03.017.|
|||Meinshausen F, Richter T, Blockeel J, Huber B. Group Certifi- cation. Internal Control Systems in Organic Agriculture: Signifi- cance, Opportunities and Challenges; 2019. Available from: https: //orgprints.org/id/eprint/35159/7/fibl-2019-ics.pdf.|
|||General Regulations, Part I – General Requirements. English Version 5.2. GLOBALG.A.P; 2019. Available from: https://www.globalgap.org/ .content/.galleries/documents/190201 GG GR Part-I V5 2 en.pdf.|
|||Certification Program. 2020 Certification and Auditing Rules. Rain- forest Allience; 2021. Available from: https://www.rainforest- alliance.org/business/wp-content/uploads/2020/06/2020- Rainforest-Alliance-Certification-and-Auditing-Rules.pdf.|
|||Regulation (EU) 2018/848 of the European Parliament and of the Council of 30 May 2018 on organic production and labelling of or- ganic products and repealing Council Regulation (EC) No 834/2007. 2021 January 21; Available from: https://eur-lex.europa.eu/eli/reg/ 2018/848/oj/eng.|
|||Commission Delegated Regulation (EU) 2021/715 of 20 Jan- uary amending Regulation (EU) 2018/848 of the European Parlia- 22 ment and of the Council as regards the requirements for groups of operators amending Regulation (EU) 2018/848 of the Eu- ropean Parliament and of the Council as regards the require- ments for groups of operators. 2018 November 14; Available from: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX% 3A32021R0715&qid=1622799975710.|
|||EU Regulatory changes and its effect on International Trade. Pre- sentation during BIOFACH / VIVANESS 2021 eSPECIAL. European Commission; 2021.|
|||Smallholder Group Certification Training Curriculum on the Evalua- tion of Internal Control Systems A Training Course for Organic In- spectors and Certification Personnel. The International Federation of Organic Agriculture Movements (IFOAM); 2004. Available from: https: //archive.ifoam.bio/sites/default/files/ics manual inspector en.pdf.|
|||Sampling Methodologies. Office of the Comptroller of the Currency (OCC); 2020. Available from: https://www.occ.gov/publications- and-resources/publications/comptrollers-handbook/files/sampling- methodologies/pub-ch-sampling-methodologies.pdf.|
|||Agresti A. Categorical Data Analysis. John Wiley & Sons; 2013.|
|||Thompson SK. Sampling. Wiley; 2002.|
|||Likert R. A Technique for the Measurement of Attitudes. vol. 140. Archives of Psychology; 1932.|
|||Allen E, Seaman C. Likert Scales and Data Analyses. Qual- ity Progress. 2007 July; Available from: http://rube.asq.org/quality- progress/2007/07/statistics/likert-scales-and-data-analyses.html.|
|||Guidance on sampling methods for audit authorities. Program- ming periods 2007-2013 and 2014-2020. European Commission; 2017. Available from: https://ec.europa.eu/regional policy/sources/ docgener/informat/2014/guidance sampling method en.pdf.|
|||Searle SR, Casella G, MacCulloch CE. Variance Components. Wiley; 1992.|
|||Wedderburn RWM. Quasi-Likelihood Functions, Generalized Linear Models, and the Gauss-Newton Method. Biometrika. 1974;61(3):439– 447. doi:10.2307/2334725.|
Based on the minimum and maximum possible mean scores (0 and 3, respectively), we may assume this variance function:
where is the variance and the mean. This can be estimated by linear regression. The intercept is zero, and there is a single regression coefficient for the predictor variable . Assuming approximate normality of the score means, we have for the sample variance :
This function (Equation ) for the variance estimate can be used in a quasi-likelihood approach  for fitting Equation . Here, we used the GENMOD procedure in SAS.
The estimated variance function is:
Using this function, the variance can be computed for an a priori estimate of , and this variance can then be used in an equation for determining sample size, such as Equation in the main text. If a prior value is not available, one may plug in the worst-case value .
Mean Variance n; x=Mean*(3-Mean);
3.00 0.000 3
3.00 0.000 2
3.00 0.000 3
3.00 0.000 3
2.94 0.006 2
2.92 0.003 4
2.90 0.030 3
2.90 0.025 3
2.87 0.053 3
2.83 0.018 9
2.83 0.012 4
2.78 0.071 7
2.70 0.043 7
2.65 0.245 2
2.57 0.263 3
2.57 0.013 3
2.40 0.000 3
2.03 0.338 8
proc glimmix data=v;
model variance=x/noint solution;
output out=v predicted=p;
proc sgplot data=v;
scatter y=variance x=mean;
reg y=p x=mean/degree=2;