Can we measure human–animal interactions in on-farm animal welfare assessment? Some unresolved issues : Reliability
On-farm assessment of animals’ fear of people requires reliable and valid means of measuring the nature of the relationship between farm animals and people, especially how fearful the animals are of people. Reliability usually refers to the repeatability of the measure; that is if we take the measure more than once, how similar are the results from one time to another? This generally has two components, which can be referred to as interobserver reliability (which refers to the chance that two different people will produce the same results) and test–retest reliability (which refers to the chance that the same results will be obtained if the test is repeated). In the context of on-farm welfare assessment, each farm will ideally be visited only once, and a number of observers should be able to be used more or less interchangeably. Reliability can be reduced by differences between observers in how they score the observed behaviour, changes in single observers in how they score the behaviour, and changes in the animals in how they respond from one test to another. The reliability of behavioural tests can be reduced if it is difficult to actually measure the distance between the animals and the people. For example, in ‘‘laboratory’’ settings with video equipment and well-defined landmarks, it is possible to obtain accurate measurements of the distance between an animal and a person, with little risk of reduced reliability due to difficulty in taking the measure. However, in ‘‘field’’ conditions, e.g. on farms, such equipment may not be available and instead subjective judgements may be relied upon more. The fact that a measure of behaviour can be taken under laboratory conditions does not necessarily mean that it can be measured reliably on-farm. In general, the reliability of measures of animals’ fear of people did not seem to be a major concern for most researchers until recently. For example, in a random sample of 30 articles (published before 2003) that attempted to measure human–animal relationships, we found only five that explicitly assessed and reported on the repeatability of the tests used. To demonstrate that ‘‘distance’’ measures can vary quite markedly over time, or with minor changes in the test situation, Table 1 shows some of our own unpublished data where we repeatedly measured the approach distance of dairy calves at different ages in their home enclosure (1.8 m 2.0 m), retesting them with two different people on the same day, and with the same person on two consecutive days.We have found low correlations in most cases and at most ages indicating that the repeatability of this test was quite low even though we went to great trouble to retest in very similar situations. This admittedly limited data suggest that many of the measures of distance used actually may not be very repeatable. Recently, two studies have rigorously evaluated the repeatability of measures of human–animal relationships and claim that the measures were repeatable (e.g. Lensink et al., 2003; Rousing andWaiblinger, 2004). However, a closer inspection of the data leads us to a less sanguine view. Lensink et al. (2003) measured milk-fed veal calves’ fear of people by approaching the calves during the time that milk was being supplied to them. The calves’ reaction to the arrival of an unknown person was scored on a two-point scale (withdrawal or not), and their response to an attempt to touch them on the head was assessed on a four-point scale: 1 = no withdrawal, 4 = strong withdrawal. Test–retest repeatability was assessed by repeating the test after a two-day interval. At the arrival of the unknown person, 84% of calves were scored the same way on both instances. This seems encouraging, but it does show that even with a simple two-point scale, 16% of the calves risk being misclassified by a single test. During the attempted touching, the scores on the two days were correlated significantly, but the coefficient was only moderate (r = 0.62) indicating that less than 40% of the variance between calves on one day was related to the variance on the other days. The results presented show a number of discrepancies. For example, the calves that were judged as most fearful on day 1 had only a 57% chance of being the most fearful on day 2, and had a 7% chance of being judged the least fearful. The authors claim that this indicates that the measure is ‘‘repeatable’’, but the question remains as to just how repeatable a measure must be to be used in on-farm assessment. Rousing and Waiblinger (2004) tested both the inter-observer reliability and the test– retest reliability of two methods of scoring dairy cows’ fear of people, an ‘‘approach test’’ where the latency of cows to approach within a defined distance of a stationary person was measured, and an ‘‘avoidance’’ test (which for purposes of consistency we will call a flight distance test), where a person approached the cow and the distance at which the cow moved off was measured. In general, there seemed few differences between observers. Inter-observer repeatability of the approach test was high (0.97) indicating that there were few differences between the observers. Inter-observer repeatability of the flight distance test was also significant, with weighted kappa values ranging from 0.85 to 0.90, which the authors considered as high. Test–retest values were less encouraging, however. Measures of concordance were significant, indicating that the two days were probably not unrelated. But considerable changes did occur. On the flight distance test, only 52% of the animals were classified the same way on the two days, while 13% of the animals differed by two categories or more (on a 5-point scale). Some of these changes were quite large, for example 15% of the cows that were scored as the most fearful on day 1 (in that they withdrew at a distance of more than 2 m), were classified as among the least fearful (in that they let the person approach to within touching distance) on day 2. One of the largest difficulties in assessing tests of reliability arises from the lack of clear criteria for deciding when reliability is adequate. Finding statistically significant correlations between two measures simply shows us that the two measures are not likely to be completely unrelated. However, we must not forget that even with a correlation coefficient of 0.7, less than 50% of the variance in the scores is common between the two tests. Significant and moderate kappa values can be obtained even where a substantial percent of the animals are misclassified. No tests have yet been published examining the repeatability of measures based on farm averages. However, from the data that has been presented on test–retest analyses of the scores of individual animals, we conclude that a number of farms risk being misclassified if scored in a single visit. In situations where the results of a test may impact the livelihood of a farmer (e.g. welfare audits by food retailers) it is essential to have very high reliability, not merely statistical significance.