Robust confidence intervals

From Infogalactic: the planetary knowledge core
Jump to: navigation, search

Lua error in Module:Broader at line 30: attempt to call field '_formatLink' (a nil value).

In statistics a robust confidence interval is a robust modification of confidence intervals, meaning that one modifies the non-robust calculations of the confidence interval so that they are not badly affected by outlying or aberrant observations in a data-set.

Example

In the process of weighing 1000 objects, under practical conditions, it is easy to believe that the operator might make a mistake in procedure and so report an incorrect mass (thereby making one type of systematic error). Suppose he has 100 objects and he weighed them all, one at a time, and repeated the whole process ten times. Then he can calculate a sample standard deviation for each object, and look for outliers. Any object with an unusually large standard deviation probably has an outlier in its data. These can be removed by various non-parametric techniques. If he repeated the process only three times, he would simply take the median of the three measurements and use σ to give a confidence interval. The 200 extra weighings served only to detect and correct for operator error and did nothing to improve the confidence interval. With more repetitions, he could use a truncated mean, discarding say the largest and smallest values and averaging the rest. He could then use a bootstrap calculation to determine a confidence interval narrower than that calculated from σ, and so obtain some benefit from a large amount of extra work.

These procedures are robust against procedural errors which are not modeled by the assumption that the balance has a fixed known standard deviation σ. In practical applications where the occasional operator error can occur, or the balance can malfunction, the assumptions behind simple statistical calculations cannot be taken for granted. Before trusting the results of 100 objects weighed just three times each to have confidence intervals calculated from σ, it is necessary to test for and remove a reasonable number of outliers (testing the assumption that the operator is careful and correcting for the fact that he is not perfect), and to test the assumption that the data really have a normal distribution with standard deviation σ.

Computer simulation

The theoretical analysis of such an experiment is complicated, but it is easy to set up a spreadsheet which draws random numbers from a normal distribution with standard deviation σ to simulate the situation; this can be done in Microsoft Excel using =NORMINV(RAND(),0,σ)), as discussed in [1] and the same techniques can be used in other spreadsheet programs such as in OpenOffice.org Calc and gnumeric.

After removing obvious outliers, one could subtract the median from the other two values for each object, and examine the distribution of the 200 resulting numbers. It should be normal with mean near zero and standard deviation a little larger than σ. A simple Monte Carlo spreadsheet calculation would reveal typical values for the standard deviation (around 105 to 115% of σ). Or, one could subtract the mean of each triplet from the values, and examine the distribution of 300 values. The mean is identically zero, but the standard deviation should be somewhat smaller (around 75 to 85% of σ).

See also

References