## Thursday, December 22, 2016

### Why polls and surveys fail

Polls and opinion surveys often predict results that never happen. Is there a scientific reason that can explain it? I think so. The problem could be that the mathematical theories behind the polls are misapplied.
A branch of statistics is called sample theory. It was invented to solve the problem of estimating whether the products of a factory are well made or defective, without having to analyze them one by one, which would be too costly.
Let us say, for example, that a factory produces one million screws a day. In theory they should be checked one by one, but since that is impossible, only one part is analyzed. Which part? This is what sample theory tries to solve.
Suppose we analyze just 2000 screws, and find that one of them is defective (0.05%). Can we extend this result to the million screws and assert that in that population there will be approximately 500 defective screws?

There is a theorem of sample theory that computes the confidence we can have in the assertion that the result of the sample applies to the whole of the population. Interestingly, if certain conditions are met, with a sample of 2000 “individuals,” regardless of the population size, we can have 95% confidence that the results of the analysis can be extended to the population. In other words, if we analyze 2000 screws, we can have 95% confidence that the result will apply to the entire set of screws, regardless of whether there are one hundred thousand, one million or ten million screws.
Electoral polls often apply the theorems of sample theory without due consideration. If we look at the technical data that come with these surveys, we will see that they often say things like these:
Size of the population surveyed: 2000 people.
Confidence coefficient: 95%.
But let us look at the sentence highlighted in red two paragraphs above. What are the conditions that must be met in order to apply the theorem? Essentially there are two:
The population must be uniform.
The sample must be meaningful.
That the population is uniform means that all the screws must be equivalent in principle, that different sets are not mixed; such as large screws with small screws.
That the sample must be meaningful means that, before extracting the sample, we must mix well the million screws; otherwise we could take a sample formed exclusively by screws produced by a concrete machine that has a problem, or by a perfect machine, while none of those produced by other machines would be analyzed. In such a case, the results of the analysis could not be extended to the total population with the same confidence.
What happens when the theorem is applied to a human population to predict the outcome of an election?
1. The most serious problem is that the population is not uniform. We know very well that the votes of some people are worth much more than those of others. In the U.S. elections, for instance, the constituency is the state. Although states with a large population, such as New York or California, may elect more representatives, each candidate requires more votes to be elected than in states with less population, such as Oklahoma.
2. Whether the sample is significant depends on the survey being well-designed. For instance, the respondents to a poll should be chosen from all states in proportion to their populations. But that means that, in a sample of 2000 people, there will be very few from Oklahoma. Can the result of the election in that state be predicted with 95% confidence level from such a small sample? The simple and straightforward answer is that it can not.
3. There is an additional problem: people are not screws. When a screw is analyzed, it cannot lie; we can trust that the properties we detect are real, unless we are using defective instruments to measure them. Instead, people can lie, or they can refuse to tell whom they are going to vote for. Pollsters take this into account, and apply corrections to estimate the possible vote of those who do not want to give their opinion. But can it be held that the degree of confidence is still the same as stated by the theorem? The simple answer is again negative.

In conclusion: the technical data usually coming with polls and surveys are fictitious, based on a poor application of the theorems. The confidence coefficient they provide is enormously exaggerated. Although they say that its value is 95%, it can be not greater than 50. Can we wonder that the surveys are frequently wrong? The odd thing is, sometimes they hit the mark.

Manuel Alfonseca
Happy Christmas and New Year
We'll meet again in January