Let's look at a case of deceiving people with data: during the American-Spanish War, the death rate of the US Navy was 9‰, while the death rate of new york residents in the same period was 16‰. Later, naval recruiters used these data to prove that it was safer to join the army. Do you think this conclusion is correct? Of course, it is not correct. These two figures don't match at all. Soldiers are all able-bodied young people, and the mortality rate of residents includes data on the elderly, the weak and the sick. Relatively speaking, these people have a high mortality rate. So normal should be compared with the naval data of the same age group and new york residents.
In fact, you find that 9‰ and 16‰ are not comparable at all.
Business managers hate "false" data. The reason is self-evident: "false data" causes waste of resources, mistakes in decision-making, delays in fighter planes and so on. Briefly summarize several aspects of "problematic data" to help everyone become a critical eye as soon as possible. What needs to be reminded is that "problematic data" does not necessarily mean "false" data, because some data are true, but the conclusion is indeed "false". Data is usually used for
There are several situations that mislead everyone:
First, create "fake" data at will to fool customers or consumers.
Please forgive me for using the verb "make".
This situation can be seen everywhere. For some people or organizations, the seriousness of data is simply empty talk. They edit any data they want. Their name is the "Editorial" Committee. In this case, be sure to ask a few more why, just ask the source of the data. Remember that "there is no truth without data (source)". For example, the circulation of newspapers is always the most difficult puzzle in the world, and I don't know the answer. All I know is:
1. The circulation published by the media itself is actually their highest distribution record. Generally speaking, people are used to removing the word "highest".
2. In order to create the highest circulation, some newspapers directly pulled newspapers from the printing house to the garbage dump, which was a blatant and shameless fraud and was later banned.
Let's see if the number in this sentence is correct: Xiao Qiang, a salesman of the company, has 24 customers, and the proportion of non-repeated customers in April was 78% (note: the proportion of non-repeated customers = the total number of customers with orders/the total number of customers). The answer is wrong, because 78% will never be worked out.
Second, the problem of direction value.
This is a hidden and deceptive means. What is directional value? It is to assume a conclusion first, then select the most favorable person to conduct market research or research, and finally declare that this law or conclusion is universal. For example, the average salary, I want him to be high, I will go to the office building, and I want him to be low, just like the labor market! This method is a deceptive trick, or not, but many people are very enthusiastic!
It is market research companies or some government agencies that apply this method to the extreme. For example, in a certain year, a certain area said that it would reduce the house price for more than half a year. After half a year, they really did it, but the people did not feel the downward trend of house prices. Why? It turned out that they played a numbers game. Half a year ago, the sample was the average house price in the urban area, and after half a year, the average house price was taken after adding the house price in the suburbs.
Most market research companies are keen on target value. The bosses of many enterprises will ask marketing companies to conduct sampling surveys according to their conclusions, and then use these data for advertising, public relations and cheating consumers. Some companies' survey data are true (that is, the number of samples surveyed is enough, and there is no targeted selection of respondents), but the conclusion is false. Because companies can also draw conclusions. For example (this case is to illustrate the problem, and the assumed data should not be taken seriously), such as a toothpaste promotion: using this brand of toothpaste will reduce tooth decay by 23%, which is the data after market research. Of course, these data must be attractive to you. Because you think the antonym of reducing is not reducing! But you know, it may be like this behind him: 23% people have less cavities, 40% people have no response, and 37% people have more cavities (but this is unlikely).
Look at this picture and you will understand.
Third, Tian Ji horse racing.
Everyone must have heard the story of horse racing in Tian Ji, and it is also common to mislead people by using horse racing in Tian Ji. Look at an example. At the end of 20 10, a well-known B2C website held a "national grab" activity. After the event, someone wrote in the Weibo: According to the transaction data, the average daily transaction volume of the four-day promotion has far exceeded the sum of the average daily sales of Gome, Suning and Anbaili in 2009. As far as this sentence is concerned, there is no problem. The mistake is that the data before and after are not comparable. It is meaningless to compare the maximum value of self-promotion with other people's daily routine sales. It's like Liu Xiang won the championship in the Paralympic Games, so what? It is not a group at all.
Let's look at another set of data: from February 20, 20 10 to February 26, 20 18, the box office of the movie If You Are the One 2 and Let the Bullets Fly was 240 million and 2 10/00000 respectively (note: If You Are the One went on sale on February 22,/kloc) From these two data, can we draw the conclusion that the box office of "Fei 2" greatly exceeds that of "Jean"? From the perspective of pure data, in fact, these two data are not comparable and do not match. Because 12.20- 12.26 is the first week of "Fei 2" and the second week of "Jean". The box office highs of normal blockbusters are all in the first week. If we only look at their box office data in the first week, the box office will be * * * 290 million in the first week of listing, with an average of 70 million per day, and 240 million in the five days before the second screening, with an average of about 50 million, then the box office will be much higher!
Tian Ji horse racing is actually the conclusion of selecting data. The matching of data is something we need to guard against at all times. This aspect is very error-prone, and sometimes our seemingly reasonable comparison may be very unreasonable.
Fourthly, systematic errors in data analysis.
Data analysis is sometimes influenced by human factors, and sometimes there may be systematic errors. For example, suppose the personnel department wants to investigate people's views on the new general manager within a company, and there are five options: like it very much, like it, have no feeling, dislike it, and dislike it very much. Ask for an anonymous vote. The results are as follows: I like 25%, I like 40%, I don't feel 20%, I don't like 10%, I don't like 5% very much. Since it is an anonymous vote, you may think this data is ok (assuming there is no flattering phenomenon).
My answer is not necessarily. Because there are probably many employees who didn't vote at all. The reason why they don't vote may be that they don't know the survey or they are too busy to vote, and these abstaining votes are probably people they don't like. They didn't want to express their true thoughts, so they gave up purposeful voting. Think about abstaining from voting in the United Nations General Assembly. In addition, if you change the five options of this survey into the following order: I don't like it very much, I don't like it, I don't feel it, I like it, I like it very much. Those who voted just now will vote again, and the result may be different!
Reading articles is highly recommended.
Big data development tutorials with an annual salary of 40+W are all here!
Big Data Zero Foundation Quick Start Tutorial
Basic course of Java
Basic course of web front-end development
Learn the basic introduction course of linux
Seven concepts that big data engineers must understand.
Five Future Trends of Cloud Computing and Big Data
How to quickly establish your own big data knowledge system?