Practitioners of Big Data have not only transformed their thinking, but also adopted a "Big Data" approach to data processing: analyzing the whole rather than samples, not pursuing precision, and "knowing what's true and not knowing why" (note: the third sentence is the author's generalization, meaning that we only need to know "what" and not "why"). "(Note: the third sentence is the author's generalization, the original meaning is that as long as you know "what", do not need to know "why", or only ask correlation, not causation). At the same time, the traditional sampling method has been declared outdated, and can not adapt to the requirements of today's Internet information society.
The above assertions are arbitrary. If the purpose of the assertion is to emphasize that in the face of the information explosion, people must constantly look for new ways to analyze and process data, including "big data methods", then how to exaggerate and exaggerate is understandable and acceptable; however, if the purpose of the assertion is to persuade people to give up traditional sampling theories, and convert to the "big data mindset", then the assertion is not the only one. "big data thinking," that's questionable.
Throughout the history of science and technology, people have studied the laws of motion of objects, and Newton's laws were once considered absolutely correct. But as scientists study the world of microscopic particles, high-speed movement (approximate speed of light) objects, Newton's laws are no longer applicable, and replaced by quantum mechanics and relativity. But this does not mean the death of Newton's laws, which still dominate the physical society in which people live.
The information society is also the same, the continuous expansion of information, change, complexity makes the traditional sampling statistical methods appear to be overwhelmed, so the so-called "big data thinking" appeared. But whether "big data" is to replace traditional methods, or just a supplement to traditional methods, remains to be seen.
Question:
Three questions can be raised about the three shifts in "big data thinking": first, if the accurate nature of things can be obtained by analyzing a small amount of sample data, is it necessary to spend the cost of collecting all the data? Second, if accurate data can be obtained, is it still necessary to deliberately pursue inaccuracy? Finally, if one can learn about causality, would one turn a blind eye to it and just analyze the correlation?
A reasonable explanation is that, first, if the nature of a thing cannot be obtained by analyzing a small amount of sample data, one has to spend more money to collect the whole data to analyze it. Second, if accurate data is not available, one has to accept less accurate and poor data to analyze. Finally, if causality is not available, people will fall back on analyzing correlations to understand things.
Based on the above explanation, big data methods should not be done deliberately, but as a last resort. In other words, big data methods have their place only when traditional sampling statistical methods do not work. It's like replacing Newton's laws with relativity only when objects are moving close to the speed of light.
Of course, there's no denying that in the rapidly evolving cyberspace, the object of study, the data, is getting bigger and bigger, more and more cumbersome and fuzzy, and more and more unstructured, and that this megatrend has made people receptive to big data thinking. To take a not-so-subtle example, when people can't explain many natural phenomena, they are more likely to accept some kind of religious explanation.
In today's information explosion, not only should traditional sampling statistical methods not be abandoned, but they should be strengthened through a series of improvements to become one of the main means of reflecting the state of things in an efficient and real-time manner. At the same time, we welcome and are willing to adopt new methods, such as the growing "big data methods" and the possible "fuzzy data methods" and so on.
A key question arises: How do you decide whether to use traditional or big data methods for a specific thing? When physicists study the forces between microscopic particles, they will use quantum mechanics; when they study the forces on a bridge, they will use Newtonian mechanics. Do information or data experts have such theories or discriminating criteria? This will be discussed in the next subsection of this paper.
Analysis:
First, the rules for selecting sample sizes in a general sense are examined.
Theorem: Let X1,X2...Xn be independent and identically distributed random variables with distribution p(x), x∈(x1,x2...xn),then the general sampling sample size S is:
S = λ*2 ^H(X) ...... ........................(1)
Where: λ is a constant, H(X)= -∑p (xi)*log
p(xi),i.e. entropy of random variable X.
Example 1: to understand the overall N individuals on a certain thing, yes or no two choices, the entropy of about 1, (assuming that the number of two kinds of answers is basically equivalent), then in a certain degree of confidence, confidence interval requirements (this paper does not do the exact theoretical derivation of the sampling, only the example of the qualitative illustration, the same as the following), S with the increase in N (for example, to 100,000) gradually converge to a constant; 400. At this time λ=200. It can be proved that when other conditions remain unchanged, as entropy increases, S increases exponentially and λ remains constant.
Interpreting λ in a different way.
Definition 1: λ is the expected value of a "typical state" in a sample.
Definition 2: A typical state is that state whose probability of occurrence is equal or approximately equal to the probability of occurrence of each state in the same mean distribution of entropy values.
For example, X obeys an 8-state mean distribution with an entropy of 3 bits, and each of its states is a "typical state" with a probability of occurrence of 1/8.
If X obeys a distribution of 12 states, the probabilities of their state distributions are
p(x1,x2,x3, x4,x5, x4,x4,x5,x5,x5,x4,x4,x4,x4,x5,x4,x5,x5,x5,x5,x5). x4,x5...x12) = (1/3, 1/5, 1/6, 1/7, 1/8, 1/15...1/50) and H(X) ~= 3 bits. Its typical state is x5, with probability of occurrence 1/8.
Based on the above, if λ is taken to be 1 and H(X) = 3, then the sample size S = 8, and the expected value of the number of times the typical state (with probability of occurrence 1/8) occurs in a sample is 1, which is equal to λ. However, state occurrences are probabilistic, and even though the expected value is 1, the observed values may be 0, 2, 3, .... ..., such that the estimation error is too large.
If λ is taken to be 100 and H(X) = 3, then the sample size S = 800, and the expected value of a typical state occurrence in a sample is 100, which is equal to λ. Its actual observed value is between 95-105 in the great probability of the fall, and if the error is acceptable, take λ = 100, otherwise, increase λ.
Another factor that affects λ is stratification. The overall N in Example 1 is divided into 3 categories of people with high income (20%), middle income (50%), and low income (30%) to investigate the perception of something. If purely random sampling is used, to ensure an accurate estimate of the distribution of each stratum, it is necessary to make the stratum with the least number of individuals able to draw a sufficient number of individuals, so λ has to be multiplied by 5 (the reciprocal of 20%). But in fact, people are more concerned about the overall results, taking into account the results of the stratification, so in order to save costs, the actual λ correction factor will be smaller, for example, to take 3, this time, the sample size is about 1200 . At this point, whether the overall is 100,000 people or 300 million people, the results of the survey on a sample of 1,200 people can reflect the actual situation within a 3% margin of error.
It can be seen from the above analysis that λ is a constant between 100 and 1000, the exact value of which depends on how many typical state (or stratified) individuals (the expected value) the surveyor wishes to get in a sample and meet the error requirements. After determining λ, the size of the sample is then related only to the system entropy, which grows exponentially, i.e., equation (1).
When traditional sampling methods are used, the random states and variations of the objects of study are limited or made limited by artificial categorization, resulting in a small entropy value, so that the overall population can be accurately estimated using a smaller sample. Coupled with the high cost of sampling at that time, the investigating party has to spend a lot of effort in designing the sampling scheme to make the sample size as small as possible without losing accuracy.
The situation in the Internet era is just the opposite, the research object is the behavior of the Internet, access to data is very easy, because the data has been generated, whether you use it or not, it is there. And many research objects on the Internet, the state of infinitely many, it is also difficult to statistically categorize (such as the "long tail phenomenon"), the system entropy value is very large, resulting in a huge size of the sample or can not determine the size at all. At this point, the use of aggregate analysis, i.e., the big data approach, is advantageous. Of course, even if the aggregate data already exists, organizing and computing it is quite resource intensive. In some cases, a sampling approach is still the best option.
Now, let's try to answer the question posed at the end of the previous section: how to choose an analytical method when faced with a specific problem?
First, it is important to examine whether the data required for the object of study has already been collected automatically in the application, e.g., the user's online shopping behavior. If not, for example, offline shopping, it is necessary for the researcher to design methods to collect the data, in which case traditional sampling methods should be used.
Secondly, in the face of the huge amount of data that has been (or can be) obtained from the Internet in real time online, when the entropy value of the research object is less than 5, it is recommended to still use the traditional sampling method, which can be more efficient; when the entropy value is between 5-15, either aggregate analysis or sampling analysis can be considered, depending on the specific situation; when the entropy value is greater than 15, it is recommended to use the aggregate analysis, i.e., the big data method.
The above recommendations remain abstract. In the next subsection, we borrow the descriptive approach of the long tail theory to categorize statistical research objects into four types and discuss the applicable methods for each.
Categories:
Type 1: "No-tail model". In this case, the object of study has a clear and limited number of states, and the state with the smallest probability of occurrence is still statistically significant. Such as democratic voting, the state has a favorable, opposed, abstained from voting in 3 states, or a limited number of support for the electorate; and then, such as the ratings survey, the state has dozens or hundreds of television stations. The method of describing the statistical results is usually a distribution histogram, i.e., the frequency of occurrence of the states in the order from high to low is represented in the form of a histogram. Connecting the vertices of the histogram gives the overall probability distribution curve. Arranging the cumulative frequency counts in the same order and connecting the vertices results in the so-called "Pareto curve". The two curves behave as concave functions, or second-order derivatives with constant negative values (borrowing from continuous analysis, which is actually discrete), and no change occurs at the tail of the curve. As the number of states increases, the "two-eight phenomenon" becomes significant, i.e., a small number of states (e.g., 20%) account for most of the frequency (e.g., 80%).
The second category: "tailed model". At this point, the state of the research object is clearer and more numerous, the probability of occurrence of a small state of the relative loss of statistical significance, statistically these states are uniformly categorized as "other" states. In the vast majority of cases, due to other states is composed of many states, the probability of its occurrence and higher than the probability of some smaller probability of the state in front of the arrangement, therefore, the overall probability distribution curve and the Pareto curve in the tail will appear upward, that is, the so-called "warped tail model". In order to ensure the statistical effect, the total probability of other states generally does not exceed 5%. At this time, the phenomenon of two eights is extremely significant, easy to "ABC analysis" and key management, so the tail model is widely used in enterprise management. Such as quality management (defect analysis), inventory management (spare parts stores, stores, outlets, especially brick-and-mortar bookstores, which can be compared with the long-tail phenomenon of the back of the Internet bookstore) and so on.
The above two models using traditional sampling methods can achieve good statistical results. With the increase in the number of object states, there is no obvious boundary. Take the ratings survey as an example: choose 30,000 survey sample households for the ratings survey, when there are 20 or 30 TV stations, the lowest ratings of TV stations can also get significant observations, can be considered a no-tail model. When the number of TV stations exceeds 100, many TV stations with ratings less than 0.3% will not be able to reach the observation value that can guarantee the relative accuracy in one sample. At this time, either the sample range can be expanded to meet the accuracy requirements, or the states less than 0.3% can be merged into "others" and adopt the "tailed model". The "tailing model".
With the progress of triple play, the vast majority of TVs will have two-way capability, and aggregate data becomes readily available. At this point, the sampling method is still valid, and it can be used to do real-time, frequent statistics, while the big data method using aggregate can be corrected at regular intervals; after all, dealing with a few tens of thousands of samples is much faster and cheaper than dealing with a few hundred million pieces of aggregate data.
The third category: "long-tail model". At this time, the state of the research object is not clear enough and a large number of states, the probability of occurrence is very small, the relative loss of statistical significance of numerous states. However, all or part of the sum of these small probability states account for 30%-40% of the overall state, or even more. Reflected in the probability distribution or Pareto chart they form a long tail (asymptotic to the X-axis or a straight line with Y=1). If the long tail model is used, a sampling approach would leave 30-40% or more of the overall state uncharacterized. From there, a plenary data, i.e., big data, approach must be used.
For example: a physical bookstore has 1,000 kinds of books on the shelves, after the statistics, the owner will find that the first 200 kinds of books that sell well account for more than 80% of its sales, while the last 500 kinds of books that do not sell well account for less than 5%, which can be statistically merged into one category. This is the so-called "two-eight phenomenon", the owner of the method of sampling statistics can grasp the distribution of books accounting for 95% of sales. An online bookstore may have 200,000 kinds of books listed in the database, of which 200 kinds of hot sales accounted for 20% of sales, the first 2,000 kinds of **** accounted for 40%. And the remaining 198,000 titles make up the remaining 60% of sales, but the share of each is so small that it is not easily observable in a significant way, no matter how the sample is expanded. Big data methods can only be used in such cases, otherwise what use is a statistic that 60% of sales are generated from nowhere.
Type 4: "Full-tailed models". At this time, the state of the research object is very unclear, or even unknown, and the number of extremely large or even infinite, under normal circumstances, no matter how to choose the sample can not be significant in the statistical significance of the observed value of each state, once it can be observed, indicating that the emergence of anomalies. The distribution curve is a straight line infinitely close to and parallel to the X-axis. So we can also call it "flat-tailed".
Typical examples are keyword search, where the state cannot be determined in advance, i.e., the system does not know what the user is searching for in advance, and there may be an infinite number of searches, so it is not possible to design a sampling model in advance. Using a big data approach that analyzes the entire population, anomalies can be detected and analyzed even when they occur. For example, a large increase in the number of searches for a certain disease or drug term in a certain region can predict that a certain disease may be prevalent in that region. In fact, Google's big data analytics are already doing a better and more efficient job of this than traditional epidemic prediction mechanisms and organizations.
Big data methods are thought to be best suited for making early warnings or predicting some state that people don't know about in advance, whereas sampling statistics generally arranges sampling rules based on known states.
The above four modeling analyses are consistent with the entropy-based analysis of the previous section. Among them, the entropy values of the no-tail and warped-tail models are less than 6 and between 5 and 15, respectively; while the entropy values of the long-tail and full-tail models are greater than 15 and tend to infinity, respectively. The first two are mostly analyzed by traditional sampling, while the latter two can only be analyzed by big data methods. More importantly, as quantitative changes cause qualitative changes, big data methods will bring more and newer concepts, theories and technologies.