Big Data Era, statistics is still useful_Data Analyst Exam
In the era of data "explosion", big data is often pinned high hopes. In the end, what kind of data is considered big data, how to use big data, traditional statistics still have a place? Tsinghua University Statistics Research Center was established recently, the famous statistician, Harvard University life professor Liu Jun as director. A few days ago, Liu Jun was a guest of People's Daily, People's Daily, People's Daily, "Cultural Forum", to share his thoughts.
What makes big data different from data is its massive accumulation, high growth rate and diversity
What is data? Data means "known" in Latin, and one explanation in English is "a collection of facts from which conclusions can be analyzed". Generally speaking, whatever is recorded with a certain carrier, which can reflect a certain information of nature and human society, can be called data. Ancient people "knotted rope to remember things", knotted rope is data. In modern society, the type and quantity of information is getting richer and richer, and there are more and more carriers. Numbers are data, text is data, images, audio, video, etc. are data.
What is big data? The increase in volume is the first thing people realize about big data. With the development of technology, the amount of data in all fields is growing rapidly. One study found that the amount of digital data has doubled every three years or so in recent years.
Big data is also distinguished by the diversity of data. As the Goldner Consulting study points out, the explosion of data is three-dimensional and three-dimensional. By three-dimensional, I mean not only the rapid increase in the volume of data, but also the accelerated rate of data growth, as well as the diversity of data, i.e., the increasing sources and types of data.
From data to big data, it is not only the accumulation of quantity, but also a qualitative leap. Massive amounts of data from different sources, in different forms, and containing different information can be easily integrated and analyzed, and originally isolated data become interconnected. This enables people to discover new knowledge and create new value through data analysis that is hard to find in the era of small data.
The study of laws and discovery of laws through data has been a constant throughout the development of human society. Many advances in the history of human scientific development are directly related to data collection and analysis, such as the beginning of modern medical epidemiology. A large-scale cholera occurred in London in 1854, and there was no way to control it for a long time. A physician studied the relationship between the distribution of wells and the distribution of cholera patients in the area using punctuated maps, and found that the prevalence of cholera was significantly higher around one well, which led to the cause of the outbreak: a contaminated well. After the well was shut down, the incidence of cholera dropped significantly. It's an approach that demonstrates the power of data.
In essence, many scientific activities are data mining, not starting from a predetermined theory or principle to study a problem through deduction, but from the data itself to summarize the law through induction. In recent modern times, as the problems we face have become more and more complex, it has often become difficult to study problems through deductive approaches. This has made the method of data induction increasingly important, and the importance of data has become more and more prominent.
Big data is a non-competitive resource that helps governments make scientific decisions and businesses market precisely
In the era of big data, the important role of data has become more prominent, and many countries have elevated big data to the height of national strategy.
The government's rational use of big data to guide decision-making will be based on empirical facts, and the government will be more predictable, more responsible and more open. Ancient Chinese governance has been the idea of heavy data, such as Shang Yang, "strong country know thirteen numbers ...... want to strengthen the country, do not know the country thirteen numbers, although the land is good, although the people are many, the country is weaker and more cut". In the era of big data, following the "number" governance will be more effective. Small data era, the government to make decisions more based on experience and local data, it is inevitable that headache, foot pain. For example, if there is traffic congestion, more roads should be built. In the era of big data, the government can make decisions from rough to intensive. Road blocked, the use of big data analysis, you can know which time, which section is most likely to be blocked, or in the vicinity of this section more road repair, or early warning to guide the residents to rationalize the travel arrangements, to achieve the optimal configuration and control of traffic flow, improve traffic.
For merchants, big data makes precision marketing possible. An interesting story is the "beer, diapers" phenomenon at Walmart Supermarket. When Walmart analyzed sales data, it found that the product that appeared most often on the customer's consumption list along with diapers was beer. After tracking the survey, it was found that many young fathers would buy some beer when buying diapers. Walmart found this pattern, with the promotion of beer, diapers, sales increased dramatically. In the era of big data, everyone provides data "spontaneously". Our behavior, such as clicking on web pages, using cell phones, swiping credit cards, watching TV, taking the subway, driving a car, will generate data and be recorded, and our gender, occupation, preferences, spending power and other information will be mined by merchants to analyze business opportunities.
Big data will also benefit individuals. Biologically and medically speaking, biologists used to observe the effects of a single or a few genes on an organism only by manipulating them, and it was difficult to find an overall correlation. Now, due to the development of technology, it is possible to analyze a lot of it, such as genetic information, expression information of all genes, protein family tree information, genome-wide methylation information, and epigenetic information. There are also data on personal health indicators, medical history, drug reactions, and so on. If the organic integration of biological multi-dimensional and multi-directional data can really be achieved, it will be able to describe the individual completely, thus realizing the purpose of precision medicine.
In the era of big data, there are also more effective means to audit the authenticity of data. One of the characteristics of big data is diversity, and there is a certain degree of correlation between data from different sources and dimensions that can be cross-validated. For example, the industrial output value of a certain place is falsely reported to be double, but the electricity consumption and energy consumption do not reach the corresponding scale. This is a data anomaly that is easily recognized by the system. After the anomalies are found, the relevant departments will then review them, and they will be able to prevent and combat data fraud in a more targeted manner.
Data is a resource, but data is not the same as material resources such as coal and oil. Material resources are not renewable, you use more, others will use less, so it is difficult to **** enjoy. Data can be reused and constantly generate new value. The use of big data resources is non-vicious competition, *** enjoy the premise, more able to create a win-win situation. From another perspective, data cannot be called big data if it is not fused and linked together.
Big data can not be used directly, statistics is still the soul of data analysis
Now there is a popular saying in the community that in the era of big data, "sample = all", people are not getting a sample of the data but the whole data, and therefore only need to simply count the number of conclusions can be made! and that complex statistical methods can no longer be needed.
In my opinion, this view is very wrong. First, big data informs but does not interpret information. For example, big data is "crude oil", not "gasoline", and cannot be used directly. Just like the stock market, even if all the data are published, people who don't know the data still don't know what the data represent. In the era of big data, statistics is still the soul of data analysis. As Prof. Michael Jordan of the University of California, Berkeley points out, "Big data research without systematic data science as a guide is like building bridges without utilizing knowledge of engineering science; many bridges may collapse with serious consequences."
Second, the concept of all-data itself hardly stands up to scrutiny. Full data, by definition, is all data. This is indeed possible on some specific occasions for some specific questions. For example, to compare Tsinghua University, Peking University students in the overall math ability which is stronger, you can collect the two schools students in the college entrance examination math scores as the object of the study data. In a sense, this is full data. However, it does not mean that we can answer the question well with this full data.
On the one hand, although this data is full data, it still has uncertainty. Math scores at the time of entry do not necessarily fully represent a student's math ability. If all students were to take the entrance exam again, almost every student would have a new score. Doing the analysis with these two full sets of data separately, the conclusions may change. On the other hand, things are constantly evolving and changing, and the grades of the students when they entered the school are not representative of their current abilities. The data on the performance of all students in the entrance examination is full data only for that examination. There is a boundary to "all", beyond which it is no longer omniscient and omnipotent. The development of things is full of uncertainty, and statistics, both the study of how to extract information and laws from the data to find the most optimal solution; also study how to quantify the uncertainty in the data.
So, in the era of big data, many of the fundamental problems of data analysis and the era of small data are not fundamentally different. Of course, the characteristics of big data do pose entirely new challenges to data analysis. For example, many traditional statistical methods applied to big data, the huge amount of computation and storage often make it difficult to bear; for data with complex structure and diverse sources, how to establish effective statistical models also need new exploration and attempts. For the new era of data science, these challenges also mean great opportunities at the same time, with the possibility of generating new ideas, methods and technologies.
The above is what I shared with you about the era of big data, statistics is still useful content, more information can be concerned about the Global Green Ivy share more dry goods