Misconceptions about Big Data: Statistics ≠ Big Data
Misconceptions about Big Data: Statistics is something that has already happened, and Big Data is often used to predict or recommend things that haven't happened yet, so the two can't be equated. But whether it's statistics or big data, it's all about making work more efficient and making decisions more rational and accurate.
Big data is so hot that it is being used in a wide range of industries, and in recent times it has shown clear signs of overheating. Is big data a marketing term or a methodology? The author of this article Lao Li is a senior employee of a big data service provider, he did the project is for different industries to analyze big data. He believes that you must first have a basic understanding of big data, that is, "a large amount of data does not necessarily have value". In addition, data statistics is not the same as big data, the difference between data statistics and big data is artificial intelligence.
In the past two years, "big data" has been widely used in all walks of life, and there are obvious signs of overheating at this stage. From CCTV's Spring Festival migration map to Yao Chen saw the microblogging data exclaimed; from the two sessions of the two big data, to the "stars" are called the beast of the high and low neck sweater, "big data" has been pushed to an unprecedented height, but also from a highly sophisticated scientific research direction into a world-known marketing vocabulary.
I am not qualified to represent the academic community, nor am I qualified to judge who is right and who is wrong. I can only talk about my own work experience to talk about the big data in my eyes:
What is big data?
Baidu encyclopedia definition of big data is this: big data (big data) or giant data, refers to the amount of information involved in the scale is so large that it can not be through the current mainstream software tools, in a reasonable period of time to achieve the capture, management, processing, and organizing the information to help business decision-making for more positive purposes.
Gartner defines "big data" as massive, high-growth, and diverse information assets that require new processing models to enable stronger decision-making, insight discovery, and process optimization.
I personally think Gartner's definition is more relevant. The term "new processing paradigm" is a key term, and it is also one of the most critical features that differentiates "Big Data" from traditional statistical analysis, as I understand it. This so-called "new processing model" has two meanings:
1, due to the massive amount of data, the need for more efficient storage and processing technology, Hadoop has become a symbol of the era of big data;
2, if you think that big data is the same as Hadoop, that would be a big mistake. Hadoop is only a necessary condition of the era of big data, big data has a clear sign is the close integration of data mining and artificial intelligence. This is also my understanding of "big data" and many of the so-called "big data" project is one of the most obvious difference. I'll expand on this later in the case study.
In addition to the above "new processing model" difference, I think there is another major difference: statistical analysis of data is based on the vertical categorization of existing data, while big data is based on the processing of the existing huge amount of data, the data has not yet been generated to make predictions and recommendations. Data statistics are things that have already happened, while big data is often used in predictions or recommendations for things that haven't happened yet.
Predictions and recommendations, how are they achieved?
The main recommendation algorithms available today can be broadly categorized into two types. One is based on behavior and one is based on content. Of course, there are more than ten algorithms for different domains and different prediction and recommendation targets. This is not the content of this article to expand.
Behavior-based analysis, as the name suggests, that is, the user in the Internet, the mobile Internet left "traces", that is, browsing, clicking, favorites, purchase, the analysis of the second purchase, to come up with a prediction of the future will choose to buy and recommend the results. Behavior-based analysis belongs to the wisdom of the group, and makes comprehensive use of the behavioral preferences of a group of users. Users will interact with each other, more in line with real-world user behavior.
Content-based analysis, including the analysis of text, images, audio, video and other information, to draw predictions and recommendations. The "genes" of the content are matched to the user's preferences, most notably in Pandora's music recommendation program, where all songs in the library are tagged by more than 400 experts who then establish a personal connection to the music to make recommendations. The content is analyzed only on an individual basis, independent of the relationship between users.
What big data can really do
Talking about this now may make people laugh, it seems that everyone knows that big data can do this and that, and in the end, even we feel ridiculous. Big data has not been "demonized", it has been "entertained". Big data seems to be a thing that is far away from us and close to us, and has become unreal.
Well, I still combined with the experience of the practitioner to say that big data "solved what problem" it: simply put, big data can help us solve the problem of decision-making and choice.
Weather forecasting is one of the oldest and best known predictions. You can decide what to wear tomorrow, whether to bring an umbrella, and so on, based on the forecast;
In the past two years, big data has been applied to the film and television production industry, based on the analysis of audience preferences, to predict and design the audience's favorite plot, to find the audience's favorite actors and actresses to play the relevant roles, and even to predict the box office. All of these predictions are based on data, after a certain modeling process, to get close to the real conclusion. In a way giving decision makers a basis for decision making, like House of Cards and Stars.
Big Data has another important role to play, which is to solve the problem of "choice". Don't laugh, no matter what your age, gender, or educational background, people are now faced with a choice problem like never before. To put it in academic terms, this is a problem caused by the "long-tail effect"; to put it in layman's terms, it is due to the contradiction between the ever-increasing number of available choices and our own ability to deal with them.
Technological advances have made people lazier, which means that our own processing power has decreased, both subjectively and objectively. Instead, there is an ever-increasing number of objects to be chosen from. From the complexity of goods (e-commerce), to the massive library of music; from dating sites for men and women, to traffic management signals.
Based on artificial intelligence, big data is a means to make people "lazy". Based on your historical behavior, determine your possible preferences, and even needs, the best results, recommended to you. This is the big data, she is your intimate housekeeper, or the most understanding of your friends.
One of the most classic cases is the "beer" and "diaper" research done by Walmart: Walmart found in the study that a class of customers often buy diapers and also buy beer. Naturally, diapers and beer are two unrelated categories of goods, and from personal experience, no connection between the two could be imagined. It was later discovered that this was a social phenomenon. There are many young couples in the U.S. When they run out of diapers, the woman stays home with the kids while the man goes to the supermarket to buy diapers. After buying diapers, the man usually stops by to buy some beer.
The above example shows that data can often lead you to discover phenomena that seem irrational and illogical, but exist and occur frequently.
Taking another example, Beijing's traffic congestion is something everyone on the planet knows about. Especially in the morning and evening peaks, it's no longer something that needs to be predicted. But if you calculate an optimal traffic signal management system for all of Beijing based on historical traffic data and then mathematical modeling, this falls into the realm of big data.
This is where I see the biggest difference between big data and ordinary statistical analysis: statistics can help you find diseases, but big data can not only help you find them, but also help you treat them.
Big data is never a "gimmick", we help an operator reading base reading recommendation program, the indicators have been greatly improved. And this increase is not a few dozen percent, but several times the increase! (Per capita user traffic increased by 4 times, silent user activation capacity increased by 6.5 times) This is the charm of big data.
Big data is not everything
Big data is obviously not everything. Because of this, she is real. Big data in some areas for various reasons, the value brought is not as high as expected. There are two main problems that lead to this phenomenon, one is due to the quality or quantity of the data itself is not enough; the other is that the algorithm is not appropriate.
Don't think it is a massive amount of data will necessarily have value, in the past work, we often find that 80-90% of the data source from the A side of the data is useless. Only 10-20% of the data will produce some value. This brings me back to the analogy that Marry Meeker made, "Working with big data is like looking for a needle in a pile of straw."
What's more, most of the domains themselves are in the early stages of their business and the data they have is very poor. Cold start and sparseness are challenges that big data faces in many domains.
On the other hand, for different domains and projects, there is no one-size-fits-all algorithm, and specific problems must be analyzed and solved. In the actual work found that not only different fields (such as article recommendation and product recommendation), and even different units of the same field (the same e-commerce, but different types of e-commerce, such as mother and baby and clothing or luxury goods) are also different.
Cross-utilization of data
The two biggest problems facing big data in practical applications mentioned above, namely the scarcity of data at cold start and the sparseness of data in the early stages of business, are not hopeless. Data bridging, which the industry has been discussing, is the way out of these two problems.
For some emerging areas, the lack of data is inevitable, and on the other hand, because of the lack of data support, so there is a greater need to have a strong decision-making support system to guide and support their business, in order to achieve the purpose of less detour, maximize the benefits.
Projects in the field of mobile Internet are particularly representative. Although in the past two or three years, the mobile Internet has been rapid development, but after all, in all aspects of the accumulation, can not be compared with the Internet. Especially before people form stable habits of use, the data does not yet have more value and significance.
But if we can connect the data of the Internet with the data of the mobile Internet, then we have the information of the person's preferences and other aspects of information, so as to make more effective guidance and help for the mobile Internet business.
Of course, data connectivity is by no means limited to the Internet and mobile Internet. Data from each source often portrays different aspects of a person. As Prof. Barabasi depicts in his book "Outbreak", 93% of human behavior is predictable and regular if there is sufficient data.
It is also only by reorganizing this data from different sources that more meaningful information can be mined.
Nowadays, many people in the industry do big data under the banner of "data statistics and analysis", which makes many laymen fall into a misunderstanding: data statistics is not equal to big data. Whether data statistics or big data, in fact, is to make our work more effective, so that decision-making is more rational and accurate. The emphasis on data is in itself a sign of maturity for an organization.
The rapid rise of the mobile Internet has made data more diverse and abundant. Its mobility, its fragmentation, its privacy and readiness are just enough to make up for the data after users leave their desktop computers, and thus together with the original Internet data, it is a good way to outline a day's life of an Internet user, and the data of daily life.
With the further enrichment and improvement of data, and with the opening up and cross-utilization of data from different channels, the imagination of big data will be even broader.
The above is what I shared with you about the misconceptions about big data: statistics ≠ big data, for more information, you can follow the Global Ivy to share more dry goods