1. 1 Look at big data from a historical perspective.
Compared with the agricultural and industrial times, the information age is a relatively long period. There are obvious differences in production factors and social development motivation in different periods. The iconic technological inventions in the information age are digital computers, integrated circuits, optical fiber communication and the Internet (World Wide Web). Although there are many sayings in the media that ridicule the era of big data, new technologies such as big data and cloud computing have not achieved technological breakthroughs comparable to the above-mentioned epoch-making technological inventions, and it is difficult to form a new era beyond the information age. The information age can be divided into several stages, and the application of new technologies such as big data indicates that the information society will enter a new stage.
Through the investigation and analysis of the long history of 100 years, we can find that there are many similarities between the development laws of the information age and the industrial age. The process of improving productivity in the electrification era is strikingly similar to that in the information age. It is only after 20 to 30 years of diffusion reserves that the increase is more obvious. The dividing lines are 19 15 and 1995 respectively. I guess, after decades of information technology dissemination, the first 30 years of 2 1 century may be the golden age for information technology to enhance productivity.
1.2 Understanding big data land from the perspective of land in the new stage of the information age
China has entered the information age, but many people's thoughts are still in the industrial age. Many problems in economic and scientific work are rooted in the lack of understanding of the times. China was beaten backward in the18-19th century because the Manchu government didn't realize that times had changed and we couldn't repeat the historical mistakes.
After the central government put forward that China should enter the new economic normal, there were many discussions in the media, but most of them explained the slowdown of economic growth, and few articles discussed the new normal from the perspective of the changes of the times. The author believes that the new economic normal means that China has entered a new stage of promoting new industrialization, urbanization and agricultural modernization with information technology, which is a leap in economic and social management, not an expedient measure, nor a retrogression.
The IT architecture composed of next-generation information technologies, such as big data, mobile Internet, social network, cloud computing and Internet of Things, is a sign that the information society has entered a new stage and has a leading and driving role in the transformation of the whole economy. The Internet, Maker, Second Machine Revolution and Industry 4.0, which often appear in the media, are all related to big data and cloud computing. Big data and cloud computing are new levers to improve productivity under the new normal. The so-called innovation-driven development mainly relies on information technology to improve productivity.
1.3 Big data may be a breakthrough for China's information industry from tracking to leading.
China's big data enterprises already have a fairly good foundation. China, the top ten Internet service companies in the world, has four seats (Alibaba, Tencent, Baidu and JD.COM). COM), the other six Top 10 Internet service companies are all American companies. There are no Internet companies in Europe and Japan entering Top 10. This shows that enterprises in China have taken the lead in Internet services based on big data. In the development of big data technology, China may change the situation that technology is controlled by people in the past 30 years, and China may play a leading role in the application of big data in the world.
However, the fact that enterprises are at the forefront of the world does not mean that China is ahead in big data technology. In fact, none of the mainstream technologies of big data that are popular in the world are initiated by China. Open source community and crowdsourcing are important ways to develop big data technology and industry, but our contribution to open source community is very small. Among the nearly 10,000 community core volunteers in the world, there may be less than 200 in China. It is necessary to learn from the lessons that basic research provided enterprises with insufficient core technologies in the past, strengthen basic research and forward-looking technology research on big data, and strive to overcome the core and key technologies of big data.
2 Understanding big data needs to rise to the height of culture and epistemology.
2. 1 Data culture is an advanced culture.
The essence of data culture is to respect the spirit of seeking truth from facts in the objective world, and data is facts. Paying attention to data means emphasizing the scientific spirit of speaking with facts and thinking rationally. The traditional habit of China people is qualitative thinking, not quantitative thinking. At present, many cities are opening government data, but it is found that most people are not interested in the data that the government wants to open. To put big data on the track of healthy development, we must first vigorously promote data culture. The data culture mentioned in this paper is not only the big data used by cultural industries such as literature and art, publishing, but also the data consciousness of the whole people. The whole society should realize that the core of informatization is data, and only when the government and the public attach importance to data can we truly understand the essence of informatization; Data is a new factor of production, and the use of big data can change the weight of traditional factors such as capital and land in the economy.
Some people will ridicule that God and data are one of the characteristics of American culture, saying that Americans have both sincerity to God and rationality to seek truth through data. The United States has completed the thinking transformation of data culture from the gilded era to the progressive era. After the Civil War, the census method was applied to many fields, forming a thinking mode of data prediction and analysis. In the past century, the modernization of the United States and western countries is closely related to the spread and infiltration of data culture, and China must also emphasize data culture in order to realize modernization.
The key to raising data awareness is to understand the strategic significance of big data. Data is a strategic resource as important as matter and energy. Data collection and analysis involves every industry and is a global and strategic technology. The transition from hard technology to soft technology is a global technology development trend, and the technology that finds value from data is the most dynamic soft technology. The backwardness of data technology and data industry will delay an era like missing the opportunity of industrial revolution.
2.2 Understanding big data requires a correct epistemology.
Historically, scientific research began with logical deduction, and all the theorems of Euclid geometry can be deduced from several axioms. Since Galileo and Newton, scientific research has paid more attention to natural observation and experimental observation, and on the basis of observation, scientific theories have been extracted by inductive method, making science become the mainstream of scientific research and epistemology from observation. Both empiricism and rationalism have made great contributions to the development of science, but they have also exposed obvious problems and even gone to extremes. Rationalism went to extremes, becoming dogmatism criticized by Kant, and empiricism went to extremes, becoming skepticism and agnosticism.
In 1930s, the German philosopher Popper put forward an epistemological viewpoint which was called "the land of teasing falsificationism" by later generations. He believes that scientific theory can not be proved by induction, but can only be teased by counterexamples found in experiments, so he denies that science begins with observation and puts forward the famous viewpoint that teasing science begins with the place where the problem occurs [3]. Falsificationism has its limitations. If we strictly abide by the law of falsification, important theories such as the law of universal gravitation and atomism may be killed by the early so-called counterexamples. However, the idea that science begins with problems has guiding significance for the development of big data technology.
The rise of big data has triggered a new scientific research model: teasing science begins in the land of data. From the epistemological point of view, big data analysis method is close to the empiricism that teases science from observation, but we should keep in mind the lessons of history and avoid slipping into the empirical mud pit that denies the role of theory. When emphasizing the relevance of ridicule, don't doubt the existence of causal relationship of ridicule; When declaring the objectivity and neutrality of big data, don't forget that no matter the size of the data, big data will always be subject to its own limitations and human prejudice. Don't believe this prediction. If you use big data mining, you don't need to ask any questions about the data, and the data will automatically generate knowledge. Faced with a sea of data, the biggest confusion of scientists and technicians engaged in data mining is what we are going to fish in this sea to see if there is any teasing, that is to say, we need to know where the problem lies. In this sense, teasing science from data and teasing science from problems should be organically combined.
The pursuit of funny places is the eternal power of scientific development. However, the reasons are endless, and it is impossible for mankind to find the ultimate truth in a limited time. On the road of scientific exploration, people often explain the world objectively by teasing. Don't immediately ask why there is such an objective law. In other words, traditional science not only pursues causality, but also concludes with objective laws. The results of big data research are mostly some new knowledge or new models, which can also be used to predict the future and can be considered as a local objective law. In the history of science, there are many examples of discovering universal laws through small data models, such as Kepler's law of celestial motion; Most big data models discover some special laws. Laws in physics are generally inevitable, but big data models are not necessarily inevitable or deductive. The research object of big data is often human psychology and society, and it is at a higher level on the knowledge ladder. Its natural boundary is fuzzy, but it has more practical features. Big data researchers pay more attention to the integration of knowledge and practice and believe in practice. Big data epistemology has many characteristics different from traditional epistemology, so we can't deny the scientific nature of big data method just because of its different characteristics. The study of big data challenges the traditional epistemology's preference for causality, supplements the single causality with data laws, realizes the data unification of rationalism and empiricism, and a brand-new big data epistemology is taking shape.
3 correctly understand the value and benefits of big data
3. 1 The value of big data is mainly reflected in its driving effect.
People always expect to dig out unexpected places of great value from big data. In fact, the value of big data is mainly reflected in its driving effect, that is, driving related scientific research and industrial development, and improving the ability of all walks of life to solve problems and add value through data analysis. The contribution of big data to the economy is not fully reflected in the direct income of big data companies, but also the contribution to the efficiency and quality improvement of other industries. Big data is a typical universal technology. To understand the general technology, we should use the model of teasing bees: the benefits of bees are mainly not their own honey, but the contribution of bee pollination to agriculture.
Von Neumann, one of the founders of electronic computers, once pointed out that in every science, when we develop some methods that can be continuously popularized by studying those problems that are quite simple compared with the ultimate goal, this discipline has made great progress. Naturally, you don't have to expect miracles every day, do more simple things, and the actual progress lies in solid efforts. The media likes to promote some amazing big data success stories, and we should keep a clear head on these cases. According to Wu Gansha, chief engineer of Intel China Research Institute, in a report, the so-called classic case of data mining poking fun at beer and diapers is actually a funny story fabricated by a manager of Teradata, which has never happened in history [4]. Even if there is this case, it does not mean that big data analysis itself is magical. Two seemingly unrelated things appear anywhere in big data at the same time or one after another. The key point is that people's analytical reasoning is to find out why two things appear at the same time or successively, and to find the right reason is new knowledge or newly discovered laws. Relevance itself is of little value.
There is a well-known fable that can illustrate the value of big data from one angle: an old farmer told his three sons before he died that he buried a bucket of gold in his home field, but did not say where.
His sons dug all the land in his family and found no gold, but because of the deep digging, the crops have been particularly good since then. The ability of data collection and analysis has improved. Even if no universal laws or completely unexpected new knowledge are discovered, the value of big data has gradually emerged.
3.2 The power of big data comes from making great wisdom.
Each data source has certain limitations and one-sidedness. Only by merging and integrating all aspects of the original data can we reflect the whole picture of things. The essence and laws of things are hidden in the association of various original data. Different data may describe the same entity from different angles. For the same problem, different data can provide complementary information and have a deeper understanding of the problem. Therefore, in big data analysis, it is the key to collect as much data as possible from the source.
Data science is a science that combines mathematics (statistics, algebra, topology, etc.). ), computer science, basic science and various applied sciences are similar to those proposed by Mr. Qian Xuesen [5]. Qian Lao pointed out: ridicule will do a lot and gain wisdom. The key to big data wisdom lies in the integration and integration of multiple data sources. Recently, IEEE Computer Society released a forecast report on the development trend of computer technology in 20 14 years, focusing on seamless intelligence. The goal of developing big data is to obtain a seamless and intelligent collaboration and integration place. Relying on only one data source, even if the data is large, it may be as one-sided as teasing the blind. The open sharing of data is not icing on the cake, but a necessary prerequisite to determine the success or failure of big data.
The research and application of big data should change the traditional thinking that all departments and disciplines are independent and develop independently in the past. The emphasis is not on supporting the development of individual technologies and methods, but on the collaboration of different departments and disciplines. Data science is not a vertical chimney primer, but a horizontal comprehensive science like environmental and energy science.
3.3 Big data has a bright future, but we can't expect too much in the near future.
When alternating current comes out, it is mainly used for lighting. It is impossible to imagine its ubiquitous application today. The same is true of big data technology, which will produce many unexpected applications in the future. We don't have to worry about the future of big data, but we must work very pragmatically in the near future. People often overestimate the recent development and underestimate the long-term development. Gartner predicts that big data technology will become the mainstream technology in 5~ 10 years, so we should be patient when developing big data technology.
Like other information technologies, big data follows the law of exponential development for a period of time. The characteristic of index development is that measured from a historical period (at least 30 years), the early development is relatively slow, and after a long period of accumulation (which may take more than 20 years), there will be an inflection point, and then there will be explosive growth. But any technology will not keep exponential growth forever. Generally speaking, high-tech development follows the technology maturity curve described by Gartner, and may eventually enter a stable state of benign development or die out.
The problems that big data technology needs to solve are often very complicated, such as social computing, life science, brain science and so on. These problems can never be solved by the efforts of several generations. The universe evolved for billions of years before creatures and humans appeared. Its complexity and ingenuity are unparalleled. Don't expect to completely unveil its mystery in the hands of our generation. Looking forward to the future of millions of years or even longer, big data technology is just a wave in the long river of scientific and technological development. We can't have unrealistic illusions about the possible scientific achievements of 10~20 years of big data research.
4 From the perspective of complexity, the challenges faced by the research and application of big data.
Big data technology is closely related to human efforts to explore complexity. In the 1970s, the rise of three new theories (dissipative structure theory, synergy theory and catastrophe theory) challenged the reductionism which has been studied in science and technology for hundreds of years. From 65438 to 0984, three Nobel laureates, including gherman, set up the Santa Fe Institute, which mainly studied complexity, put forward the slogan of transcending reductionism, and set off a complexity science movement in the scientific and technological circles. Although the thunder is very loud, it has not achieved the expected effect in the past 30 years. One of the reasons may be that complex technology had not been solved at that time.
The development of integrated circuits, computers and communication technologies has greatly enhanced the ability of human beings to study and deal with complex problems. Big data technology will carry forward the new ideas of complexity science, which may make complexity science land. Complexity science is the scientific basis of big data technology, and big data method can be regarded as the technical realization of complexity science. Big data method provides a technical way to realize the dialectical unity of reductionism and holism. Big data research should draw nutrition from complexity research. Scholars engaged in data science research should not only understand the three major theories of innovation in the 20th century, but also learn related knowledge such as hypercycle, chaos, fractal and cellular automata, so as to broaden their horizons and deepen their understanding of the mechanism of big data.
Big data technology is still immature. In the face of massive, heterogeneous and dynamic data, traditional data processing and analysis technologies are difficult to cope with. The existing data processing system has low efficiency, high cost, high energy consumption and is difficult to expand. Most of these challenges come from the complexity of data itself, the complexity of calculation and the complexity of information system.
4. 1 Challenges brought by data complexity
Data analysis, such as graphic retrieval, topic discovery, semantic analysis and emotional analysis, is very difficult, because big data involves complex types, complex structures and complex patterns, and the data itself is very complicated. At present, people don't understand the physical meaning behind big data, the correlation law between data, and the internal relationship between the complexity of big data and computational complexity. The lack of domain knowledge limits the discovery of big data models and the design of efficient computing methods. Formal or quantitative description of the essential characteristics and metrics of big data complexity requires in-depth study of the internal mechanism of data complexity. The complexity of the human brain is mainly reflected in the connection between trillions of dendrites and axons, and the complexity of big data is also mainly reflected in the correlation between data. Understanding the mystery of the correlation between data may be a breakthrough to reveal the micro and macro emergence laws. The research on the complexity law of big data is helpful to understand the essential characteristics and generation mechanism of complex patterns of big data, thus simplifying the representation of big data and obtaining better knowledge abstraction. Therefore, it is necessary to establish the data distribution theory and model under multimodal association, clarify the internal relationship between data complexity and computational complexity, and lay a theoretical foundation for big data computing.
4.2 Challenges brought by computational complexity
Big data computing can't do statistical analysis and iterative calculation of global data like processing small sample data sets. When analyzing big data, we need to re-examine and study its computability, computational complexity and solving algorithm. The sample size of big data is huge, the internal correlation is close and complex, and the value density distribution is extremely uneven. These characteristics challenge the establishment of big data computing paradigm. For PB-level data, even the calculation of linear complexity is difficult to achieve, and many invalid calculations may be made due to the sparsity of data distribution.
Traditional computational complexity refers to the functional relationship between the time and space needed to solve a problem and the scale of the problem. The so-called polynomial complexity algorithm means that when the scale of the problem increases, the growth rate of computing time and space is within a tolerable range. The focus of traditional scientific computing is how to solve the problem of a given scale quickly. In big data applications, especially in stream computing, the time and space for data processing and analysis are often clearly limited. For example, if the response time of network service exceeds several seconds or even milliseconds, many users will be lost. The application of big data is essentially how to make a lot of fun under the given time and space constraints. From fast calculation to multi-calculation, the thinking logic considering computational complexity has changed greatly. The so-called funny calculation is not that the larger the data, the better. It is necessary to explore the on-demand reduction method from enough data to just good data to valuable data.
One way to solve problems based on big data is to give up general solutions and find solutions to specific problems according to special constraints. Human cognitive problems are generally NP-hard, but as long as there is enough data, a very satisfactory solution can be found under limited conditions. In recent years, the great progress of self-driving cars is a good case. In order to reduce the amount of calculation, it is necessary to study the local calculation and approximation methods based on bootstrap and sampling, put forward a new algorithm theory that does not depend on the total data, and study the uncertain algorithm that adapts to big data.
4.3 Challenges brought by system complexity
Big data puts forward strict requirements for the operating efficiency and energy consumption of computer systems. It is challenging to evaluate and optimize the efficiency of big data processing system. It is necessary not only to clarify the relationship between the computational complexity of big data and the system efficiency and energy consumption, but also to comprehensively measure the efficiency factors of the system, such as throughput, parallel processing ability, job calculation accuracy and job unit energy consumption. In view of the sparsity and weak access locality of big data, it is necessary to study the distributed storage and processing architecture of big data.
The application of big data involves almost all fields. The advantage of big data is that it can find sparse and precious value in long tail applications. However, an optimized computer system structure is difficult to meet various requirements, and the application of fragmentation greatly increases the complexity of information systems. How do big data and Internet of Things applications as many as insects (more than 5 million kinds) form a huge market like mobile phones? This is called the insect paradox [6]. In order to solve the complexity of computer system, it is necessary to study heterogeneous computing system and plastic computing technology.
In the application of big data, the load of computer system has changed substantially, and the structure of computer system needs revolutionary reconstruction. Information system needs to change from data around the processor to data processing ability, and the focus is not on data processing, but on data processing; The starting point of system structure design should be changed from focusing on the completion time of a single task to improving the throughput and parallel processing ability of the system, and the scale of concurrent execution should be increased to more than 654.38+0 billion. The basic idea of building a data-centric computing system is to fundamentally eliminate unnecessary data streams, and the necessary data processing should also be changed from teasing elephants to teasing ants to move rice fields.
Five misunderstandings that should be avoided in the development of big data
5. 1 Don't blindly pursue teasing data scale.
The main difficulty of big data lies not in the large amount of data, but in the diversity of data types, the timely response requirements and the difficulty in distinguishing the original data from the real data. Existing database software can't solve unstructured data, so we should pay attention to data fusion, data format standardization and data interoperability. The quality of collected data is often low, which is one of the characteristics of big data, but it is still worthy of attention to improve the quality of original data as much as possible. The biggest problem in brain science research is the poor reliability of the collected data, and it is difficult to analyze valuable results based on the data with poor reliability.
The blind pursuit of large-scale data will not only cause waste, but also may not be very effective. The integration of multi-source small data may dig out great value that single source big data can't get. We should pay more attention to data fusion technology and the openness and enjoyment of data. The so-called large-scale data is closely related to the application field. In some areas, a few petabytes of data may not be large, while in some areas, tens of terabytes of data may already be large.
The development of big data cannot be endlessly pursued to be bigger, more and faster. It is necessary to reduce costs, reduce energy consumption, benefit the public, and be fair and rule by law. Just like the current environmental pollution control, we should pay attention to all kinds of disadvantages that big data may bring, such as teasing polluted land and invading privacy.
5.2 Don't tease the technology driver, tease the application first.
New information technologies are emerging one after another, and new concepts and terms in the information field are constantly emerging. It is estimated that after poking fun at big data, new technologies such as cognitive computing, wearable devices and robots will enter the peak of speculation. We are used to following the foreign craze, often unconsciously following the technical trend, and it is the easiest to take the road of teasing technology. In fact, the purpose of developing information technology is to serve people, and the only criterion for testing all technologies is application. To develop big data industry in China, we must adhere to the development strategy of combing applications and the technical route of application traction. Limited technology and unlimited applications. To develop cloud computing and big data in various places, we must mobilize the enthusiasm of application departments and innovative enterprises through policies and various measures, explore new applications through cross-border portfolio innovation, and find a way out from applications.
5.3 Do not abandon the method of teasing small data.
The popular definition of poking fun at big data is: data sets that current mainstream software tools cannot collect, store and process in a reasonable time. This is an incompetent technique to define the problem, which may lead to misunderstanding. According to this definition, people may only pay attention to the problems that cannot be solved at present, just like a pedestrian wants to step on the shadow in front of him. In fact, most of the data processing encountered by all walks of life is still a question of teasing small data. Whether it is big data or small data, we should pay attention to practical problems.
Statisticians have spent more than 200 years summarizing various traps in the process of cognitive data, which will not be automatically filled with the increase of data volume. There are a lot of small data problems in big data, and big data collection will also produce the same statistical deviation as small data collection. Google's flu prediction failed in the past two years because of human intervention such as search recommendation.
There is a popular view in the big data community: big data does not need to analyze causality, do not need sampling, and do not need accurate data. This concept cannot be absolute. In practical work, we should combine logical deduction and induction, white box and black box research, big data method and small data method.
5.4 Pay close attention to the construction cost of big data platform.
At present, big data centers are being built all over the country, and Luliang Mountain has established a data processing center with a capacity of more than 2 PB. Many city public security departments require high-definition surveillance videos to be kept for more than 3 months. These systems are very expensive. The value of data mining comes from the cost, so we can't blindly build a big data system regardless of the cost. What data needs to be saved and how long it needs to be saved depends on the possible value and the required cost. The technology of big data system is still under study. The E-class supercomputer system in the United States requires a reduction in energy consumption by 1000 times, and it is planned to be developed in 2024. The giant system built with current technology consumes a lot of energy.
We should not compare the scale of big data systems, but consume less resources and energy than the actual application effect to accomplish the same thing. First, grasp the big data applications that ordinary people need most and develop big data according to local conditions. The development of big data and the strategy of informatization are the same: the goal should be ambitious, the start should be precise and the development should be rapid.