The word big data, I'm afraid that in the past two years, the IT sector is one of the hottest words, a variety of forums, conferences, words must talk about big data, "big data" the word, in the IT sector has become a fruit of the same "arcade" or "street words", do not follow the trend to say two words "big data long, big data short" are embarrassed to say with people that they are big data. "or called" street words ", do not follow the wind to say two sentences" big data long, big data short "are embarrassed to tell people that they are engaged in IT. From a certain point of view, big data this "circle" is too chaotic, no better than "your circle".
First, conceptually, what is big data? In fact, data processing from the birth of mankind on the period, the ancients knotted rope to remember things is the basic statistics, statistics on their own ate a few meals and hunted a few times, and so on; and then recently, the emperor every night to turn over the concubine's card is also data processing, in the turn over the card before the card to analyze the card from a large pile of "convenient", "hot", "hot", "hot", "hot", "hot", "hot", "hot", "hot", "hot". high heat", "freshness" and other indicators; more recently, data warehouses have matured and developed for decades before the word "big data" appeared. So, big data is not new, just certain technologies such as Hadoop, MR, Storm, Spark development to a certain stage, in line with the concept of these technologies speculated, but these concepts are based on a basic concept of "open source", this concept is not at any stage before, you can save money! Improve efficiency, so everyone throws matches to the industry (said many people now follow the wind and quarrel, I think it is not a bad thing). Myth 1: only engage in big data technology development, is the real "insiders". I have participated in a number of meetings, 70% are biased technology, the presence of all domestic data-related project managers and technical leaders, we discuss the topic of upgrading the CDH version of the time what is the problem, which way is better when dealing with Hive jobs, how to match the efficiency of Storm, Kafka, how to free up the memory in the Spark application of these issues. The problem. Attendees have an attitude: people who do not understand big data technology are not qualified to comment on big data, you do not understand resource allocation in Hadoop 2.0, do not understand Spark in memory residence time tuning, do not understand the Kafka collection, do not participate in this meeting! By the way, recently Google completely abandoned MR and only use Dataflow, do you understand? Do you understand? Here I would like to say that the progress of technology is driven by business, a certain treasure to go to the IOE to be called big data, I as a deaf masseur with a knotted rope to complete the whole process of treatment for people of different body types, with what massage techniques, it is not called big data analytics? Technology development to what extent, only a small part is driven by the spirit of scientists in pursuit of excellence, most of the reasons are because the business development to a certain extent, the requirements of technology must make progress in order to achieve the goal. Therefore, the real big data "insiders" should include at least the following kinds of people: First, business operations personnel. For example, the Internet product managers require technicians, must be calculated when the user arrives at the site of his mood index today, and to achieve dynamic monitoring, which can only be used Storm or Spark to deal with; for example, telecom operators require to do real-time marketing, the user enters the business hall, you must immediately push the text message to the user, prompting him to have a business hall is particularly suitable for him! Dating object (present height, circumference, weight and other indicators), but before meeting to buy a 4G cell phone; Another example is that the patient came to the bank to open an account, the bank understands that the user has been to the hospital outpatient clinic twice in the last week, traveled abroad three times, take the child to swim two times, and immediately the account manager to the customer to recommend the relevant bank insurance + financial products. These business people, often the core reason for driving technological progress. Second, architects. Architects are so important, when a business person and an engineer, one speaks the language of business, a technical terminology to discuss the problem there, engineers often think of what kind of code can immediately shut him up, and architects often jump out to say, "No, can not be that way, you write this can only solve a problem and will create a number of subsequent problems, according to my program to solve the problem, and can create a number of problems. You can solve one problem and create several more. Follow my solution and you can solve several more! A non-technical enterprise IT system level, often more than 70% of the standards in the hands of the architectural designers, as soon as possible, many excellent architects are slowly developed from the engineers to learn, the importance of IT architecture, many companies have realized that this is a lot of companies have CTO and CIO two positions, equally important! The beauty of architecture, when the IT system is running smoothly when no one can feel, but in the eyes of a chimney forest, the architecture of a chaotic environment to walk through, IT development must be architecture current, development in the back! Third, the investors. Boss, needless to say, the boss gives you food and clothing, you give the boss to sell your life, the natural provider of basic information, the boss said to have a mountain will have a mountain, the boss said to do real-time data processing and analysis, there will be a Storm, the boss said to do open source, there will be a Hadoop, the boss also said that it is necessary to do iterative digging, there will be a Spark ...... Fourth, scientists. They are the Geek in the eyes of others, they are the high and mighty in the eyes of others, they are similar to the mysterious Hawking-like eyes of men and women who go out early and return late at night, day and night, and they are the core force driving the world's technological progress. In addition to the world's top IT companies (often the world's technology direction in their hands), other companies generally need 1-2 scientists is enough, they are truly committed to science, do not let them think about business scenarios, do not let them think about the business process, do not let them calculate the cost, do not let them think about the progress of the project, the only thing they need to think about how to be in a certain target The only thing they need to think about is how to beat their rivals on a certain indicator, improve 0.1% on a certain indicator has allowed them to fight continuously, sleepless, let us all applaud and cheer for these scientists. In China, I don't think there are more than a hundred real big data scientists ...... V. Engineers. Engineers are such a lovely group of people, they are young, impulsive, aspirational, and honored as "losers" "keyboard party", they work tirelessly for their own ideals and struggle, every time they make a little progress, are considering whether the The egg dunks at the subway entrance have gone up by 50 cents again. They are sensitive, conceited, and never bother to argue with business people. The difference between engineers and scientists is that engineers need to change the code frequently, test the program frequently, and go online frequently, but the final system is a combination of several engineers' code. Each egotistical engineers see the history of the system code will despise the issue of a "hmm, this garbage code", after which they will be put into the future generations continue to despise the code writing work. Sixth, followers. Some of them are trainers, some of them are killers wash and cut, some of them are coal bosses some of them are lost girls. They are characterized by speculation, and the only difference between speculators is that they do not have to pay money, they think that as long as the data and data on the side called big data, some of them have never even touched the IT system, they are fishing in troubled waters, indiscriminate masters, they are the first few kinds of people despised by the invisible people. But I want to say, welcome to speculation, the more fierce an industry speculation, the truly valuable people will be more able to play their role. Myth 2: only big data can save the world big data current technology and applications are in data analysis, data warehousing, etc., mainly for OLAP (Online Analytical System), from a technical point of view, contains two legs I summarized: one leg is batch data processing (including MR, MPP, etc.), the other leg of the real-time data stream processing ( Storm, in-memory database, etc.). On this basis, part of the scene and found that the MR framework or real-time framework can not be very good to meet the needs of near-line, iterative mining, so it has produced a very hot memory-based data processing Spark framework. The current big data framework of many enterprises is, on the one hand, to Hadoop 2.0 above the Hive, Pig framework to deal with the underlying data processing and processing, in accordance with the business logic of the data processed directly into the application database; on the other hand, the Storm stream processing engine to deal with real-time data, according to the rules of the business marketing to trigger the corresponding marketing scenarios. At the same time, based on Spark processing technology cluster to meet the demand for real-time data processing, mining. The above description can be seen, big data is frankly has not entered the real transaction system, did not make much contribution in OLTP (Online Transaction system). As for many articles to the big data and the Internet of Things, ubiquitous network, smart cities are linked together, I think big data is only one of the conditions, the rest of the OLTP system whether or not, the physical network and even organizational structure are important factors. Finally, I would also like to say that big data processing technology, and then dazzling as Google's Dataflow or mature as Hadoop 2.0, data warehouse, Storm, etc., are essentially data processing tools, for many engineers, only need to figure out the data processing process can be, on this platform can be fixed templates and scripts for data processing has been enough. After all, more than 70% of the value of the data is for business applications, a dazzle word for the business if it does not help, it will eventually be just a dragon slayer. Any technology, IT architecture should be in line with business planning, in line with the requirements of business development, otherwise technology will only hinder the development of business and productivity.
With the changing times, the big waves, as a member of the data industry, each of us in the different roles between the conversion, today you may be a scientist, tomorrow will become an architect, today's engineers will become a few years after the scientists, and some people will also end up stepping into the ranks of the followers. Myth three: the amount of data is particularly large only called big data in the "data world" there is such a wave of people, they believe that "only the Peta level above is called big data, and even to Zeta above is called big data, the current has not been to the real era of big data! " Whenever I hear something like this, I know that these people are too influenced by the "volume" in the 4V theory of an IOE giant. In this regard, I would like to say the first sentence is "not as good as books, not as good as books, not as good as the giant to go to IOE", to go to IOE is not just to start from the hardware, but also from the idea of dare to challenge the giants to start, although many of the classic theories of the IT industry are traditional giants put forward, but with the emergence of the challenger, the emergence of new ideas and technologies, the traditional giants will slowly be subverted! , the traditional giants will be slowly subverted, which is an important factor in the forward movement of our humanity. If we are still stuck in the era of superstitious giants, so stereotypical and dogmatic to pursue a concept, then there will not be the current Hadoop, there will not be the current Spark, there will not be the current Tesla, there will not be the machine learning artificial intelligence, and moreover, there will not be the future of the Nth industrial revolution. First of all, I want to emphasize that big data technology is really not a new word, in the previous article I have said, the essence of big data or data, data this industry has been developed for a number of years, and the scale of the volume of data is always beyond the imagination of the era, such as more than a dozen years ago, the amount of data on a floppy disk is also 1.44M, the data at that time, if you reach 1T are letting the bystanders smack their lips. So according to the standard of data volume, then if someone collected 1T data has entered the era of big data? Obviously not! So I want to say, the size of the data volume is not a measure of the standard of big data, if the data volume to judge whether the big data, then the word "big data" is really a pseudo-proposition, just as "tigers such as the old, the boy must be small, the giant must be a big head, the flyer must be winged" such as "tiger such as old, the guy must be small, giant must be big head, flyer must be winged". The term "Big Data" is really a pseudo-proposition, just like "Tigers must be old, guys must be small, giants must have big heads, and flyers must have wings". So what is the concept of Big Data? First of all, big data is a complete ecosystem, from the generation of data, collection, processing, aggregation, display, mining, pushing and other aspects of the formation of a closed-loop value chain, and through a variety of technical processing of each link, to provide valuable applications and services for the business scene. Secondly, what is the core of big data? On the one hand, it is open source, on the other hand, it is cost-cutting, and the core goal of the current big data technology is to better meet the demand for data through low-cost technology (especially to deal with more unstructured data in recent years), and to save as much investment as possible for the enterprise on the basis of meeting the demand. To say a thousand words, the core concept of big data is still to meet the application needs, there is a clear goal of the technology called productivity, there is no business goal of the technology is called "waste of life force". Misunderstanding four: for big data and big data this misunderstanding I think is currently the most serious. In some enterprises, the pursuit of technology must be the latest, the best, the most dazzling, must get the international advanced, world-class. All enterprises, regardless of industry, regardless of nature, regardless of geography, regardless of age, all shouting "to catch up with BAT, big data help ** enterprises to reach ** goal", the next step is to go to the IOE, and then invest in buying clusters, before a variety of high-performance mainframes are not used, before buying the O-authorization to stop all the previous decades of investment overnight. Decades of investment overnight null and void, and invested more resources to catch up with the "big data". Students, I believe that this kind of things that hurt the people and money every day we will hear or see with their own eyes, many companies do not count the cost is to win the leadership a smile, it must be how big a misunderstanding ah. In this regard, I would like to say: First, technically speaking, such as BAT or many Internet companies to pursue big data, because of the need for business development. Any Internet business is born to live for the flow and click, which means that a large number of unstructured data need to be processed quickly, which determines the Internet business can only be through some concurrent means to decompose the underlying data, and then quickly processed, and to meet the needs of its service users and the market. The business processes and business models of Internet enterprises determine the need to adopt big data technology. On the contrary, many companies simply can not use these technologies, some companies simply one or two Excel files to do a few formulas to meet its development, and the cycle of data is still processed on a monthly basis, there is no need to use these technologies. Secondly, from the investment, Internet companies are born civilians, simply can not afford to buy large-scale equipment, even after getting rich overnight, there is no one traditional small machine mainframe can better meet their development, it can only find another way to create a value chain and standards, in the previous low-investment, lightweight architecture, and constantly small linear hardware investment to meet the development of the business. On the contrary, some traditional enterprises, and even the giant, its investment plan has been clear a year ago, and in the original basis of investment will be more ROI (return on investment), and now in order to pursue the slogan of big data, sacrificed a large number of previous investments, in addition to "more than enough to lose", the rest can only be full of knuckleheads. Big data technology, or even any kind of technology is to meet specific business objectives and born, in the possession of a clear business purpose, the design of their own business structure in line with the trend of the technical architecture, is a scientific and healthy development concept. If you are a boss, CEO or investor, you must understand that big data technology for the enterprise, sometimes like water, and the business objectives of the enterprise is the boat, "water can carry the boat, but also can overturn the boat". With the continuous adjustment of the relations of production, there will be a number of rounds of productivity progress, big data after the technology will also be progressing day by day, such as the trend is now beginning to emerge, "machine learning, deep learning" and many other aspects of artificial intelligence technology, but also appeared, such as "small data ", "microdata" and other more detailed direction of the technical sub-division, in the flood of technology, as long as to maintain a clear to meet the business-oriented mind, according to their own business needs to design their own technical architecture, will not be a variety of schools, a variety of concepts drowned.