History of the development of big data technology: the past and present of big data
Today we often say that big data technology, in fact, originated in Google in 2004 around the time of the publication of the three papers, that is, we often hear the "Troika", respectively, are distributed file system GFS, big data distributed computing framework MapReduce and NoSQL database system BigTable.
You know, search engines mainly do two things, one is web crawling, one is indexing, and in this process, there is a large amount of data need to be stored and calculated. The "Troika" is actually used to solve this problem, as you can see from the introduction, a file system, a computing framework, a database system.
Now when you hear words like distributed and big data, they are certainly not new to you. But you have to know, in 2004, the whole Internet is still in the age of ignorance, Google released the paper is really let the industry for a moment, everyone suddenly realized, the original can also be so play.
Because at that point in time, most companies were actually still focused on a single machine, thinking about how to improve the performance of a single machine, looking for more expensive and better servers. And Google's thinking is to deploy a large-scale server cluster, through a distributed way to store massive amounts of data on this cluster, and then utilize all the machines on the cluster for data computation. In this way, Google doesn't really need to buy a lot of very expensive servers, it just needs to organize these ordinary machines together and it's very powerful.
Doug Cutting, a talented programmer at the time and the founder of the Lucene open source project, was developing the open source search engine Nutch, and was very excited to read Google's paper, and followed it up with a preliminary implementation of GFS- and MapReduce-like functionality based on the paper's principles.
Two years later, in 2006, Doug Cutting separated these big data-related features from Nutch, and then started a separate project dedicated to developing and maintaining big data technology, which became known as Hadoop, and includes the Hadoop distributed file system HDFS and the big data compute engine MapReduce.
This project is the first of its kind in the world.
When we look back at the history of software development, including the software we have developed ourselves, you will find that there is software that has been developed that no one or very few people have used, and this actually accounts for the majority of all the software that has been developed. And there is software that may start an industry that creates tens of billions of dollars of value and millions of jobs every year, and these software used to be Windows, Linux, Java, and now Hadoop has to be added to this list.
If you have time, you can briefly browse the code of Hadoop, this software written purely in Java is actually not any profound technical difficulties, using some of the most basic programming skills, and there is nothing out of the ordinary, but it has brought a huge impact on the society, and even led to a profound technological revolution, which has pushed the development and progress of artificial intelligence.
I think, when we do software development, we can also think more about where the value point of the software we develop is. Where is the real need to use the software to realize the value? You should pay attention to the business, understand the business, value-oriented, with their own technology to create real value for the company, and then realize their own value of life. Instead of burying your head in the requirements document all day and being a code robot without thinking.
After the release of Hadoop, Yahoo quickly used it. About a year later, in 2007, Baidu and Alibaba also started using Hadoop for big data storage and computation.
In 2008, Hadoop officially became a top Apache project, and Doug Cutting himself became the president of the Apache Foundation. Since then, Hadoop has risen to prominence as a star in software development.
In the same year, Cloudera, a commercial company specializing in Hadoop, was founded, and Hadoop received further commercial support.
At this time, some folks at Yahoo decided that programming big data with MapReduce was too much of a hassle and developed Pig, a scripting language that uses an SQL-like syntax that allows developers to use Pig scripts to describe the operations to be performed on big data sets, and Pig is compiled to generate MapReduce programs that are then run on Hadoop to run.
Writing Pig scripts, while easier than direct MapReduce programming, still required learning new script syntax. So Facebook released Hive, which supports big data computations using SQL syntax, so you can write a Select statement to query data, for example, and then Hive will transform the SQL statement into a MapReduce computation program.
In this way, data analysts and engineers who are familiar with databases can use big data for data analysis and processing without thresholds.Hive appeared to reduce the difficulty of using Hadoop to a great extent, and quickly gained the popularity of developers and enterprises. In 2011, 90% of the jobs running on Facebook's big data platform came from Hive.
Then, many Hadoop peripherals began to appear, and the big data ecosystem gradually took shape, including Sqoop, which specializes in importing and exporting data from relational databases to the Hadoop platform; Sqoop, which collects, aggregates, and transmits data from large-scale logs; and Sqoop, which is a distributed collection, aggregation, and transmission system for large-scale logs. Sqoop, which specializes in importing and exporting data from relational databases to the Hadoop platform; Flume, which is designed for large-scale log collection, aggregation, and transmission; and Oozie, a MapReduce workflow scheduling engine.
In the early days of Hadoop, MapReduce was both an execution engine and a resource scheduling framework, and the resource scheduling management of the server cluster was done by MapReduce itself. But this is not conducive to resource reuse and makes MapReduce very bloated. So a new project was started to separate the MapReduce execution engine and resource scheduling, which is Yarn. 2012, Yarn became an independent project began to operate, and then supported by all kinds of big data products, becoming the most mainstream resource scheduling system on big data platforms.
Also in 2012, Spark, developed at UC Berkeley's AMP Lab (an acronym for Algorithms, Machines, and People), began to emerge. At that time, Dr. Tie Ma of AMP Lab found that the performance of machine learning calculations using MapReduce was very poor, because machine learning algorithms usually need to perform many iterations of calculations, and MapReduce needs to restart the job once for each execution of Map and Reduce calculations, which brings a large amount of unnecessary consumption. Another point is that MapReduce mainly uses disk as a storage medium, while in 2012, memory has broken through the capacity and cost limitations, and become the main storage medium in the process of data operation.Spark was launched, immediately sought after by the industry, and gradually replace MapReduce in enterprise applications.
Generally speaking, the business scenarios handled by such computing frameworks as MapReduce and Spark are called batch computing, because they usually target the data generated in units of "days" for a single computation, and then get the required results, which takes about A few minutes or more. This type of calculation is also known as Big Data offline computing because the data being calculated is not real-time data available online, but rather historical data.
In the big data field, there is another class of application scenarios that require instant computation of large amounts of data generated in real time, such as face recognition and suspect tracking for surveillance cameras all over the city. This type of computation is called big data streaming computing, accordingly, there are Storm, Flink, Spark Streaming and other streaming computing frameworks to meet the scenarios of such big data applications. The data to be processed by streaming computing is the data generated online in real time, so this type of computing is also called big data real-time computing.
In a typical big data business scenario, the most common practice for data operations is to use batch processing techniques to handle historical full-volume data, and streaming computing to handle real-time added data. And a compute engine like Flink can support both streaming and batch computing.
In addition to big data batch and streaming processing, NoSQL systems deal mainly with the storage and access of large-scale massive data as well, so they are also categorized as big data technologies. NoSQL used to be very hot around 2011, and many excellent products emerged such as HBase, Cassandra, etc., of which HBase is an HDFS-based NoSQL system separated from Hadoop.
We look back at the history of software development will find that almost similar features of the software, they appear very close to the time, such as Linux and Windows are in the early 1990s, Java development of various types of MVC frameworks are basically the same period of time, and Android and iOS is also the front foot after the foot of the introduction of the year. 2011 years ago or so, a variety of NoSQL databases are also emerging. NoSQL databases are also coming out, I was also involved in the development of Alibaba's own NoSQL system at that time.
The development of things has its own trend and law, when you are in the trend of the time, to grasp the trend of the opportunity to find ways to stand out, even if not successful, but also more insight into the pulse of the times, to harvest valuable knowledge and experience. And if the trend has receded, this time to work in this direction, will only reap the benefits of confusion and depression, for the times, for themselves are not very helpful.
But the waves of the times are like the waves on the beach, there is always one wave after another, and as long as you stand on the beach and are in the middle of this industry, the next wave will soon come again. You need to be sensitive and profound to observe, skimming those impetuous bubble, seize the opportunity of the real trend, and fight, regardless of success or failure, will not regret.
As the saying goes, advance in the logic of history, develop in the trend of the times. In layman's terms, it is to fly in the wind.
Above I talked about these can basically be categorized as big data engine or big data framework. And the main application scenarios of big data processing include data analysis, data mining and machine learning. Data analysis is mainly done using Hive, Spark SQL and other SQL engines; data mining and machine learning are specialized machine learning frameworks TensorFlow, Mahout, and MLlib, etc., which have built-in major machine learning and data mining algorithms.
In addition, for big data to be deposited into a distributed file system (HDFS), for MapReduce and Spark jobs to be scheduled for execution in an orderly manner, and for the results of the execution to be able to be written to the databases of the various application systems, there is also a need for a big data platform that integrates all of these big data components and enterprise applications.
All these frameworks, platforms, and related algorithms in the figure **** with the composition of the technical system of big data, I will analyze the column later one by one, to help you be able to the principles of big data technology and the application of algorithms to build up a complete body of knowledge, into the big data development can be engaged in a full-time job, can be retired from the development of their own applications in a better and big data integration and control of their own projects.
I hope this helps you! ~