This article is mainly organized from the speech of Mo Wen, a senior technologist in Alibaba's Compute Platform Business Unit, at the Yunqi Conference.
The wood that grips the tree is the end of the thread
With the advent of the AI era and the explosion of data volumes, the most common approach to data operations in typical big data business scenarios is to use batch technology to process full-volume data, and streaming computation to process incremental data in real time. In most business scenarios, the user's business logic is often the same in both batch and streaming processing. However, the two sets of compute engines that users use for batch and streaming are different.
As a result, users often need to write two sets of code. This undoubtedly imposes some additional burdens and costs. Alibaba's merchandise data processing often need to face incremental and full two different sets of business process issues, so Ali was thinking, can we have a unified set of big data engine technology, the user only need to develop a set of code according to their own business logic. In this way, in a variety of different scenarios, whether it is full data or incremental data, or real-time processing, a set of programs can all support, this is the background and original intention of the choice of Flink Ali .
Currently, there are many options for open source big data computing engines, such as Storm, Samza, Flink, Kafka Stream, etc., and batch processing, such as Spark, Hive, Pig, Flink, and so on. The only two choices for a compute engine that supports both streaming and batch processing are Apache Spark and Apache Flink.
From the technology, ecology, and other aspects of the comprehensive consideration. First of all, Spark's technical concept is based on batch to simulate streaming computation. Flink, on the other hand, is the complete opposite, and it uses stream-based computing to simulate batch computing.
From the direction of technology development, the use of batch to simulate streams has certain technical limitations, and this limitation may be difficult to break through. And Flink simulates batch based on streams, which has better scalability in terms of technology. In the long run, Ali decided to use Flink to make a unified, generalized big data engine as a future option.
Flink is a low-latency, high-throughput, unified big data compute engine. In Alibaba's production environment, Flink's computing platform can achieve milliseconds of latency, processing hundreds of millions of messages or events per second. At the same time Flink provides an Exactly-once consistency semantics. The correctness of the data is guaranteed. This allows Flink big data engine can provide financial-grade data processing capabilities.
Flink's current situation in Ali
The platform built in Alibaba based on Apache Flink was formally launched in 2016 and started to realize the two major scenarios of Alibaba's search and recommendation. At present, all of Alibaba's businesses, including all of Alibaba's subsidiaries have adopted the real-time computing platform built on Flink. At the same time Flink computing platform running on top of the open source Hadoop cluster. Hadoop's YARN is used for resource management scheduling, and HDFS is used as data storage. Therefore, Flink can be seamlessly connected to the open source big data software Hadoop.
Currently, this real-time computing platform built on Flink not only serves the internal Alibaba Group, but also provides Flink-based cloud product support to the entire developer ecosystem through Alibaba Cloud's cloud product API.
How has Flink fared in large-scale adoption at Alibaba?
Scale: Scale is an important indicator of whether a system is mature. Flink was initially launched on Alibaba with only a few hundred servers, and is now up to tens of thousands of servers, which is one of the few scales globally.
State data:
Flink has been used in a variety of applications, such as mobile devices, mobile phones, and mobile phones. Based on Flink, the state data accumulated internally is already at the petabyte level;
Events: More than a trillion pieces of data are processed on Flink's compute platform every day;
PS: During peak periods, it can handle more than 472 million accesses per second;
PS: During peak periods, it can handle more than 472 million accesses per second. The most typical application scenario is the Alibaba Double 11 big screen;
Flink's development path
Next, from the perspective of open source technology, let's talk about how Apache Flink was born, how it grew? And how did Ali come in at this critical point in its growth? And what contributions and support have been made to it?
Flink was born out of StratoSphere, a European big data research project at the Technical University of Berlin. Early on, Flink was doing Batch computing, but in 2014, core members inside StratoSphere incubated Flink, donated Flink to Apache in the same year, and later became Apache's top big data project, while the mainstream direction of Flink computing was positioned as Streaming, which means that streaming computing is used to do all big data computation, which is the background of the birth of Flink technology.
In 2014, Flink began to emerge in the open source big data industry as a big data engine that focuses on streaming. The difference between Storm, Spark Streaming and other streaming computing engine is: it is not only a high-throughput, low-latency computing engine, but also provides many advanced features. For example, it provides stateful computation, support for state management, support for strong consistency of data semantics and support for Event Time, WaterMark on the message disorder processing.
Flink core concepts and basic ideas
What sets Flink apart from other streaming engines is state management.
What is state? For example, if you are developing a system or task for streaming computing to do data processing, you may often need to count the data, such as Sum, Count, Min, Max, and these values need to be stored. These values are to be stored because they are constantly updated, and these values or variables can be interpreted as a state. If the data source is reading Kafka, RocketMQ, you may want to record what position is read, and record Offset, these Offset variables are the state to be calculated.
Flink provides built-in state management to store this state inside Flink instead of storing it on an external system. The benefits of this are first, it reduces the dependency of the compute engine on external systems and their deployment, which makes operation and maintenance simpler; second, it brings a great performance improvement: if accessed externally, such as Redis and HBase, it must be through the network and RPC; if accessed internally through Flink, it only accesses these variables through its own processes. At the same time, Flink will periodically do checkpoint persistence of these states, the checkpoint is stored in a distributed persistence system, such as HDFS. In this way, when any failure of Flink's tasks, it will be from the most recent checkpoint will be the state of the entire stream to restore, and then continue to run its stream processing. There is no data impact on the user.
How does Flink recover from Checkpoint without any data loss or redundancy? How does Flink ensure accurate calculations?
The reason for this is that Flink utilizes a very classic Chandy-Lamport algorithm, the core idea of which is to look at the stream computing as a streaming topology, regularly inserting special Barries from the head Source point of this topology, and broadcasting the Barries downstream from upstream. When a node receives all the barries, it will do a snapshot of the state, and when every node has done a snapshot, the whole topology has done a complete checkpoint, and will recover from the nearest checkpoint regardless of any failures that may occur.
Flink utilizes this classic algorithm to ensure strongly consistent semantics. This is the core difference between Flink and other stateless stream computing engines.
Here's how Flink solves the problem of disorganization. For example, the playing order of Star Wars may reveal that the story is jumping around if you watch it as it was released.
In streaming computing, it is very similar to this example. The time when all the messages arrive is not the same as the time when it actually happens at the source, in the online system log. In the stream processing, we hope to process the messages in the order in which they actually occurred in the source, not the time they actually arrived in the program to deal with. Flink provides Event Time and WaterMark some of the advanced technology to solve the problem of disorder. So that the user can process the message in an orderly manner. This is a very important feature of Flink.
The next step is to introduce the core idea and core concepts when Flink was launched, which is the first stage of Flink development; the second stage of time is 2015 and 2017, this stage is also the development of Flink and the time of Alibaba's involvement. The story stems from a research we did in the search division in mid-2015. At that time, Ali had its own batch processing technology and stream computing technology, both self-developed and open source. However, in order to think about the direction of the next generation of big data engines and future trends, we did a lot of research on new technologies.
Combined with the results of a lot of research, we finally came to the conclusion that solving the general big data computing needs, batch and stream convergence computing engine, is the direction of development of big data technology, and finally we chose Flink.
But Flink in 2015 is not mature enough, whether it is the scale or stability has not yet experienced the practice. In the end, we decided to set up a Flink branch within Alibaba to do a lot of modifications and improvements to Flink, so that it can adapt to Alibaba's mega-scale business scenarios. In this process, our team not only made a lot of improvements and optimizations to Flink in terms of performance and stability, but also made a lot of innovations and improvements to the core architecture and functionality, and contributed them to the community, such as: Flink's new distributed architecture, incremental checkpoint mechanism, Credit-based network flow control mechanism and Streaming SQL, etc.
Alibaba's contribution to the Flink community
Let's take two design examples. The first one is that Alibaba refactored Flink's distributed architecture to make a clear layering and decoupling of Flink's job scheduling and resource management. The primary benefit of this is that Flink can run natively on a variety of different open source resource managers. After this set of distributed architecture improvements, Flink can run natively on top of Hadoop Yarn and Kubernetes, the two most common resource management systems. It also changes Flink's task scheduling from centralized to distributed, so that Flink can support larger clusters as well as get better resource isolation.
Another is the implementation of an incremental checkpoint mechanism, because Flink provides stateful computing and a regular checkpoint mechanism, if the internal data is more and more, keep doing checkpoint, checkpoint will be more and more large, and finally may lead to do not come out. After providing incremental checkpoint, Flink will automatically find out which data are incremental changes and which data are modified. At the same time only those modified data are persisted. This way Checkpoint doesn't get harder and harder to do over time, and the performance of the whole system is very smooth, which is a very significant feature that we are contributing to the community.
The Flink community has matured through the refinement of Flink Streaming from 2015 to 2017, and Flink has become the most dominant compute engine in the Streaming space. Because Flink wanted to make a stream batch unified big data engine in the earliest days, it has started this work in 2018, and in order to achieve this goal, Alibaba has proposed a new unified API architecture and a unified SQL solution, and at the same time, after the various functions of Streaming computing have been perfected, we think that the batch computing also needs a variety of perfections. Whether in the task scheduling layer, or in the data Shuffle layer, in fault tolerance, ease of use, need to improve a lot of work.
For space reasons, the following two main points to share with you:
● Unified API Stack
● Unified SQL program
First, let's take a look at the current Flink API Stack of a status quo, the research of Flink or the use of Flink developers should know that Flink has two sets of basic API, one is DataStream, the other is DataStream, and the other one is DataStream, the other is DataStream. The DataStream API is for streaming users and the DataSet API is for batch users, but the execution paths of these two sets of APIs are completely different, and even need to generate different Tasks to execute. So this is a conflict with getting a unified API, and this is also imperfect, not the final solution. The first thing on top of Runtime is to have a base API layer for batch flow unification and fusion, and we want to be able to unify the API layer.
Therefore, we will adopt a DAG (finite acyclic graph) API as a batch flow unified API layer in our new architecture. For this finite acyclic graph, batch computation and stream computation do not need to be expressed distinctly. It only needs to let the developer define different attributes at different nodes and different edges to plan whether the data is stream attribute or batch attribute. The whole topology is a unified semantic expression that can integrate batch and stream, and the whole computation does not need to distinguish between stream and batch computation, but only needs to express its own needs. With this API, Flink's API Stack will be unified.
In addition to the unified base API layer and unified API Stack, the same unified SQL solution in the upper layer. The SQL for streams and batches can be thought of as having a data source for stream calculations and a data source for batch calculations, and we can model both of these sources as data tables. The data source for streaming data can be thought of as a data table that is constantly updated, and for batch processing the data source can be thought of as a relatively static table that is not updated. The entire data processing can be treated as a Query in SQL, and the final result produced can also be modeled as a result table.
For stream computing, it is a result table that is constantly updated. For batch processing, its result table is the equivalent of a result table that is updated once. Expressed in terms of the entire SOL semantics, streams and batches can be harmonized. In addition, both streaming SQL and batch SQL can be expressed in the same Query for reuse. In this way both stream and batch can be optimized or parsed with the same Query. Even many stream and batch operators are reusable.
Flink's future direction
First of all, Alibaba still wants to be based on the essence of Flink to do an all-in-one unified big data computing engine. Land it on the ecology and scene. Currently Flink is already a mainstream stream computing engine, many Internet companies have reached a **** knowledge: Flink is the future of big data, is the best stream computing engine. The next important step is to make Flink a breakthrough in batch computing. In more scenarios to land, become a mainstream batch computing engine. Then further in the stream and batch seamless switching between the stream and batch, stream and batch boundaries are becoming increasingly blurred. With Flink, you can have both streaming and batch computing in a single compute.
The second direction is that Flink's ecosystem has more language support, not just Java, Scala, or even Python or Go for machine learning. In the future, we hope to use more rich languages to develop Flink computing tasks, to describe computing logic, and to interface with more ecosystems.
Finally have to say AI, because now a lot of big data computing needs and the amount of data are in support of very hot AI scenarios, so on the basis of Flink flow batch ecological perfection, will continue to go up, improve the upper layer of Flink Machine Learning algorithm libraries, at the same time, Flink to the upper layer will also be to the mature machine learning, deep learning to go. The first step is to integrate it with the existing machine learning algorithms and deep learning. For example, we can do Tensorflow On Flink, so that the ETL data processing of big data and machine learning feature computing and feature computing, training computing and other integration, so that developers can enjoy the benefits of multiple ecosystems to everyone.