Why Flink will become the standard for the next generation of big data processing frameworks

By Zhang Libing

For reprints, please contact Huazhang Technology

In the current era of data surge tradition, different business scenarios have a large amount of business data generated, for these constantly generated data should be how to effectively deal with the majority of the companies faced with the problem nowadays.

With Yahoo's open source of Hadoop, more and more big data processing technologies are starting to come into view, such as Apache Spark, which is currently the more popular big data processing engine and has basically replaced MapReduce as the current standard for big data processing.

But as data continues to grow and new technologies continue to evolve, people are realizing the importance of real-time data processing, and organizations need to be able to support high throughput, low latency, and high-performance streaming technologies to deal with the growing amount of data.

Compared to traditional data processing models, streaming data processing has higher processing efficiency and cost control. Apache Flink is the open source community that has evolved in recent years to support high-throughput, low-latency, high-performance distributed processing frameworks.

Between 2010 and 2014, a research project called "Stratosphere: Information Management on the Cloud" was launched by the Technische Universit?t Berlin, Humboldt Universit?t Berlin, and the Hasso Plattner Institute.

Between 2010 and 2014, a research project called "Stratosphere: Information Management on the Cloud" was initiated by TU Berlin, Humboldt University Berlin, and the Hasso Plattner Institute, and gradually gained some visibility in the community at the time.

At the beginning of the project, the core members of the project were all from the original core members of Stratosphere, and then most of the founding members of the team left the school, **** with the founding of a company called Data Artisans, whose main business is the commercialization of Stratosphere, which is later called Flink. During the incubation period, Stratosphere was renamed Flink.

Flink is the German word for fast and responsive, which is used to reflect the speed and flexibility of streaming data processors, and the use of the brownish-red squirrel design as the Flink project logo is also based on the squirrel's flexibility and speed, which has made Flink officially accessible to community developers. Flink began to formally enter the community of developers in the line of sight.

In December 2014, the project became a top-tier project of the Apache Software Foundation, and from September 2015, the first stable version 0.9 was released, and in April 2019, version 1.8 was released, with more community developers joining in, and now Flink has more than 350 developers around the world. New features are constantly being released.

At the same time, more and more companies are using Flink globally, and some of the most famous Internet companies in China, such as Alibaba, Meituan, and DDT, are using Flink on a large scale as a distributed big data processing engine for their enterprises.

Flink has been gradually known and used in recent years, the main reason is not only because it provides real-time computing capabilities that simultaneously support high throughput, low latency, and exactly-once semantics, but also because Flink provides the ability to process batch data based on the streaming computation engine, which truly realizes batch-streaming unification, and at the same time, with the open-source of Blink by Alibaba, it greatly enhances the performance of the enterprise's distributed big data processing engine. Blink's open source, which greatly enhances Flink's support for the batch computing field.

Numerous excellent features make Flink a rising star in open source big data processing frameworks. As the domestic community continues to promote, more and more domestic companies are choosing to use Flink as a real-time data processing technology, and in the near future, Flink will become the mainstream of the enterprise internal data processing framework, and eventually become the standard for the next generation of big data data processing frameworks.

Stateful Streaming Computing will evolve with the technology and gradually become the architectural model for enterprises to build data platforms, and the open-source solution to realize this technology is Apache Flink from the community's point of view. Flink realizes high throughput, low latency, high performance, and real-time streaming computing frameworks through the implementation of Google's Dataflow streaming model.

Apache Flink is a real-time, high-performance streaming framework that implements the Google Dataflow model.

At the same time, Flink supports efficient and fault-tolerant state management. Flink maintains its state in memory or in a RockDB database, and in order to prevent the state from being lost due to system anomalies during the computation process, Flink periodically takes distributed snapshots of its state through CheckPoints to achieve persistent state maintenance, which makes it possible for the system to be used in the event of system downtime or anomalies. The system can correctly recover state even in the case of downtime or anomalies, thus ensuring that the correct results can be calculated at any time.

The evolution of the data architecture is accompanied by iterative updates to the technology. Flink has advanced architectural concepts, as well as a number of excellent features, and a well-developed programming interface, and Flink continues to introduce new features in each Release version.

For example, the Queryable State feature allows users to remotely access the state of a streaming computing task, which means that data can be queried directly from the streaming application without landing in a database, and for real-time interactive querying, you can query the state of Flink directly for the latest results. This feature is currently in Beta, but I believe that in the near future, it will become more and more perfect, then Flink will not only be used as a real-time streaming processing framework, more likely to become a set of real-time storage engine, will allow more users to benefit from the technology of stateful computing .

Flink is a distributed streaming data processing framework that combines high throughput, low latency, and high performance.

The very mature computing framework Apache Spark can only take into account high-throughput and high-performance features, in the Spark Streaming streaming computing can not do low-latency guarantee; and Apache Storm can only support low-latency and high-performance features, but can not meet the requirements of high throughput. To meet the high throughput, low latency, high performance of these three goals for distributed streaming computing framework is very important.

In the field of streaming computing, the status of the window calculation is important, but most of the current computing framework window calculation used by the system time (Process Time), but also the event transmission to the computing framework processing, the current time of the system host, Flink can support the window calculation based on the Event Time (Event Time) semantics, and the window calculation is based on the Event Time (Event Time) semantics, which can be used by the system host. Flink supports windowing based on Event Time semantics, which uses the time when the event is generated. This timing mechanism allows the data stream to compute accurate results even if the event arrives out of order or even late, while maintaining the time dimension of the event as it was originally generated, independent of network traffic or the compute framework.

In version 1.4, Flink implemented state management, which means that the intermediate results of a streaming computation are stored in memory or in the DB, and then the next event can be retrieved from the state to perform the computation without having to compute the results based on all the original data, which greatly improves the performance of the system, and at the same time reduces the time-consumption of the computation process.

This approach greatly improves the performance of the system and reduces the time consuming process.

State-based streaming is useful for streaming very large amounts of data and complex logic.

In stream processing applications, where the data is continuous, a window is needed to perform a certain range of aggregation calculations on the streaming data, such as counting how many users have clicked on a certain web page in the last minute, in which case we have to define a window that collects the data from the most recent minute and performs the calculations again on the data in this window. .

Flink divides windows into Time, Count, Session, and Data-driven window operations. Windows can be customized with flexible triggers to support complex streaming patterns, and different window operations can be used to provide feedback on real-world events, allowing users to define different window triggers to meet different needs.

Windows is a powerful tool that allows you to customize your windows to support complex streaming patterns.

Flink can run on thousands of nodes in a distributed fashion, breaking down a large computational process into smaller computational processes, and then distributing the computational processes to a single parallel node for processing.

During the execution of a task, data inconsistencies are automatically detected due to errors in event processing, such as node downtime, network transmission problems, or restart of the computing service due to user upgrades or repair issues.

In these cases, Checkpoints, which are based on distributed snapshots, are used to persistently store information about the tasks being performed, so that if a task goes down abnormally, Flink is able to automatically recover the task to ensure that the data is consistent throughout the process.

Memory management is a key consideration for every computing framework, especially for heavy computing scenarios, where data is managed in memory. For memory management, Flink implements its own memory management mechanism to minimize the impact of Full GC on the system.

In addition, by customizing the serialization/deserialization method, all objects are converted to binary and stored in memory, which reduces the size of the data storage, utilizes the memory space more efficiently, reduces the risk of performance degradation or task stoppage caused by GC, and improves the performance of distributed processing over data transfer.

Therefore, Flink is more stable than other distributed processing frameworks and will not cause the entire application to go down due to JVM GC issues.

For streaming applications that run 24/7, data is constantly being accessed, and the termination of the application within a period of time can lead to data loss or inaccurate results, such as version upgrades, downtime operations, and so on, all of which can lead to this situation.

However, it is worth mentioning that Flink uses its Save Points technology to save a snapshot of the task execution on the storage medium, and when waiting for the task to restart, it can restore the original computation state directly from the realization of the saved Save Points, so that the task can continue to run in accordance with the state prior to the shutdown, and the Save Points technology allows users to better utilize their data. The Save Points technology allows users to better manage and maintain real-time streaming applications.

At the same time, Flink has other great features in addition to the ones mentioned above that allow users to have more choices.

Flink has a lot of great features, which not only makes Flink more and more visible in the community, but also attracts a lot of companies to participate in the development and use of Flink.

About the author: Zhang Libing, senior architect, expert in streaming computing, AI project architect of Fourth Paradigm in East China, former big data architect of Millward Data in East China. He has many years of development experience in big data and streaming computing, has a very deep understanding of Hadoop, Spark, Flink and other big data computing engines, and has accumulated rich project experience.

Recommendation: From the function, principle, practice and tuning of the four dimensions of the step-by-step explanation of the use of Flink for distributed streaming application development, to guide readers from zero basic to advanced. It is suitable for: streaming computing development engineers, big data architecture engineers, big data development engineers, data mining engineers, college graduate students and senior undergraduates.

Schedule of Guizhou Talent Expo 2023

Jiangxi Youth Vocational College in which district

Is it easy for Shanghai Industrial and Commercial Bank to handle real estate mortgage loan?

What are the causes of problematic big data?

What does big screen display capability mean

Is the trip code cell phone shutdown is no longer

China New Energy Big Data Research Report

Vocational Education Baidu Marketing Platform Crowd Analysis

Is Beiqi Foton a state-owned enterprise?

Reporter's question on cross-border manipulation of Shanghai-Hong Kong Stock Connect account by Shanghai Stock Exchange