Current location - Loan Platform Complete Network - Big data management - Develop a big data governance framework
Develop a big data governance framework
Spark is a fast universal engine for processing massive data. As a big data processing technology, Spark is often compared with Hadoop.

Hadoop has become the de facto standard of big data technology, and Hadoop MapReduce is also very suitable for batch processing of large-scale data sets, but it still has some defects. Specific performance in:

1 and Hadoop MapRedue have limited expressive power. All calculations need to be transformed into Map and Reduce operations, which are not suitable for all scenarios and difficult to describe complex data processing processes.

2. The disk I/O cost is very high. Hadoop MapReduce requires the data between each step to be serialized to disk, so the I/O cost is very high, which leads to the high overhead of interactive analysis and iterative algorithm, while almost all optimization and machine learning are iterative. So Hadoop MapReduce is not suitable for interactive analysis and machine learning.

3. The calculation delay is very high. If you want to complete more complex work, you must connect a series of MapReduce jobs in series and then execute them in sequence. Each job has a high delay, and the next job can only be started after the previous job is completed. Therefore, Hadoop MapReduce is not competent for complex multi-stage computing services.

Spark is developed from Hadoop MapReduce technology, which inherits the advantages of distributed parallel computing and improves many defects of MapReduce. The specific advantages are as follows:

1 and Spark provide a wide range of data set manipulation types (20+ types), supporting Java, Python and Scala API, and supporting interactive Python and Scala shell. More general than Hadoop.

2.Spark provides a caching mechanism to support calculations that require repeated iterations or multiple data sharing, thus reducing the I/O overhead of data reading. Spark uses memory cache to improve performance, so interactive analysis is fast enough, and cache also improves the performance of iterative algorithm, which makes Spark very suitable for data theory tasks, especially machine learning.

3.Spark provides memory calculation, and puts the intermediate results into memory, which brings higher iterative operation efficiency. Directed acyclic graph (DAG) supports the programming framework of distributed parallel computing, which reduces the need of writing data to disk in the iterative process and improves the processing efficiency.

In addition, Spark can seamlessly connect with Hadoop. Spark can use YARN as its cluster manager and can read all Hadoop data, such as HDFS and HBase.

Spark has developed rapidly in recent years. Compared with other big data platforms or frameworks, Spark's code base is the most active. So far, the latest release version is Spark3.3.0.

There are also many data governance tools that use Spark technology to achieve real-time and universal data governance. Take the SoData data robot launched by Feidian as an example. It is a set of efficient data development and management tools with real-time, batch and batch integration, which can help enterprises realize data application quickly.

Compared with the traditional data processing process, SoData data robot realizes the data synchronization mechanism of process and batch integration, and carries out in-depth secondary development based on Spark and Flink framework, realizing the ultimate experience of real-time and batch processing in the whole process of data collection, integration, conversion, loading, processing and unloading, with second-level delay, stable and efficient average delay of 5- 10s, and rapid response to enterprise data application requirements.

In addition to the advantages of Spark data processing, the Spark architecture of SoData data robot also supports the development of Spark-SQL that generates Spark dictionary tables by executing SQL of various data sources and debugging at the same time, and supports the output of arbitrary result sets to various databases. Visual operation and maintenance development mode can also greatly lower the threshold of data development, governance and application, and improve efficiency.

In the informatization construction of a general hospital, the SoData data robot once completed the data migration that originally required 8-9 hours in 5 minutes.

At present, SoData data robots have been applied to many industries such as finance, medical care and energy, and will continue to bring better and faster data development, governance and application experiences to various industry organizations through innovative technologies.