In order to distinguish hadoop from Storm, this section will answer the following questions:
What are the operations of 1.Hadoop and Storm?
2. Why 2? Storm called it a streaming computing system?
3. What scenario is Hadoop suitable for and when to use Hadoop?
4. What is throughput
First of all, the overall understanding: Hadoop is disk-level computing, and the data is on the disk when computing, so it is necessary to read and write the disk; Storm is memory-level computing, and data is directly imported into memory through the network. Reading and writing memory is n orders of magnitude faster than reading and writing disks. According to Harvard CS6 1 courseware, the delay of disk access is about 75000 times that of memory access. So the storm is faster.
Precautions:
1. delay refers to the time from data generation to operation result, and "fast" should mainly refer to this.
2. Throughput refers to the amount of data processed by the system in unit time.
The delay of network direct transmission and memory calculation of Storm must be much lower than that of hadoop transmission through hdfs. When the calculation model is suitable for streaming processing, the streaming processing of storm saves the time of data acquisition in batch processing. Because storm is a service-oriented job, it also saves the delay of job scheduling. So storm is faster than hadoop in terms of latency.
In principle:
Hadoop M/R is based on HDFS, which needs to split input data, generate intermediate data files, sort, compress data, and copy multiple copies. And the efficiency is low.
Storm is based on ZeroMQ, which is a high-performance message library and does not store data persistently.
Why is storm faster than hadoop? The following is an application scenario.
In a typical scenario, thousands of log producers generate log files and need some ETL operations to store them in the database.
Assuming hadoop is used, you need to save it in hdfs first and calculate it according to the granularity of cutting a file every minute (this granularity is extremely fine, and there will be a bunch of small files in hdfs if it is smaller). When hadoop started computing, 1 minute passed, and then it took another minute to start scheduling tasks, and then the job ran. Assuming that there are many machines, it will cost a lot of money to complete, and then it will take a lot of time to write database assumptions.
When generating data in streaming computing, there is a program to monitor the generation of logs at all times, and the generated lines are sent to the streaming computing system through the transmission system, and then the streaming computing system directly processes them and writes them into the database after processing. With sufficient resources, each piece of data can be completed in milliseconds from generation to writing in the database.
At the same time, talk about another scene:
If the number of words in a large file is streamed on storm, and all the existing data have been processed before storm can output the results, then you can compare it with hadoop. At this time, it is not the delay, but the throughput.
-
The most important aspect: Hadoop uses disk as the medium of intermediate exchange, and the data of storm always flows in memory.
They face different fields, one is batch processing, based on task scheduling; The other is real-time processing, which is based on stream.
Taking water as an example, Hadoop can be regarded as pure water, which is moved barrel by barrel; Storm uses a water pipe, which is topologically connected in advance, and then turns on the faucet, and the water keeps flowing out.
-
Nathan Marz, chief engineer of Storm, said: Storm can easily write and expand complex real-time calculations in a computer cluster. Storm is to real-time processing what Hadoop is to batch processing. Storm guarantees that every message will be processed quickly-in a small cluster, millions of messages can be processed every second. Even better, you can develop in any programming language.
The main features of the storm are as follows:
1. Simple programming model. Similar to MapReduce, it reduces the complexity of parallel batch processing, and Storm reduces the complexity of real-time processing.
2. Various programming languages can be used. You can use various programming languages on Storm. Clojure, Java, Ruby and Python are supported by default. In order to increase support for other languages, we only need to implement a simple Storm communication protocol.
3. fault tolerance. Storm will manage the failures of worker processes and nodes.
4. Horizontal expansion. Computation is performed in parallel among multiple threads, processes and servers.
5. Reliable message processing. Storm guarantees that each message can be processed completely at least once. When the task fails, it is responsible for retrying the message from the message source.
6. Soon. The design of the system ensures that messages can be processed quickly, and MQ is used as its underlying message queue.
7. Local mode. Storm has a "local mode", which can completely simulate the Storm cluster during processing. This allows you to develop and unit test quickly.
-
Generally speaking, under the same resource consumption, the delay of storm is lower than that of mapreduce. But the throughput is also lower than mapreduce. Storm is a typical stream computing system, and mapreduce is a typical batch processing system. The following is the flow chart of convection calculation and batch processing system.
This data processing flow can be roughly divided into three stages:
1. Data acquisition and preparation
2. Data calculation (intermediate storage is involved in the calculation), and "which aspects decide" in the title should mainly refer to the current processing mode.
3. Presentation of data results (feedback)
1) In the data collection stage, the typical processing strategy at present: the data generation system generally comes from the page points and logs of the parsing DB, and the traffic calculation queues the messages in the data collection (such as kafaka, metaQ, timetunle). Batch processing systems usually collect data into distributed file systems (such as HDFS), and of course some use message queues. For the time being, message queues and file systems are called preprocessing storage. There is little difference in delay and throughput between the two, and then there is a big difference from this preprocessing storage to the data calculation stage. Flow computing usually reads data from the message queue in real time and performs calculation. A batch processing system generally saves a large number and then imports it into the computing system (hadoop) in batches, so the delay is different.
2) In the data calculation stage, the low latency of the stream computing system (storm) mainly includes the following aspects (aiming at the problem of the subject).
A: The storm process is permanent, and data can be processed in real time.
After a batch of mapreduce data is saved, the job management system starts the task, the Jobtracker calculates the task assignment, and the tasktacker starts the related operation flow.
B: The data between the calculation units of b:STOM is transmitted directly through the network (zeromq).
The result of mapreduce map task operation should be written into HDFS, because the reduce task will drag and operate on the network. Relatively speaking, the more disk reads and writes, the slower it is.
C: for complex operations
The operation model of storm directly supports DAG (directed acyclic graph)
Mapreduce needs to be composed of multiple MR processes, and some map operations are meaningless.
3) Presentation of data results
The general operation results of stream computing are directly fed back to the final result set (index of display page, database and search engine). Mapreduce generally needs to import the results into the result set in batches after the whole operation.
There is no essential difference between actual flow calculation and batch processing system. For example, trident of Storm also has the concept of batch, and mapreduce can reduce the data set of each operation (such as starting every few minutes). Puma of facebook is a hadoop-based streaming computing system.
Second, the comparison of high-performance parallel computing engines Storm and Spark
Spark's idea is that when the data is huge, it is more efficient to transfer the calculation process to the data than to transfer the data to the calculation process. Each node stores (or caches) its data set and then submits the task to that node.
So this is to pass the process to the data. This is similar to Hadoop map/reduce, except that memory is actively used to avoid I/O operations, so that the performance of iterative algorithm (the output of the last calculation is the input of the next calculation) is higher.
Shark is just a query engine based on Spark (supporting temporary analysis and query).
Storm's architecture is the opposite of Spark's. Storm is a distributed flow computing engine. Each node implements a basic calculation process, and data items flow into and out of interconnected network nodes. Contrary to Spark, this is to pass data to the process.
Both of these frameworks are used for parallel computing with large amounts of data.
Storm is better at dynamically processing a large number of generated "small data blocks" (such as calculating some aggregate functions or analysis in real time on Twitter data streams).
Spark works on existing data sets (such as Hadoop data) and has been imported into Spark cluster. Spark can scan flash based on memory management to minimize the global I/O operation of iterative algorithm.
However, the Spark stream module is similar to Storm (both are stream computing engines), although not identical.
Spark stream module collects batch data first, and then distributes data blocks (treating them as immutable data), while Storm processes and distributes data in real time as soon as it receives the data.
I'm not sure which method is more advantageous in data throughput, but the calculation delay of Storm is smaller.
To sum up, sparks and storms are opposite in design, while spark steaming is similar to storms. The former has a data sliding window, and the latter needs to maintain this window by itself.