Current location - Loan Platform Complete Network - Big data management - How to deal with big data, how to deal with double 11 so many orders
How to deal with big data, how to deal with double 11 so many orders
Big data (huge collection of data (IT industry term))

Big data, a collection of data that cannot be captured, managed, and processed in an affordable timeframe with conventional software tools, is an information asset that requires new processing paradigms in order to have greater decision-making, insight discovery, and process optimization capabilities to accommodate large volumes, high growth rates, and diversity.

In The Age of Big Data by Viktor Mayer Sch?nberg & Kenneth Kukier. Big Data refers to the use of all data for analysis and processing without using shortcuts like random analysis (sampling). The 5V characteristics of Big Data (proposed by IBM): Volume (large amount), Velocity (high speed), Variety (diverse), Value (value), Veracity (authenticity).

Specific operations:

The entire processing process can be summarized in four steps, namely, acquisition, import and preprocessing, statistics and analysis, and mining.

Capture

Capture of big data refers to the use of multiple databases to receive data from the client (Web, app, or sensor form, etc.), and the user can use these databases to perform simple query and processing work. For example, e-commerce companies use traditional relational databases such as MySQL and Oracle to store each transaction, in addition to NoSQL databases such as Redis and MongoDB, which are also commonly used for data collection.

In the process of big data collection, its main feature and challenge is the high concurrency, because there may be thousands of users to access and operate at the same time, such as train ticket websites and Taobao, their concurrent access to the peak of millions, so it is necessary to deploy a large number of databases on the collection side to support. And how to load-balance and shard these databases does require in-depth thinking and design.

Import/Preprocessing

While there are many databases at the collection end, if you want to effectively analyze the massive data, you should import the data from the front-end into a centralized large-scale distributed database or distributed storage cluster, and you can do some simple cleansing and preprocessing based on the import. There are also some users who use Storm from Twitter to stream the data when importing to meet some of the real-time computing needs of the business.

The characteristics and challenges of the import and preprocessing process are mainly the large volume of imported data, which often reaches a hundred megabytes or even a gigabyte per second.

Statistics/Analytics

Statistics and analytics mainly use distributed databases, or distributed computing clusters to carry out ordinary analysis and classification of the massive amounts of data stored in them to meet the most common analytical needs, in this regard, some of the real-time demand will be used in the EMC GreenPlum, Oracle's Exadata, as well as the MySQL-based columnar storage. In this regard, some real-time needs will use EMC's GreenPlum, Oracle's Exadata, and MySQL-based columnar storage Infobright, etc., and some batch processing, or based on the demand for semi-structured data can be used Hadoop.

Statistics and analytics of this part of the main features and challenges is the analysis of the data involved in the large volume of its system resources, especially I/O will have a great occupation.

Mining

Unlike the previous statistical and analytical processes, data mining generally does not have any predetermined theme, mainly based on a variety of algorithms in the existing data on top of the computation, so as to play a prediction (Predict) effect, so as to achieve some high-level data analysis needs. Typical algorithms are K-Means for clustering, SVM for statistical learning, and Naive Bayes for classification, and the main tools used are Hadoop's Mahout.

-