Current location - Loan Platform Complete Network - Big data management - How to collect data for big data
How to collect data for big data
Data collection is essential for all data systems, and as big data is increasingly emphasized, the challenge of data collection has become particularly prominent. Let's take a look today at what methods are used by big data technology in data collection:

1, offline collection: tool: ETL; in the context of data warehousing, ETL is basically a representative of data collection, including data extraction (Extract), transformation (Transform) and loading (Load). In the process of transformation, data needs to be governed for specific business scenarios, such as illegal data monitoring and filtering, format conversion and data normalization, data replacement, and ensuring data integrity.

2, real-time collection: tool: Flume/Kafka; real-time collection is mainly used in the consideration of stream processing business scenarios, for example, used to record the data source to perform a variety of operational activities, such as network monitoring of traffic management, financial applications, stock bookkeeping and web server records of user access behavior. In the stream processing scenario, the data collection will become a consumer of Kafka, like a dam to intercept the continuous flow of data from the upstream, and then according to the business scenario to do the corresponding processing (such as de-emphasis, denoising, intermediate computation, etc.), and then written to the corresponding data storage. This process is similar to traditional ETL, but it is a streaming process instead of a timed batch job. All of these tools use a distributed architecture to meet the demand for log data collection and transmission of hundreds of MB per second.

3, Internet collection: tools: Crawler, DPI, etc.; Scribe is a data (log) collection system developed by Facebook. Also known as web spiders, web robots, is a program or script that automatically crawls the World Wide Web for information in accordance with certain rules, it supports the collection of pictures, audio, video and other files or attachments. Crawlers, in addition to the content contained in the web, for the collection of network traffic can be processed using bandwidth management techniques such as DPI or DFI.

4, other data collection methods for enterprise production and operation data on customer data, financial data and other confidentiality requirements of high data, can be collected through cooperation with data technology service providers, the use of specific system interfaces and other related ways to collect data. For example, Octave Cloud Computing's Digital Enterprise BDSaaS, whether it is data collection technology, BI data analysis, or data security and confidentiality, are doing very well. Data collection is the first step to explore the value of data, and when the amount of data is getting bigger and bigger, the useful data that can be extracted is bound to be more and more. As long as you make good use of the data processing platform, you can ensure the validity of the data analysis results and help enterprises realize data-driven.