How to collect big data

1, offline collection: tool: ETL; in the context of data warehousing, ETL is basically a representative of data collection, including data extraction (Extract), transformation (Transform) and loading (Load). In the process of transformation, the need for specific business scenarios for data governance, such as illegal data monitoring and filtering, format conversion and data normalization, data replacement, to ensure data integrity and so on. 2, real-time collection: tool: Flume/Kafka; real-time collection is mainly used in the consideration of stream processing business scenarios, for example, for recording the data source to perform a variety of operational activities, such as network monitoring of traffic management, financial applications, stock bookkeeping and web server records of user access behavior. In the stream processing scenario, data collection will become a consumer of Kafka, like a dam that intercepts the constant flow of data from the upstream, and then according to the business scenario to do the corresponding processing (e.g., de-emphasis, denoising, intermediate computation, etc.), and then written to the corresponding data storage. This process is similar to the traditional ETL, but it is streaming processing, rather than timed batch Job, some tools are distributed architecture, to meet the log data collection and transmission needs of hundreds of MB per second 3, Internet collection: Tools: Crawler, DPI, etc.; Scribe is a data (logs) collection system developed by Facebook. Also known as web spiders, web robots, is a kind of according to certain rules, automatically crawl the World Wide Web information program or script, it supports pictures, audio, video and other files or attachments collection.