I've been meaning to organize this piece, so since it's a ramble, I'll say whatever comes to mind. I have been in the Internet industry, so let's take the Internet industry. First of all, I probably list the use of the Internet industry data warehouse, data platform:
Integration of all the company's business data, the establishment of a unified data center;
Provide a variety of reports, there are to the senior management, and there are to the various business;
Provide operational data for the operation of the site to support the operation of the data, that is, through the data, so that the operation of a timely understanding of the site and the product's operating results;
Provide operational data support for the website operation. p>
Provide online or offline data support for each business, and become a unified data exchange and provision platform for the company;
Analyze user behavior data, and reduce input costs and improve input effects through data mining; for example, targeted and accurate placement of advertisements and personalized recommendations for users, etc.;
Develop data products, which can make profits for the company directly or indirectly;
Building an open data platform to open up company data;
。。。。。。
The content listed above looks more or less the same as the traditional industry data warehouse use, and all require data warehouse/data platform to have good stability, reliability; but in the Internet industry, in addition to the large amount of data, more and more business requirements for timeliness, and even many of them are required to be real-time, and, in addition, the business of the Internet industry is changing very quickly, and it's not possible to be like the traditional industry can be Use the top-down approach to build a data warehouse, once and for all, it requires that the new business can soon be integrated into the data warehouse, the old down the line of business, can be very convenient from the existing data warehouse down the line;
In fact, the Internet industry's data warehouse is the so-called Agile Data Warehouse, not only required to be able to respond to the data, but also required to be able to respond to the business fast;
Building an agile data warehouse, in addition to the technical requirements of the architecture, there is a very important aspect of data modeling, if you come up with the idea of building a set of data models that can be compatible with all the data and business, it will be back to the construction of traditional data warehouses, it is difficult to meet the rapid response to business changes. To cope with this situation, it is generally the first core of the persistent business for in-depth modeling (for example: based on the website logs to establish the website statistical analysis model and user browsing track model; based on the company's core user data to establish a user model), the other business generally use the dimensions + wide table way to build data models. This piece is an afterthought.
Overall ArchitectureThe following diagram is the architecture diagram of the data platform we are currently using, in fact, most companies should be more or less the same:
Logically, there is generally a data collection layer, data storage and analysis layer, data **** enjoyment layer, data application layer. They may be called differently, but essentially the roles are pretty much the same.
We look at it from the bottom up:
Data collectionThe task of the data collection layer is to collect and store data from various data sources to the data storage, during which some simple cleaning may be done.
There are more kinds of data sources:
Website logs:
As the Internet industry, website logs account for the largest share of website logs are stored on multiple website log servers,
Generally, flumeagent is deployed on each website log server to collect website logs in real time and store them on HDFS;
Business database:
The types of business databases are also various, there are Mysql, Oracle, SqlServer, etc. At this time, we urgently need a tool that can synchronize the data from various databases to HDFS, Sqoop is a kind of, but Sqoop is too cumbersome, and regardless of the size of the data volume, you need to launch the MapReduce to execute, and the need for Hadoop cluster of each machine can access the business database; should this scenario, Taobao open source DataX, is a good solution (see the article "Heterogeneous Data Source Massive Data Exchange Tool - TaobaoDataX download and use"), there are resources, can be based on the DataX on top of the secondary development, you can very well solve the problem. can be very good solution, we are currently using DataHub is also.
Of course, Flume through the configuration and development, you can also synchronize data from the database to HDFS in real time
Data sources from Ftp/Http:
It is possible that some of the data provided by partners need to be obtained through the Ftp/Http and other regular access to the data, DataX can also meet the demand;
Other Data source:
For example, some manual data entry, just need to provide an interface or small program, can be completed
Data storage and analysis without a doubt, HDFS is the most perfect data storage solution for data warehouse/data platform in big data environment.
Offline data analysis and computation, that is, the part of the real-time requirements are not high, in my opinion, Hive is still the first choice, rich data types, built-in functions; compression ratio of the ORC file storage format is very high; very convenient SQL support, so that Hive in the structured data based on the statistical analysis of far more efficient than MapReduce. A SQL can complete the demand, the development of MR may require hundreds of lines of code;
Of course, the use of Hadoop framework naturally also provides a MapReduce interface, if you are really happy to develop Java, or SQL is not familiar with, then you can also use MapReduce to do the analysis and computation; Spark is the two years of the very hot, after the practice, its performance is indeed much better than MapReduce, and and Hive, Yarn combination is getting better and better, therefore, must support the use of Spark and SparkSQL to do analysis and calculation. Because there is already a HadoopYarn, using Spark is actually very easy, without having to deploy a separate Spark cluster. For related articles on SparkOnYarn, see: "SparkOnYarn series of articles"
Real-time computation part, later said separately.
Data **** enjoyment here, in fact, refers to the previous data analysis and computation of the results of the storage place, in fact, relational databases and NOSQL databases;
The previous use of Hive, MR, Spark, SparkSQL analysis and computation of the results of the HDFS is still in the HDFS, but most of the business and the application is not possible to directly from HDFS, then there is a need for a data **** enjoy the place, so that the various businesses and products can easily access the data; and data collection layer to HDFS is just the opposite, here there is a need for a tool to synchronize the data from HDFS to other target data sources, and similarly, DataX can also meet.
Additionally, some real-time computation result data may be written directly to Data***Help by the real-time computation module.
Data application
Business products
The data used by business products already exists in the Data***Hanging layer, and they can be accessed directly from the Data***Hanging layer;
Reporting
In the same way as the business products, the data used for reporting is also generally summarized and stored in the Data***Hanging layer;
In the same way as the business products, the data used for reporting is also summarized and stored in the Data***Hanging layer;
The data used by the real-time computing module may be written directly to the DataX layer. p>Immediate query
Immediate query users are many, may be data developers, website and product operators, data analysts, or even department bosses, they all have the need to query the data;
This kind of impromptu query is usually the existing reports and data **** enjoyment layer of the data and can not satisfy their needs, need to query the data directly from the data storage layer.
The query is usually done through SQL, the biggest difficulty lies in the response speed, using Hive is a bit slow, at present my solution is SparkSQL, its response speed is much faster than Hive, and can be very good with Hive compatibility.
Of course, you can also use Impala, if you don't care about another framework in the platform.
OLAP
At present, many of the OLAP tools can not well support the data directly from HDFS, are synchronized with the data needed to do OLAP to relational databases, but if the amount of data is huge, relational databases obviously can not be ;
This time, you need to do the corresponding development, from HDFS or HBase
Then, we need to do corresponding development to get data from HDFS or HBase to complete the OLAP function;
For example: according to the user's choice of indeterminate dimensions and indicators on the interface, through the development of interfaces, to get the data from HBase to display.
Other Data Interfaces
This kind of interface has general and customized ones. For example, an interface to get user attributes from Redis is generic, and all businesses can call this interface to get user attributes.
Real-time computing now business on the data warehouse real-time demand for more and more, for example: real-time understanding of the overall flow of the site; real-time access to an advertisement exposure and clicks; in the massive data, relying on traditional databases and traditional methods of implementation can not be accomplished, the need for a distributed, high throughput, low latency, highly reliable real-time computing framework; Storm is more mature in this area , but it is not the case that it can be used in any other way. This is more mature, but I chose SparkStreaming, for the simple reason that I do not want to introduce a framework to the platform, in addition, SparkStreaming than Storm delay a little bit higher, that for our needs can be ignored.
We currently use SparkStreaming to achieve real-time website traffic statistics, real-time advertising effect statistics two pieces of functionality.
The approach is also very simple, by Flume in the front-end log server to collect site logs and advertising logs, real-time sent to SparkStreaming, SparkStreaming to complete the statistics, the data will be stored to Redis, the business by accessing Redis real-time access.
Task scheduling and monitoring in the data warehouse/data platform, there are a variety of very many programs and tasks, such as: data collection tasks, data synchronization tasks, data analysis tasks, etc.;
These tasks, in addition to timed scheduling, there are also very complex task dependencies, such as: the data analysis task must wait for the completion of the corresponding data collection task before it can be started; the data synchronization tasks need to wait for the completion of the data analysis tasks before they can start; this requires a very complete task scheduling and monitoring system, which serves as the pivot of the data warehouse/data platform and is responsible for scheduling and monitoring the allocation and operation of all tasks.
The previous article, "Task Scheduling and Monitoring in Big Data Platforms", will not be repeated here.
SummaryIn my opinion, the architecture is not the more and more new technology the better, but in the case of can meet the demand, the simpler and more stable the better. Currently in our data platform, the development is more concerned about the business rather than technology, they figure out the business and requirements, basically only need to do simple SQL development, and then configured to the scheduling system can be, if the task is abnormal, it will receive alerts. This allows more resources to be focused on the business above all else.