I've been meaning to organize this piece, so since it's a ramble, I'll say what comes to mind. I've been in the Internet industry, so let's take the Internet industry.
First roughly list the Internet industry data warehouse, data platform uses:
Integration of all the company's business data, the establishment of a unified data center;
Provide a variety of reports, there are to the top, and there are to the various business;
Provide operational data support for the website operation, that is, through the data, so that the operation of the timely understanding of the website and the product's
Provide online or offline data support for each business, and become a unified data exchange and provision platform for the company;
Analyze user behavior data, and reduce input costs and improve input effects through data mining; for example, targeted and accurate advertising, personalized user recommendations, etc.
Develop data products, directly or indirectly for the company's profitability;
Construct open source reports, including reports for senior management, and reports for each business.
Constructing an open data platform to open up company data;
。。。。。。
The content listed above looks more or less the same as the traditional industry data warehouse use, and all require data warehouse/data platform has good stability, reliability; but in the Internet industry, in addition to the large amount of data, more and more business requirements of the timeliness, and even a lot of the requirements of real-time , In addition, the Internet industry's business is changing very quickly, it is not possible to be like the traditional industries
In fact, the Internet industry's data warehouse is the so-called agile data warehouse, not only requires a quick response to the data, but also requires a quick response to the business;
Building an agile data warehouse is not only a quick response to the data, but also requires a quick response to the business;
In addition to the volume of data, more and more business requirements, even many of them require real-time. p>Construction of an agile data warehouse, in addition to the technical requirements of the architecture, there is a very important aspect of the data modeling, if you come up with the idea of building a set of data models that can be compatible with all the data and business, it will be back to the construction of traditional data warehouses, and it is difficult to meet the rapid response to business changes. To cope with this situation, it is generally the first core of the persistent business for in-depth modeling (for example: based on the website logs to establish the website statistical analysis model and user browsing track model; based on the company's core user data to establish a user model), the other business generally use the dimensions + wide table way to build data models. This piece is an afterthought.
Overall ArchitectureThe following diagram is the architecture diagram of the data platform we are currently using, in fact, most companies should be more or less the same:
Please click to enter the picture description
Please click to enter the picture description
Logically, there is generally a data collection layer, data storage and analytics layer, data **** enjoyment layer, data application layer. It may be called differently, the essential roles are all pretty much the same.We look at it from the bottom up:
Data collectionThe task of the data collection layer is to collect and store data from various data sources to the data storage, during which some simple cleaning may be done.
There are more kinds of data sources:
Website logs:
As the Internet industry, website logs account for the largest share of website logs are stored on multiple website log servers,
Generally, a flume agent is deployed to each website log server to collect website logs in real time and store them to the HDFS;
Business database:
Business databases are also diverse, there are Mysql, Oracle, SqlServer, etc., at this time, we urgently need a tool that can synchronize the data from a variety of databases to HDFS, Sqoop is a kind of, but Sqoop is too cumbersome. And regardless of the size of the data volume, you need to start MapReduce to execute, and the need for Hadoop cluster of each machine can access the business database; in response to this scenario, Taobao open source DataX, is a good solution (see the article "Heterogeneous Data Sources Massive Data Exchange Tool - Taobao DataX download and use"), there are resources, then you can Based on DataX on top of the secondary development, it can be a very good solution, we are currently using the DataHub is also.
Of course, Flume can be configured and developed to synchronize data from database to HDFS in real time.
Data from Ftp/Http:
It is possible that the data provided by some partners needs to be obtained through Ftp/Http and other timed access to the data, and DataX can meet the demand;
DataX can meet this demand;
DataX can be used to synchronize data from database to HDFS in real time. p>
Other data sources:
For example, some manually entered data, just need to provide an interface or a small program, can be completed;
Data storage and analysis Undoubtedly, HDFS is the most perfect data storage solution for the data warehouse/data platform in the big data environment.
Offline data analysis and computation, that is, the part of the real-time requirements are not high, in my opinion, Hive is still the first choice, a wealth of data types, built-in functions; compression ratio of the ORC file storage format is very high; very convenient SQL support, so that Hive in the structured data based on the statistical analysis of far more efficient than MapReduce. A SQL can complete the demand, the development of MR may require hundreds of lines of code;
Of course, the use of Hadoop framework naturally also provides a MapReduce interface, if you are really happy to develop Java, or SQL is not familiar with, then you can also use MapReduce to do the analysis and computation; Spark is the two years of the very hot, after the practice, its performance is indeed much better than MapReduce, and and Hive, Yarn combination is getting better and better, therefore, must support the use of Spark and SparkSQL to do analysis and calculation. Because there is already a Hadoop Yarn, using Spark is actually very easy, without having to deploy a separate Spark cluster. For related articles on Spark On Yarn, see: "Spark On Yarn series of articles"
Real-time computation part, later said separately.
Data **** enjoyment The data **** enjoyment here actually refers to the place where the results of the previous data analysis and computation are stored, which is actually the relational and NOSQL databases;
The results of the previous analysis and computation using Hive, MR, Spark, and SparkSQL are still on HDFS, but it is unlikely that most of the businesses and applications will directly get data from HDFS, then there is a need for a data **** enjoyment place, making it easy for the various businesses and products to access data; ? and data collection layer to HDFS is just the opposite, here you need a tool to synchronize data from HDFS to other target data sources, again, DataX can meet.
Additionally, some real-time computation result data may be written by the real-time computation module directly to the data **** enjoyment.
Data Application
Business Products
The data used by business products already exists in the Data***Hanging layer, and they can be accessed directly from the Data***Hanging layer;
Reporting
In the same way as the business products, the data used for reporting is also usually already summarized and stored in the Data***Hanging layer;
DataX can also be accessed directly from the Data***Hanging layer. enjoyment layer;
Impromptu query
Impromptu query users are many, may be data developers, website and product operators, data analysts, or even department bosses, they all have the need for impromptu querying of data;
This kind of impromptu query is usually that the data in the existing reports and data *** enjoyment layer do not satisfy their and need to query directly from the data storage layer.
The immediate query is usually done through SQL, the biggest difficulty lies in the response speed, using Hive is a bit slow, at present my solution is SparkSQL, its response speed is much faster than Hive, and can be very good with Hive compatibility.
Of course, you can also use Impala, if you don't care about another framework in the platform.
OLAP
Currently, many of the OLAP tools can not well support the data directly from HDFS, are synchronized through the need for data to the relational database to do OLAP, but if the amount of data is huge, relational databases obviously can not be ;
This time, you need to do the corresponding development, from HDFS
Then, it is necessary to do the corresponding development, from HDFS or HBase to obtain data to complete the OLAP function;
For example: according to the user interface to select the indeterminate dimensions and indicators, through the development of interfaces, from HBase to obtain data to display.
Other Data Interfaces
This kind of interface has general and customized ones. For example, an interface to get user attributes from Redis is generic, and all businesses can call this interface to get user attributes.
Real-time computing now business on the data warehouse real-time demand for more and more, for example: real-time understanding of the overall flow of the site; real-time access to an advertisement exposure and clicks; in the massive data, relying on traditional databases and traditional methods of implementation can not be accomplished, the need for a distributed, high throughput, low latency, highly reliable real-time computing framework; Storm is more mature in this area , but it is not the case that it can be used in any other way. This is more mature, but I chose Spark Streaming, for the simple reason that I do not want to introduce more of a framework to the platform, in addition, Spark Streaming than Storm latency is a little higher, that for our needs can be ignored.
We currently use Spark Streaming to achieve real-time website traffic statistics, real-time advertising effect statistics two pieces of functionality.The approach is also very simple, by Flume in the front-end log server to collect site logs and advertising logs, real-time sent to Spark Streaming, Spark Streaming to complete the statistics, the data will be stored to Redis, the business by accessing Redis real-time access.
Task scheduling and monitoring in the data warehouse/data platform, there are a variety of very many programs and tasks, such as: data collection tasks, data synchronization tasks, data analysis tasks, etc.;
These tasks, in addition to timed scheduling, there are also very complex task dependencies, such as: the data analysis task must wait for the completion of the corresponding data collection task before it can be started; the data synchronization tasks need to wait for the completion of the data analysis tasks before they can start; ? This requires a very sophisticated task scheduling and monitoring system, which acts as the hub of the data warehouse/data platform and is responsible for scheduling and monitoring the allocation and operation of all tasks.
The previous article, "Task Scheduling and Monitoring in Big Data Platforms", will not be repeated here.
SummaryIn my opinion, the architecture is not the more and more new technology the better, but in the case of can meet the demand, the simpler and more stable the better. Currently in our data platform, the development is more concerned about the business rather than technology, they figure out the business and requirements, basically only need to do simple SQL development, and then configured to the scheduling system can be, if the task is abnormal, it will receive alerts. This allows more resources to be focused on the business above all else.