Current location - Loan Platform Complete Network - Loan consultation - Apache Kylin’s practice in Meituan’s billions of data OLAP scenarios
Apache Kylin’s practice in Meituan’s billions of data OLAP scenarios

There are a large number of OLAP analysis scenarios in various business lines of Meituan, which require analysis of billions of data based on Hadoop, direct response to interactive access requests from thousands of people such as analysts and city BDs, and OLAP analysis. The scalability, stability, data accuracy and performance of the service all have high requirements. This article mainly introduces the specific OLAP needs of Meituan, how to apply Kylin to actual scenarios, as well as the current usage and current situation. At the same time, Kylin is compared with other systems (such as Presto, Druid, etc.) and the unique advantages of Kylin are explained.

As the company's platform department, it needs to provide platform services to each business line, so how to build a company platform-level OLAP analysis service that meets various needs. First of all, an open source project will encounter many obstacles when it is implemented in the company. This is mainly determined by the different data characteristics and business characteristics of each business line. Therefore, this article will introduce the characteristics of Meituan’s data scenarios; secondly, for these What kind of solutions are there for data characteristics, especially those that are not in line with the original intention of Kylin’s design? Thirdly, there is currently no so-called de facto standard in the OLAP field. Many engines can do similar things, such as ordinary MPP, Kylin , or ES, etc. How do these systems compare and how should they be chosen? We also have some test data to share. Finally, let’s briefly discuss the work we plan to do on Kylin in the future.

1. Meituan’s data scenario characteristics

The first characteristic is the data scale and model characteristics. On the one hand, in terms of data scale, fact tables are generally on the order of 100 million to 1 billion, and there are also dimension tables on the order of tens of millions, that is, ultra-high cardinality dimension tables. On the other hand, the data model was the biggest difficulty encountered at the beginning. Because the original design of Kylin was based on a star model, unfortunately due to various reasons, a lot of data are snowflake models, and there are other models, such as the so-called "constellation" model, which is two or three in the middle. A fact table, surrounded by many other dimension tables. The business logic determines how these data are related in a very complex manner that cannot be explained by classical standard theory.

The second one is dimension. Dimensions are ideally fixed and only the fact table changes from day to day. But in fact, dimensions often change, which may be related to industry characteristics, such as organizational structure, and related dimension data may change every day. In addition, today's dimensions may be used to correlate all historical data, so historical data needs to be refreshed, and the corresponding overhead is relatively large.

The third issue is data backtracking. For example, if there is a problem with data generation or an error occurs in the upstream, you need to rerun the data. This is also different from the classical theoretical model.

From a dimension perspective, the number of general dimensions is between 5 and 20, so Kylin is relatively suitable. Another feature is that there is generally a date dimension, which may be the current day, a week, a month, or any time period. In addition, there will be more hierarchical dimensions, such as the organizational structure from the top region to the lower cells, which is a typical hierarchical dimension.

From an indicator perspective, generally the number of indicators is within 50. Relatively speaking, Kylin’s restrictions on indicators are not that strict and can meet the needs. There are a lot of expression indicators. In Kylin, the parameters of the aggregate function can only be a single column. Sum(if...) cannot be supported, so some special solutions are needed. Another very important issue is the accuracy of data. Currently, in the OLAP field, various systems use approximate algorithms such as hyperloglog to deduplicate and count. This is mainly due to cost considerations, but our business scenarios require that the data must be Precise. Therefore this is also a key issue to be solved.

There are also relatively high requirements for queries.

Because the query service of the platform may be directly opened to urban BDs, and there will be dozens or millions of visits each time, stability must be ensured first. The second requirement is high performance. Because Kylin is mainly used to implement interactive analysis so that users can get results quickly, second-level response is required.

In addition, people often ask whether Kylin has a visual front-end. Internally, it is mostly done by the business side, because we already have such a system, and we used to connect to MySQL and other systems. Data sources can now be directly connected using Kylin's JDBC driver.

The above are some of the characteristics of Meituan in OLAP query. Before using Kylin, there were actually some solutions, but the results were not ideal. For example, if you use Hive to check directly, in this case, the first is slow, and the second is that it consumes the resources of the computing cluster. Especially on the first day of every month, everyone has to publish monthly reports, and a lot of SQL is run, all of which are linked to the cluster. The concurrency limit causes the execution to run slower than usual. We have also tried pre-aggregation before. This idea is very similar to Kylin, but we do it ourselves. We use Hive to calculate all the dimensions first, and then import them into MySQL or HBase. However, this solution does not have as good a model definition abstraction as Kylin, nor does it have an overall framework from configuration to execution, precomputation, and query. Now we can solve these problems at low cost by using Kylin.

2. Solution to connect to Apache Kylin

In response to the above problems, after a lot of attempts and verifications, the current main solutions are as follows.

The first and most important point is to use a wide table. All non-standard star data models can be solved by flattening them into a wide table through preprocessing. As long as these tables can be related according to business logic, a wide table can be generated, and then data aggregation can be done in Kylin based on this table. Wide tables can not only solve data model problems, but also solve problems such as dimension changes or ultra-high cardinality dimensions.

The second point is the problem of expression indicators, which can also be solved by processing in advance. Just convert the expression into a column and then perform aggregation based on this column. In fact, the wide table and expression transformation can be processed using Hive's view, or a physical table can be generated.

The third problem is accurate deduplication. The current solution is based on Bitmap. Due to data type limitations, only int type is currently supported, and other types including long, string, etc. are not supported yet. Because each value needs to be mapped to a Bitmap, if it is a long, the overhead will be too high. If hashing is used, there will be conflicts, resulting in inaccurate results. In addition, Bitmap itself is relatively expensive, especially when running pre-calculation. If the calculated base is very large, the corresponding data structure will be tens of megabytes, and the memory will be at risk of OOM. We will also think of some solutions to these problems later, and we welcome discussions in the community. (Additional note: Support for accurate deduplication counting of all types has been implemented in version 1.5.3.)

In terms of the deployment method of the entire system, the current Server adopts a separate deployment method. Kylin Server is essentially a client and does not require too many resources. Generally, a virtual machine can meet the needs.

The actual deployment situation can be seen in this picture. The lower left corner is the main Hadoop cluster, which is used to execute all Hadoop jobs every day. The most important ones in the middle are the two servers Kylin01 and 02, which are servers used in the online environment. Among them, kylin01 is the production environment. On the one hand, this environment is responsible for running calculations from the host group and importing data to HBase. It is also responsible for responding to front-end requests and reading data from HBase. If you want to add a new Cube, you need to operate it on kylin02, which is a pre-launch environment.

The cube data model definitions of all business personnel are done on kylin02. If there are no problems, the administrator will switch to kylin01.

One advantage of this is that kylin01 can ensure stability as an online service, and even the permission control can be stricter; secondly, after the development in the pre-launch environment is completed, the administrator can Conduct a review to ensure the efficiency of the cube.

In the upper right corner is another scheduling system. The data production of the entire data warehouse is scheduled through this scheduling system. There are many types of tasks, and Kylin's cube build task is also one of them. After the upstream data is ready, the Cube creation process is automatically triggered based on the configured dependencies.

There is also a completely independent offline test cluster in the upper left corner. This cluster is completely open. It is mainly used to do some initial feasibility research or evaluation work for users, but at the same time it does not Ensure stability.

The process of taking an open source system from the community, to its actual implementation, and then to production is relatively long. There are not too many technical problems here, but more on some processes. experience. It is how to provide better services to business parties at all stages, making the process of accessing Kylin smoother and lower communication costs. The whole process is mainly divided into four stages.

The first stage is solution selection. The business party selects some solutions for investigation based on business needs. At this stage, we provide a Checklist of requirements, which lists more detailed points from various aspects of data model and dimensions, so that business parties can compare them to determine whether the requirements can be met.

After confirming that Kylin can meet the needs, the next step is offline exploration, which is offline evaluation or testing. We provide very detailed access documentation and an offline testing environment. The third step is online development. We also have some document support. Based on the abstracted scenarios, how to configure Cube for each scenario, or what preprocessing should be done, how to optimize, etc., can give the business side a guiding opinion.

Finally, after the development is completed, the cutting table is put online. This process is still operated by the administrator. On the one hand, it is to avoid misuse or abuse. On the other hand, the cube will be reviewed to help optimize it.

3. Comparative analysis of mainstream OLAP systems

Through communication with other students, I have a feeling that everyone thinks Kylin is not bad, but they are not particularly confident, or do not know whether it is necessary What are the reasons for using it, or how does it compare to other systems? Here are some test results to share with you.

The entire test is based on the SSB data set, which is also completely open source. It is actually specially used for testing in star model OLAP scenarios. The entire test data set is a very standard five tables. You can configure some parameters to determine the size of the generated data set, and then test different query scenarios at different scales. The systems that have been tested include: Presto, Kylin1.3, Kylin1.5 and Druid. The data scale includes three sizes: tens of millions, billions, and billions; the number of dimensions is 30; the number of indicators is 50. Typical test scenarios include: roll-up, drill-down, and commonly used aggregation functions.

Five typical query scenarios are selected here: filtering and aggregation of a fact table; query after five tables are fully associated; two Count Dstinct indicators and two Sum indicators; the latter two queries include Filtering in 8~10 dimensions.

This picture is a test result on a scale of tens of millions, including four systems. Before we used Kylin or other systems, there was no engine specifically used for OLAP analysis, so we could only use general-purpose ones.

Presto is one of the engines that performs very well, but in a specific scenario like OLAP, it can be seen that it is much worse than Kylin or Druid. Therefore, the first two tests include the Presto results, but not the latter ones. .

The interesting phenomenon here is that in the third query, Kylin1.5 is slower than Kylin1.3. We haven't figured out the reason for this yet, but we will look at it in detail later. Of course, this can also prove that the data has not been modified and is real test data.

As can be seen from the next two queries, there is still a relatively large gap with Druid at the tens of millions scale level. This is mainly related to their implementation mode, because Druid will load all data into memory after preprocessing, and can achieve very fast speeds when doing some aggregation of small amounts of data; but Kylin needs to read it on HBase , its performance is relatively poor, but it can fully meet the needs.

The situation has changed again on a billion-level scale. Looking at the next two queries, Kylin1.3 basically has a linear growth. This data has become more ugly. This is due to Kylin1 .3 uses a serial method when scanning HBase, but Kylin1.5 will perform better. This is because Kylin1.5 introduces HBase parallel scan, which greatly reduces the scanning time. The data of Kylin1.5 will be sharded to different regions. The amount of data is still relatively small on the order of tens of millions, and there is no obvious reflection. But after hundreds of millions, as the amount of data increases, the regions also increase, and it can actually Increase concurrency. So here you can see that Kylin1.5 will perform better. It can also be seen here that after the amount of data increases by orders of magnitude, Kylin performs more stably and can still maintain good query performance on data sets of different sizes. As the amount of data increases, Druid's performance loss also increases exponentially.

Just now we did some analysis on performance. In fact, for a system, performance is only one aspect. In addition, we will also consider other aspects, mainly including the following four points.

First, the completeness of functions. I just mentioned that all our data must be accurate, but basically no system can fully cover this requirement. For example, Druid's performance is indeed better, but its deduplication and counting cannot be accurate.

Second, the ease of use of the system. As a platform service, you not only need to use the system, but also maintain it, so the costs of deployment and monitoring must be considered. Kylin is relatively good in this regard. There are many roles in a Druid cluster. If you want to use this system, it may take a long time to set up these services in this environment. This is actually a relatively painful thing for us as a platform. Whether it is deployment or monitoring, the cost is relatively high. In terms of another query interface, the one we are most familiar with or the most standard and best to use is of course the standard SQL interface. Systems such as ES and Druid originally did not support SQL. Of course, there are some plug-ins now, but they are not as complete as the native support in terms of functional completeness and data efficiency.

Third, data cost. As mentioned just now, some data requires some preprocessing, such as table flattening or expression column transformation. In addition, there are also some format conversions. For example, some systems can only read TEXT format, which will lead to data preparation. cost. Another aspect is the efficiency of data import. How long does it take from the time the data enters the data warehouse until it can actually be queried? How much machine resources are required for data storage and service can be classified as data cost, which is the cost of using this data.

Fourth, query flexibility. Business parties often ask, what should I do if Cube is not defined? Now of course the query can only fail.

This shows that some query patterns are not so fixed. You may suddenly need to check a number, but you will not check it again in the future. In fact, on OLAP engines that need to be predefined, this requirement is generally not well supported.

This picture is an all-round comparison of various systems.

From the perspective of query efficiency, the worst performer here is Presto, and the best performers should be Druid and Kylin1.5, both of which are comparable. In terms of functional completeness, it is true that Presto syntax, UDF, etc. are very complete. Kylin is slightly worse, but better than Druid.

The difference in system usability is not too big. The main considerations here are whether there is a standard SQL interface, whether the deployment cost is high, and whether users can get started faster. At present, the Druid interface is indeed not friendly enough. You need to read its documentation to know how to write the query syntax.

In terms of query cost, Presto is the best because it requires almost no special processing. Basically, the data that Hive can read can also be read by Presto, so the cost is very low. The costs of Druid and Kylin are relatively high because they require pre-calculation in advance. Especially if Kylin has a particularly large number of dimensions and is not specially optimized, the amount of data will still be considerable.

Finally, in terms of flexibility, Presto can query in any way as long as the SQL is written, while Druid and Kylin both have to do some pre-model definition work. This aspect can also be used as a reference when everyone chooses.

I have just compared several systems objectively, and now I will summarize the advantages of Kylin.

First, the performance is very stable. Because all the services that Kylin relies on, such as Hive and HBase, are very mature and the logic of Kylin itself is not complicated, so the stability is well guaranteed. Currently, in our production environment, the stability can be guaranteed to be above 99.99. At the same time, query latency is also ideal. We now have a business line requirement with more than 20,000 queries per day. The latency of 95 is less than 1 second and that of 99 is within 3 seconds. Basically it can meet our interactive analysis needs.

Second, a point that is particularly important to us is the accuracy requirement of the data. In fact, the only one that can do it now is Kylin, so we don’t have many other options.

Third, in terms of ease of use, Kylin also has many features. The first is peripheral services, whether it is Hive or HBase, as long as everyone uses the Hadoop system, they are basically available, and no extra work is required. In terms of deployment, operation, maintenance and usage costs, they are relatively low. Secondly, there is a public Web page for model configuration. In contrast, Druid is still based on configuration files. There is a problem here. Configuration files are generally managed by the platform or administrator. There is no way to open this configuration system, which is not ideal in terms of communication cost and response efficiency. Kylin has opened a general Web Server that all users can test and define. Only when it goes online, the administrator needs to review it, so that the experience will be much better.

Fourth and last point is the active and open community and enthusiastic core developer team. Discussions in the community are very open. Everyone can provide their own opinions and patches, fix bugs and submit new features, including Our Meituan team has also contributed many features, such as writing to different HBase clusters. What should be particularly pointed out here is that the core team is all Chinese. This is the only top-level project among all Apache projects that is dominated by Chinese people. The community is very active and enthusiastic, and there are many Chinese engineers. Especially when you contribute more and more, the community will invite you to become a committer. Myself and team members are already committers of Apache Kylin.

At the same time, I am very happy to see Kyligence, a startup company established by the Apache Kylin core team headed by Han Qing at the beginning of this year. I believe it can bring greater space and future to the development of the entire project and community.

4. Future work

In terms of future work, we believe that Kylin still has some unsatisfactory aspects, and we will also work hard to optimize and improve it.

First, accurate deduplication and counting. I just mentioned that only Int is supported. Next, there is a solution that will support all data types, which can expand the scope of scenarios where everyone can use accurate deduplication (additional note: this function has been implemented in version 1.5.3).

Secondly, we have also seen some areas that can be optimized in terms of query efficiency and Build efficiency. For example, queue resource splitting, all our computing cluster resources are costed according to business lines, but Kylin itself does not yet support it. We will work on this as soon as possible. I believe many companies have similar needs. There are also large result sets and pagination. When the results reach millions, the query delay will increase to tens of seconds. At the same time, it may be necessary to sort and paginate when querying. That is, after reading all the results, order by based on one of the indicators, this overhead is also relatively large. We will also find ways to optimize.

Finally, after Kylin 1.5, detailed data and streaming features will be released, and attempts will be made in this regard later.

5. Qamp; A

Q1: The issue of cost has been mentioned before during Build. Can you give an estimate? If there are 10 billion data, how much will it cost? time?

Sun Yerui: There is a simple data, about 200 million rows of data, with 14 or 5 dimensions. The build time is no more than two hours, and the data produced by the build is 500 to 600 GB.

Q2: What is the original value?

Sun Yerui: After extracting this data, the data that only participates in Build only has a few G after compression.

Q3: Have you ever encountered the problem of Kerberos authentication failure?

Sun Yerui: After Kerberos authentication is completed, there will be a ticket problem in the local temporary directory. It can be refreshed externally every day. The service does not need to be stopped.

Moderator: Let me add why we chose the SQL interface? What Kylin solves is who are the real users? In fact, they are business personnel and analysts. They only know SQL. Almost those people rarely want to learn JAVA, so if you can give them a standard SQL, this is the most important thing for them to get on board. Quick thing. In fact, this is where ease of use is important.

I just saw Kylin’s performance at tens of millions and billions of scales. If the data scale reaches billions, tens of billions, or hundreds of billions, I believe Kylin will kill everything in an instant. Because we now have another case where a table with a scale of hundreds of billions in a production environment can perform 90 queries within 1.8 seconds. In addition, I think it is very good. Meituan and JD.com have contributed many patches. In fact, they put forward the needs and everyone can work together.

Author introduction

Sun Yerui, senior engineer at Meituan, Apache Kylin Committer. Graduated from the University of Electronic Science and Technology of China in 2012. He worked at Qihoo 360 and was responsible for the construction of the Hadoop platform. He joined Meituan in 2015. Currently he is mainly responsible for the improvement and optimization of data production and query engines, focusing on distributed computing, OLAP analysis and other fields. He also has rich experience in distributed storage systems.