Usually, we synchronize the first layer of data from the source system into the data warehouse is called ODS or staging layer data, Alibaba collectively referred to as ODS. Data drift is a persistent problem with ODS data, which usually means that the same business date data in the ODS table contains data from the day before or near the early morning after or is missing the day's change data.
Because ODS needs to take on the demand for historically oriented detailed data queries, it needs to be physically dropped into the data warehouse where the ODS table is sliced and diced for partitioned storage by time period. The usual practice is to slice and dice the table by certain timestamped fields, but in practice, data drift often occurs due to the accuracy of the timestamped fields.
Typically, timestamp fields are divided into four categories:
Theoretically, these times should be consistent, but in practice, these times are often different, and there are several possible reasons for this:
The common practice is to slice the ODS table based on one of these fields, which leads to data drift. Let's look at a few scenarios of data drift.
There are two main ways to deal with it:
( 1) Get more data the day after Since it is difficult to solve the problem of data drift, then in the ODS each time partition forward, backward redundancy of some data, to ensure that the data will only be more, not less, and the specific data cut so that the downstream according to their own business scenarios to limit the use of different business time proc time. But this way there will be some data errors, for example, an order is paid on the same day, but the first day of the early morning to apply for a refund to close the order, then the order status of this record will be updated, the downstream in the statistics of the status of the paid order will be an error.
( 2) Limit time by multiple timestamp fields to get relatively accurate data
Here's a look at a real-world example of dealing with data drift in Taobao transaction orders.
When we processed the "Double 11" orders, we found that a large number of orders that were paid around 23:59:59 on November 11th drifted to the 12th. The main reason is that the system needs to call Alipay's interface after the user has placed the order and there is a delay, which leads to the final generation of these orders across days. That is, modified_time and log_time are both later than proc_time
If the order has only one payment process, then you can use the payment time limit to get the correct data. But often the actual order has multiple business processes: order, payment, success
Each business process has a corresponding timestamp field, and not only the payment data will drift. If we directly fetch the data from the day after and then limit those times, we can get the data
but the data from the day after may have been updated multiple times, and the record we directly fetch is already updated multiple times, so the accuracy of the data is a bit problematic.
Therefore, we can get the data from 15 minutes after the actual situation, and restrict the timestamp fields of each business process (order, payment, success) to the day of the "double", and then sort the data in ascending order by the modified_time of the order, and get the record of the first change of data for each order. of that record.
In addition, we can redundancy the data from the last 15 minutes of the day before yesterday and the data from the first 15 minutes of the morning of the day after tomorrow based on the log_time, and filter the non-same-day data by modified_time,
and then sort the data for each order by log_time in descending order, and get the record for each order with the last change in the data for that day.
Finally, we do a full outer join of the two data according to the order, and add the drifted data back to the same day data.
========================== split line ==========================
The above is rather obscure, and the examples are more difficult to understand because they are not supported by concrete data. I'll explain the above example a little bit with my own understanding and imagination of the actual data.
The first scenario is that at 23:59:59 on the day of the double 11 there are a large number of payment orders due to the call link is long as well as network delays and other reasons, the final data into the warehouse actually drifted to the 12th.
In my understanding the data in this should be the binlog log of mysql library.