Current location - Loan Platform Complete Network - Big data management - Data Granularity in Data Warehousing
Data Granularity in Data Warehousing
Determining the proper granularity of data in a data warehouse is one of the most important design issues that data warehouse developers face. Data granularity focuses on the scope of the metric data, such as whether the population data item is counted as a neighborhood or a community in the statistics department. The higher the level of granularity of population data, the smaller the granularity level; conversely, the lower the level of granularity, the larger the granularity level. Granularity is a major design issue for data warehouses because it greatly affects the size of the amount of data stored in the data warehouse, as well as the types of queries the data warehouse can answer. Weighing the size of the data volume and the type of queries while designing a data warehouse yields a reasonable granularity size. Below we explain the determination of the granularity of the data warehouse through the planning, design, and construction phases.

1. Planning phase

"Planning" - the future of the overall, long-term, basic type of thinking and consideration of the problem, the design of the future set of action programs. In the planning phase process first roughly estimate the amount of data, the purpose of the estimation is to grasp a range of the amount of data in the data warehouse. The second step predicts the granularity needed for future applications in the data mart, and the minimum granularity used by the data warehouse to store the data mart.

1.1. It is important to have a good feedback loop.

The first step is to establish a good feedback loop. The data warehouse is built to face fuzzy demand, the granularity can not be planned at once, the first introduction of a small amount of data, the establishment of a part of the application submitted to the user to use, and listen to the user to use the views of the user, according to the user's views on the use of the adjusted granularity of the size of the opinion.

1.2. A rough estimate of the data stored is useful for those designing the architecture.

A rough estimate of the amount of data in the data warehouse is useful for planning the data warehouse architecture. If there are only 10 000 rows of data, then the data warehouse uses the smaller the granularity level of the data store, and all the granular data is stored in the data warehouse. If there are 10 000 000 rows of granular data, the data entering the data warehouse needs to be initially summarized. If there are 10 billion rows, not only does the data warehouse need to have a high granularity level, but it may also move most of the data up to overflow storage.

The estimation methodology is as follows:

1.3. It is necessary to predict the granularity of data that may be used in the data mart.

In order to appropriately populate all data marts, the data in the data warehouse must be at a minimum level of granularity required for all data marts.

The results of the planning phase are an important element of the basis for the construction of the data warehouse. The planning phase maps out the organizational structure, data volume size, and later applications, allows for the development of scenarios with a preconceived notion of the possible outcomes, and the design of possible problems to be avoided.

2. Construction phase

2.1. According to the estimated space results, the architecture design can be based on the size of the data volume of the storage device selection. How many direct access storage devices are needed and whether a double granularity design is required.

2.2. design the management of overflow data. Overflow data is data that the data warehouse transfers outdated data that is infrequently accessed to a larger amount of storage that is slow to access. Managing overflow data makes it easy for the index to locate historical data and allows for quick retrieval of that data.

Cross-media storage managers and data activity monitors enable efficient management of overflow data. The movement of data between disk storage and mass storage is controlled by a software called Cross Media Storage Manager (CMSM). Data Activity Monitor, which is used to determine which data is being accessed and which is not. Data activity monitors provide information about the location of data storage.

2.3. The determination of granularity in the implementation of a data warehouse is a back-and-forth loop. Using the feedback loop methodology established during the planning phase, constant feedback is obtained from the analysts to continually optimize the data warehouse.

From the figure, you can see that a successful data warehouse cannot be built without the collaborative efforts of the analysts. The builder has to constantly listen to the analysts. Analysts don't know what they need when they build the data warehouse, and only when they see the final analysis results can they tell the data warehouse staff what is really useful to them. To get feedback effectively, here are a few tips to follow:

Build a very small subset of the data warehouse quickly and listen carefully to user feedback;

? Use a prototype approach;

? Referring to the experiences of others;

? Work with experienced users;

? Use the functionality needs that already exist in the organization as a reference;

? Hold regular data warehouse construction meetings.

3. Example of a small example of banking granularity

3.1. granularity level in a banking environment, the figure below shows an example of data granularity in a bank.

The operational layer of the bank stores data at a granularity of days. The bank's various business systems store only the last 60 days of transaction activity details, to facilitate the user to query the last two months of transaction information details, this period of time the user is most concerned about the transaction data details.

The data warehouse layer aggregates data into summarized data at a monthly granularity. The bank will be up to ten years in the past data by each account monthly transaction information for aggregation, stored in the direct storage device for high-speed query access, the user of the past a long time transaction details do not care, but the user needs to quickly query to get the results, at this time to provide the unit of summary data to meet the needs of the user in the month.

All historical data is stored in the overflow storage area in units of days, which is a very large amount of data and accessed very infrequently. Generally, banks do not accept requests to query historical detail data for up to ten years. If some special circumstances require querying historical data for more than ten years, the query time will be quite slow.

4. Summary

The determination of the granularity of the data warehouse is a difficult process, requiring a suitable level, neither too high nor too low.

Choosing a granularity level is largely based on common sense. Prior to construction, proper planning is done to estimate the amount of data and set up appropriate feedback systems. During implementation, start by building a small part of the data warehouse and letting analysts use it. Then listen to them and make appropriate adjustments to the granularity level based on their feedback.