Current location - Loan Platform Complete Network - Big data management - What are the similarities and differences between a data warehouse and a database?
What are the similarities and differences between a data warehouse and a database?
Database:It is a logical concept to store the data in a warehouse, which is realized by database software. Database consists of many tables, tables are two-dimensional, a table with many fields inside. The fields are lined up one by one, and the data is written to the table line by line. Database tables, lies in the ability to use two-dimensional performance of multi-dimensional relationships. For example: oracle, DB2, MySQL, Sybase, MSSQL Server and so on.

Data warehouse: is an upgrade of the database concept. From a logical understanding, there is no difference between the database and data warehouse, are realized through the database software to store the data place, only from the amount of data, data warehouse than the database is much more huge de more. The data warehouse is mainly used for data mining and data analysis, to assist the leadership to make decisions.

And then look at the different applications

Business database is operation-oriented, mainly serving the business products and development. While the data warehouse is analysis-oriented, mainly serving analysts. Evaluation of the data warehouse to do a good job, depends on the analysts to use the pleasure or not. Therefore, the data warehouse from the beginning of the product design , has always stood in the position of analysts to consider , is committed to solving the use of business data for analysis of the various shortcomings . For more details, check out?1, which mentions ETL, which is briefly described in the following block.

ETL is an acronym for Extract-Transform-Load, which describes the process of extracting (extract), transforming (transform), and loading (load) data from the source to the destination.

The term ETL is more commonly used in the context of data warehousing, but the object is not limited to data warehousing.

Description of (author's source unknown)

Problems solved by data warehouses

Structural clarity and simplicity

Data warehouses don't need to follow the database design paradigm, and therefore have a great deal of freedom in designing the data model.

The data model is generally a star model, with tables divided into two categories: fact tables and dimension tables.

The fact table is located at the center of the star and stores various metrics that describe the business situation.

Dimension tables surround the fact tables and are related one-to-one by foreign keys, providing different perspectives on the business situation.

The star model is easy to use, easy to understand, and focused on the business.

When we do data analysis, we first select a topic, such as analyzing user registration; then we find the corresponding business data source based on the selected topic, and then we observe what analysis perspectives the business data source provides, and finally we analyze the data based on the data.

The star model is perfect for this idea and greatly simplifies the process. Here's an example of our current model.

Reusable and easy to expand

The star model is not only easy to understand and use, but the dimension table is also easy to reuse and the fields in the dimension table are easy to expand.

For example, the date dimension table can be used not only by different fact tables, but also in the same fact table can be reused, such as a fact table of different dates of operation, an order for a product has a date of creation, date of payment, date of shipment, time of refund, time of receipt and so on.

The fields in a dimension table are easy to extend, as long as the primary key of the dimension data remains unchanged, adding new fields directly in the dimension table can be added, and the new content added will only affect the dimension table. Moreover, dimension tables usually have a small amount of data, so even if they are completely reloaded, it doesn't take much time.

Clean data

Unclean data is removed or labeled during the ETL process, making it easier to use.

Note: Since data cleaning requires the establishment of certain rules, and the current focus is on data modeling and ETL system design, there is no additional time and effort to design cleaning rules. In order to ensure the integrity of the data, cleaning is not done in the current ETL.

Data Semantics/Unified Description

All kinds of states can be directly written as specific values, no longer need to use opcodes for querying, and SQL statements are more natural and easier to understand.

For some of the commonly used combination of states, they can be merged into one field to represent them. For example, in the repayment analysis, you need to filter out valid orders based on the combination of repayment status, release status/shipment status, you can directly set up an order valid field to simplify the filtering conditions.

For the same meaning of the data in different contexts, it can also be described uniformly. For example, for the description of the date of disbursement, when the product is a consumer loan, it refers to the date of shipment, and when the product is a cash loan, it refers to the date of disbursement to the user. The fact that both dates indicate the date of disbursement can be harmonized, again simplifying the filtering criteria.

Keeping History

Data warehouses can record business state changes in the form of zipper tables, or even dedicated fact tables can be designed to do so. Whenever there is a need for historical analysis, you can go ahead and implement it.

High-speed querying

The data warehouse itself does not provide high-speed querying capabilities. It is only because of its simple star structure that it has an advantage in speed over the complex queries of business databases. If the traditional relational database is still used to store data. After the volume of data on the scale, the same problem of slow queries will be encountered.

However, using Hive to store data, and then using the multidimensional query engine Kylin built on Hive, the results of all possible query scenarios under the star model are saved, and using space for time, you can do high-speed queries, and the time consumed for large-scale querying can be shortened to sub-seconds, which greatly improves work efficiency.