Current location - Loan Platform Complete Network - Big data management - Please ask the software to do warehouse management development future how to
Please ask the software to do warehouse management development future how to
Data warehouse technology and prospects for the development of the current situation

---- computer system functions from numerical computation to expand to data management has been more than thirty years from now. The initial form of data management is mainly a file system, a small number of data fragments to add some associations and semantics between the composition of the hierarchical or mesh database, but the access to the data must rely on a specific program, the access to the data is fixed, rigid. By 1969, Dr. E.F. Codd published his famous paper on the relational data model. Since then, the emergence of relational databases ushered in a new era of data management.

---- more than two decades, a large number of new technologies, new ideas emerge and are used in the development and implementation of relational database systems: client/server architecture, stored procedures, multi-threaded concurrent kernel, asynchronous I/O, costly optimization, and so on, which are sufficient to make the processing power of relational database systems is no less than that of traditional closed database systems. The benefits of relational databases in terms of access logic and applications go far beyond that. The use of SQL has become an unstoppable trend, and coupled with an order-of-magnitude increase in the processing power of computer hardware in recent years, relational databases have eventually come to dominate online transaction processing systems. Throughout the 1980s and into the early 1990s, online transaction processing was the dominant database application. However, applications continued to advance. When the application of online transaction processing systems to a certain stage, entrepreneurs have found that the sole reliance on having an online transaction processing system has been insufficient to obtain competitive advantages in the market, they need to operate their own business and the entire market related to the industry's situation to analyze, and make favorable decisions. Such decisions require analyzing a large amount of business data, including historical business data. In today's highly competitive market environment, this kind of decision analysis based on business data, which we call on-line analytical processing, is more important than ever. If traditional online transaction processing emphasizes updating the database - adding information to the database - then online analytical processing is about getting information from the database and using it. Thus, noted data warehousing expert Ralph Kimball writes, "We've spent more than two decades putting data into databases, and now it's time to get them out."

---- fact, applying large amounts of business data to analytics and statistics was originally a very simple and natural idea. However, in practice, one finds that it is not as easy as one might think to obtain useful information: first, all online transaction processing emphasizes intensive data update processing performance and system reliability, and does not care about the convenience and speed of data query. On-line analysis and transaction processing have different system requirements, and it is theoretically difficult for the same database to do both; second, business data are often stored in scattered heterogeneous environments, which are not easy to be accessed in a unified query, and there is also a large amount of historical data in an off-line state, which is virtually useless; third, the business data schema is designed for the transaction processing system, and the format and description of the data are not suitable for non-computer professionals to analyze the business. The format and description of the data are not suitable for non-computer professionals to conduct business analysis and statistics. Therefore, some people lament: 20 years ago, the query can not be data because the data is too little, and today the query can not be data because the data is too much. In response to this problem, it is envisioned that a data center be established specifically for statistical analysis of business, with its data coming from online transaction processing systems, from heterogeneous external data sources, and from offline historical business data. ...... This data center is an online system, which is specifically designed to serve analytical statistics and decision support applications. This data center is an online system that is dedicated to analytical statistics and decision support applications, through which everything required for decision support and online analytical applications can be met. This data center is called a data warehouse. This concept was introduced in the early 90's, if you need to give a definition of the data warehouse, then the data warehouse is a structured data environment as a data source for decision support systems and on-line analytical applications. The problem that data warehouses are designed to study and solve is the problem of obtaining information from databases.

---- So what is the relationship between a data warehouse and a database (mainly relational databases)? Back in the beginning, people stick to closed systems is out of a preference for transaction processing, people choose relational databases is to facilitate access to information. We just turn over Dr. C.J. Date's classic work "AnIntroductiontoDatabaseSystems" will find: today's data warehouse to provide exactly the year when the relational database to be advocated. However, "success is also Xiao He, failure is also Xiao He", due to the great success of relational database system in online transaction processing applications, so that people have been unconsciously classified into the category of transaction processing; too much attention to the improvement of transaction processing capabilities, so that relational databases in the face of online analytical applications and appear to be "The old revolution meets a new problem" - today's data warehouse on relational databases for online analysis of the ability to put forward higher requirements, the use of ordinary relational databases as a data warehouse in terms of functionality and performance are not enough, they must have specialized improvements. Therefore, the difference between data warehouses and databases is not only in terms of the method and purpose of the application, but also involves differences in products and configurations.

---- With a discerning eye, the rise of the data warehouse is actually a return to data management, a spiral. Today's databases are like the hierarchical databases and net-type databases of the day, which are oriented towards transaction processing; today's data warehouse is like the relational databases of the day, which are aimed at on-line analytics. The difference is that today's data warehouses do not have to run unnecessarily for the characteristics of online transaction processing, due to the specialization of the technology, it can be more focused on the development and exploration of the field of online analytics.

---- From a vendor's perspective, after a long period of development, the market for online transaction processing systems showed signs of saturation by the mid-1990s, and its growth slowed significantly. This led to the traditional business growth of major database vendors face serious challenges, seeking new business growth has become their top priority. The rise of data warehousing has undoubtedly created a huge market for database products, which will become a new growth point for the database market from the end of this century to the beginning of the next century. Therefore, the concept of data warehousing is accompanied by strong market hype from the very beginning. For the majority of users, only from their own application needs, to break the mystery of technology and concepts, avoid the virtual on the real, pay close attention to the direction of technological development, in order to obtain satisfactory products, solutions and economic benefits.

---- Once the concept of data warehousing appeared, it was first applied to finance, telecommunications, insurance and other major traditional data processing-intensive industries. Many large foreign data warehouses were established in 1996-1997. So, what kind of industries are most in need of and likely to establish data warehouses? There are two basic conditions: First, the industry has a more mature online transaction processing systems, it provides objective conditions for the data warehouse; second, the industry is facing the pressure of market competition, it provides external motivation for the establishment of the data warehouse.

Key Technologies for Data Warehousing

---- So what are the components and key technologies of a data warehouse? Unlike relational databases, data warehouses do not have a strict mathematical theoretical foundation; they are more engineering-oriented. Because of this engineering nature of the data warehouse, it can thus be technically categorized into four areas based on its working process: data extraction, storage and management, data representation, and technical consulting for data warehouse design. Here, we will discuss each of these aspects separately.

----1.Data Extraction

---- Data extraction is the entry point of data into the warehouse. Since the data warehouse is an independent data environment, it requires an extraction process to import data from on-line transaction processing systems, external data sources, and off-line data storage media into the data warehouse. Data extraction technically involves several aspects such as interconnection, replication, incrementalization, transformation, scheduling, and monitoring. The data in the data warehouse does not require real-time synchronization with the online transaction processing system, so data extraction can be timed, but the timing, sequence, success or failure of multiple extraction operations is critical to the effectiveness of the information in the data warehouse.

---- In terms of technological development, the individual technical aspects involved in data extraction are relatively mature, and some of them are inescapable from programming, but the overall level of integration is still very insufficient. Most of what is currently offered on the market are data extraction tools. These tools will automatically generate the code for data extraction through the correspondence between the user-selected source data and target data. However, the types of data supported by extraction tools are limited; at the same time, the data extraction process involves data conversion, which is a part closely related to the actual application, and its complexity makes it impossible to embed user-programmable extraction tools to meet the requirements. Therefore, the actual data warehouse implementation process often does not necessarily use extraction tools. Whether the whole extraction process can be integrated into effective management, scheduling and maintenance due to the use of tools is even more important. In terms of market development, data warehouse vendors with data extraction and heterogeneous interconnect products are generally likely to be subsumed by other companies with database products. In the world of data warehousing, they can only become a supporting role.

----2.Storage and Management

---- real key to data warehousing is the storage and management of data. The way the data warehouse is organized and managed determines its characteristics that set it apart from traditional databases, as well as its external data representation. To decide what products and technologies to use to build the core of the data warehouse, it is necessary to start analyzing the technical characteristics of the data warehouse.

---- first problem encountered with data warehouses is the storage and management of large amounts of data. The amount of data involved here is much larger than traditional transaction processing and accumulates over time. From the point of view of existing technologies and products, only relational database systems can take on this task. Relational database after nearly 30 years of development, in the data storage and management has been very mature, non-comparable to other data management systems. At present, many relational database systems have supported data partitioning technology, a large database table can be dispersed in a number of physical storage devices, further enhancing the system to manage the expansion of large data volumes. The use of relational databases to manage hundreds of gigabytes or even terabytes of data has been a common thing. Some vendors also specifically consider system backup for large data volumes. The good news is that data warehouses don't require much in the way of online backup.

---- second problem data warehouses have to solve is parallel processing. In traditional online transaction processing applications, user access to the system is characterized as short and dense; for a multiprocessor system, it is critical to be able to share user requests in a balanced manner, which is concurrent operation. In a data warehouse system, on the other hand, the user access system is characterized as large and sparse, where every query and statistic is complex but not accessed very frequently. At this point, the system needs to have the ability to mobilize all the processors to serve this one complex query request, the request will be processed in parallel. Therefore, parallel processing techniques are more important than ever in data warehousing. You may notice that in the TPC-D benchmark test for data warehouses, a single-user environment test called "system power" (QppD) has been added. The parallel processing capability of the system has a significant impact on the QppD value. At present, the relational database system in parallel processing has been able to do the query statement decomposition parallel, based on data partitioning parallel, as well as support for cross-platform multi-processor cluster environment and MPP environment, able to support up to hundreds of processor hardware system and maintain the performance of the ability to expand.

The third problem of ---- data warehousing is optimization for decision support queries. This issue is primarily for relational databases, as other data management environments do not have even basic general-purpose querying capabilities. Technically, optimization for decision support involves many parts of the database system such as indexing mechanisms, query optimizers, connection strategies, data sorting and sampling. Ordinary relational databases use B-tree type indexes, which are almost ineffective for fields with a large number of duplicate values, such as gender, age, and region. The expanded relational database introduces the mechanism of bitmap indexing to represent the state of a field with binary bits, turning the query process into a filtering process, and a single computer can filter multiple records with basic operations. Since the amount of data in each data table in the data warehouse is often very uneven, the best query path derived by an ordinary query optimizer may not be optimal. Therefore, decision support-oriented relational databases in the query optimizer has also been improved, while according to the characteristics of the use of indexes to add the ability to scan multiple indexes. Data warehouses built with relational databases encounter a large number of inter-table join operations, which are time-consuming for relational databases. Expansion of the relational database can be pre-defined connection operations , we call the connection index , so that the database in the execution of the query can be directly access to the data without having to implement a specific connection operation . Data warehouse queries often require only some of the records in the database, such as the top 50 largest customers, and so on. Ordinary relational databases do not provide such query capabilities and have to sort the entire table of records, which takes a lot of time. Decision-supporting relational databases have improved here by providing this capability. In addition, queries in data warehouses do not need to be as precise as in transaction processing systems, but they need to have a sufficiently short system response time in high-volume data environments. Therefore, some database systems have added the ability to query sampled data to dramatically improve system query efficiency to the extent that accuracy allows. In short, the ordinary relational database transformed into a suitable server to play the role of data warehouse has a lot of work to do, it has become an important research topic of relational database technology and development direction. It can be seen that the expansion of decision support for traditional relational databases into the data warehouse market is an important technical measure.

---- fourth problem of data warehousing is the query pattern to support multidimensional analysis, which is one of the most serious challenges that relational databases encounter in the field of data warehousing. Users access data warehouses in a very different way than traditional relational databases. The access to the data warehouse is often not a simple query of tables and records, but an analysis mode based on the user's business, i.e., online analysis. As shown in the accompanying figure, it is characterized by imagining the data as a multi-dimensional cube, the user's query is equivalent to some of the dimensions (prongs) on the conditions imposed on the cube for slicing, splitting, and the result is a numerical value of the matrix or vector, and will be made into a chart or enter the algorithms of mathematical statistics.

---- relational databases themselves do not provide the query functionality for this kind of multidimensional analysis, and in the early days of the development of data warehousing, it was found that the use of relational databases to implement this multidimensional query model is very inefficient, and the query processing process is difficult to automate. For this reason, the concept of multidimensional databases has been proposed. Multi-dimensional database is a multi-dimensional data storage form to organize the data management system, it is not a relational database, in the use of data from the relational database needs to be reproduced in the multi-dimensional database can be accessed. The online analytical application realized by multidimensional database is called MOLAP. multidimensional database has better effect on small multidimensional analytical application, but it lacks the parallel processing and large-scale data management scalability possessed by relational database, so it is difficult to undertake large-scale data warehouse application. Such a state of affairs until the "star schema" in the design of relational databases have been widely used to completely change. A few years ago, data warehousing experts found that relational databases, if the "star schema" to organize the data can be a good solution to the problem of multi-dimensional analysis. The "star schema" is nothing more than a form of correlation between data tables in database design, and its ingenuity lies in its ability to find a fixed algorithm that converts a user's multidimensional query request into a standard SQL statement for that data schema, and that statement is optimized. The application of the "star schema" has given the green light for relational databases to be used in data warehousing. Online analytic applications implemented using relational databases are called ROLAP, and today, most vendors offer data warehousing solutions that use ROLAP.

---- In the area of data storage and management for data warehousing, today's technological development shows that parallel relational databases for decision support expansion will be the core of data warehousing. In the market, database vendors will be the backbone of the data warehouse.

----3.Data Representation

---- data representation is the face of the data warehouse. It is a world of tool vendors. They focus on multidimensional analysis, mathematical statistics and data mining.

---- Multidimensional analysis is an important manifestation of data warehousing, and since MOLAP systems are specialized, most of the tools and products about the multidimensional analysis field are ROLAP tools. These products in the last two years have focused more on providing a Web-based front-end on-line analytical interface, rather than just on-line data publishing.

---- Mathematical statistics were originally not directly related to data warehousing, but in practice, customers need to validate their assumptions about certain things by using statistics on data to make decisions. Similar to mathematical statistics, data mining is not directly related to data warehousing. And the concept is somewhat ambiguous in reality. Data mining emphasizes not just validating people's assumptions about the characteristics of data, but it's also about proactively searching for and discovering the patterns embedded in the data. While this may sound appealing, there is a big difference in implementation. Many of the data mining tools on the market are really nothing more than applications of mathematical statistics. Instead of actually finding patterns in the data, they validate as many hypotheses as possible, which include many meaningless combinations, and finally a human being judges their reasonableness. Thus, in current data warehousing applications, the effective use of mathematical statistics can already yield considerable benefits.

----4.Technical Consultation on Data Warehouse Design

---- are some of the more fundamental questions that need to be answered during the implementation of a data warehouse. They include: What departments are provided with the data warehouse? How can different departments utilize the decision-making benefits of the data warehouse? What data needs to be stored in the data warehouse? In what structure is this data stored? Where will the data be loaded from? What is the appropriate frequency of loading? What data management products and tools need to be acquired to build a data warehouse? And so on. These questions are dependent on the particular data warehouse system and are in the realm of technical consulting.

---- fact, the data warehouse is never a simple stack of products, it is a comprehensive solution and systems engineering. In the implementation process of data warehouse, technical consulting services is crucial, is an indispensable part, it is even more important than the purchase of products. At present, the data warehouse technical consulting mainly from the data warehouse software product vendors and independent consulting firms for data warehouse technology.

Mainstream Vendors and Products

---- As a hotspot in the data management market, many companies have been engaged in the data warehouse market competition in recent years. Here, we will choose to introduce some of these vendors. First of all, they are familiar to the Chinese market and their products can be easily purchased. Second, we mainly choose software vendors. Third, these vendors are divided into two main categories: those with a background in database products, which will be the backbone of the data warehouse market; and tool product vendors, which provide peripheral tools in the data warehouse solution (not to be introduced here).

---- data management vendors in the category of the main (alphabetical order): IBM, Informix, Microsoft, NCR, Oracle, Sybase and so on.

----■ IBM

---- As a strong force in the field of data warehousing, IBM is a vendor with both hardware and software. In the area of data warehouse technology, IBM is most noted for its SP/2 MPP hardware environment. In recent years, it has managed a large number of data warehouses with more than terabyte capacity with open systems. As the closed mainframe system is difficult to become the mainstream of the data warehouse center system for a while, SP/2 and other open MPP environments are bound to become dominant. In contrast, IBM's database software performance usual, its data warehouse core using the DB2UniversalServer (referred to as UDB) ParallelEdition.IBM's advantage is the industry's reputation, market share, hardware systems and consulting services.

----■ Informix

----Informix is a specialized database vendor whose relational database server, DynamicServer, has consistently held a steady and broad market share in traditional online transaction processing applications. In recent years, data warehousing has become one of the company's key areas of development. In data warehousing technology, Informix is mainly focused on so many aspects: first, parallel processing database server. Informix's ExtendedParallelServer (XPS) is designed for enterprise-level decision support systems, using non-**** technology to support clustered systems and MPP environments, and is able to provide near-linear performance scalability. The XPS is designed for enterprise-class decision support systems. Second, Informix adds extensions to the parallel relational database for decision support operations. Third, Informix provides MetaCubeOLAP middleware, which implements ROLAP solutions in a multi-tier client/server architecture and integrates query optimization mechanisms based on aggregation and sampling.

----1 At the end of 1998, RedBrick, a well-known data warehouse provider, merged into Informix, enhancing its strengths in data extraction, data mining, and consulting with industry advisors. Currently, Informix looks at data warehousing as a collection of products and services, naming the overall solution DecisionFrontier.

----■ Microsoft

----Microsoft is using its relational database, SQLServer, as its data warehousing core. In the data warehousing space, Microsoft's plan is to make Plato (an OLAP server) and DataTransformationServices (data transformation services, including data extraction, transformation, and loading capabilities) a complimentary part of its SQLServer 7.0 database. Microsoft's OLAP goes the way of ROLAP, which, like its data transformation, is a conventional solution; while parallel processing and decision support extensions are not SQLServer's strong points. As a result, the entire solution is still oriented to the low and mid-range, and price is the key to win.

---- For this reason, Microsoft advocates another concept in the data warehouse market -- data mart (DataMart). The so-called DataMart is a departmental application-oriented, small-scale data warehouse; the technology used is similar to a data warehouse, but the content stored is more thematic. For a data mart of this size, Microsoft's solution is ideal.

----■ NCR

----NCR is one of the pioneers of data warehousing, with a strong business-focused consultancy and a large market for traditional data warehouses.NCR's data warehouse product is called Teradata ScalableWarehouse, which is meant to be a hyperscalable data warehouse for the high-end data warehouse market. NCR's Teradata is not an open database system designed specifically for the data warehousing space. However, Teradata performs ordinarily well in TPC-D tests on data warehousing performance, which requires more parallel processors.Teradata runs on a platform that is primarily an MPP environment, and the operating system is NCR's own, which until recently supported Unix and NT.

----NCR is a vendor that specializes in high-end data warehousing, and its Teradata performs well with large-scale systems and data volumes. But its solutions face challenges: inline multidimensional analytics is its weak point.

----■Oracle

----Oracle's earlier research in data warehousing focused on OLAP multidimensional analysis. A few years ago, Oracle acquired a multidimensional database vendor called IRI and launched Express multidimensional database, which provides a solution for on-line analytics in MOLAP mode. With the solution of ROLAP in recent years gradually become mainstream, in Oracle's latest data warehouse solution - OracleDataMartSuite Oracle to Oracle8EnterpriseServer for the data warehouse server.

----■Sybase

---- As early as 1994 to promote System10, Sybase will be in the database of massively parallel online backup, data replication, interconnection of heterogeneous databases and so on have done a lot of work. In the core area, Sybase designed NavigationServer specifically for MPP environments to work with SQLServer to form a massively parallel processing environment. in early 1995, Sybase introduced SybaseIQ, the first bitmap indexing mechanism to be combined with a large relational database, through the acquisition of ExpressWay. currently. Sybase launched a data warehouse solution called SybaseWarehouseStudio, which is enhanced by SybaseIQ AdaptiveServer, as well as the Power series of design, transformation, OLAP tools. But in the actual application solutions, due to market reasons, Sybase often need to borrow third-party tools.

Data Warehouse Future Directions

---- Data Warehouse is a booming field in the data management technology and market, with good prospects for development. Here, we will discuss the future development of data warehouse from several aspects such as technology, application and market.

---- development of data warehouse technology naturally includes data extraction, storage management, data representation and methodology. In terms of data extraction, future technological development will focus on system integration. It incorporates interconnection, conversion, replication, scheduling, and monitoring into a standardized and unified management to adapt to possible changes in the data warehouse itself or in the data source, making the system easier to manage and maintain. In terms of data management, future developments will enable database vendors to explicitly introduce data warehouse engines as server products alongside database servers. In this regard, parallel relational databases with decision support extensions will have the most growth potential. In terms of data performance, mathematical and statistical algorithms and functions will be generally integrated into the on-line analytical products, while closely integrated with Internet/Web technology, the introduction of the Intranet for Intranet, terminal maintenance-free data warehouse access front-end. In this regard, the data warehouse user front-end software refined by industry application characteristics will become a product as part of the data warehouse solution. The methodology of the data warehouse implementation process will become more popular and will become a clear branch of database design and a must for management information system design.

---- data warehousing tendency of computer application development is the driving force behind data warehousing development. Traditional on-line transaction processing systems do not consider data warehousing separately, but practical applications have long had a need for the functionality that data warehousing can provide. Therefore, many transaction processing systems in recent years into a dilemma: in the existing system to increase the limited on-line analysis functions, including complex reports and data aggregation operations; on the one hand, a serious impact on the online performance of the transaction processing, on the other hand, the statistical analysis is due to a variety of structural constraints on the system can not be fully embodied. The result is: the development of application technology is towards a more refined, more professional direction. In the new generation of application systems, data warehousing is incorporated into system design considerations from the outset, and on-line analysis is used in common transaction processing systems. In terms of data management, online transaction processing and data warehousing are relatively independent in the application, which makes the online transaction processing system itself more concise and efficient, and at the same time, the analysis and statistics are also more convenient. Industry-oriented mathematical statistics move towards more general applications and are integrated into the data warehouse solution of the application system. They will build on the wealth of information provided by the data warehouse to better serve business decisions.

---- In the marketplace, we will look at the development of data warehouses from both the vendor and user perspectives. For vendors providing data warehouse products and solutions, harsh market competition is a constant theme. The future development will be that vendors who do not provide complete solutions may be acquired by other companies, for example, software companies engaged in data extraction, providing specialized tools are likely to merge into large database vendors and go to build complete solutions. The vendors that can be sustained are broadly divided into two categories: first, companies with strong database and data management backgrounds; and second, companies that specialize in providing industry-specific, technical consulting on data warehouse implementation.

---- From the user's point of view, the traditional areas of data management, such as finance, insurance, telecommunications and other industries in the specific applications, such as credit analysis, risk analysis, fraud detection, etc., is the main market for data warehousing outside of the data warehouse, data warehouse applications with the change of the business model of the modern society to further popularize and deepen. In recent years, a quiet revolution is changing the way products are manufactured and services are delivered; it is the digital customization economic model. In this world, users can buy a computer assembled according to their own requirements, a pair of jeans designed according to their own body shape, a kind of health medicine produced according to their own physical needs, and a pair of glasses matching their own face shape ...... Mass customization is not only a manufacturing process, a logistic system, or a sales promotion strategy, it is likely to become the next century of enterprise production. may become the organizing principle of corporate production in the next century, just as batch production is the organizing principle of this century. In the future of the mass customization economy, the data warehouse will become a key weapon for companies to gain a competitive advantage.

---- In short, data warehousing is a comprehensive technology and solution based on data management and utilization, which will become a new round of growth in the database market, as well as an important part of the next generation of application systems. Data warehouse for the majority of computer users, including Chinese users, is not far away; it is visible, touch, buy. Data warehouse technology is not mysterious, at least more than the majority of statistical theorems to simple. I believe that we will be able to implement and use the data warehouse to obtain satisfactory results.