Current location - Loan Platform Complete Network - Big data management - Data quality and the eight dimensions of data quality indicators
Data quality and the eight dimensions of data quality indicators
Eight dimensions of data quality and data quality metrics

The quality of data has a direct impact on the value of the data, and directly affects the results of data analysis and the quality of the decisions we make. Poor quality data is not just a problem of the data itself, but also affects business management decisions; wrong data is not as good as no data, because without data, we will also be based on experience and judgment based on common sense to make the decision that may not be wrong, and the wrong data will lead us to make the wrong decision. Therefore, data quality is the key to data governance for business management.

The quality of data can be measured in eight dimensions, each of which reflects the character of the data from one side. The eight dimensions are: accuracy, veracity, completeness, comprehensiveness, timeliness, immediacy, precision and relevance.

We often use this graphical representation when comparing the quality of two data sets. For example, it is conventional to say that the accuracy, authenticity and completeness of internal data collection are high, while the comprehensiveness, timeliness, immediacy, precision and relevance depend on the degree of attention paid to the data within the enterprise as well as the sophistication of the technological means used; external data sets, such as microblogging data, Internet media data, etc., can be improved through technological means, such as web crawlers, etc. can be improved, but it is difficult to guarantee and control in terms of accuracy, authenticity, and precision, and in terms of relevance depends on the technology related to data collection and mining.

We can also use this model to measure the character of data across functions within a company. The figure below is an illustration of how we can take targeted steps to improve data quality through the evaluation of the 8 major indicators of data quality for data governance within an organization.

Accuracy of data

Accuracy of data (Accuracy) refers to the degree of proximity between the data collection value or observation value and the true value, also called the error value, the larger the error, the lower the accuracy. The accuracy of the data is determined by the method of data collection.

Precision of data

The precision of data (Precision) refers to the degree of proximity between different data obtained from repeated measurements of the same object. Precision can also be called accuracy. Precision is related to the accuracy of our data collection. Higher precision requires finer granularity of data collection and lower tolerance for error.

Measuring the height of a person, we can be accurate to the centimeter, the difference between multiple measurements will only be at the level of centimeters; measuring the distance from Beijing to Shanghai, we are accurate to the kilometer, the difference between the results of multiple measurements will be at the level of kilometers; the use of vernier calipers to measure the thickness of a part can be as accurate as 1/50 millimeters, the results of the results of multiple measurements of the error will only be in 1/50 millimeters between. The method and means of measurement used have a direct impact on the accuracy of the data.

Data Veracity

Data veracity, also known as data correctness (Rightness). The correctness of the data depends on the degree of control of the data collection process, the degree of control is high, traceability is good, the authenticity of the data is easy to be guaranteed, while the degree of control is low or untraceable, the data can not be traced after the forgery, the authenticity is difficult to guarantee.

In order to improve the authenticity of the data, the use of intelligent terminals for direct data collection without process intervention can better ensure the authenticity of the collected data, reduce human intervention, reduce data forgery, so that the data can more correctly respond to the objective things.

Data timeliness

Data timeliness (In-time) is whether the data can be guaranteed when needed. At the beginning of the month, we will summarize the statistics of the previous month's operation and management data, whether these data can be processed in a timely manner to complete the financial accounts in a timely manner after the closure of the monthly accounts. The timeliness of data is the guarantee of the timeliness of our data analysis and mining. If the company's financial accounting is complex, slow accounting, the last month's data in the middle of the month to complete the statistical summary, and so need to adjust the financial strategy, has come to the end of the month, a month is almost over. Especially after the company is bigger, the business covers multiple markets and countries, the data can not be summarized in time, which will affect the timely decision-making of the senior management.

The timeliness of the data and the speed and efficiency of enterprise data processing has a direct relationship, in order to improve the timeliness of the data, more and more companies are using the management information system, and in the management information system is attached to a variety of automated data processing functions, to be able to automatically complete the vast majority of the reports after the data upload system, so as to ensure that the efficiency of data processing. Computerized automatic processing of intermediate level data is an effective means to improve the efficiency of enterprise data processing.

In addition to ensuring the timeliness of data collection and the efficiency of data processing issues, it is also necessary to ensure the timeliness of data transmission from the system and process. The data report is completed, in a timely manner or within the required time frame to be sent to the designated departments, or uploaded to the designated storage space.

The immediacy of data

The immediacy of data refers to the data collection time node and the time node of data transmission, a data source in the data collection immediately after the storage, and immediately processed and presented, is the instantaneous data, and after a period of time and then transmitted to the information system, then the immediacy of the data is slightly worse.

Microblogging data collection, when the user issued a microblogging, the data can immediately be captured and processed, will generate instant microblogging data report, and over time, the data continues to change, we can call it instantaneous collection and processing. A production equipment instrumentation instantly reacts to the equipment's temperature, voltage, current, air pressure and other data, which generates a data stream that monitors the equipment's operation at any time, and this data can be viewed as instant data. And when the immediate operating data of the equipment is stored and used to analyze the relationship between the operating conditions of the equipment and the life of the equipment, these data become historical data.

Data Completeness

Data completeness is measured by the degree to which data is collected, the ratio between what should be collected and what is actually collected. An information collection of 12 data points, such as our collection of employee information data, the requirements to fill in the name, date of birth, gender, ethnicity, place of origin, height, blood type, marital status, the highest degree, the highest degree of professionalism, the highest degree of graduation institutions, the highest degree of graduation time and other 12 items of information, and an employee only fill in part of the information, such as only fill in the five, the employee filled out the data integrity of only half. The completeness of the data is only half.

The completeness of a company's data reflects the importance the company places on data. Requirements to collect data and in fact did not collect the complete collection, only part of the collection, which is incomplete, often the company's data collection quality requirements are not in place to lead to. If a company requires everyone to fill out a complete personal information form, and some employees refuse to do so, and only 1,200 of the company's 2,000 employees fill out a complete personal information form, the dataset is incomplete.

In addition, for dynamic data, we can measure the completeness of data collection from the timeline. For example, if we require data to be collected every hour, 24 data points will be formed every day and recorded as 24 pieces of data, but the employee malfeasance is recorded only 20 times, then this dataset is also incomplete.

Comprehensiveness of data

Comprehensiveness of data is different from completeness, which measures the difference between what should be collected and what is actually collected. Comprehensiveness, on the other hand, refers to the omission of data collection points. For example, let's say we want to collect employee behavior data, we only collect the data of employees clocking in and out of work, and the employee behavior data during working hours is not collected, or we don't find a suitable method to collect it. Then the dataset is incomplete.

If we describe the packaging of a product, and we only describe the front and back of the packaging, and we don't record the sides of the packaging, then the data set is not comprehensive. If we record a customer's transaction data, and we only collect the products in the customer's order, the price and quantity of the products in the order, but not the customer's delivery address and purchase time, the data collection is not comprehensive.

Tencent QQ and WeChat user data records the data of customer communication; Ali and Jingdong user data records the data of user's purchase transaction; Baidu map records the data of user's traveling; Volkswagen Dianping and Meituan records the data of customer's dining and entertainment. For a comprehensive description of all aspects of a person's life in terms of clothing, food, housing and transportation, the data of these companies are incomplete, while if their data is integrated, it will form more comprehensive data. Therefore, data comprehensiveness is a relative concept. The excessive pursuit of data comprehensiveness is unrealistic.

Data relevance

Data relevance refers to the relationship between the various data sets. For example, employee salary data and employee performance appraisal data is associated with the resource through the employee, and performance data is directly related to the amount of salary. Purchase order data and production order data through the material traceability mechanism is associated, and production orders are completed by employees, that is, through the employee job data and employee information data associated.

In fact, the enterprise big data that we explore in this book, each dataset is relevantly associated, some are directly associated, such as employee salary data and employee performance data, and some are indirectly associated, such as material purchase order data and employee salary data. These data correlations are connected by the company's resources, including people, money, materials, and information. If any of the datasets cannot be connected to other datasets, there is data fragmentation or data silos. Data fragmentation and data silos are the result of insufficient data relevance in an organization. And data relevance directly affects the value of enterprise data sets.