Big data is huge in volume and diverse in format.
A huge amount of data is generated by a wide range of devices in homes, manufacturing plants, and offices, Internet transaction transactions, activity on social networks, automated sensors, mobile devices, and scientific instruments.
Its explosive growth has outstripped the processing capabilities of traditional IT infrastructures, creating serious data management problems for organizations and society.
Therefore, it is necessary to develop a new data architecture that focuses on the entire process of "data collection, data management, data analysis, knowledge formation, and intelligent action" to develop and use these data, and to unleash the hidden value of more data.
? A big data construction ideas
? 1) Data acquisition
The fundamental reason for the emergence of big data is the widespread use of perceptual systems.
With the development of technology, people have the ability to create extremely small sensors with processing capabilities, and began to place these devices in a wide range of corners of society, through which the functioning of society as a whole is monitored.
These devices are constantly generating new data, and this data generation is automatic.
Therefore, in terms of data collection, it is important to attach spatial and temporal markers to the data from the network, including the Internet of Things, social networks, and institutional information systems, to remove falsehoods, to collect heterogeneous or even heterogeneous data as much as possible, and to cross-check the data with historical data, if necessary, to validate the data's comprehensiveness and trustworthiness from multiple perspectives.
? 2) data aggregation and storage
The Internet is a magical big network, big data development and software customization is also a model, here to provide the most detailed offer, if you really want to do it, you can come here, the beginning of this phone's number is one eighty-seven in the middle of the three children zero last is one four two five zero, according to the order of combinations can be found, I would like to say that. Unless you want to do or understand this aspect of the content, if you just get together, do not come
Data only constant flow and full **** enjoy, only have vitality.
Should be based on the construction of each specialized database, through data integration, to achieve data exchange and data **** enjoyment of various information systems at all levels.
Data storage to achieve low-cost, low-energy, high-reliability goals, usually redundant configuration, distribution and cloud computing technology, in the storage of data should be classified according to certain rules, through the filtering and de-emphasis, to reduce the amount of storage, and at the same time, adding labels for easy retrieval in the future.
? 3) Management of data
There is also a proliferation of techniques for managing big data.
Among the many technologies, there are six data management technologies that are commonly followed, namely, distributed storage and computing, in-memory database technologies, columnar database technologies, cloud databases, non-relational databases, and mobile database technologies.
Of these, distributed storage and computing received the most attention.
The figure above shows a library data management system.
? 4) Data analysis
Data analysis and processing: some industry data involves hundreds of parameters, its complexity is not only reflected in the data samples themselves, but also in the multi-source heterogeneous, multi-entity and multi-space interaction between the dynamic nature of the difficult to use the traditional methods of description and metrics, the complexity of the processing is very large, the need for high-dimensional images and other multimedia data to reduce the dimensionality of the metrics and processing
Big Data is a powerful tool for analyzing and synthesizing information from a large amount of dynamic and potentially ambiguous data, and deriving comprehensible content.
There are many types of big data processing, and the main processing models can be categorized into streaming and batch processing.
Batch processing involves storing and then processing, while stream processing involves processing the data directly.
The main tasks of mining are correlation analysis, cluster analysis, classification, prediction, temporal pattern and bias analysis.
? 5) The value of big data: decision support system
The magic of big data is that it can accurately predict the future through the analysis of past and present data; through the integration of internal and external data of the organization, it can gain insight into the correlation between things; through the mining of huge amounts of data, it can replace the human brain and take on the responsibility of enterprise and social management.
The newest version of the Intel? Core? processor.
? 6) The use of data
Big data has three connotations: first, a huge amount of data, from a variety of sources and a variety of types of data sets; second, a new type of data processing and analysis technology; and third, the use of data analysis to form value.
Big data is revolutionizing scientific research, economic construction, social development, and cultural life.
The key to the application of big data, and also its necessary conditions, lies in the integration of "IT" and "business", of course, the connotation of business here can be very broad, as small as the operation of a retail store, as large as the operation of a city.
The basic architecture of big data
Based on the above characteristics of big data, the cost of storing and processing big data through traditional IT technology is high.
An enterprise to vigorously develop the application of big data first need to solve two problems: first, low-cost, rapid extraction and storage of massive, multi-category data; second, the use of new technologies to analyze and mine the data to create value for the enterprise.
Therefore, the storage and processing of big data is inextricably linked to cloud computing technology, and under current technical conditions, distributed systems based on inexpensive hardware (such as Hadoop, etc.) are considered to be the most suitable technology platform for processing big data.
Hadoop is a distributed infrastructure that enables users to conveniently and efficiently utilize computing resources and process massive amounts of data, and is now widely used in many large Internet enterprises, such as Amazon, Facebook, and Yahoo.
It is an open architecture, and members of the architecture are constantly expanding and improving, and the usual architecture is shown in Figure 2:
? Hadoop architecture
(1) Hadoop's lowest level is a HDFS (Hadoop Distributed File System), files stored in HDFS are first divided into blocks, and then replicated to multiple hosts (DataNode, data nodes)
(2)
(2) The core of Hadoop is the MapReduce (Mapping and Reducing Programming Model) engine, Map means to decompose a single task into multiple, and Reduce means to summarize the results of the decomposition of the multi-tasking, which consists of JobTrackers (job tracking, corresponding to named nodes) and TaskTrackers (task tracking, corresponding to data nodes). corresponding to data nodes).
When dealing with big data queries, MapReduce breaks down tasks across multiple nodes to improve data processing efficiency and avoid single-machine performance bottlenecks.
(3) Hive is a data warehouse in the Hadoop architecture, mainly used for static structures and work that needs to be analyzed frequently.
Hbase runs on HDFS primarily as a column-oriented database that can store petabytes of data.
Hbase utilizes MapReduce to process massive amounts of data internally and to locate the data you need and access it within the massive amount of data.
(4) Sqoop is designed for data interoperability and can import data from relational databases into Hadoop and directly into HDFS or Hive.
(5) Zookeeper is responsible for application orchestration in the Hadoop architecture to maintain synchronization within the Hadoop cluster.
(6) Thrift is a software framework for scalable and cross-language services, originally developed by Facebook, that is built to work seamlessly and efficiently across programming languages.
? Hadoop Core Design
? Hbase - distributed data storage system
Client: using the HBase RPC mechanism to communicate with HMaster and HRegionServer
Zookeeper: collaborative service management. HMaster can sense the health status of each HRegionServer at any time through Zookeepe
HMaster: manages the user's table additions, deletions, modifications, and deletions
HRegionServer: the most central module in HBase, mainly responsible for responding to the user's I/O requests, reading and writing data to and from HDFS filesystems
Client: uses the HBase RPC mechanism to communicate with the HMaster and HRegionServer. Read and write data to the HDFS file system
HRegion: the smallest unit of distributed storage in Hbase, which can be understood as a Table
HStore: the core of the HBase storage.
Consists of MemStore and StoreFile.
HLog: Every time a user operation writes to Memstore, it also writes a copy of the data to the HLog file
Combined with the above Hadoop architectural features, the big data platform system functionality is proposed as shown in the figure:
Application system: for most enterprises, the application of the operational field is the most core application of big data, before the enterprise mainly Use a variety of report data from production and operation, but with the arrival of the big data era, from the Internet, the Internet of Things, a variety of sensors of the massive amount of data coming to the face.
As a result, some companies have begun to mine and utilize this data to drive operational efficiency.
Data platforms: With big data platforms, the future of the Internet will allow businesses to better understand consumers' habits and improve their experience.
Based on big data, the corresponding analysis can be more targeted to improve the user experience and explore new business opportunities.
Data source: The data source is the database or database server used by the database application.
A rich data source is a prerequisite for the development of the Big Data industry.
Data sources are expanding and becoming more and more diverse.
For example, smart cars can turn the dynamic driving process into data, and IoT embedded in production equipment can turn the production process and the dynamic status of equipment into data.
The continuous expansion of data sources not only leads to the development of collection devices, but also to better control the value of data by controlling new data sources.
However, the total amount of digitized data resources in China is much lower than in the US and Europe, and in terms of the limited data resources that are already available, there is still standardization, low accuracy, low completeness, and low utilization value, which **reduces the value of the data.
? Third, the target effect of big data
Through the introduction and deployment of big data, the following effects can be achieved:
? 1) Data Integration
- Unified Data Model: Carry the enterprise data model and promote the unity of the logical model of data in each domain of the enterprise;
- Unified Data Standard: Uniformly establish a standard data encoding directory to achieve the standardization and unified storage of enterprise data;
- Unified Data View: Achieve a unified view of the data to enable the enterprise in customer
- Unified Data View: Achieve a unified view of data so that enterprises have access to consistent information across customer, product, and resource perspectives.
? 2) Data quality control
- Data quality verification: verify the consistency, completeness, and accuracy of stored data according to the rules to ensure the consistency, completeness, and accuracy of the data;
- Data quality control: through the establishment of quality standards for enterprise data, data control organizations, data control processes, to carry out unified control of data quality, in order to achieve data quality. unified control, in order to achieve the gradual improvement of data quality.
? 3) Data **** enjoyment
- Eliminate mesh interfaces, establish a big data **** enjoyment center, provide **** enjoyment data for each business system, reduce the complexity of interfaces, and improve the efficiency and quality of inter-system interfaces;
- Provide consolidated or computed data to the external systems in real-time or quasi-real-time.
? 4) Data application
- Query application: the platform realizes the on-demand query function with unfixed conditions, unpredictability, and flexible format;
- Fixed report application: depending on the display of the analysis results of the fixed statistical dimensions and indexes, it can be analyzed according to the needs of the business system to generate a variety of business report data, etc.
- Dynamic analysis application: the data analysis is done according to the dimensions and indicators of interest to the data. Thematic analysis of data by the dimensions and indicators of interest, dynamic analysis applications in the dimensions and indicators are not fixed.
?
The big data platform based on distributed technology can effectively reduce data storage costs, improve the efficiency of data analysis and processing, and has the ability to support massive data and high concurrency scenarios, which can significantly shorten the data query response time, and meet the data needs of the enterprise's various upper-level applications.