As we all know, big data mining in the era of big data has become a major hotspot in all walks of life.
I. Data Mining
In the era of big data, the generation and collection of data is the foundation, and data mining is the key, and data mining can be said to be the most critical and basic work of big data. Generally speaking, data mining, also known as DataMining, or Knowledge Discovery from Data, refers to an engineered and systematic process of mining out implicit, previously unknown but potentially useful information and patterns from a large amount of data.
Different scholars have different understandings of data mining, but I personally believe that the characteristics of data mining are mainly in the following four aspects:
1. Application (A Combination of Theory and Application): Data mining is a perfect combination of theoretical algorithms and application practice. Data mining from the actual production and life in the application of the needs of mining data from specific applications, while the knowledge discovered through data mining and to be used in practice, to assist in practical decision-making. Therefore, data mining from the application practice, but also serve the application practice, data is fundamental, data mining should be data-oriented, which involves the design and development of algorithms need to take into account the needs of the actual application, abstraction and generalization of the problem, the good algorithms are applied to the actual practice, and tested in practice.
2. Engineering (An Engineering Process): Data mining is an engineering process consisting of multiple steps. The application characteristics of data mining determine that data mining is not just algorithm analysis and application, but a complete process that includes data preparation and management, data preprocessing and conversion, mining algorithm development and application, result presentation and validation, as well as knowledge accumulation and use. And in practical applications, the typical data mining process is still an interactive and cyclic process.
3. A Collection of Functionalities: Data mining is a collection of multiple functions. Commonly used data mining functions include data exploration and analysis, association rule mining, time series pattern mining, classification prediction, cluster analysis, anomaly detection, data visualization and link analysis. A specific use case often involves several different functions. Different functions usually have different theoretical and technical foundations, and each function is supported by different algorithms.
4. Interdisciplinarity (An Interdisciplinary Field): Data mining is an interdisciplinary discipline that utilizes research results and academic ideas from many different fields such as statistical analysis, pattern recognition, machine learning, artificial intelligence, information retrieval, databases, and so on. Meanwhile some other fields such as stochastic algorithms, information theory, visualization, distributed computing and optimization also play an important role in the development of data mining. The difference between data mining and these related fields can be summarized by the 3 characteristics of data mining mentioned earlier, and most importantly it focuses more on applications.
In summary, application is an important characteristic of data mining, which is the key to distinguish it from other disciplines, and at the same time, its application characteristics complement other characteristics, which determine the research and development of data mining to a certain extent, and at the same time, it also provides guidance on how to learn and master data mining. For example, from the point of view of research and development, the demand of practical application is the root of the proposal and development of many methods in the field of data mining. From the very beginning of customer transaction data analysis (market basket analysis), multimedia data mining (multimedia data mining), privacy-preserving data mining (privacy-preserving data mining) to text data mining (text mining) and Web mining ( Web mining), to social media mining are all driven by applications. Engineering and aggregation determine the breadth of content and direction of data mining research. In particular, engineering makes the different steps in the whole research process belong to the research scope of data mining. The aggregation nature makes data mining have many different functions, and how to connect and combine the multiple functions has influenced the development of data mining research methods to some extent. For example, in the mid-1990s, data mining research was mainly focused on the mining of association rules and time series patterns. By the end of the 1990s, researchers began to study classification algorithms based on association rules and time series patterns (e.g., classification based on association), which organically combine two different data mining functions.At the beginning of the 21st century, a research hotspot is semi-supervised learning and semi-supervised clustering. learning (semi-supervised learning) and semi-supervised clustering (semi-supervised clustering), which also organically combine the two functions of classification and clustering. Some other research directions in recent years, such as subspace clustering (combination of feature extraction and clustering) and graph classification (combination of graph mining and classification), also link and combine multiple functions. Finally, intersectionality leads to a diversity of research ideas and methodological designs.
Mentioned earlier are the characteristics of data mining on the development of research and research methodology, in addition, these characteristics of data mining on how to learn and master data mining puts forward guidance on how to learn and master data mining, to train graduate students, undergraduates have some guidance, such as application in the guidance of data mining should be familiar with the application of the business and demand, demand is the purpose of data mining, business and algorithms, The close integration of technology is very important to understand the business and grasp the demand in order to analyze the data in a targeted way and mine its value. Therefore, what is needed in the actual application is a kind of talent that understands both business and data mining algorithms. Engineering decided to master the data mining need to have a certain degree of engineering ability, a good data mining personnel is first of all an engineer, there is a very strong ability to deal with large-scale data and the development of prototype systems, which is equivalent to the training of data mining engineers, the ability to deal with the data and programming ability is very important. Aggregation makes it important to do a good job of accumulating different functions and multiple algorithms at the bottom when applying data mining in a concrete way. Intersectionality determines the need to actively understand and learn ideas and techniques in related fields when learning data mining.
Therefore, these characteristics are all characteristics of data mining, through these four characteristics can be summarized and learning data mining.
II. Characteristics of Big Data
The term bigdata is often used to describe and refer to the massive amount of information generated in the era of information explosion. The significance of studying big data is to discover and understand the content of information and the connection between information and information. The study of big data begins with clarifying and understanding the characteristics and basic concepts of big data, and then understanding and recognizing big data.
The study of big data first needs to understand the characteristics and basic concepts of big data. It is widely recognized that big data has the standard "4V" characteristics:
1. Volume (large volume): the volume of data is huge, jumping from the TB level to the PB level.
2. Variety: There are many types of data, such as web logs, videos, pictures, and geolocation information.
3. Velocity: fast processing speed, real-time analysis, which is also fundamentally different from traditional data mining technology.
4. Value: low value density, contains high effective value, reasonable use of low-density value of the data and its correct and accurate analysis, will bring great commercial and social value.
The above "4Vs" characterize the main difference between Big Data and the partially sampled "small data" of the past. However, practice is the only way to realize the ultimate value of Big Data. From the point of view of practical applications and the complexity of big data processing, big data also has the following new "4V" characteristics:
5. Variability: In different scenarios, under different research objectives, the structure and meaning of the data may change, and therefore, in the actual research, we need to consider specific Context scene (Context).
6. Veracity: Obtaining real and reliable data is a prerequisite for ensuring accurate and effective analysis results. Only real and accurate data can obtain truly meaningful results.
7.Volatility (volatility)/Variance (difference): Because the data itself contains noise and the analysis process is not standardized, resulting in the use of different algorithms or different analytical processes and means will get unstable results.
8. Visualization: In the big data environment, data visualization can be more intuitive to explain the meaning of the data, to help understand the data, explain the results.
In summary, the above "8V" features have a strong guiding significance in big data analysis and data mining.
Three, data mining in the era of big data
In the era of big data, data mining needs to consider the following four issues:
The core and essence of big data mining is the organic combination of four elements: applications, algorithms, data and platforms.
Because data mining is application-driven, from practice, massive data generated in the application. It needs to be driven by specific application data, supported by algorithms, tools and platforms, and ultimately applies the discovered knowledge and information to practice, so as to provide quantitative, reasonable, feasible, and can produce great value of information.
Mining the useful information implicit in big data requires the design and development of corresponding data mining and learning algorithms. The design and development of algorithms need to be driven by specific application data, and at the same time be applied and verified in real problems, while the realization and application of algorithms need an efficient processing platform, which can solve the volatility problem. An efficient processing platform needs to effectively analyze massive data, integrate multiple data in a timely manner, and at the same time strongly support the implementation of data-based algorithms and data visualization, as well as standardize the process of data analysis.
In short, the idea of combining the four aspects of applications, algorithms, data and platforms is a comprehensive distillation of the understanding and knowledge of data mining in the era of big data, reflecting the essence and core of data mining in the era of big data. These four aspects are also the integration and architecture of the corresponding research aspects, which are specifically developed from the following four levels:
Application layer (Application): concerned with the collection of data and algorithm validation, the key issue is to understand the semantics and domain knowledge related to the application.
Data layer (Data): data management, storage, access and security, the concern is how to make efficient data use.
Algorithm layer (Algorithm): mainly the design and implementation of algorithms such as data mining, machine learning, and approximation algorithms.
Platform Layer (Infrastructure): data access and computation, computing platform to deal with distributed large-scale data.
In summary, the algorithms of data mining are divided into several levels, and there are different research contents at different levels, and we can see the main research directions in doing data mining at present, such as using data fusion technology to pre-process sparse, heterogeneous, uncertain, incomplete, and multi-source data; mining complex and dynamically changing data; testing the global knowledge obtained through local learning and model fusion, and Feedback relevant information to the preprocessing stage; Parallel distributization of data for effective use.
IV. Development of a Big Data Mining System
1. Background Objectives
The advent of the Big Data era has led to an explosion in the size and complexity of data, prompting data analysts in different application domains to utilize data mining techniques to analyze data. In application domains such as healthcare, high-end manufacturing, and finance, a typical data mining task often requires complex sub-task configurations, integration of many different types of mining algorithms, and efficient operation in a distributed computing environment. Therefore, an imperative for data mining applications in the Big Data era is to develop and build computing platforms and tools that support data analysts in application domains to be able to perform data analysis tasks efficiently.
It was previously mentioned that a data mining has multiple tasks, multiple functions and different mining algorithms, and at the same time, an efficient platform is required. Therefore, the top priority for data mining and applications in the Big Data era is to develop and build computing platforms and tools that support data analysts in application domains to be able to perform data analysis tasks efficiently.
2. Related Products
Existing data mining tools
There are Weka, SPSS, and SQLServer, which provide user-friendly interfaces to facilitate users' analyses, however, these tools are not suitable for large-scale data analyses, and at the same time, it is difficult for users to add new algorithmic programs when using these tools.
Popular data mining algorithm libraries
such as Mahout, MLC++ and MILK, which provide a large number of data mining algorithms. However, these algorithm libraries require advanced programming skills for task configuration and algorithm integration.
Some recently emerged integrated data mining products
such as Radoop and BC-PDM, which provide user-friendly interfaces to quickly configure data mining tasks. However, these products are based on the Hadoop framework and have very limited support for non-Hadoop algorithmic programs. Resource allocation in multi-user and multi-task situations is not explicitly addressed.
3.FIU-Miner
To address the limitations of existing tools and products in big data mining, our team developed a new platform, FIU-Miner, which stands for A Fast,Integrated,and User- Friendly System for Data Miningin Distributed Environment. it is a data mining system that is user-friendly and supports efficient computation and fast integration in distributed environment. Compared with existing data mining platforms, FIU-Miner provides a new set of features that help data analysts to easily and effectively carry out complex data mining tasks.
Compared with traditional data mining platforms, it provides a number of new features, mainly in the following areas:
A. User-friendly, humanized and fast data mining task configuration. Based on the "Software as a Service" model, FIU-Miner hides the low-end details that are not related to data analysis tasks. With the user-friendly interface provided by FIU-Miner, users can easily configure a task for a complex data mining problem by directly assembling existing algorithms into a workflow without writing any code.
B. Flexible multi-language program integration. Allows users to expand and manage the collection of analysis tools by directly importing the most advanced data mining algorithms into the system's algorithm library. At the same time, there is no restriction on the language of implementation for these imported algorithms, as FIU-Miner is able to correctly assign tasks to computing nodes with suitable operating environments.
C. Effective resource management in heterogeneous environments: FIU-Miner supports running data mining tasks in heterogeneous computing environments (including graphic workstations, single computers, and servers, etc.) FIU-Miner takes into account a variety of factors (including algorithm implementation, server load balancing, and data location) to optimize the utilization of computing resources.
D. Efficient program scheduling and execution.
The application architecture includes a user interface layer, a task and system management layer, a logical resource layer, and a heterogeneous physical resource layer. This layered architecture gives full consideration to the distributed storage of massive data, integration of different data mining algorithms, configuration of multiple tasks, and delivery functions for system users. A typical data mining task requires complex master task configuration and integration of many different types of mining algorithms in the application. Therefore, the development and establishment of such computing platforms and tools to support data analysts in the application domain for effective analysis is an important task in big data mining.
FIU-Miner system is used in different aspects: such as high-end manufacturing, intelligent management of warehouses, spatial data processing, etc. TerraFly GeoCloud is a platform built on top of the TerraFly system to support a variety of online spatial data analysis. It provides MapQL, a spatial data query and mining language with SQL-like statements, which not only supports SQL-like statements, but more importantly, can be used to mine spatial data, render and draw spatial data according to the different requirements of users. Through the construction of spatial data analysis workflow to optimize the analysis process and improve the efficiency of analysis.
Manufacturing refers to the industrial production process of processing raw materials into finished products on a large scale. High-end manufacturing refers to the newly emerged industries with high technological content, high value-added and strong competitiveness in the manufacturing industry. Typical high-end manufacturing industries include electronic semiconductor production, precision instrument manufacturing, biopharmaceuticals and so on. These manufacturing areas often involve tight engineering design, complex assembly lines, a large number of controlled processing equipment and process parameters, precise process control, and strict specification of materials. Yield and quality rely heavily on process control and optimized decision making. Therefore, manufacturing companies spare no effort in adopting various measures to optimize the production process, tune the control parameters, improve product quality and yield, and thus increase the competitiveness of the enterprise.
In terms of spatial data processing, TerraFly GeoCloud analyzes a wide range of online spatial data. For traditional data analysis, the difficulty lies in the fact that MapQL statements are more difficult to write, the relationship between tasks is more complex, and the efficiency of spatial data licensing between sequential execution is low. FIU-Miner can effectively solve these three difficulties.
In summary, the complexity of big data has put forward new requirements and challenges for data mining in terms of theoretical and algorithmic research. Big data is the phenomenon, the core is to mine the potential information contained in the data and make them valuable. Data mining is a perfect combination of theoretical techniques and practical applications. Data mining is an example of the combination of theory and practice.