Current location - Loan Platform Complete Network - Big data management - How to use data mining to extract problem hotspots
How to use data mining to extract problem hotspots
1. Visual analysis

Big data analytics users have big data analytics experts, as well as ordinary users, but both of them for big data analytics is the most basic requirements for visual analysis, because visual analysis can intuitively present the characteristics of the big data, and at the same time very easy to be accepted by the reader, as simple as looking at the map to speak as clear and concise.

2. Data mining algorithms

The theoretical core of big data analysis is data mining algorithms, a variety of data mining algorithms based on different data types and formats in order to more scientifically present the characteristics of the data itself, but also because of these are recognized by the world's statisticians of various statistical methods (which can be called the truth) in order to deepen the internal data, dig out the recognized value. It is also because of these statistical methods, which are recognized by statisticians all over the world as truths, that we are able to penetrate into the data and uncover the recognized values. In addition, it is also because of these data mining algorithms that big data can be processed more quickly. If an algorithm takes years to come to a conclusion, then the value of big data can not be said.

3. Predictive analytics

One of the ultimate applications of big data analytics is predictive analytics, where features are mined from big data and modeled scientifically, and then new data can be brought in through the model to predict future data.

4. Semantic engines

The diversity of unstructured data brings new challenges to data analysis, and we need a set of tools to systematically analyze and refine data. Semantic engines need to be designed with enough artificial intelligence to be sufficient to proactively extract information from data.

5. Data quality and data management. Big data analytics cannot be separated from data quality and data management. High-quality data and effective data management, both in academic research and in business applications, can ensure that the results of the analysis are true and valuable.

The basis of big data analysis is the above five aspects, of course, more in-depth big data analysis, there are many, many more features, more in-depth, more professional big data analysis methods.

Technology of big data

Data collection: ETL tools are responsible for extracting data from distributed, heterogeneous data sources such as relational data, flat data files, etc. to a temporary middle layer after cleaning, conversion, integration, and finally loaded into a data warehouse or data mart, which becomes the basis for online analysis and processing, data mining.

Data access: relational databases, NOSQL, SQL, and so on.

Infrastructure: Cloud storage, distributed file storage, etc.

Data Processing: Natural Language Processing (NLP, Natural Language Processing) is a discipline that studies the linguistic aspects of human-computer interaction. The key to dealing with natural language is to let the computer "understand" natural language, so natural language processing is also called natural language understanding is also known as computational linguistics. On the one hand, it is a branch of linguistic information processing, on the other hand, it is one of the core topics of artificial intelligence.

Statistical analysis: hypothesis testing, significance testing, analysis of variance, correlation analysis, t-test, analysis of variance, chi-square analysis, partial correlation analysis, distance analysis, regression analysis, simple regression analysis, multiple regression analysis, stepwise regression, regression prediction and residual analysis, ridge regression, logistic regression analysis, curve estimation, factor analysis, cluster analysis, principal component analysis, Factor Analysis, Rapid Clustering and Clustering Method, Discriminant Analysis, Correspondence Analysis, Multivariate Correspondence Analysis (Optimal Scale Analysis), bootstrap techniques, and more.

Data Mining: Classification, Estimation, Prediction, Affinity grouping or association rules, Clustering, Description and Visualization, Description and Visualization), Complex data type mining (Text, Web ,graphic images, video, audio, etc.)

Model Prediction : Predictive Modeling, Machine Learning, Modeling Simulation.

Results Presentation : Cloud Computing, Tag Cloud, Relational Graph, etc.

Big Data Processing

1. One of the Big Data Processing: Collection

The collection of Big Data refers to the use of multiple databases to receive the data from the client (Web, App or sensor form, etc.), and the user can use these databases to perform simple query and processing work. For example, e-commerce companies use traditional relational databases such as MySQL and Oracle to store each transaction, but NoSQL databases such as Redis and MongoDB are also commonly used for data collection.

In the process of big data collection, the main feature and challenge is the high concurrency, because there may be thousands of users to access and operate at the same time, such as the train ticket website and Taobao, their concurrent access to the peak of millions of, so you need to deploy a large number of databases on the collection side to support. And how to load balance and sharding between these databases does require in-depth thinking and design.

2. Big Data Processing II: Import/Preprocessing

While there are many databases at the collection end, if you want to effectively analyze the massive data, you should import the data from the front end into a centralized large distributed database or distributed storage cluster, and you can do some simple cleansing and preprocessing on the basis of importing. The data can be imported into a centralized distributed database or a distributed storage cluster. Some users also use Storm from Twitter to stream the data when importing to meet some of the real-time computing needs of the business.

The import and preprocessing process is characterized by a large volume of imported data, often up to a hundred megabytes or even a thousand megabytes of imported data per second.

3. Big Data Processing III: Statistics/Analysis

Statistics and analytics mainly use distributed databases, or distributed computing clusters to store massive amounts of data within the ordinary analysis and classification of the summary to meet the majority of common analytical needs, in this regard, some of the real-time demand will be used in the EMC GreenPlum, Oracle Exadata, as well as MySQL-based data management systems. In this regard, some real-time demand will use EMC's GreenPlum, Oracle's Exadata, and MySQL-based columnar storage Infobright, etc., and some batch processing, or based on the demand for semi-structured data can be used Hadoop.

The main feature and challenge of this part of the statistics and analysis is the analysis of the data involved in the large volume of its system resources, especially I/O will have a great occupation.

4. Big Data Processing IV: Mining

Unlike the previous statistical and analytical processes, data mining generally has no predetermined theme, mainly in the existing data based on a variety of algorithms, so as to play a prediction (Predict) effect, so as to achieve some of the needs of high-level data analysis. Typical algorithms include Kmeans for clustering, SVM for statistical learning, and NaiveBayes for classification, and the main tools used are Hadoop's Mahout. The process is characterized by the complexity of the algorithms used for mining, and the amount of data and computation involved in the calculation is very large, and the commonly used data mining algorithms are mainly single-threaded.

The entire general process of big data processing should satisfy at least these four aspects of the steps to be considered a more complete big data processing.