Current location - Loan Platform Complete Network - Big data management - Who says rookies don't know how to analyze data spss article
Who says rookies don't know how to analyze data spss article

Capture

Capture of big data refers to the use of multiple databases to receive data from the client (Web, app or sensor form, etc.) and the user can use these databases for simple query and processing work. For example, e-commerce companies use traditional relational databases such as MySQL and Oracle to store each transaction, in addition to NoSQL databases such as Redis and MongoDB, which are also commonly used for data collection.

Import/Preprocessing

While there will be a lot of databases on the collection side itself, if you want to effectively analyze these massive data, you should import these data from the front-end into a centralized large-scale distributed database, or distributed storage cluster, and you can do some simple cleaning and preprocessing work on top of the import. There are also some users who will use Storm from Twitter to stream the data when importing to meet some of the real-time computing needs of the business.

Statistics/Analytics

Statistics and analytics mainly use distributed databases, or distributed computing clusters to store the massive amount of data within the ordinary analysis and classification of the summary to meet the majority of common analytical needs, in this regard, some of the real-time demand will use EMC's GreenPlum, Oracle's Exadata, and the MySQL-based columnar storage. MySQL-based columnar storage Infobright, etc., and some batch processing, or based on semi-structured data needs can use Hadoop.

Mining

With the previous statistical and analytical process is different, data mining is generally no pre-set theme, mainly in the existing data based on a variety of algorithmic calculations, so that the effect of prediction (Predict). Predictive (Predict) effect, so as to realize the needs of some high-level data analysis. Typical algorithms are K-Means for clustering, SVM for statistical learning, and Naive Bayes for classification, and the main tools used are Hadoop's Mahout.