python big data mining series of the basics of introductory knowledge Knowledge Organization (introductory tutorials with source code)

Python in the big data industry is very hot nearly two years, as a pythonic, so also have to dabble in big data analysis, the following talk about them.

Python data analysis and mining technology overview

The so-called data analysis, that is, the known data analysis, and then extract some valuable information, such as statistical averages, standard deviation, and other information, the amount of data for data analysis may not be too large, and data mining, refers to a large amount of data to analyze and dig stubborn, get some unknown, valuable information. For example, from the website of the user and user behavior to dig out the user's potential demand for information, so as to improve the website and so on.

Data analysis and data mining are inseparable, and data mining is the enhancement of data analysis. Data mining technology can help us better discover the laws between things. So we can use data mining technology can help us better discover the laws between things. For example, discovering the potential needs of users, realizing the personalized push of information, and discovering the laws between diseases and symptoms or even diseases and drugs.

The first thing we need to do is talk about what modules are available for data analytics:

Here's the basic use of these modules.

numpy module installation and use

Installation:

Downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/

The package I'm downloading here is version 1.11.3 at http://www.lfd.uci .edu/~gohlke/pythonlibs/f9r7rmd8/numpy-1.11.3+mkl-cp35-cp35m-win_amd64.whl

After downloading, use pip install "numpy-1.11.3+mkl-cp35-cp35m-win _amd64.whl"

The version of numpy that you install must be the one with mkl, so that it can better support numpy

Simple use of numpy

Generating random numbers

Mainly use the random method under numpy.

pandas

Use pip install pandas that's it

Directly on the code:

Here's the output of pandas, the number of columns of numbers in this line, the first column of numbers is the number of rows, locate a by the first line, the first column to locate:

Commonly used methods are as follows:

The following look at the pandas statistics on the data, the following information on each line

Transpose function: the number of rows into columns, and the number of columns into rows, as follows:

Importing data through pandas

pandas supports a variety of input formats, I'll be here. Simply list the most common daily life of several, for more input methods you can view the source code latter official website.

CSV file

csv file imported after the display of the output, it is in accordance with the csv file default line output, how many columns on the output of the number of columns, for example, I have five columns of data, then it is in the prinit output results, it will be displayed in five columns

excel table

Depends on the xlrd module, please install it.

Same old, original output showing what excel would have been, except that a row number is added to the beginning of each row

Reading SQL

Depends on PyMySQL, so you need to install it. pandas takes sql as input, you need to formulate two parameters, the first one is the sql statement, and the second one is the sql connection instance.

Reading HTML

Depends on the lxml module, so install it.

For HTTPS pages, depends on BeautifulSoup4, html5lib module.

Reading HTML only reads tables in HTML, that is, only

Display is through python's list display while adding row and column identifiers

Reading txt files

Output is displayed while adding row and column identifiers

scipy

The installation method is to download the whl format file first, and then install it by pip install "package name". whl package download address is: http://www.lfd.uci.edu/~gohlke/pythonlibs/f9r7rmd8/scipy- 0.18.1-cp35-cp35m-win_amd64.whl

matplotlib data visualization and analysis

We install this module directly using pip install. There is no need to download whl in advance and install it via pip install.

See the code below:

Here is the modification of the style of the graph

On the type of graphs, there are several:

On the colors, there are several:

On the shapes, there are several:

We can also modify the graph a little bit to add some styles, the following modification of the polka dots graph for the red dots, the code is as follows.

We can also draw a dashed line graph, the code is shown below:

You can also add a title to the graph, x, y axis labels, the code is shown below

Histogram

The use of histograms can be a very good way to show the data for each segment. Here is a histogram using random numbers.

The Y-axis is the number of occurrences, and the X-axis is the value (or range) of the number

You can also specify the type of histogram with the histtype parameter:

The graphical language of differentiation is not able to describe it in detail, so try it out with confidence.

An example:

The subgraph function

What is the subgraph function? Submap is the ability to display multiple small drawings inside a large panel, each small drawing being a sub-drawing of the larger panel.

We know that generating a plot is done using the plot function, and a subplot is a subplog. the code operates as follows:

We can now plot a bunch of data, and it is very easy to find anomalies based on the plot. Here we will practice through a csv file, this csv file is a website article reads and comments.

First of all, the structure of this csv file, the first column is the serial number, the second column is the URL of each article, the third column of the number of readings of each article, the fourth column is the number of comments per article.

Our requirement is to have the number of comments as the Y-axis and the number of reads as the X-axis, so we need to get the data in the third and fourth columns. We know that the way to get the data is through the values method of pandas to get the value of a line, the value of this line to do the slicing process, to get the subscripts for the value of 3 (number of reads) and 4 (number of comments), but here is just a line of values, we need to be all the number of comments and reads under the csv file, so what to do? Smart you will say, I customize 2 lists, I traverse the csv file, the number of readings and comments were added to the corresponding list, which is not on the line. Oh, in fact, there is a faster way, then it is the use of T transpose method, so that then through the values method, you can directly get the number of comments and read the number of this, at this point in the matplotlib to give you the pylab method to make a map, then it is OK. After understanding the idea, then write it.

Here look at the code:

Wu Lei, who is playing in the Spanish second division, how is his performance with Espanyol this season?

Are there any promising entrepreneurial projects in the health industry?

Harbin institute of technology's most powerful major

Can deep-water Haina be a smart water project?

Shell credit big data report reliable

The legal position of big data

What are the comparison sites for basketball and soccer stats

How about the big data master's degree

How about Tongchuan, Shaanxi?

Hunan Police Academy 2020 Enrollment Policy Interpretation