Python in the big data industry is very hot nearly two years, as a pythonic, so also have to dabble in big data analysis, the following talk about them.
Python data analysis and mining technology overview
The so-called data analysis, that is, the known data analysis, and then extract some valuable information, such as statistical averages, standard deviation, and other information, the amount of data for data analysis may not be too large, and data mining, refers to a large amount of data to analyze and dig stubborn, get some unknown, valuable information. For example, from the website of the user and user behavior to dig out the user's potential demand for information, so as to improve the website and so on.
Data analysis and data mining are inseparable, and data mining is the enhancement of data analysis. Data mining technology can help us better discover the laws between things. So we can use data mining technology can help us better discover the laws between things. For example, discovering the potential needs of users, realizing the personalized push of information, and discovering the laws between diseases and symptoms or even diseases and drugs.
The first thing we need to do is talk about what modules are available for data analytics:
Here's the basic use of these modules.
numpy module installation and use
Installation:
Downloaded from http://www.lfd.uci.edu/~gohlke/pythonlibs/
The package I'm downloading here is version 1.11.3 at http://www.lfd.uci .edu/~gohlke/pythonlibs/f9r7rmd8/numpy-1.11.3+mkl-cp35-cp35m-win_amd64.whl
After downloading, use pip install "numpy-1.11.3+mkl-cp35-cp35m-win _amd64.whl"
The version of numpy that you install must be the one with mkl, so that it can better support numpy
Simple use of numpy
Generating random numbers
Mainly use the random method under numpy.
pandas
Use pip install pandas that's it
Directly on the code:
Here's the output of pandas, the number of columns of numbers in this line, the first column of numbers is the number of rows, locate a by the first line, the first column to locate:
Commonly used methods are as follows:
The following look at the pandas statistics on the data, the following information on each line
Transpose function: the number of rows into columns, and the number of columns into rows, as follows:
Importing data through pandas
pandas supports a variety of input formats, I'll be here. Simply list the most common daily life of several, for more input methods you can view the source code latter official website.
CSV file
csv file imported after the display of the output, it is in accordance with the csv file default line output, how many columns on the output of the number of columns, for example, I have five columns of data, then it is in the prinit output results, it will be displayed in five columns
excel table
Depends on the xlrd module, please install it.
Same old, original output showing what excel would have been, except that a row number is added to the beginning of each row
Reading SQL
Depends on PyMySQL, so you need to install it. pandas takes sql as input, you need to formulate two parameters, the first one is the sql statement, and the second one is the sql connection instance.
Reading HTML
Depends on the lxml module, so install it.
For HTTPS pages, depends on BeautifulSoup4, html5lib module.
Reading HTML only reads tables in HTML, that is, only
Display is through python's list display while adding row and column identifiers
Reading txt files
Output is displayed while adding row and column identifiers
scipy
The installation method is to download the whl format file first, and then install it by pip install "package name". whl package download address is: http://www.lfd.uci.edu/~gohlke/pythonlibs/f9r7rmd8/scipy- 0.18.1-cp35-cp35m-win_amd64.whl
matplotlib data visualization and analysis
We install this module directly using pip install. There is no need to download whl in advance and install it via pip install.
See the code below:
Here is the modification of the style of the graph
On the type of graphs, there are several:
On the colors, there are several:
On the shapes, there are several:
We can also modify the graph a little bit to add some styles, the following modification of the polka dots graph for the red dots, the code is as follows.
We can also draw a dashed line graph, the code is shown below:
You can also add a title to the graph, x, y axis labels, the code is shown below
Histogram
The use of histograms can be a very good way to show the data for each segment. Here is a histogram using random numbers.
The Y-axis is the number of occurrences, and the X-axis is the value (or range) of the number
You can also specify the type of histogram with the histtype parameter:
The graphical language of differentiation is not able to describe it in detail, so try it out with confidence.
An example:
The subgraph function
What is the subgraph function? Submap is the ability to display multiple small drawings inside a large panel, each small drawing being a sub-drawing of the larger panel.
We know that generating a plot is done using the plot function, and a subplot is a subplog. the code operates as follows:
We can now plot a bunch of data, and it is very easy to find anomalies based on the plot. Here we will practice through a csv file, this csv file is a website article reads and comments.
First of all, the structure of this csv file, the first column is the serial number, the second column is the URL of each article, the third column of the number of readings of each article, the fourth column is the number of comments per article.
Our requirement is to have the number of comments as the Y-axis and the number of reads as the X-axis, so we need to get the data in the third and fourth columns. We know that the way to get the data is through the values method of pandas to get the value of a line, the value of this line to do the slicing process, to get the subscripts for the value of 3 (number of reads) and 4 (number of comments), but here is just a line of values, we need to be all the number of comments and reads under the csv file, so what to do? Smart you will say, I customize 2 lists, I traverse the csv file, the number of readings and comments were added to the corresponding list, which is not on the line. Oh, in fact, there is a faster way, then it is the use of T transpose method, so that then through the values method, you can directly get the number of comments and read the number of this, at this point in the matplotlib to give you the pylab method to make a map, then it is OK. After understanding the idea, then write it.
Here look at the code: