1. matplotlib, graphical
2. pandas, the key to data mining, provides various algorithms for mining analysis
3. numpy, provides basic statistics
scipy, provides various mathematical formulas
4. python common lib, the basic python framework
II. Environment setup
1. install python
2. install pip
pandas relies on pip version, the minimum is 8.0.0. If pip is below 8, such as 7.2.1, you need to upgrade pip.
The command is "python -m pip install -U pip", which is the windows version.
Linux is "pip install -U pip"
With the command "pip --version", you can see the pip version number
3. Installation pandas
The command "pip install pandas", this is the windows version.
Available for Linux platform
sudo apt-get install python-pandas
4. Install matplotlib
pip install matplotlib
III. Data types
pypython common type
string list tuple dict set
6 bells and whistles
list, tuple, string, unicode string, buffer object, xrange
pandas type
ndarray, series dateFrame
ndarray, array type, added because:
list, tuple is based on pointer + object design. I.e. list, tuple store void* pointers, which point to the data of a specific object.
Because they are void* pointers, both can store various data types, i.e., data types can be non-uniform.
Although the storage is rich, there are drawbacks if the amount of data is too large, i.e., when dealing with big data.
1. Large storage space, waste of memory. Because of the storage of two parts, pointer + data
2. Slow reading, through the index, to find the pointer; based on the pointer, to find the data
So in the processing of big data, the new ndarray, a numerical type, similar to the C++ array. Store the same, read, modify fast.
Alias: array, helps save memory, improve CPU computing time, there are rich processing functions
series, variable-length dictionary,
similar to a one-dimensional array of objects; there are data and index composition
added because:
dict is unordered, its key and value there is a mapping relationship. However, key and value are not independent of each other and are stored together.
If you need to manipulate one item, it affects the other. So with series, the key and value of series are independent and stored independently.
The key of series is fixed-length ordered. You get the whole index via series.key, and all values via series.values.
The key of a series can be set to a unique name via series.index.name.
series as a whole can also be set with a unique name, via series.name
DataFrame:
1. a tabular data structure
2. containing an ordered set of columns (similar to an index)
3. which can be thought of as, *** enjoying an index of the Series collection
data1={'name':['java', 'c', 'python'], 'year': [2,2,3]}
frame = pd.DataFrame(data1)
------------------------- -----------------------
IV. Basic data analysis process:
1. Acquisition of data
2. Data preparation - specification, creation of various indexes index
3. Display, description of data for debugging
e.g. df.index, df.values, df.head(n), df.tail(n) df. describe
4. Data selection
index fetch, slice fetch, row and column fetch, rectangular region fetch
index fetch, df.row1 or df['row1']
row and column, df.loc[rowlist, columnlist], e.g. df.loc[0:1, ['co1', 'co1', 'co1', df.tail(n)) df. ['co1','col2']]
Take the upper left corner of the 2D by two-bit indexing, df.iloc[0,0], or you can list df.iloc[0:2,0:2] and take the first 2 rows.
5. Simple statistics and processing
Statistical averages, maximum values, etc.
6. Grouping Grouping
df.groupby(df.row1)
7. Merge Merge
append append,
contact Connection, including the append function, also two different 2D data structures can be merged
join join,
SQL join, join based on the same field, such as sql where, a.row1 = b.row1
--------------------------------------------- ----
V. Advanced Data Processing and Visualization:
1. Cluster Analysis
Clustering is an important part of the descriptive and predictive tasks of data mining, which is based on similarity,
dividing similar objects into groups and subsets by static classification.
There are many third-party libraries in python that provide clustering algorithms.
There are many clustering algorithms, among which the K-means algorithm, is widely used because of its simplicity and speed.
The basic principle is,
1. to find the center of a dataset,
2. to use the mean square deviation, calculate the distance. So that each data point converges within a group; the groups are completely isolated
Example:
>>> from pylab import *
>>> from scipy.clusters.vq import *
>> from scipy.clusters.vq import *
>> from scipy.clusters.vq import *
>> from scipy.clusters.vq import gt;>
>>> list1=[88,64,96,85]
>>> list2=[92,99,95,94]
>>> list3=[91,87,99,95]
& gt;>> list4 = [78,99,97,81]
>>> list5=[88,78,98,84]
>>> list6=[100,95,100,92]
>>> tempdate = (list1, list2, list3, list4, list5, list6)
>>>
>>> tempdate
([88, 64, 96, 85], [92, 99, 95, 94], [ 91, 87, 99, 95], [78, 99, 97, 81], [88, 78
, 98, 84], [100, 95, 100, 92])
>>> date = vstack(tempdate)
>>>
>>> date
array([[ 88, 64, 96, 85],
[ 92, 99, 95, 94],
[ 91, 87, 99, 95],
[ 78, 99, 97, 81],
[ 88, 78, 98, 84], p>
[ 100, 95, 100, 92]]
>>> centroids,abc=kmeans(date,2) # Find the center of the clusters, the second parameter is to set the division into N classes, e.g., 5 if 5 classes are present
>>> centroids # Find the centroids based on each column Find centroids, possibly the mean
array([[88, 71, 97, 84],
[90, 95, 97, 90]])
>>>
>>> result,cde=vq(date,centroids) # Classify the dataset, based on cluster centers
>>> result
array([0, 1, 1, 1, 1, 0, 1])
2. Plotting fundamentals
The python portrayal library, which consists of two parts,
plotting api, matplotlib offers various depiction interfaces.
Integrated library, pylab (including numpy and matplotlib in the common methods), depiction faster and more convenient.
import numpy as np
import matplotlib.pyplot as plt
t = np.range(0,10)
plt.plot(t, t+2)
plt.plot(t,t, 'o', t,t +2, t,t**2, 'o') # (x,y) set of lines, by default; 'o' is scatter,
plt.bar(t,t**2) # bar chart
plt.show()
--------------------
import pylab as pl
t = np.arange(0,10)
plt.plot(t, t+2)
plt.show()
3. matplotlib image attribute control
Color, style
Name: plot, horizontal, Vertical axis,
plt.title('philip\'s python plot')
plt.xlabel('date')
plt.ylabel('value')
Other: pl.figure(figsize=(8,6),dpi =100)
pl.plot(x,y, color='red', linewidth=3, lable='line1')
pl.legend(loc='upper left')
Subplot
pl.subplot(211) # overall image, which can be divided into two-dimensional parts;
# the first is the rows of the plot, the second is the columns; the third is the index, which is traversed from upper left 0 through the current row and then the next row.
# If it is a 2-digit number, such as 11, you need ','
axes(left, bottom, width, height) # Parameters take values in the range (0,1), left, is the distance to the left, bottom is the distance to the bottom
< p>4. pandas plottingSeries, DataFrame support direct depiction, encapsulated in the interface to call matplotlib, such as
series.close.plot()
df.close.plot() # specific parameters similar to the matplotlib common interface
Attribute control
Similar to matplotlib common interface, modify various types of images, bar charts, polylines, etc.
--------common-----------------
list, tuple, dict
-------- numpy-----------------
ndarray, Series, DataFrame