## Python时间序列数据分析–以示例说明

### 导读

1. 用pandas处理时序数据
2. 怎样检查时序数据的稳定性
3. 怎样让时序数据具有稳定性
4. 时序数据的预测

### 1. 用pandas导入和处理时序数据

import pandas as pd
import numpy as np
import matplotlib.pylab as plt
from matplotlib.pylab import rcParams
#rcParams设定好画布的大小
rcParams['figure.figsize'] = 15, 6

http://github.com/aarshayj/Analytics_Vidhya/tree/master/Articles/Time_Series_Analysis 中下载

data = pd.read_csv(path+"AirPassengers.csv")
print '\n Data types:'
print data.dtypes dateparse = lambda dates: pd.datetime.strptime(dates, '%Y-%m')
#---其中parse_dates 表明选择数据中的哪个column作为date-time信息，
#---index_col 告诉pandas以哪个column作为 index
#--- date_parser 使用一个function(本文用lambda表达式代替)，使一个string转换为一个datetime变量
data = pd.read_csv('AirPassengers.csv', parse_dates=['Month'], index_col='Month',date_parser=dateparse)
print data.index  ### 2.怎样检查时序数据的稳定性(Stationarity)

#### 1. 判断数据是稳定的常基于对于时间是常量的几个统计量：

1. 常量的均值
2. 常量的方差
3. 与时间独立的自协方差

1. 均值 X是时序数据的值，t是时间。可以看到左图，数据的均值对于时间轴来说是常量，即数据的均值不是时间的函数,所有它是稳定的；右图随着时间的推移，数据的值整体趋势是增加的，所有均值是时间的函数，数据具有趋势，所以是非稳定的。
2. 方差 可以看到左图，数据的方差对于时间是常量，即数据的值域围绕着均值上下波动的振幅是固定的，所以左图数据是稳定的。而右图，数据的振幅在不同时间点不同，所以方差对于时间不是独立的，数据是非稳定的。但是左、右图的均值是一致的。
3. 自协方差 一个时序数据的自协方差，就是它在不同两个时刻i,j的值的协方差。可以看到左图的自协方差于时间无关；而右图，随着时间的不同，数据的波动频率明显不同，导致它i，j取值不同，就会得到不同的协方差，因此是非稳定的。虽然右图在均值和方差上都是与时间无关的，但仍是非稳定数据。

#### 2. python判断时序数据稳定性

1.Rolling statistic– 即每个时间段内的平均的数据均值和标准差情况。

1. Dickey-Fuller Test — 这个比较复杂，大致意思就是在一定置信水平下，对于时序数据假设 Null hypothesis: 非稳定。
if 通过检验值(statistic)< 临界值(critical value)，则拒绝null hypothesis，即数据是稳定的；反之则是非稳定的。
from statsmodels.tsa.stattools import adfuller
def test_stationarity(timeseries):

#这里以一年为一个窗口，每一个时间t的值由它前面12个月（包括自己）的均值代替，标准差同理。
rolmean = pd.rolling_mean(timeseries,window=12)
rolstd = pd.rolling_std(timeseries, window=12)

#plot rolling statistics:
fig = plt.figure()
orig = plt.plot(timeseries, color = 'blue',label='Original')
mean = plt.plot(rolmean , color = 'red',label = 'rolling mean')
std = plt.plot(rolstd, color = 'black', label= 'Rolling standard deviation')

plt.legend(loc = 'best')
plt.title('Rolling Mean & Standard Deviation')
plt.show(block=False)

#Dickey-Fuller test:

print 'Results of Dickey-Fuller Test:'
dftest = adfuller(timeseries,autolag = 'AIC')
#dftest的输出前一项依次为检测值，p值，滞后数，使用的观测数，各个置信度下的临界值
dfoutput = pd.Series(dftest[0:4],index = ['Test Statistic','p-value','#Lags Used','Number of Observations Used'])
for key,value in dftest.items():
dfoutput['Critical value (%s)' %key] = value

print dfoutput

ts = data['#Passengers']
test_stationarity(ts)  ### 3. 让时序数据变成稳定的方法

1. 趋势（trend）-数据随着时间变化。比如说升高或者降低。
2. 季节性(seasonality)-数据在特定的时间段内变动。比如说节假日，或者活动导致数据的异常。

ts_log = np.log(ts)
1. 检测和去除趋势
通常有三种方法：

• 聚合 : 将时间轴缩短，以一段时间内星期/月/年的均值作为数据值。使不同时间段内的值差距缩小。
• 平滑： 以一个滑动窗口内的均值代替原来的值，为了使值之间的差距缩小
• 多项式过滤：用一个回归模型来拟合现有数据，使得数据更平滑。

Moving Average–移动平均

moving_avg = pd.rolling_mean(ts_log,12)
plt.plot(ts_log ,color = 'blue')
plt.plot(moving_avg, color='red') ts_log_moving_avg_diff = ts_log-moving_avg
ts_log_moving_avg_diff.dropna(inplace = True)
test_stationarity(ts_log_moving_avg_diff)  # halflife的值决定了衰减因子alpha：  alpha = 1 - exp(log(0.5) / halflife)
expweighted_avg = pd.ewma(ts_log,halflife=12)
ts_log_ewma_diff = ts_log - expweighted_avg
test_stationarity(ts_log_ewma_diff)  1. 检测和去除季节性
有两种方法：

• 1 差分化： 以特定滞后数目的时刻的值的作差
• 2 分解： 对趋势和季节性分别建模在移除它们

Differencing–差分

ts_log_diff = ts_log - ts_log.shift()
ts_log_diff.dropna(inplace=True)
test_stationarity(ts_log_diff) 3.Decomposing-分解

#分解(decomposing) 可以用来把时序数据中的趋势和周期性数据都分离出来:
from statsmodels.tsa.seasonal import seasonal_decompose
def decompose(timeseries):

# 返回包含三个部分 trend（趋势部分） ， seasonal（季节性部分） 和residual (残留部分)
decomposition = seasonal_decompose(timeseries)

trend = decomposition.trend
seasonal = decomposition.seasonal
residual = decomposition.resid

plt.subplot(411)
plt.plot(ts_log, label='Original')
plt.legend(loc='best')
plt.subplot(412)
plt.plot(trend, label='Trend')
plt.legend(loc='best')
plt.subplot(413)
plt.plot(seasonal,label='Seasonality')
plt.legend(loc='best')
plt.subplot(414)
plt.plot(residual, label='Residuals')
plt.legend(loc='best')
plt.tight_layout()

return trend , seasonal, residual #消除了trend 和seasonal之后，只对residual部分作为想要的时序数据进行处理
trend , seasonal, residual = decompose(ts_log)
residual.dropna(inplace=True)
test_stationarity(residual)  ### 4. 对时序数据进行预测

step1： 通过ACF,PACF进行ARIMA（p，d，q）的p，q参数估计

yt=YtYt1yt=Yt−Yt−1

#ACF and PACF plots:
from statsmodels.tsa.stattools import acf, pacf
lag_acf = acf(ts_log_diff, nlags=20)
lag_pacf = pacf(ts_log_diff, nlags=20, method='ols')
#Plot ACF:
plt.subplot(121)
plt.plot(lag_acf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.title('Autocorrelation Function')

#Plot PACF:
plt.subplot(122)
plt.plot(lag_pacf)
plt.axhline(y=0,linestyle='--',color='gray')
plt.axhline(y=-1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.axhline(y=1.96/np.sqrt(len(ts_log_diff)),linestyle='--',color='gray')
plt.title('Partial Autocorrelation Function')
plt.tight_layout() step2： 得到参数估计值p，d，q之后，生成模型ARIMA（p，d，q）

from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_log, order=(2, 1, 0))
results_AR = model.fit(disp=-1)
plt.plot(ts_log_diff)
plt.plot(results_AR.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_AR.fittedvalues-ts_log_diff)**2)) model = ARIMA(ts_log, order=(0, 1, 2))
results_MA = model.fit(disp=-1)
plt.plot(ts_log_diff)
plt.plot(results_MA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_MA.fittedvalues-ts_log_diff)**2)) model = ARIMA(ts_log, order=(2, 1, 2))
results_ARIMA = model.fit(disp=-1)
plt.plot(ts_log_diff)
plt.plot(results_ARIMA.fittedvalues, color='red')
plt.title('RSS: %.4f'% sum((results_ARIMA.fittedvalues-ts_log_diff)**2)) step3: 将模型代入原数据进行预测


#ARIMA拟合的其实是一阶差分ts_log_diff，predictions_ARIMA_diff[i]是第i个月与i-1个月的ts_log的差值。
#由于差分化有一阶滞后，所以第一个月的数据是空的，
predictions_ARIMA_diff = pd.Series(results_ARIMA.fittedvalues, copy=True)
#累加现有的diff，得到每个值与第一个月的差分（同log底的情况下）。
#即predictions_ARIMA_diff_cumsum[i] 是第i个月与第1个月的ts_log的差值。
predictions_ARIMA_diff_cumsum = predictions_ARIMA_diff.cumsum()
#先ts_log_diff => ts_log=>ts_log => ts
#先以ts_log的第一个值作为基数，复制给所有值，然后每个时刻的值累加与第一个月对应的差值(这样就解决了，第一个月diff数据为空的问题了)
#然后得到了predictions_ARIMA_log => predictions_ARIMA
predictions_ARIMA_log = pd.Series(ts_log.ix, index=ts_log.index)
predictions_ARIMA = np.exp(predictions_ARIMA_log)
plt.figure()
plt.plot(ts)
plt.plot(predictions_ARIMA)
plt.title('RMSE: %.4f'% np.sqrt(sum((predictions_ARIMA-ts)**2)/len(ts))) ### 5.总结

(1). 获取被观测系统时间序列数据；
(2). 对数据绘图，观测是否为平稳时间序列；对于非平稳时间序列要先进行d阶差分运算，化为平稳时间序列；
(3). 经过第二步处理，已经得到平稳时间序列。要对平稳时间序列分别求得其自相关系数ACF 和偏自相关系数PACF，通过对自相关图和偏自相关图的分析，得到最佳的阶层 p 和阶数 q
(4). 由以上得到的d、q、p，得到ARIMA模型。然后开始对得到的模型进行模型检验。

1.判断一个时序数据是否是稳定。对应步骤(1)

1. 怎样让时序数据稳定化。对应步骤(2)
2. 使用ARIMA模型进行时序数据预测。对应步骤(3,4)

https://www.analyticsvidhya.com/blog/

## GBDT（MART） 迭代决策树入门教程 | 简介

2012年11月29日 19:12:19

GBDT(Gradient Boosting Decision Tree) 又叫 MART（Multiple Additive Regression Tree)，是一种迭代的决策树算法，该算法由多棵决策树组成，所有树的结论累加起来做最终答案。它在被提出之初就和SVM一起被认为是泛化能力（generalization)较强的算法。近些年更因为被用于搜索排序的机器学习模型而引起大家关注。

【1】Boosting Decision Tree入门教程 http://www.schonlau.net/publication/05stata_boosting.pdf

【2】LambdaMART用于搜索排序入门教程 http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf

GBDT主要由三个概念组成：Regression Decistion Tree（即DT)，Gradient Boosting（即GB)，Shrinkage (算法的一个重要演进分枝，目前大部分源码都按该版本实现）。搞定这三个概念后就能明白GBDT是如何工作的，要继续理解它如何用于搜索排序则需要额外理解RankNet概念，之后便功德圆满。下文将逐个碎片介绍，最终把整张图拼出来。  A: 14岁高一学生，购物较少，经常问学长问题；预测年龄A = 15 – 1 = 14

B: 16岁高三学生；购物较少，经常被学弟问问题；预测年龄B = 15 + 1 = 16

C: 24岁应届毕业生；购物较多，经常问师兄问题；预测年龄C = 25 – 1 = 24

D: 26岁工作两年员工；购物较多，经常被师弟问问题；预测年龄D = 25 + 1 = 26

1）既然图1和图2 最终效果相同，为何还需要GBDT呢？

Shrinkage（缩减）的思想认为，每次走一小步逐渐逼近结果的效果，要比每次迈一大步很快逼近结果的方式更容易避免过拟合。即它不完全信任每一个棵残差树，它认为每棵树只学到了真理的一小部分，累加的时候只累加一小部分，通过多学几棵树弥补不足。用方程来看更清晰，即

y(i+1) = 残差(y1~yi)， 其中： 残差(y1~yi) =  y真实值 – y(1 ~ i)

y(1 ~ i) = SUM(y1, …, yi)

Shrinkage不改变第一个方程，只把第二个方程改为：

y(1 ~ i) = y(1 ~ i-1) + step * yi

# 引言

Boosting 分类器属于集成学习模型，它基本思想是把成百上千个分类准确率较低的树模型组合起来，成为一个准确率很高的模型。这个模型会不断地迭代，每次迭代就生成一颗新的树。对于如何在每一步生成合理的树，大家提出了很多的方法，我们这里简要介绍由 Friedman 提出的 Gradient Boosting Machine。它在生成每一棵树的时候采用梯度下降的思想，以之前生成的所有树为基础，向着最小化给定目标函数的方向多走一步。在合理的参数设置下，我们往往要生成一定数量的树才能达到令人满意的准确率。在数据集较大较复杂的时候，我们可能需要几千次迭代运算，如果生成一个树模型需要几秒钟，那么这么多迭代的运算耗时，应该能让你专心地想静静…… # 功能介绍

## 一、基础功能

devtools::install_github('dmlc/xgboost',subdir='R-package')


require(xgboost)
data(agaricus.train, package='xgboost')
data(agaricus.test, package='xgboost')
train <- agaricus.train
test <- agaricus.test


class(train$data)   "dgCMatrix"  attr(,"package")   "Matrix"  不用担心，xgboost 支持稀疏矩阵作为输入。下面就是训练模型的命令 bst <- xgboost(data = train$data, label = train$label, max.depth = 2, eta = 1,nround = 2, objective = "binary:logistic")   train-error:0.046522  train-error:0.022263  我们迭代了两次，可以看到函数输出了每一次迭代模型的误差信息。这里的数据是稀疏矩阵，当然也支持普通的稠密矩阵。如果数据文件太大不希望读进 R 中，我们也可以通过设置参数data = 'path_to_file'使其直接从硬盘读取数据并分析。目前支持直接从硬盘读取 libsvm 格式的文件。 做预测只需要一句话： pred <- predict(bst, test$data)


cv.res <- xgb.cv(data = train$data, label = train$label, max.depth = 2, eta = 1, nround = 2, objective = "binary:logistic",
nfold = 5)

 train-error:0.046522+0.001102   test-error:0.046523+0.004410
 train-error:0.022264+0.000864   test-error:0.022266+0.003450

cv.res

   train.error.mean train.error.std test.error.mean test.error.std
1:         0.046522        0.001102        0.046523       0.004410
2:         0.022264        0.000864        0.022266       0.003450


## 二、高速准确

1. xgboost 借助 OpenMP，能自动利用单机 CPU 的多核进行并行计算。需要注意的是，Mac 上的 Clang 对 OpenMP 的支持较差，所以默认情况下只能单核运行。
2. xgboost 自定义了一个数据矩阵类 DMatrix，会在训练开始时进行一遍预处理，从而提高之后每次迭代的效率。

Time (in secs) 761.48 450.22 102.41 44.18 34.04

## 三、进阶特征

1. 只要能够求出目标函数的梯度和 Hessian 矩阵，用户就可以自定义训练模型时的目标函数。demo
2. 允许用户在交叉验证时自定义误差衡量方法，例如回归中使用 RMSE 还是 RMSLE，分类中使用 AUC，分类错误率或是 F1-score。甚至是在希格斯子比赛中的 “奇葩” 衡量标准 AMSdemo
3. 交叉验证时可以返回模型在每一折作为预测集时的预测结果，方便构建 ensemble 模型。demo
4. 允许用户先迭代 1000 次，查看此时模型的预测效果，然后继续迭代 1000 次，最后模型等价于一次性迭代 2000 次。demo
6. 可以计算变量重要性并画出树状图。demo
7. 可以选择使用线性模型替代树模型，从而得到带 L1+L2 惩罚的线性回归或者 logistic 回归。demo

# 结语

xgboost 功能较多，参数设置比较繁杂，希望在上手之后有更全面了解的读者可以参考项目 wiki。欢迎大家多多交流，在项目 issue 区提出疑问与建议。我们也邀请有兴趣的读者提交代码完善功能，让 xgboost 成为更好用的工具。

## 什么是 ARIMA模型

ARIMA模型的全称叫做自回归移动平均模型，全称是(ARIMA, Autoregressive Integrated Moving Average Model)。也记作ARIMA(p,d,q)，是统计模型(statistic model)中最常见的一种用来进行时间序列 预测的模型。

### 1. ARIMA的优缺点

1.要求时序数据是稳定的（stationary），或者是通过差分化(differencing)后是稳定的。
2.本质上只能捕捉线性关系，而不能捕捉非线性关系。

### 2. 判断是时序数据是稳定的方法。

1. 稳定的数据是没有趋势(trend)，没有周期性(seasonality)的; 即它的均值，在时间轴上拥有常量的振幅，并且它的方差，在时间轴上是趋于同一个稳定的值的。
2. 可以使用Dickey-Fuller Test进行假设检验。（另起文章介绍）

### 3. ARIMA的参数与数学形式

ARIMA模型有三个参数:p,d,q。

• p–代表预测模型中采用的时序数据本身的滞后数(lags) ,也叫做AR/Auto-Regressive项
• d–代表时序数据需要进行几阶差分化，才是稳定的，也叫Integrated项。
• q–代表预测模型中采用的预测误差的滞后数(lags)，也叫做MA/Moving Average项

if d=0, yt=Ytif d=1, yt=YtYt1if d=2, yt=(YtYt1)(Yt1Yt2)=Yt2Yt1+Yt2if d=0, yt=Ytif d=1, yt=Yt−Yt−1if d=2, yt=(Yt−Yt−1)−(Yt−1−Yt−2)=Yt−2Yt−1+Yt−2

ARIMA的预测模型可以表示为：

Y的预测值 = 常量c and/or 一个或多个最近时间的Y的加权和 and/or 一个或多个最近时间的预测误差。

ARIMA用数学形式表示为：

ytˆ=μ+ϕ1yt1+...+ϕpytp+θ1et1+...+θqetqyt^=μ+ϕ1∗yt−1+…+ϕp∗yt−p+θ1∗et−1+…+θq∗et−q

,ϕARθMA其中,ϕ表示AR的系数，θ表示MA的系数

### 4.ARIMA模型的几个特例

#### 1.ARIMA(0,1,0) = random walk: Yˆt=μ+Yt1Y^t=μ+Yt−1

#### 2. ARIMA(1,0,0) = first-order autoregressive model:

p=1, d=0,q=0。说明时序数据是稳定的和自相关的。一个时刻的Y值只与上一个时刻的Y值有关。

Yˆt=μ+ϕ1Yt1.where, ϕ[1,1],Y^t=μ+ϕ1∗Yt−1.where, ϕ∈[−1,1],是一个斜率系数

#### 3. ARIMA(1,1,0) = differenced first-order autoregressive model:

p=1,d=1,q=0. 说明时序数据在一阶差分化之后是稳定的和自回归的。即一个时刻的差分（y）只与上一个时刻的差分有关。

yˆt=μ+ϕ1yt1YˆtYt1=μ+ϕ1(Yt1Yt2)Yˆt=μ+Yt1+ϕ1(Yt1Yt2)y^t=μ+ϕ1∗yt−1结合一阶差分的定义，也可以表示为：Y^t−Yt−1=μ+ϕ1∗(Yt−1−Yt−2)或者Y^t=μ+Yt−1+ϕ1∗(Yt−1−Yt−2)

#### 4. ARIMA(0,1,1) = simple exponential smoothing with growth.

p=0, d=1 ,q=1.说明数据在一阶差分后市稳定的和移动平均的。即一个时刻的估计值的差分与上一个时刻的预测误差有关。

yˆt=μ+α1et1q=1ytp=1ytyˆt=YˆtYˆt1, et1=Yt1Yˆt1,θ1=1α1Yˆt=μ+Yˆt1+α1(Yt1Yˆt1)=μ+Yt1θ1et1y^t=μ+α1∗et−1注意q=1的差分yt与p=1的差分yt的是不一样的其中，y^t=Y^t−Y^t−1, et−1=Yt−1−Y^t−1,设θ1=1−α1则也可以写成：Y^t=μ+Y^t−1+α1(Yt−1−Y^t−1)=μ+Yt−1−θ1∗et−1

#### 5. ARIMA(2,1,2)

yˆt=μ+ϕ1yt1+ϕ2yt2θ1et1θ2et2:Yˆt=μ+ϕ1(Yt1Yt2)+ϕ2(Yt2Yt3)θ1(Yt1Yˆt1)θ2(Yt2Yˆt2)y^t=μ+ϕ1∗yt−1+ϕ2∗yt−2−θ1∗et−1−θ2∗et−2也可以写成:Y^t=μ+ϕ1∗(Yt−1−Yt−2)+ϕ2∗(Yt−2−Yt−3)−θ1∗(Yt−1−Y^t−1)−θ2∗(Yt−2−Y^t−2)

#### 6. ARIMA(2,2,2)

yˆt=μ+ϕ1yt1+ϕ2yt2θ1et1θ2et2Yˆt=μ+ϕ1(Yt12Yt2+Yt3)+ϕ2(Yt22Yt3+Yt4)θ1(Yt1Yˆt1)θ2(Yt2Yˆt2)y^t=μ+ϕ1∗yt−1+ϕ2∗yt−2−θ1∗et−1−θ2∗et−2Y^t=μ+ϕ1∗(Yt−1−2Yt−2+Yt−3)+ϕ2∗(Yt−2−2Yt−3+Yt−4)−θ1∗(Yt−1−Y^t−1)−θ2∗(Yt−2−Y^t−2)

#### 7. ARIMA建模基本步骤

1. 获取被观测系统时间序列数据；
2. 对数据绘图，观测是否为平稳时间序列；对于非平稳时间序列要先进行d阶差分运算，化为平稳时间序列；
3. 经过第二步处理，已经得到平稳时间序列。要对平稳时间序列分别求得其自相关系数ACF 和偏自相关系数PACF，通过对自相关图和偏自相关图的分析，得到最佳的阶层 p 和阶数 q
4. 由以上得到的d、q、p，得到ARIMA模型。然后开始对得到的模型进行模型检验。
具体例子会在另一篇文章中给出。

# 你的第一个机器学习项目

Do you want to do machine learning using Python, but you’re having trouble getting started?

In this post, you will complete your first machine learning project using Python.

In this step-by-step tutorial you will:

1. Download and install Python SciPy and get the most useful package for machine learning in Python.
2. Load a dataset and understand it’s structure using statistical summaries and data visualization.
3. Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.

If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.

Let’s get started!

• Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
• Update Mar/2017: Added links to help setup your Python environment. ## How Do You Start Machine Learning in Python?

The best way to learn machine learning is by designing and completing small projects.

### Python Can Be Intimidating When Getting Started

Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems.

There are also a lot of modules and libraries to choose from, providing multiple ways to do each task. It can feel overwhelming.

The best way to get started using Python for machine learning is to complete a project.

• It will force you to install and start the Python interpreter (at the very least).
• It will given you a bird’s eye view of how to step through a small project.
• It will give you confidence, maybe to go on to your own small projects.

### Beginners Need A Small End-to-End Project

Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.

When you are applying machine learning to your own datasets, you are working on a project.

A machine learning project may not be linear, but it has a number of well known steps:

1. Define Problem.
2. Prepare Data.
3. Evaluate Algorithms.
4. Improve Results.
5. Present Results.

The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing data, evaluating algorithms and making some predictions.

If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

### Hello World of Machine Learning

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).

This is a good project because it is so well understood.

• Attributes are numeric so you have to figure out how to load and handle data.
• It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
• It is a multi-class classification problem (multi-nominal) that may require some specialized handling.
• It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
• All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.

Let’s get started with your hello world machine learning project in Python.

## Machine Learning in Python: Step-By-Step Tutorial (start here)

In this section, we are going to work through a small machine learning project end-to-end.

Here is an overview of what we are going to cover:

1. Installing the Python and SciPy platform.
3. Summarizing the dataset.
4. Visualizing the dataset.
5. Evaluating some algorithms.
6. Making some predictions.

Take your time. Work through each step.

Try to type in the commands yourself or copy-and-paste the commands to speed things up.

If you have any questions at all, please leave a comment at the bottom of the post.

### Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

Get the Python and SciPy platform installed on your system if it is not already.

I do not want to cover this in great detail, because others already have. This is already pretty straightforward, especially if you are a developer. If you do need help, ask a question in the comments.

### 1.1 Install SciPy Libraries

This tutorial assumes Python version 2.7 or 3.5.

There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:

• scipy
• numpy
• matplotlib
• pandas
• sklearn

There are many ways to install these libraries. My best advice is to pick one method then be consistent in installing each library.

The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows. If you have any doubts or questions, refer to this guide, it has been followed by thousands of people.

• On Mac OS X, you can use macports to install Python 2.7 and these libraries. For more information on macports, see the homepage.
• On Linux you can use your package manager, such as yum on Fedora to install RPMs.

If you are on Windows or you are not confident, I would recommend installing the free version of Anaconda that includes everything you need.

Note: This tutorial assumes you have scikit-learn version 0.18 or higher installed.

Need more help? See one of these tutorials:

### 1.2 Start Python and Check Versions

It is a good idea to make sure your Python environment was installed successfully and is working as expected.

The script below will help you test out your environment. It imports each library required in this tutorial and prints the version.

Open a command line and start the python interpreter:

I recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs. Keep things simple and focus on the machine learning not the toolchain.

Type or copy and paste the following script:

Here is the output I get on my OS X workstation:

Compare the above output to your versions.

Ideally, your versions should match or be more recent. The APIs do not change quickly, so do not be too concerned if you are a few versions behind, Everything in this tutorial will very likely still work for you.

If you get an error, stop. Now is the time to fix it.

If you cannot run the above script cleanly you will not be able to complete this tutorial.

My best advice is to Google search for your error message or post a question on Stack Exchange.

## 2. Load The Data

We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.

The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

In this step we are going to load the iris data from CSV file URL.

### 2.1 Import libraries

First, let’s import all of the modules, functions and objects we are going to use in this tutorial.

Everything should load without error. If you have an error, stop. You need a working SciPy environment before continuing. See the advice above about setting up your environment.

### 2.2 Load Dataset

We can load the data directly from the UCI Machine Learning repository.

We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.

Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.

The dataset should load without incident.

If you do have network problems, you can download the iris.data file into your working directory and load it using the same method, changing URL to the local file name.

## 3. Summarize the Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

1. Dimensions of the dataset.
2. Peek at the data itself.
3. Statistical summary of all attributes.
4. Breakdown of the data by the class variable.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

### 3.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 150 instances and 5 attributes:

### 3.2 Peek at the Data

It is also always a good idea to actually eyeball your data.

You should see the first 20 rows of the data:

### 3.3 Statistical Summary

Now we can take a look at a summary of each attribute.

This includes the count, mean, the min and max values as well as some percentiles.

We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.

### 3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

We can see that each class has the same number of instances (50 or 33% of the dataset).

## 4. Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

1. Univariate plots to better understand each attribute.
2. Multivariate plots to better understand the relationships between attributes.

### 4.1 Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

Given that the input variables are numeric, we can create box and whisker plots of each.

This gives us a much clearer idea of the distribution of the input attributes: We can also create a histogram of each input variable to get an idea of the distribution.

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption. ### 4.2 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship. ## 5. Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

1. Separate out a validation dataset.
2. Set-up the test harness to use 10-fold cross validation.
3. Build 5 different models to predict species from flower measurements
4. Select the best model.

### 5.1 Create a Validation Dataset

We need to know that the model we created is any good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

You now have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later.

### 5.2 Test Harness

We will use 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

The specific random seed does not matter, learn more about pseudorandom number generators here:

We are using the metric of ‘accuracy‘ to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

### 5.3 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 6 different algorithms:

• Logistic Regression (LR)
• Linear Discriminant Analysis (LDA)
• K-Nearest Neighbors (KNN).
• Classification and Regression Trees (CART).
• Gaussian Naive Bayes (NB).
• Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build and evaluate our five models:

### 5.4 Select Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

Running the example above, we get the following raw results:

Note, you’re results may differ. For more on this see the post:

We can see that it looks like KNN has the largest estimated accuracy score.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

You can see that the box and whisker plots are squashed at the top of the range, with many samples achieving 100% accuracy. ## 6. Make Predictions

The KNN algorithm was the most accurate model that we tested. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.

We can see that the accuracy is 0.9 or 90%. The confusion matrix provides an indication of the three errors made. Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).

You can learn more about how to make predictions and predict probabilities here:

## You Can Do Machine Learning in Python

Work through the tutorial above. It will take you 5-to-10 minutes, max!

You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the help(“FunctionName”) help syntax in Python to learn about all of the functions that you’re using.

You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.

You do not need to be a Python programmer. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.

You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.

What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.

## Summary

In this post, you discovered step-by-step how to complete your first machine learning project in Python.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.

### Your Next Step

Do you work through the tutorial?

1. Work through the above tutorial.
2. List any questions you have.
3. Search or research the answers.
4. Remember, you can use the help(“FunctionName”) in Python to get help on any function.

Do you have a question?
Post it in the comments below.

## 如何使用Anaconda设置机器学习和深度学习的Python环境

• 更新 2017/03：注：你需要一个Theano或TensorFlow才能使用Kears进行深度学习。

## 教程概述

1. 下载Anaconda
2. 安装Anaconda
3. 启动和更新Anaconda
4. 更新 scikit-learn库
5. 安装深度学习库

## 下载蟒蛇

Anaconda是一个免费且易于操作的科学Python环境。 3.选择适合您平台的下载（Windows，OSX或Linux）：

• 选择Python 3.5
• 选择图形安装程序（Graphical Installer） Anaconda3-4.2.0-MacOSX-x86_64.pkg

## 2.安装蟒蛇

• 1.双击下载的文件。
• 2.按照安装向导。  ## 3.启动和更新蟒蛇

Anaconda配有一套名为Anaconda Navigator的图形工具。您可以从应用程序启动器打开Anaconda Navigator。

Conda快速，简单，不会遗漏错误信息，您可以快速确认您的环境已安装并正常工作。

• 1.打开终端（命令行窗口）。
• 2.通过键入以下内容，确认正确安装：
 1 conda-V

 1 conda4.2.9
• 3.键入以下内容，确认Python已正确安装：
 1 python-V

Python 3.5.2 :: Anaconda 4.2.0 (x86_64) • 4.为确认您的conda环境是最新的，请输入：
 1 conda update conda
 2 conda update anaconda

• 5.确认您的SciPy环境。

 01 # scipy
 02 import scipy
 03 print('scipy: %s' % scipy.__version__)
 04 # numpy
 05 import numpy
 06 print('numpy: %s' % numpy.__version__)
 07 # matplotlib
 08 import matplotlib
 09 print('matplotlib: %s' % matplotlib.__version__)
 10 # pandas
 11 import pandas
 12 print('pandas: %s' % pandas.__version__)
 13 # statsmodels
 14 import statsmodels
 15 print('statsmodels: %s' % statsmodels.__version__)
 16 # scikit-learn
 17 import sklearn
 18 print('sklearn: %s' % sklearn.__version__)

 1 python versions.py

 1 scipy:0.18.1
 2 numpy:1.11.1
 3 matplotlib:1.5.3
 4 pandas:0.18.1
 5 statsmodels:0.6.1
 6 sklearn:0.17.1 ## 4.更新scikit-learn库

• 1.更新scikit-learn到最新版本。

 1 conda update scikit-learn 1 conda install-c anaconda scikit-learn=0.18.1

 1 scipy:0.18.1
 2 numpy:1.11.3
 3 matplotlib:1.5.3
 4 pandas:0.18.1
 5 statsmodels:0.6.1
 6 sklearn:0.18.1

## 5.安装深度学习库

• 1.通过键入以下内容安装Theano深度学习库：
 1 conda install theano
• 2.安装TensorFlow深度学习库（Windows除外），键入以下内容：
 1 conda install-c conda-forge tensorflow

• 3.通过键入以下内容安装Keras：
 1 pip install keras
• 4.确认您的深入学习环境已安装并正常工作。

 1 # theano
 2 import theano
 3 print('theano: %s' % theano.__version__)
 4 # tensorflow
 5 import tensorflow
 6 print('tensorflow: %s' % tensorflow.__version__)
 7 # keras
 8 import keras
 9 print('keras: %s' % keras.__version__)

 1 python deep_versions.py

 1 theano:0.8.2.dev-901275534cbfe3fbbe290ce85d1abf8bb9a5b203
 2 tensorflow:0.12.1
 3 Using TensorFlow backend.
 4 keras:1.2.1 11 图像介绍
12视频数据介绍

31 实时电子稳像
311 基本原理
312常用方法
32 后续电子稳像
321 特点
322 常用算法

# 1.小型无人机数据介绍

## 1.1 图像介绍：

1. 影像航向重叠度和旁向重叠度都不够规则；
2. 像幅较小、像片数量多；
3. 影像的倾角过大且倾斜方向没有规律；
4. 航摄区域地形起伏大、高程变化显著，影像间的比例尺差异大、选偏角大；
5. 影像有明显畸变等这些情况下实现自动空三是现有数据摄影测量系统的主要挑战，在大多数下都将导致错误结果。

## 1.2.视频数据介绍

1. 视频飞行过程中存在高频的抖动，使得无人机拍摄的视频存在抖动；
2. 摄像机拍摄的视频帧图像存在较大畸变，对于后续数据处理存在很大问题；
3. 1080P高清视频用于目标识别跟踪，所需要处理的数据量非常大。

# 2. 小型无人机电子稳像作用

1. 能够去除视频中存在的抖动、晃动等因素，使得视频画面过度更加平稳，减少视觉疲劳；
2. 能够增加目标识别与跟踪的精度。

# 3.  电子稳像

电子稳像根据其处理方式不同可以分为实时稳像和后续稳像两种；
实时电子稳像是在无人机飞行过程中根据实时处理算法对摄像机拍摄到的视频进行实时处理；后续电子稳像是无人机在飞行结束之后对其视频进行后续处理。

## 3.1 实时电子稳像

### 3.1.1 基本原理

电子稳像不同于图像处理技术中的图像恢复,图像恢复是针对每一帧模糊的图像,而电子稳像稳定的是一个图像序列,图像序列的不稳定是由于帧与帧间图像变化在监视器上反映出来的不稳定,一个基本条件是图像序列中的每一帧是清晰的。

### 3.1.2常用方法

实时电子稳像常用的处理方法包括：灰度投影方法、基于图像特征方法、基于图像块方法、基于背景差法
下面我们详细介绍下灰度投影方法：
灰度投影 分解为： 和  ## 3.2 后续电子稳像

后续电子稳像方法主要是根据视频图像所有帧的运动轨迹（局部运动）来对进行优化，获得精确地全局运动，使得图像之间的运动更加平滑，衔接性更好。

### 3.2.1 特点

处理精度比实时稳像高，但处理算法的计算复杂度高，处理过程消耗时间多。

### 3.2.2 常用算法 Tripod -> DP(t) = 0
Dolly or pan -> D2P(t) = 0
Ease in and out transitions -> D3P(t) = 0 微软的稳像处理算法：
该算法的核心分为两点：1）图像的分块，单一的运动模型很难拟合整张图像，将图像分为若干块分别作为一个运动模块使得图像局部拟合效果更加，这样使得图像不会出现严重的变形现象；2）图像分块角点的估算，通过视频帧图像间匹配的特征点来推算给块图像的四个角点，再通过迭代使得图像块的路径最优和图像块与块之间的衔接更加平缓。      opencv 电子稳像例子：