Your First Machine Learning Project in Python Step-By-Step
你的第一个机器学习项目
Do you want to do machine learning using Python, but you’re having trouble getting started?
In this post, you will complete your first machine learning project using Python.
In this step-by-step tutorial you will:
- Download and install Python SciPy and get the most useful package for machine learning in Python.
- Load a dataset and understand it’s structure using statistical summaries and data visualization.
- Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.
If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.
Let’s get started!
- Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
- Update Mar/2017: Added links to help setup your Python environment.
- Update Apr/2018: Added some helpful links about randomness and making predictions.
How Do You Start Machine Learning in Python?
The best way to learn machine learning is by designing and completing small projects.
Python Can Be Intimidating When Getting Started
Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems.
There are also a lot of modules and libraries to choose from, providing multiple ways to do each task. It can feel overwhelming.
The best way to get started using Python for machine learning is to complete a project.
- It will force you to install and start the Python interpreter (at the very least).
- It will given you a bird’s eye view of how to step through a small project.
- It will give you confidence, maybe to go on to your own small projects.
Beginners Need A Small End-to-End Project
Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.
When you are applying machine learning to your own datasets, you are working on a project.
A machine learning project may not be linear, but it has a number of well known steps:
- Define Problem.
- Prepare Data.
- Evaluate Algorithms.
- Improve Results.
- Present Results.
The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing data, evaluating algorithms and making some predictions.
If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.
Hello World of Machine Learning
The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).
This is a good project because it is so well understood.
- Attributes are numeric so you have to figure out how to load and handle data.
- It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
- It is a multi-class classification problem (multi-nominal) that may require some specialized handling.
- It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
- All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.
Let’s get started with your hello world machine learning project in Python.
Machine Learning in Python: Step-By-Step Tutorial
(start here)
In this section, we are going to work through a small machine learning project end-to-end.
Here is an overview of what we are going to cover:
- Installing the Python and SciPy platform.
- Loading the dataset.
- Summarizing the dataset.
- Visualizing the dataset.
- Evaluating some algorithms.
- Making some predictions.
Take your time. Work through each step.
Try to type in the commands yourself or copy-and-paste the commands to speed things up.
If you have any questions at all, please leave a comment at the bottom of the post.
Need help with Machine Learning in Python?
Take my free 2-week email course and discover data prep, algorithms and more (with code).
Click to sign-up now and also get a free PDF Ebook version of the course.
Start Your FREE Mini-Course Now!
1. Downloading, Installing and Starting Python SciPy
Get the Python and SciPy platform installed on your system if it is not already.
I do not want to cover this in great detail, because others already have. This is already pretty straightforward, especially if you are a developer. If you do need help, ask a question in the comments.
1.1 Install SciPy Libraries
This tutorial assumes Python version 2.7 or 3.5.
There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:
- scipy
- numpy
- matplotlib
- pandas
- sklearn
There are many ways to install these libraries. My best advice is to pick one method then be consistent in installing each library.
The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows. If you have any doubts or questions, refer to this guide, it has been followed by thousands of people.
- On Mac OS X, you can use macports to install Python 2.7 and these libraries. For more information on macports, see the homepage.
- On Linux you can use your package manager, such as yum on Fedora to install RPMs.
If you are on Windows or you are not confident, I would recommend installing the free version of Anaconda that includes everything you need.
Note: This tutorial assumes you have scikit-learn version 0.18 or higher installed.
Need more help? See one of these tutorials:
- How to Setup a Python Environment for Machine Learning and Deep Learning with Anaconda
- How to Create a Linux Virtual Machine For Machine Learning Development With Python 3
1.2 Start Python and Check Versions
It is a good idea to make sure your Python environment was installed successfully and is working as expected.
The script below will help you test out your environment. It imports each library required in this tutorial and prints the version.
Open a command line and start the python interpreter:
1
|
python
|
I recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs. Keep things simple and focus on the machine learning not the toolchain.
Type or copy and paste the following script:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# Check the versions of libraries
# Python version
import sys
print(‘Python: {}’.format(sys.version))
# scipy
import scipy
print(‘scipy: {}’.format(scipy.__version__))
# numpy
import numpy
print(‘numpy: {}’.format(numpy.__version__))
# matplotlib
import matplotlib
print(‘matplotlib: {}’.format(matplotlib.__version__))
# pandas
import pandas
print(‘pandas: {}’.format(pandas.__version__))
# scikit-learn
import sklearn
print(‘sklearn: {}’.format(sklearn.__version__))
|
Here is the output I get on my OS X workstation:
1
2
3
4
5
6
7
|
Python: 2.7.11 (default, Mar 1 2016, 18:40:10)
[GCC 4.2.1 Compatible Apple LLVM 7.0.2 (clang-700.1.81)]
scipy: 0.17.0
numpy: 1.10.4
matplotlib: 1.5.1
pandas: 0.17.1
sklearn: 0.18.1
|
Compare the above output to your versions.
Ideally, your versions should match or be more recent. The APIs do not change quickly, so do not be too concerned if you are a few versions behind, Everything in this tutorial will very likely still work for you.
If you get an error, stop. Now is the time to fix it.
If you cannot run the above script cleanly you will not be able to complete this tutorial.
My best advice is to Google search for your error message or post a question on Stack Exchange.
2. Load The Data
We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.
The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.
You can learn more about this dataset on Wikipedia.
In this step we are going to load the iris data from CSV file URL.
2.1 Import libraries
First, let’s import all of the modules, functions and objects we are going to use in this tutorial.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# Load libraries
import pandas
from pandas.plotting import scatter_matrix
import matplotlib.pyplot as plt
from sklearn import model_selection
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
|
Everything should load without error. If you have an error, stop. You need a working SciPy environment before continuing. See the advice above about setting up your environment.
2.2 Load Dataset
We can load the data directly from the UCI Machine Learning repository.
We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.
Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.
1
2
3
4
|
# Load dataset
names = [‘sepal-length’, ‘sepal-width’, ‘petal-length’, ‘petal-width’, ‘class’]
dataset = pandas.read_csv(url, names=names)
|
The dataset should load without incident.
If you do have network problems, you can download the iris.data file into your working directory and load it using the same method, changing URL to the local file name.
3. Summarize the Dataset
Now it is time to take a look at the data.
In this step we are going to take a look at the data a few different ways:
- Dimensions of the dataset.
- Peek at the data itself.
- Statistical summary of all attributes.
- Breakdown of the data by the class variable.
Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.
3.1 Dimensions of Dataset
We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.
1
2
|
# shape
print(dataset.shape)
|
You should see 150 instances and 5 attributes:
1
|
(150, 5)
|
3.2 Peek at the Data
It is also always a good idea to actually eyeball your data.
1
2
|
# head
print(dataset.head(20))
|
You should see the first 20 rows of the data:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
sepal-length sepal-width petal-length petal-width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
10 5.4 3.7 1.5 0.2 Iris-setosa
11 4.8 3.4 1.6 0.2 Iris-setosa
12 4.8 3.0 1.4 0.1 Iris-setosa
13 4.3 3.0 1.1 0.1 Iris-setosa
14 5.8 4.0 1.2 0.2 Iris-setosa
15 5.7 4.4 1.5 0.4 Iris-setosa
16 5.4 3.9 1.3 0.4 Iris-setosa
17 5.1 3.5 1.4 0.3 Iris-setosa
18 5.7 3.8 1.7 0.3 Iris-setosa
19 5.1 3.8 1.5 0.3 Iris-setosa
|
3.3 Statistical Summary
Now we can take a look at a summary of each attribute.
This includes the count, mean, the min and max values as well as some percentiles.
1
2
|
# descriptions
print(dataset.describe())
|
We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.
1
2
3
4
5
6
7
8
9
|
sepal-length sepal-width petal-length petal-width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
|
3.4 Class Distribution
Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.
1
2
|
# class distribution
print(dataset.groupby(‘class’).size())
|
We can see that each class has the same number of instances (50 or 33% of the dataset).
1
2
3
4
|
class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
|
4. Data Visualization
We now have a basic idea about the data. We need to extend that with some visualizations.
We are going to look at two types of plots:
- Univariate plots to better understand each attribute.
- Multivariate plots to better understand the relationships between attributes.
4.1 Univariate Plots
We start with some univariate plots, that is, plots of each individual variable.
Given that the input variables are numeric, we can create box and whisker plots of each.
1
2
3
|
# box and whisker plots
dataset.plot(kind=‘box’, subplots=True, layout=(2,2), sharex=False, sharey=False)
plt.show()
|
This gives us a much clearer idea of the distribution of the input attributes:
We can also create a histogram of each input variable to get an idea of the distribution.
1
2
3
|
# histograms
dataset.hist()
plt.show()
|
It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.
4.2 Multivariate Plots
Now we can look at the interactions between the variables.
First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.
1
2
3
|
# scatter plot matrix
scatter_matrix(dataset)
plt.show()
|
Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.
5. Evaluate Some Algorithms
Now it is time to create some models of the data and estimate their accuracy on unseen data.
Here is what we are going to cover in this step:
- Separate out a validation dataset.
- Set-up the test harness to use 10-fold cross validation.
- Build 5 different models to predict species from flower measurements
- Select the best model.
5.1 Create a Validation Dataset
We need to know that the model we created is any good.
Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.
That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.
We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.
1
2
3
4
5
6
7
|
# Split-out validation dataset
array = dataset.values
X = array[:,0:4]
Y = array[:,4]
validation_size = 0.20
seed = 7
X_train, X_validation, Y_train, Y_validation = model_selection.train_test_split(X, Y, test_size=validation_size, random_state=seed)
|
You now have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later.
5.2 Test Harness
We will use 10-fold cross validation to estimate accuracy.
This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.
1
2
3
|
# Test options and evaluation metric
seed = 7
scoring = ‘accuracy’
|
The specific random seed does not matter, learn more about pseudorandom number generators here:
We are using the metric of ‘accuracy‘ to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.
5.3 Build Models
We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.
Let’s evaluate 6 different algorithms:
- Logistic Regression (LR)
- Linear Discriminant Analysis (LDA)
- K-Nearest Neighbors (KNN).
- Classification and Regression Trees (CART).
- Gaussian Naive Bayes (NB).
- Support Vector Machines (SVM).
This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.
Let’s build and evaluate our five models:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# Spot Check Algorithms
models = []
models.append((‘LR’, LogisticRegression()))
models.append((‘LDA’, LinearDiscriminantAnalysis()))
models.append((‘KNN’, KNeighborsClassifier()))
models.append((‘CART’, DecisionTreeClassifier()))
models.append((‘NB’, GaussianNB()))
models.append((‘SVM’, SVC()))
# evaluate each model in turn
results = []
names = []
for name, model in models:
kfold = model_selection.KFold(n_splits=10, random_state=seed)
cv_results = model_selection.cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = “%s: %f (%f)” % (name, cv_results.mean(), cv_results.std())
print(msg)
|
5.4 Select Best Model
We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.
Running the example above, we get the following raw results:
1
2
3
4
5
6
|
LR: 0.966667 (0.040825)
LDA: 0.975000 (0.038188)
KNN: 0.983333 (0.033333)
CART: 0.975000 (0.038188)
NB: 0.975000 (0.053359)
SVM: 0.981667 (0.025000)
|
Note, you’re results may differ. For more on this see the post:
We can see that it looks like KNN has the largest estimated accuracy score.
We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).
1
2
3
4
5
6
7
|
# Compare Algorithms
fig = plt.figure()
fig.suptitle(‘Algorithm Comparison’)
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
|
You can see that the box and whisker plots are squashed at the top of the range, with many samples achieving 100% accuracy.
6. Make Predictions
The KNN algorithm was the most accurate model that we tested. Now we want to get an idea of the accuracy of the model on our validation set.
This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.
We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.
1
2
3
4
5
6
7
|
# Make predictions on validation dataset
knn = KNeighborsClassifier()
knn.fit(X_train, Y_train)
predictions = knn.predict(X_validation)
print(accuracy_score(Y_validation, predictions))
print(confusion_matrix(Y_validation, predictions))
print(classification_report(Y_validation, predictions))
|
We can see that the accuracy is 0.9 or 90%. The confusion matrix provides an indication of the three errors made. Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).
1
2
3
4
5
6
7
8
9
10
11
12
13
|
0.9
[[ 7 0 0]
[ 0 11 1]
[ 0 2 9]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 7
Iris-versicolor 0.85 0.92 0.88 12
Iris-virginica 0.90 0.82 0.86 11
avg / total 0.90 0.90 0.90 30
|
You can learn more about how to make predictions and predict probabilities here:
You Can Do Machine Learning in Python
Work through the tutorial above. It will take you 5-to-10 minutes, max!
You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the help(“FunctionName”) help syntax in Python to learn about all of the functions that you’re using.
You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.
You do not need to be a Python programmer. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.
You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.
What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.
Summary
In this post, you discovered step-by-step how to complete your first machine learning project in Python.
You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.
Your Next Step
Do you work through the tutorial?
- Work through the above tutorial.
- List any questions you have.
- Search or research the answers.
- Remember, you can use the help(“FunctionName”) in Python to get help on any function.
Do you have a question?
Post it in the comments below.
如何使用Anaconda设置机器学习和深度学习的Python环境
有一些平台安装Python机器学习环境可能很麻烦。
首先你得安装Python,然后安装许多软件包这很容易把初学者搞懵。
在本教程中,你将学会如何用Anaconda设置Python机器学习开发环境。
完成本教程后,你将拥有一个Python工作环境,可以让你学习、练习和开发机器学习和深度学习软件。
本说明适用于Windows,Mac OS X和Linux平台。我将在OS X上演示它们,因此你可能会看到一些mac对话框和文件扩展名。
- 更新 2017/03:注:你需要一个Theano或TensorFlow才能使用Kears进行深度学习。
教程概述
在本教程中,我们将介绍如下步骤:
- 下载Anaconda
- 安装Anaconda
- 启动和更新Anaconda
- 更新 scikit-learn库
- 安装深度学习库
下载蟒蛇
在此步骤中,我们将为您的平台下载Anaconda Python。
Anaconda是一个免费且易于操作的科学Python环境。
- 1.访问Anaconda主页。
- 2.从菜单中点击“Anaconda”,点击“下载”进入下载页面。
3.选择适合您平台的下载(Windows,OSX或Linux):
- 选择Python 3.5
- 选择图形安装程序(Graphical Installer)
将Anaconda Python包下载到您的工作站。
我在OS X上,所以我选择了OS X版本。文件约426 MB。
你应该下载到一个名称如下的文件:
Anaconda3-4.2.0-MacOSX-x86_64.pkg
2.安装蟒蛇
在此步骤中,我们将在您的系统上安装Anaconda Python软件。
此步骤假定你具有足够的管理权限来在系统上安装软件。
- 1.双击下载的文件。
- 2.按照安装向导。
安装很顺利应该不会遇到棘手的问题
安装需要不到10分钟,占用硬盘上1 GB的空间。
3.启动和更新蟒蛇
在此步骤中,我们将确认您的Anaconda Python环境是不是最新的。
Anaconda配有一套名为Anaconda Navigator的图形工具。您可以从应用程序启动器打开Anaconda Navigator。
您可以点击这里了解有关Anaconda Navigator的所有信息。
我们稍后使用Anaconda Navigator和图形开发环境; 现在,我建议从Anaconda命令行环境开始,它被称为conda。
Conda快速,简单,不会遗漏错误信息,您可以快速确认您的环境已安装并正常工作。
- 1.打开终端(命令行窗口)。
- 2.通过键入以下内容,确认正确安装:
1 |
conda - V |
你应该看到以下(或类似的东西):
1 |
conda 4.2 . 9 |
- 3.键入以下内容,确认Python已正确安装:
1 |
python - V |
你应该看到以下(或类似的东西):
Python 3.5.2 :: Anaconda 4.2.0 (x86_64)
如果命令不起作用或报错,请查看平台的帮助文档。
也可以参阅“延伸阅读”部分的一些资料。
- 4.为确认您的conda环境是最新的,请输入:
1 |
conda update conda |
2 |
conda update anaconda |
你可能需要给一些包安装更新。
- 5.确认您的SciPy环境。
下面的脚本将打印您需要用于机器学习开发的关键SciPy库的版本号,如: SciPy、NumPy、Matplotlib、Pandas、Statsmodels和Scikit-learn。
您可以键入“python”然后直接键入命令。但我建议打开一个文本文档,并将脚本复制到文档中。
01 |
# scipy |
02 |
import scipy |
03 |
print ( 'scipy: %s' % scipy.__version__) |
04 |
# numpy |
05 |
import numpy |
06 |
print ( 'numpy: %s' % numpy.__version__) |
07 |
# matplotlib |
08 |
import matplotlib |
09 |
print ( 'matplotlib: %s' % matplotlib.__version__) |
10 |
# pandas |
11 |
import pandas |
12 |
print ( 'pandas: %s' % pandas.__version__) |
13 |
# statsmodels |
14 |
import statsmodels |
15 |
print ( 'statsmodels: %s' % statsmodels.__version__) |
16 |
# scikit-learn |
17 |
import sklearn |
18 |
print ( 'sklearn: %s' % sklearn.__version__) |
将脚本保存为名称为versions.py的文件。
在命令行中,将目录更改为保存脚本的位置然后键入:
1 |
python versions.py |
您应该看到如下输出:
1 |
scipy: 0.18 . 1 |
2 |
numpy: 1.11 . 1 |
3 |
matplotlib: 1.5 . 3 |
4 |
pandas: 0.18 . 1 |
5 |
statsmodels: 0.6 . 1 |
6 |
sklearn: 0.17 . 1 |
4.更新scikit-learn库
在这一步中,我们将在Python中更新用于机器学习的库,名为scikit-learn。
- 1.更新scikit-learn到最新版本。
在撰写本文时,Anaconda发行的scikit-learning版本已经过期(0.17.1,而不是0.18.1)。您可以使用conda命令更新特定的库; 以下是将scikit-learn更新到最新版本的示例。
输入:
1 |
conda update scikit - learn |
你也可以键入如下内容把他升级到特定的版本:
1 |
conda install - c anaconda scikit - learn = 0.18 . 1 |
为了确认是否安装成功,你可以键入以下内容重新运行version.py脚本:
你应该看到如下的输出:
1 |
scipy: 0.18 . 1 |
2 |
numpy: 1.11 . 3 |
3 |
matplotlib: 1.5 . 3 |
4 |
pandas: 0.18 . 1 |
5 |
statsmodels: 0.6 . 1 |
6 |
sklearn: 0.18 . 1 |
你可以根据需要使用这些命令更新机器学习和SciPy库。
点击下方链接阅读scikit-learn教程:
5.安装深度学习库
在这一步中,我们将安装用于深度学习的Python库,主要是:Theano,TensorFlow和Keras。
注意:我建议使用Keras进行深度学习,而Keras只需要安装Tnano或TensorFlow中的一个。在某些Windows系统上安装TensorFlow可能会出现问题。
- 1.通过键入以下内容安装Theano深度学习库:
1 |
conda install theano |
- 2.安装TensorFlow深度学习库(Windows除外),键入以下内容:
1 |
conda install - c conda - forge tensorflow |
或者,您可以选择使用pip和特定版本的tensorflow为您的平台进行安装。
详情请参阅tensorflow的安装说明。
- 3.通过键入以下内容安装Keras:
1 |
pip install keras |
- 4.确认您的深入学习环境已安装并正常工作。
创建一个脚本,该脚本打印每个库的版本号,就像我们上面为安装SciPy环境所做的那样。
1 |
# theano |
2 |
import theano |
3 |
print ( 'theano: %s' % theano.__version__) |
4 |
# tensorflow |
5 |
import tensorflow |
6 |
print ( 'tensorflow: %s' % tensorflow.__version__) |
7 |
# keras |
8 |
import keras |
9 |
print ( 'keras: %s' % keras.__version__) |
将脚本保存成文件deep_versions.py。输入以下命令来运行脚本:
1 |
python deep_versions.py |
你应该看到如下输出:
1 |
theano: 0.8 . 2.dev - 901275534cbfe3fbbe290ce85d1abf8bb9a5b203 |
2 |
tensorflow: 0.12 . 1 |
3 |
Using TensorFlow backend. |
4 |
keras: 1.2 . 1 |
尝试一下Keras深度学习教程,如:Anaconda
进一步阅读
本节提供一些进一步阅读的链接。
总结
恭喜你现在拥有一个用于机器学习和深入学习的工作Python开发环境。
你现在可以在工作站上学习和练习机器学习和深度学习。
阿里云漏洞修复RHSA-2018:0169: kernel security and bug fix update
基本信息
CVE-2017-11176 严重
标题: Linux kernel ‘mq_notify’内存错误引用漏洞
披露时间: 2017-07-11 00:00:00
CVEID: CVE-2017-11176
简介:
Linux kernel是美国Linux基金会发布的操作系统Linux所使用的内核。
Linux kernel 4.11.9及之前的版本中的’mq_notify’函数存在安全漏洞。攻击者可利用该漏洞造成拒绝服务。
解决方案:
请直接在漏洞处理页面,选择对应服务器和漏洞,生成修复命令后,登录到服务器上运行即可。
解决步骤
- 升级内核版本
[root@CYBSERVER_HK ~]# uname -r 2.6.32-696.16.1.el6.x86_64 [root@CYBSERVER_HK ~]#rpm --import https://www.elrepo.org/RPM-GPG-KEY-elrepo.org [root@CYBSERVER_HK ~]#rpm -Uvh http://www.elrepo.org/elrepo-release-6-8.el6.elrepo.noarch.rpm #安装yum源 ... Installed: kernel-lt.x86_64 0:4.4.101-1.el6.elrepo Complete!
- 修改GRUB启动顺序
[root@CYBSERVER_HK ~]#vim /etc/grub.conf
#将默认启动顺序default修改为0 ,default=0# grub.conf generated by anaconda # # Note that you do not have to rerun grub after making changes to this file # NOTICE: You do not have a /boot partition. This means that # all kernel and initrd paths are relative to /, eg. # root (hd0,0) # kernel /boot/vmlinuz-version ro root=/dev/xvda1 # initrd /boot/initrd-[generic-]version.img #boot=/dev/xvda default=0 title CentOS (4.4.135-1.el6.elrepo.i686) root (hd0,0) kernel /boot/vmlinuz-4.4.135-1.el6.elrepo.i686 ro root=UUID=e76a7b8d-20c2-4f94-bdd1-f4054a34c206 rd_NO_LUKS rd_NO_LVM LANG=en_US.UTF-8 rd_NO_MD SYSFONT=latarcyrheb-sun16 crashkernel=auto KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet
- 重启并查看内核版本
[root@CYBSERVER_HK ~]# reboot [root@CYBSERVER_HK ~]# uname -r 4.4.135-1.el6.elrepo.i686
- 升级kernel-devel version,kernel-headers version,kernel-doc version
[root@CYBSERVER_HK ~]# yum check-update [root@CYBSERVER_HK ~]# yum update [root@CYBSERVER_HK ~]# yum --enablerepo=elrepo-kernel install kernel-lt-headers -y [root@CYBSERVER_HK ~]# yum --enablerepo=elrepo-kernel install kernel-lt-devel -y [root@CYBSERVER_HK ~]# yum --enablerepo=elrepo-kernel install kernel-lt-doc -y
rpm命令参考
rpm命令是RPM软件包的管理工具。rpm原本是Red Hat Linux发行版专门用来管理Linux各项套件的程序,由于它遵循GPL规则且功能强大方便,因而广受欢迎。逐渐受到其他发行版的采用。RPM套件管理方式的出现,让Linux易于安装,升级,间接提升了Linux的适用度。
语法
rpm(选项)(参数)
选项
-a:查询所有套件;
-b<完成阶段><套件档>+或-t <完成阶段><套件档>+:设置包装套件的完成阶段,并指定套件档的文件名称;
-c:只列出组态配置文件,本参数需配合"-l"参数使用;
-d:只列出文本文件,本参数需配合"-l"参数使用;
-e<套件档>或--erase<套件档>:删除指定的套件;
-f<文件>+:查询拥有指定文件的套件;
-h或--hash:套件安装时列出标记;
-i:显示套件的相关信息;
-i<套件档>或--install<套件档>:安装指定的套件档;
-l:显示套件的文件列表;
-p<套件档>+:查询指定的RPM套件档;
-q:使用询问模式,当遇到任何问题时,rpm指令会先询问用户;
-R:显示套件的关联性信息;
-s:显示文件状态,本参数需配合"-l"参数使用;
-U<套件档>或--upgrade<套件档>:升级指定的套件档;
-v:显示指令执行过程;
-vv:详细显示指令执行过程,便于排错。
参数
软件包:指定要操纵的rpm软件包。
实例
如何安装rpm软件包
rpm软件包的安装可以使用程序rpm来完成。执行下面的命令:
rpm -ivh your-package.rpm
其中your-package.rpm是你要安装的rpm包的文件名,一般置于当前目录下。
安装过程中可能出现下面的警告或者提示:
... conflict with ...
可能是要安装的包里有一些文件可能会覆盖现有的文件,缺省时这样的情况下是无法正确安装的可以用rpm –force -i强制安装即可
... is needed by ...
... is not installed ...
此包需要的一些软件你没有安装可以用rpm –nodeps -i来忽略此信息,也就是说rpm -i –force –nodeps可以忽略所有依赖关系和文件问题,什么包都能安装上,但这种强制安装的软件包不能保证完全发挥功能。
如何安装.src.rpm软件包
有些软件包是以.src.rpm结尾的,这类软件包是包含了源代码的rpm包,在安装时需要进行编译。这类软件包有两种安装方法:
方法一:
rpm -i your-package.src.rpm
cd /usr/src/redhat/SPECS
rpmbuild -bp your-package.specs #一个和你的软件包同名的specs文件
cd /usr/src/redhat/BUILD/your-package/ #一个和你的软件包同名的目录
./configure #这一步和编译普通的源码软件一样,可以加上参数
make
make install
方法二:
rpm -i you-package.src.rpm
cd /usr/src/redhat/SPECS
前两步和方法一相同
rpmbuild -bb your-package.specs #一个和你的软件包同名的specs文件
这时在/usr/src/redhat/RPM/i386/
(根据具体包的不同,也可能是i686,noarch等等)在这个目录下,有一个新的rpm包,这个是编译好的二进制文件。
执行rpm -i new-package.rpm
即可安装完成。
如何卸载rpm软件包
使用命令rpm -e包名,包名可以包含版本号等信息,但是不可以有后缀.rpm,比如卸载软件包proftpd-1.2.8-1,可以使用下列格式:
rpm -e proftpd-1.2.8-1
rpm -e proftpd-1.2.8
rpm -e proftpd-
rpm -e proftpd
不可以是下列格式:
rpm -e proftpd-1.2.8-1.i386.rpm
rpm -e proftpd-1.2.8-1.i386
rpm -e proftpd-1.2
rpm -e proftpd-1
有时会出现一些错误或者警告:
... is needed by ...
这说明这个软件被其他软件需要,不能随便卸载,可以用rpm -e –nodeps强制卸载
如何不安装但是获取rpm包中的文件
使用工具rpm2cpio和cpio
rpm2cpio xxx.rpm | cpio -vi
rpm2cpio xxx.rpm | cpio -idmv
rpm2cpio xxx.rpm | cpio --extract --make-directories
参数i和extract相同,表示提取文件。v表示指示执行进程,d和make-directory相同,表示根据包中文件原来的路径建立目录,m表示保持文件的更新时间。
如何查看与rpm包相关的文件和其他信息
下面所有的例子都假设使用软件包mysql-3.23.54a-11
- 我的系统中安装了那些rpm软件包。
rpm -qa 讲列出所有安装过的包
如果要查找所有安装过的包含某个字符串sql的软件包rpm -qa | grep sql
- 如何获得某个软件包的文件全名。
rpm -q mysql
可以获得系统中安装的mysql软件包全名,从中可以获得当前软件包的版本等信息。这个例子中可以得到信息mysql-3.23.54a-11
- 一个rpm包中的文件安装到那里去了?
rpm -ql 包名
注意这里的是不包括.rpm后缀的软件包的名称,也就是说只能用mysql或者mysql-3.23.54a-11而不是mysql-3.23.54a-11.rpm。如果只是想知道可执行程序放到那里去了,也可以用which,比如:
which mysql - 一个rpm包中包含那些文件。
一个没有安装过的软件包,使用rpm -qlp ****.rpm
一个已经安装过的软件包,还可以使用rpm -ql ****.rpm
- 如何获取关于一个软件包的版本,用途等相关信息?
一个没有安装过的软件包,使用rpm -qip ****.rpm
一个已经安装过的软件包,还可以使用rpm -qi ****.rpm - 某个程序是哪个软件包安装的,或者哪个软件包包含这个程序。
rpm -qf `which 程序名` #返回软件包的全名 rpm -qif `which 程序名` #返回软件包的有关信息 rpm -qlf `which 程序名` #返回软件包的文件列表
注意,这里不是引号,而是`,就是键盘左上角的那个键。也可以使用rpm -qilf,同时输出软件包信息和文件列表。
- 某个文件是哪个软件包安装的,或者哪个软件包包含这个文件。
注意,前一个问题中的方法,只适用与可执行的程序,而下面的方法,不仅可以用于可执行程序,也可以用于普通的任何文件。前提是知道这个文件名。首先获得这个程序的完整路径,可以用
whereis
或者which
,然后使用rpm -qf
例如:whereis ftptop ftptop: /usr/bin/ftptop /usr/share/man/man1/ftptop.1.gz rpm -qf /usr/bin/ftptop proftpd-1.2.8-1 rpm -qf /usr/share/doc/proftpd-1.2.8/rfc/rfc0959.txt proftpd-1.2.8-1
linux yum 命令
引用自:li1121567428 ( http://www.runoob.com/linux/linux-yum.html)
yum( Yellow dog Updater, Modified)是一个在Fedora和RedHat以及SUSE中的Shell前端软件包管理器。
基於RPM包管理,能够从指定的服务器自动下载RPM包并且安装,可以自动处理依赖性关系,并且一次安装所有依赖的软体包,无须繁琐地一次次下载、安装。
yum提供了查找、安装、删除某一个、一组甚至全部软件包的命令,而且命令简洁而又好记。
yum 语法
yum [options] [command] [package ...]
- options:可选,选项包括-h(帮助),-y(当安装过程提示选择全部为”yes”),-q(不显示安装的过程)等等。
- command:要进行的操作。
- package操作的对象。
yum常用命令
- 1.列出所有可更新的软件清单命令:yum check-update
- 2.更新所有软件命令:yum update
- 3.仅安装指定的软件命令:yum install
- 4.仅更新指定的软件命令:yum update
- 5.列出所有可安裝的软件清单命令:yum list
- 6.删除软件包命令:yum remove
- 7.查找软件包 命令:yum search
- 8.清除缓存命令:
- yum clean packages: 清除缓存目录下的软件包
- yum clean headers: 清除缓存目录下的 headers
- yum clean oldheaders: 清除缓存目录下旧的 headers
- yum clean, yum clean all (= yum clean packages; yum clean oldheaders) :清除缓存目录下的软件包及旧的headers
实例 1
安装 pam-devel
[root@www ~]# yum install pam-devel Setting up Install Process Parsing package install arguments Resolving Dependencies <==先检查软件的属性相依问题 --> Running transaction check ---> Package pam-devel.i386 0:0.99.6.2-4.el5 set to be updated --> Processing Dependency: pam = 0.99.6.2-4.el5 for package: pam-devel --> Running transaction check ---> Package pam.i386 0:0.99.6.2-4.el5 set to be updated filelists.xml.gz 100% |=========================| 1.6 MB 00:05 filelists.xml.gz 100% |=========================| 138 kB 00:00 -> Finished Dependency Resolution ……(省略)
实例 2
移除 pam-devel
[root@www ~]# yum remove pam-devel Setting up Remove Process Resolving Dependencies <==同样的,先解决属性相依的问题 --> Running transaction check ---> Package pam-devel.i386 0:0.99.6.2-4.el5 set to be erased --> Finished Dependency Resolution Dependencies Resolved ============================================================================= Package Arch Version Repository Size ============================================================================= Removing: pam-devel i386 0.99.6.2-4.el5 installed 495 k Transaction Summary ============================================================================= Install 0 Package(s) Update 0 Package(s) Remove 1 Package(s) <==还好,并没有属性相依的问题,单纯移除一个软件 Is this ok [y/N]: y Downloading Packages: Running rpm_check_debug Running Transaction Test Finished Transaction Test Transaction Test Succeeded Running Transaction Erasing : pam-devel ######################### [1/1] Removed: pam-devel.i386 0:0.99.6.2-4.el5 Complete!
实例 3
利用 yum 的功能,找出以 pam 为开头的软件名称有哪些?
[root@www ~]# yum list pam* Installed Packages pam.i386 0.99.6.2-3.27.el5 installed pam_ccreds.i386 3-5 installed pam_krb5.i386 2.2.14-1 installed pam_passwdqc.i386 1.0.2-1.2.2 installed pam_pkcs11.i386 0.5.3-23 installed pam_smb.i386 1.1.7-7.2.1 installed Available Packages <==底下则是『可升级』的或『未安装』的 pam.i386 0.99.6.2-4.el5 base pam-devel.i386 0.99.6.2-4.el5 base pam_krb5.i386 2.2.14-10 base
国内 yum 源
网易(163)yum源是国内最好的yum源之一 ,无论是速度还是软件版本,都非常的不错。
将yum源设置为163 yum,可以提升软件包安装和更新的速度,同时避免一些常见软件版本无法找到。
安装步骤
首先备份/etc/yum.repos.d/CentOS-Base.repo
mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-Base.repo.backup
下载对应版本 repo 文件, 放入 /etc/yum.repos.d/ (操作前请做好相应备份)
- CentOS5 :http://mirrors.163.com/.help/CentOS5-Base-163.repo
- CentOS6 :http://mirrors.163.com/.help/CentOS6-Base-163.repo
- CentOS7 :http://mirrors.163.com/.help/CentOS7-Base-163.repo
wget http://mirrors.163.com/.help/CentOS6-Base-163.repo mv CentOS6-Base-163.repo CentOS-Base.repo
运行以下命令生成缓存
yum clean all
yum makecache
除了网易之外,国内还有其他不错的 yum 源,比如中科大和搜狐。
中科大的 yum 源,安装方法查看:https://lug.ustc.edu.cn/wiki/mirrors/help/centos
sohu 的 yum 源安装方法查看: http://mirrors.sohu.com/help/centos.html
1 篇笔记
电子稳像【转】
无人机视频的电子稳像
小型无人机数据介绍
11 图像介绍
12视频数据介绍
小型无人机电子稳像作用
电子稳像
31 实时电子稳像
311 基本原理
312常用方法
32 后续电子稳像
321 特点
322 常用算法
无人机视频的电子稳像
1.小型无人机数据介绍
1.1 图像介绍:
1.2.视频数据介绍
2. 小型无人机电子稳像作用
3. 电子稳像
3.1 实时电子稳像
3.1.1 基本原理
3.1.2常用方法
3.2 后续电子稳像
3.2.1 特点
3.2.2 常用算法
[java]Windows 7 配置jdk 1.7环境变量
Windows 7 配置jdk 1.7环境变量
环境:win7(32位)64位和下面差不多 jdk1.7
1.右击计算机-属性-高级系统设置-高级-环境变量,弹出“环境变量”对话框,主要是改下面的环境变量,不是上面的Administrator。(不然其他用户还得自己配)
添加JAVA_HOME指明JDK安装路径,如C:\Program Files\Java\jdk1.7.0,此路径下包括lib,bin,jre等文件夹
2.在Path变量中添加:
设为:%JAVA_HOME%\bin;%JAVA_HOME%\jre\bin;
ps:分号主要是用来分隔jdk的路径和其他系统程序和应用程序的路径。
windows 10中,可以直接“新建”来增加每一条路径。
3.添加CLASSPATH变量为java加载类的(class or lib)路径,(这样java虚拟机装载class文件的时候才知道上哪去找只有类在classpath中,java命令才能识别)
设为:.;%JAVA_HOME%\lib;%JAVA_HOME%\lib\tools.jar (要加.表示当前路径,当前路径的意思就是你现在编译运行的程序class文件所在的地方)
测试是否配置成功:
在命令行中,输入命令java 回车后应该会出现java的各种命令;
>java -version
java version “1.7.0_79”
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)
javac 也会出现相关编译的命令;
java -version 出现jdk版本号,ps:注意java和javac都是命令(动词),后面-version可理解为宾语,中间是有空格的,切记!!!
特别注意:
在 cmd 中,输入
set java_home命令可以看到jdk安装目录;
>set java_home
JAVA_HOME=C:\Program Files\Java\jdk1.7.0_79
set path 可以看到path变量的值,各种程序的执行路径
>set path
Path=C:\Program Files (x86)\Intel\iCLS Client\;C:\Program Files\Intel\iCLS Client\;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Program Files\Intel\WiFi\bin\;C:\Program Files\Common Files\Intel\WirelessCommon\;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files\Intel\Intel(R) Management Engine Components\DAL;C:\Program Files (x86)\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\Intel\Intel(R) Management Engine Components\IPT;C:\Program Files\Git\cmd;C:\mysql-5.6.17\bin;C:\Program Files\PuTTY\;C:\Program Files (x86)\WinSCP\;C:\Program Files\Git LFS;C:\Program Files (x86)\QuickTime\QTSystem\;C:\Program Files\Java\jdk1.7.0_79\bin;C:\Program Files\Java\jdk1.7.0_79\jre\bin;C:\Users\cyb\AppData\Local\Microsoft\WindowsApps;C:\Program Files\Intel\WiFi\bin\;C:\Program Files\Common Files\Intel\WirelessCommon\;C:\Program Files\Microsoft VS Code\bin
PATHEXT=.COM;.EXE;.BAT;.CMD;.VBS;.VBE;.JS;.JSE;.WSF;.WSH;.MSC
set classpath可以看到类装载路径
>set classpath
CLASSPATH=.;C:\Program Files\Java\jdk1.7.0_79\lib;C:\Program Files\Java\jdk1.7.0_79\lib\tools.jar;C:\Program Files (x86)\QuickTime\QTSystem\QTJava.zip;
◆环境变量值的结尾可加分号或不加,不同值之间用;(记住:分号是用来分隔的,只要是两个不同的路径都需要分号隔开)
◆CLASSPATH变量值中的.表示当前目录,另外java_home这个变量整体被path和classpath引用,好处在于今后重装jdk后,只需改java_home的值
li1121567428
li1***567428@live.com
配置本地Yum仓库
实现此案例需要按照如下步骤进行。
步骤一:搭建一个本地Yum,将RHEL6光盘手动挂载到/media
命令操作如下所示:
步骤二:将本地设置为客户端,进行Yum验证
Yum客户端需编辑配置文件,命令操作如下所示: