Your First Machine Learning Project in Python Step-By-Step

你的第一个机器学习项目

Do you want to do machine learning using Python, but you’re having trouble getting started?

In this post, you will complete your first machine learning project using Python.

In this step-by-step tutorial you will:

  1. Download and install Python SciPy and get the most useful package for machine learning in Python.
  2. Load a dataset and understand it’s structure using statistical summaries and data visualization.
  3. Create 6 machine learning models, pick the best and build confidence that the accuracy is reliable.

If you are a machine learning beginner and looking to finally get started using Python, this tutorial was designed for you.

Let’s get started!

  • Update Jan/2017: Updated to reflect changes to the scikit-learn API in version 0.18.
  • Update Mar/2017: Added links to help setup your Python environment.
  • Update Apr/2018: Added some helpful links about randomness and making predictions.

Your First Machine Learning Project in Python Step-By-Step

How Do You Start Machine Learning in Python?

The best way to learn machine learning is by designing and completing small projects.

Python Can Be Intimidating When Getting Started

Python is a popular and powerful interpreted language. Unlike R, Python is a complete language and platform that you can use for both research and development and developing production systems.

There are also a lot of modules and libraries to choose from, providing multiple ways to do each task. It can feel overwhelming.

The best way to get started using Python for machine learning is to complete a project.

  • It will force you to install and start the Python interpreter (at the very least).
  • It will given you a bird’s eye view of how to step through a small project.
  • It will give you confidence, maybe to go on to your own small projects.

Beginners Need A Small End-to-End Project

Books and courses are frustrating. They give you lots of recipes and snippets, but you never get to see how they all fit together.

When you are applying machine learning to your own datasets, you are working on a project.

A machine learning project may not be linear, but it has a number of well known steps:

  1. Define Problem.
  2. Prepare Data.
  3. Evaluate Algorithms.
  4. Improve Results.
  5. Present Results.

The best way to really come to terms with a new platform or tool is to work through a machine learning project end-to-end and cover the key steps. Namely, from loading data, summarizing data, evaluating algorithms and making some predictions.

If you can do that, you have a template that you can use on dataset after dataset. You can fill in the gaps such as further data preparation and improving result tasks later, once you have more confidence.

Hello World of Machine Learning

The best small project to start with on a new tool is the classification of iris flowers (e.g. the iris dataset).

This is a good project because it is so well understood.

  • Attributes are numeric so you have to figure out how to load and handle data.
  • It is a classification problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.
  • It is a multi-class classification problem (multi-nominal) that may require some specialized handling.
  • It only has 4 attributes and 150 rows, meaning it is small and easily fits into memory (and a screen or A4 page).
  • All of the numeric attributes are in the same units and the same scale, not requiring any special scaling or transforms to get started.

Let’s get started with your hello world machine learning project in Python.

Machine Learning in Python: Step-By-Step Tutorial
(start here)

In this section, we are going to work through a small machine learning project end-to-end.

Here is an overview of what we are going to cover:

  1. Installing the Python and SciPy platform.
  2. Loading the dataset.
  3. Summarizing the dataset.
  4. Visualizing the dataset.
  5. Evaluating some algorithms.
  6. Making some predictions.

Take your time. Work through each step.

Try to type in the commands yourself or copy-and-paste the commands to speed things up.

If you have any questions at all, please leave a comment at the bottom of the post.

Need help with Machine Learning in Python?

Take my free 2-week email course and discover data prep, algorithms and more (with code).

Click to sign-up now and also get a free PDF Ebook version of the course.

Start Your FREE Mini-Course Now!

1. Downloading, Installing and Starting Python SciPy

Get the Python and SciPy platform installed on your system if it is not already.

I do not want to cover this in great detail, because others already have. This is already pretty straightforward, especially if you are a developer. If you do need help, ask a question in the comments.

1.1 Install SciPy Libraries

This tutorial assumes Python version 2.7 or 3.5.

There are 5 key libraries that you will need to install. Below is a list of the Python SciPy libraries required for this tutorial:

  • scipy
  • numpy
  • matplotlib
  • pandas
  • sklearn

There are many ways to install these libraries. My best advice is to pick one method then be consistent in installing each library.

The scipy installation page provides excellent instructions for installing the above libraries on multiple different platforms, such as Linux, mac OS X and Windows. If you have any doubts or questions, refer to this guide, it has been followed by thousands of people.

  • On Mac OS X, you can use macports to install Python 2.7 and these libraries. For more information on macports, see the homepage.
  • On Linux you can use your package manager, such as yum on Fedora to install RPMs.

If you are on Windows or you are not confident, I would recommend installing the free version of Anaconda that includes everything you need.

Note: This tutorial assumes you have scikit-learn version 0.18 or higher installed.

Need more help? See one of these tutorials:

1.2 Start Python and Check Versions

It is a good idea to make sure your Python environment was installed successfully and is working as expected.

The script below will help you test out your environment. It imports each library required in this tutorial and prints the version.

Open a command line and start the python interpreter:

I recommend working directly in the interpreter or writing your scripts and running them on the command line rather than big editors and IDEs. Keep things simple and focus on the machine learning not the toolchain.

Type or copy and paste the following script:

Here is the output I get on my OS X workstation:

Compare the above output to your versions.

Ideally, your versions should match or be more recent. The APIs do not change quickly, so do not be too concerned if you are a few versions behind, Everything in this tutorial will very likely still work for you.

If you get an error, stop. Now is the time to fix it.

If you cannot run the above script cleanly you will not be able to complete this tutorial.

My best advice is to Google search for your error message or post a question on Stack Exchange.

2. Load The Data

We are going to use the iris flowers dataset. This dataset is famous because it is used as the “hello world” dataset in machine learning and statistics by pretty much everyone.

The dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.

You can learn more about this dataset on Wikipedia.

In this step we are going to load the iris data from CSV file URL.

2.1 Import libraries

First, let’s import all of the modules, functions and objects we are going to use in this tutorial.

Everything should load without error. If you have an error, stop. You need a working SciPy environment before continuing. See the advice above about setting up your environment.

2.2 Load Dataset

We can load the data directly from the UCI Machine Learning repository.

We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.

Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.

The dataset should load without incident.

If you do have network problems, you can download the iris.data file into your working directory and load it using the same method, changing URL to the local file name.

3. Summarize the Dataset

Now it is time to take a look at the data.

In this step we are going to take a look at the data a few different ways:

  1. Dimensions of the dataset.
  2. Peek at the data itself.
  3. Statistical summary of all attributes.
  4. Breakdown of the data by the class variable.

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

3.1 Dimensions of Dataset

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 150 instances and 5 attributes:

3.2 Peek at the Data

It is also always a good idea to actually eyeball your data.

You should see the first 20 rows of the data:

3.3 Statistical Summary

Now we can take a look at a summary of each attribute.

This includes the count, mean, the min and max values as well as some percentiles.

We can see that all of the numerical values have the same scale (centimeters) and similar ranges between 0 and 8 centimeters.

3.4 Class Distribution

Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.

We can see that each class has the same number of instances (50 or 33% of the dataset).

4. Data Visualization

We now have a basic idea about the data. We need to extend that with some visualizations.

We are going to look at two types of plots:

  1. Univariate plots to better understand each attribute.
  2. Multivariate plots to better understand the relationships between attributes.

4.1 Univariate Plots

We start with some univariate plots, that is, plots of each individual variable.

Given that the input variables are numeric, we can create box and whisker plots of each.

This gives us a much clearer idea of the distribution of the input attributes:

Box and Whisker Plots

We can also create a histogram of each input variable to get an idea of the distribution.

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

Histogram Plots

4.2 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

Scattplot Matrix

5. Evaluate Some Algorithms

Now it is time to create some models of the data and estimate their accuracy on unseen data.

Here is what we are going to cover in this step:

  1. Separate out a validation dataset.
  2. Set-up the test harness to use 10-fold cross validation.
  3. Build 5 different models to predict species from flower measurements
  4. Select the best model.

5.1 Create a Validation Dataset

We need to know that the model we created is any good.

Later, we will use statistical methods to estimate the accuracy of the models that we create on unseen data. We also want a more concrete estimate of the accuracy of the best model on unseen data by evaluating it on actual unseen data.

That is, we are going to hold back some data that the algorithms will not get to see and we will use this data to get a second and independent idea of how accurate the best model might actually be.

We will split the loaded dataset into two, 80% of which we will use to train our models and 20% that we will hold back as a validation dataset.

You now have training data in the X_train and Y_train for preparing models and a X_validation and Y_validation sets that we can use later.

5.2 Test Harness

We will use 10-fold cross validation to estimate accuracy.

This will split our dataset into 10 parts, train on 9 and test on 1 and repeat for all combinations of train-test splits.

The specific random seed does not matter, learn more about pseudorandom number generators here:

We are using the metric of ‘accuracy‘ to evaluate models. This is a ratio of the number of correctly predicted instances in divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate). We will be using the scoring variable when we run build and evaluate each model next.

5.3 Build Models

We don’t know which algorithms would be good on this problem or what configurations to use. We get an idea from the plots that some of the classes are partially linearly separable in some dimensions, so we are expecting generally good results.

Let’s evaluate 6 different algorithms:

  • Logistic Regression (LR)
  • Linear Discriminant Analysis (LDA)
  • K-Nearest Neighbors (KNN).
  • Classification and Regression Trees (CART).
  • Gaussian Naive Bayes (NB).
  • Support Vector Machines (SVM).

This is a good mixture of simple linear (LR and LDA), nonlinear (KNN, CART, NB and SVM) algorithms. We reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. It ensures the results are directly comparable.

Let’s build and evaluate our five models:

5.4 Select Best Model

We now have 6 models and accuracy estimations for each. We need to compare the models to each other and select the most accurate.

Running the example above, we get the following raw results:

Note, you’re results may differ. For more on this see the post:

We can see that it looks like KNN has the largest estimated accuracy score.

We can also create a plot of the model evaluation results and compare the spread and the mean accuracy of each model. There is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation).

You can see that the box and whisker plots are squashed at the top of the range, with many samples achieving 100% accuracy.

Compare Algorithm Accuracy

6. Make Predictions

The KNN algorithm was the most accurate model that we tested. Now we want to get an idea of the accuracy of the model on our validation set.

This will give us an independent final check on the accuracy of the best model. It is valuable to keep a validation set just in case you made a slip during training, such as overfitting to the training set or a data leak. Both will result in an overly optimistic result.

We can run the KNN model directly on the validation set and summarize the results as a final accuracy score, a confusion matrix and a classification report.

We can see that the accuracy is 0.9 or 90%. The confusion matrix provides an indication of the three errors made. Finally, the classification report provides a breakdown of each class by precision, recall, f1-score and support showing excellent results (granted the validation dataset was small).

You can learn more about how to make predictions and predict probabilities here:

You Can Do Machine Learning in Python

Work through the tutorial above. It will take you 5-to-10 minutes, max!

You do not need to understand everything. (at least not right now) Your goal is to run through the tutorial end-to-end and get a result. You do not need to understand everything on the first pass. List down your questions as you go. Make heavy use of the help(“FunctionName”) help syntax in Python to learn about all of the functions that you’re using.

You do not need to know how the algorithms work. It is important to know about the limitations and how to configure machine learning algorithms. But learning about algorithms can come later. You need to build up this algorithm knowledge slowly over a long period of time. Today, start off by getting comfortable with the platform.

You do not need to be a Python programmer. The syntax of the Python language can be intuitive if you are new to it. Just like other languages, focus on function calls (e.g. function()) and assignments (e.g. a = “b”). This will get you most of the way. You are a developer, you know how to pick up the basics of a language real fast. Just get started and dive into the details later.

You do not need to be a machine learning expert. You can learn about the benefits and limitations of various algorithms later, and there are plenty of posts that you can read later to brush up on the steps of a machine learning project and the importance of evaluating accuracy using cross validation.

What about other steps in a machine learning project. We did not cover all of the steps in a machine learning project because this is your first project and we need to focus on the key steps. Namely, loading data, looking at the data, evaluating some algorithms and making some predictions. In later tutorials we can look at other data preparation and result improvement tasks.

Summary

In this post, you discovered step-by-step how to complete your first machine learning project in Python.

You discovered that completing a small end-to-end project from loading the data to making predictions is the best way to get familiar with a new platform.

Your Next Step

Do you work through the tutorial?

  1. Work through the above tutorial.
  2. List any questions you have.
  3. Search or research the answers.
  4. Remember, you can use the help(“FunctionName”) in Python to get help on any function.

Do you have a question?
Post it in the comments below.

如何使用Anaconda设置机器学习和深度学习的Python环境

有一些平台安装Python机器学习环境可能很麻烦。

首先你得安装Python,然后安装许多软件包这很容易把初学者搞懵。

在本教程中,你将学会如何用Anaconda设置Python机器学习开发环境。

完成本教程后,你将拥有一个Python工作环境,可以让你学习、练习和开发机器学习和深度学习软件。

本说明适用于Windows,Mac OS X和Linux平台。我将在OS X上演示它们,因此你可能会看到一些mac对话框和文件扩展名。

  • 更新 2017/03:注:你需要一个Theano或TensorFlow才能使用Kears进行深度学习。

教程概述

在本教程中,我们将介绍如下步骤:

  1. 下载Anaconda
  2. 安装Anaconda
  3. 启动和更新Anaconda
  4. 更新 scikit-learn库
  5. 安装深度学习库

下载蟒蛇

在此步骤中,我们将为您的平台下载Anaconda Python。

Anaconda是一个免费且易于操作的科学Python环境。

如何使用Anaconda设置机器学习和深度学习的Python环境

3.选择适合您平台的下载(Windows,OSX或Linux):

  • 选择Python 3.5
  • 选择图形安装程序(Graphical Installer)

如何使用Anaconda设置机器学习和深度学习的Python环境

将Anaconda Python包下载到您的工作站。

我在OS X上,所以我选择了OS X版本。文件约426 MB。

你应该下载到一个名称如下的文件:

Anaconda3-4.2.0-MacOSX-x86_64.pkg

2.安装蟒蛇

在此步骤中,我们将在您的系统上安装Anaconda Python软件。

此步骤假定你具有足够的管理权限来在系统上安装软件。

  • 1.双击下载的文件。
  • 2.按照安装向导。

如何使用Anaconda设置机器学习和深度学习的Python环境

安装很顺利应该不会遇到棘手的问题

如何使用Anaconda设置机器学习和深度学习的Python环境

安装需要不到10分钟,占用硬盘上1 GB的空间。

3.启动和更新蟒蛇

在此步骤中,我们将确认您的Anaconda Python环境是不是最新的。

Anaconda配有一套名为Anaconda Navigator的图形工具。您可以从应用程序启动器打开Anaconda Navigator。

您可以点击这里了解有关Anaconda Navigator的所有信息。

我们稍后使用Anaconda Navigator和图形开发环境; 现在,我建议从Anaconda命令行环境开始,它被称为conda

Conda快速,简单,不会遗漏错误信息,您可以快速确认您的环境已安装并正常工作。

  • 1.打开终端(命令行窗口)。
  • 2.通过键入以下内容,确认正确安装:
1 conda-V

你应该看到以下(或类似的东西):

1 conda4.2.9
  • 3.键入以下内容,确认Python已正确安装:
1 python-V

你应该看到以下(或类似的东西):

Python 3.5.2 :: Anaconda 4.2.0 (x86_64)

如何使用Anaconda设置机器学习和深度学习的Python环境

如果命令不起作用或报错,请查看平台的帮助文档。

也可以参阅“延伸阅读”部分的一些资料。

  • 4.为确认您的conda环境是最新的,请输入:
1 conda update conda
2 conda update anaconda

你可能需要给一些包安装更新。

  • 5.确认您的SciPy环境。

下面的脚本将打印您需要用于机器学习开发的关键SciPy库的版本号,如: SciPy、NumPy、Matplotlib、Pandas、Statsmodels和Scikit-learn。

您可以键入“python”然后直接键入命令。但我建议打开一个文本文档,并将脚本复制到文档中。

01 # scipy
02 import scipy
03 print('scipy: %s' % scipy.__version__)
04 # numpy
05 import numpy
06 print('numpy: %s' % numpy.__version__)
07 # matplotlib
08 import matplotlib
09 print('matplotlib: %s' % matplotlib.__version__)
10 # pandas
11 import pandas
12 print('pandas: %s' % pandas.__version__)
13 # statsmodels
14 import statsmodels
15 print('statsmodels: %s' % statsmodels.__version__)
16 # scikit-learn
17 import sklearn
18 print('sklearn: %s' % sklearn.__version__)

将脚本保存为名称为versions.py的文件。

在命令行中,将目录更改为保存脚本的位置然后键入:

1 python versions.py

您应该看到如下输出:

1 scipy:0.18.1
2 numpy:1.11.1
3 matplotlib:1.5.3
4 pandas:0.18.1
5 statsmodels:0.6.1
6 sklearn:0.17.1

如何使用Anaconda设置机器学习和深度学习的Python环境

4.更新scikit-learn库

在这一步中,我们将在Python中更新用于机器学习的库,名为scikit-learn。

  • 1.更新scikit-learn到最新版本。

在撰写本文时,Anaconda发行的scikit-learning版本已经过期(0.17.1,而不是0.18.1)。您可以使用conda命令更新特定的库; 以下是将scikit-learn更新到最新版本的示例。

输入:

1 conda update scikit-learn

如何使用Anaconda设置机器学习和深度学习的Python环境

你也可以键入如下内容把他升级到特定的版本:

1 conda install-c anaconda scikit-learn=0.18.1

为了确认是否安装成功,你可以键入以下内容重新运行version.py脚本:

你应该看到如下的输出:

1 scipy:0.18.1
2 numpy:1.11.3
3 matplotlib:1.5.3
4 pandas:0.18.1
5 statsmodels:0.6.1
6 sklearn:0.18.1

你可以根据需要使用这些命令更新机器学习和SciPy库。

点击下方链接阅读scikit-learn教程:

5.安装深度学习库

在这一步中,我们将安装用于深度学习的Python库,主要是:Theano,TensorFlow和Keras。

注意:我建议使用Keras进行深度学习,而Keras只需要安装Tnano或TensorFlow中的一个。在某些Windows系统上安装TensorFlow可能会出现问题。

  • 1.通过键入以下内容安装Theano深度学习库:
1 conda install theano
  • 2.安装TensorFlow深度学习库(Windows除外),键入以下内容:
1 conda install-c conda-forge tensorflow

或者,您可以选择使用pip和特定版本的tensorflow为您的平台进行安装。

详情请参阅tensorflow的安装说明

  • 3.通过键入以下内容安装Keras:
1 pip install keras
  • 4.确认您的深入学习环境已安装并正常工作。

创建一个脚本,该脚本打印每个库的版本号,就像我们上面为安装SciPy环境所做的那样。

1 # theano
2 import theano
3 print('theano: %s' % theano.__version__)
4 # tensorflow
5 import tensorflow
6 print('tensorflow: %s' % tensorflow.__version__)
7 # keras
8 import keras
9 print('keras: %s' % keras.__version__)

将脚本保存成文件deep_versions.py。输入以下命令来运行脚本:

1 python deep_versions.py

你应该看到如下输出:

1 theano:0.8.2.dev-901275534cbfe3fbbe290ce85d1abf8bb9a5b203
2 tensorflow:0.12.1
3 Using TensorFlow backend.
4 keras:1.2.1

如何使用Anaconda设置机器学习和深度学习的Python环境

尝试一下Keras深度学习教程,如:Anaconda

进一步阅读

本节提供一些进一步阅读的链接。

总结

恭喜你现在拥有一个用于机器学习和深入学习的工作Python开发环境。

你现在可以在工作站上学习和练习机器学习和深度学习。