Classification problem with PyCaret

Jane Alam
4 min readAug 26, 2020

Overview

PyCaret is an open-source, low-code machine learning library in Python that aims to reduce the cycle time from hypothesis to insights. It is well suited for seasoned data scientists who want to increase the productivity of their ML experiments by using PyCaret in their workflows or for citizen data scientists and those new to data science with little or no background in coding. PyCaret allows you to go from preparing your data to deploying your model within seconds using your choice of notebook environment. Please choose your track below to continue learning more about PyCaret.

link:- https://pycaret.org/

Table of Contents

  1. What is PyCaret and Why Should you Use it?
  2. Installing PyCaret
  3. Problem statement and Dataset
  4. Training Machine Learning Model using PyCaret
  5. Analyze Model

What is PyCaret and Why Should you Use it?

PyCaret is an open-source, machine learning library in Python that helps you from data preparation to model deployment. It is easy to use and you can do almost every data science project task with just one line of code.

I have found PyCaret useful. Here are two reasons why:

  • PyCaret being a low-code library makes you more productive. With less time spent coding, you and your team can now focus on business problems.
  • PyCaret is simple and easy to use machine learning library that will help you to perform end-to-end ML experiments with fewer lines of code.

Installing PyCaret

Problem statement and Dataset

In this article, we are going to solve a classification problem. We have a breast cancer dataset with features like radius_mean, area_mean, texture_mean, diagnosis whether patients have Malignant or Benign cancer. We will build a machine learning model that will help doctors to identify the cancers malignant or benign.

The dataset has 569 rows and 31 columns. You can find the complete code and dataset used in this

https://github.com/alamjane/Breast-Cancer-Detection/blob/master/EDA/AutoML/Breast_Cancer_Pycaret.ipynb.

Let’s start by reading the dataset using the Pandas library:

import pandas as pd

path = "https://raw.githubusercontent.com/alamjane/Project/master/Breast%20Cancer.csv"
df = pd.read_csv(path)


df.head()

Initializing the Setup: In this step, PyCaret performs some basic preprocessing tasks, like ignoring the ID, imputing the missing values, encoding the categorical variables, and splitting the dataset into the train-test split for the rest of the modeling steps. When you run the setup function, it will first confirm the data types, and then if you press enter, it will create the environment for you to go-ahead

# Importing module and initializing setup
from pycaret.classification import *
clf1 = setup(data = df, target = 'diagnosis')

Training Machine Learning Model using PyCaret

Training a model in PyCaret is very simple. You need to use the create_models function that takes just the one parameter

compare_models()

Here Ada Boost Classifier has better accuracy than other algorithms. For training the Ada Boost model, you just need to pass the string “ada”:

ada = create_model('ada')

Hyperparameter tuning

We can tune the hyperparameters of a machine learning model by just using the tune_model function which takes one parameter — the model abbreviation string (the same as we used in the create_model function).

Let’s tune it.

tuned_ada = tune_model('ada')

Analyze Model

Now, after training the model, the next step is to analyze the results. Analyzing a model in PyCaret is again very simple. Just a single line of code and you can do

Plot Model Results

You can plot model results by providing the model object as the parameter and the type of plot you want. Let’s plot the AUC-ROC curve and decision boundary:

AUC Curve

plot_model(estimator = tuned_ada, plot = 'auc')

Learning Curve for Ada Boost Classifier

plot_model(estimator = tuned_ada, plot = 'learning')

Confusion Matrix

plot_model(estimator = tuned_ada, plot = 'confusion_matrix')

Conclusion

It really is that easy to use. I’ve personally found PyCaret to be quite useful for generating quick results.

Practice using it on different types of datasets — you’ll truly get more interest and it will help.

If you have any suggestions/feedback related to the article, please post them in the comments section below. I look forward to hearing about your experience using PyCaret as well.

--

--

Jane Alam

I am Sr. Data Scientist working in Birlasoft | Writing about Gen AI, Data Science, AI, ML, DL, Stats, Math