Classification using Random Forest

Learn supervised learning with Random Forest using the iris dataset. This tutorial covers data preprocessing, model training, and results visualization using Python.

Classification using Random Forest
Fun Pic: Human Supervision: Example of supervision

0:00
/1:48

How to run code in this article

Open In Colab Button

Hi, today we will learn about supervised learning using Random Forest we will also practically play with iris dataset. . Below, I'll break down the tutorial into several sections:

  1. Supervised Learning
  2. Dataset
  3. Preprocessing
  4. Model and Training
  5. Results

1. Supervised Learning

Supervised learning is a type of machine learning where the model is trained on a labeled dataset. The dataset consists of input-output pairs, and the goal is for the model to learn to map inputs to outputs. This training allows the model to make predictions or decisions without being explicitly programmed to perform the task. In this article we will classify dataset using Random forest model. Random forest is a machine learning method that creates many decision trees and combines their results to make better predictions. This approach improves accuracy and reduces errors by averaging the outcomes of multiple trees (see the example below).

Building a Random Forest Model: A Step-by-Step Guide
Random Forest explaining image

2. Dataset

Download Dataset

First step is download the dataset, you can either download the dataset from the link below or download it from kaggle. I have inserted the data here just for ease.

Now lets open the Colab IDE (integrated development environment) and import the relevant libraries.

Load and Visualize Dataset

First import all libraries:

# import seaborn, pandas and sklearn libraries
import seaborn as sns
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import PolynomialFeatures
import matplotlib.pyplot as plt

import libraries

Data visualization, now we use seaborn libraries to understand the csv file structure.

df = pd.read_csv('iris.csv', index_col=0)
df.head(4)
Dataset description
Dataset first rows and columns

Now lets visualize the columns of the dataset using pairplot

sns.set_style("whitegrid")
sns.pairplot(df,hue="Species",size=3);
plt.show()
iris data visualization
Iris data visualization

Feature Importance

There is a lot in terms of visualization (please check the google colab file), lets use plotly for feature importance. This plot shows which feature is important for classification model. Here 'PetalWidthCm' has highest importance.

from sklearn.ensemble import ExtraTreesClassifier

# Assume df is already loaded with your data
X = df.drop(['Species', 'Id'], axis=1)  # features
y = df['Species']  # target

# Feature selection using Extra Trees Classifier
feat_selection = ExtraTreesClassifier()
feat_selection.fit(X, y)
feat_importances = pd.Series(feat_selection.feature_importances_, index=X.columns)

# Sorting features by importance
sorted_importances = feat_importances.sort_values(ascending=True)

# Creating a bar plot for feature importances
fig = px.bar(sorted_importances, 
             orientation='h', 
             labels={'value': 'Feature Importance', 'index': 'Features'}, 
             title='Feature Importance Ranking',
             color=sorted_importances.values,
             color_continuous_scale='Viridis')

# Enhancing the layout
fig.update_layout(xaxis_title='Feature Importance',
                  yaxis_title='Features',
                  coloraxis_showscale=False, # Turn off color scale if not needed
                  plot_bgcolor='white')

# Display the plot
fig.show()

Get the feature importance

3. Preprocessing

This process can have multiple steps depending on the task, this dataset is well filtered and maintained so we can skip the those steps for now, we will discuss them later in other articles. For now we will only divide the dataset into train test and validation set.

# divide the dataset into the training, test and validation sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

# print the shape of the training, test and validation sets
print(X_train.shape, X_test.shape, X_val.shape)

Divide the dataset into train validation and test set

4. Model and Training

Now lets prepare the model and train it on different parameters.

# make a random forest classifier model and train validat the model and finally test it
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

est = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 60, 70, 80, 90, 100]
train_accuracy_ = []
val_accuracy_ = []
test_accuracy_ = []

for i in range(len(est)):

    # create a random forest classifier model
    rf = RandomForestClassifier(n_estimators=est[i], random_state=42)

    # train the model
    rf.fit(X_train, y_train)

    # test the model in trainign set
    y_train_pred = rf.predict(X_train)
    train_accuracy = accuracy_score(y_train, y_train_pred)
    print(f'Training accuracy: {train_accuracy}')
    train_accuracy_.append(train_accuracy)

    # validate the model
    y_val_pred = rf.predict(X_val)
    val_accuracy = accuracy_score(y_val, y_val_pred)
    print(f'Validation accuracy: {val_accuracy}')
    val_accuracy_.append(val_accuracy)

    # test the model
    y_test_pred = rf.predict(X_test)
    test_accuracy = accuracy_score(y_test, y_test_pred)
    print(f'Test accuracy: {test_accuracy}')
    test_accuracy_.append(test_accuracy)


# plot the training, validation and test accuracies with respect to the number of estimators
plt.figure(figsize=(10, 6))
plt.plot(est, train_accuracy_, '-o', label='Training Accuracy')
plt.plot(est, val_accuracy_, '-o', label='Validation Accuracy')
plt.plot(est, test_accuracy_, '-o', label='Test Accuracy')
plt.xticks(est)
plt.xlabel('Number of Estimators')
plt.ylabel('Accuracy')
plt.title('Random Forest Classifier Accuracy')
plt.legend()
plt.show()

Model and Training

Train testing and validation accuracy set.

What did you notice ? all of them has almost same training and testing accuracy (green line). So choose any model you want (with number of estimators 10, 20, 25, and 35-90) because validation accuracy is highest in all them with training accuracy.

5. Results

In order to visualize the the results we use confusion matrix. A confusion matrix is a table that summarizes the performance of a classification model by showing the counts of true positives, true negatives, false positives, and false negatives. It helps evaluate how well the model distinguishes between different classes.

from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_test_pred, labels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
plt.figure(figsize=(10, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], yticklabels=['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Confusion Matrix

confusion matrix of Classification model
Confusion matrix of testing dataset

Finally get classification report from sklearn library

# get the report of the model
from sklearn.metrics import classification_report
report = classification_report(y_test, y_test_pred)
print(report)


It is amazing right, enjoy the article, I will see you again πŸ˜„