Learn supervised learning with Random Forest using the iris dataset. This tutorial covers data preprocessing, model training, and results visualization using Python.
Hi, today we will learn about supervised learning using Random Forest we will also practically play with iris dataset. . Below, I'll break down the tutorial into several sections:
Supervised Learning
Dataset
Preprocessing
Model and Training
Results
1. Supervised Learning
Supervised learning is a type of machine learning where the model is trained on a labeled dataset. The dataset consists of input-output pairs, and the goal is for the model to learn to map inputs to outputs. This training allows the model to make predictions or decisions without being explicitly programmed to perform the task. In this article we will classify dataset using Random forest model. Random forest is a machine learning method that creates many decision trees and combines their results to make better predictions. This approach improves accuracy and reduces errors by averaging the outcomes of multiple trees (see the example below).
2. Dataset
Download Dataset
First step is download the dataset, you can either download the dataset from the link below or download it from kaggle. I have inserted the data here just for ease.
There is a lot in terms of visualization (please check the google colab file), lets use plotly for feature importance. This plot shows which feature is important for classification model. Here 'PetalWidthCm' has highest importance.
3. Preprocessing
This process can have multiple steps depending on the task, this dataset is well filtered and maintained so we can skip the those steps for now, we will discuss them later in other articles. For now we will only divide the dataset into train test and validation set.
4. Model and Training
Now lets prepare the model and train it on different parameters.
What did you notice ? all of them has almost same training and testing accuracy (green line). So choose any model you want (with number of estimators 10, 20, 25, and 35-90) because validation accuracy is highest in all them with training accuracy.
5. Results
In order to visualize the the results we use confusion matrix. A confusion matrix is a table that summarizes the performance of a classification model by showing the counts of true positives, true negatives, false positives, and false negatives. It helps evaluate how well the model distinguishes between different classes.
Finally get classification report from sklearn library
# get the report of the model
from sklearn.metrics import classification_report
report = classification_report(y_test, y_test_pred)
print(report)
It is amazing right, enjoy the article, I will see you again π