Data Science

Regression using Random forest on Custom Dataset

Supervised learning involves training a model on a labeled dataset to predict outcomes. Today, we’ll explore this using a Medical Cost dataset.

Sumit Pandey

Jun 28, 2024 — 3 min read

Hi Welcome back, today we will learn about supervised learning we will also practically play with Medical cost dataset. Below, I'll break down the tutorial into several sections:

Before you go ahead please click on the button (play in Colab) copy the codes and download the dataset given in tutorial and have fun 😸.

Open In Colab Button

Supervised Learning
Dataset
Preprocessing
Model and Training
Results

Supervised Learning

Supervised learning is a type of machine learning where the model is trained on a labeled dataset. The goal is for the model to learn the mapping from input features to the output label. In this tutorial, we will use a Random Forest regression model to predict medical costs based on various features.

Dataset

First Step is download the dataset, we will use the Medical Cost dataset, which contains information such as age, sex, BMI, children, smoker status, region, and charges. The goal is to predict the 'charges' column based on the other features.

Medical Cost dataset

Download Dataset

Medical Cost dataset.csv

54 KB

Load and Visualize Dataset

First import all libraries:

# Import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, r2_score

Data visualization, now we use seaborn libraries to understand the csv file structure.

# Load the dataset
data = pd.read_csv('insurance 2.csv')

# Display the first few rows of the dataset
print(data.head())

Preprocessing

Before training our model, we need to preprocess the data. This includes handling missing values, encoding categorical variables, and splitting the data into training and testing sets.

Model and Training

We will use the Random Forest regression model for our task. This model is an ensemble learning method that constructs multiple decision trees during training and outputs the average prediction of the individual trees for regression tasks.

Results

Finally, we will evaluate the performance of our model using metrics such as Mean Absolute Error (MAE) and R-squared (R²).

Let's start coding!

Step-by-Step Implementation

# Preprocess the data
# Encode categorical variables
label_encoders = {}
for column in ['sex', 'smoker', 'region']:
    le = LabelEncoder()
    data[column] = le.fit_transform(data[column])
    label_encoders[column] = le

# Split the data into features and target
X = data.drop(columns='charges')
y = data['charges']

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize and train the Random Forest regressor
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)

# Make predictions on the test set
y_pred = model.predict(X_test)

# Evaluate the model
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f'Mean Absolute Error: {mae}')
print(f'R-squared: {r2}')

# Feature importance
feature_importances = model.feature_importances_
for name, importance in zip(X.columns, feature_importances):
    print(f'{name}: {importance:.4f}')

Explanation

Import Libraries: We import the necessary libraries for data manipulation, model training, and evaluation.
Load Dataset: The dataset is loaded into a Pandas DataFrame.
Preprocessing:
- Categorical variables ('sex', 'smoker', 'region') are encoded using LabelEncoder.
- The dataset is split into features (X) and target (y).
- The data is further split into training and testing sets.
Model Training:
- A RandomForestRegressor model is initialized and trained on the training data.
Model Evaluation:
- Predictions are made on the test set.
- The model's performance is evaluated using Mean Absolute Error and R-squared metrics.
Feature Importance: The importance of each feature in making predictions is printed.

Regression using Random forest on Custom Dataset

Sumit Pandey

Supervised Learning

Dataset

Load and Visualize Dataset

Preprocessing

Model and Training

Results

Step-by-Step Implementation

Explanation

Read more

A Comprehensive Guide to UNet Implementation with TensorFlow

Hierarchical Clustering: Intro with Math and Python

Summer School Lectures: GAN methods for Medical Imaging

Summer School Lectures: CNN