Exploring Machine Learning with Iris Dataset: A Deep Dive into K-Nearest Neighbors

Introduction

Let's talk about a machine learning technique called the K-Nearest Neighbors (KNN) algorithm. In this article, we'll use the well-known Iris dataset to show how KNN can be used for classification. The Iris dataset is like a classic puzzle in the world of machine learning, and we'll use it to learn the fundamentals of classification methods.

Understanding the Iris Dataset

The Iris dataset contains 150 samples of iris flowers, each belonging to one of three species: setosa, versicolor, or virginica. For each sample, four features are measured: sepal length, sepal width, petal length, and petal width. Our goal is to classify the species of iris flowers based on these features.

Implementing K-Nearest Neighbors

KNN works by finding the 'k' nearest data points to a given input and classifying it based on the majority class among its neighbors. Using a popular Python library like scikit-learn, implementing KNN becomes surprisingly simple. After preparing the data, training the model, and fine-tuning parameters, we achieved an accuracy of 97.777%.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
import warnings
warnings.filterwarnings('ignore')

dataset=pd.read_csv('Iris.csv')
dataset

dataset.isnull().sum()

dataset.columns

x=pd.DataFrame(dataset,columns=["SepalLengthCm","SepalWidthCm","PetalLengthCm","PetalWidthCm"]).values
y=dataset.Species.values.reshape(-1,1)

x_train,x_test,y_train,y_test=train_test_split(x,y,test_size=.30,random_state=4)

from sklearn.model_selection import cross_val_score
k_range=range(1,11)
k_scores=[]
for k in k_range:
    knn=KNeighborsClassifier(n_neighbors=k)
    scores=cross_val_score(knn,x_train,y_train,cv=5,scoring='accuracy')
    k_scores.append(scores.mean())

for i in range(len(k_scores)):
    print("k=",i+1,"score=",k_scores[i])

best_k=np.argmax(k_scores)+1
print("Best k value=",best_k)

k=6
clf=KNeighborsClassifier(k)
clf.fit(x_train,y_train)
y_pred=clf.predict(x_test)

print("Accuracy is :",metrics.accuracy_score(y_test,y_pred))
print("F1 Score:",metrics.f1_score(y_test, y_pred, average='macro'))
print("Confusion Matrix:\n",metrics.confusion_matrix(y_test,y_pred))
print("Recall:",metrics.recall_score(y_test, y_pred, average='macro'))
print("Precision:",metrics.precision_score(y_test, y_pred, average='macro'))

Model Performance and Application

  1. Accuracy: 97.7777% or approximately 0.9778. Accuracy represents the proportion of correctly classified instances out of the total instances. In our case, the model correctly classified approximately 97.78% of the iris flowers.

  2. F1 Score: 0.97096. The F1 score is the harmonic mean of precision and recall. It takes both false positives and false negatives into account. A higher F1 score indicates better precision and recall balance. our model achieved an F1 score of approximately 0.971, which is quite good.

  3. Recall: 0.9667. Recall, also known as sensitivity, is the proportion of actual positive cases that were correctly identified by the model. our model correctly identified approximately 96.67% of the actual positive cases.

  4. Precision: 0.9778. Precision is the proportion of predicted positive cases that were correctly classified. our model's precision rate is approximately 97.78%, indicating that when it predicted an iris flower as a specific species.

Based on these metrics, our KNN model appears to be performing very well in classifying iris flowers. An accuracy of 97.78% suggests that the model is reliable, and the balance between precision and recall (as indicated by the F1 score) is quite good. This level of performance makes the model suitable for various real-world applications in botanical research and automated plant recognition systems.

Conclusion

In this blog post, we explored the fundamentals of the K-Nearest Neighbors algorithm and its application to the Iris dataset. Achieving an accuracy of 97.777% demonstrates the algorithm's capability in handling classification tasks. We've learned how to implement KNN using scikit-learn and discussed its potential applications in real-world scenarios.