This Pandas Trick Will Blow Your Mind As a Data Scientist! Nov. 3, 2024
This Pandas Trick Will Blow Your Mind As a Data Scientist!

Pandas is undoubtedly the most powerful data science library, but what if I told you that you could automate data analysis and complete your work with just a click?

In this article, we’ll explore how to do this, but first, let’s look at what the final script will look like.

Final Look

Now, we would have 8 steps to achieve this;

 

SS of the output

Here, you can upload any CSV files you want and see;

  • First rows
  • Last rows
  • Data types
  • Statistical Summary
  • Missing Values
  • Correlation Matrix

After just one click. Also, you can see ;

  • Value Counts
  • Unique Values
  • Histogram
  • Box plot

of the columns you have selected.

Step 1: Setting Up the Environment

Before doing that, let’s set up the environment.

pip install pandas numpy ipywidgets matplotlib seaborn

Step 2: Importing Libraries and Creating the File Upload Widget

Now, let’s create a place where you can upload the dataset at the end and import the libraries.

import pandas as pd
import numpy as np
import ipywidgets as widgets
from ipywidgets import FileUpload
from IPython.display import display, clear_output
import matplotlib.pyplot as plt
import seaborn as sns
import io

# Output area for displaying results
output = widgets.Output()

# File upload widget to upload CSV files
upload_widget = FileUpload(
accept='.csv',
multiple=False,
description='Upload CSV File'
)

# Display the widget
display(upload_widget)

Step 3: Handling File Uploads and Loading Data

At this step, we will define the on_file_upload function to handle the uploaded file and load it into a data frame.

df = pd.DataFrame()
column_options = []

def on_file_upload(change):
global df, column_options
with output:
clear_output()
uploaded_file = change['new'][0]
content = uploaded_file['content']
# Load the CSV content into a DataFrame
df = pd.read_csv(io.BytesIO(content))
print("File uploaded successfully!")
print(f"DataFrame shape: {df.shape}")
column_options = df.columns.tolist() # Update column names
update_dropdown_options()

Also, this function updates the dropdowns with column names.

SS of the output

This function reads the uploaded file into a DataFrame and displays the shape of the loaded data.

SS of the output

Step 4: Updating Dropdown Options for Column Selection

This update_dropdown_options function fills the dropdowns based on the df’s columns.

def update_dropdown_options():
value_counts_column.options = column_options
unique_values_column.options = column_options
histogram_column.options = column_options
boxplot_column.options = column_options

Step 5: Creating Data Exploration Buttons and Dropdown Widgets

Now, at this step, we’ll create buttons. Feel free to add or discard any buttons.

# Basic exploration buttons
button_head = widgets.Button(description="First Rows")
button_dtypes = widgets.Button(description="Data Types")

# Column selection dropdowns
value_counts_column = widgets.Dropdown(
options=column_options,
description='Select Column:'
)

These buttons and dropdowns will be connected to specific functions to explore the data interactively.

SS of the output

Step 6: Defining Data Exploration Functions

Now, at this step, we will define the functions;

  • show_head
  • show_corr
def show_head():
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
display(df.head())

def show_corr():
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
numeric_df = df.select_dtypes(include=[np.number])
corr = numeric_df.corr()
display(corr)
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm')
plt.show()

Step 7: Connecting Buttons to Functions

At this step, we will connect buttons to the functions; here is one example.

button_head.on_click(lambda b: show_head())
button_dtypes.on_click(lambda b: show_dtypes())

These connections allow the buttons to call specific functions when clicked.

Step 8: Arranging the Interface Layout

Now, at this last step, we will group the widgets into a layout for the user interface.

# Group widgets
button_row1 = widgets.HBox([button_head, button_dtypes])
value_counts_widget = widgets.VBox([value_counts_column, button_value_counts])

# Arrange the layout
ui = widgets.VBox([
upload_widget,
button_row1,
value_counts_widget,
])

# Display the interface
display(ui, output)

Here is the entire code.

import pandas as pd
import numpy as np
import ipywidgets as widgets
from ipywidgets import FileUpload
from IPython.display import display, clear_output
import matplotlib.pyplot as plt
import seaborn as sns
import io

# Output area
output = widgets.Output()

# Create a FileUpload widget
upload_widget = FileUpload(
accept='.csv', # Accept CSV files
multiple=False, # Do not allow multiple uploads
description='Upload CSV File'
)

# Initialize an empty DataFrame
df = pd.DataFrame()
column_options = []

def on_file_upload(change):
global df, column_options
with output:
clear_output()
uploaded_file = change['new'][0] # Access the first item in the tuple
content = uploaded_file['content']
# Read the CSV file
df = pd.read_csv(io.BytesIO(content))
print("File uploaded successfully!")
print(f"DataFrame shape: {df.shape}")
# Update column options for dropdowns
column_options = df.columns.tolist()
update_dropdown_options()

def update_dropdown_options():
# Update options for column selection widgets
value_counts_column.options = column_options
unique_values_column.options = column_options
histogram_column.options = column_options
boxplot_column.options = column_options

# Data exploration functions with checks
def show_head():
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
display(df.head())

def show_tail():
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
display(df.tail())

def show_dtypes():
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
display(df.dtypes)

def show_describe():
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
display(df.describe())

def show_missing_values():
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
display(df.isnull().sum())

def show_corr():
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
# Select only numeric columns for correlation
numeric_df = df.select_dtypes(include=[np.number])
if numeric_df.empty:
print("No numeric columns available for correlation.")
else:
corr = numeric_df.corr()
display(corr)
plt.figure(figsize=(10, 8))
sns.heatmap(corr, annot=True, fmt=".2f", cmap='coolwarm')
plt.show()


def show_value_counts(column):
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
counts = df[column].value_counts()
display(counts)

def show_unique_values(column):
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
uniques = df[column].unique()
print(f"Unique values in '{column}':")
display(uniques)

def show_histogram(column):
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
else:
plt.figure(figsize=(8,6))
sns.histplot(df[column].dropna(), kde=True)
plt.title(f"Histogram of {column}")
plt.show()

def show_boxplot(column):
with output:
clear_output()
if df.empty:
print("Please upload a CSV file.")
elif df[column].dtype not in [np.float64, np.int64]:
print(f"The selected column '{column}' is not numeric. Please select a numeric column.")
else:
plt.figure(figsize=(8, 6))
sns.boxplot(y=df[column].dropna())
plt.title(f"Boxplot of {column}")
plt.show()


# Buttons for data exploration options
button_head = widgets.Button(description="First Rows")
button_tail = widgets.Button(description="Last Rows")
button_dtypes = widgets.Button(description="Data Types")
button_describe = widgets.Button(description="Statistical Summary")
button_missing = widgets.Button(description="Missing Values")
button_corr = widgets.Button(description="Correlation Matrix")

# Initialize dropdowns with empty options
value_counts_column = widgets.Dropdown(
options=column_options,
description='Select Column:'
)

unique_values_column = widgets.Dropdown(
options=column_options,
description='Select Column:'
)

histogram_column = widgets.Dropdown(
options=column_options,
description='Select Column:'
)

boxplot_column = widgets.Dropdown(
options=column_options,
description='Select Column:'
)

# Buttons for functions requiring column selection
button_value_counts = widgets.Button(description="Show Value Counts")
button_unique_values = widgets.Button(description="Show Unique Values")
button_histogram = widgets.Button(description="Show Histogram")
button_boxplot = widgets.Button(description="Show Boxplot")

# Button click handlers
def on_button_head_clicked(b):
show_head()

def on_button_tail_clicked(b):
show_tail()

def on_button_dtypes_clicked(b):
show_dtypes()

def on_button_describe_clicked(b):
show_describe()

def on_button_missing_clicked(b):
show_missing_values()

def on_button_corr_clicked(b):
show_corr()

def on_button_value_counts_clicked(b):
show_value_counts(value_counts_column.value)

def on_button_unique_values_clicked(b):
show_unique_values(unique_values_column.value)

def on_button_histogram_clicked(b):
show_histogram(histogram_column.value)

def on_button_boxplot_clicked(b):
show_boxplot(boxplot_column.value)

# Connect buttons to handlers
button_head.on_click(on_button_head_clicked)
button_tail.on_click(on_button_tail_clicked)
button_dtypes.on_click(on_button_dtypes_clicked)
button_describe.on_click(on_button_describe_clicked)
button_missing.on_click(on_button_missing_clicked)
button_corr.on_click(on_button_corr_clicked)

button_value_counts.on_click(on_button_value_counts_clicked)
button_unique_values.on_click(on_button_unique_values_clicked)
button_histogram.on_click(on_button_histogram_clicked)
button_boxplot.on_click(on_button_boxplot_clicked)

# Observe the upload widget
upload_widget.observe(on_file_upload, names='value')

# Group buttons without column selection
button_row1 = widgets.HBox([button_head, button_tail, button_dtypes])
button_row2 = widgets.HBox([button_describe, button_missing, button_corr])

# Group widgets for value counts
value_counts_widget = widgets.VBox([value_counts_column, button_value_counts])

# Group widgets for unique values
unique_values_widget = widgets.VBox([unique_values_column, button_unique_values])

# Group widgets for histogram
histogram_widget = widgets.VBox([histogram_column, button_histogram])

# Group widgets for boxplot
boxplot_widget = widgets.VBox([boxplot_column, button_boxplot])

# Arrange all widgets
ui = widgets.VBox([
upload_widget,
button_row1,
button_row2,
value_counts_widget,
unique_values_widget,
histogram_widget,
boxplot_widget
])

# Display the UI and output
display(ui, output)

Step 8: Testing with Car Price Dataset

 

 

Reference

Now, let’s test it. After running the code, you will see the screen.

SS of the output

Now let’s upload the csv file you uploaded from the Kaggle above, and click the buttons;

First rows;

SS of the output

Statistical Summary;

SS of the output

If you select price, here you can see the histogram;

SS of the output

You can do every task one by one.

Final Thoughts

In this article, we’ve explored an efficient way to analyze data and explore using Panda's features.

If you are new to data science or AI, there is a lot to discover. If you want to follow AI news and access premium article series, consider becoming a paid subscriber to our substack.

Series

  • Weekly AI Pulse: Get the latest updates as you read this.
  • LearnAI Series: Learn AI with our unique GPT and empower with this series.
  • Job Hunt Series: Discover freelance opportunities on Upwork here.

GPT’s

 

Agents

Here are the free resources.

Here is the ChatGPT cheat sheet.

Here is the Prompt Techniques cheat sheet.

Here is my NumPy cheat sheet.

Here is the source code of the “How to be a Billionaire” data project.

Here is the source code of the “Classification Task with 6 Different Algorithms using Python” data project.

Here is the source code of the “Decision Tree in Energy Efficiency Analysis” data project.

Here is the source code of the “DataDrivenInvestor 2022 Articles Analysis” data project.

 

“Machine learning is the last invention that humanity will ever need to make.” Nick Bostrom

 

 

Copyright © Learn AI With Me All Rights Reserved