Pandas vs. Polars: Why Speed Matters? (What Most Data Scientists Overlook) Nov. 3, 2024
Pandas vs. Polars: Why Speed Matters? (What Most Data Scientists Overlook)

Did you know that Polars can outperform Pandas by up to 100 times in certain operations?
As data continues to grow larger and larger, the importance of efficient computation grows alongside it.
In this article, we’ll explore why Polars is gaining more attention as a powerful alternative to Pandas, focusing on speed and memory usage for data science operations — from data loading to joining. Let’s start!

Mock-Up Dataset with Custom Functions

Now, before we begin the comparison, let’s create a large mock dataset and custom functions to measure time and perform calculations for the comparison.

Mock-up Dataset

Here is the code to create mock-up dataset.

import numpy as np
import pandas as pd
import polars as pl
import time

# Initialize data with different variable names
row_count = 10000
data_alt = {
'X': np.random.randint(0, 150, row_count),
'Y': np.random.randint(0, 150, row_count),
'Z': np.random.rand(row_count)
}

df_pd = pd.DataFrame(data_alt)
df_pl = pl.DataFrame(data_alt)

Speed Measuring Custom Function

Here is the code to measure the speed.

# Function to measure execution time
def measure_time(fn, *args):
start = time.time()
result = fn(*args)
end = time.time()
return result, end - start

Calculation Function

Here is the code to do calculation.

# Function to store, calculate comparisons, and print at each stage
def calculate_and_print_comparison(timing_results, operation):
pandas_time = timing_results[operation]['Pandas']
polars_time = timing_results[operation]['Polars']

# Calculate percentage difference
percentage_diff = (pandas_time - polars_time) / max(pandas_time, polars_time) * 100

# Determine which is faster
if pandas_time < polars_time:
faster = f"Pandas is {-percentage_diff:.2f}% faster"
else:
faster = f"Polars is {percentage_diff:.2f}% faster"

# Store the comparison for later display
timing_results[operation]['Comparison'] = faster

For the final code, let’s create an empty dictionary for timing results and copy the dataframes for both polars and pandas with the following code.

# Clone the data for the operations
df_pd_copy = df_pd.copy()
df_pl_copy = df_pl.clone()

# Run tests and collect timing results
timing_results = {}

Data Loading: How Fast Can You Get Started?

 

 

Photo by Yingchih on Unsplash

Data loading is the initial stage of data analysis. If your dataset is too large, you might experience some delays at this stage. That’s why we’re testing data loading.

To do that we’ll use following custom function.

# Define operations with different function names
def load_data_pd():
return pd.DataFrame(data_alt)

def load_data_pl():
return pl.DataFrame(data_alt)

These functions will use both Pandas and Polars with our mock dataset and a custom function that measures speed. To perform the calculations, we’ll also use our speed and calculation functions. Here’s the code.​

# Load Data
_, pd_load_time = measure_time(load_data_pd)
_, pl_load_time = measure_time(load_data_pl)
timing_results['Load Data'] = {'Pandas': pd_load_time, 'Polars': pl_load_time}
calculate_and_print_comparison(timing_results, 'Load Data')

Here is the output.

SS of the output

As you can see Polars is way faster.

Score: Polars 1- 0 Pandas.

Aggregation: Which Library Handles Group Operations Better?

 

Photo by Antoine Dautry on Unsplash

Mathematical operations like aggregation are among the most commonly used methods to group data in analysis.
That’s why this is the second feature we’ll test. To do that, let’s first define the custom function for aggregation.

def group_agg_pd(df):
return df.groupby('X').agg({'Y': 'mean', 'Z': 'sum'})

def group_agg_pl(df):
return df.group_by('X').agg(pl.col('Y').mean(), pl.col('Z').sum())

We’ll use same custom functions, which you are familiar by now.(speed + calculation)

Here is the code.

# Aggregation
_, pd_agg_time = measure_time(group_agg_pd, df_pd)
_, pl_agg_time = measure_time(group_agg_pl, df_pl)
timing_results['Aggregation'] = {'Pandas': pd_agg_time, 'Polars': pl_agg_time}
calculate_and_print_comparison(timing_results, 'Aggregation')

Here is the output.

SS of the output

Score: Polars 2–0 Pandas

Filtering: Can You Slice Data Faster?

 

Photo by Igor Miske on Unsplash

Let’s say you want to select rows where a column meets a certain condition or includes specific criteria. That’s where filtering comes in handy, a common task in data science.
Let’s test this operation. Here are the custom filtering functions.

def filter_data_pd(df):
return df[df['X'] > 75]

def filter_data_pl(df):
return df.filter(pl.col('X') > 75)

We’ll use same functions.Here is the code.

# Filtering
_, pd_filter_time = measure_time(filter_data_pd, df_pd)
_, pl_filter_time = measure_time(filter_data_pl, df_pl)
timing_results['Filtering'] = {'Pandas': pd_filter_time, 'Polars': pl_filter_time}
calculate_and_print_comparison(timing_results, 'Filtering')

Here is the output.

Photo by Roman Kraft on Unsplash

Polars 3–0 Pandas

Joining: Who Wins in Merging Large Datasets?

 

Photo by Roman Kraft on Unsplash

The final method is joining, which is one of the top three operations in data science.
To test this, as we’ve done before, let’s first define the custom function for joining.

def join_data_pd(df1, df2):
return df1.merge(df2, on='X')

def join_data_pl(df1, df2):
return df1.join(df2, on='X')

Now let’s do the calculation.

# Joining
_, pd_join_time = measure_time(join_data_pd, df_pd_copy, df_pd)
_, pl_join_time = measure_time(join_data_pl, df_pl_copy, df_pl)
timing_results['Joining'] = {'Pandas': pd_join_time, 'Polars': pl_join_time}
calculate_and_print_comparison(timing_results, 'Joining')

Here is the output.

SS of the output

Score: Polars 4–0 Pandas

Comparison Alltogether: What’s the Final Verdict?

 

Photo by NordWood Themes on Unsplash

Now let’s compare all of them together. Here is the code.

# Create a DataFrame to hold the timing results
results_df = pd.DataFrame({operation: {key: value for key, value in timing_results[operation].items() if key != 'Comparison'} for operation in timing_results})

# Add the comparisons as a row to the DataFrame
comparison_series = pd.Series({operation: timing_results[operation]['Comparison'] for operation in timing_results}, name='Comparison')
results_df = pd.concat([results_df, comparison_series.to_frame().T])

# Display the final DataFrame with all results and comparisons
results_df

Here is the output.

SS of the output

Final Thoughts

In this one, we’ll compare pandas and polars for different operations, which can used in Data Science opeartion most freqently and compare the results for each of them.

If you are working on big data, even %1–2 enhancements will save you or your customer & boss a lot of compuation power, and cost accordingly. That’s why, in the way of searching the best one, give polars a chance too.

On Substack, we have celebrated 1 year anniversary. Join us here to follow latest AI news, use LearnAIWithMe GPT’s, and reach out next-gen dataset to be part of the future. See you there!

Series

  • Weekly AI Pulse: Get the latest updates as you read this.
  • LearnAI Series: Learn AI with our unique GPT and empower with this series.
  • Job Hunt Series: Discover freelance opportunities on Upwork here.

 

GPT’s

 

Agents

 

Here are the free resources.

Here is the Prompt Techniques cheat sheet.

Here is the ChatGPT cheat sheet.

Here is my NumPy cheat sheet.

Here is the source code of the “How to be a Billionaire” data project.

Here is the source code of the “Classification Task with 6 Different Algorithms using Python” data project.

Here is the source code of the “Decision Tree in Energy Efficiency Analysis” data project.

Here is the source code of the “DataDrivenInvestor 2022 Articles Analysis” data project.

 

“Machine learning is the last invention that humanity will ever need to make.” Nick Bostrom

 

 

Copyright © Learn AI With Me All Rights Reserved