Pandas vs. PySpark: Navigating Data Read/Write Challenges with ADLS Gen2

In the data science and data engineering fields, Python is the dominant language for processing data. Most beginners are introduced to Pandas, which provides a way to work with DataFrames, allowing them to manipulate datasets programmatically. However, as the scale of data grows, the limitations of singlenode processing technologies like Pandas become apparent. It’s not feasible to keep scaling computing resources vertically forever. At this point, the need for a tool that supports horizontal scaling and distributed computing arises.

Enter Apache Spark, a distributed computing engine designed for big data workloads. Spark supports APIs in multiple languages, including Python, R, SQL, and Scala, but its Python interface, PySpark, is by far the most popular.

The transition from Pandas to PySpark can be challenging for aspiring data scientists and data engineers. In this post, we’ll explore a realworld use case that highlights how understanding the differences between the two tools helped solve a practical problem.

On the surface, Pandas and PySpark seem similar because they both feature the DataFrame structure for data manipulation. For example, creating a new column or casting data types follows a similar concept but with slightly different syntax:

Creating a New Column

Pandas:

df[‘new_column’] = 5

PySpark:

df = df.withColumn(‘new_column’, lit(5))

Casting a Column from Integer to String

Pandas:

df[‘col1’] = df[‘col1’].astype(str)


PySpark:

df = df.withColumn(‘col1’, df[‘col1’].cast(‘string’))


While these basic operations are straightforward, the differences between the two tools become

evident as tasks grow more complex, especially when working with distributed environments.

A RealWorld Challenge: Saving DataFrames to ADLS Gen2
Recently, I encountered a challenge while trying to save a PySpark DataFrame as a single CSV file in Azure Data Lake Storage Gen2 (ADLS Gen2). My initial instinct was to use Pandas, as saving a DataFrame to CSV in Pandas is simple:

df.to_csv(‘file_name.csv’, index=False)

However, since I was working in an Azure Synapse workspace, which uses preconfigured linked services for authentication, Pandas was not able to leverage the Managed Identity for authentication. Using Pandas would have required me to either:

1. Use the adlfs library and manually specify account credentials.
2.
Use the storage_options parameter to provide authentication details.

Example 1: Using adlfs to Connect to ADLS Gen2

import pandas as pd

import adlfs


# Define storage account details

fs = adlfs.AzureBlobFileSystem(account_name=”your_account_name”,
account_key=”your_account_key”)


# Reading a CSV from ADLS Gen2

file_path = “abfs://yourcontainername/path/to/file.csv”

with fs.open(file_path, “rb”) as f:

df = pd.read_csv(f)


# Writing a CSV to ADLS Gen2

output_path = “abfs://yourcontainername/path/to/output_file.csv”

with fs.open(output_path, “wb”) as f:

df.to_csv(f, index=False)


Example 2: Using storage_options to Connect to ADLS Gen2

import pandas as pd

# Storage options with credentials
storage_options = {

“account_name”: “your_account_name”,

“account_key”: “your_account_key”

}

# Reading a CSV

file_path = “abfs://yourcontainername/path/to/file.csv”
df = pd.read_csv(file_path, storage_options=storage_options)

# Writing a CSV

output_path = “abfs://yourcontainername/path/to/output_file.csv”

df.to_csv(output_path, index=False, storage_options=storage_options)

Both options required hardcoding sensitive credentials, which poses a security risk. Thus, I opted for PySpark, which integrates seamlessly with the Synapse linked service to authenticate using Managed Identity.
Challenges with PySpark for Single CSV File Output

1. Combining Partitions

Since Spark is a distributed system, data is processed and stored in partitions. Writing a DataFrame to a single CSV file requires reducing the number of partitions to one. This can be achieved using:

coalesce: Recommended for reducing the number of partitions without triggering a full shuffle.
repartition: Used to increase or redistribute partitions, but it involves shuffling, which is resourceintensive.

Example:

df = df.coalesce(1)


2. File Naming

Spark writes files with systemgenerated names (e.g., part00000) and doesn’t allow custom file naming directly. To resolve this, you can:

        1.
Write the DataFrame to a temporary directory.
        2.
Rename and move the file using file system operations.

Example:

output_temp_path = “abfs://yourcontainername/temp_output/”
 
final_output_path = “abfs://yourcontainername/final_output/your_file.csv”

# Save the DataFrame to a temporary location

df.coalesce(1).write.mode(‘overwrite’).csv(output_temp_path, header=True)


# Use Synapse utilities or Azure SDK to move and rename the file

from notebookutils import mssparkutils


# List files in the temporary output path

files = mssparkutils.fs.ls(output_temp_path)

for file in files:

   if file.name.endswith(“.csv”):

        # Move (rename) the file to the final output path

        mssparkutils.fs.mv(file.path, final_output_path)

        break


# Clean up the temporary directory

mssparkutils.fs.rm(output_temp_path, recurse=True)

Key Lessons
1.
Pandas is ideal for smallscale, singlenode operations and provides a simple interface for common tasks. However, it struggles with scalability and requires additional configuration for secure cloud integration.

2.
PySpark excels in handling largescale, distributed workloads but requires additional effort
for tasks that seem simple in Pandas, such as saving a single CSV file.


These differences stem from their underlying architectures:

Pandas operates on a single node.
PySpark runs on a distributed cluster.

Understanding these trade
offs is crucial when deciding which tool to use for a given task.

About the Author

Picture of Stephen Situ

Stephen Situ

Stephen Situ is an experienced data engineer and data scientist with over 7 years of expertisespanning data engineering, machine learning, and AI across industries such as oil and gas,manufacturing, and technology. With a background in Chemical Engineering and AppliedMathematics, he brings full-stack proficiency to the entire data lifecycle—designing scalable datainfrastructures, building distributed computing systems, developing machine learning models, anddeploying them into production. Certified in top cloud platforms like AWS and Azure, includingAzure Data Engineer Associate and Databricks Certified Data Engineer Professional, Stephencombines his technical depth and practical experience to solve complex data challenges anddeliver end-to-end solutions that drive business value.

What do you think?

Leave a Reply

Your email address will not be published. Required fields are marked *