In the data science and data engineering fields, Python is the dominant language for processing data. Most beginners are introduced to Pandas, which provides a way to work with DataFrames, allowing them to manipulate datasets programmatically. However, as the scale of data grows, the limitations of single–node processing technologies like Pandas become apparent. It’s not feasible to keep scaling computing resources vertically forever. At this point, the need for a tool that supports horizontal scaling and distributed computing arises.
Enter Apache Spark, a distributed computing engine designed for big data workloads. Spark supports APIs in multiple languages, including Python, R, SQL, and Scala, but its Python interface, PySpark, is by far the most popular.
The transition from Pandas to PySpark can be challenging for aspiring data scientists and data engineers. In this post, we’ll explore a real–world use case that highlights how understanding the differences between the two tools helped solve a practical problem.
On the surface, Pandas and PySpark seem similar because they both feature the DataFrame structure for data manipulation. For example, creating a new column or casting data types follows a similar concept but with slightly different syntax:
Creating a New Column
Pandas:
df[‘new_column’] = 5
• PySpark:
df = df.withColumn(‘new_column’, lit(5))
Casting a Column from Integer to String
• Pandas:
df[‘col1’] = df[‘col1’].astype(str)
• PySpark:
df = df.withColumn(‘col1’, df[‘col1’].cast(‘string’))
While these basic operations are straightforward, the differences between the two tools become
evident as tasks grow more complex, especially when working with distributed environments.
A Real–World Challenge: Saving DataFrames to ADLS Gen2
Recently, I encountered a challenge while trying to save a PySpark DataFrame as a single CSV file in Azure Data Lake Storage Gen2 (ADLS Gen2). My initial instinct was to use Pandas, as saving a DataFrame to CSV in Pandas is simple:
df.to_csv(‘file_name.csv’, index=False)
1. Use the adlfs library and manually specify account credentials.
2. Use the storage_options parameter to provide authentication details.
Example 1: Using adlfs to Connect to ADLS Gen2
import pandas as pd
import adlfs
# Define storage account details
fs = adlfs.AzureBlobFileSystem(account_name=”your_account_name”,
account_key=”your_account_key”)
# Reading a CSV from ADLS Gen2
file_path = “abfs://your–container–name/path/to/file.csv”
with fs.open(file_path, “rb”) as f:
df = pd.read_csv(f)
# Writing a CSV to ADLS Gen2
output_path = “abfs://your–container–name/path/to/output_file.csv”
with fs.open(output_path, “wb”) as f:
df.to_csv(f, index=False)
Example 2: Using storage_options to Connect to ADLS Gen2
import pandas as pd
# Storage options with credentials
storage_options = {
“account_name”: “your_account_name”,
“account_key”: “your_account_key”
}
# Reading a CSV
df = pd.read_csv(file_path, storage_options=storage_options)
# Writing a CSV
output_path = “abfs://your–container–name/path/to/output_file.csv”
df.to_csv(output_path, index=False, storage_options=storage_options)
Both options required hardcoding sensitive credentials, which poses a security risk. Thus, I opted for PySpark, which integrates seamlessly with the Synapse linked service to authenticate using Managed Identity.
1. Combining Partitions
Since Spark is a distributed system, data is processed and stored in partitions. Writing a DataFrame to a single CSV file requires reducing the number of partitions to one. This can be achieved using:
• coalesce: Recommended for reducing the number of partitions without triggering a full shuffle.
• repartition: Used to increase or redistribute partitions, but it involves shuffling, which is resource–intensive.
Example:
df = df.coalesce(1)
2. File Naming
Spark writes files with system–generated names (e.g., part–00000) and doesn’t allow custom file naming directly. To resolve this, you can:
1. Write the DataFrame to a temporary directory.
2. Rename and move the file using file system operations.
Example:
output_temp_path = “abfs://your–container–name/temp_output/”
# Save the DataFrame to a temporary location
df.coalesce(1).write.mode(‘overwrite’).csv(output_temp_path, header=True)
# Use Synapse utilities or Azure SDK to move and rename the file
from notebookutils import mssparkutils
# List files in the temporary output path
files = mssparkutils.fs.ls(output_temp_path)
for file in files:
if file.name.endswith(“.csv”):
# Move (rename) the file to the final output path
mssparkutils.fs.mv(file.path, final_output_path)
break
# Clean up the temporary directory
mssparkutils.fs.rm(output_temp_path, recurse=True)
Key Lessons
1. Pandas is ideal for small–scale, single–node operations and provides a simple interface for common tasks. However, it struggles with scalability and requires additional configuration for secure cloud integration.
2. PySpark excels in handling large–scale, distributed workloads but requires additional effort
for tasks that seem simple in Pandas, such as saving a single CSV file.
These differences stem from their underlying architectures:
• Pandas operates on a single node.
• PySpark runs on a distributed cluster.
Understanding these trade–offs is crucial when deciding which tool to use for a given task.
About the Author
Stephen Situ
Stephen Situ is an experienced data engineer and data scientist with over 7 years of expertisespanning data engineering, machine learning, and AI across industries such as oil and gas,manufacturing, and technology. With a background in Chemical Engineering and AppliedMathematics, he brings full-stack proficiency to the entire data lifecycle—designing scalable datainfrastructures, building distributed computing systems, developing machine learning models, anddeploying them into production. Certified in top cloud platforms like AWS and Azure, includingAzure Data Engineer Associate and Databricks Certified Data Engineer Professional, Stephencombines his technical depth and practical experience to solve complex data challenges anddeliver end-to-end solutions that drive business value.