Processing data in chunks using pandas

Processing data in chunks using pandas

Introduction

While a decent computer with 16Gb of RAM and above would serve you well in daily tasks, it quickly proves insufficient when working with large datasets and in data science projects.

Fortunately, the de facto data processing library in Python, pandas, can read CSV files in chunks ( A chunk is a subset of the original dataset). This allows us to read data that cannot otherwise fit in the memory and process it.

Creating an iterator

The chunksize argument, which takes an integer value, is used to make the read_csv function read the dataset in chunks with the size of the value.

# Read the blood_pressure.csv in chunks of 10 lines
df = pd.read_csv("blood_pressure.csv", chunksize=10)

After reading the dataset, we can now iterate over it and perform operations like in any other dataset.

There is a catch, though! Unless we want to get the results of only one chunk, we need to aggregate the results for all pieces.

male = 0
female = 0
# Loop through the chunks
for chunk in df:
    # count all males in the sex column of the current chunk
    male += chunk[chunk['sex'] == "Male"]["sex"].count()
    # count all females in the sex column of the current chunk
    female += chunk[chunk['sex'] == "Female"]["sex"].count()
# Print the results
print(f'Total number of males: {male}')
print(f'Total number of females: {female}')

The example above counts the number of males and females in each chunk and adds the result to the global variables male and female.

Using a workflow similar to the above allows us to process the whole data frame without getting the annoying MemoryError.

Conclusion

Reading data in chunks is handy when working with large datasets, allowing us to process data that would otherwise not fit in our computer's RAM.

Another way to help us avoid problems is by using Dask, a flexible library for parallel computing in Python. Dask parallelizes data processing on multiple CPU cores, improving processing time and memory usage.