Finding Unique records in column of a giant CSV

2 min readApr 3, 2021

When you are dealing with a giant datasets, Python is your best friend. I was tasked to find the unique records in a column in a ten million record dataset. For smaller datasets, its really easy to do it in excel this way:

(1) Select a column, go to Data tab and click remove duplicates

So how do we do this for bigger datasets and CSV? Python and Pandas to the rescue. You can do this with a simple code snippet as follows:

import pandas as pd
import os

data = []
csv_directory = r'<your directory>'

for filename in os.listdir(csv_directory):
    if filename.lower().endswith('.csv'):
         df = pd.read_csv(r'<your directory>/<file>.csv')
         data.extend(df['COLUMN_NAME'].tolist())

#Fetch unique values with Set
data_unique = list(set(data))
data_unique_string = map(str, data_unique)
new_list = open("outputfile.csv", "w")
new_list.write('\n'.join(data_unique_string))
new_list.close()

What the above snippet does is — Load the column from CSV into a data frame in Pandas and use set to retrieve unique values. We convert the result to a string and write it back out a csv file. Though the performance of the script is slow, it still saves considerable amount of time with regards to the manual effort.

Hope you enjoyed the read. Cheers!

Finding Unique records in column of a giant CSV

Written by Valliappan Thenappan