Finding Unique records in column of a giant CSV

Valliappan Thenappan
2 min readApr 3, 2021

When you are dealing with a giant datasets, Python is your best friend. I was tasked to find the unique records in a column in a ten million record dataset. For smaller datasets, its really easy to do it in excel this way:

(1) Select a column, go to Data tab and click remove duplicates

Remove Duplicates in Excel
Excel — Remove Duplicate Result

So how do we do this for bigger datasets and CSV? Python and Pandas to the rescue. You can do this with a simple code snippet as follows:

import pandas as pd
import os

data = []
csv_directory = r'<your directory>'

for filename in os.listdir(csv_directory):
if filename.lower().endswith('.csv'):
df = pd.read_csv(r'<your directory>/<file>.csv')
data.extend(df['COLUMN_NAME'].tolist())

#Fetch unique values with Set
data_unique = list(set(data))
data_unique_string = map(str, data_unique)
new_list = open("outputfile.csv", "w")
new_list.write('\n'.join(data_unique_string))
new_list.close()

What the above snippet does is — Load the column from CSV into a data frame in Pandas and use set to retrieve unique values. We convert the result to a string and write it back out a csv file. Though the performance of the script is slow, it still saves considerable amount of time with regards to the manual effort.

Hope you enjoyed the read. Cheers!

--

--