2025-02-14

Learn R - Dedup!

Dedup (deduplication) is the first step to analyze any data. 

There is almost always some redundancy in the table of any raw data if you're dealing with the type of "real-world" data sets. It could be the consequence of many complex joins or the data is copied twice by mistakes. Whatever the reason is, a perfectly cleaned data is rare. Any further analyses on it without checking it deliberately could just create lots of complex troubles later! 

Let's explore the ways to dedup in R.

This is one simple data frame.

> df <- data.frame(name=c('Alice', 'Brian', 'Chris', 'Brian'), age=c(10, 20, 30, 20), height=c(5.2, 5.7, 6, 5.7))
> df
   name age height
1 Alice  10    5.2
2 Brian  20    5.7
3 Chris  30    6.0
4 Brian  20    5.7

Rows 2 and 4 are duplicated. To remove one of them, we can use three methods.

(1) duplicated()

The function duplicate() indicates which rows are duplicated. It is R's base function, so it doesn't need the installation.

> duplicated(df)
[1] FALSE FALSE FALSE  TRUE

The 4th row is duplicated (as the True indicates). The rows not duplicated are found using the "not" operator (!) as follows.

> !duplicated(df)
[1]  TRUE  TRUE  TRUE FALSE

Slicing with the index selects the rows not duplicated.

> df[!duplicated(df),]
   name age height
1 Alice  10    5.2
2 Brian  20    5.7
3 Chris  30    6.0


(2) unique()

The function unique() returns the dataframe after removing the redundancy. It's also the base function of R so ready to be used without installation.

> unique(df)
   name age height
1 Alice  10    5.2
2 Brian  20    5.7
3 Chris  30    6.0


(3) distinct() in dplyr

Install dplyr package first, then use distinct() in the package. It returns the same distinct rows only.

> dplyr::distinct(df)
   name age height
1 Alice  10    5.2
2 Brian  20    5.7
3 Chris  30    6.0

The same could be done using piping in a tidyverse style.

> df %>% distinct()
   name age height
1 Alice  10    5.2
2 Brian  20    5.7
3 Chris  30    6.0

It's that easy. Just remember to dedup!