Data cleaning in R

Data cleaning is the process of fixing or eliminating erroneous, incorrect, improperly formatted, duplicated, or missing data from a dataset. Data can be duplicated or mislabeled in a variety of ways, including when separate data sources or datasets are merged. Even if the software runs the data, the findings and techniques are untrustworthy without the right data. There is no universal approach for describing the various stages of data cleaning. Data cleaning aspects are clearly explained in the separate Data Cleaning article. And you can download the .csv file here.

To begin the analysis, the dataset is imported into R using the readxl package, which allows seamless reading of Excel files directly into a data frame for further processing.


library(readr)
data=read_csv("test_measurements.csv")

When we import the .csv or .xlsx file into the R window, by default, it will be considered as a data frame. Then we are checking the dimension of the data frame and the summary of the data. In summary, we can get a clear view of the NA's (missing) values of the corresponding columns.


dim(data)
summary(data)

This initial inspection helps identify missing values, unusual patterns, and potential issues that need to be addressed before proceeding with deeper analysis or modeling.

To find the total number of missing values.


total = sum(is.na(data))
print(total)
colSums(is.na(data))

The missing values are replaced with the median by using the code below:


New_df = data[,2:12]

New_df$Presentation  = ifelse(is.na(New_df$Presentation), 
                              median(New_df$Presentation,na.rm = TRUE),New_df$Presentation)

New_df$`Influencing and Convincing`  = ifelse(is.na(New_df$`Influencing and Convincing`),
                                              median(New_df$`Influencing and Convincing`,na.rm = TRUE),New_df$`Influencing and Convincing`)

New_df$`Stress Tolerance` = ifelse(is.na(New_df$`Stress Tolerance` ), 
                                   median(New_df$`Stress Tolerance`, na.rm = TRUE),New_df$`Stress Tolerance`)

New_df$`Achievement Orientation` = ifelse(is.na(New_df$`Achievement Orientation`), 
                                          median(New_df$`Achievement Orientation`, na.rm = TRUE),New_df$`Achievement Orientation` )

Again, we are checking for the missing values in the data frame.


total = sum(is.na(New_df))
print(total)
summary(New_df)

To check for outliers

boxplot(New_df)

col = c('Presentation','Influencing.and.Convincing','Stress.Tolerance','Achievement.Orientation')

boxplot(New_df[,c('Presentation','Influencing.and.Convincing','Stress.Tolerance','Achievement.Orientation')])

for (x in c('Presentation','Influencing.and.Convincing','Stress.Tolerance','Achievement.Orientation'))
{
value =New_df[,x][New_df[,x] %in% boxplot.stats(New_df[,x])$out] 
New_df[,x][New_df[,x] %in% value] = NA
}

Checking whether the outliers in the above-defined columns are replaced by NULL or not.


as.data.frame(colSums(is.na(New_df)))

In some cases, the null values may lead to less accuracy. So we have to remove them. Removing the null values with this code:

library(tidyr)
New_df = drop_na(New_df)
as.data.frame(colSums(is.na(New_df)))

To view the overall source code (R.file)!!!! -

Post a Comment

Post a Comment

Translate

AKSTATS

Contact Form