Data Collection: Gather raw data from various sources, including databases, APIs, files (e.g., CSV, Excel, JSON), and internet scraping.
Be clear and cautious about the data format before you import it into the environment. Because most of the file format errors are runtime errors. Load the data into a data frame:
import pandas as pd # Importing Pandas library as alias "pd".
data = pd.read_csv('data.csv') # for .csv
print(data)
df = pd.read_excel("sample.xlsx") # for .xlsx
print(df)
import json as js
df = open('data.json') # Opening JSON file
dataset = js.load(f)
for j in dataset ['emps_details']: # Iterating through the json as list
print(j)
# It will print the results.
df.close() # Closing file
# Don't forget replace 'data' with your excat file path and file format.
Data Inspection: Perform preliminary exploratory data analysis (EDA) to understand the raw data's structure, content, and quality. This aids in the identification of possible difficulties as well as the planning of data cleaning and transformation stages. Inspect or Explore the data by using the below code:
data.head() # view the first few rows of the data
data.info() # check the data types and column
data.describe() # Get summary statistics of the data
data.dropna() # remove rows with missing values # or data.dropna(subset=['column1', 'column2']) # remove rows with missing values in specific columnsHandle inconsistent or incorrect values:data.loc[data['column'] == 'incorrect_value', 'column'] = np.nan # replace incorrect values with NaN data['column'].fillna(data['column'].mean(), inplace=True) # impute missing values with the meanReformat and reshape the data if necessary:data = data.melt(id_vars=['id'], value_vars=['column1', 'column2'], var_name='variable', value_name='value') # convert from wide to long formatHandle inconsistent categorical variables:data['column'] = data['column'].replace({'Incorrect': 'Correct'}) # correct inconsistent categorical valuesCorrect data types:data['column'] = data['column'].astype('category') # convert a column to a categorical typePerform additional data validation checks:inconsistent_data = data[data['column1'] > data['column2']] # check for logical inconsistencies in the data missing_values = data[data['column'].isnull()] # check for missing values in a specific columnHandling Duplicate values: To maintain data integrity and correctness,remove duplicate records.
data.drop_duplicates() # remove duplicate rowsOutlier Detection and Handling: Identify and address outliers that may distort the outcomes of thestudy or modelling.lower_limit = data['column'].quantile(0.05) upper_limit = data['column'].quantile(0.95) data[(data['column'] = lower_limit) & (data['column'] <= upper_limit)] # remove outliers based on a specified rangeTransformation of data:The conversion of data types (for example, numeric and categorical) to formats, conversion of uppercase to lowercaseand removal of space, all these conversions will make the proper dataset for further analysis.Data Formatting: Maintain consistent data formatting and organization for better analysis and visualization.data['column'] = pd.to_numeric(data['column'], errors='coerce') # convert a column to numeric type data['column'] = pd.to_datetime(data['column'], format='%Y-%m-%d') # convert a column to date type data['column'] = data['column'].str.lower() # convert a column to lowercase data['column'] = data['column'].str.strip() # Remove leading and trailing whitespacesNormalization and standardization: To enable fair comparisons and improved model performance,
scale numerical characteristics to a standard scale.Encoding category Variables: For model training, convert category variables into numerical representations.
from sklearn.preprocessing import StandardScaler, MinMaxScaler scaler = StandardScaler() # create a standard scaler object data['column'] = scaler.fit_transform(data[['column']]) # standardize a column # or scaler = MinMaxScaler() # create a min-max scaler object data['column'] = scaler.fit_transform(data[['column']]) # normalize a columnHandle skewed or non-normal distributions:import numpy as np data['column'] = np.log(data['column'] + 1) # apply log transformation to address skewnessOther important data wrangling operations are listed below:Save the cleaned data:data.to_csv('cleaned_data.csv', index=False) # replace 'cleaned_data.csv' with your desired file nameHandle inconsistent string formats:data['column'] = data['column'].str.replace('\s+', ' ', regex=True) # remove multiple whitespaces within strings data['column'] = data['column'].str.replace('[^a-zA-Z0-9\s]', '', regex=True) # remove non-alphanumeric characters from stringsHandle imputation of missing values replaced by mean:data['column'].fillna(data['column'].mean(), inplace=True) # impute missing values with the meanReorder and rename variables:data = data[['column3', 'column1', 'column2']] # reorder columns data.rename(columns={'OldColumnName': 'NewColumnName'}, inplace=True) # rename a columnRemove unnecessary variables:data = data.drop(['column3', 'column4'], axis=1) # remove specific columns from the datasetMerge or join datasets:merged_data = data1.merge(data2, on='common_column') # merge two datasets based on a common columnHandle inconsistent or misspelled values:data.loc[data['column'].isin(['Incorrect', 'Misspelled']), 'column'] = 'Correct' # replace inconsistent values with a correct valueHandle data truncation or rounding::data['column'] = data['column'].round(2) # round numerical values to a specific decimal placePerform data aggregation or summarization:aggregated_data = data.groupby('category')['column'].agg(['mean', 'max']) # calculate mean and maximum values for each categoryPerform data type conversions and factor re-leveling:data['column'] = data['column'].astype('int') # convert a column to integer type data['column'] = pd.Categorical(data['column'], categories=['Level 1', 'Level 2']) # redefine categorical levelsData Validation: Ensure that the processed data fulfils the specified requirements and is ready for the analytical step.
data.info() # verify the data types and column information of the cleaned dataData wrangling is a repetitive as well as time-consuming procedure that involves an in-depth understandingof the analytic or specific requirements of modelling tasks. The objective is to create a clean, well-structureddataset that optimizes the insights that may be extracted from it.
Post a Comment
The more you ask questions, that will enrich the answer, so whats your question?