Data Cleaning Garbage in. Garbage out. Variable Type ID Text Ordinal Categorical Numerical 5 Data Types that need to be cleaned Duplicate ID Type Missing Out of range Outlier Data Cleaning Best Practice Guided Practice for cleaning data Copy the data separately before doing Make a note of what you've done. Create a document to record conclusions Where is the data come from? Explain analysis results Summary of analysis results. Summary Statistics Ordinal and categorical data 1. Central tendency a. Mean: A central value of a finite set of numbers: specifically, the sum of the values divided by the number of values. b. Median: The middle value of the given list of data, when arranged in an order. c. Mode: The value that appears most often in a set of data values. 2. Dispersion a. Standard deviation: A measure of the amount of variation or dispersion of a set of values. b. Range: A set of data is the difference between the largest and smallest values 3. Statistical dependency a. Correlation: Any statistical relationship, whether causal or not, between two random variables or bivariate data. Lookup_value is the value that will be used to match data. This is usually an identifier (an ID of some kind). It must exist in both worksheets. Table_array is the table from which you want to retrieve data. Col_index_num is the number of the column from the left side of the table_array from which you want to retrieve data. Lock Column with $ e.g. $C4 Range_lookup defines whether or not the lookup_value is an approximate match or an exact match of the value you are comparing it to in the left-most column of the table_array. TRUE: Approximate match is needed.* FALSE: An exact match is required. A categorical, statistical data type where the variables have natural, ordered categories and the distances between the categories is not known. Numerical data Numerical data is a data type expressed in numbers, rather than natural language description. Sometimes called quantitative data, numerical data is always collected in number form. Data Referencing VOOKUP SYNTAX The syntax is: =VLOOKUP(lookup_value, table_array, col_index_num, [range_lookup]) A NULL is any missing value in your data. There are four primary strategies for handling NULL values: Delete them (only with caution). Ignore them (some may have meaning). Impute values (e.g. median or zeros). Find missing values (using reference resources). Guideline: If over 15% of a dataset is filled with NULL values, find new data! ID 1 2 3 4 5 6 7 Age range 10-30 31-50 31-50 51-90 51-90 10-30 10-30 Region North North Northeast North South Northeast South Spending 20000 30000 100000 20000 10000 4000 40000 Name Anthony Brittney Christina Donald Elaine Frank Gary ID 1 2 3 4 5 6 7 Age range 10-30 31-50 31-50 51-90 51-90 10-30 10-30 Region North North Northeast North South Northeast South Spending 20000 30000 100000 20000 10000 4000 40000 Name Anthony Brittney Christina Donald Elaine Frank Gary ID 1 3 5 6 6 6 Bill 80000 20000 10000 0 3000 1000 Card type Primo Super Platinum Primo Super Platinum ID 1 2 3 4 5 6 7 Age range 10-30 31-50 31-50 51-90 51-90 10-30 10-30 Region North North Northeast North South Northeast South Spending east 30000 -100000 20000 10000 4000 4000000 Name Anthony Brittney Christina Donald Elaine Frank Gary ID 1 3 5 6 6 6 Bill 80000 20000 10000 0 3000 1000 Card type Primo Super Platinum Primo Super Platinum Field Accident_Index Accident_Severity Date Region Number_of_ Included_Parties Number_of_ Vehicles Notes Unique identifier Slight, Serious, of Fatal Expect duplicates Likely entered by a human; critical to our analysis The spread of numbers looks concerning at first glance The spread of numbers looks concerning at first glance Action to take Check for and handle duplicates. Checkfor and handle blanks None Check for incorrect and/or missing values. Check for and handle outliers. Check for and handle outliers. Date Type Alphanumeric Text Date Test Numerical Numerical 2021 © TRUE DIGITAL ACADEMY Data Analytics Intermediate I Module 2: Fundamentals of Data in Excel Data Analytics Intermediate 1 Fundamentals of Data in Excel Module 2