Skip to content
Snippets Groups Projects
Commit cfb4d074 authored by pfoo's avatar pfoo
Browse files

Added the related files for this part 2 course

parent 6499141d
No related branches found
No related tags found
No related merge requests found
.Rproj.user
.Rhistory
.RData
.Ruserdata
Item Price
Burger 10.90
Fries 4.45
Juice NA
Salad 9.50
Chicken 10.90
Drinks 3.33
ID,First Name,Last Name
D001,NA,Smith
D002,Anne,Clare
D003,Charlotte,NA
D004,Thomas,Train
File added
---
title: "R Beginners Exercise 2: Data Processing"
output:
word_document: default
html_document:
toc: true
editor_options:
chunk_output_type: console
---
# Introduction
Welcome to R for Beginners Exercise 2! This notebook contains the exercises for the lesson that we will be looking at during exercise breaks throughout the course as well as being a work space for you to use during the session!
To execute a line of code, click on it and press *Ctrl + Enter*.
To execute a chunk of code, click the green run button at the top right corner of the code chunk or highlight the entire code chunk and press *Ctrl + Enter*.
# 2.1 Import dataset
R supports importing datasets from a range of file types. Some common file types supported by R are `.csv`, `.txt`, `.xlxs`. There are two methods to import a dataset.
- Through the "Import Dataset" button in the Environment window (top right).
- Through command *(more information regarding this method can be found in the [R documentation](https://cran.r-project.org/doc/manuals/R-data.html#XML))*
To import data through command, first set working directory to the folder where your data files are stored.
```{r}
setwd("~/IntroductionToR_P2_ExerciseNotebook")
```
## 2.1.1 Import from CSV file
```{r}
students = read.csv("Ex2-Students_CSV.csv")
students
```
## 2.1.2 Import from TXT file
```{r}
menu = read.delim("Ex2-Menu_TXT.txt") # with tab character as the delimiter
menu
```
## 2.1.3 Import from Excel file
```{r}
library(readxl) # the package which enables Excel file importation
students2 = read_xlsx("Ex2-Students_Excel.xlsx")
students2
```
# 2.2 Dataset overview
## 2.2.1 View dataset in new window
Use the `view()` function to view the `students` dataset in another window.
```{r}
# The package which contains the view() function to display dataset in a separate window
library(tibble)
# Write your code below
```
## 2.2.2 View columns of dataset in console
Use the `glimpse()` function to print all the columns of the `students` dataset to console.
```{r}
# Write your code below
```
## 2.2.3 View column names in console
Use the `names()` function to print all the column names of the `students` dataset to console.
```{r}
# Write your code below
```
\* In R, you can view column names of a dataset when writing code by typing the dollar sign, `$`, after the dataset's name.
## 2.2.4 Find datatype of a column
The `class()` function in R can be used to find the datatype of a specific column in a table, e.g. `class(tableName$columnName)`. Use the `class()` function to find the datatype of the `Price` column in the `menu` dataset.
```{r}
# Write your code below
```
## 2.2.5 Summary of dataset
Use the `summary()` function to get a summary of all columns in the `menu` dataset.
```{r}
# Write your code below
```
## 2.2.6 Find unique values in a column
The `unique()` function in R can be used to find all unique values in a specific column in a table, e.g., `unique(tableName$columnName)`. Use the `unique()` function to find all values in the `Price` column in the `menu` dataset, without duplication.
```{r}
# Write your code below
```
## 2.2.7 Incomplete cases in dataset
In R, incomplete cases are rows in dataset that have `Na` value(s). These incomplete cases can be viewed using the function `complete.cases`.
For example, we can view the incomplete cases in the `menu` dataset by executing the code below.
```{r}
menu[!complete.cases(menu),]
```
Breaking down the code written, a condition is being used with the `[]` operator. The condition states 'rows in `menu` that are not a complete case', utilising the NOT logical operator and the `complete.cases` function. Note that the `,` sign within the `[]` operator is mandatory since `menu` is a 2-dimensional dataframe. The second index has been left out on purpose to include all columns in the result.
With reference to the syntax shown above, find the incomplete cases in `students`
```{r}
# Write your code below
```
# 2.3 Data ordering
The order of vector elements can be easily reordered using the function `sort()` in R. It can be applied to all basic datatypes to order data in numerical order (numeric, integer, complex, logical datatypes) or alphabetical order (character datatypes). It has a parameter, `decreasing`, to specify whether to sort a vector in ascending or descending order, e.g., `sort(someVector, decreasing = TRUE)`.
\* Executing the code `sort(someVector)` will only print the sorting result to console. To save the sorted vector, you would need to write `someVector = sort(someVector)`.
Execute the `numbers` vector defined below, then sort it in ascending order.
```{r}
numbers = c(4, 3, 5, 1, 2)
# Write your code below
```
Execute the `words` vector defined below, then sort it in descending alphabetical order.
```{r}
words = c("table", "water", "chair", "ruler")
# Write your code below
```
# 2.4 Dealing with incomplete cases
Missing values in R are very problematic as R cannot process a function correctly with their presence, and may even run into error. In order to obtain accurate results, these incomplete cases will need to be dealt with.
## 2.4.1 Drop all incomplete cases
The simplest way to deal with missing value(s) is to drop (remove) all the incomplete cases. Extending from the syntax to find incomplete cases in [2.2.7 Incomplete cases in dataset], we can drop all incomplete cases by omitting the use of NOT operator such as:
```{r}
menu[complete.cases(menu),]
```
Try using this method to output a table which drops all the incomplete cases in `students`.
```{r}
# Write your code below
```
A simpler method uses the function `na.omit()`. Using this function, try to obtain the same output table as above.
```{r}
# Write your code below
```
As you might notice, dealing with missing value(s) by dropping all the incomplete cases ignores a lot of rows in the dataset. Therefore, this method is not recommended because it can greatly affect the reliability of the analysis.
## 2.4.2 Drop incomplete cases in specific column
A better way to deal with missing value(s) is to remove only rows with a missing value in the variable we are interested in. This minimises the number of rows being ignored. The method to do so is similar to the first method in the previous section, with the addition of the name of the column that you want to drop missing values from in the condition.
```{r}
menu[complete.cases(menu$Price),]
```
Apply this method on the `students` dataset to output a table where rows with missing value in the `First.Name` column are removed.
```{r}
# Write your code below
```
## 2.4.3 Replace the missing value(s)
If you know your data well enough, you can replace the missing value(s) in the dataset. This can be done manually by assigning a new value to the cell. Taking the dataset `menu` as an example, the cell with missing value has index `[3, 2]`, a new value can be assigned to it as follows:
```{r}
menu[3, 2] = 4.50
menu # Check to make sure the missing value is being replaced correctly.
```
For a small dataset, the method above is plausible. For a larger dataset, the better way is to replace missing value(s) in a batch using the function `replace_na` in the `dbplyr` package.
```{r}
library(tidyr) # The package which includes the replace_na function.
replace_na(menu, list(Price = 4.50))
```
This method should be used with caution as it replaces all `NA` values in the column with the value specified. Run the code chunk below which shows an example of such an issue.
```{r}
# Creating the dataset for illustration purposes.
menu_example = menu
menu_example[1:4, 2] = NA
menu_example
# Applying the function
replace_na(menu_example, list(Price = 4.50))
```
The `students2` dataset is a complete version of the `students` dataset, refer to it to apply the following changes to the `students` dataset.
1. Replace the missing value in the `First.Name` column using the first method.
2. Replace the missing value in the `Last.Name` column using the second method.
```{r}
# Replacing with first method
# Replacing with second method
```
## 2.4.4 Ignore missing value(s) in function
When applying aggregate functions to a dataset with missing value(s), the value `NA` will be returned. In order to obtain a useful output, we can ask these functions to ignore the missing value(s) by setting the `na.rm` parameter to `TRUE`.
\* Aggregate functions: sum, mean, median, min, max, variance, range, etc.
```{r}
numbers = c(2, 3, 5, 8, NA, 4, 7)
# Using the default value assigned to the na.rm parameter
sum(numbers)
# Setting the na.rm parameter to TRUE to ignore missing value(s)
sum(numbers, na.rm = TRUE)
```
Calculate the mean prices in the `menu_example` dataset.
\* Use `datasetName$colname` to specify which column to apply the function on.
```{r}
# Write your code below
```
---
title: "R Beginners Exercise 2: Data Processing"
output:
word_document: default
html_document:
toc: true
editor_options:
chunk_output_type: console
---
# Introduction
Welcome to R for Beginners Exercise 2! This notebook contains the exercises for the lesson that we will be looking at during exercise breaks throughout the course as well as being a work space for you to use during the session!
To execute a line of code, click on it and press *Ctrl + Enter*.
To execute a chunk of code, click the green run button at the top right corner of the code chunk or highlight the entire code chunk and press *Ctrl + Enter*.
# 2.1 Import dataset
R supports importing datasets from a range of file types. Some common file types supported by R are `.csv`, `.txt`, `.xlxs`. There are two methods to import a dataset.
- Through the "Import Dataset" button in the Environment window (top right).
- Through command *(more information regarding this method can be found in the [R documentation](https://cran.r-project.org/doc/manuals/R-data.html#XML))*
To import data through command, first set working directory to the folder where your data files are stored.
```{r eval = FALSE}
setwd("~/IntroductionToR_P2_ExerciseNotebook_Sol")
```
## 2.1.1 Import from CSV file
```{r}
students = read.csv("Ex2-Students_CSV.csv")
students
```
## 2.1.2 Import from TXT file
```{r}
menu = read.delim("Ex2-Menu_TXT.txt") # with tab character as the delimiter
menu
```
## 2.1.3 Import from Excel file
```{r}
library(readxl) # the package which enables Excel file importation
students2 = read_xlsx("Ex2-Students_Excel.xlsx")
students2
```
# 2.2 Dataset overview
## 2.2.1 View dataset in new window
Use the `view()` function to view the `students` dataset in another window.
```{r}
# The package which contains the view() function to display dataset in a separate window
library(tibble)
# Write your code below
view(students)
```
## 2.2.2 View columns of dataset in console
Use the `glimpse()` function to print all the columns of the `students` dataset to console.
```{r}
# Write your code below
glimpse(students)
```
## 2.2.3 View column names in console
Use the `names()` function to print all the column names of the `students` dataset to console.
```{r}
# Write your code below
names(students)
```
\* In R, you can view column names of a dataset when writing code by typing the dollar sign, `$`, after the dataset's name.
## 2.2.4 Find datatype of a column
The `class()` function in R can be used to find the datatype of a specific column in a table, e.g. `class(tableName$columnName)`. Use the `class()` function to find the datatype of the `Price` column in the `menu` dataset.
```{r}
# Write your code below
class(menu$Price)
```
## 2.2.5 Summary of dataset
Use the `summary()` function to get a summary of all columns in the `menu` dataset.
```{r}
# Write your code below
summary(menu)
```
## 2.2.6 Find unique values in a column
The `unique()` function in R can be used to find all unique values in a specific column in a table, e.g., `unique(tableName$columnName)`. Use the `unique()` function to find all values in the `Price` column in the `menu` dataset, without duplication.
```{r}
# Write your code below
unique(menu$Price)
```
## 2.2.7 Incomplete cases in dataset
In R, incomplete cases are rows in dataset that have `Na` value(s). These incomplete cases can be viewed using the function `complete.cases`.
For example, we can view the incomplete cases in the `menu` dataset by executing the code below.
```{r}
menu[!complete.cases(menu),]
```
Breaking down the code written, a condition is being used with the `[]` operator. The condition states 'rows in `menu` that are not a complete case', utilising the NOT logical operator and the `complete.cases` function. Note that the `,` sign within the `[]` operator is mandatory since `menu` is a 2-dimensional dataframe. The second index has been left out on purpose to include all columns in the result.
With reference to the syntax shown above, find the incomplete cases in `students`
```{r}
# Write your code below
students[!complete.cases(students),]
```
# 2.3 Data ordering
The order of vector elements can be easily reordered using the function `sort()` in R. It can be applied to all basic datatypes to order data in numerical order (numeric, integer, complex, logical datatypes) or alphabetical order (character datatypes). It has a parameter, `decreasing`, to specify whether to sort a vector in ascending or descending order, e.g., `sort(someVector, decreasing = TRUE)`.
\* Executing the code `sort(someVector)` will only print the sorting result to console. To save the sorted vector, you would need to write `someVector = sort(someVector)`.
Execute the `numbers` vector defined below, then sort it in ascending order.
```{r}
numbers = c(4, 3, 5, 1, 2)
# Write your code below
numbers = sort(numbers)
```
Execute the `words` vector defined below, then sort it in descending alphabetical order.
```{r}
words = c("table", "water", "chair", "ruler")
# Write your code below
words = sort(words, decreasing = TRUE)
```
# 2.4 Dealing with incomplete cases
Missing values in R are very problematic as R cannot process a function correctly with their presence, and may even run into error. In order to obtain accurate results, these incomplete cases will need to be dealt with.
## 2.4.1 Drop all incomplete cases
The simplest way to deal with missing value(s) is to drop (remove) all the incomplete cases. Extending from the syntax to find incomplete cases in [2.2.7 Incomplete cases in dataset], we can drop all incomplete cases by omitting the use of NOT operator such as:
```{r}
menu[complete.cases(menu),]
```
Try using this method to output a table which drops all the incomplete cases in `students`.
```{r}
# Write your code below
students[complete.cases(students),]
```
A simpler method uses the function `na.omit()`. Using this function, try to obtain the same output table as above.
```{r}
# Write your code below
na.omit(students)
```
As you might notice, dealing with missing value(s) by dropping all the incomplete cases ignores a lot of rows in the dataset. Therefore, this method is not recommended because it can greatly affect the reliability of the analysis.
## 2.4.2 Drop incomplete cases in specific column
A better way to deal with missing value(s) is to remove only rows with a missing value in the variable we are interested in. This minimises the number of rows being ignored. The method to do so is similar to the first method in the previous section, with the addition of the name of the column that you want to drop missing values from in the condition.
```{r}
menu[complete.cases(menu$Price),]
```
Apply this method on the `students` dataset to output a table where rows with missing value in the `First.Name` column are removed.
```{r}
# Write your code below
students[complete.cases(students$First.Name),]
```
## 2.4.3 Replace the missing value(s)
If you know your data well enough, you can replace the missing value(s) in the dataset. This can be done manually by assigning a new value to the cell. Taking the dataset `menu` as an example, the cell with missing value has index `[3, 2]`, a new value can be assigned to it as follows:
```{r}
menu[3, 2] = 4.50
menu # Check to make sure the missing value is being replaced correctly.
```
For a small dataset, the method above is plausible. For a larger dataset, the better way is to replace missing value(s) in a batch using the function `replace_na` in the `dbplyr` package.
```{r}
library(tidyr) # The package which includes the replace_na function.
replace_na(menu, list(Price = 4.50))
```
This method should be used with caution as it replaces all `NA` values in the column with the value specified. Run the code chunk below which shows an example of such an issue.
```{r}
# Creating the dataset for illustration purposes.
menu_example = menu
menu_example[1:4, 2] = NA
menu_example
# Applying the function
replace_na(menu_example, list(Price = 4.50))
```
The `students2` dataset is a complete version of the `students` dataset, refer to it to apply the following changes to the `students` dataset.
1. Replace the missing value in the `First.Name` column using the first method.
2. Replace the missing value in the `Last.Name` column using the second method.
```{r}
# Replacing with first method
students[1, 2] = "John"
# Replacing with second method
replace_na(students, list(Last.Name = "Brey"))
```
## 2.4.4 Ignore missing value(s) in function
When applying aggregate functions to a dataset with missing value(s), the value `NA` will be returned. In order to obtain a useful output, we can ask these functions to ignore the missing value(s) by setting the `na.rm` parameter to `TRUE`.
\* Aggregate functions: sum, mean, median, min, max, variance, range, etc.
```{r}
numbers = c(2, 3, 5, 8, NA, 4, 7)
# Using the default value assigned to the na.rm parameter
sum(numbers)
# Setting the na.rm parameter to TRUE to ignore missing value(s)
sum(numbers, na.rm = TRUE)
```
Calculate the mean prices in the `menu_example` dataset.
\* Use `datasetName$colname` to specify which column to apply the function on.
```{r}
# Write your code below
mean(menu_example$Price, na.rm = TRUE)
```
Version: 1.0
RestoreWorkspace: Default
SaveWorkspace: Default
AlwaysSaveHistory: Default
EnableCodeIndexing: Yes
UseSpacesForTab: Yes
NumSpacesForTab: 2
Encoding: UTF-8
RnwWeave: Sweave
LaTeX: pdfLaTeX
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment