Added the related files for this part 2 course

cfb4d074 · pfoo · 6499141d · cfb4d074 · cfb4d074 · cfb4d074
Commit cfb4d074 authored 1 month ago by pfoo
--- a/.gitignore
+++ b/.gitignore
+.Rproj.user
+.Rhistory
+.RData
+.Ruserdata
--- a/Ex2-Menu_TXT.txt
+++ b/Ex2-Menu_TXT.txt
+Item	Price
+Burger	10.90
+Fries	4.45
+Juice	NA
+Salad	9.50
+Chicken	10.90
+Drinks	3.33
--- a/Ex2-Students_CSV.csv
+++ b/Ex2-Students_CSV.csv
+ID,First Name,Last Name
+D001,NA,Smith
+D002,Anne,Clare
+D003,Charlotte,NA
+D004,Thomas,Train
--- a/Ex2-Students_Excel.xlsx
+++ b/Ex2-Students_Excel.xlsx
--- a/RBeginnersExercise2.Rmd
+++ b/RBeginnersExercise2.Rmd
+---
+title: "R Beginners Exercise 2: Data Processing"
+output:
+  word_document: default
+  html_document: 
+    toc: true
+editor_options:
+  chunk_output_type: console
+---
+# Introduction
+Welcome to R for Beginners Exercise 2! This notebook contains the exercises for the lesson that we will be looking at during exercise breaks throughout the course as well as being a work space for you to use during the session!
+To execute a line of code, click on it and press *Ctrl + Enter*.
+To execute a chunk of code, click the green run button at the top right corner of the code chunk or highlight the entire code chunk and press *Ctrl + Enter*.
+# 2.1 Import dataset
+R supports importing datasets from a range of file types. Some common file types supported by R are `.csv`, `.txt`, `.xlxs`. There are two methods to import a dataset.
+-   Through the "Import Dataset" button in the Environment window (top right).
+-   Through command *(more information regarding this method can be found in the [R documentation](https://cran.r-project.org/doc/manuals/R-data.html#XML))*
+To import data through command, first set working directory to the folder where your data files are stored.
+```{r}
+setwd("~/IntroductionToR_P2_ExerciseNotebook")
+```
+## 2.1.1 Import from CSV file
+```{r}
+students = read.csv("Ex2-Students_CSV.csv")
+students
+```
+## 2.1.2 Import from TXT file
+```{r}
+menu = read.delim("Ex2-Menu_TXT.txt") # with tab character as the delimiter
+menu
+```
+## 2.1.3 Import from Excel file
+```{r}
+library(readxl) # the package which enables Excel file importation
+students2 = read_xlsx("Ex2-Students_Excel.xlsx")
+students2
+```
+# 2.2 Dataset overview
+## 2.2.1 View dataset in new window
+Use the `view()` function to view the `students` dataset in another window.
+```{r}
+# The package which contains the view() function to display dataset in a separate window
+library(tibble)
+# Write your code below
+```
+## 2.2.2 View columns of dataset in console
+Use the `glimpse()` function to print all the columns of the `students` dataset to console.
+```{r}
+# Write your code below
+```
+## 2.2.3 View column names in console
+Use the `names()` function to print all the column names of the `students` dataset to console.
+```{r}
+# Write your code below
+```
+\* In R, you can view column names of a dataset when writing code by typing the dollar sign, `$`, after the dataset's name.
+## 2.2.4 Find datatype of a column
+The `class()` function in R can be used to find the datatype of a specific column in a table, e.g. `class(tableName$columnName)`. Use the `class()` function to find the datatype of the `Price` column in the `menu` dataset.
+```{r}
+# Write your code below
+```
+## 2.2.5 Summary of dataset
+Use the `summary()` function to get a summary of all columns in the `menu` dataset.
+```{r}
+# Write your code below
+```
+## 2.2.6 Find unique values in a column
+The `unique()` function in R can be used to find all unique values in a specific column in a table, e.g., `unique(tableName$columnName)`. Use the `unique()` function to find all values in the `Price` column in the `menu` dataset, without duplication.
+```{r}
+# Write your code below
+```
+## 2.2.7 Incomplete cases in dataset
+In R, incomplete cases are rows in dataset that have `Na` value(s). These incomplete cases can be viewed using the function `complete.cases`.
+For example, we can view the incomplete cases in the `menu` dataset by executing the code below.
+```{r}
+menu[!complete.cases(menu),]
+```
+Breaking down the code written, a condition is being used with the `[]` operator. The condition states 'rows in `menu` that are not a complete case', utilising the NOT logical operator and the `complete.cases` function. Note that the `,` sign within the `[]` operator is mandatory since `menu` is a 2-dimensional dataframe. The second index has been left out on purpose to include all columns in the result.
+With reference to the syntax shown above, find the incomplete cases in `students`
+```{r}
+# Write your code below
+```
+# 2.3 Data ordering
+The order of vector elements can be easily reordered using the function `sort()` in R. It can be applied to all basic datatypes to order data in numerical order (numeric, integer, complex, logical datatypes) or alphabetical order (character datatypes). It has a parameter, `decreasing`, to specify whether to sort a vector in ascending or descending order, e.g., `sort(someVector, decreasing = TRUE)`.
+\* Executing the code `sort(someVector)` will only print the sorting result to console. To save the sorted vector, you would need to write `someVector = sort(someVector)`.
+Execute the `numbers` vector defined below, then sort it in ascending order.
+```{r}
+numbers = c(4, 3, 5, 1, 2)
+# Write your code below
+```
+Execute the `words` vector defined below, then sort it in descending alphabetical order.
+```{r}
+words = c("table", "water", "chair", "ruler")
+# Write your code below
+```
+# 2.4 Dealing with incomplete cases
+Missing values in R are very problematic as R cannot process a function correctly with their presence, and may even run into error. In order to obtain accurate results, these incomplete cases will need to be dealt with.
+## 2.4.1 Drop all incomplete cases
+The simplest way to deal with missing value(s) is to drop (remove) all the incomplete cases. Extending from the syntax to find incomplete cases in [2.2.7 Incomplete cases in dataset], we can drop all incomplete cases by omitting the use of NOT operator such as:
+```{r}
+menu[complete.cases(menu),]
+```
+Try using this method to output a table which drops all the incomplete cases in `students`.
+```{r}
+# Write your code below
+```
+A simpler method uses the function `na.omit()`. Using this function, try to obtain the same output table as above.
+```{r}
+# Write your code below
+```
+As you might notice, dealing with missing value(s) by dropping all the incomplete cases ignores a lot of rows in the dataset. Therefore, this method is not recommended because it can greatly affect the reliability of the analysis.
+## 2.4.2 Drop incomplete cases in specific column
+A better way to deal with missing value(s) is to remove only rows with a missing value in the variable we are interested in. This minimises the number of rows being ignored. The method to do so is similar to the first method in the previous section, with the addition of the name of the column that you want to drop missing values from in the condition.
+```{r}
+menu[complete.cases(menu$Price),]
+```
+Apply this method on the `students` dataset to output a table where rows with missing value in the `First.Name` column are removed.
+```{r}
+# Write your code below
+```
+## 2.4.3 Replace the missing value(s)
+If you know your data well enough, you can replace the missing value(s) in the dataset. This can be done manually by assigning a new value to the cell. Taking the dataset `menu` as an example, the cell with missing value has index `[3, 2]`, a new value can be assigned to it as follows:
+```{r}
+menu[3, 2] = 4.50
+menu # Check to make sure the missing value is being replaced correctly. 
+```
+For a small dataset, the method above is plausible. For a larger dataset, the better way is to replace missing value(s) in a batch using the function `replace_na` in the `dbplyr` package.
+```{r}
+library(tidyr) # The package which includes the replace_na function. 
+replace_na(menu, list(Price = 4.50))
+```
+This method should be used with caution as it replaces all `NA` values in the column with the value specified. Run the code chunk below which shows an example of such an issue.
+```{r}
+# Creating the dataset for illustration purposes. 
+menu_example = menu
+menu_example[1:4, 2] = NA
+menu_example
+# Applying the function
+replace_na(menu_example, list(Price = 4.50))
+```
+The `students2` dataset is a complete version of the `students` dataset, refer to it to apply the following changes to the `students` dataset.
+1.  Replace the missing value in the `First.Name` column using the first method.
+2.  Replace the missing value in the `Last.Name` column using the second method.
+```{r}
+# Replacing with first method
+# Replacing with second method
+```
+## 2.4.4 Ignore missing value(s) in function
+When applying aggregate functions to a dataset with missing value(s), the value `NA` will be returned. In order to obtain a useful output, we can ask these functions to ignore the missing value(s) by setting the `na.rm` parameter to `TRUE`.
+\* Aggregate functions: sum, mean, median, min, max, variance, range, etc.
+```{r}
+numbers = c(2, 3, 5, 8, NA, 4, 7)
+# Using the default value assigned to the na.rm parameter
+sum(numbers)
+# Setting the na.rm parameter to TRUE to ignore missing value(s)
+sum(numbers, na.rm = TRUE)
+```
+Calculate the mean prices in the `menu_example` dataset.
+\* Use `datasetName$colname` to specify which column to apply the function on.
+```{r}
+# Write your code below
+```
--- a/RBeginnersExercise2_Sol.Rmd
+++ b/RBeginnersExercise2_Sol.Rmd
+---
+title: "R Beginners Exercise 2: Data Processing"
+output:
+  word_document: default
+  html_document: 
+    toc: true
+editor_options:
+  chunk_output_type: console
+---
+# Introduction
+Welcome to R for Beginners Exercise 2! This notebook contains the exercises for the lesson that we will be looking at during exercise breaks throughout the course as well as being a work space for you to use during the session!
+To execute a line of code, click on it and press *Ctrl + Enter*.
+To execute a chunk of code, click the green run button at the top right corner of the code chunk or highlight the entire code chunk and press *Ctrl + Enter*.
+# 2.1 Import dataset
+R supports importing datasets from a range of file types. Some common file types supported by R are `.csv`, `.txt`, `.xlxs`. There are two methods to import a dataset.
+-   Through the "Import Dataset" button in the Environment window (top right).
+-   Through command *(more information regarding this method can be found in the [R documentation](https://cran.r-project.org/doc/manuals/R-data.html#XML))*
+To import data through command, first set working directory to the folder where your data files are stored.
+```{r eval = FALSE}
+setwd("~/IntroductionToR_P2_ExerciseNotebook_Sol")
+```
+## 2.1.1 Import from CSV file
+```{r}
+students = read.csv("Ex2-Students_CSV.csv")
+students
+```
+## 2.1.2 Import from TXT file
+```{r}
+menu = read.delim("Ex2-Menu_TXT.txt") # with tab character as the delimiter
+menu
+```
+## 2.1.3 Import from Excel file
+```{r}
+library(readxl) # the package which enables Excel file importation
+students2 = read_xlsx("Ex2-Students_Excel.xlsx")
+students2
+```
+# 2.2 Dataset overview
+## 2.2.1 View dataset in new window
+Use the `view()` function to view the `students` dataset in another window.
+```{r}
+# The package which contains the view() function to display dataset in a separate window
+library(tibble)
+# Write your code below
+view(students)
+```
+## 2.2.2 View columns of dataset in console
+Use the `glimpse()` function to print all the columns of the `students` dataset to console.
+```{r}
+# Write your code below
+glimpse(students)
+```
+## 2.2.3 View column names in console
+Use the `names()` function to print all the column names of the `students` dataset to console.
+```{r}
+# Write your code below
+names(students)
+```
+\* In R, you can view column names of a dataset when writing code by typing the dollar sign, `$`, after the dataset's name.
+## 2.2.4 Find datatype of a column
+The `class()` function in R can be used to find the datatype of a specific column in a table, e.g. `class(tableName$columnName)`. Use the `class()` function to find the datatype of the `Price` column in the `menu` dataset.
+```{r}
+# Write your code below
+class(menu$Price)
+```
+## 2.2.5 Summary of dataset
+Use the `summary()` function to get a summary of all columns in the `menu` dataset.
+```{r}
+# Write your code below
+summary(menu)
+```
+## 2.2.6 Find unique values in a column
+The `unique()` function in R can be used to find all unique values in a specific column in a table, e.g., `unique(tableName$columnName)`. Use the `unique()` function to find all values in the `Price` column in the `menu` dataset, without duplication.
+```{r}
+# Write your code below
+unique(menu$Price)
+```
+## 2.2.7 Incomplete cases in dataset
+In R, incomplete cases are rows in dataset that have `Na` value(s). These incomplete cases can be viewed using the function `complete.cases`.
+For example, we can view the incomplete cases in the `menu` dataset by executing the code below.
+```{r}
+menu[!complete.cases(menu),]
+```
+Breaking down the code written, a condition is being used with the `[]` operator. The condition states 'rows in `menu` that are not a complete case', utilising the NOT logical operator and the `complete.cases` function. Note that the `,` sign within the `[]` operator is mandatory since `menu` is a 2-dimensional dataframe. The second index has been left out on purpose to include all columns in the result.
+With reference to the syntax shown above, find the incomplete cases in `students`
+```{r}
+# Write your code below
+students[!complete.cases(students),]
+```
+# 2.3 Data ordering
+The order of vector elements can be easily reordered using the function `sort()` in R. It can be applied to all basic datatypes to order data in numerical order (numeric, integer, complex, logical datatypes) or alphabetical order (character datatypes). It has a parameter, `decreasing`, to specify whether to sort a vector in ascending or descending order, e.g., `sort(someVector, decreasing = TRUE)`.
+\* Executing the code `sort(someVector)` will only print the sorting result to console. To save the sorted vector, you would need to write `someVector = sort(someVector)`.
+Execute the `numbers` vector defined below, then sort it in ascending order.
+```{r}
+numbers = c(4, 3, 5, 1, 2)
+# Write your code below
+numbers = sort(numbers)
+```
+Execute the `words` vector defined below, then sort it in descending alphabetical order.
+```{r}
+words = c("table", "water", "chair", "ruler")
+# Write your code below
+words = sort(words, decreasing = TRUE)
+```
+# 2.4 Dealing with incomplete cases
+Missing values in R are very problematic as R cannot process a function correctly with their presence, and may even run into error. In order to obtain accurate results, these incomplete cases will need to be dealt with.
+## 2.4.1 Drop all incomplete cases
+The simplest way to deal with missing value(s) is to drop (remove) all the incomplete cases. Extending from the syntax to find incomplete cases in [2.2.7 Incomplete cases in dataset], we can drop all incomplete cases by omitting the use of NOT operator such as:
+```{r}
+menu[complete.cases(menu),]
+```
+Try using this method to output a table which drops all the incomplete cases in `students`.
+```{r}
+# Write your code below
+students[complete.cases(students),]
+```
+A simpler method uses the function `na.omit()`. Using this function, try to obtain the same output table as above.
+```{r}
+# Write your code below
+na.omit(students)
+```
+As you might notice, dealing with missing value(s) by dropping all the incomplete cases ignores a lot of rows in the dataset. Therefore, this method is not recommended because it can greatly affect the reliability of the analysis.
+## 2.4.2 Drop incomplete cases in specific column
+A better way to deal with missing value(s) is to remove only rows with a missing value in the variable we are interested in. This minimises the number of rows being ignored. The method to do so is similar to the first method in the previous section, with the addition of the name of the column that you want to drop missing values from in the condition.
+```{r}
+menu[complete.cases(menu$Price),]
+```
+Apply this method on the `students` dataset to output a table where rows with missing value in the `First.Name` column are removed.
+```{r}
+# Write your code below
+students[complete.cases(students$First.Name),]
+```
+## 2.4.3 Replace the missing value(s)
+If you know your data well enough, you can replace the missing value(s) in the dataset. This can be done manually by assigning a new value to the cell. Taking the dataset `menu` as an example, the cell with missing value has index `[3, 2]`, a new value can be assigned to it as follows:
+```{r}
+menu[3, 2] = 4.50
+menu # Check to make sure the missing value is being replaced correctly. 
+```
+For a small dataset, the method above is plausible. For a larger dataset, the better way is to replace missing value(s) in a batch using the function `replace_na` in the `dbplyr` package.
+```{r}
+library(tidyr) # The package which includes the replace_na function. 
+replace_na(menu, list(Price = 4.50))
+```
+This method should be used with caution as it replaces all `NA` values in the column with the value specified. Run the code chunk below which shows an example of such an issue.
+```{r}
+# Creating the dataset for illustration purposes. 
+menu_example = menu
+menu_example[1:4, 2] = NA
+menu_example
+# Applying the function
+replace_na(menu_example, list(Price = 4.50))
+```
+The `students2` dataset is a complete version of the `students` dataset, refer to it to apply the following changes to the `students` dataset.
+1.  Replace the missing value in the `First.Name` column using the first method.
+2.  Replace the missing value in the `Last.Name` column using the second method.
+```{r}
+# Replacing with first method
+students[1, 2] = "John"
+# Replacing with second method
+replace_na(students, list(Last.Name = "Brey"))
+```
+## 2.4.4 Ignore missing value(s) in function
+When applying aggregate functions to a dataset with missing value(s), the value `NA` will be returned. In order to obtain a useful output, we can ask these functions to ignore the missing value(s) by setting the `na.rm` parameter to `TRUE`.
+\* Aggregate functions: sum, mean, median, min, max, variance, range, etc.
+```{r}
+numbers = c(2, 3, 5, 8, NA, 4, 7)
+# Using the default value assigned to the na.rm parameter
+sum(numbers)
+# Setting the na.rm parameter to TRUE to ignore missing value(s)
+sum(numbers, na.rm = TRUE)
+```
+Calculate the mean prices in the `menu_example` dataset.
+\* Use `datasetName$colname` to specify which column to apply the function on.
+```{r}
+# Write your code below
+mean(menu_example$Price, na.rm = TRUE)
+```
--- a/part-2-data-processing.Rproj
+++ b/part-2-data-processing.Rproj
+Version: 1.0
+RestoreWorkspace: Default
+SaveWorkspace: Default
+AlwaysSaveHistory: Default
+EnableCodeIndexing: Yes
+UseSpacesForTab: Yes
+NumSpacesForTab: 2
+Encoding: UTF-8
+RnwWeave: Sweave
+LaTeX: pdfLaTeX