Skip to content
Snippets Groups Projects
Commit faa375fe authored by pfoo's avatar pfoo
Browse files

Created the exercisee notebook, with answers removed.

parent b774313d
No related branches found
No related tags found
No related merge requests found
---
title: "R Beginners Exercise 3: Data Visualisation"
output:
word_document: default
html_document:
toc: true
editor_options:
chunk_output_type: console
---
# Introduction
Welcome to R for Beginners Exercise 3! This notebook contains the exercises for the lesson that we will be looking at during exercise breaks throughout the course as well as being a work space for you to use during the session!
To execute a line of code, click on it and press *Ctrl + Enter*.
To execute a chunk of code, click the green run button at the top right corner of the code chunk or highlight the entire code chunk and press *Ctrl + Enter*.
# 3.1 Introduction to built-in datasets
For this exercise, we will be using two different dataset. These dataset are built into R hence they can be used by simply calling the name of the dataset, without the need to load any packages. Run the following code chunk to get a quick overview of the dataset.
# 3.1.1 Iris dataset
```{r}
# The dimension of the dataset
sprintf("The iris dataset contains %.0f rows and %.0f columns. ",
dim(iris)[1], dim(iris)[2])
# The columns' name and datatype
sapply(iris, class)
# Check for incomplete case
iris[!complete.cases(iris),]
# A summary of each column
summary(iris)
# View more information of iris dataset in the documentation
?iris
```
# 3.1.2 Pressure dataset
```{r}
# The dimension of the dataset
sprintf("The pressure dataset contains %.0f rows and %.0f columns. ",
dim(pressure)[1], dim(pressure)[2])
# The columns' name and datatype
sapply(pressure, class)
# Check for incomplete case
pressure[!complete.cases(pressure),]
# A summary of each column
summary(pressure)
# View more information of pressure dataset in the documentation
?pressure
```
# 3.2 Segment canvas
In R, it can be helpful to create multiple plots on a single canvas. This can be done by splitting the canvas into a specific grid size using the function `par()`, with the argument `mfrow` that takes a vector of 2 numeric values, i.e., `c(num1, num2)`. The first value in the vector represents the number of rows whereas the second value represents the number of columns.
\* `rnorm(1000, 0, 1)` is a function to generate 1000 random samples from the N(0, 1) distribution, with mean = 0 and standard deviation = 1. It is used for demonstration purposes in this section only.
```{r}
# One plot in the canvas
par(mfrow = c(1, 1))
hist(rnorm(1000, 0, 1))
# Segment the canvas to fit four plots
par(mfrow = c(2, 2))
hist(rnorm(1000, 0, 1))
hist(rnorm(1000, 0, 1))
hist(rnorm(1000, 0, 1))
hist(rnorm(1000, 0, 1))
# Reset the canvas
par(mfrow = c(1, 1))
```
# 3.3 Basic plots
## 3.3.1 Scatter plot
The easiest plot to create in R is the scatter plot, using the `plot()` function, e.g., `plot(x_axis_var, y_axis_var)`.
Try plotting a scatter plot for the `iris` dataset, using `Petal.Width` as the x-axis and `Petal.Length` as the y-axis.
\* Reminder: The syntax to select a column of a dataset is `dataset$columnName`.
```{r}
# Write your code below
```
### 3.3.1.1 Plot title and labels
Notice that the scatter plot above lacks a descriptive plot title, and the axis labels are not easily understandable by someone unfamiliar with the dataset.To enhance the plot's clarity, you can add the following arguments in the `plot()` function:
- `main`: Adds a title to the plot.
- `xlab`: Adds an x-axis label to the plot.
- `ylab`: Adds a y-axis label to the plot.
Complete the code by adding the title, x-axis label and y-axis label.
\* To specify an argument when calling the function, use `function(..., argument = argumentValue)`
```{r}
# Complete the following code
plot(iris$Petal.Width, iris$Petal.Length)
```
### 3.3.1.2 Best fit line
In scatter plot, a best-fit line is often used to provide an estimation. This line can be computed using the `lm()` function, which takes a formula as input. To compute the best-fit line, the code will looks like `lm(y_axis_var ~ x_axis_var)`, where the dependent variable is on the left and the independent variable is on the right. The tilde `~` character signifies the relationship between the variables.
The best-fit line can be added directly to the same scatter plot by calling the `abline()` function, e.g., `abiline(best_fit_line)`.
Write the code to add a best fit line to the scatter plot plotted in the previous section.
```{r}
# Write your code below
```
## 3.3.2 Line graph
Line graph is a variant of the `plot()` function introduced previously. To plot a line graph in R, simply add the `type = "l"` argument when calling the `plot()` function, i.e., `plot(x_axis_var, y_axis_var, type ="l")`.
Try plotting a line graph for the `pressure` dataset, providing descriptive plot title, x-axis label, and y-axis label.
\* Tip: The `plot()` function can generate a basic scatter/line plot when provided with a dataset containing exactly two columns. For example, `plot(dataset)` uses the first column as the x-axis and the second column as the y-axis.
```{r}
# Write your code below
```
### 3.3.2.1 Multiple lines on a single plot
Sometimes it can be useful to plot multiple lines on the same plot for comparison purposes. Additional lines can be added to the plot using the `lines()` function. Note that this function only works after the `plot()` function has been called.
For demonstration purposes, the following code create a dataset named `pressure_new` derived from the existing `pressure` dataset. Using the `lines()` function, plot a line of this derived dataset on the line graph plotted previously.
```{r}
# Create the derived dataset
pressure_new = 0.9 * pressure
# Plot the line on top of the line graph
# Write your code below
```
### 3.3.2.2 Plot with colour and legend
After adding a line to the previous line graph, notice that both lines are plotted with the same colour and it is difficult to differentiate the two lines. Hence, it is a good practice to use colours in plots. This can be done through the `col` argument which is supported by most basic plot functions.
Plot a line graph for the `pressure` dataset with the colour `blue`, and add a line for the `pressure_new` dataset in `red`.
```{r}
# Write your code to plot the pressure dataset below
# Write your code to add the line for pressure_new dataset below
```
With different coloured line, it is obvious that they represents different dataset. However, it is not clear from the plot which line represents which dataset. To make the plot easier to understand, a legend can be added to the plot by calling the `lengend()` function. The required arguments are:
- `x`: Specifies the location of the legend. For simplicity, it is common to use the predefined location such as `topleft`, `bottomleft`, etc. (Run `?legend` in the console to find out more)
- `legend`: A list (vector) of labels to be presented in the legend.
- `fill`: A list (vector) of corresponding colours to create filled checkboxes in the legend.
Add a legend to the line graph with two lines, using the three arguments introduced.
\* For argument that takes a list (vector), use `function(argument = c(value1, value2, ...))`
```{r}
# Write your code to add the legend below
```
## 3.3.3 Bar chart
Bar charts are often used to visualise a frequency table, plotted in R using the `barplot()` function. Since both the `iris` and `pressure` dataset are not a frequency table, this section uses the `table()` function to create a frequency table from a column of the `iris` dataset for demonstration purposes.
\* Frequency table: A table with the count of each unique values in the dataset.
```{r}
barplot(table(iris$Petal.Length),
main = "Frequency of Iris' Petal Length",
xlab = "Petal Length (cm)",
ylab = "Frequency")
```
Try to plot a bar chart of the `Petal.Width` column in the `iris` dataset with descriptive plot tile, x-axis label, and y-axis label.
```{r}
# Write your code below
```
## 3.3.4 Histogram
Histogram is often used to observe the trend in a dataset, plotted in R using the `hist()` function. Note that this function can only be applied to a column in a table. Create a histogram of the `Sepal.Width` column in the `iris` dataset with descriptive plot tile, x-axis label, and y-axis label.
```{r}
# Write your code below
```
## 3.3.5 Box plot
Boxplot is often used to visualise the statistical information of a dataset, showing:
- median
- lower quantile (first quartile)
- upper quantile (third quartile)
- min
- max
- outliers
It is plotted in R using the `boxplot()` function. Unlike `hist()`, `boxplot()` can be applied to a table with multiple columns. Produce a boxplot of the `iris` dataset with descriptive plot title.
```{r}
# Write your code below
```
When using it on a specific column in a table or on a vector, the `horizontal = TRUE` argument is often applied to rotate the boxplot for better visualisation. Produce a boxplot of the same column used to plot the histogram, and rotate the boxplot with the `horizontal = TRUE` argument with descriptive plot title.
```{r}
boxplot(iris$Sepal.Width, horizontal = TRUE,
main = "Boxplot of Iris' Sepal Width")
```
# 3.4 Customisation
R provides various built-in arguments to customise a plot. This section will introduce some of the commonly used arguments for such purpose. The full list of arguments for plot customisation can be found in the documentation (run the code chunk below):
```{r}
?par
```
## 3.4.1 Types of points
The plot point's style and size can be customised with the `pch` and `cex` arguments, respectively. The `pch` argument has a list of pre-defined style represented by integers. Run the code chunk below to check the pre-defined point style in R.
```{r}
?pch
```
The `cex` argument controls the size of the point with respect to 1. Hence, a value larger than 1 enlarges the plot point, while a value smaller than 1 minimises the plot point.
Try to apply different combination of `pch` and `cex` to the plot below and see how the plot point changes.
```{r}
# Complete the code below
plot(pressure)
```
## 3.4.2 Types of lines
The plot line's style and width can be customised with the `lty` and `lwd` arguments, respectively. Similar to points, R has a list of pre-defined line's style represented by integers. A description of these styles can be found in the documentation by running `?par`. The `lwd` works as per `cex` to control the width of the line with respect to 1. A value larger than 1 results in thicker line, while a value smaller than 1 results in thinner line.
Try to apply different combination of `lty` and `lwd` to the plot below and see how the line changes.
```{r}
# Complete the code below
plot(pressure)
```
## 3.4.3 Axis limits
Sometimes, it can be helpful to shorten the axis' range for a focused view. This can be achieved by specifying the axis' range using the `xlim` and `ylim` argument. The code below demonstrates how to specify the x-axis' range.
```{r}
plot(pressure, type = "l", xlim = c(250, 350))
```
Using a similar approach, set the y-axis range from 200 to 400 using the `ylim` argument.
```{r}
# Complete the code below by adding the ylim argument
plot(pressure, type = "l", xlim = c(200, 350))
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment