First follow the instructions in NLU+ Coursework 1 and create the `nlu` conda environment. Then activate the environment, and install JupyterLab with pip:
```shell
pip install notebook
```
Once installed, launch Jupyter notebook with:
```shell
jupyter notebook
```
and open the `lab1.ipynb` file.
### Alternative: Lab1-specific environment
As an alternative, you can create and work with a virtual environment just for Lab1.
Simply run:
```shell
conda create -n lab1 python=3.7
```
then, activate the environment and install the required packages:
"#### Authors: Christos Baziotis, Lexi Birch, Frank Keller\n",
"These exercises aim to support you in successfully completing your assignments. Here, we will focus on NumPy, which is a library for working with numerical data in Python and is what you will be using in your first assignment. It would be helpful to first read the official NumPy quickstart [guide](https://numpy.org/doc/stable/user/absolute_beginners.html). While here we will review some operations that will help you with your assignment, you should use the official guide as a reference.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d3209c9",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"import numpy as np\n",
"np.set_printoptions(suppress=True) # suppresses the use of scientific notation for small numbers\n",
"\n",
"# you may use this function to print a numpy array and its properties\n",
"def print_array(arr):\n",
" print(arr)\n",
" print(\"shape:\", arr.shape)\n",
" print(\"type:\", arr.dtype.type)\n",
" print()"
]
},
{
"cell_type": "markdown",
"id": "40374a13",
"metadata": {},
"source": [
"# Load the data\n",
"We will be working with the Wine Data Set from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Wine). It is contains the results of a chemical analysis of 178 wines. The wines are categorized into 3 classes and described by 13 attributes. All attributes are continuous.\n",
"\n",
"The dataset is stored in the `wine.csv` file. The first row contains the column names and the rest of the rows the corresponding values. Open the file and check its structure. The columns in the data are as follows:\n",
"\n",
" 1. *Type*: The type of wine, into one of three classes, 1 (59 obs), 2(71 obs), and 3 (48 obs).\n",
" 2. Alcohol\n",
" 3. Malic acid\n",
" 4. Ash\n",
" 5. Alcalinity of ash\n",
" 6. Magnesium\n",
" 7. Total phenols\n",
" 8. Flavanoids\n",
" 9. Nonflavanoid phenols\n",
" 10. Proanthocyanins\n",
" 11. Color intensity\n",
" 12. Hue\n",
" 13. D280/OD315 of diluted wines\n",
" 14. Proline\n",
"\n",
"**NOTE**: As you can see, the first attribute is the *class* identifier (1-3)\n",
"\n",
"\n",
"\n",
"First, we naively read all the data into a regular 2D Python list (i.e., list of lists), named `data`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "cd548302",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# solution\n",
"data = []\n",
"with open(\"wine.csv\") as f:\n",
" for line in f:\n",
" row = line.strip().split(\",\")\n",
" data.append(row)\n",
" \n",
"print(data)"
]
},
{
"cell_type": "markdown",
"id": "6f138e6b",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 1. Initialize a Numpy Array\n",
"Create a numpy array (named `data` ) out of the Python array and check its shape and data type.\n",
"What is the data type of the numpy array and why? How do numpy arrays differ from regular Python lists?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c0ec94d7",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1cb13efd",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# let's see what is in the array\n",
"print_array(data)"
]
},
{
"cell_type": "markdown",
"id": "4d572df0",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 2. Array Indexing and Slicing\n",
"Now we need to split the data into separate numpy arrays, in order to disentangle the attribute names, attribute values and class labels. To do this you need to use numpy array slicing.\n",
"Do the following:\n",
" 1. Store the 13 attribute names into an 1D numpy array, called `names`. This means you should ignore the first column (i.e., Wine type).\n",
" 2. Store the class labels (i.e., Wine type) into an 1D numpy array, called `classes`.\n",
" 3. Store the attribute values into an 2D numpy array, called `attributes`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2712f977",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "65acf4d2",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"print(\"names\")\n",
"print_array(names)\n",
"\n",
"print(\"classes\")\n",
"print_array(classes)\n",
"\n",
"print(\"attributes\")\n",
"print_array(attributes)\n",
"\n",
"assert names.shape == (13,)\n",
"assert classes.shape == (178,)\n",
"assert attributes.shape == (178, 13)"
]
},
{
"cell_type": "markdown",
"id": "90ce761e",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"4\\. Using the `attributes` array, print the second to last row, without its last 3 elements.\n",
"Cast each numpy array to the appropriate data type. We need to represent numerical values with the appropriate data type to be able to do numerical operations.\n",
"1. The `attributes` array contains continuous values, therefore it needs to be converted to `float`.\n",
"2. The `classes` array contains categorical values, so you should convert it to `int`.\n",
"\n",
"(The `names` array already contains string values as it should. You don't need to change it.)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "459fe022",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d9d77dd0",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"print(\"classes\")\n",
"print_array(classes)\n",
"\n",
"print(\"attributes\")\n",
"print_array(attributes)"
]
},
{
"cell_type": "markdown",
"id": "cae988ab",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"\n",
"### 4. Array Axis and Statistics\n",
"We often need to compute some statistics using aggregating methods. A common pitfall however is computing these statistics along the wrong axis.\n",
"\n",
"Using the `attributes` numpy array, do the following:\n",
"\n",
"1. Compute the sum of all values.\n",
"2. Compute the average value of each column (i.e., feature).\n",
"3. Compute the maximum of the row minimums.\n",
"4. Compute the average of the column maximus.\n",
"\n",
"**Hint:** The output for questions `4.3`, `4.4`, will be a *scalar*. To make sure you are aggregating over the correct values, check the shape of the intermediate resulting array first.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "26b18248",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Solution for 4.1\n",
"# write your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b51ed6f9",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Solution for 4.2\n",
"# write your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "752f51ff",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Solution for 4.3\n",
"# write your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bdfb54b6",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Solution for 4.4\n",
"# write your code here"
]
},
{
"cell_type": "markdown",
"id": "deaae13d",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 5. Array Transpose\n",
"When transposing a `2x3` array, we get a `3x2` array. Matrix transpose has a lot of significance in linear algebra and you will rely on it many times in your assignments.\n",
"\n",
"Using the transpose of the original `attributes` array:\n",
"\n",
"1. Compute the same statistics of the question 4.2.\n",
"1. Compute the same statistics of the question 4.3.\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0b3d09ae",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Solution for 5.1\n",
"# write your code here"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ff9dbd08",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# Solution for 5.2\n",
"# write your code here"
]
},
{
"cell_type": "markdown",
"id": "9b2df45d",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 6. Sorting and Indexing\n",
"Sort the `names` array alphabetically, and then apply the same ordering to the **columns** of the `attributes` array, in order to preserve the correspondence between them.\n",
"\n",
"Hint: be careful when applying the sorting of `names` to `attributes` and think about the role of each axis.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9536b1ad",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "markdown",
"id": "1f0dc8c2",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 7. Data Standardization\n",
"Standardization (not to be confused with normalization), is a preprocessing step that is commonly used in many machine learning models and ensures that all features are normally distributed (i.e., they have zero mean and unit variance).\n",
"\n",
"To do this, you need to transform the data as follows: \n",
"1. Remove the mean value of each feature (i.e., centering).\n",
"2. Divide the features by their standard deviation (i.e., rescaling).\n",
"\n",
"Save the standardized version of the `attributes` array to `attributes_norm`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1d9ea13d",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "markdown",
"id": "f86fc083",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"### 8. Matrix Operations\n",
"We are going to work with following slices of the `attributes` array (see the cell below). You will compute some simple operations without using NumPy's builtin methods, but you may use them to check that your solution is correct."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3ec7f5b7-662e-4bc8-a154-5bd93ea2a518",
"metadata": {},
"outputs": [],
"source": [
"slice1 = attributes[6:10]\n",
"slice2 = attributes[76:80]\n",
"\n",
"print_array(slice1)\n",
"print_array(slice2)"
]
},
{
"cell_type": "markdown",
"id": "2d120955",
"metadata": {
"pycharm": {
"name": "#%% md\n"
},
"tags": []
},
"source": [
"**1\\.** Compute the dot product between each vector (i.e., row) of `slice1`, with the corresponding vector of `slice2`. This means, the 1st vector `slice1` with the 1st vector of `slice2`, the 2nd vector `slice1` with the 2nd vector of `slice2` etc. Use numpy, but avoid using `np.dot` or for loops. Think about the definition of the dot product.\n",
"\n",
"The expected output is `[514410.1698, 661579.8319, 797379.7166, 494338.7313]`"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "28dc7dae-854f-4701-bc42-7d10524e7d9d",
"metadata": {},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "markdown",
"id": "2b4ba17d",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"**2\\.** Compute the cosine similarity between 7th and 77th rows (use 0-based indexing) of the `attributes` array using the dot product. \n",
"\n",
"(0-based indexing, means you should use the vectors `attributes[7]` and `attributes[77]` )\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c7db29f9",
"metadata": {
"collapsed": false,
"jupyter": {
"outputs_hidden": false
},
"pycharm": {
"name": "#%%\n"
}
},
"outputs": [],
"source": [
"# write your code here"
]
},
{
"cell_type": "markdown",
"id": "1d6d022c",
"metadata": {
"pycharm": {
"name": "#%% md\n"
}
},
"source": [
"Cosine similarity is used a lot in machine learning and has many nice properties compared to other metrics, such as a cosine. One of them is that its values are in the `[-1, 1]` range regardless of the properties of the vector space (e.g., dimensionality).\n",
"\n",
"**3\\.** Compute the cosine similarity between the 7th and 77th rows (use 0-based indexing) of the `attributes_norm` array. How can you explain the difference in their cosine similarities? Why did these vectors look very similar before standardization, but dissimilar afterwards?\n",
#### Authors: Christos Baziotis, Lexi Birch, Frank Keller
These exercises aim to support you in successfully completing your assignments. Here, we will focus on NumPy, which is a library for working with numerical data in Python and is what you will be using in your first assignment. It would be helpful to first read the official NumPy quickstart [guide](https://numpy.org/doc/stable/user/absolute_beginners.html). While here we will review some operations that will help you with your assignment, you should use the official guide as a reference.
%% Cell type:code id:3d3209c9 tags:
``` python
importnumpyasnp
np.set_printoptions(suppress=True)# suppresses the use of scientific notation for small numbers
# you may use this function to print a numpy array and its properties
defprint_array(arr):
print(arr)
print("shape:",arr.shape)
print("type:",arr.dtype.type)
print()
```
%% Cell type:markdown id:40374a13 tags:
# Load the data
We will be working with the Wine Data Set from UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Wine). It is contains the results of a chemical analysis of 178 wines. The wines are categorized into 3 classes and described by 13 attributes. All attributes are continuous.
The dataset is stored in the `wine.csv` file. The first row contains the column names and the rest of the rows the corresponding values. Open the file and check its structure. The columns in the data are as follows:
1.*Type*: The type of wine, into one of three classes, 1 (59 obs), 2(71 obs), and 3 (48 obs).
2. Alcohol
3. Malic acid
4. Ash
5. Alcalinity of ash
6. Magnesium
7. Total phenols
8. Flavanoids
9. Nonflavanoid phenols
10. Proanthocyanins
11. Color intensity
12. Hue
13. D280/OD315 of diluted wines
14. Proline
**NOTE**: As you can see, the first attribute is the *class* identifier (1-3)
First, we naively read all the data into a regular 2D Python list (i.e., list of lists), named `data`.
%% Cell type:code id:cd548302 tags:
``` python
# solution
data=[]
withopen("wine.csv")asf:
forlineinf:
row=line.strip().split(",")
data.append(row)
print(data)
```
%% Cell type:markdown id:6f138e6b tags:
### 1. Initialize a Numpy Array
Create a numpy array (named `data` ) out of the Python array and check its shape and data type.
What is the data type of the numpy array and why? How do numpy arrays differ from regular Python lists?
%% Cell type:code id:c0ec94d7 tags:
``` python
# write your code here
```
%% Cell type:code id:1cb13efd tags:
``` python
# let's see what is in the array
print_array(data)
```
%% Cell type:markdown id:4d572df0 tags:
### 2. Array Indexing and Slicing
Now we need to split the data into separate numpy arrays, in order to disentangle the attribute names, attribute values and class labels. To do this you need to use numpy array slicing.
Do the following:
1. Store the 13 attribute names into an 1D numpy array, called `names`. This means you should ignore the first column (i.e., Wine type).
2. Store the class labels (i.e., Wine type) into an 1D numpy array, called `classes`.
3. Store the attribute values into an 2D numpy array, called `attributes`.
%% Cell type:code id:2712f977 tags:
``` python
# write your code here
```
%% Cell type:code id:65acf4d2 tags:
``` python
print("names")
print_array(names)
print("classes")
print_array(classes)
print("attributes")
print_array(attributes)
assertnames.shape==(13,)
assertclasses.shape==(178,)
assertattributes.shape==(178,13)
```
%% Cell type:markdown id:90ce761e tags:
4\. Using the `attributes` array, print the second to last row, without its last 3 elements.
Cast each numpy array to the appropriate data type. We need to represent numerical values with the appropriate data type to be able to do numerical operations.
1. The `attributes` array contains continuous values, therefore it needs to be converted to `float`.
2. The `classes` array contains categorical values, so you should convert it to `int`.
(The `names` array already contains string values as it should. You don't need to change it.)
%% Cell type:code id:459fe022 tags:
``` python
# write your code here
```
%% Cell type:code id:d9d77dd0 tags:
``` python
print("classes")
print_array(classes)
print("attributes")
print_array(attributes)
```
%% Cell type:markdown id:cae988ab tags:
### 4. Array Axis and Statistics
We often need to compute some statistics using aggregating methods. A common pitfall however is computing these statistics along the wrong axis.
Using the `attributes` numpy array, do the following:
1. Compute the sum of all values.
2. Compute the average value of each column (i.e., feature).
3. Compute the maximum of the row minimums.
4. Compute the average of the column maximus.
**Hint:** The output for questions `4.3`, `4.4`, will be a *scalar*. To make sure you are aggregating over the correct values, check the shape of the intermediate resulting array first.
%% Cell type:code id:26b18248 tags:
``` python
# Solution for 4.1
# write your code here
```
%% Cell type:code id:b51ed6f9 tags:
``` python
# Solution for 4.2
# write your code here
```
%% Cell type:code id:752f51ff tags:
``` python
# Solution for 4.3
# write your code here
```
%% Cell type:code id:bdfb54b6 tags:
``` python
# Solution for 4.4
# write your code here
```
%% Cell type:markdown id:deaae13d tags:
### 5. Array Transpose
When transposing a `2x3` array, we get a `3x2` array. Matrix transpose has a lot of significance in linear algebra and you will rely on it many times in your assignments.
Using the transpose of the original `attributes` array:
1. Compute the same statistics of the question 4.2.
1. Compute the same statistics of the question 4.3.
%% Cell type:code id:0b3d09ae tags:
``` python
# Solution for 5.1
# write your code here
```
%% Cell type:code id:ff9dbd08 tags:
``` python
# Solution for 5.2
# write your code here
```
%% Cell type:markdown id:9b2df45d tags:
### 6. Sorting and Indexing
Sort the `names` array alphabetically, and then apply the same ordering to the **columns** of the `attributes` array, in order to preserve the correspondence between them.
Hint: be careful when applying the sorting of `names` to `attributes` and think about the role of each axis.
%% Cell type:code id:9536b1ad tags:
``` python
# write your code here
```
%% Cell type:markdown id:1f0dc8c2 tags:
### 7. Data Standardization
Standardization (not to be confused with normalization), is a preprocessing step that is commonly used in many machine learning models and ensures that all features are normally distributed (i.e., they have zero mean and unit variance).
To do this, you need to transform the data as follows:
1. Remove the mean value of each feature (i.e., centering).
2. Divide the features by their standard deviation (i.e., rescaling).
Save the standardized version of the `attributes` array to `attributes_norm`.
%% Cell type:code id:1d9ea13d tags:
``` python
# write your code here
```
%% Cell type:markdown id:f86fc083 tags:
### 8. Matrix Operations
We are going to work with following slices of the `attributes` array (see the cell below). You will compute some simple operations without using NumPy's builtin methods, but you may use them to check that your solution is correct.
**1\.** Compute the dot product between each vector (i.e., row) of `slice1`, with the corresponding vector of `slice2`. This means, the 1st vector `slice1` with the 1st vector of `slice2`, the 2nd vector `slice1` with the 2nd vector of `slice2` etc. Use numpy, but avoid using `np.dot` or for loops. Think about the definition of the dot product.
The expected output is `[514410.1698, 661579.8319, 797379.7166, 494338.7313]`
**2\.** Compute the cosine similarity between 7th and 77th rows (use 0-based indexing) of the `attributes` array using the dot product.
(0-based indexing, means you should use the vectors `attributes[7]` and `attributes[77]` )
%% Cell type:code id:c7db29f9 tags:
``` python
# write your code here
```
%% Cell type:markdown id:1d6d022c tags:
Cosine similarity is used a lot in machine learning and has many nice properties compared to other metrics, such as a cosine. One of them is that its values are in the `[-1, 1]` range regardless of the properties of the vector space (e.g., dimensionality).
**3\.** Compute the cosine similarity between the 7th and 77th rows (use 0-based indexing) of the `attributes_norm` array. How can you explain the difference in their cosine similarities? Why did these vectors look very similar before standardization, but dissimilar afterwards?