{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Machin Learning with Python\n", "\n", "Machine Learning (ML) is a field where we attempt to teach computers to do complex tasks. How do you define a task though? We use computers for a lot of different things, as such we can also teach them to do a lot of different things. Some examples include:\n", "- Image Recognition - you have probably used your smartphone to take a picture, but how does it detect faces in pictures? That is done via ML and some more interesting examples nowadays include detecting objects in autonomous cars (It is a bit scary in all fairness)\n", "\n", "<figure class=\"image\">\n", " <img src=\"https://cdn.technologyreview.com/i/images/Face%20detection.png?sw=600\" alt=\"drawing\" width=\"500\"/>\n", " <center>\n", " <figcaption>Face recognition with Machine Learning © MIT Technology Review</figcaption>\n", " </center>\n", "</figure>\n", "\n", "\n", "- Speech Recognition - have you ever used Alexa, Siri or Google Assistant? Well all of them are based on ML - computers learning to understand our speech\n", "- Medical diagnosis - ML can help with diagnosis of diseases. For example, if you take a CT scan of a patient's brain, you can then put the image through a ML algorithm which can tell you if the patient has a brain tumor.\n", "- and much more...\n", "\n", "In this notebook we will use the package `sklearn` to do some basic machine learning. Namely, we will teach a computer to recognise handwritten digits and predict house prices." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 1. Linear Regression\n", "Let's kick-off with the **Linear regression**. It is an *supervised method* which tries to predict never-before seen data based on data it has seen before.\n", "\n", "A simple example would be: Imagine you are budgeting for renting a flat around George Square in Edinburgh but you don't know how much it might cost. Luckily, you have a couple of frinds who live in that area and you ask them how much they are paying based on how many bedrooms they have\n", "- friend A has 1 bedroom and is paying £560\n", "- friend B has 3 bedroom and is paying £1200\n", "- friend C has 1 bedroom and is paying £540\n", "\n", "Based on that, you can find how much you expect to be paying for a 2 bedroom flat by simply averaging the price of a bedroom in that area.\n", "\n", "$$ \\dfrac{A + B + C}{Num. of bedrooms} $$\n", "\n", "\n", "$$ = \\dfrac{560 + 1200 + 540}{5} $$\n", "\n", "\n", "$$ = £460 $$\n", "\n", "Therefore you should expect to be paying 2x460 = £920 for a 2 bedroom flat\n", "\n", "You just performed *regression* there. You predicted how much a 2-bedroom flat might cost to rent in the future based on data you have from the past.\n", "\n", "Now let's see how we can do that in Python. For that we will be using the package `sklearn`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import matplotlib.pyplot as plt\n", "%matplotlib inline" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we need to insert the data we just defined above" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "price = np.array([[560], [1200], [540]])\n", "bedrooms = np.array([[1], [3], [1]])\n", "plt.scatter(bedrooms, price)\n", "plt.xlabel(\"Num of bedrooms\")\n", "plt.ylabel(\"Price of flat\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to fit a linear regression through it which will predict how much flats will cost to rent.\n", "\n", "First we have to define the model. You can treat the model as the container for your Linear Regression and all of its parameters. Then we will fit it to the data" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression\n", "\n", "# Create the model\n", "linreg = LinearRegression()\n", "\n", "# Learn to predict\n", "linreg.fit(bedrooms, price)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's predict how much flats with 1 to 6 bedrooms will cost. First we define the bedrooms we want in `bedrooms_prediction` and then we predict the prices of these flats with `linreg.predict()`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "bedrooms_prediction = np.array([[1], [2], [3], [4], [5], [6]])\n", "prices_prediction = linreg.predict(bedrooms_prediction)\n", "\n", "# plot the results\n", "fig = plt.figure()\n", "plt.scatter(bedrooms, price, label=\"Input data\")\n", "plt.plot(bedrooms_prediction, prices_prediction, color=\"r\", label=\"Linear fit\")\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Huh, from the graph, you can see our initial esitmate was fairly correct. Now we can also see what the prices would be even for a 4 5 and 6 bedroom flat. £2000 is a lot of money for a flat!\n", "\n", "This example is a fairly trivial one but imagine if you had a lot more information about the flats like number of bathrooms, floor, distance from closest shop, etc.. Then predicting by hand will become way more difficult but Linear Regression would perform just as well!\n", "\n", "You will find out how well exactly in the next exercise!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 1 - Linear regression for predicting house pricing <a name=\"ex1\"></a>\n", "Here you will perform linear regression on a dataset of housing prices.\n", "\n", "Let's first load the dataset `data/kc_house_data_min.csv`. It has the following columns:\n", "- **price** in US dollars\n", "- number of **bedrooms**\n", "- number of **bathrooms**\n", "- **sqft_living** - square feet of living space\n", "- **sqft_lot** - square feet of the full house\n", "- number of **floors**\n", "- **waterfront** - 1 if there is one; 0 otherwise\n", "- **view** - 1 if there is a nice view; 0 otherwise\n", "- **condition** rating of the house from 1 to 5\n", "- **grade** - overall grade given to the housing unit, based on King County grading system. From 1 to 13" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = np.genfromtxt(\"data/kc_house_data_min.csv\", delimiter=\",\", skip_header=1)\n", "data.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we want to seperate the data into the description of the houses and the predictions for them. In our case, we are trying to predict the house prices and all other columns are just the description of the houses. Separate the data into arrays `prices` and `descriptions` below:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# [ENTER CODE IN THIS BLOCK]\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# USE THIS TO CHECK THE CORRECTNESS OF THE PREVIOUS BLOCK\n", "assert prices.shape == (21613,) , \"Price variable has wrong shape\"\n", "assert descriptions.shape == (21613,9), \"Descriptions variable has wrong shape\"\n", "print(\"You seperated the data correctly!\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now you can just apply linear regression as it was shown in the example above! First, create an instance of a linear regression and fit the model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# [ENTER CODE IN THIS BLOCK]\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "OK, now let's make some predictions! Create some arbatrary description of an imaginary house using the fields defined [earlier](#ex1)! You can use `linreg.predict()` to predict its price.\n", "\n", "*Note: Data you input into the linear regression model must be of shape (1,9)!*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# [ENTER CODE IN THIS BLOCK]\n", "# Create a description of a house\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# [ENTER CODE IN THIS BLOCK]\n", "# Predict its price\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# 2. Logistic regression\n", "\n", "Logistic regression is in many ways similar to Linear regression. However, its major difference is that it is a **classifying machine learning algorithm**. Instead of outputting a numerical prediction, **logistic regression outputs a label**. But wait, hold on, what is a label?\n", "\n", "Imagine that you are driving a car - your goal is to reach your destination safely without causing any incidents. For that you need to look around as you are driving for other cars, pedestrians, roads, dogs, cats, etc... Based on how you classify those objects, you take different actions. If you see a road you think to yourself \"I should make sure I am driving on that\" but if you see a pedestrian you think to yourself \"Better try to not run over that person\".\n", "\n", "This is pretty much how autonomous cars work as well, but first they need to identify and classify objects on the road. Say that your machine learning algorithm can identify a [person, car, traffic light, handbag, backpack, truck], then you give it an image, it tries to identify objects and assign a label for each object. \"Hey, you look like a human, I'll label you a `person`\"\n", "\n", "<figure class=\"image\">\n", " <img src=\"https://cdn-images-1.medium.com/max/1600/1*QOGcvHbrDZiCqTG6THIQ_w.png\" alt=\"drawing\" width=\"600\"/>\n", " <center>\n", " <figcaption>YOLO detection algorithm on ImageNet 1000 dataset</figcaption>\n", " </center>\n", "</figure>" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Logistic Regression is one of the ways you can achieve the above. Let's first have a look at a simple example.\n", "\n", "We will now generate random data with `make_blobs` which creates data clustered around a command center for each class. Then we will attempt to classify the points back into classes with logistic regression!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegression\n", "from sklearn.datasets.samples_generator import make_blobs\n", "\n", "# generate 2d classification dataset\n", "X, y = make_blobs(n_samples=100, centers=2, n_features=2, cluster_std=4)\n", "\n", "# plot the data into classes\n", "plt.scatter(X[y==0, 0], X[y==0, 1], label=\"Class 1\")\n", "plt.scatter(X[y==1, 0], X[y==1, 1], label=\"Class 2\")\n", "plt.legend()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we can distinguish between 2 different classes via their colours. Now let's attempt to put learn a boundary between them and classify them with Logistic Regression" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "logreg = LogisticRegression()\n", "logreg.fit(X, y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we have a fitted Linear Regression, we would like to plot the _decision boundary_ between the two classes. In this case, since we are using a _linear_ Logistic Regression model, our decision boundary will just be a straight line. All points that are below that line belong to one class and all points above that line belong to the other class.\n", "\n", "The code below will visualise the decision boundary. You don't need to understand the code, it is just there to visualise the results." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Plot the decision boundary. For that, we will assign a color to each\n", "# point in the mesh [x_min, x_max]x[y_min, y_max].\n", "x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1\n", "y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1\n", "h = .02 # step size in the mesh\n", "xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))\n", "Z = logreg.predict(np.c_[xx.ravel(), yy.ravel()])\n", "\n", "# Put the result into a color plot\n", "Z = Z.reshape(xx.shape)\n", "plt.pcolormesh(xx, yy, Z, cmap=plt.cm.Paired)\n", "\n", "# Plot also the training points\n", "plt.scatter(X[:, 0], X[:, 1], c=y, edgecolors='k', cmap=plt.cm.Paired)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Pretty cool!\n", "\n", "Now you have the chance to apply this to something a bit more practical!" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exercise 2 - Logistic regression image classification" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You task here is to make the computer learn how to classify hand-written digits.\n", "\n", "First let's make sure you have the right version of the `sklean` package. Just run the cell below to verify that. If you don't it will automatically install the correction version you need for this exercise. Don't worry about the outputs of the cell!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "!pip install scikit-learn==0.20.2" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now let's download our digits from the internet (it might take some time).\n", "\n", "*Note: that we will be using only the first 5000 examples of images to make the exercise faster. You can use the full dataset but it might take >1h to train it.*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "%run download_mnist.py\n", "from sklearn.datasets import fetch_mldata\n", "from sklearn.utils import shuffle\n", "\n", "# load the MNIST dataset\n", "fetch_mnist()\n", "a, b, y, X = fetch_mldata(\"MNIST original\").values()\n", "\n", "# reorder the data randomly\n", "X, y = shuffle(X, y)\n", "\n", "# Take only 5000 examples\n", "X = X[:5000]\n", "y = y[:5000]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's now create a function that can visualise the data along with labels. There is no need to understand the code fully. All you should know is that it receives a dataset of inputs `X` and the target of those inputs `y`. Then the code picks 10 random examples and visualises them along with their target outputs.\n", "\n", "You can use it to visualise both the dataset and then your trained outputs!" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def plot_digits(X, y):\n", " plt.rc(\"image\", cmap=\"binary\")\n", " nums = np.random.randint(0, len(X), 10)\n", " for idx, i in enumerate(nums):\n", " axs = plt.subplot(2,5,idx+1)\n", " plt.imshow(X[i].reshape(28,28), label=\"1\")\n", " axs.set_title(\"Label: \" + str(y[i]))\n", " plt.xticks(())\n", " plt.yticks(())\n", " plt.tight_layout()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# X is the data you want to visualise\n", "# y is the labels that will be displayed on top\n", "plot_digits(X,y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "You have your data, you have a method of visualising it.\n", "\n", "**Now apply Logistic Regression to the data!**\n", "1. Initilise a LogisticRegression model\n", "2. Train it on the data\n", "3. Predict 10 examples (can be from the full dataset)\n", "4. Plot the input images and their labels using `plot_digits()`\n", "\n", "*Warning: Training might take up to 5 minutes, so be patient*" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# [ENTER CODE IN THIS BLOCK]\n" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6" } }, "nbformat": 4, "nbformat_minor": 2 }