python-data-4-plotting.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 4 - Basic Plotting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Producing informative visuals for your data is an important part of data analysis. Here, we introduce matplotlib which is the de facto standard plotting library used in Python. We tend to interact with matplotlib through its `pyplot` module, which is normally imported like so:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## My first plot\n",
    "\n",
    "Let's see how easy it is to plot with pyplot by drawing a line graph of some randomly-generated data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "data = np.random.randn(100).cumsum() #Create some random data\n",
    "plt.plot(data) #Plot the data\n",
    "plt.show() #Show the plot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's not so hard! The `plt.plot` function in general can take sequences of numerical values of the same length and plot them against each other. If we give just one array, `plt.plot` will plot the elements of the array against their index.\n",
    "\n",
    "We can also give `plt.plot` what's called a format string, which controls the style of the line we plot. In the example below, the string `g.:` specifies we want our plot to be green (`g`), with data marked by a small dot (`.`), and joined to one another by a dotted line (`:`). For a full list of colours, point markers, and line styles, see the documentation for `plt.plot` (remember you can run `plt.plot?` or sude `Shift + TAB` to access this)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(data, 'g.:')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The `plt.plot` method is called an artist method, because it draws something, and it does so in a articular way (it draws line plots). There are many other artist methods available to us in the `pyplot` module. Here are some of the most useful artist methods:\n",
    "\n",
    "|Artist|Style|\n",
    "|---|---|\n",
    "|plot|line plots|\n",
    "|bar|bar charts (vertica)|\n",
    "|barh|horizontal bar charts|\n",
    "|boxplot|boxplots|\n",
    "|hist|histograms|\n",
    "|loglog|log-log scaled plots|\n",
    "|matshow|heat map of a matrix|\n",
    "|pie|pie charts|\n",
    "|quiver|2-D vector fields|\n",
    "|scatter|scatter diagrams|\n",
    "|violin|violin plots|\n",
    "\n",
    "Each artist method can be accessed by prefixing it with `plt`. You can find out more about each method using the `pyplot` documentation available within Jupyter notebooks or in more detail at [the matplotlib documentation site](https://matplotlib.org/index.html). By exploring this documentation you may find even more useful artist methods for you to use!\n",
    "\n",
    "### Exercise 1\n",
    "\n",
    "Use Numpy to generate 2 different sequences of random numbers of the same length. Then plot these against one another as a scatter diagram."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Annotation and Customisation\n",
    "\n",
    "Annotating and customising plots is just as easy as creating them. In the example below, we import some categorical data from a file, plot it, and add axis labels and a title."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#Read the data from a csv file\n",
    "rawdata = np.genfromtxt(\"./data/headcount.csv\", dtype=np.str, delimiter=\",\") # read data into an np array\n",
    "header = rawdata[0] # Separate out column headers\n",
    "groups = rawdata[1:, 0] # Separate first column\n",
    "counts = np.asarray(rawdata[1:, 1], dtype=np.int) # Separate second column\n",
    "\n",
    "#Plot a bar chart of staff numbers\n",
    "plt.bar(np.arange(len(counts)), counts) # Create bar chart\n",
    "plt.xlabel(header[0]) # Set x axis label\n",
    "plt.ylabel(header[1]) # Set y axis label\n",
    "plt.xticks(np.arange(len(counts)), groups, rotation=30) # Set labels for x axis ticks\n",
    "plt.title(\"UoE Staff Headcount Dec 2018\") # Set title\n",
    "plt.show() # Show the figure"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's break that sequence of plotting commands down some more.\n",
    "- The method `plt.bar` takes an array of positions for which to centre the bars, and an array of the same length giving the heights of the bars. In general, if we have n bars, the numbers 1, 2, ..., n are a good way of centering the bars. We specified that here by passing the array `np.arange(len(counts))`.\n",
    "- The methods `plt.xlabel` and `plt.ylabel` set labels for the x and y axes respectively. We had stored these in the array `header`, which contained the header row in our csv file.\n",
    "- The method `plt.xticks` sets where the ticks should be placed on the x axis. We have told it to put them at the centre of each bar, and have given it an array of strings to be used as labels (the default is to use numerical values), as well as an optional rotation parameter.\n",
    "- The method `plt.title` gives our figure a title.\n",
    "- Finally, `plt.show` displays the figure for us!\n",
    "\n",
    "We can also add a legend to our figures if, for instance, they contain multiple plots. Here's an example, using randomly generated data."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.arange(0, 6.5, 0.01)\n",
    "plt.plot(x, np.sin(x), 'r')\n",
    "plt.plot(x, np.cos(x), 'g')\n",
    "plt.legend([\"sin(x)\", \"cos(x)\"], loc='best')\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We were able to draw multiple lines on the same figure by multiple successive calls to `plt.plot`. Then `plt.legend` adds a legend to the figure: it takes a list of labels, and uses these to label the lines in the order they were drawn on the figure. The optional `loc` parameter specifies where the legend will be placed (defaults to top right).\n",
    "\n",
    "### Exercise 2\n",
    "\n",
    "Read in the data from the file `cse_staff_population.csv`. Each line of the file lists the name of a School/Planning unit, followed by the number of academic staff, followed by the number of professional services staff (the first line is a header line).\n",
    "\n",
    "Plot a bar chart showing, for each School/Planning Unit, the number of academic staff and the number of professional services staff (so each School/Planning Unit will have two bars associated to it). Your figure should have axes labels, appropriate labels for the xticks, and a title. The bars for academic and professional services should be colour coded, and this should be noted by a legend.\n",
    "\n",
    "(Hint: You can use two different calls to `plt.bar` to overlay two bar charts on one figure.)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Saving Figures\n",
    "\n",
    "The function used to save the current figure is `plt.savefig`. This function accepts a string representing a file path, which tells Python where the image should be saved. In specifying an extension, the `plt.savefig` method will infer which format to export to. We can additionally control the resolution with the `dpi` parameter, which stands for dots-per-inch."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = np.random.randn(100) # Generate some random data\n",
    "plt.boxplot(data) # Create a boxplot of that data\n",
    "plt.savefig(\"my_boxplot.png\", dpi=300) # Save the figure as a png, with high resolution."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Figure Objects\n",
    "\n",
    "As we have seen, we build plots in matplotlib by successive calls to `plt`. We start by calling an artist method, which draws a figure. Then if we call other artist methods, these draw on the same figure. Calling methods such as `plt.xlabel` or `plt.title` update the current figure too, which we can save using `plt.savefig`. To see the current figure, we can call `plt.show`.\n",
    "\n",
    "In the background, matplotlib always maintains a \"current figure\", which is empty to start with. Successive calls to functions from the `plt` module draw on or change the current figure. This allows us to build up a figure piece-by-piece: layering different plots on the same figure, and updating the figure's annotations and other settings one-by-one. But then we need to know: how do we clear the current figure to create something new? It turns out, this is done automatically every time we call `plt.show`.\n",
    "\n",
    "### Exercise 3\n",
    "Take a look at the file `my_graph.jpeg` produced by running this code block. Something is wrong: change the code block so that the output file is the same as the displayed graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.arange(0.1, 10, 0.1)\n",
    "plt.plot(x, np.log(x))\n",
    "plt.show()\n",
    "plt.savefig(\"my_graph.jpeg\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The current figure is also cleared when we run a new Jupyter cell. This means we must complete a figure in a single cell, and we then have a \"blank canvas\" in every new cell. This is something to be aware of if you try to migrate code from a Jupyter notebook to a regular python script: in the background, Jupyter will have been clearing the current figure for you, so you may need to add code to clear the current figure manually. Here's an example showing how Jupyter clears figures automatically:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "x = np.arange(0, 6, 0.1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(x, np.exp(x))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "plt.plot(x, x**3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.plot(x, np.exp(x))\n",
    "plt.plot(x, x**3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "It's good practice to end the creation of a new plot with `plt.show`, so that if code is transferred to a non-Jupyter environment we will not have uncleared figures interfering with each other. You can also use `plt.clf` to clear the current figure (this may be preferable, since in a regular python script, `plt.show` creates a pop-up window displaying the figure, and the rest of the script will not execute until this window has been closed).\n",
    "\n",
    "We've seen that by default, matplotlib builds up a figure by updating an object called the \"current figure\". But can we also access other, previously created figures to update them. We can even create more complicated figures which may cintain several sub-plots side by side. This involves creating and referncing figure objects, and axes objects. This is not much harder than we've been doing already, but since it goes a little beyond the basic matplotlib functionality, it will be covered in an extension notebook.\n",
    "\n",
    "Another way to enhance our plotting capabilities is through the [seaborn](https://seaborn.pydata.org/) package. Matplotlib, with its style of building up plots in a piecewise manner, gives us lots of control over our visuals - but it can alos be tedious to use, especially for routine tasks. Seaborn is based on matplotlib, but wraps up lots of convenient functionality in an easy to use way. Its visual style is also a little different to matplotlib. It is closely related to another package known as Pandas, which will be introduced in Notebook 5."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}