Skip to content
Snippets Groups Projects
python-data-5-pandas.ipynb 184 KiB
Newer Older
ignat's avatar
ignat committed
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 5 - pandas\n",
ignat's avatar
ignat committed
    "[pandas](http://pandas.pydata.org) provides high-level data structures and functions designed to make working with structured or tabular data fast, easy and expressive. The primary objects in pandas that we will be using are the `DataFrame`, a tabular, column-oriented data structure with both row and column labels, and the `Series`, a one-dimensional labeled array object.\n",
    "\n",
ignat's avatar
ignat committed
    "pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases. It provides sophisticated indexing functionality to make it easy to reshape, slice and perform aggregations.\n",
ignat's avatar
ignat committed
    "\n",
ignat's avatar
ignat committed
    "While pandas adopts many coding idioms from NumPy, the most significant difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.\n",
ignat's avatar
ignat committed
    "<br>\n",
    "\n",
    "## Table of Contents:\n",
ignat's avatar
ignat committed
    "- [Data Structures](#structures)\n",
    "    - [Series](#series)\n",
    "    - [DataFrame](#dataframe)\n",
    "- [Essential Functionality](#ess_func)\n",
    "    - [Reindexing](#reindexing)\n",
    "    - [Dropping Entries](#removing)\n",
    "    - [Indexing, Slicing and Filtering](#indexing)\n",
    "    - [Arithmetic Operations](#arithmetic)\n",
    "- [Summarizing and Computing Descriptive Statistics](#sums)\n",
    "- [Loading and storing data](#loading)\n",
    "    - [Text Format](#text) \n",
    "    - [Web Scraping](#web)\n",
    "- [Data Cleaning and preperation](#cleaning)\n",
    "    - [Handling missing data](#missing)\n",
    "    - [Data transformation](#transformation)\n",
    "- [String manipulation](#strings)\n",
    "\n",
    "The common pandas import statment is shown below:"
ignat's avatar
ignat committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
ignat's avatar
ignat committed
   "metadata": {},
   "outputs": [],
   "source": [
    "# Common pandas import statement\n",
    "import numpy as np\n",
ignat's avatar
ignat committed
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "# Data Structures <a name=\"structures\"></a>\n",
    "## Series <a name=\"series\"></a>\n",
    "A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels called its index.\n",
ignat's avatar
ignat committed
    "\n",
    "The easiest way to make a Series is from an array of data:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
ignat's avatar
ignat committed
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.Series([4, 7, -5, 3])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now try printing out data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    4\n",
       "1    7\n",
       "2   -5\n",
       "3    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data"
   ]
ignat's avatar
ignat committed
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "The string representation of a Series displayed interactively shows the index on the left and the values on the right. Because we didn't specify an index, the default on is simply integers 0 through N-1.\n",
ignat's avatar
ignat committed
    "\n",
    "You can output only the values of a Series using \n",
    "```python\n",
    "data.values\n",
    "```\n",
ignat's avatar
ignat committed
    "or you can get only the indices using\n",
ignat's avatar
ignat committed
    "```python\n",
    "data.index\n",
    "```\n",
    "Try it out below!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
ignat's avatar
ignat committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([ 4,  7, -5,  3])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.values"
   ]
ignat's avatar
ignat committed
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can specify custom indeces when intialising the Series"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
ignat's avatar
ignat committed
   "metadata": {},
   "outputs": [],
   "source": [
    "data2 = pd.Series([4, 7, -5, 3], index=[\"a\", \"b\", \"c\", \"d\"])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
ignat's avatar
ignat committed
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "a    4\n",
       "b    7\n",
       "c   -5\n",
       "d    3\n",
       "dtype: int64"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data2"
   ]
ignat's avatar
ignat committed
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now you can use these labels to access the data similar to a normal array"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "4"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
ignat's avatar
ignat committed
   "source": [
    "data2[\"a\"]"
   ]
Loading
Loading full blame...