Skip to content
Snippets Groups Projects
python-data-5-pandas.ipynb 184 KiB
Newer Older
       "      <td>0.001317</td>\n",
       "      <td>-0.047580</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>std</th>\n",
       "      <td>1.007406</td>\n",
       "      <td>0.977138</td>\n",
       "      <td>1.021164</td>\n",
       "      <td>1.006258</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>min</th>\n",
       "      <td>-3.000000</td>\n",
       "      <td>-3.000000</td>\n",
       "      <td>-2.824025</td>\n",
       "      <td>-3.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>25%</th>\n",
       "      <td>-0.696013</td>\n",
       "      <td>-0.664201</td>\n",
       "      <td>-0.667879</td>\n",
       "      <td>-0.735838</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>50%</th>\n",
       "      <td>-0.014608</td>\n",
       "      <td>0.012376</td>\n",
       "      <td>0.018254</td>\n",
       "      <td>-0.078759</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>75%</th>\n",
       "      <td>0.660866</td>\n",
       "      <td>0.642424</td>\n",
       "      <td>0.678173</td>\n",
       "      <td>0.617265</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>max</th>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "      <td>3.000000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                 0            1            2            3\n",
       "count  1000.000000  1000.000000  1000.000000  1000.000000\n",
       "mean     -0.025998    -0.007102     0.001317    -0.047580\n",
       "std       1.007406     0.977138     1.021164     1.006258\n",
       "min      -3.000000    -3.000000    -2.824025    -3.000000\n",
       "25%      -0.696013    -0.664201    -0.667879    -0.735838\n",
       "50%      -0.014608     0.012376     0.018254    -0.078759\n",
       "75%       0.660866     0.642424     0.678173     0.617265\n",
       "max       3.000000     3.000000     3.000000     3.000000"
      ]
     },
     "execution_count": 105,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data[np.abs(data) > 3] = np.sign(data) * 3\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Permutation and Random Sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Permuting (randomly reordering) of rows in pandas is easy to do using the `numpy.random.permutation` function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 106,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>16</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3\n",
       "0   0   1   2   3\n",
       "1   4   5   6   7\n",
       "2   8   9  10  11\n",
       "3  12  13  14  15\n",
       "4  16  17  18  19"
      ]
     },
     "execution_count": 106,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 107,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([1, 2, 0, 4, 3])"
      ]
     },
     "execution_count": 107,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# generate random order\n",
    "sampler = np.random.permutation(5)\n",
ignat's avatar
ignat committed
    "sampler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 108,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>16</td>\n",
       "      <td>17</td>\n",
       "      <td>18</td>\n",
       "      <td>19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3\n",
       "1   4   5   6   7\n",
       "2   8   9  10  11\n",
       "0   0   1   2   3\n",
       "4  16  17  18  19\n",
       "3  12  13  14  15"
      ]
     },
     "execution_count": 108,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.take(sampler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To select a random subset without replacement, you can use the sample method:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 109,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "      <th>3</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>12</td>\n",
       "      <td>13</td>\n",
       "      <td>14</td>\n",
       "      <td>15</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>8</td>\n",
       "      <td>9</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>6</td>\n",
       "      <td>7</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "    0   1   2   3\n",
       "3  12  13  14  15\n",
       "2   8   9  10  11\n",
       "1   4   5   6   7"
      ]
     },
     "execution_count": 109,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df.sample(n=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "# String manipulation <a name=\"strings\"></a>\n",
    "Python has long been popular for its raw data manipulation in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basics\n",
    "Let's refresh what normal `str` (String objects) are capable of in Python"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 110,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['Edinburgh', 'is', 'great']"
      ]
     },
     "execution_count": 110,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# complex strings can be broken into small bits\n",
    "val = \"Edinburgh is great\"\n",
    "val.split(\" \")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 111,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'Edinburgh::is::great'"
      ]
     },
     "execution_count": 111,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# substrings can be concatinated together with +\n",
    "first, second, last = val.split(\" \")\n",
    "first + \"::\" + second + \"::\" + last"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that Strings are just lists of individual charecters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 112,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "E\n",
      "d\n",
      "i\n",
      "n\n",
      "b\n",
      "u\n",
      "r\n",
      "g\n",
      "h\n"
     ]
    }
   ],
   "source": [
    "val = \"Edinburgh\"\n",
    "for each in val:\n",
    "    print(each)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use standard list operations with them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 113,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "3"
      ]
     },
     "execution_count": 113,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val.find(\"n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 114,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "-1"
      ]
     },
     "execution_count": 114,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "val.find(\"x\")  # -1 means that there is no such element"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 115,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'EDINBURGH'"
      ]
     },
     "execution_count": 115,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# and of course remember about upper() and lower()\n",
    "val.upper()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to learn more about strings you can always refer to the [Python manual](https://docs.python.org/2/library/string.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regular expressions\n",
ignat's avatar
ignat committed
    "provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called *regex*, is a string formed according to the regular expression language. Python's built-in module is responsible for applying regular expression of strings via the `re` package"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 116,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'foo    bar\\t baz   \\tqux'"
      ]
     },
     "execution_count": 116,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import re\n",
    "text = \"foo    bar\\t baz   \\tqux\"\n",
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 117,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['foo', 'bar', 'baz', 'qux']"
      ]
     },
     "execution_count": 117,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "re.split(\"\\s+\", text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "this expression effectively removed all whitespaces and tab characters (`\\t`) which was stated with the `\\s` regex and then the `+` after it means to remove any number of sequential occurrences of that character.\n",
    "\n",
ignat's avatar
ignat committed
    "Let's have a look at a more complex example - identifying email addresses in a text file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 118,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"\"\"Dave dave@google.com\n",
    "Steve steve@gmail.com\n",
    "Rob rob@gmail.com\n",
    "Ryan ryan@yahoo.com\n",
    "\"\"\"\n",
    "\n",
    "# pattern to be used for searching\n",
    "pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
    "\n",
    "# re.IGNORECASE makes the regex case-insensitive\n",
    "regex = re.compile(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 119,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']"
      ]
     },
     "execution_count": 119,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "regex.findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's dissect the regex part by part:\n",
    "```\n",
    "pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
    "```\n",
    "\n",
ignat's avatar
ignat committed
    "- the `r` prefix before the string signals that the string should keep special characters such as the newline character `\\n`. Otherwise, Python would just treat it as a newline\n",
    "- `A-Z` means all letters from A to Z including lowercase and uppercase\n",
    "- `0-9` similarly means all characters from 0 to 9\n",
    "- the concatenation `._%+-` means just include those characters\n",
    "- the square brackets [ ] means to combine all of the regular expressions inside. For example `[A-Z0-9._%+-]` would mean include all letters A to Z, all numbers 0 to 9, and the characters ._%+-\n",
    "- `+` means to concatenate the strings patterns\n",
    "- `{2,4}` means consider only 2 to 4 character strings\n",
    "\n",
    "To summarise the pattern above searches for any combination of letters and numbers, followed by a `@`, then any combination of letters and numbers followed by a `.` with only 2 to 4 letters after it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regular expressions and pandas\n",
    "Let's see how they can be combined. Replicating the example above"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data = pd.Series({'Dave': 'Daves email dave@google.com', 'Steve': 'Steves email steve@gmail.com',\n",
    "        'Rob': 'Robs rob@gmail.com', 'Wes': np.nan})\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can reuse the same `pattern` variable from above"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data.str.findall(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "pandas also offers more standard string operations. For example, we can check if a string is contained within a data row:"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data.str.contains(\"gmail\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many more of these methods exist:\n",
    "    \n",
    "    \n",
ignat's avatar
ignat committed
    "| Methods | Description |\n",
    "| -- | -- |\n",
    "| cat | Concatenate strings element-wise with optional delimiter |\n",
    "| contains | Return boolean array if each string contains pattern/regex |\n",
ignat's avatar
ignat committed
    "| count | Count occurrences of a pattern |\n",
    "| extract | Use a regex with groups to extract one or more strings from a Series |\n",
ignat's avatar
ignat committed
    "| findall | Computer list of all occurrences of pattern/regex for each string |\n",
    "| get | Index into each element |\n",
    "| isdecimal | Checks if the string is a decimal number |\n",
    "| isdigit | Checks if the string is a digit |\n",
    "| islower | Checks if the string is in lower case |\n",
    "| isupper | Checks if the string is in upper case |\n",
    "| join | Join strings in each element of the Series with passed seperator |\n",
ignat's avatar
ignat committed
    "| len | Compute the length of each string |\n",
    "| lower, upper | Convert cases |\n",
    "| match | Returns matched groups as a list |\n",
    "| pad | Adds whitespace to left, right or both sides of strings |\n",
    "| repeat | Duplicate string values |\n",
ignat's avatar
ignat committed
    "| slice | Slice each string in the Series |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 12\n",
    "There is a `dataset data/yob2012.txt` which lists the number of newborns registered in 2018 with their names and sex. Using regular expressions, extract all names from the dataset which start with letters A to C. How many names did you find?\n",
    "\n",
    "Note: `^` is the \"starting with\" operator in regular expressions, "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Thanks "
ignat's avatar
ignat committed
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
ignat's avatar
ignat committed
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}