python-data-5-pandas.ipynb

   "outputs": [],
   "source": [
    "# substrings can be concatinated together with +\n",
    "first, second, last = val.split(\" \")\n",
    "first + \"::\" + second + \"::\" + last"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Remember that Strings are just lists of individual charecters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val = \"Edinburgh\"\n",
    "for each in val:\n",
    "    print(each)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can use standard list operations with them"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val.find(\"n\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "val.find(\"x\")  # -1 means that there is no such element"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# and of course remember about upper() and lower()\n",
    "val.upper()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want to learn more about strings you can always refer to the [Python manual](https://docs.python.org/2/library/string.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regular expressions\n",
    "provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called *regex*, is a string formed according to the regular expression language. Python's built-in module is responsible for applying regular expression of strings via the `re` package"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import re\n",
    "text = \"foo    bar\\t baz   \\tqux\"\n",
    "text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "re.split(\"\\s+\", text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "this expression effectively removed all whitespaces and tab characters (`\\t`) which was stated with the `\\s` regex and then the `+` after it means to remove any number of sequential occurrences of that character.\n",
    "\n",
    "Let's have a look at a more complex example - identifying email addresses in a text file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "text = \"\"\"Dave dave@google.com\n",
    "Steve steve@gmail.com\n",
    "Rob rob@gmail.com\n",
    "Ryan ryan@yahoo.com\n",
    "\"\"\"\n",
    "\n",
    "# pattern to be used for searching\n",
    "pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
    "\n",
    "# re.IGNORECASE makes the regex case-insensitive\n",
    "regex = re.compile(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "regex.findall(text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's dissect the regex part by part:\n",
    "```\n",
    "pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
    "```\n",
    "\n",
    "- the `r` prefix before the string signals that the string should keep special characters such as the newline character `\\n`. Otherwise, Python would just treat it as a newline\n",
    "- `A-Z` means all letters from A to Z including lowercase and uppercase\n",
    "- `0-9` similarly means all characters from 0 to 9\n",
    "- the concatenation `._%+-` means just include those characters\n",
    "- the square brackets [ ] means to combine all of the regular expressions inside. For example `[A-Z0-9._%+-]` would mean include all letters A to Z, all numbers 0 to 9, and the characters ._%+-\n",
    "- `+` means to concatenate the strings patterns\n",
    "- `{2,4}` means consider only 2 to 4 character strings\n",
    "\n",
    "To summarise the pattern above searches for any combination of letters and numbers, followed by a `@`, then any combination of letters and numbers followed by a `.` with only 2 to 4 letters after it."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Regular expressions and pandas\n",
    "Let's see how they can be combined. Replicating the example above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.Series({'Dave': 'Daves email dave@google.com', 'Steve': 'Steves email steve@gmail.com',\n",
    "        'Rob': 'Robs rob@gmail.com', 'Wes': np.nan})\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can reuse the same `pattern` variable from above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.str.findall(pattern, flags=re.IGNORECASE)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "pandas also offers more standard string operations. For example, we can check if a string is contained within a data row:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data.str.contains(\"gmail\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many more of these methods exist:\n",
    "    \n",
    "    \n",
    "| Methods | Description |\n",
    "| -- | -- |\n",
    "| cat | Concatenate strings element-wise with optional delimiter |\n",
    "| contains | Return boolean array if each string contains pattern/regex |\n",
    "| count | Count occurrences of a pattern |\n",
    "| extract | Use a regex with groups to extract one or more strings from a Series |\n",
    "| findall | Computer list of all occurrences of pattern/regex for each string |\n",
    "| get | Index into each element |\n",
    "| isdecimal | Checks if the string is a decimal number |\n",
    "| isdigit | Checks if the string is a digit |\n",
    "| islower | Checks if the string is in lower case |\n",
    "| isupper | Checks if the string is in upper case |\n",
    "| join | Join strings in each element of the Series with passed seperator |\n",
    "| len | Compute the length of each string |\n",
    "| lower, upper | Convert cases |\n",
    "| match | Returns matched groups as a list |\n",
    "| pad | Adds whitespace to left, right or both sides of strings |\n",
    "| repeat | Duplicate string values |\n",
    "| slice | Slice each string in the Series |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 12\n",
    "There is a `dataset data/yob2012.txt` which lists the number of newborns registered in 2018 with their names and sex. Using regular expressions, extract all names from the dataset which start with letters A to C. How many names did you find?\n",
    "\n",
    "Note: `^` is the \"starting with\" operator in regular expressions, "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Thanks "
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}