python-data-1-warmup.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 1 - Warm-up Exercises <a name=\"text\"></a>\n",
    "\n",
    "In this notebook we will warm up with a textual analysis exercise, using some of the assumed basic python knowledge for the course. We will see how to use some basic string methods as well as how to open and close files in python (later, some of these methods will be superceded by inbuild methods of data science packages we will use).\n",
    "\n",
    "For this we will be using the text [Humanistic Nursing by Josephine G. Paterson and Loretta T. Zderad](http://www.gutenberg.org/ebooks/25020). You already have this downloaded in your workspace.\n",
    "\n",
    "To open up the file, Python gives us a very handy function, we just have to give it the path to the file:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "file = open(\"data/humanistic_nursing.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "An easy way to deal with text files is reading it line by line within a for loop:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "for line in file:\n",
    "    print(line)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Well, that was a lot of text. Can we turn it into something useful?\n",
    "\n",
    "For example, we can split up each line into the words making it and then count the occurances of the word \"and\". Here's a function that does that. Try it out!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# define functino\n",
    "def countAnd(file_path):\n",
    "    counter = 0\n",
    "    file = open(file_path)\n",
    "    \n",
    "    for line in file:\n",
    "        for word in line.split():\n",
    "            if word == \"and\":\n",
    "                counter += 1\n",
    "                \n",
    "    return counter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# try function\n",
    "countAnd(\"./data/humanistic_nursing.txt\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 1: Count any word\n",
    "Based on the function above, now write your own function which counts the occurances of any word. For example:\n",
    "```python\n",
    "countAny(filename, \"medicine\")\n",
    "```\n",
    "will return the occurences of the word \"medicine\" in the file filename."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def countAny(file_path, des_word):\n",
    "    # [ WRITE YOUR CODE HERE ]\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify your function\n",
    "countAny(\"./data/humanistic_nursing.txt\", \"patient\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 2: Count multiple words\n",
    "Before this exercise you should be familiar with Python dictionaries. If you're not, please see [here](https://docs.python.org/3/tutorial/datastructures.html#dictionaries).\n",
    "\n",
    "Write a function which takes a file path and a list of words, and returns a dictionary mapping each word to its frequency in the given file.\n",
    "\n",
    "Intuitively, we can first fill in the dictionary keys with the words in our list. Afterwards we can count the occurrences of each word and and fill in the appropriate dictionary value.\n",
    "\n",
    "*Hint: Can we use `countAny()` for this?*"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def countAll(file_path, words):\n",
    "    # [ WRITE YOUR CODE HERE ]\n",
    "    "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify your function\n",
    "countAll(\"./data/humanistic_nursing.txt\", [\"patient\", \"and\", \"the\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You should expect `{'patient': 125, 'and': 1922, 'the': 2604}`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise 3: Cleaning up \n",
    "Unless you have already accounted for it, your counter was thrown off by some words. Consider this:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(countAny(filename, \"work\"))\n",
    "print(countAny(filename, \"work.\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Although the word is the same, you have different values for the 2 keys.\n",
    "\n",
    "In order to fix this we have to clean up the words before they are counted.\n",
    "\n",
    "There are multiple ways to do this. A good approach will be:\n",
    "- take all wards, one by one\n",
    "- use the `.strip()` method to clean up bad charecters\n",
    "- convert all words to lowercase\n",
    "\n",
    "You can get some ideas from the [String methods page](https://docs.python.org/3/library/stdtypes.html#string-methods)\n",
    "\n",
    "A good way to \n",
    "\n",
    "Now write a function that opens up the text, cleans all of the words and returns a big long list of words:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def cleanText(file_path):\n",
    "    # [ WRITE YOUR CODE HERE ]\n",
    "             "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Verify your function\n",
    "cleanText(filename)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}