Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Notebook 1 - Warm-up Exercises <a name=\"text\"></a>\n",
"\n",
"In this notebook we will warm up with a textual analysis exercise, using some of the assumed basic python knowledge for the course. We will see how to use some basic string methods as well as how to open and close files in python (later, some of these methods will be superceded by inbuild methods of data science packages we will use).\n",
"\n",
"For this we will be using the text [Humanistic Nursing by Josephine G. Paterson and Loretta T. Zderad](http://www.gutenberg.org/ebooks/25020). You already have this downloaded in your workspace.\n",
"\n",
"To open up the file, Python gives us a very handy function, we just have to give it the path to the file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"file = open(\"data/humanistic_nursing.txt\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"An easy way to deal with text files is reading it line by line within a for loop:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"for line in file:\n",
" print(line)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Well, that was a lot of text. Can we turn it into something useful?\n",
"\n",
"For example, we can split up each line into the words making it and then count the occurances of the word \"and\". Here's a function that does that. Try it out!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# define functino\n",
"def countAnd(file_path):\n",
" counter = 0\n",
" file = open(file_path)\n",
" \n",
" for line in file:\n",
" for word in line.split():\n",
" if word == \"and\":\n",
" counter += 1\n",
" \n",
" return counter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# try function\n",
"countAnd(\"./data/humanistic_nursing.txt\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 1: Count any word\n",
"Based on the function above, now write your own function which counts the occurances of any word. For example:\n",
"```python\n",
"countAny(filename, \"medicine\")\n",
"```\n",
"will return the occurences of the word \"medicine\" in the file filename."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def countAny(file_path, des_word):\n",
" # [ WRITE YOUR CODE HERE ]\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Verify your function\n",
"countAny(\"./data/humanistic_nursing.txt\", \"patient\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 2: Count multiple words\n",
"Before this exercise you should be familiar with Python dictionaries. If you're not, please see [here](https://docs.python.org/3/tutorial/datastructures.html#dictionaries).\n",
"\n",
"Write a function which takes a file path and a list of words, and returns a dictionary mapping each word to its frequency in the given file.\n",
"\n",
"Intuitively, we can first fill in the dictionary keys with the words in our list. Afterwards we can count the occurrences of each word and and fill in the appropriate dictionary value.\n",
"\n",
"*Hint: Can we use `countAny()` for this?*"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def countAll(file_path, words):\n",
" # [ WRITE YOUR CODE HERE ]\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Verify your function\n",
"countAll(\"./data/humanistic_nursing.txt\", [\"patient\", \"and\", \"the\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You should expect `{'patient': 125, 'and': 1922, 'the': 2604}`"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exercise 3: Cleaning up \n",
"Unless you have already accounted for it, your counter was thrown off by some words. Consider this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(countAny(filename, \"work\"))\n",
"print(countAny(filename, \"work.\"))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Although the word is the same, you have different values for the 2 keys.\n",
"\n",
"In order to fix this we have to clean up the words before they are counted.\n",
"\n",
"There are multiple ways to do this. A good approach will be:\n",
"- take all wards, one by one\n",
"- use the `.strip()` method to clean up bad charecters\n",
"- convert all words to lowercase\n",
"\n",
"You can get some ideas from the [String methods page](https://docs.python.org/3/library/stdtypes.html#string-methods)\n",
"\n",
"A good way to \n",
"\n",
"Now write a function that opens up the text, cleans all of the words and returns a big long list of words:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def cleanText(file_path):\n",
" # [ WRITE YOUR CODE HERE ]\n",
" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Verify your function\n",
"cleanText(filename)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.6"
}
},
"nbformat": 4,
"nbformat_minor": 2
}