Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Regular expressions (regex)\n",
"is a sequence of characters that define a search pattern. They allow us to do fancy data sciency things like searching for an email address with a particular pattern - eg. starts with an \"s\", followed by 3 digits and ending with \"@yahoo.com\".\n",
"\n",
"In this notebook we will briefly touch upon string manipulation and using regex with pandas."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# String manipulation <a name=\"strings\"></a>\n",
"Python has long been popular for its raw data manipulation in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Basics\n",
"Let's refresh what normal `str` (String objects) are capable of in Python"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# complex strings can be broken into small bits\n",
"val = \"Edinburgh is great\"\n",
"val.split(\" \")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# substrings can be concatinated together with +\n",
"first, second, last = val.split(\" \")\n",
"first + \"::\" + second + \"::\" + last"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember that Strings are just lists of individual charecters"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val = \"Edinburgh\"\n",
"for each in val:\n",
" print(each)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use standard list operations with them"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val.find(\"n\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"val.find(\"x\") # -1 means that there is no such element"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# and of course remember about upper() and lower()\n",
"val.upper()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to learn more about strings you can always refer to the [Python manual](https://docs.python.org/2/library/string.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular expressions\n",
"provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called *regex*, is a string formed according to the regular expression language. Python's built-in module is responsible for applying regular expression of strings via the `re` package"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re\n",
"text = \"foo bar\\t baz \\tqux\"\n",
"text"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"re.split(\"\\s+\", text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"this expression effectively removed all whitespaces and tab characters (`\\t`) which was stated with the `\\s` regex and then the `+` after it means to remove any number of sequential occurrences of that character.\n",
"\n",
"Let's have a look at a more complex example - identifying email addresses in a text file:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"text = \"\"\"Dave dave@google.com\n",
"Steve steve@gmail.com\n",
"Rob rob@gmail.com\n",
"Ryan ryan@yahoo.com\n",
"\"\"\"\n",
"\n",
"# pattern to be used for searching\n",
"pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
"\n",
"# re.IGNORECASE makes the regex case-insensitive\n",
"regex = re.compile(pattern, flags=re.IGNORECASE)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"regex.findall(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's dissect the regex part by part:\n",
"```\n",
"pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
"```\n",
"\n",
"- the `r` prefix before the string signals that the string should keep special characters such as the newline character `\\n`. Otherwise, Python would just treat it as a newline\n",
"- `A-Z` means all letters from A to Z including lowercase and uppercase\n",
"- `0-9` similarly means all characters from 0 to 9\n",
"- the concatenation `._%+-` means just include those characters\n",
"- the square brackets [ ] means to combine all of the regular expressions inside. For example `[A-Z0-9._%+-]` would mean include all letters A to Z, all numbers 0 to 9, and the characters ._%+-\n",
"- `+` means to concatenate the strings patterns\n",
"- `{2,4}` means consider only 2 to 4 character strings\n",
"\n",
"To summarise the pattern above searches for any combination of letters and numbers, followed by a `@`, then any combination of letters and numbers followed by a `.` with only 2 to 4 letters after it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular expressions and pandas\n",
"Let's see how they can be combined. Replicating the example above"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data = pd.Series({'Dave': 'Daves email dave@google.com', 'Steve': 'Steves email steve@gmail.com',\n",
" 'Rob': 'Robs rob@gmail.com', 'Wes': np.nan})\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can reuse the same `pattern` variable from above"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.str.findall(pattern, flags=re.IGNORECASE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas also offers more standard string operations. For example, we can check if a string is contained within a data row:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"data.str.contains(\"gmail\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many more of these methods exist:\n",
" \n",
" \n",
"| Methods | Description |\n",
"| -- | -- |\n",
"| cat | Concatenate strings element-wise with optional delimiter |\n",
"| contains | Return boolean array if each string contains pattern/regex |\n",
"| count | Count occurrences of a pattern |\n",
"| extract | Use a regex with groups to extract one or more strings from a Series |\n",
"| findall | Computer list of all occurrences of pattern/regex for each string |\n",
"| get | Index into each element |\n",
"| isdecimal | Checks if the string is a decimal number |\n",
"| isdigit | Checks if the string is a digit |\n",
"| islower | Checks if the string is in lower case |\n",
"| isupper | Checks if the string is in upper case |\n",
"| join | Join strings in each element of the Series with passed seperator |\n",
"| len | Compute the length of each string |\n",
"| lower, upper | Convert cases |\n",
"| match | Returns matched groups as a list |\n",
"| pad | Adds whitespace to left, right or both sides of strings |\n",
"| repeat | Duplicate string values |\n",
"| slice | Slice each string in the Series |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise\n",
"There is a dataset `data/yob2012.txt` which lists the number of newborns registered in 2012 with their names and sex. Using regular expressions, extract all names from the dataset which start with letters A to C. How many names did you find?\n",
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
"\n",
"Note: `^` is the \"starting with\" operator in regular expressions, "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",