Newer
Older
6001
6002
6003
6004
6005
6006
6007
6008
6009
6010
6011
6012
6013
6014
6015
6016
6017
6018
6019
6020
6021
6022
6023
6024
6025
6026
6027
6028
6029
6030
6031
6032
6033
6034
6035
6036
6037
6038
6039
6040
6041
6042
6043
6044
6045
6046
6047
6048
6049
6050
6051
6052
6053
6054
6055
6056
6057
6058
6059
6060
6061
6062
6063
6064
6065
6066
" <td>0.001317</td>\n",
" <td>-0.047580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>std</th>\n",
" <td>1.007406</td>\n",
" <td>0.977138</td>\n",
" <td>1.021164</td>\n",
" <td>1.006258</td>\n",
" </tr>\n",
" <tr>\n",
" <th>min</th>\n",
" <td>-3.000000</td>\n",
" <td>-3.000000</td>\n",
" <td>-2.824025</td>\n",
" <td>-3.000000</td>\n",
" </tr>\n",
" <tr>\n",
" <th>25%</th>\n",
" <td>-0.696013</td>\n",
" <td>-0.664201</td>\n",
" <td>-0.667879</td>\n",
" <td>-0.735838</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50%</th>\n",
" <td>-0.014608</td>\n",
" <td>0.012376</td>\n",
" <td>0.018254</td>\n",
" <td>-0.078759</td>\n",
" </tr>\n",
" <tr>\n",
" <th>75%</th>\n",
" <td>0.660866</td>\n",
" <td>0.642424</td>\n",
" <td>0.678173</td>\n",
" <td>0.617265</td>\n",
" </tr>\n",
" <tr>\n",
" <th>max</th>\n",
" <td>3.000000</td>\n",
" <td>3.000000</td>\n",
" <td>3.000000</td>\n",
" <td>3.000000</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"count 1000.000000 1000.000000 1000.000000 1000.000000\n",
"mean -0.025998 -0.007102 0.001317 -0.047580\n",
"std 1.007406 0.977138 1.021164 1.006258\n",
"min -3.000000 -3.000000 -2.824025 -3.000000\n",
"25% -0.696013 -0.664201 -0.667879 -0.735838\n",
"50% -0.014608 0.012376 0.018254 -0.078759\n",
"75% 0.660866 0.642424 0.678173 0.617265\n",
"max 3.000000 3.000000 3.000000 3.000000"
]
},
"execution_count": 105,
"metadata": {},
"output_type": "execute_result"
}
],
6067
6068
6069
6070
6071
6072
6073
6074
6075
6076
6077
6078
6079
6080
6081
6082
6083
6084
6085
6086
6087
"source": [
"data[np.abs(data) > 3] = np.sign(data) * 3\n",
"data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Permutation and Random Sampling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Permuting (randomly reordering) of rows in pandas is easy to do using the `numpy.random.permutation` function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering:"
]
},
{
"cell_type": "code",
6088
6089
6090
6091
6092
6093
6094
6095
6096
6097
6098
6099
6100
6101
6102
6103
6104
6105
6106
6107
6108
6109
6110
6111
6112
6113
6114
6115
6116
6117
6118
6119
6120
6121
6122
6123
6124
6125
6126
6127
6128
6129
6130
6131
6132
6133
6134
6135
6136
6137
6138
6139
6140
6141
6142
6143
6144
6145
6146
6147
6148
6149
6150
6151
6152
6153
6154
6155
6156
6157
6158
6159
6160
6161
6162
6163
6164
6165
6166
6167
6168
6169
6170
6171
"execution_count": 106,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>16</td>\n",
" <td>17</td>\n",
" <td>18</td>\n",
" <td>19</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"0 0 1 2 3\n",
"1 4 5 6 7\n",
"2 8 9 10 11\n",
"3 12 13 14 15\n",
"4 16 17 18 19"
]
},
"execution_count": 106,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 107,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2, 0, 4, 3])"
]
},
"execution_count": 107,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# generate random order\n",
"sampler = np.random.permutation(5)\n",
"sampler"
]
},
{
"cell_type": "code",
6201
6202
6203
6204
6205
6206
6207
6208
6209
6210
6211
6212
6213
6214
6215
6216
6217
6218
6219
6220
6221
6222
6223
6224
6225
6226
6227
6228
6229
6230
6231
6232
6233
6234
6235
6236
6237
6238
6239
6240
6241
6242
6243
6244
6245
6246
6247
6248
6249
6250
6251
6252
6253
6254
6255
6256
6257
6258
6259
6260
6261
6262
6263
6264
6265
6266
6267
6268
6269
6270
6271
6272
6273
6274
6275
6276
6277
6278
6279
6280
6281
6282
6283
6284
"execution_count": 108,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>2</td>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>16</td>\n",
" <td>17</td>\n",
" <td>18</td>\n",
" <td>19</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"1 4 5 6 7\n",
"2 8 9 10 11\n",
"0 0 1 2 3\n",
"4 16 17 18 19\n",
"3 12 13 14 15"
]
},
"execution_count": 108,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.take(sampler)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To select a random subset without replacement, you can use the sample method:"
]
},
{
"cell_type": "code",
6298
6299
6300
6301
6302
6303
6304
6305
6306
6307
6308
6309
6310
6311
6312
6313
6314
6315
6316
6317
6318
6319
6320
6321
6322
6323
6324
6325
6326
6327
6328
6329
6330
6331
6332
6333
6334
6335
6336
6337
6338
6339
6340
6341
6342
6343
6344
6345
6346
6347
6348
6349
6350
6351
6352
6353
6354
6355
6356
6357
6358
6359
6360
6361
6362
6363
6364
6365
"execution_count": 109,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>12</td>\n",
" <td>13</td>\n",
" <td>14</td>\n",
" <td>15</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>8</td>\n",
" <td>9</td>\n",
" <td>10</td>\n",
" <td>11</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>4</td>\n",
" <td>5</td>\n",
" <td>6</td>\n",
" <td>7</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2 3\n",
"3 12 13 14 15\n",
"2 8 9 10 11\n",
"1 4 5 6 7"
]
},
"execution_count": 109,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.sample(n=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# String manipulation <a name=\"strings\"></a>\n",
"Python has long been popular for its raw data manipulation in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Basics\n",
"Let's refresh what normal `str` (String objects) are capable of in Python"
]
},
{
"cell_type": "code",
"execution_count": 110,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['Edinburgh', 'is', 'great']"
]
},
"execution_count": 110,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# complex strings can be broken into small bits\n",
"val = \"Edinburgh is great\"\n",
"val.split(\" \")"
]
},
{
"cell_type": "code",
"execution_count": 111,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Edinburgh::is::great'"
]
},
"execution_count": 111,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# substrings can be concatinated together with +\n",
"first, second, last = val.split(\" \")\n",
"first + \"::\" + second + \"::\" + last"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember that Strings are just lists of individual charecters"
]
},
{
"cell_type": "code",
"execution_count": 112,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"E\n",
"d\n",
"i\n",
"n\n",
"b\n",
"u\n",
"r\n",
"g\n",
"h\n"
]
}
],
"source": [
"val = \"Edinburgh\"\n",
"for each in val:\n",
" print(each)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use standard list operations with them"
]
},
{
"cell_type": "code",
"execution_count": 113,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"3"
]
},
"execution_count": 113,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val.find(\"n\")"
]
},
{
"cell_type": "code",
"execution_count": 114,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-1"
]
},
"execution_count": 114,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"val.find(\"x\") # -1 means that there is no such element"
]
},
{
"cell_type": "code",
"execution_count": 115,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'EDINBURGH'"
]
},
"execution_count": 115,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# and of course remember about upper() and lower()\n",
"val.upper()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to learn more about strings you can always refer to the [Python manual](https://docs.python.org/2/library/string.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular expressions\n",
"provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called *regex*, is a string formed according to the regular expression language. Python's built-in module is responsible for applying regular expression of strings via the `re` package"
"execution_count": 116,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'foo bar\\t baz \\tqux'"
]
},
"execution_count": 116,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import re\n",
"text = \"foo bar\\t baz \\tqux\"\n",
"text"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['foo', 'bar', 'baz', 'qux']"
]
},
"execution_count": 117,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"re.split(\"\\s+\", text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"this expression effectively removed all whitespaces and tab characters (`\\t`) which was stated with the `\\s` regex and then the `+` after it means to remove any number of sequential occurrences of that character.\n",
"Let's have a look at a more complex example - identifying email addresses in a text file:"
"metadata": {},
"outputs": [],
"source": [
"text = \"\"\"Dave dave@google.com\n",
"Steve steve@gmail.com\n",
"Rob rob@gmail.com\n",
"Ryan ryan@yahoo.com\n",
"\"\"\"\n",
"\n",
"# pattern to be used for searching\n",
"pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
"\n",
"# re.IGNORECASE makes the regex case-insensitive\n",
"regex = re.compile(pattern, flags=re.IGNORECASE)"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['dave@google.com', 'steve@gmail.com', 'rob@gmail.com', 'ryan@yahoo.com']"
]
},
"execution_count": 119,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regex.findall(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's dissect the regex part by part:\n",
"```\n",
"pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
"```\n",
"\n",
"- the `r` prefix before the string signals that the string should keep special characters such as the newline character `\\n`. Otherwise, Python would just treat it as a newline\n",
"- `A-Z` means all letters from A to Z including lowercase and uppercase\n",
"- `0-9` similarly means all characters from 0 to 9\n",
"- the concatenation `._%+-` means just include those characters\n",
"- the square brackets [ ] means to combine all of the regular expressions inside. For example `[A-Z0-9._%+-]` would mean include all letters A to Z, all numbers 0 to 9, and the characters ._%+-\n",
"- `+` means to concatenate the strings patterns\n",
"- `{2,4}` means consider only 2 to 4 character strings\n",
"\n",
"To summarise the pattern above searches for any combination of letters and numbers, followed by a `@`, then any combination of letters and numbers followed by a `.` with only 2 to 4 letters after it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular expressions and pandas\n",
"Let's see how they can be combined. Replicating the example above"
]
},
{
"cell_type": "code",
"source": [
"data = pd.Series({'Dave': 'Daves email dave@google.com', 'Steve': 'Steves email steve@gmail.com',\n",
" 'Rob': 'Robs rob@gmail.com', 'Wes': np.nan})\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can reuse the same `pattern` variable from above"
]
},
{
"cell_type": "code",
"source": [
"data.str.findall(pattern, flags=re.IGNORECASE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas also offers more standard string operations. For example, we can check if a string is contained within a data row:"
"source": [
"data.str.contains(\"gmail\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many more of these methods exist:\n",
" \n",
" \n",
"| -- | -- |\n",
"| cat | Concatenate strings element-wise with optional delimiter |\n",
"| contains | Return boolean array if each string contains pattern/regex |\n",
"| extract | Use a regex with groups to extract one or more strings from a Series |\n",
"| findall | Computer list of all occurrences of pattern/regex for each string |\n",
"| get | Index into each element |\n",
"| isdecimal | Checks if the string is a decimal number |\n",
"| isdigit | Checks if the string is a digit |\n",
"| islower | Checks if the string is in lower case |\n",
"| isupper | Checks if the string is in upper case |\n",
"| join | Join strings in each element of the Series with passed seperator |\n",
"| lower, upper | Convert cases |\n",
"| match | Returns matched groups as a list |\n",
"| pad | Adds whitespace to left, right or both sides of strings |\n",
"| repeat | Duplicate string values |\n",
6733
6734
6735
6736
6737
6738
6739
6740
6741
6742
6743
6744
6745
6746
6747
6748
6749
6750
6751
6752
6753
6754
6755
6756
6757
6758
6759
"| slice | Slice each string in the Series |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 12\n",
"There is a `dataset data/yob2012.txt` which lists the number of newborns registered in 2018 with their names and sex. Using regular expressions, extract all names from the dataset which start with letters A to C. How many names did you find?\n",
"\n",
"Note: `^` is the \"starting with\" operator in regular expressions, "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Thanks "
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",