Newer
Older
"source": [
"# substrings can be concatinated together with +\n",
"first, second, last = val.split(\" \")\n",
"first + \"::\" + second + \"::\" + last"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remember that Strings are just lists of individual charecters"
]
},
{
"cell_type": "code",
"source": [
"val = \"Edinburgh\"\n",
"for each in val:\n",
" print(each)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use standard list operations with them"
]
},
{
"cell_type": "code",
"source": [
"val.find(\"n\")"
]
},
{
"cell_type": "code",
"source": [
"val.find(\"x\") # -1 means that there is no such element"
]
},
{
"cell_type": "code",
"source": [
"# and of course remember about upper() and lower()\n",
"val.upper()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want to learn more about strings you can always refer to the [Python manual](https://docs.python.org/2/library/string.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular expressions\n",
"provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called *regex*, is a string formed according to the regular expression language. Python's built-in module is responsible for applying regular expression of strings via the `re` package"
"source": [
"import re\n",
"text = \"foo bar\\t baz \\tqux\"\n",
"text"
]
},
{
"cell_type": "code",
"source": [
"re.split(\"\\s+\", text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"this expression effectively removed all whitespaces and tab characters (`\\t`) which was stated with the `\\s` regex and then the `+` after it means to remove any number of sequential occurrences of that character.\n",
"Let's have a look at a more complex example - identifying email addresses in a text file:"
"metadata": {},
"outputs": [],
"source": [
"text = \"\"\"Dave dave@google.com\n",
"Steve steve@gmail.com\n",
"Rob rob@gmail.com\n",
"Ryan ryan@yahoo.com\n",
"\"\"\"\n",
"\n",
"# pattern to be used for searching\n",
"pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
"\n",
"# re.IGNORECASE makes the regex case-insensitive\n",
"regex = re.compile(pattern, flags=re.IGNORECASE)"
]
},
{
"cell_type": "code",
"source": [
"regex.findall(text)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's dissect the regex part by part:\n",
"```\n",
"pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\\.[A-Z]{2,4}'\n",
"```\n",
"\n",
"- the `r` prefix before the string signals that the string should keep special characters such as the newline character `\\n`. Otherwise, Python would just treat it as a newline\n",
"- `A-Z` means all letters from A to Z including lowercase and uppercase\n",
"- `0-9` similarly means all characters from 0 to 9\n",
"- the concatenation `._%+-` means just include those characters\n",
"- the square brackets [ ] means to combine all of the regular expressions inside. For example `[A-Z0-9._%+-]` would mean include all letters A to Z, all numbers 0 to 9, and the characters ._%+-\n",
"- `+` means to concatenate the strings patterns\n",
"- `{2,4}` means consider only 2 to 4 character strings\n",
"\n",
"To summarise the pattern above searches for any combination of letters and numbers, followed by a `@`, then any combination of letters and numbers followed by a `.` with only 2 to 4 letters after it."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Regular expressions and pandas\n",
"Let's see how they can be combined. Replicating the example above"
]
},
{
"cell_type": "code",
"source": [
"data = pd.Series({'Dave': 'Daves email dave@google.com', 'Steve': 'Steves email steve@gmail.com',\n",
" 'Rob': 'Robs rob@gmail.com', 'Wes': np.nan})\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can reuse the same `pattern` variable from above"
]
},
{
"cell_type": "code",
"source": [
"data.str.findall(pattern, flags=re.IGNORECASE)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"pandas also offers more standard string operations. For example, we can check if a string is contained within a data row:"
"source": [
"data.str.contains(\"gmail\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Many more of these methods exist:\n",
" \n",
" \n",
"| -- | -- |\n",
"| cat | Concatenate strings element-wise with optional delimiter |\n",
"| contains | Return boolean array if each string contains pattern/regex |\n",
"| extract | Use a regex with groups to extract one or more strings from a Series |\n",
"| findall | Computer list of all occurrences of pattern/regex for each string |\n",
"| get | Index into each element |\n",
"| isdecimal | Checks if the string is a decimal number |\n",
"| isdigit | Checks if the string is a digit |\n",
"| islower | Checks if the string is in lower case |\n",
"| isupper | Checks if the string is in upper case |\n",
"| join | Join strings in each element of the Series with passed seperator |\n",
"| lower, upper | Convert cases |\n",
"| match | Returns matched groups as a list |\n",
"| pad | Adds whitespace to left, right or both sides of strings |\n",
"| repeat | Duplicate string values |\n",
6229
6230
6231
6232
6233
6234
6235
6236
6237
6238
6239
6240
6241
6242
6243
6244
6245
6246
6247
6248
6249
6250
6251
6252
6253
6254
6255
"| slice | Slice each string in the Series |"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 12\n",
"There is a `dataset data/yob2012.txt` which lists the number of newborns registered in 2018 with their names and sex. Using regular expressions, extract all names from the dataset which start with letters A to C. How many names did you find?\n",
"\n",
"Note: `^` is the \"starting with\" operator in regular expressions, "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Thanks "
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",