# Regular expressions (regex)
is a sequence of characters that define a search pattern. They allow us to do fancy data sciency things like searching for an email address with a particular pattern - eg. starts with an "s", followed by 3 digits and ending with "@yahoo.com".

In this notebook we will briefly touch upon string manipulation and using regex with pandas.

# String manipulation <a name="strings"></a>
Python has long been popular for its raw data manipulation in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed.

In [None]:
import numpy as np
import pandas as pd

### Basics
Let's refresh what normal `str` (String objects) are capable of in Python

In [None]:
# complex strings can be broken into small bits
val = "Edinburgh is great"
val.split(" ")

In [None]:
# substrings can be concatinated together with +
first, second, last = val.split(" ")
first + "::" + second + "::" + last

Remember that Strings are just lists of individual charecters

In [None]:
val = "Edinburgh"
for each in val:
    print(each)

You can use standard list operations with them

In [None]:
val.find("n")

In [None]:
val.find("x")  # -1 means that there is no such element

In [None]:
# and of course remember about upper() and lower()
val.upper()

If you want to learn more about strings you can always refer to the [Python manual](https://docs.python.org/2/library/string.html)

### Regular expressions
provide a flexible way to search or match (often more complex) string patterns in text. A single expression, commonly called *regex*, is a string formed according to the regular expression language. Python's built-in module is responsible for applying regular expression of strings via the `re` package

In [None]:
import re
text = "foo    bar\t baz   \tqux"
text

In [None]:
re.split("\s+", text)

this expression effectively removed all whitespaces and tab characters (`\t`) which was stated with the `\s` regex and then the `+` after it means to remove any number of sequential occurrences of that character.

Let's have a look at a more complex example - identifying email addresses in a text file:

In [None]:
text = """Dave dave@google.com
Steve steve@gmail.com
Rob rob@gmail.com
Ryan ryan@yahoo.com
"""

# pattern to be used for searching
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'

# re.IGNORECASE makes the regex case-insensitive
regex = re.compile(pattern, flags=re.IGNORECASE)

In [None]:
regex.findall(text)

Let's dissect the regex part by part:
```
pattern = r'[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,4}'
```

- the `r` prefix before the string signals that the string should keep special characters such as the newline character `\n`. Otherwise, Python would just treat it as a newline
- `A-Z` means all letters from A to Z including lowercase and uppercase
- `0-9` similarly means all characters from 0 to 9
- the concatenation `._%+-` means just include those characters
- the square brackets [ ] means to combine all of the regular expressions inside. For example `[A-Z0-9._%+-]` would mean include all letters A to Z, all numbers 0 to 9, and the characters ._%+-
- `+` means to concatenate the strings patterns
- `{2,4}` means consider only 2 to 4 character strings

To summarise the pattern above searches for any combination of letters and numbers, followed by a `@`, then any combination of letters and numbers followed by a `.` with only 2 to 4 letters after it.

### Regular expressions and pandas
Let's see how they can be combined. Replicating the example above

In [None]:
data = pd.Series({'Dave': 'Daves email dave@google.com', 'Steve': 'Steves email steve@gmail.com',
        'Rob': 'Robs rob@gmail.com', 'Wes': np.nan})
data

We can reuse the same `pattern` variable from above

In [None]:
data.str.findall(pattern, flags=re.IGNORECASE)

pandas also offers more standard string operations. For example, we can check if a string is contained within a data row:

In [None]:
data.str.contains("gmail")

Many more of these methods exist:
    
    
| Methods | Description |
| -- | -- |
| cat | Concatenate strings element-wise with optional delimiter |
| contains | Return boolean array if each string contains pattern/regex |
| count | Count occurrences of a pattern |
| extract | Use a regex with groups to extract one or more strings from a Series |
| findall | Computer list of all occurrences of pattern/regex for each string |
| get | Index into each element |
| isdecimal | Checks if the string is a decimal number |
| isdigit | Checks if the string is a digit |
| islower | Checks if the string is in lower case |
| isupper | Checks if the string is in upper case |
| join | Join strings in each element of the Series with passed seperator |
| len | Compute the length of each string |
| lower, upper | Convert cases |
| match | Returns matched groups as a list |
| pad | Adds whitespace to left, right or both sides of strings |
| repeat | Duplicate string values |
| slice | Slice each string in the Series |

### Exercise
There is a dataset `data/yob2012.txt` which lists the number of newborns registered in 2012 with their names and sex. Using regular expressions, extract all names from the dataset which start with letters A to C. How many names did you find?

Note: `^` is the "starting with" operator in regular expressions, 