Newer
Older
"data = pd.DataFrame([1., -999., 2., -999., 3., 4., -999, -999, 7.])\n",
"data"
]
},
{
"cell_type": "code",
"source": [
"data.replace(-999, np.nan)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Detection and Filtering Outliers\n",
"Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:"
]
},
{
"cell_type": "code",
"source": [
"data = pd.DataFrame(np.random.randn(1000, 4))\n",
"data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose you now want to lower all absolute values exceeding 3 from one of the columns"
"source": [
"col = data[2]\n",
"col[np.abs(col) > 3]"
]
},
{
"cell_type": "code",
"source": [
"data[np.abs(data) > 3] = np.sign(data) * 3\n",
"data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
"### Boolean indexing\n",
"What we did above was actually boolean indexing.\n",
"\n",
"We generated a boolean array and then use it to access particular values of the array.\n",
"\n",
"Let's have a look at a more simple example."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# generate a random dataframe\n",
"df = pd.DataFrame(np.random.randn(3, 3))\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# see which values are bigger than 0\n",
"df > 0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We get `True` where the condition held and `False` otherwise. Now we can actually use that dataframe to index into another dataframe:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# set all positive values to 0\n",
"df[df>0] = 0\n",
"df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise 9\n",
"Let's load again our file with home prices and filter out homes based on our preference:\n",
"1. Load up the file `data/homes.csv`\n",
"2. The data contains some duplicates. Filter them out.\n",
"3. Let's say that the most we can spend on a house is £150. Keep only houses that have a **sell**ing price less than £150 and remove the rest\n",
"4. Select only houses that have 4 or more bedrooms\n",
"5. Select only houses that have 3 or more baths\n",
"You should end up with only 2 houses"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",