Skip to content
Snippets Groups Projects
python-data-4-pandas.ipynb 30.5 KiB
Newer Older
    "data = pd.DataFrame([1., -999., 2., -999., 3., 4., -999, -999, 7.])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data.replace(-999, np.nan)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Detection and Filtering Outliers\n",
    "Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data = pd.DataFrame(np.random.randn(1000, 4))\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "Suppose you now want to lower all absolute values exceeding 3 from one of the columns"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "col = data[2]\n",
    "col[np.abs(col) > 3]"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data[np.abs(data) > 3] = np.sign(data) * 3\n",
    "data.describe()"
   ]
  },
ignat's avatar
ignat committed
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Boolean indexing\n",
    "What we did above was actually boolean indexing.\n",
    "\n",
    "We generated a boolean array and then use it to access particular values of the array.\n",
    "\n",
    "Let's have a look at a more simple example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# generate a random dataframe\n",
    "df = pd.DataFrame(np.random.randn(3, 3))\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# see which values are bigger than 0\n",
    "df > 0"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We get `True` where the condition held and `False` otherwise. Now we can actually use that dataframe to index into another dataframe:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set all positive values to 0\n",
    "df[df>0] = 0\n",
    "df"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise 9\n",
    "Let's load again our file with home prices and filter out homes based on our preference:\n",
    "1. Load up the file `data/homes.csv`\n",
    "2. The data contains some duplicates. Filter them out.\n",
    "3. Let's say that the most we can spend on a house is £150. Keep only houses that have a **sell**ing price less than £150 and remove the rest\n",
    "4. Select only houses that have 4 or more bedrooms\n",
    "5. Select only houses that have 3 or more baths\n",
ignat's avatar
ignat committed
    "\n",
    "You should end up with only 2 houses"
ignat's avatar
ignat committed
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
ignat's avatar
ignat committed
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.2"
ignat's avatar
ignat committed
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}