Skip to content
Snippets Groups Projects
python-data-5-pandas.ipynb 207 KiB
Newer Older
5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300 5301 5302 5303 5304 5305 5306 5307 5308 5309 5310 5311 5312 5313 5314 5315 5316 5317 5318 5319 5320 5321 5322 5323 5324 5325 5326 5327 5328 5329 5330 5331 5332 5333 5334 5335 5336 5337 5338 5339 5340 5341 5342 5343 5344 5345 5346 5347 5348 5349 5350 5351 5352 5353 5354 5355 5356 5357 5358 5359 5360 5361 5362 5363 5364 5365 5366 5367 5368 5369 5370 5371 5372 5373 5374 5375 5376 5377 5378 5379 5380 5381 5382 5383 5384 5385 5386 5387 5388 5389 5390 5391 5392 5393 5394 5395 5396 5397 5398 5399 5400 5401 5402 5403 5404 5405 5406 5407 5408 5409 5410 5411 5412 5413 5414 5415 5416 5417 5418 5419 5420 5421 5422 5423 5424 5425 5426 5427 5428 5429 5430 5431 5432 5433 5434 5435 5436 5437 5438 5439 5440 5441 5442 5443 5444 5445 5446 5447 5448 5449 5450 5451 5452 5453 5454 5455 5456 5457 5458 5459 5460 5461 5462 5463 5464 5465 5466 5467 5468 5469 5470 5471 5472 5473 5474 5475 5476 5477 5478 5479 5480 5481 5482 5483 5484 5485 5486 5487 5488 5489 5490 5491 5492 5493 5494 5495 5496 5497 5498 5499 5500 5501 5502 5503 5504 5505 5506 5507 5508 5509 5510 5511 5512 5513 5514 5515 5516 5517 5518 5519 5520 5521 5522 5523 5524 5525 5526 5527 5528 5529 5530 5531 5532 5533 5534 5535 5536 5537 5538 5539 5540 5541 5542 5543 5544 5545 5546 5547 5548 5549 5550 5551 5552 5553 5554 5555 5556 5557 5558 5559 5560 5561 5562 5563 5564 5565 5566 5567 5568 5569 5570 5571 5572 5573 5574 5575 5576 5577 5578 5579 5580 5581 5582 5583 5584 5585 5586 5587 5588 5589 5590 5591 5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 5736 5737 5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 5930 5931 5932 5933 5934 5935 5936 5937 5938 5939 5940 5941 5942 5943 5944 5945 5946 5947 5948 5949 5950 5951 5952 5953 5954 5955 5956 5957 5958 5959 5960 5961 5962 5963 5964 5965 5966 5967 5968 5969 5970 5971 5972 5973 5974 5975 5976 5977 5978 5979 5980 5981 5982 5983 5984 5985 5986 5987 5988 5989 5990 5991 5992 5993 5994 5995 5996 5997 5998 5999 6000
       "      <td>December 14, 2000</td>\n",
       "      <td>March 17, 2005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>554</th>\n",
       "      <td>Bank of Honolulu</td>\n",
       "      <td>Honolulu</td>\n",
       "      <td>HI</td>\n",
       "      <td>21029</td>\n",
       "      <td>Bank of the Orient</td>\n",
       "      <td>October 13, 2000</td>\n",
       "      <td>March 17, 2005</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>555 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             Bank Name                City  \\\n",
       "0                  Washington Federal Bank for Savings             Chicago   \n",
       "1      The Farmers and Merchants State Bank of Argonia             Argonia   \n",
       "2                                  Fayette County Bank          Saint Elmo   \n",
       "3    Guaranty Bank, (d/b/a BestBank in Georgia & Mi...           Milwaukee   \n",
       "4                                       First NBC Bank         New Orleans   \n",
       "5                                        Proficio Bank  Cottonwood Heights   \n",
       "6                        Seaway Bank and Trust Company             Chicago   \n",
       "7                               Harvest Community Bank          Pennsville   \n",
       "8                                          Allied Bank            Mulberry   \n",
       "9                         The Woodbury Banking Company            Woodbury   \n",
       "10                              First CornerStone Bank     King of Prussia   \n",
       "11                                  Trust Company Bank             Memphis   \n",
       "12                          North Milwaukee State Bank           Milwaukee   \n",
       "13                              Hometown National Bank            Longview   \n",
       "14                                 The Bank of Georgia      Peachtree City   \n",
       "15                                        Premier Bank              Denver   \n",
       "16                                      Edgebrook Bank             Chicago   \n",
       "17                              Doral Bank  En Espanol            San Juan   \n",
       "18                   Capitol City Bank & Trust Company             Atlanta   \n",
       "19                             Highland Community Bank             Chicago   \n",
       "20                    First National Bank of Crestview           Crestview   \n",
       "21                                  Northern Star Bank             Mankato   \n",
       "22              Frontier Bank, FSB D/B/A El Paseo Bank         Palm Desert   \n",
       "23               The National Republic Bank of Chicago             Chicago   \n",
       "24                                      NBRS Financial          Rising Sun   \n",
       "25                               GreenChoice Bank, fsb             Chicago   \n",
       "26                            Eastside Commercial Bank             Conyers   \n",
       "27                              The Freedom State Bank             Freedom   \n",
       "28                                         Valley Bank     Fort Lauderdale   \n",
       "29                                         Valley Bank              Moline   \n",
       "..                                                 ...                 ...   \n",
       "525                                  ANB Financial, NA         Bentonville   \n",
       "526                                          Hume Bank                Hume   \n",
       "527                             Douglass National Bank         Kansas City   \n",
       "528                                  Miami Valley Bank            Lakeview   \n",
       "529                                            NetBank          Alpharetta   \n",
       "530                          Metropolitan Savings Bank          Pittsburgh   \n",
       "531                                    Bank of Ephraim             Ephraim   \n",
       "532                                      Reliance Bank        White Plains   \n",
       "533              Guaranty National Bank of Tallahassee         Tallahassee   \n",
       "534                                Dollar Savings Bank              Newark   \n",
       "535                               Pulaski Savings Bank        Philadelphia   \n",
       "536              First National Bank of Blanchardville      Blanchardville   \n",
       "537                              Southern Pacific Bank            Torrance   \n",
       "538                        Farmers Bank of Cheneyville         Cheneyville   \n",
       "539                                      Bank of Alamo               Alamo   \n",
       "540             AmTrade International Bank  En Espanol             Atlanta   \n",
       "541                     Universal Federal Savings Bank             Chicago   \n",
       "542                       Connecticut Bank of Commerce            Stamford   \n",
       "543                                   New Century Bank     Shelby Township   \n",
       "544                              Net 1st National Bank          Boca Raton   \n",
       "545                                       NextBank, NA             Phoenix   \n",
       "546                           Oakwood Deposit Bank Co.             Oakwood   \n",
       "547                              Bank of Sierra Blanca       Sierra Blanca   \n",
       "548                      Hamilton Bank, NA  En Espanol               Miami   \n",
       "549                             Sinclair National Bank            Gravette   \n",
       "550                                 Superior Bank, FSB            Hinsdale   \n",
       "551                                Malta National Bank               Malta   \n",
       "552                    First Alliance Bank & Trust Co.          Manchester   \n",
       "553                  National State Bank of Metropolis          Metropolis   \n",
       "554                                   Bank of Honolulu            Honolulu   \n",
       "\n",
       "     ST   CERT                Acquiring Institution        Closing Date  \\\n",
       "0    IL  30570                   Royal Savings Bank   December 15, 2017   \n",
       "1    KS  17719                          Conway Bank    October 13, 2017   \n",
       "2    IL   1802            United Fidelity Bank, fsb        May 26, 2017   \n",
       "3    WI  30003  First-Citizens Bank & Trust Company         May 5, 2017   \n",
       "4    LA  58302                         Whitney Bank      April 28, 2017   \n",
       "5    UT  35495                    Cache Valley Bank       March 3, 2017   \n",
       "6    IL  19328                  State Bank of Texas    January 27, 2017   \n",
       "7    NJ  34951  First-Citizens Bank & Trust Company    January 13, 2017   \n",
       "8    AR     91                         Today's Bank  September 23, 2016   \n",
       "9    GA  11297                          United Bank     August 19, 2016   \n",
       "10   PA  35312  First-Citizens Bank & Trust Company         May 6, 2016   \n",
       "11   TN   9956           The Bank of Fayette County      April 29, 2016   \n",
       "12   WI  20364  First-Citizens Bank & Trust Company      March 11, 2016   \n",
       "13   WA  35156                       Twin City Bank     October 2, 2015   \n",
       "14   GA  35259                        Fidelity Bank     October 2, 2015   \n",
       "15   CO  34112            United Fidelity Bank, fsb       July 10, 2015   \n",
       "16   IL  57772             Republic Bank of Chicago         May 8, 2015   \n",
       "17   PR  32102         Banco Popular de Puerto Rico   February 27, 2015   \n",
       "18   GA  33938  First-Citizens Bank & Trust Company   February 13, 2015   \n",
       "19   IL  20290            United Fidelity Bank, fsb    January 23, 2015   \n",
       "20   FL  17557                       First NBC Bank    January 16, 2015   \n",
       "21   MN  34983                            BankVista   December 19, 2014   \n",
       "22   CA  34738    Bank of Southern California, N.A.    November 7, 2014   \n",
       "23   IL    916                  State Bank of Texas    October 24, 2014   \n",
       "24   MD   4862                          Howard Bank    October 17, 2014   \n",
       "25   IL  28462                 Providence Bank, LLC       July 25, 2014   \n",
       "26   GA  58125            Community & Southern Bank       July 18, 2014   \n",
       "27   OK  12483      Alva State Bank & Trust Company       June 27, 2014   \n",
       "28   FL  21793  Landmark Bank, National Association       June 20, 2014   \n",
       "29   IL  10450                  Great Southern Bank       June 20, 2014   \n",
       "..   ..    ...                                  ...                 ...   \n",
       "525  AR  33901       Pulaski Bank and Trust Company         May 9, 2008   \n",
       "526  MO   1971                        Security Bank       March 7, 2008   \n",
       "527  MO  24660       Liberty Bank and Trust Company    January 25, 2008   \n",
       "528  OH  16848         The Citizens Banking Company     October 4, 2007   \n",
       "529  GA  32575                           ING DIRECT  September 28, 2007   \n",
       "530  PA  35353  Allegheny Valley Bank of Pittsburgh    February 2, 2007   \n",
       "531  UT   1249                        Far West Bank       June 25, 2004   \n",
       "532  NY  26778                     Union State Bank      March 19, 2004   \n",
       "533  FL  26838              Hancock Bank of Florida      March 12, 2004   \n",
       "534  NJ  31330                          No Acquirer   February 14, 2004   \n",
       "535  PA  27203                       Earthstar Bank   November 14, 2003   \n",
       "536  WI  11639                        The Park Bank         May 9, 2003   \n",
       "537  CA  27094                            Beal Bank    February 7, 2003   \n",
       "538  LA  16445            Sabine State Bank & Trust   December 17, 2002   \n",
       "539  TN   9961                          No Acquirer    November 8, 2002   \n",
       "540  GA  33784                          No Acquirer  September 30, 2002   \n",
       "541  IL  29355               Chicago Community Bank       June 27, 2002   \n",
       "542  CT  19183                   Hudson United Bank       June 26, 2002   \n",
       "543  MI  34979                          No Acquirer      March 28, 2002   \n",
       "544  FL  26652                       Bank Leumi USA       March 1, 2002   \n",
       "545  AZ  22314                          No Acquirer    February 7, 2002   \n",
       "546  OH   8966       The State Bank & Trust Company    February 1, 2002   \n",
       "547  TX  22002     The Security State Bank of Pecos    January 18, 2002   \n",
       "548  FL  24382     Israel Discount Bank of New York    January 11, 2002   \n",
       "549  AR  34248                   Delta Trust & Bank   September 7, 2001   \n",
       "550  IL  32646                Superior Federal, FSB       July 27, 2001   \n",
       "551  OH   6629                    North Valley Bank         May 3, 2001   \n",
       "552  NH  34264  Southern New Hampshire Bank & Trust    February 2, 2001   \n",
       "553  IL   3815              Banterra Bank of Marion   December 14, 2000   \n",
       "554  HI  21029                   Bank of the Orient    October 13, 2000   \n",
       "\n",
       "           Updated Date  \n",
       "0     February 21, 2018  \n",
       "1     February 21, 2018  \n",
       "2         July 26, 2017  \n",
       "3        March 22, 2018  \n",
       "4      December 5, 2017  \n",
       "5         March 7, 2018  \n",
       "6          May 18, 2017  \n",
       "7          May 18, 2017  \n",
       "8    September 25, 2017  \n",
       "9          June 1, 2017  \n",
       "10    November 13, 2018  \n",
       "11    September 6, 2016  \n",
       "12       March 13, 2017  \n",
       "13    February 19, 2018  \n",
       "14         July 9, 2018  \n",
       "15    February 20, 2018  \n",
       "16        July 12, 2016  \n",
       "17         May 13, 2015  \n",
       "18       April 21, 2015  \n",
       "19    November 15, 2017  \n",
       "20    November 15, 2017  \n",
       "21      January 3, 2018  \n",
       "22    November 10, 2016  \n",
       "23      January 6, 2016  \n",
       "24    February 19, 2018  \n",
       "25    December 12, 2016  \n",
       "26      October 6, 2017  \n",
       "27    February 21, 2018  \n",
       "28    February 14, 2018  \n",
       "29        June 26, 2015  \n",
       "..                  ...  \n",
       "525     August 28, 2012  \n",
       "526     August 28, 2012  \n",
       "527    October 26, 2012  \n",
       "528  September 12, 2016  \n",
       "529     August 28, 2012  \n",
       "530    October 27, 2010  \n",
       "531       April 9, 2008  \n",
       "532       April 9, 2008  \n",
       "533      April 17, 2018  \n",
       "534       April 9, 2008  \n",
       "535     October 6, 2017  \n",
       "536        June 5, 2012  \n",
       "537    October 20, 2008  \n",
       "538    October 20, 2004  \n",
       "539      March 18, 2005  \n",
       "540  September 11, 2006  \n",
       "541     October 6, 2017  \n",
       "542   February 14, 2012  \n",
       "543      March 18, 2005  \n",
       "544       April 9, 2008  \n",
       "545    February 5, 2015  \n",
       "546    October 25, 2012  \n",
       "547    November 6, 2003  \n",
       "548  September 21, 2015  \n",
       "549     October 6, 2017  \n",
       "550     August 19, 2014  \n",
       "551   November 18, 2002  \n",
       "552   February 18, 2003  \n",
       "553      March 17, 2005  \n",
       "554      March 17, 2005  \n",
       "\n",
       "[555 rows x 7 columns]"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "banks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Powerful no? Now let's turn that into an exercise.\n",
    "\n",
    "### Exercise\n",
    "Given the data you just extracted above, can you analyse how many banks have failed per state?\n",
    "\n",
    "Georgia (GA) should be the state with the most failed banks!\n",
    "\n",
    "*Hint: try searching the web for pandas counting occurances* "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Data Cleaning\n",
    "During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transofrming and rearranging. Such tasks are often reported to take **up to 80%** or more of a data analyst's time. Often the way the data is stored in files isn't in the correct format and needs to be modified. Reseraches usually do this on an ad-hoc basis using programming languages like Python.\n",
    "\n",
    "In this chapter we will discuss tools for handling missing data, duplicate data, string manipulation, and some other analytical data transformations.\n",
    "\n",
    "## Handling missing data\n",
    "Mussing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible.\n",
    "\n",
    "In pandas, missing numberic data is represented by `NaN` (Not a Number) and can easily be handled:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     orange\n",
       "1     tomato\n",
       "2        NaN\n",
       "3    avocado\n",
       "dtype: object"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data = pd.Series(['orange', 'tomato', np.nan, 'avocado'])\n",
    "string_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Furthermore, the pandas `NaN` is functionally equlevant to the standard Python type `NoneType` which can be defined with `x = None`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       None\n",
       "1     tomato\n",
       "2        NaN\n",
       "3    avocado\n",
       "dtype: object"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data[0] = None\n",
    "string_data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "dtype: bool"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are some other methods which you can find useful:\n",
    "    \n",
    "| Method | Decription |\n",
    "| -- | -- |\n",
    "| dropna | Filter axis labels based on whether values of each label have missing data|\n",
    "| fillna | Fill in missing data with some vallue |\n",
    "| isnull | Return boolean values indicating which vallues are missin |\n",
    "| notnull | Negation of isnull |\n",
    "\n",
    "### Exercise\n",
    "Remove the missing data below using the appropriate method"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.0\n",
       "2    3.0\n",
       "3    4.0\n",
       "5    6.0\n",
       "dtype: float64"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.Series([1, None, 3, 4, None, 6])\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`dropna()` by default removes any row/column that has a missing value. What if we want to remove only rows in which all of the data is missing though?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "2  NaN  NaN  NaN\n",
       "3  NaN  6.5  3.0"
      ]
     },
     "execution_count": 14,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame([[1., 6.5, 3.], [1., None, None],\n",
    "                    [None, None, None], [None, 6.5, 3.]])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.dropna()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "3  NaN  6.5  3.0"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.dropna(how=\"all\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Exercise\n",
    "That's fine if we want to remove missing data, what if we want to fill in missing data? Do you know of a way? Try to fill in all of the missing values from the data below with **0s**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.DataFrame([[1., 6.5, 3.], [2., None, None],\n",
    "                    [None, None, None], [None, 1.5, 9.]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "pandas also allows us to interpolate the data instead of just filling it with a constant. The easiest way to do that is shown below, but there are more complex ones that are not covered in this course."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2.0</td>\n",
       "      <td>1.5</td>\n",
       "      <td>9.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  2.0  6.5  3.0\n",
       "2  2.0  6.5  3.0\n",
       "3  2.0  1.5  9.0"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.fillna(method=\"ffill\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want you can explore the other capabilities of [`fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Data Transformation\n",
    "### Removing duplicates\n",
    "Duplicate data can be a serious issue, luckily pandas offers a simple way to remove duplicates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0\n",
       "0  1\n",
       "1  2\n",
       "2  3\n",
       "3  4\n",
       "4  3\n",
       "5  2\n",
       "6  1"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame([1, 2, 3, 4, 3, 2, 1])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0\n",
       "0  1\n",
       "1  2\n",
       "2  3\n",
       "3  4"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also select which rows to keep"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>3</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>5</th>\n",
       "      <td>2</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6</th>\n",
       "      <td>1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "   0\n",
       "3  4\n",
       "4  3\n",
       "5  2\n",
       "6  1"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.drop_duplicates(keep=\"last\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Replacing data\n",
    "You've already seen how you can fill in missing data with `fillna`. That is actually a special case of more general value replacement. That is done via the `replace` method.\n",
    "\n",
    "Let's consider an example where the dataset given to us had `-999` as sentinel values for missing data instead of `NaN`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",