Skip to content
Snippets Groups Projects
python-data-5-pandas.ipynb 182 KiB
Newer Older
5001 5002 5003 5004 5005 5006 5007 5008 5009 5010 5011 5012 5013 5014 5015 5016 5017 5018 5019 5020 5021 5022 5023 5024 5025 5026 5027 5028 5029 5030 5031 5032 5033 5034 5035 5036 5037 5038 5039 5040 5041 5042 5043 5044 5045 5046 5047 5048 5049 5050 5051 5052 5053 5054 5055 5056 5057 5058 5059 5060 5061 5062 5063 5064 5065 5066 5067 5068 5069 5070 5071 5072 5073 5074 5075 5076 5077 5078 5079 5080 5081 5082 5083 5084 5085 5086 5087 5088 5089 5090 5091 5092 5093 5094 5095 5096 5097 5098 5099 5100 5101 5102 5103 5104 5105 5106 5107 5108 5109 5110 5111 5112 5113 5114 5115 5116 5117 5118 5119 5120 5121 5122 5123 5124 5125 5126 5127 5128 5129 5130 5131 5132 5133 5134 5135 5136 5137 5138 5139 5140 5141 5142 5143 5144 5145 5146 5147 5148 5149 5150 5151 5152 5153 5154 5155 5156 5157 5158 5159 5160 5161 5162 5163 5164 5165 5166 5167 5168 5169 5170 5171 5172 5173 5174 5175 5176 5177 5178 5179 5180 5181 5182 5183 5184 5185 5186 5187 5188 5189 5190 5191 5192 5193 5194 5195 5196 5197 5198 5199 5200 5201 5202 5203 5204 5205 5206 5207 5208 5209 5210 5211 5212 5213 5214 5215 5216 5217 5218 5219 5220 5221 5222 5223 5224 5225 5226 5227 5228 5229 5230 5231 5232 5233 5234 5235 5236 5237 5238 5239 5240 5241 5242 5243 5244 5245 5246 5247 5248 5249 5250 5251 5252 5253 5254 5255 5256 5257 5258 5259 5260 5261 5262 5263 5264 5265 5266 5267 5268 5269 5270 5271 5272 5273 5274 5275 5276 5277 5278 5279 5280 5281 5282 5283 5284 5285 5286 5287 5288 5289 5290 5291 5292 5293 5294 5295 5296 5297 5298 5299 5300
       "      <td>TN</td>\n",
       "      <td>9961</td>\n",
       "      <td>No Acquirer</td>\n",
       "      <td>November 8, 2002</td>\n",
       "      <td>March 18, 2005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>540</th>\n",
       "      <td>AmTrade International Bank  En Espanol</td>\n",
       "      <td>Atlanta</td>\n",
       "      <td>GA</td>\n",
       "      <td>33784</td>\n",
       "      <td>No Acquirer</td>\n",
       "      <td>September 30, 2002</td>\n",
       "      <td>September 11, 2006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>541</th>\n",
       "      <td>Universal Federal Savings Bank</td>\n",
       "      <td>Chicago</td>\n",
       "      <td>IL</td>\n",
       "      <td>29355</td>\n",
       "      <td>Chicago Community Bank</td>\n",
       "      <td>June 27, 2002</td>\n",
       "      <td>October 6, 2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>542</th>\n",
       "      <td>Connecticut Bank of Commerce</td>\n",
       "      <td>Stamford</td>\n",
       "      <td>CT</td>\n",
       "      <td>19183</td>\n",
       "      <td>Hudson United Bank</td>\n",
       "      <td>June 26, 2002</td>\n",
       "      <td>February 14, 2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>543</th>\n",
       "      <td>New Century Bank</td>\n",
       "      <td>Shelby Township</td>\n",
       "      <td>MI</td>\n",
       "      <td>34979</td>\n",
       "      <td>No Acquirer</td>\n",
       "      <td>March 28, 2002</td>\n",
       "      <td>March 18, 2005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>544</th>\n",
       "      <td>Net 1st National Bank</td>\n",
       "      <td>Boca Raton</td>\n",
       "      <td>FL</td>\n",
       "      <td>26652</td>\n",
       "      <td>Bank Leumi USA</td>\n",
       "      <td>March 1, 2002</td>\n",
       "      <td>April 9, 2008</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>545</th>\n",
       "      <td>NextBank, NA</td>\n",
       "      <td>Phoenix</td>\n",
       "      <td>AZ</td>\n",
       "      <td>22314</td>\n",
       "      <td>No Acquirer</td>\n",
       "      <td>February 7, 2002</td>\n",
       "      <td>February 5, 2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>546</th>\n",
       "      <td>Oakwood Deposit Bank Co.</td>\n",
       "      <td>Oakwood</td>\n",
       "      <td>OH</td>\n",
       "      <td>8966</td>\n",
       "      <td>The State Bank &amp; Trust Company</td>\n",
       "      <td>February 1, 2002</td>\n",
       "      <td>October 25, 2012</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>547</th>\n",
       "      <td>Bank of Sierra Blanca</td>\n",
       "      <td>Sierra Blanca</td>\n",
       "      <td>TX</td>\n",
       "      <td>22002</td>\n",
       "      <td>The Security State Bank of Pecos</td>\n",
       "      <td>January 18, 2002</td>\n",
       "      <td>November 6, 2003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>548</th>\n",
       "      <td>Hamilton Bank, NA  En Espanol</td>\n",
       "      <td>Miami</td>\n",
       "      <td>FL</td>\n",
       "      <td>24382</td>\n",
       "      <td>Israel Discount Bank of New York</td>\n",
       "      <td>January 11, 2002</td>\n",
       "      <td>September 21, 2015</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>549</th>\n",
       "      <td>Sinclair National Bank</td>\n",
       "      <td>Gravette</td>\n",
       "      <td>AR</td>\n",
       "      <td>34248</td>\n",
       "      <td>Delta Trust &amp; Bank</td>\n",
       "      <td>September 7, 2001</td>\n",
       "      <td>October 6, 2017</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>550</th>\n",
       "      <td>Superior Bank, FSB</td>\n",
       "      <td>Hinsdale</td>\n",
       "      <td>IL</td>\n",
       "      <td>32646</td>\n",
       "      <td>Superior Federal, FSB</td>\n",
       "      <td>July 27, 2001</td>\n",
       "      <td>August 19, 2014</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>551</th>\n",
       "      <td>Malta National Bank</td>\n",
       "      <td>Malta</td>\n",
       "      <td>OH</td>\n",
       "      <td>6629</td>\n",
       "      <td>North Valley Bank</td>\n",
       "      <td>May 3, 2001</td>\n",
       "      <td>November 18, 2002</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>552</th>\n",
       "      <td>First Alliance Bank &amp; Trust Co.</td>\n",
       "      <td>Manchester</td>\n",
       "      <td>NH</td>\n",
       "      <td>34264</td>\n",
       "      <td>Southern New Hampshire Bank &amp; Trust</td>\n",
       "      <td>February 2, 2001</td>\n",
       "      <td>February 18, 2003</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>553</th>\n",
       "      <td>National State Bank of Metropolis</td>\n",
       "      <td>Metropolis</td>\n",
       "      <td>IL</td>\n",
       "      <td>3815</td>\n",
       "      <td>Banterra Bank of Marion</td>\n",
       "      <td>December 14, 2000</td>\n",
       "      <td>March 17, 2005</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>554</th>\n",
       "      <td>Bank of Honolulu</td>\n",
       "      <td>Honolulu</td>\n",
       "      <td>HI</td>\n",
       "      <td>21029</td>\n",
       "      <td>Bank of the Orient</td>\n",
       "      <td>October 13, 2000</td>\n",
       "      <td>March 17, 2005</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>555 rows × 7 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                                             Bank Name                City  \\\n",
       "0                  Washington Federal Bank for Savings             Chicago   \n",
       "1      The Farmers and Merchants State Bank of Argonia             Argonia   \n",
       "2                                  Fayette County Bank          Saint Elmo   \n",
       "3    Guaranty Bank, (d/b/a BestBank in Georgia & Mi...           Milwaukee   \n",
       "4                                       First NBC Bank         New Orleans   \n",
       "5                                        Proficio Bank  Cottonwood Heights   \n",
       "6                        Seaway Bank and Trust Company             Chicago   \n",
       "7                               Harvest Community Bank          Pennsville   \n",
       "8                                          Allied Bank            Mulberry   \n",
       "9                         The Woodbury Banking Company            Woodbury   \n",
       "10                              First CornerStone Bank     King of Prussia   \n",
       "11                                  Trust Company Bank             Memphis   \n",
       "12                          North Milwaukee State Bank           Milwaukee   \n",
       "13                              Hometown National Bank            Longview   \n",
       "14                                 The Bank of Georgia      Peachtree City   \n",
       "15                                        Premier Bank              Denver   \n",
       "16                                      Edgebrook Bank             Chicago   \n",
       "17                              Doral Bank  En Espanol            San Juan   \n",
       "18                   Capitol City Bank & Trust Company             Atlanta   \n",
       "19                             Highland Community Bank             Chicago   \n",
       "20                    First National Bank of Crestview           Crestview   \n",
       "21                                  Northern Star Bank             Mankato   \n",
       "22              Frontier Bank, FSB D/B/A El Paseo Bank         Palm Desert   \n",
       "23               The National Republic Bank of Chicago             Chicago   \n",
       "24                                      NBRS Financial          Rising Sun   \n",
       "25                               GreenChoice Bank, fsb             Chicago   \n",
       "26                            Eastside Commercial Bank             Conyers   \n",
       "27                              The Freedom State Bank             Freedom   \n",
       "28                                         Valley Bank     Fort Lauderdale   \n",
       "29                                         Valley Bank              Moline   \n",
       "..                                                 ...                 ...   \n",
       "525                                  ANB Financial, NA         Bentonville   \n",
       "526                                          Hume Bank                Hume   \n",
       "527                             Douglass National Bank         Kansas City   \n",
       "528                                  Miami Valley Bank            Lakeview   \n",
       "529                                            NetBank          Alpharetta   \n",
       "530                          Metropolitan Savings Bank          Pittsburgh   \n",
       "531                                    Bank of Ephraim             Ephraim   \n",
       "532                                      Reliance Bank        White Plains   \n",
       "533              Guaranty National Bank of Tallahassee         Tallahassee   \n",
       "534                                Dollar Savings Bank              Newark   \n",
       "535                               Pulaski Savings Bank        Philadelphia   \n",
       "536              First National Bank of Blanchardville      Blanchardville   \n",
       "537                              Southern Pacific Bank            Torrance   \n",
       "538                        Farmers Bank of Cheneyville         Cheneyville   \n",
       "539                                      Bank of Alamo               Alamo   \n",
       "540             AmTrade International Bank  En Espanol             Atlanta   \n",
       "541                     Universal Federal Savings Bank             Chicago   \n",
       "542                       Connecticut Bank of Commerce            Stamford   \n",
       "543                                   New Century Bank     Shelby Township   \n",
       "544                              Net 1st National Bank          Boca Raton   \n",
       "545                                       NextBank, NA             Phoenix   \n",
       "546                           Oakwood Deposit Bank Co.             Oakwood   \n",
       "547                              Bank of Sierra Blanca       Sierra Blanca   \n",
       "548                      Hamilton Bank, NA  En Espanol               Miami   \n",
       "549                             Sinclair National Bank            Gravette   \n",
       "550                                 Superior Bank, FSB            Hinsdale   \n",
       "551                                Malta National Bank               Malta   \n",
       "552                    First Alliance Bank & Trust Co.          Manchester   \n",
       "553                  National State Bank of Metropolis          Metropolis   \n",
       "554                                   Bank of Honolulu            Honolulu   \n",
       "\n",
       "     ST   CERT                Acquiring Institution        Closing Date  \\\n",
       "0    IL  30570                   Royal Savings Bank   December 15, 2017   \n",
       "1    KS  17719                          Conway Bank    October 13, 2017   \n",
       "2    IL   1802            United Fidelity Bank, fsb        May 26, 2017   \n",
       "3    WI  30003  First-Citizens Bank & Trust Company         May 5, 2017   \n",
       "4    LA  58302                         Whitney Bank      April 28, 2017   \n",
       "5    UT  35495                    Cache Valley Bank       March 3, 2017   \n",
       "6    IL  19328                  State Bank of Texas    January 27, 2017   \n",
       "7    NJ  34951  First-Citizens Bank & Trust Company    January 13, 2017   \n",
       "8    AR     91                         Today's Bank  September 23, 2016   \n",
       "9    GA  11297                          United Bank     August 19, 2016   \n",
       "10   PA  35312  First-Citizens Bank & Trust Company         May 6, 2016   \n",
       "11   TN   9956           The Bank of Fayette County      April 29, 2016   \n",
       "12   WI  20364  First-Citizens Bank & Trust Company      March 11, 2016   \n",
       "13   WA  35156                       Twin City Bank     October 2, 2015   \n",
       "14   GA  35259                        Fidelity Bank     October 2, 2015   \n",
       "15   CO  34112            United Fidelity Bank, fsb       July 10, 2015   \n",
       "16   IL  57772             Republic Bank of Chicago         May 8, 2015   \n",
       "17   PR  32102         Banco Popular de Puerto Rico   February 27, 2015   \n",
       "18   GA  33938  First-Citizens Bank & Trust Company   February 13, 2015   \n",
       "19   IL  20290            United Fidelity Bank, fsb    January 23, 2015   \n",
       "20   FL  17557                       First NBC Bank    January 16, 2015   \n",
       "21   MN  34983                            BankVista   December 19, 2014   \n",
       "22   CA  34738    Bank of Southern California, N.A.    November 7, 2014   \n",
       "23   IL    916                  State Bank of Texas    October 24, 2014   \n",
       "24   MD   4862                          Howard Bank    October 17, 2014   \n",
       "25   IL  28462                 Providence Bank, LLC       July 25, 2014   \n",
       "26   GA  58125            Community & Southern Bank       July 18, 2014   \n",
       "27   OK  12483      Alva State Bank & Trust Company       June 27, 2014   \n",
       "28   FL  21793  Landmark Bank, National Association       June 20, 2014   \n",
       "29   IL  10450                  Great Southern Bank       June 20, 2014   \n",
       "..   ..    ...                                  ...                 ...   \n",
       "525  AR  33901       Pulaski Bank and Trust Company         May 9, 2008   \n",
       "526  MO   1971                        Security Bank       March 7, 2008   \n",
       "527  MO  24660       Liberty Bank and Trust Company    January 25, 2008   \n",
       "528  OH  16848         The Citizens Banking Company     October 4, 2007   \n",
       "529  GA  32575                           ING DIRECT  September 28, 2007   \n",
       "530  PA  35353  Allegheny Valley Bank of Pittsburgh    February 2, 2007   \n",
       "531  UT   1249                        Far West Bank       June 25, 2004   \n",
       "532  NY  26778                     Union State Bank      March 19, 2004   \n",
       "533  FL  26838              Hancock Bank of Florida      March 12, 2004   \n",
       "534  NJ  31330                          No Acquirer   February 14, 2004   \n",
       "535  PA  27203                       Earthstar Bank   November 14, 2003   \n",
       "536  WI  11639                        The Park Bank         May 9, 2003   \n",
       "537  CA  27094                            Beal Bank    February 7, 2003   \n",
       "538  LA  16445            Sabine State Bank & Trust   December 17, 2002   \n",
       "539  TN   9961                          No Acquirer    November 8, 2002   \n",
       "540  GA  33784                          No Acquirer  September 30, 2002   \n",
       "541  IL  29355               Chicago Community Bank       June 27, 2002   \n",
       "542  CT  19183                   Hudson United Bank       June 26, 2002   \n",
       "543  MI  34979                          No Acquirer      March 28, 2002   \n",
       "544  FL  26652                       Bank Leumi USA       March 1, 2002   \n",
       "545  AZ  22314                          No Acquirer    February 7, 2002   \n",
       "546  OH   8966       The State Bank & Trust Company    February 1, 2002   \n",
       "547  TX  22002     The Security State Bank of Pecos    January 18, 2002   \n",
       "548  FL  24382     Israel Discount Bank of New York    January 11, 2002   \n",
       "549  AR  34248                   Delta Trust & Bank   September 7, 2001   \n",
       "550  IL  32646                Superior Federal, FSB       July 27, 2001   \n",
       "551  OH   6629                    North Valley Bank         May 3, 2001   \n",
       "552  NH  34264  Southern New Hampshire Bank & Trust    February 2, 2001   \n",
       "553  IL   3815              Banterra Bank of Marion   December 14, 2000   \n",
       "554  HI  21029                   Bank of the Orient    October 13, 2000   \n",
       "\n",
       "           Updated Date  \n",
       "0     February 21, 2018  \n",
       "1     February 21, 2018  \n",
       "2         July 26, 2017  \n",
       "3        March 22, 2018  \n",
       "4      December 5, 2017  \n",
       "5         March 7, 2018  \n",
       "6          May 18, 2017  \n",
       "7          May 18, 2017  \n",
       "8    September 25, 2017  \n",
       "9          June 1, 2017  \n",
       "10    November 13, 2018  \n",
ignat's avatar
ignat committed
       "11   September 14, 2018  \n",
       "12       March 13, 2017  \n",
       "13    February 19, 2018  \n",
       "14         July 9, 2018  \n",
       "15    February 20, 2018  \n",
       "16        July 12, 2016  \n",
       "17         May 13, 2015  \n",
       "18       April 21, 2015  \n",
       "19    November 15, 2017  \n",
       "20    November 15, 2017  \n",
       "21      January 3, 2018  \n",
       "22    November 10, 2016  \n",
       "23      January 6, 2016  \n",
       "24    February 19, 2018  \n",
       "25    December 12, 2016  \n",
       "26      October 6, 2017  \n",
       "27    February 21, 2018  \n",
       "28    February 14, 2018  \n",
       "29        June 26, 2015  \n",
       "..                  ...  \n",
       "525     August 28, 2012  \n",
       "526     August 28, 2012  \n",
       "527    October 26, 2012  \n",
       "528  September 12, 2016  \n",
       "529     August 28, 2012  \n",
       "530    October 27, 2010  \n",
       "531       April 9, 2008  \n",
       "532       April 9, 2008  \n",
       "533      April 17, 2018  \n",
       "534       April 9, 2008  \n",
       "535     October 6, 2017  \n",
       "536        June 5, 2012  \n",
       "537    October 20, 2008  \n",
       "538    October 20, 2004  \n",
       "539      March 18, 2005  \n",
       "540  September 11, 2006  \n",
       "541     October 6, 2017  \n",
       "542   February 14, 2012  \n",
       "543      March 18, 2005  \n",
       "544       April 9, 2008  \n",
       "545    February 5, 2015  \n",
       "546    October 25, 2012  \n",
       "547    November 6, 2003  \n",
       "548  September 21, 2015  \n",
       "549     October 6, 2017  \n",
       "550     August 19, 2014  \n",
       "551   November 18, 2002  \n",
       "552   February 18, 2003  \n",
       "553      March 17, 2005  \n",
       "554      March 17, 2005  \n",
       "\n",
       "[555 rows x 7 columns]"
      ]
     },
ignat's avatar
ignat committed
     "execution_count": 64,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "banks"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Powerful no? Now let's turn that into an exercise.\n",
    "\n",
ignat's avatar
ignat committed
    "### Exercise 9\n",
    "Given the data you just extracted above, can you analyse how many banks have failed per state?\n",
    "\n",
    "Georgia (GA) should be the state with the most failed banks!\n",
    "\n",
ignat's avatar
ignat committed
    "*Hint: try searching the web for pandas counting occurrences* "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "# Data Cleaning <a name=\"cleaning\"></a>\n",
    "While doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming and rearranging. Such tasks are often reported to take **up to 80%** or more of a data analyst's time. Often the way the data is stored in files isn't in the correct format and needs to be modified. Researchers usually do this on an ad-hoc basis using programming languages like Python.\n",
    "\n",
ignat's avatar
ignat committed
    "In this chapter, we will discuss tools for handling missing data, duplicate data, string manipulation, and some other analytical data transformations.\n",
    "\n",
ignat's avatar
ignat committed
    "## Handling missing data <a name=\"missing\"></a>\n",
    "Mussing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible.\n",
    "\n",
ignat's avatar
ignat committed
    "In pandas, missing numeric data is represented by `NaN` (Not a Number) and can easily be handled:"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": 65,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     orange\n",
       "1     tomato\n",
       "2        NaN\n",
       "3    avocado\n",
       "dtype: object"
      ]
     },
ignat's avatar
ignat committed
     "execution_count": 65,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data = pd.Series(['orange', 'tomato', np.nan, 'avocado'])\n",
    "string_data"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": 66,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    False\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "dtype: bool"
      ]
     },
ignat's avatar
ignat committed
     "execution_count": 66,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Furthermore, the pandas `NaN` is functionally equlevant to the standard Python type `NoneType` which can be defined with `x = None`."
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": 67,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0       None\n",
       "1     tomato\n",
       "2        NaN\n",
       "3    avocado\n",
       "dtype: object"
      ]
     },
ignat's avatar
ignat committed
     "execution_count": 67,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data[0] = None\n",
    "string_data"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": 68,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0     True\n",
       "1    False\n",
       "2     True\n",
       "3    False\n",
       "dtype: bool"
      ]
     },
ignat's avatar
ignat committed
     "execution_count": 68,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "string_data.isnull()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Here are some other methods which you can find useful:\n",
    "    \n",
ignat's avatar
ignat committed
    "| Method | Description |\n",
    "| -- | -- |\n",
ignat's avatar
ignat committed
    "| dropna | Filter axis labels based on whether the values of each label have missing data|\n",
    "| fillna | Fill in missing data with some value |\n",
    "| isnull | Return boolean values indicating which values are missing |\n",
    "| notnull | Negation of isnull |\n",
    "\n",
ignat's avatar
ignat committed
    "### Exercise 10\n",
    "Remove the missing data below using the appropriate method"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": 69,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0    1.0\n",
ignat's avatar
ignat committed
       "1    NaN\n",
       "2    3.0\n",
       "3    4.0\n",
ignat's avatar
ignat committed
       "4    NaN\n",
       "5    6.0\n",
       "dtype: float64"
      ]
     },
ignat's avatar
ignat committed
     "execution_count": 69,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
ignat's avatar
ignat committed
    "data = pd.Series([1, None, 3, 4, None, 6])\n",
    "data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`dropna()` by default removes any row/column that has a missing value. What if we want to remove only rows in which all of the data is missing though?"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": 70,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>1.0</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>NaN</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0\n",
       "1  1.0  NaN  NaN\n",
       "2  NaN  NaN  NaN\n",
       "3  NaN  6.5  3.0"
      ]
     },
ignat's avatar
ignat committed
     "execution_count": 70,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data = pd.DataFrame([[1., 6.5, 3.], [1., None, None],\n",
    "                    [None, None, None], [None, 6.5, 3.]])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": 71,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>0</th>\n",
       "      <th>1</th>\n",
       "      <th>2</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1.0</td>\n",
       "      <td>6.5</td>\n",
       "      <td>3.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "     0    1    2\n",
       "0  1.0  6.5  3.0"
      ]
     },
ignat's avatar
ignat committed
     "execution_count": 71,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "data.dropna()"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data.dropna(how=\"all\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "### Exercise 11\n",
    "That's fine if we want to remove missing data, what if we want to fill in missing data? Do you know of a way? Try to fill in all of the missing values from the data below with **0s**"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = pd.DataFrame([[1., 6.5, 3.], [2., None, None],\n",
    "                    [None, None, None], [None, 1.5, 9.]])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "\n",
    "pandas also allows us to interpolate the data instead of just filling it with a constant. The easiest way to do that is shown below, but there are more complex ones that are not covered in this course."
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data.fillna(method=\"ffill\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you want you can explore the other capabilities of [`fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "## Data Transformation  <a name=\"transformation\"></a>\n",
    "### Removing duplicates\n",
    "Duplicate data can be a serious issue, luckily pandas offers a simple way to remove duplicates"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data = pd.DataFrame([1, 2, 3, 4, 3, 2, 1])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data.drop_duplicates()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also select which rows to keep"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data.drop_duplicates(keep=\"last\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "### Replacing data\n",
    "You've already seen how you can fill in missing data with `fillna`. That is actually a special case of more general value replacement. That is done via the `replace` method.\n",
    "\n",
    "Let's consider an example where the dataset given to us had `-999` as sentinel values for missing data instead of `NaN`."
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data = pd.DataFrame([1., -999., 2., -999., 3., 4., -999, -999, 7.])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data.replace(-999, np.nan)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Renaming axis indexes\n",
    "Similar to `replace` you can also rename the labels of your axis"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data = pd.DataFrame(np.arange(12).reshape((3, 4)),\n",
    "                    index=['Edinburgh', 'Glasgow', 'Aberdeen'])\n",
    "data"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "# create a map using a standard Python dictionary\n",
    "mapping = { 0 : \"one\",\n",
    "            1 : \"two\",\n",
    "            2 : \"three\",\n",
    "            3 : \"four\"}\n",
    "\n",
    "# now rename the columns\n",
    "data.rename(columns=mapping)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Rows can be renamed in a similar fashion"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Detection and Filtering Outliers\n",
    "Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data = pd.DataFrame(np.random.randn(1000, 4))\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "Suppose you now want to lower all absolute values exceeding 3 from one of the columns"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "col = data[2]\n",
    "col[np.abs(col) > 3]"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "data[np.abs(data) > 3] = np.sign(data) * 3\n",
    "data.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Permutation and Random Sampling"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Permuting (randomly reordering) of rows in pandas is easy to do using the `numpy.random.permutation` function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering:"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))\n",
    "df"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "# generate random order\n",
    "sampler = np.random.permutation(5)\n",
ignat's avatar
ignat committed
    "sampler"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.take(sampler)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To select a random subset without replacement, you can use the sample method:"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "df.sample(n=3)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
ignat's avatar
ignat committed
    "# String manipulation <a name=\"strings\"></a>\n",
    "Python has long been popular for its raw data manipulation in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Basics\n",
    "Let's refresh what normal `str` (String objects) are capable of in Python"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},
ignat's avatar
ignat committed
   "outputs": [],
   "source": [
    "# complex strings can be broken into small bits\n",
    "val = \"Edinburgh is great\"\n",
    "val.split(\" \")"
   ]
  },
  {
   "cell_type": "code",
ignat's avatar
ignat committed
   "execution_count": null,
   "metadata": {},