Newer
Older
5001
5002
5003
5004
5005
5006
5007
5008
5009
5010
5011
5012
5013
5014
5015
5016
5017
5018
5019
5020
5021
5022
5023
5024
5025
5026
5027
5028
5029
5030
5031
5032
5033
5034
5035
5036
5037
5038
5039
5040
5041
5042
5043
5044
5045
5046
5047
5048
5049
5050
5051
5052
5053
5054
5055
5056
5057
5058
5059
5060
5061
5062
5063
5064
5065
5066
5067
5068
5069
5070
5071
5072
5073
5074
5075
5076
5077
5078
5079
5080
5081
5082
5083
5084
5085
5086
5087
5088
5089
5090
5091
5092
5093
5094
5095
5096
5097
5098
5099
5100
5101
5102
5103
5104
5105
5106
5107
5108
5109
5110
5111
5112
5113
5114
5115
5116
5117
5118
5119
5120
5121
5122
5123
5124
5125
5126
5127
5128
5129
5130
5131
5132
5133
5134
5135
5136
5137
5138
5139
5140
5141
5142
5143
5144
5145
5146
5147
5148
5149
5150
5151
5152
5153
5154
5155
5156
5157
5158
5159
5160
5161
5162
5163
5164
5165
5166
5167
5168
5169
5170
5171
5172
5173
5174
5175
5176
5177
5178
5179
5180
5181
5182
5183
5184
5185
5186
5187
5188
5189
5190
5191
5192
5193
5194
5195
5196
5197
5198
5199
5200
5201
5202
5203
5204
5205
5206
5207
5208
5209
5210
5211
5212
5213
5214
5215
5216
5217
5218
5219
5220
5221
5222
5223
5224
5225
5226
5227
5228
5229
5230
5231
5232
5233
5234
5235
5236
5237
5238
5239
5240
5241
5242
5243
5244
5245
5246
5247
5248
5249
5250
5251
5252
5253
5254
5255
5256
5257
5258
5259
5260
5261
5262
5263
5264
5265
5266
5267
5268
5269
5270
5271
5272
5273
5274
5275
5276
5277
5278
5279
5280
5281
5282
5283
5284
5285
5286
5287
5288
5289
5290
5291
5292
5293
5294
5295
5296
5297
5298
5299
5300
" <td>TN</td>\n",
" <td>9961</td>\n",
" <td>No Acquirer</td>\n",
" <td>November 8, 2002</td>\n",
" <td>March 18, 2005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>540</th>\n",
" <td>AmTrade International Bank En Espanol</td>\n",
" <td>Atlanta</td>\n",
" <td>GA</td>\n",
" <td>33784</td>\n",
" <td>No Acquirer</td>\n",
" <td>September 30, 2002</td>\n",
" <td>September 11, 2006</td>\n",
" </tr>\n",
" <tr>\n",
" <th>541</th>\n",
" <td>Universal Federal Savings Bank</td>\n",
" <td>Chicago</td>\n",
" <td>IL</td>\n",
" <td>29355</td>\n",
" <td>Chicago Community Bank</td>\n",
" <td>June 27, 2002</td>\n",
" <td>October 6, 2017</td>\n",
" </tr>\n",
" <tr>\n",
" <th>542</th>\n",
" <td>Connecticut Bank of Commerce</td>\n",
" <td>Stamford</td>\n",
" <td>CT</td>\n",
" <td>19183</td>\n",
" <td>Hudson United Bank</td>\n",
" <td>June 26, 2002</td>\n",
" <td>February 14, 2012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>543</th>\n",
" <td>New Century Bank</td>\n",
" <td>Shelby Township</td>\n",
" <td>MI</td>\n",
" <td>34979</td>\n",
" <td>No Acquirer</td>\n",
" <td>March 28, 2002</td>\n",
" <td>March 18, 2005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>544</th>\n",
" <td>Net 1st National Bank</td>\n",
" <td>Boca Raton</td>\n",
" <td>FL</td>\n",
" <td>26652</td>\n",
" <td>Bank Leumi USA</td>\n",
" <td>March 1, 2002</td>\n",
" <td>April 9, 2008</td>\n",
" </tr>\n",
" <tr>\n",
" <th>545</th>\n",
" <td>NextBank, NA</td>\n",
" <td>Phoenix</td>\n",
" <td>AZ</td>\n",
" <td>22314</td>\n",
" <td>No Acquirer</td>\n",
" <td>February 7, 2002</td>\n",
" <td>February 5, 2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>546</th>\n",
" <td>Oakwood Deposit Bank Co.</td>\n",
" <td>Oakwood</td>\n",
" <td>OH</td>\n",
" <td>8966</td>\n",
" <td>The State Bank & Trust Company</td>\n",
" <td>February 1, 2002</td>\n",
" <td>October 25, 2012</td>\n",
" </tr>\n",
" <tr>\n",
" <th>547</th>\n",
" <td>Bank of Sierra Blanca</td>\n",
" <td>Sierra Blanca</td>\n",
" <td>TX</td>\n",
" <td>22002</td>\n",
" <td>The Security State Bank of Pecos</td>\n",
" <td>January 18, 2002</td>\n",
" <td>November 6, 2003</td>\n",
" </tr>\n",
" <tr>\n",
" <th>548</th>\n",
" <td>Hamilton Bank, NA En Espanol</td>\n",
" <td>Miami</td>\n",
" <td>FL</td>\n",
" <td>24382</td>\n",
" <td>Israel Discount Bank of New York</td>\n",
" <td>January 11, 2002</td>\n",
" <td>September 21, 2015</td>\n",
" </tr>\n",
" <tr>\n",
" <th>549</th>\n",
" <td>Sinclair National Bank</td>\n",
" <td>Gravette</td>\n",
" <td>AR</td>\n",
" <td>34248</td>\n",
" <td>Delta Trust & Bank</td>\n",
" <td>September 7, 2001</td>\n",
" <td>October 6, 2017</td>\n",
" </tr>\n",
" <tr>\n",
" <th>550</th>\n",
" <td>Superior Bank, FSB</td>\n",
" <td>Hinsdale</td>\n",
" <td>IL</td>\n",
" <td>32646</td>\n",
" <td>Superior Federal, FSB</td>\n",
" <td>July 27, 2001</td>\n",
" <td>August 19, 2014</td>\n",
" </tr>\n",
" <tr>\n",
" <th>551</th>\n",
" <td>Malta National Bank</td>\n",
" <td>Malta</td>\n",
" <td>OH</td>\n",
" <td>6629</td>\n",
" <td>North Valley Bank</td>\n",
" <td>May 3, 2001</td>\n",
" <td>November 18, 2002</td>\n",
" </tr>\n",
" <tr>\n",
" <th>552</th>\n",
" <td>First Alliance Bank & Trust Co.</td>\n",
" <td>Manchester</td>\n",
" <td>NH</td>\n",
" <td>34264</td>\n",
" <td>Southern New Hampshire Bank & Trust</td>\n",
" <td>February 2, 2001</td>\n",
" <td>February 18, 2003</td>\n",
" </tr>\n",
" <tr>\n",
" <th>553</th>\n",
" <td>National State Bank of Metropolis</td>\n",
" <td>Metropolis</td>\n",
" <td>IL</td>\n",
" <td>3815</td>\n",
" <td>Banterra Bank of Marion</td>\n",
" <td>December 14, 2000</td>\n",
" <td>March 17, 2005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>554</th>\n",
" <td>Bank of Honolulu</td>\n",
" <td>Honolulu</td>\n",
" <td>HI</td>\n",
" <td>21029</td>\n",
" <td>Bank of the Orient</td>\n",
" <td>October 13, 2000</td>\n",
" <td>March 17, 2005</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>555 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" Bank Name City \\\n",
"0 Washington Federal Bank for Savings Chicago \n",
"1 The Farmers and Merchants State Bank of Argonia Argonia \n",
"2 Fayette County Bank Saint Elmo \n",
"3 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee \n",
"4 First NBC Bank New Orleans \n",
"5 Proficio Bank Cottonwood Heights \n",
"6 Seaway Bank and Trust Company Chicago \n",
"7 Harvest Community Bank Pennsville \n",
"8 Allied Bank Mulberry \n",
"9 The Woodbury Banking Company Woodbury \n",
"10 First CornerStone Bank King of Prussia \n",
"11 Trust Company Bank Memphis \n",
"12 North Milwaukee State Bank Milwaukee \n",
"13 Hometown National Bank Longview \n",
"14 The Bank of Georgia Peachtree City \n",
"15 Premier Bank Denver \n",
"16 Edgebrook Bank Chicago \n",
"17 Doral Bank En Espanol San Juan \n",
"18 Capitol City Bank & Trust Company Atlanta \n",
"19 Highland Community Bank Chicago \n",
"20 First National Bank of Crestview Crestview \n",
"21 Northern Star Bank Mankato \n",
"22 Frontier Bank, FSB D/B/A El Paseo Bank Palm Desert \n",
"23 The National Republic Bank of Chicago Chicago \n",
"24 NBRS Financial Rising Sun \n",
"25 GreenChoice Bank, fsb Chicago \n",
"26 Eastside Commercial Bank Conyers \n",
"27 The Freedom State Bank Freedom \n",
"28 Valley Bank Fort Lauderdale \n",
"29 Valley Bank Moline \n",
".. ... ... \n",
"525 ANB Financial, NA Bentonville \n",
"526 Hume Bank Hume \n",
"527 Douglass National Bank Kansas City \n",
"528 Miami Valley Bank Lakeview \n",
"529 NetBank Alpharetta \n",
"530 Metropolitan Savings Bank Pittsburgh \n",
"531 Bank of Ephraim Ephraim \n",
"532 Reliance Bank White Plains \n",
"533 Guaranty National Bank of Tallahassee Tallahassee \n",
"534 Dollar Savings Bank Newark \n",
"535 Pulaski Savings Bank Philadelphia \n",
"536 First National Bank of Blanchardville Blanchardville \n",
"537 Southern Pacific Bank Torrance \n",
"538 Farmers Bank of Cheneyville Cheneyville \n",
"539 Bank of Alamo Alamo \n",
"540 AmTrade International Bank En Espanol Atlanta \n",
"541 Universal Federal Savings Bank Chicago \n",
"542 Connecticut Bank of Commerce Stamford \n",
"543 New Century Bank Shelby Township \n",
"544 Net 1st National Bank Boca Raton \n",
"545 NextBank, NA Phoenix \n",
"546 Oakwood Deposit Bank Co. Oakwood \n",
"547 Bank of Sierra Blanca Sierra Blanca \n",
"548 Hamilton Bank, NA En Espanol Miami \n",
"549 Sinclair National Bank Gravette \n",
"550 Superior Bank, FSB Hinsdale \n",
"551 Malta National Bank Malta \n",
"552 First Alliance Bank & Trust Co. Manchester \n",
"553 National State Bank of Metropolis Metropolis \n",
"554 Bank of Honolulu Honolulu \n",
"\n",
" ST CERT Acquiring Institution Closing Date \\\n",
"0 IL 30570 Royal Savings Bank December 15, 2017 \n",
"1 KS 17719 Conway Bank October 13, 2017 \n",
"2 IL 1802 United Fidelity Bank, fsb May 26, 2017 \n",
"3 WI 30003 First-Citizens Bank & Trust Company May 5, 2017 \n",
"4 LA 58302 Whitney Bank April 28, 2017 \n",
"5 UT 35495 Cache Valley Bank March 3, 2017 \n",
"6 IL 19328 State Bank of Texas January 27, 2017 \n",
"7 NJ 34951 First-Citizens Bank & Trust Company January 13, 2017 \n",
"8 AR 91 Today's Bank September 23, 2016 \n",
"9 GA 11297 United Bank August 19, 2016 \n",
"10 PA 35312 First-Citizens Bank & Trust Company May 6, 2016 \n",
"11 TN 9956 The Bank of Fayette County April 29, 2016 \n",
"12 WI 20364 First-Citizens Bank & Trust Company March 11, 2016 \n",
"13 WA 35156 Twin City Bank October 2, 2015 \n",
"14 GA 35259 Fidelity Bank October 2, 2015 \n",
"15 CO 34112 United Fidelity Bank, fsb July 10, 2015 \n",
"16 IL 57772 Republic Bank of Chicago May 8, 2015 \n",
"17 PR 32102 Banco Popular de Puerto Rico February 27, 2015 \n",
"18 GA 33938 First-Citizens Bank & Trust Company February 13, 2015 \n",
"19 IL 20290 United Fidelity Bank, fsb January 23, 2015 \n",
"20 FL 17557 First NBC Bank January 16, 2015 \n",
"21 MN 34983 BankVista December 19, 2014 \n",
"22 CA 34738 Bank of Southern California, N.A. November 7, 2014 \n",
"23 IL 916 State Bank of Texas October 24, 2014 \n",
"24 MD 4862 Howard Bank October 17, 2014 \n",
"25 IL 28462 Providence Bank, LLC July 25, 2014 \n",
"26 GA 58125 Community & Southern Bank July 18, 2014 \n",
"27 OK 12483 Alva State Bank & Trust Company June 27, 2014 \n",
"28 FL 21793 Landmark Bank, National Association June 20, 2014 \n",
"29 IL 10450 Great Southern Bank June 20, 2014 \n",
".. .. ... ... ... \n",
"525 AR 33901 Pulaski Bank and Trust Company May 9, 2008 \n",
"526 MO 1971 Security Bank March 7, 2008 \n",
"527 MO 24660 Liberty Bank and Trust Company January 25, 2008 \n",
"528 OH 16848 The Citizens Banking Company October 4, 2007 \n",
"529 GA 32575 ING DIRECT September 28, 2007 \n",
"530 PA 35353 Allegheny Valley Bank of Pittsburgh February 2, 2007 \n",
"531 UT 1249 Far West Bank June 25, 2004 \n",
"532 NY 26778 Union State Bank March 19, 2004 \n",
"533 FL 26838 Hancock Bank of Florida March 12, 2004 \n",
"534 NJ 31330 No Acquirer February 14, 2004 \n",
"535 PA 27203 Earthstar Bank November 14, 2003 \n",
"536 WI 11639 The Park Bank May 9, 2003 \n",
"537 CA 27094 Beal Bank February 7, 2003 \n",
"538 LA 16445 Sabine State Bank & Trust December 17, 2002 \n",
"539 TN 9961 No Acquirer November 8, 2002 \n",
"540 GA 33784 No Acquirer September 30, 2002 \n",
"541 IL 29355 Chicago Community Bank June 27, 2002 \n",
"542 CT 19183 Hudson United Bank June 26, 2002 \n",
"543 MI 34979 No Acquirer March 28, 2002 \n",
"544 FL 26652 Bank Leumi USA March 1, 2002 \n",
"545 AZ 22314 No Acquirer February 7, 2002 \n",
"546 OH 8966 The State Bank & Trust Company February 1, 2002 \n",
"547 TX 22002 The Security State Bank of Pecos January 18, 2002 \n",
"548 FL 24382 Israel Discount Bank of New York January 11, 2002 \n",
"549 AR 34248 Delta Trust & Bank September 7, 2001 \n",
"550 IL 32646 Superior Federal, FSB July 27, 2001 \n",
"551 OH 6629 North Valley Bank May 3, 2001 \n",
"552 NH 34264 Southern New Hampshire Bank & Trust February 2, 2001 \n",
"553 IL 3815 Banterra Bank of Marion December 14, 2000 \n",
"554 HI 21029 Bank of the Orient October 13, 2000 \n",
"\n",
" Updated Date \n",
"0 February 21, 2018 \n",
"1 February 21, 2018 \n",
"2 July 26, 2017 \n",
"3 March 22, 2018 \n",
"4 December 5, 2017 \n",
"5 March 7, 2018 \n",
"6 May 18, 2017 \n",
"7 May 18, 2017 \n",
"8 September 25, 2017 \n",
"9 June 1, 2017 \n",
"10 November 13, 2018 \n",
5302
5303
5304
5305
5306
5307
5308
5309
5310
5311
5312
5313
5314
5315
5316
5317
5318
5319
5320
5321
5322
5323
5324
5325
5326
5327
5328
5329
5330
5331
5332
5333
5334
5335
5336
5337
5338
5339
5340
5341
5342
5343
5344
5345
5346
5347
5348
5349
5350
5351
5352
5353
5354
"12 March 13, 2017 \n",
"13 February 19, 2018 \n",
"14 July 9, 2018 \n",
"15 February 20, 2018 \n",
"16 July 12, 2016 \n",
"17 May 13, 2015 \n",
"18 April 21, 2015 \n",
"19 November 15, 2017 \n",
"20 November 15, 2017 \n",
"21 January 3, 2018 \n",
"22 November 10, 2016 \n",
"23 January 6, 2016 \n",
"24 February 19, 2018 \n",
"25 December 12, 2016 \n",
"26 October 6, 2017 \n",
"27 February 21, 2018 \n",
"28 February 14, 2018 \n",
"29 June 26, 2015 \n",
".. ... \n",
"525 August 28, 2012 \n",
"526 August 28, 2012 \n",
"527 October 26, 2012 \n",
"528 September 12, 2016 \n",
"529 August 28, 2012 \n",
"530 October 27, 2010 \n",
"531 April 9, 2008 \n",
"532 April 9, 2008 \n",
"533 April 17, 2018 \n",
"534 April 9, 2008 \n",
"535 October 6, 2017 \n",
"536 June 5, 2012 \n",
"537 October 20, 2008 \n",
"538 October 20, 2004 \n",
"539 March 18, 2005 \n",
"540 September 11, 2006 \n",
"541 October 6, 2017 \n",
"542 February 14, 2012 \n",
"543 March 18, 2005 \n",
"544 April 9, 2008 \n",
"545 February 5, 2015 \n",
"546 October 25, 2012 \n",
"547 November 6, 2003 \n",
"548 September 21, 2015 \n",
"549 October 6, 2017 \n",
"550 August 19, 2014 \n",
"551 November 18, 2002 \n",
"552 February 18, 2003 \n",
"553 March 17, 2005 \n",
"554 March 17, 2005 \n",
"\n",
"[555 rows x 7 columns]"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"banks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Powerful no? Now let's turn that into an exercise.\n",
"\n",
"Given the data you just extracted above, can you analyse how many banks have failed per state?\n",
"\n",
"Georgia (GA) should be the state with the most failed banks!\n",
"\n",
"*Hint: try searching the web for pandas counting occurrences* "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Cleaning <a name=\"cleaning\"></a>\n",
"While doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transforming and rearranging. Such tasks are often reported to take **up to 80%** or more of a data analyst's time. Often the way the data is stored in files isn't in the correct format and needs to be modified. Researchers usually do this on an ad-hoc basis using programming languages like Python.\n",
"In this chapter, we will discuss tools for handling missing data, duplicate data, string manipulation, and some other analytical data transformations.\n",
"Mussing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible.\n",
"\n",
"In pandas, missing numeric data is represented by `NaN` (Not a Number) and can easily be handled:"
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 orange\n",
"1 tomato\n",
"2 NaN\n",
"3 avocado\n",
"dtype: object"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data = pd.Series(['orange', 'tomato', np.nan, 'avocado'])\n",
"string_data"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"dtype: bool"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Furthermore, the pandas `NaN` is functionally equlevant to the standard Python type `NoneType` which can be defined with `x = None`."
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 None\n",
"1 tomato\n",
"2 NaN\n",
"3 avocado\n",
"dtype: object"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data[0] = None\n",
"string_data"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 False\n",
"2 True\n",
"3 False\n",
"dtype: bool"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are some other methods which you can find useful:\n",
" \n",
"| dropna | Filter axis labels based on whether the values of each label have missing data|\n",
"| fillna | Fill in missing data with some value |\n",
"| isnull | Return boolean values indicating which values are missing |\n",
"| notnull | Negation of isnull |\n",
"\n",
"Remove the missing data below using the appropriate method"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1.0\n",
"5 6.0\n",
"dtype: float64"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.Series([1, None, 3, 4, None, 6])\n",
"data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`dropna()` by default removes any row/column that has a missing value. What if we want to remove only rows in which all of the data is missing though?"
]
},
{
"cell_type": "code",
5559
5560
5561
5562
5563
5564
5565
5566
5567
5568
5569
5570
5571
5572
5573
5574
5575
5576
5577
5578
5579
5580
5581
5582
5583
5584
5585
5586
5587
5588
5589
5590
5591
5592
5593
5594
5595
5596
5597
5598
5599
5600
5601
5602
5603
5604
5605
5606
5607
5608
5609
5610
5611
5612
5613
5614
5615
5616
5617
5618
5619
5620
5621
5622
5623
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0\n",
"1 1.0 NaN NaN\n",
"2 NaN NaN NaN\n",
"3 NaN 6.5 3.0"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame([[1., 6.5, 3.], [1., None, None],\n",
" [None, None, None], [None, 6.5, 3.]])\n",
"data"
]
},
{
"cell_type": "code",
5638
5639
5640
5641
5642
5643
5644
5645
5646
5647
5648
5649
5650
5651
5652
5653
5654
5655
5656
5657
5658
5659
5660
5661
5662
5663
5664
5665
5666
5667
5668
5669
5670
5671
5672
5673
5674
5675
5676
5677
5678
5679
5680
5681
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna()"
]
},
{
"cell_type": "code",
"source": [
"data.dropna(how=\"all\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"That's fine if we want to remove missing data, what if we want to fill in missing data? Do you know of a way? Try to fill in all of the missing values from the data below with **0s**"
]
},
{
"cell_type": "code",
5711
5712
5713
5714
5715
5716
5717
5718
5719
5720
5721
5722
5723
5724
5725
5726
5727
5728
5729
5730
5731
5732
5733
5734
5735
"metadata": {},
"outputs": [],
"source": [
"data = pd.DataFrame([[1., 6.5, 3.], [2., None, None],\n",
" [None, None, None], [None, 1.5, 9.]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"pandas also allows us to interpolate the data instead of just filling it with a constant. The easiest way to do that is shown below, but there are more complex ones that are not covered in this course."
]
},
{
"cell_type": "code",
"source": [
"data.fillna(method=\"ffill\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want you can explore the other capabilities of [`fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Transformation <a name=\"transformation\"></a>\n",
"### Removing duplicates\n",
"Duplicate data can be a serious issue, luckily pandas offers a simple way to remove duplicates"
]
},
{
"cell_type": "code",
"source": [
"data = pd.DataFrame([1, 2, 3, 4, 3, 2, 1])\n",
"data"
]
},
{
"cell_type": "code",
"source": [
"data.drop_duplicates()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also select which rows to keep"
]
},
{
"cell_type": "code",
"source": [
"data.drop_duplicates(keep=\"last\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You've already seen how you can fill in missing data with `fillna`. That is actually a special case of more general value replacement. That is done via the `replace` method.\n",
"\n",
"Let's consider an example where the dataset given to us had `-999` as sentinel values for missing data instead of `NaN`."
]
},
{
"cell_type": "code",
"source": [
"data = pd.DataFrame([1., -999., 2., -999., 3., 4., -999, -999, 7.])\n",
"data"
]
},
{
"cell_type": "code",
"source": [
"data.replace(-999, np.nan)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Renaming axis indexes\n",
"Similar to `replace` you can also rename the labels of your axis"
]
},
{
"cell_type": "code",
"source": [
"data = pd.DataFrame(np.arange(12).reshape((3, 4)),\n",
" index=['Edinburgh', 'Glasgow', 'Aberdeen'])\n",
"data"
]
},
{
"cell_type": "code",
5847
5848
5849
5850
5851
5852
5853
5854
5855
5856
5857
5858
5859
5860
5861
5862
5863
5864
5865
5866
5867
5868
5869
5870
5871
5872
5873
5874
"source": [
"# create a map using a standard Python dictionary\n",
"mapping = { 0 : \"one\",\n",
" 1 : \"two\",\n",
" 2 : \"three\",\n",
" 3 : \"four\"}\n",
"\n",
"# now rename the columns\n",
"data.rename(columns=mapping)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Rows can be renamed in a similar fashion"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Detection and Filtering Outliers\n",
"Filtering or transforming outliers is largely a matter of applying array operations. Consider a DataFrame with some normally distributed data:"
]
},
{
"cell_type": "code",
"source": [
"data = pd.DataFrame(np.random.randn(1000, 4))\n",
"data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Suppose you now want to lower all absolute values exceeding 3 from one of the columns"
"source": [
"col = data[2]\n",
"col[np.abs(col) > 3]"
]
},
{
"cell_type": "code",
5905
5906
5907
5908
5909
5910
5911
5912
5913
5914
5915
5916
5917
5918
5919
5920
5921
5922
5923
5924
5925
"source": [
"data[np.abs(data) > 3] = np.sign(data) * 3\n",
"data.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Permutation and Random Sampling"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Permuting (randomly reordering) of rows in pandas is easy to do using the `numpy.random.permutation` function. Calling permutation with the length of the axis you want to permute produces an array of integers indicating the new ordering:"
]
},
{
"cell_type": "code",
"source": [
"df = pd.DataFrame(np.arange(5 * 4).reshape((5, 4)))\n",
"df"
]
},
{
"cell_type": "code",
"source": [
"# generate random order\n",
"sampler = np.random.permutation(5)\n",
"sampler"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.take(sampler)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To select a random subset without replacement, you can use the sample method:"
]
},
{
"cell_type": "code",
"source": [
"df.sample(n=3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# String manipulation <a name=\"strings\"></a>\n",
"Python has long been popular for its raw data manipulation in part due to its ease of use for string and text processing. Most text operations are made simple with the string object's built-in methods. For more complex pattern matching and text manipulations, regular expressions may be needed."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Basics\n",
"Let's refresh what normal `str` (String objects) are capable of in Python"
]
},
{
"cell_type": "code",
"source": [
"# complex strings can be broken into small bits\n",
"val = \"Edinburgh is great\"\n",
"val.split(\" \")"
]
},
{
"cell_type": "code",