Newer
Older
5001
5002
5003
5004
5005
5006
5007
5008
5009
5010
5011
5012
5013
5014
5015
5016
5017
5018
5019
5020
5021
5022
5023
5024
5025
5026
5027
5028
5029
5030
5031
5032
5033
5034
5035
5036
5037
5038
5039
5040
5041
5042
5043
5044
5045
5046
5047
5048
5049
5050
5051
5052
5053
5054
5055
5056
5057
5058
5059
5060
5061
5062
5063
5064
5065
5066
5067
5068
5069
5070
5071
5072
5073
5074
5075
5076
5077
5078
5079
5080
5081
5082
5083
5084
5085
5086
5087
5088
5089
5090
5091
5092
5093
5094
5095
5096
5097
5098
5099
5100
5101
5102
5103
5104
5105
5106
5107
5108
5109
5110
5111
5112
5113
5114
5115
5116
5117
5118
5119
5120
5121
5122
5123
5124
5125
5126
5127
5128
5129
5130
5131
5132
5133
5134
5135
5136
5137
5138
5139
5140
5141
5142
5143
5144
5145
5146
5147
5148
5149
5150
5151
5152
5153
5154
5155
5156
5157
5158
5159
5160
5161
5162
5163
5164
5165
5166
5167
5168
5169
5170
5171
5172
5173
5174
5175
5176
5177
5178
5179
5180
5181
5182
5183
5184
5185
5186
5187
5188
5189
5190
5191
5192
5193
5194
5195
5196
5197
5198
5199
5200
5201
5202
5203
5204
5205
5206
5207
5208
5209
5210
5211
5212
5213
5214
5215
5216
5217
5218
5219
5220
5221
5222
5223
5224
5225
5226
5227
5228
5229
5230
5231
5232
5233
5234
5235
5236
5237
5238
5239
5240
5241
5242
5243
5244
5245
5246
5247
5248
5249
5250
5251
5252
5253
5254
5255
5256
5257
5258
5259
5260
5261
5262
5263
5264
5265
5266
5267
5268
5269
5270
5271
5272
5273
5274
5275
5276
5277
5278
5279
5280
5281
5282
5283
5284
5285
5286
5287
5288
5289
5290
5291
5292
5293
5294
5295
5296
5297
5298
5299
5300
5301
5302
5303
5304
5305
5306
5307
5308
5309
5310
5311
5312
5313
5314
5315
5316
5317
5318
5319
5320
5321
5322
5323
5324
5325
5326
5327
5328
5329
5330
5331
5332
5333
5334
5335
5336
5337
5338
5339
5340
5341
5342
5343
5344
5345
5346
5347
5348
5349
5350
5351
5352
5353
5354
5355
5356
5357
5358
5359
5360
5361
5362
5363
5364
5365
5366
5367
5368
5369
5370
5371
5372
5373
5374
5375
5376
5377
5378
5379
5380
5381
5382
5383
5384
5385
5386
5387
5388
5389
5390
5391
5392
5393
5394
5395
5396
5397
5398
5399
5400
5401
5402
5403
5404
5405
5406
5407
5408
5409
5410
5411
5412
5413
5414
5415
5416
5417
5418
5419
5420
5421
5422
5423
5424
5425
5426
5427
5428
5429
5430
5431
5432
5433
5434
5435
5436
5437
5438
5439
5440
5441
5442
5443
5444
5445
5446
5447
5448
5449
5450
5451
5452
5453
5454
5455
5456
5457
5458
5459
5460
5461
5462
5463
5464
5465
5466
5467
5468
5469
5470
5471
5472
5473
5474
5475
5476
5477
5478
5479
5480
5481
5482
5483
5484
5485
5486
5487
5488
5489
5490
5491
5492
5493
5494
5495
5496
5497
5498
5499
5500
5501
5502
5503
5504
5505
5506
5507
5508
5509
5510
5511
5512
5513
5514
5515
5516
5517
5518
5519
5520
5521
5522
5523
5524
5525
5526
5527
5528
5529
5530
5531
5532
5533
5534
5535
5536
5537
5538
5539
5540
5541
5542
5543
5544
5545
5546
5547
5548
5549
5550
5551
5552
5553
5554
5555
5556
5557
5558
5559
5560
5561
5562
5563
5564
5565
5566
5567
5568
5569
5570
5571
5572
5573
5574
5575
5576
5577
5578
5579
5580
5581
5582
5583
5584
5585
5586
5587
5588
5589
5590
5591
5592
5593
5594
5595
5596
5597
5598
5599
5600
5601
5602
5603
5604
5605
5606
5607
5608
5609
5610
5611
5612
5613
5614
5615
5616
5617
5618
5619
5620
5621
5622
5623
5624
5625
5626
5627
5628
5629
5630
5631
5632
5633
5634
5635
5636
5637
5638
5639
5640
5641
5642
5643
5644
5645
5646
5647
5648
5649
5650
5651
5652
5653
5654
5655
5656
5657
5658
5659
5660
5661
5662
5663
5664
5665
5666
5667
5668
5669
5670
5671
5672
5673
5674
5675
5676
5677
5678
5679
5680
5681
5682
5683
5684
5685
5686
5687
5688
5689
5690
5691
5692
5693
5694
5695
5696
5697
5698
5699
5700
5701
5702
5703
5704
5705
5706
5707
5708
5709
5710
5711
5712
5713
5714
5715
5716
5717
5718
5719
5720
5721
5722
5723
5724
5725
5726
5727
5728
5729
5730
5731
5732
5733
5734
5735
5736
5737
5738
5739
5740
5741
5742
5743
5744
5745
5746
5747
5748
5749
5750
5751
5752
5753
5754
5755
5756
5757
5758
5759
5760
5761
5762
5763
5764
5765
5766
5767
5768
5769
5770
5771
5772
5773
5774
5775
5776
5777
5778
5779
5780
5781
5782
5783
5784
5785
5786
5787
5788
5789
5790
5791
5792
5793
5794
5795
5796
5797
5798
5799
5800
5801
5802
5803
5804
5805
5806
5807
5808
5809
5810
5811
5812
5813
5814
5815
5816
5817
5818
5819
5820
5821
5822
5823
5824
5825
5826
5827
5828
5829
5830
5831
5832
5833
5834
5835
5836
5837
5838
5839
5840
5841
5842
5843
5844
5845
5846
5847
5848
5849
5850
5851
5852
5853
5854
5855
5856
5857
5858
5859
5860
5861
5862
5863
5864
5865
5866
5867
5868
5869
5870
5871
5872
5873
5874
5875
5876
5877
5878
5879
5880
5881
5882
5883
5884
5885
5886
5887
5888
5889
5890
5891
5892
5893
5894
5895
5896
5897
5898
5899
5900
5901
5902
5903
5904
5905
5906
5907
5908
5909
5910
5911
5912
5913
5914
5915
5916
5917
5918
5919
5920
5921
5922
5923
5924
5925
5926
5927
5928
5929
5930
5931
5932
5933
5934
5935
5936
5937
5938
5939
5940
5941
5942
5943
5944
5945
5946
5947
5948
5949
5950
5951
5952
5953
5954
5955
5956
5957
5958
5959
5960
5961
5962
5963
5964
5965
5966
5967
5968
5969
5970
5971
5972
5973
5974
5975
5976
5977
5978
5979
5980
5981
5982
5983
5984
5985
5986
5987
5988
5989
5990
5991
5992
5993
5994
5995
5996
5997
5998
5999
6000
" <td>December 14, 2000</td>\n",
" <td>March 17, 2005</td>\n",
" </tr>\n",
" <tr>\n",
" <th>554</th>\n",
" <td>Bank of Honolulu</td>\n",
" <td>Honolulu</td>\n",
" <td>HI</td>\n",
" <td>21029</td>\n",
" <td>Bank of the Orient</td>\n",
" <td>October 13, 2000</td>\n",
" <td>March 17, 2005</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>555 rows × 7 columns</p>\n",
"</div>"
],
"text/plain": [
" Bank Name City \\\n",
"0 Washington Federal Bank for Savings Chicago \n",
"1 The Farmers and Merchants State Bank of Argonia Argonia \n",
"2 Fayette County Bank Saint Elmo \n",
"3 Guaranty Bank, (d/b/a BestBank in Georgia & Mi... Milwaukee \n",
"4 First NBC Bank New Orleans \n",
"5 Proficio Bank Cottonwood Heights \n",
"6 Seaway Bank and Trust Company Chicago \n",
"7 Harvest Community Bank Pennsville \n",
"8 Allied Bank Mulberry \n",
"9 The Woodbury Banking Company Woodbury \n",
"10 First CornerStone Bank King of Prussia \n",
"11 Trust Company Bank Memphis \n",
"12 North Milwaukee State Bank Milwaukee \n",
"13 Hometown National Bank Longview \n",
"14 The Bank of Georgia Peachtree City \n",
"15 Premier Bank Denver \n",
"16 Edgebrook Bank Chicago \n",
"17 Doral Bank En Espanol San Juan \n",
"18 Capitol City Bank & Trust Company Atlanta \n",
"19 Highland Community Bank Chicago \n",
"20 First National Bank of Crestview Crestview \n",
"21 Northern Star Bank Mankato \n",
"22 Frontier Bank, FSB D/B/A El Paseo Bank Palm Desert \n",
"23 The National Republic Bank of Chicago Chicago \n",
"24 NBRS Financial Rising Sun \n",
"25 GreenChoice Bank, fsb Chicago \n",
"26 Eastside Commercial Bank Conyers \n",
"27 The Freedom State Bank Freedom \n",
"28 Valley Bank Fort Lauderdale \n",
"29 Valley Bank Moline \n",
".. ... ... \n",
"525 ANB Financial, NA Bentonville \n",
"526 Hume Bank Hume \n",
"527 Douglass National Bank Kansas City \n",
"528 Miami Valley Bank Lakeview \n",
"529 NetBank Alpharetta \n",
"530 Metropolitan Savings Bank Pittsburgh \n",
"531 Bank of Ephraim Ephraim \n",
"532 Reliance Bank White Plains \n",
"533 Guaranty National Bank of Tallahassee Tallahassee \n",
"534 Dollar Savings Bank Newark \n",
"535 Pulaski Savings Bank Philadelphia \n",
"536 First National Bank of Blanchardville Blanchardville \n",
"537 Southern Pacific Bank Torrance \n",
"538 Farmers Bank of Cheneyville Cheneyville \n",
"539 Bank of Alamo Alamo \n",
"540 AmTrade International Bank En Espanol Atlanta \n",
"541 Universal Federal Savings Bank Chicago \n",
"542 Connecticut Bank of Commerce Stamford \n",
"543 New Century Bank Shelby Township \n",
"544 Net 1st National Bank Boca Raton \n",
"545 NextBank, NA Phoenix \n",
"546 Oakwood Deposit Bank Co. Oakwood \n",
"547 Bank of Sierra Blanca Sierra Blanca \n",
"548 Hamilton Bank, NA En Espanol Miami \n",
"549 Sinclair National Bank Gravette \n",
"550 Superior Bank, FSB Hinsdale \n",
"551 Malta National Bank Malta \n",
"552 First Alliance Bank & Trust Co. Manchester \n",
"553 National State Bank of Metropolis Metropolis \n",
"554 Bank of Honolulu Honolulu \n",
"\n",
" ST CERT Acquiring Institution Closing Date \\\n",
"0 IL 30570 Royal Savings Bank December 15, 2017 \n",
"1 KS 17719 Conway Bank October 13, 2017 \n",
"2 IL 1802 United Fidelity Bank, fsb May 26, 2017 \n",
"3 WI 30003 First-Citizens Bank & Trust Company May 5, 2017 \n",
"4 LA 58302 Whitney Bank April 28, 2017 \n",
"5 UT 35495 Cache Valley Bank March 3, 2017 \n",
"6 IL 19328 State Bank of Texas January 27, 2017 \n",
"7 NJ 34951 First-Citizens Bank & Trust Company January 13, 2017 \n",
"8 AR 91 Today's Bank September 23, 2016 \n",
"9 GA 11297 United Bank August 19, 2016 \n",
"10 PA 35312 First-Citizens Bank & Trust Company May 6, 2016 \n",
"11 TN 9956 The Bank of Fayette County April 29, 2016 \n",
"12 WI 20364 First-Citizens Bank & Trust Company March 11, 2016 \n",
"13 WA 35156 Twin City Bank October 2, 2015 \n",
"14 GA 35259 Fidelity Bank October 2, 2015 \n",
"15 CO 34112 United Fidelity Bank, fsb July 10, 2015 \n",
"16 IL 57772 Republic Bank of Chicago May 8, 2015 \n",
"17 PR 32102 Banco Popular de Puerto Rico February 27, 2015 \n",
"18 GA 33938 First-Citizens Bank & Trust Company February 13, 2015 \n",
"19 IL 20290 United Fidelity Bank, fsb January 23, 2015 \n",
"20 FL 17557 First NBC Bank January 16, 2015 \n",
"21 MN 34983 BankVista December 19, 2014 \n",
"22 CA 34738 Bank of Southern California, N.A. November 7, 2014 \n",
"23 IL 916 State Bank of Texas October 24, 2014 \n",
"24 MD 4862 Howard Bank October 17, 2014 \n",
"25 IL 28462 Providence Bank, LLC July 25, 2014 \n",
"26 GA 58125 Community & Southern Bank July 18, 2014 \n",
"27 OK 12483 Alva State Bank & Trust Company June 27, 2014 \n",
"28 FL 21793 Landmark Bank, National Association June 20, 2014 \n",
"29 IL 10450 Great Southern Bank June 20, 2014 \n",
".. .. ... ... ... \n",
"525 AR 33901 Pulaski Bank and Trust Company May 9, 2008 \n",
"526 MO 1971 Security Bank March 7, 2008 \n",
"527 MO 24660 Liberty Bank and Trust Company January 25, 2008 \n",
"528 OH 16848 The Citizens Banking Company October 4, 2007 \n",
"529 GA 32575 ING DIRECT September 28, 2007 \n",
"530 PA 35353 Allegheny Valley Bank of Pittsburgh February 2, 2007 \n",
"531 UT 1249 Far West Bank June 25, 2004 \n",
"532 NY 26778 Union State Bank March 19, 2004 \n",
"533 FL 26838 Hancock Bank of Florida March 12, 2004 \n",
"534 NJ 31330 No Acquirer February 14, 2004 \n",
"535 PA 27203 Earthstar Bank November 14, 2003 \n",
"536 WI 11639 The Park Bank May 9, 2003 \n",
"537 CA 27094 Beal Bank February 7, 2003 \n",
"538 LA 16445 Sabine State Bank & Trust December 17, 2002 \n",
"539 TN 9961 No Acquirer November 8, 2002 \n",
"540 GA 33784 No Acquirer September 30, 2002 \n",
"541 IL 29355 Chicago Community Bank June 27, 2002 \n",
"542 CT 19183 Hudson United Bank June 26, 2002 \n",
"543 MI 34979 No Acquirer March 28, 2002 \n",
"544 FL 26652 Bank Leumi USA March 1, 2002 \n",
"545 AZ 22314 No Acquirer February 7, 2002 \n",
"546 OH 8966 The State Bank & Trust Company February 1, 2002 \n",
"547 TX 22002 The Security State Bank of Pecos January 18, 2002 \n",
"548 FL 24382 Israel Discount Bank of New York January 11, 2002 \n",
"549 AR 34248 Delta Trust & Bank September 7, 2001 \n",
"550 IL 32646 Superior Federal, FSB July 27, 2001 \n",
"551 OH 6629 North Valley Bank May 3, 2001 \n",
"552 NH 34264 Southern New Hampshire Bank & Trust February 2, 2001 \n",
"553 IL 3815 Banterra Bank of Marion December 14, 2000 \n",
"554 HI 21029 Bank of the Orient October 13, 2000 \n",
"\n",
" Updated Date \n",
"0 February 21, 2018 \n",
"1 February 21, 2018 \n",
"2 July 26, 2017 \n",
"3 March 22, 2018 \n",
"4 December 5, 2017 \n",
"5 March 7, 2018 \n",
"6 May 18, 2017 \n",
"7 May 18, 2017 \n",
"8 September 25, 2017 \n",
"9 June 1, 2017 \n",
"10 November 13, 2018 \n",
"11 September 6, 2016 \n",
"12 March 13, 2017 \n",
"13 February 19, 2018 \n",
"14 July 9, 2018 \n",
"15 February 20, 2018 \n",
"16 July 12, 2016 \n",
"17 May 13, 2015 \n",
"18 April 21, 2015 \n",
"19 November 15, 2017 \n",
"20 November 15, 2017 \n",
"21 January 3, 2018 \n",
"22 November 10, 2016 \n",
"23 January 6, 2016 \n",
"24 February 19, 2018 \n",
"25 December 12, 2016 \n",
"26 October 6, 2017 \n",
"27 February 21, 2018 \n",
"28 February 14, 2018 \n",
"29 June 26, 2015 \n",
".. ... \n",
"525 August 28, 2012 \n",
"526 August 28, 2012 \n",
"527 October 26, 2012 \n",
"528 September 12, 2016 \n",
"529 August 28, 2012 \n",
"530 October 27, 2010 \n",
"531 April 9, 2008 \n",
"532 April 9, 2008 \n",
"533 April 17, 2018 \n",
"534 April 9, 2008 \n",
"535 October 6, 2017 \n",
"536 June 5, 2012 \n",
"537 October 20, 2008 \n",
"538 October 20, 2004 \n",
"539 March 18, 2005 \n",
"540 September 11, 2006 \n",
"541 October 6, 2017 \n",
"542 February 14, 2012 \n",
"543 March 18, 2005 \n",
"544 April 9, 2008 \n",
"545 February 5, 2015 \n",
"546 October 25, 2012 \n",
"547 November 6, 2003 \n",
"548 September 21, 2015 \n",
"549 October 6, 2017 \n",
"550 August 19, 2014 \n",
"551 November 18, 2002 \n",
"552 February 18, 2003 \n",
"553 March 17, 2005 \n",
"554 March 17, 2005 \n",
"\n",
"[555 rows x 7 columns]"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"banks"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Powerful no? Now let's turn that into an exercise.\n",
"\n",
"### Exercise\n",
"Given the data you just extracted above, can you analyse how many banks have failed per state?\n",
"\n",
"Georgia (GA) should be the state with the most failed banks!\n",
"\n",
"*Hint: try searching the web for pandas counting occurances* "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Cleaning\n",
"During the course of doing data analysis and modeling, a significant amount of time is spent on data preparation: loading, cleaning, transofrming and rearranging. Such tasks are often reported to take **up to 80%** or more of a data analyst's time. Often the way the data is stored in files isn't in the correct format and needs to be modified. Reseraches usually do this on an ad-hoc basis using programming languages like Python.\n",
"\n",
"In this chapter we will discuss tools for handling missing data, duplicate data, string manipulation, and some other analytical data transformations.\n",
"\n",
"## Handling missing data\n",
"Mussing data occurs commonly in many data analysis applications. One of the goals of pandas is to make working with missing data as painless as possible.\n",
"\n",
"In pandas, missing numberic data is represented by `NaN` (Not a Number) and can easily be handled:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 orange\n",
"1 tomato\n",
"2 NaN\n",
"3 avocado\n",
"dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data = pd.Series(['orange', 'tomato', np.nan, 'avocado'])\n",
"string_data"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"dtype: bool"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Furthermore, the pandas `NaN` is functionally equlevant to the standard Python type `NoneType` which can be defined with `x = None`."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 None\n",
"1 tomato\n",
"2 NaN\n",
"3 avocado\n",
"dtype: object"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data[0] = None\n",
"string_data"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 True\n",
"1 False\n",
"2 True\n",
"3 False\n",
"dtype: bool"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"string_data.isnull()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here are some other methods which you can find useful:\n",
" \n",
"| Method | Decription |\n",
"| -- | -- |\n",
"| dropna | Filter axis labels based on whether values of each label have missing data|\n",
"| fillna | Fill in missing data with some vallue |\n",
"| isnull | Return boolean values indicating which vallues are missin |\n",
"| notnull | Negation of isnull |\n",
"\n",
"### Exercise\n",
"Remove the missing data below using the appropriate method"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 1.0\n",
"2 3.0\n",
"3 4.0\n",
"5 6.0\n",
"dtype: float64"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.Series([1, None, 3, 4, None, 6])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"`dropna()` by default removes any row/column that has a missing value. What if we want to remove only rows in which all of the data is missing though?"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0\n",
"1 1.0 NaN NaN\n",
"2 NaN NaN NaN\n",
"3 NaN 6.5 3.0"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame([[1., 6.5, 3.], [1., None, None],\n",
" [None, None, None], [None, 6.5, 3.]])\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna()"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1.0</td>\n",
" <td>NaN</td>\n",
" <td>NaN</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>NaN</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0\n",
"1 1.0 NaN NaN\n",
"3 NaN 6.5 3.0"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.dropna(how=\"all\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise\n",
"That's fine if we want to remove missing data, what if we want to fill in missing data? Do you know of a way? Try to fill in all of the missing values from the data below with **0s**"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"data = pd.DataFrame([[1., 6.5, 3.], [2., None, None],\n",
" [None, None, None], [None, 1.5, 9.]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"pandas also allows us to interpolate the data instead of just filling it with a constant. The easiest way to do that is shown below, but there are more complex ones that are not covered in this course."
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1.0</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2.0</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2.0</td>\n",
" <td>6.5</td>\n",
" <td>3.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2.0</td>\n",
" <td>1.5</td>\n",
" <td>9.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 1.0 6.5 3.0\n",
"1 2.0 6.5 3.0\n",
"2 2.0 6.5 3.0\n",
"3 2.0 1.5 9.0"
]
},
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.fillna(method=\"ffill\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you want you can explore the other capabilities of [`fillna`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.fillna.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Data Transformation\n",
"### Removing duplicates\n",
"Duplicate data can be a serious issue, luckily pandas offers a simple way to remove duplicates"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"0 1\n",
"1 2\n",
"2 3\n",
"3 4\n",
"4 3\n",
"5 2\n",
"6 1"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data = pd.DataFrame([1, 2, 3, 4, 3, 2, 1])\n",
"data"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"0 1\n",
"1 2\n",
"2 3\n",
"3 4"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.drop_duplicates()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can also select which rows to keep"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>3</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>2</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"3 4\n",
"4 3\n",
"5 2\n",
"6 1"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data.drop_duplicates(keep=\"last\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Replacing data\n",
"You've already seen how you can fill in missing data with `fillna`. That is actually a special case of more general value replacement. That is done via the `replace` method.\n",
"\n",
"Let's consider an example where the dataset given to us had `-999` as sentinel values for missing data instead of `NaN`."
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",