Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Notebook 5 - pandas\n",
"[pandas](http://pandas.pydata.org) provides high-level data structures and functions designed to make working with structured or tabular data fast, easy and expressive. The primary objects in pandas that we will be using are the `DataFrame`, a tabular, column-oriented data structure with both row and column labels, and the `Series`, a one-dimensional labeled array object.\n",
"\n",
"pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases. It provides sophisticated indexing functionality to make it easy to reshape, slice and perform aggregations.\n",
"\n",
"While pandas adopts many coding idioms from NumPy, the most significant difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.\n",
"<br>\n",
"\n",
"## Table of Contents:\n",
"- [Data Structures](#structures)\n",
" - [Series](#series)\n",
" - [DataFrame](#dataframe)\n",
"- [Essential Functionality](#ess_func)\n",
" - [Reindexing](#reindexing)\n",
" - [Dropping Entries](#removing)\n",
" - [Indexing, Slicing and Filtering](#indexing)\n",
" - [Arithmetic Operations](#arithmetic)\n",
"- [Summarizing and Computing Descriptive Statistics](#sums)\n",
"- [Loading and storing data](#loading)\n",
" - [Text Format](#text) \n",
" - [Web Scraping](#web)\n",
"- [Data Cleaning and preperation](#cleaning)\n",
" - [Handling missing data](#missing)\n",
" - [Data transformation](#transformation)\n",
"\n",
"The common pandas import statment is shown below:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Common pandas import statement\n",
"import numpy as np\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Structures <a name=\"structures\"></a>\n",
"## Series <a name=\"series\"></a>\n",
"A Series is a one-dimensional array-like object containing a sequence of values and an associated array of data labels called its index.\n",
"\n",
"The easiest way to make a Series is from an array of data:"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"data = pd.Series([4, 7, -5, 3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now try printing out data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The string representation of a Series displayed interactively shows the index on the left and the values on the right. Because we didn't specify an index, the default on is simply integers 0 through N-1.\n",
"\n",
"You can output only the values of a Series using \n",
"```python\n",
"data.values\n",
"```\n",
"or you can get only the indices using\n",
"```python\n",
"data.index\n",
"```\n",
"Try it out below!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can specify custom indeces when intialising the Series"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"data2 = pd.Series([4, 7, -5, 3], index=[\"a\", \"b\", \"c\", \"d\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you can use these labels to access the data similar to a normal array"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data2[\"a\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way to think about Series is as a fixed-length ordered dictionary. Furthermore, you can actually define a Series in a similar manner to a dictionary"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"cities = {\"Glasgow\" : 599650, \"Edinburgh\" : 464990, \"Abardeen\" : 196670, \"Dundee\" : 147710}\n",
"data3 = pd.Series(cities)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Glasgow 599650\n",
"Edinburgh 464990\n",
"Abardeen 196670\n",
"Dundee 147710\n",
"dtype: int64"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data3"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can do arithmetic operations between Series similar to NumPy arrays. Even if you have 2 datasets with different data, arithmetic operations will be aligned according to their indices.\n",
"\n",
Loading
Loading full blame...