Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"[pandas](http://pandas.pydata.org) provides high-level data structures and functions designed to make working with structured or tabular data fast, easy and expressive. The primary objects in pandas that we will be using are the `DataFrame`, a tabular, column-oriented data structure with both row and column labels, and the `Series`, a one-dimensional labeled array object.\n",
"\n",
"pandas blends the high-performance, array-computing ideas of NumPy with the flexible data manipulation capabilities of spreadsheets and relational databases. It provides sophisticated indexing functinoality to make it easy to reshape , slice and perform aggregations.\n",
"\n",
"While pandas adopts many coding idioms from NumPy, the biggest difference is that pandas is designed for working with tabular or heterogeneous data. NumPy, by contrast, is best suited for working with homogeneous numerical array data.\n",
"<br>\n",
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
"- Data Structures\n",
" - Series\n",
" - DataFrame\n",
"- Essential Functionality\n",
" - Reindexing\n",
" - Dropping Entries\n",
" - Indexing, Slicing and Filtering\n",
" - Arithmetic Operations\n",
" - Sorting and ranking\n",
"- Summarizing and Computing Descriptive Statistics\n",
" - Correlation and Covariance\n",
" - Unique values, value counts and Membership\n",
"- Reading and storing data\n",
" - Text Format\n",
" - Text Format Writing\n",
" - XML and HTML Web Scraping\n",
" - Reading excel files\n",
" - mention that pandas allow interfacing with web APIs and SQL databases\n",
"- Data Cleaning and preperation\n",
" - Missing data\n",
" - Data transformation\n",
" - String manipulation incl. regexp\n",
"- Data wrangling\n",
"- Plotting?\n",
" "
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Common pandas import statement\n",
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Data Structures\n",
"## Series\n",
"a one-dimensional array-like object containing a sequence of values and an associated array of data labels, called its index.\n",
"\n",
"The easiest way to make a Series is from an array of data:"
]
},
{
"cell_type": "code",
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
"metadata": {},
"outputs": [],
"source": [
"data = pd.Series([4, 7, -5, 3])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now try printing out data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The string representation of a Seires displayed interactively shows the index on the left and the values on the right. Because we didn't specify an index, the default on is simply integers 0 through N-1.\n",
"\n",
"You can output only the values of a Series using \n",
"```python\n",
"data.values\n",
"```\n",
"or you can get only the indeces using\n",
"```python\n",
"data.index\n",
"```\n",
"Try it out below!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can specify custom indeces when intialising the Series"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"data2 = pd.Series([4, 7, -5, 3], index=[\"a\", \"b\", \"c\", \"d\"])"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now you can use these labels to access the data similar to a normal array"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"4"
]
},
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"data2[\"a\"]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another way to think about Serieses is as a fixed-length ordered dictionary. Furthermore, you can actually define a Series in a similar manner to a dictionary"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [],
"source": [
"cities = {\"Glasgow\" : 599650, \"Edinburgh\" : 464990, \"Abardeen\" : 196670, \"Dundee\" : 147710}\n",
"data3 = pd.Series(cities)"
]
},
{
"cell_type": "code",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Abardeen 196670\n",
"Dundee 147710\n",
"dtype: int64"
]
},
Loading
Loading full blame...