Pandas¶
Introduction¶
Pandas is an open-source python library which is bascially used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.
Installation using pip¶
# !pip install pandas
We can use pandas by importing it in the file like:
import pandas as pd
Creating data¶
There are two core concepts in the pandas: DataFrame and Series
DataFrame¶
A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.
For example, consider the following simple DataFrame:
pd.DataFrame({'Likes': [130, 121], 'Dislikes': [11, 2]})
Likes | Dislikes | |
---|---|---|
0 | 130 | 11 |
1 | 121 | 2 |
In this example, the “0, Likes” entry has the value of 130. The “0, Dislikes” entry has a value of 11, and so on.
DataFrame entries are not limited to integers. For instance, here’s a DataFrame whose values are strings:
pd.DataFrame({'Anonymous': ['I liked it.', 'It was great!'], 'Analyst': ['Looks good.', 'Informative']})
Anonymous | Analyst | |
---|---|---|
0 | I liked it. | Looks good. |
1 | It was great! | Informative |
We are using the pd.DataFrame()
constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Anonymous and Analyst in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.
The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, …) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.
The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index
parameter in our constructor:
pd.DataFrame({'Anonymous': ['I liked it.', 'It was great!'],
'Analyst': ['Looks good.', 'Informative']},
index=['Product A', 'Product B'])
Anonymous | Analyst | |
---|---|---|
Product A | I liked it. | Looks good. |
Product B | It was great! | Informative |
Series¶
A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:
pd.Series([1, 2, 3, 4, 5])
0 1
1 2
2 3
3 4
4 5
dtype: int64
A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index
parameter. However, a Series does not have a column name, it only has one overall name
:
pd.Series([400, 515, 605], index=['2017 Sales', '2018 Sales', '2019 Sales'], name='Product A')
2017 Sales 400
2018 Sales 515
2019 Sales 605
Name: Product A, dtype: int64
Reading Data Files¶
Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won’t actually be creating our own data by hand. Instead, we’ll be working with data that already exists.
Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:
Product A,Product B,Product C,
30,21,9,
35,34,1,
41,11,11
So a CSV file is a table of values separated by commas. Hence the name: “Comma-Separated Values”, or CSV.
Let’s now try to read a very famous dataset, known as Golf Dataset. We’ll use the pd.read_csv()
function to read the data into a DataFrame. This goes thusly:
golf_dataset = pd.read_csv("./Data/Pandas/golf-dataset.csv")
golf_dataset
Outlook | Temp | Humidity | Windy | Play Golf | |
---|---|---|---|---|---|
0 | Rainy | Hot | High | False | No |
1 | Rainy | Hot | High | True | No |
2 | Overcast | Hot | High | False | Yes |
3 | Sunny | Mild | High | False | Yes |
4 | Sunny | Cool | Normal | False | Yes |
5 | Sunny | Cool | Normal | True | No |
6 | Overcast | Cool | Normal | True | Yes |
7 | Rainy | Mild | High | False | No |
8 | Rainy | Cool | Normal | False | Yes |
9 | Sunny | Mild | Normal | False | Yes |
10 | Rainy | Mild | Normal | True | Yes |
11 | Overcast | Mild | High | True | Yes |
12 | Overcast | Hot | Normal | False | Yes |
13 | Sunny | Mild | High | True | No |
shape
¶
We can use the shape
attribute to check how large the resulting DataFrame is:
golf_dataset.shape
(14, 5)
head
¶
We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:
golf_dataset.head()
Outlook | Temp | Humidity | Windy | Play Golf | |
---|---|---|---|---|---|
0 | Rainy | Hot | High | False | No |
1 | Rainy | Hot | High | True | No |
2 | Overcast | Hot | High | False | Yes |
3 | Sunny | Mild | High | False | Yes |
4 | Sunny | Cool | Normal | False | Yes |
In Python, we can access the property of an object by accessing it as an attribute. A book
object, for example, might have a title
property, which we can access by calling book.title
. Columns in a pandas DataFrame work in much the same way.
Hence to access the Temp
property of golf_dataset
we can use:
golf_dataset.Temp
0 Hot
1 Hot
2 Hot
3 Mild
4 Cool
5 Cool
6 Cool
7 Mild
8 Cool
9 Mild
10 Mild
11 Mild
12 Hot
13 Mild
Name: Temp, dtype: object
Indexing in Pandas¶
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you’re supposed to be using.
Index-based Selection¶
Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc
follows this paradigm.
To select the first row of data in a DataFrame, we may use the following:
golf_dataset.iloc[0]
Outlook Rainy
Temp Hot
Humidity High
Windy False
Play Golf No
Name: 0, dtype: object
Both loc
and iloc
are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.
This means that it’s marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc
, we can do the following:
golf_dataset.iloc[:,0]
0 Rainy
1 Rainy
2 Overcast
3 Sunny
4 Sunny
5 Sunny
6 Overcast
7 Rainy
8 Rainy
9 Sunny
10 Rainy
11 Overcast
12 Overcast
13 Sunny
Name: Outlook, dtype: object
Label-based Selection¶
The second paradigm for attribute selection is the one followed by the loc
operator: label-based selection. In this paradigm, it’s the data index value, not its position, which matters.
For example, to get the first entry in golf_dataset
, we would now do the following:
golf_dataset.loc[0,'Outlook']
'Rainy'
iloc
is conceptually simpler than loc
because it ignores the dataset’s indices. When we use iloc
we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc
, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it’s usually easier to do things using loc
instead.
For example, here’s one operation that’s much easier using loc
:
golf_dataset.loc[:3, ['Outlook', 'Temp']]
Outlook | Temp | |
---|---|---|
0 | Rainy | Hot |
1 | Rainy | Hot |
2 | Overcast | Hot |
3 | Sunny | Mild |
Summary functions in Pandas¶
Pandas provides many simple “summary functions” (not an official name) which restructure the data in some useful way. For example, consider the describe()
method:
golf_dataset.describe()
Outlook | Temp | Humidity | Windy | Play Golf | |
---|---|---|---|---|---|
count | 14 | 14 | 14 | 14 | 14 |
unique | 3 | 3 | 2 | 2 | 2 |
top | Rainy | Mild | High | False | Yes |
freq | 5 | 6 | 7 | 8 | 9 |
or we can find out summary for an individual attribute like for Temp
:
golf_dataset.Temp.describe()
count 14
unique 3
top Mild
freq 6
Name: Temp, dtype: object
To see a list of unique values we can use the unique()
function:
golf_dataset.Temp.unique()
array(['Hot', 'Mild', 'Cool'], dtype=object)
Dtypes¶
The data type for a column in a DataFrame or a Series is known as the dtype.
You can use the dtype
property to grab the type of a specific column. For instance, we can get the dtype of the Windy
column in the golf_dataset
DataFrame:
golf_dataset.Windy.dtype
dtype('bool')
Alternatively, the dtypes
property returns the dtype of every column in the DataFrame
golf_dataset.dtypes
Outlook object
Temp object
Humidity object
Windy bool
Play Golf object
dtype: object
Further Readings¶
We’ve seen the most commonly used functions of Pandas that are used in the field of Machine Learning and Data Analysis. However there is a variety of other functions that Pandas provides us.
For further understanding, or to explore more you may refer to the official documentation of Panadas at:
https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html