Pandas

Introduction

Pandas is an open-source python library which is bascially used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Installation using pip

# !pip install pandas

We can use pandas by importing it in the file like:

import pandas as pd

Creating data

There are two core concepts in the pandas: DataFrame and Series

DataFrame

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

For example, consider the following simple DataFrame:

pd.DataFrame({'Likes': [130, 121], 'Dislikes': [11, 2]})
Likes Dislikes
0 130 11
1 121 2

In this example, the “0, Likes” entry has the value of 130. The “0, Dislikes” entry has a value of 11, and so on.

DataFrame entries are not limited to integers. For instance, here’s a DataFrame whose values are strings:

pd.DataFrame({'Anonymous': ['I liked it.', 'It was great!'], 'Analyst': ['Looks good.', 'Informative']})
Anonymous Analyst
0 I liked it. Looks good.
1 It was great! Informative

We are using the pd.DataFrame() constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Anonymous and Analyst in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, …) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

pd.DataFrame({'Anonymous': ['I liked it.', 'It was great!'], 
              'Analyst': ['Looks good.', 'Informative']},
             index=['Product A', 'Product B'])
Anonymous Analyst
Product A I liked it. Looks good.
Product B It was great! Informative

Series

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

pd.Series([1, 2, 3, 4, 5])
0    1
1    2
2    3
3    4
4    5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

pd.Series([400, 515, 605], index=['2017 Sales', '2018 Sales', '2019 Sales'], name='Product A')
2017 Sales    400
2018 Sales    515
2019 Sales    605
Name: Product A, dtype: int64

Reading Data Files

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won’t actually be creating our own data by hand. Instead, we’ll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:


Product A,Product B,Product C,

30,21,9,

35,34,1,

41,11,11


So a CSV file is a table of values separated by commas. Hence the name: “Comma-Separated Values”, or CSV.

Let’s now try to read a very famous dataset, known as Golf Dataset. We’ll use the pd.read_csv() function to read the data into a DataFrame. This goes thusly:

golf_dataset = pd.read_csv("./Data/Pandas/golf-dataset.csv")

golf_dataset
Outlook Temp Humidity Windy Play Golf
0 Rainy Hot High False No
1 Rainy Hot High True No
2 Overcast Hot High False Yes
3 Sunny Mild High False Yes
4 Sunny Cool Normal False Yes
5 Sunny Cool Normal True No
6 Overcast Cool Normal True Yes
7 Rainy Mild High False No
8 Rainy Cool Normal False Yes
9 Sunny Mild Normal False Yes
10 Rainy Mild Normal True Yes
11 Overcast Mild High True Yes
12 Overcast Hot Normal False Yes
13 Sunny Mild High True No

shape

We can use the shape attribute to check how large the resulting DataFrame is:

golf_dataset.shape
(14, 5)

Indexing in Pandas

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you’re supposed to be using.

Index-based Selection

Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

golf_dataset.iloc[0]
Outlook      Rainy
Temp           Hot
Humidity      High
Windy        False
Play Golf       No
Name: 0, dtype: object

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it’s marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

golf_dataset.iloc[:,0]
0        Rainy
1        Rainy
2     Overcast
3        Sunny
4        Sunny
5        Sunny
6     Overcast
7        Rainy
8        Rainy
9        Sunny
10       Rainy
11    Overcast
12    Overcast
13       Sunny
Name: Outlook, dtype: object

Label-based Selection

The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it’s the data index value, not its position, which matters.

For example, to get the first entry in golf_dataset, we would now do the following:

golf_dataset.loc[0,'Outlook']
'Rainy'

iloc is conceptually simpler than loc because it ignores the dataset’s indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it’s usually easier to do things using loc instead.

For example, here’s one operation that’s much easier using loc:

golf_dataset.loc[:3, ['Outlook', 'Temp']]
Outlook Temp
0 Rainy Hot
1 Rainy Hot
2 Overcast Hot
3 Sunny Mild

Summary functions in Pandas

Pandas provides many simple “summary functions” (not an official name) which restructure the data in some useful way. For example, consider the describe() method:

golf_dataset.describe()
Outlook Temp Humidity Windy Play Golf
count 14 14 14 14 14
unique 3 3 2 2 2
top Rainy Mild High False Yes
freq 5 6 7 8 9

or we can find out summary for an individual attribute like for Temp:

golf_dataset.Temp.describe()
count       14
unique       3
top       Mild
freq         6
Name: Temp, dtype: object

To see a list of unique values we can use the unique() function:

golf_dataset.Temp.unique()
array(['Hot', 'Mild', 'Cool'], dtype=object)

Dtypes

The data type for a column in a DataFrame or a Series is known as the dtype.

You can use the dtype property to grab the type of a specific column. For instance, we can get the dtype of the Windy column in the golf_dataset DataFrame:

golf_dataset.Windy.dtype
dtype('bool')

Alternatively, the dtypes property returns the dtype of every column in the DataFrame

golf_dataset.dtypes
Outlook      object
Temp         object
Humidity     object
Windy          bool
Play Golf    object
dtype: object

Further Readings

We’ve seen the most commonly used functions of Pandas that are used in the field of Machine Learning and Data Analysis. However there is a variety of other functions that Pandas provides us.

For further understanding, or to explore more you may refer to the official documentation of Panadas at:

https://pandas.pydata.org/docs/getting_started/intro_tutorials/index.html