Pandas¶

Introduction¶

Pandas is an open-source python library which is bascially used for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series.

Installation using pip¶

# !pip install pandas

We can use pandas by importing it in the file like:

import pandas as pd

Creating data¶

There are two core concepts in the pandas: DataFrame and Series

DataFrame¶

A DataFrame is a table. It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

For example, consider the following simple DataFrame:

pd.DataFrame({'Likes': [130, 121], 'Dislikes': [11, 2]})

	Likes	Dislikes
0	130	11
1	121	2

In this example, the “0, Likes” entry has the value of 130. The “0, Dislikes” entry has a value of 11, and so on.

DataFrame entries are not limited to integers. For instance, here’s a DataFrame whose values are strings:

pd.DataFrame({'Anonymous': ['I liked it.', 'It was great!'], 'Analyst': ['Looks good.', 'Informative']})

	Anonymous	Analyst
0	I liked it.	Looks good.
1	It was great!	Informative

We are using the pd.DataFrame() constructor to generate these DataFrame objects. The syntax for declaring a new one is a dictionary whose keys are the column names (Anonymous and Analyst in this example), and whose values are a list of entries. This is the standard way of constructing a new DataFrame, and the one you are most likely to encounter.

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, …) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

pd.DataFrame({'Anonymous': ['I liked it.', 'It was great!'], 
              'Analyst': ['Looks good.', 'Informative']},
             index=['Product A', 'Product B'])

	Anonymous	Analyst
Product A	I liked it.	Looks good.
Product B	It was great!	Informative

Series¶

A Series, by contrast, is a sequence of data values. If a DataFrame is a table, a Series is a list. And in fact you can create one with nothing more than a list:

pd.Series([1, 2, 3, 4, 5])

  1
  2
  3
  4
  5
dtype: int64

A Series is, in essence, a single column of a DataFrame. So you can assign column values to the Series the same way as before, using an index parameter. However, a Series does not have a column name, it only has one overall name:

pd.Series([400, 515, 605], index=['2017 Sales', '2018 Sales', '2019 Sales'], name='Product A')

Sales    400
Sales    515
Sales    605
Name: Product A, dtype: int64

Reading Data Files¶

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won’t actually be creating our own data by hand. Instead, we’ll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:

Product A,Product B,Product C,

30,21,9,

35,34,1,

41,11,11

So a CSV file is a table of values separated by commas. Hence the name: “Comma-Separated Values”, or CSV.

Let’s now try to read a very famous dataset, known as Golf Dataset. We’ll use the pd.read_csv() function to read the data into a DataFrame. This goes thusly:

golf_dataset = pd.read_csv("./Data/Pandas/golf-dataset.csv")

golf_dataset

	Outlook	Temp	Humidity	Windy	Play Golf
0	Rainy	Hot	High	False	No
1	Rainy	Hot	High	True	No
2	Overcast	Hot	High	False	Yes
3	Sunny	Mild	High	False	Yes
4	Sunny	Cool	Normal	False	Yes
5	Sunny	Cool	Normal	True	No
6	Overcast	Cool	Normal	True	Yes
7	Rainy	Mild	High	False	No
8	Rainy	Cool	Normal	False	Yes
9	Sunny	Mild	Normal	False	Yes
10	Rainy	Mild	Normal	True	Yes
11	Overcast	Mild	High	True	Yes
12	Overcast	Hot	Normal	False	Yes
13	Sunny	Mild	High	True	No

`shape`¶

We can use the shape attribute to check how large the resulting DataFrame is:

golf_dataset.shape

(14, 5)

`head`¶

We can examine the contents of the resultant DataFrame using the head() command, which grabs the first five rows:

golf_dataset.head()

	Outlook	Temp	Humidity	Windy	Play Golf
0	Rainy	Hot	High	False	No
1	Rainy	Hot	High	True	No
2	Overcast	Hot	High	False	Yes
3	Sunny	Mild	High	False	Yes
4	Sunny	Cool	Normal	False	Yes

In Python, we can access the property of an object by accessing it as an attribute. A book object, for example, might have a title property, which we can access by calling book.title. Columns in a pandas DataFrame work in much the same way.

Hence to access the Temp property of golf_dataset we can use:

golf_dataset.Temp

    Hot
    Hot
    Hot
   Mild
   Cool
   Cool
   Cool
   Mild
   Cool
   Mild
  Mild
  Mild
   Hot
  Mild
Name: Temp, dtype: object

Indexing in Pandas¶

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you’re supposed to be using.

Index-based Selection¶

Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the first row of data in a DataFrame, we may use the following:

golf_dataset.iloc[0]

Outlook      Rainy
Temp           Hot
Humidity      High
Windy        False
Play Golf       No
Name: 0, dtype: object

Both loc and iloc are row-first, column-second. This is the opposite of what we do in native Python, which is column-first, row-second.

This means that it’s marginally easier to retrieve rows, and marginally harder to get retrieve columns. To get a column with iloc, we can do the following:

golf_dataset.iloc[:,0]

      Rainy
      Rainy
   Overcast
      Sunny
      Sunny
      Sunny
   Overcast
      Rainy
      Rainy
      Sunny
     Rainy
  Overcast
  Overcast
     Sunny
Name: Outlook, dtype: object

Label-based Selection¶

The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it’s the data index value, not its position, which matters.

For example, to get the first entry in golf_dataset, we would now do the following:

golf_dataset.loc[0,'Outlook']

'Rainy'

iloc is conceptually simpler than loc because it ignores the dataset’s indices. When we use iloc we treat the dataset like a big matrix (a list of lists), one that we have to index into by position. loc, by contrast, uses the information in the indices to do its work. Since your dataset usually has meaningful indices, it’s usually easier to do things using loc instead.

For example, here’s one operation that’s much easier using loc:

golf_dataset.loc[:3, ['Outlook', 'Temp']]

	Outlook	Temp
0	Rainy	Hot
1	Rainy	Hot
2	Overcast	Hot
3	Sunny	Mild

Summary functions in Pandas¶

Pandas provides many simple “summary functions” (not an official name) which restructure the data in some useful way. For example, consider the describe() method:

golf_dataset.describe()

	Outlook	Temp	Humidity	Windy	Play Golf
count	14	14	14	14	14
unique	3	3	2	2	2
top	Rainy	Mild	High	False	Yes
freq	5	6	7	8	9

or we can find out summary for an individual attribute like for Temp:

golf_dataset.Temp.describe()

count       14
unique       3
top       Mild
freq         6
Name: Temp, dtype: object

To see a list of unique values we can use the unique() function:

golf_dataset.Temp.unique()

array(['Hot', 'Mild', 'Cool'], dtype=object)

Dtypes¶

The data type for a column in a DataFrame or a Series is known as the dtype.

You can use the dtype property to grab the type of a specific column. For instance, we can get the dtype of the Windy column in the golf_dataset DataFrame:

golf_dataset.Windy.dtype

dtype('bool')

Alternatively, the dtypes property returns the dtype of every column in the DataFrame

golf_dataset.dtypes

Outlook      object
Temp         object
Humidity     object
Windy          bool
Play Golf    object
dtype: object

Data Science Notes

Pandas¶

Introduction¶

Installation using pip¶

Creating data¶

DataFrame¶

Series¶

Reading Data Files¶

`shape`¶

`head`¶

Indexing in Pandas¶

Index-based Selection¶

Label-based Selection¶

Summary functions in Pandas¶

Dtypes¶

Further Readings¶

Data Science Notes

Pandas¶

Introduction¶

Installation using pip¶

Creating data¶

DataFrame¶

Series¶

Reading Data Files¶

shape¶

head¶

Indexing in Pandas¶

Index-based Selection¶

Label-based Selection¶

Summary functions in Pandas¶

Dtypes¶

Further Readings¶

`shape`¶

`head`¶