The objective of this post is to present an intuitive overview of features of pandas DataFrame object. Minimum temperature data from 1901 to 2017 provided by data.gov.in is used as an example.

1. What is pandas?

It is a Python library for data analysis. It is interestingly named as acronym of PAnel DAta. It has rich data structures and tools for working with structured data sets common to statistics and other fields. Its main data structure is called DataFrame.

2. Installing pandas

conda install pandas
  • If you have Anaconda installed, you can install pandas using above command.

3. Running this example on Kaggle

4. Creating a DataFrame from Excel or CSV

import pandas as pd
temp = pd.read_excel ('../input/temp.xls')
#temp = pd.read_csv ('../input/temp.csv')
temp = temp.set_index (temp.YEAR)
  • Firstly, we import pandas library.
  • read_excel () and read_csv() both return DataFrame object. Here we are using read_excel as input file is an Excel file this case.
  • Every DataFrame has an index, in this case we want YEAR column to be the index. set_index() function returns a new DataFrame and doesn’t modify the existing one.

5. Glancing at the data

temp.head()
  • head() returns five first rows from the data with column headers.

6. Statistical overview of the data

temp.describe()

pandas-describe

  • describe() returns basic statistics from the dataset e.g. count, mean, min, max, std etc.

7. Finding the hottest year

temp['ANNUAL'].idxmax()
2016
  • idxmax() returns index of the row where column value is maximum. Because YEAR is our index, we get hottest year by finding maximum on ANNUAL column. We can achieve this simply by using idxmax() on ANNUAL column.

8. Visualizing annual minimum temperature over years

import matplotlib.pyplot as plt
x = temp.index
y = temp.ANNUAL
 
plt.scatter(x,y)
plt.show()

  • We’ve imported matplotlib for plotting.
  • Here a scatter plot with columns ANNUAL against YEAR is plotted.

9. Visualizing temperatures rise and fall (Mean Temp – Months)

mean_months = temp.loc[:,'JAN':'DEC'].mean()
plt.plot(mean_months.index, mean_months)
JAN    13.167009
FEB    14.656239
MAR    17.774872
APR    21.054274
MAY    23.233846
JUN    23.838291
JUL    23.718462
AUG    23.386838
SEP    22.228974
OCT    19.735299
NOV    16.255470
DEC    13.735641
dtype: float64

  • loc is used to access values by labels. Here we are accessing columns from ‘JAN’ through ‘DEC’.
  • loc when used with [] returns a Series.
  • loc when used with [[]] returns a DataFrame.
  • mean() does not need an explanation.

10. Finding hottest seasons (1901-2017)

hottest_seasons = {'Winter' : temp['JAN-FEB'].idxmax(),
                   'Summer' : temp['MAR-MAY'].idxmax(),
                   'Monsoon': temp['JUN-SEP'].idxmax(),
                   'Autumn' : temp['OCT-DEC'].idxmax()}
print (hottest_seasons)
{'Winter': 2016, 'Summer': 2016, 'Monsoon': 2016, 'Autumn': 2017}

11. Finding the most extreme year

temp ['DIFF'] = temp.loc[:,'JAN':'DEC'].max(axis=1) - temp.loc[:,'JAN':'DEC'].min(axis=1)
temp.DIFF.idxmax()
1921
  • Calculate min() and max() on JAN to DEC columns for each row
  • Calculate difference = max – min for each row
  • Add difference (DIFF) column to the dataframe
  • Do idxmax() on DIFF column

12. Plotting Difference over Years

axes= plt.axes()
axes.set_ylim([5,15])
axes.set_xlim([1901,2017])
plt.plot(temp.index, temp.DIFF)

temp.DIFF.mean()
10.895128205128202

13. Looking into abnormal winters

year_dict = temp.loc[:,'JAN':'DEC'].to_dict(orient='index')
sorted_months = []
for key, value in year_dict.items():
    sorted_months.append (sorted(value, key=value.get)[:4])
 
winter = sorted_months[:]
winter_set = []
for x in winter:
    winter_set.append (set(x))
temp['WINTER'] = winter_set
 
winter_routine = max(sorted_months, key=sorted_months.count)
 
temp.WINTER [temp.WINTER != set(winter_routine)]
YEAR
1957    {FEB, JAN, MAR, DEC}
1976    {FEB, JAN, MAR, DEC}
1978    {FEB, JAN, MAR, DEC}
1979    {FEB, JAN, MAR, DEC}
Name: WINTER, dtype: object
  • Abnormal winters, here, mean a season of four months where most cold temperatures where at least one month is different from commonly observed set of winter months.

References

  1. pandas: a Foundational Python Library for Data
    Analysis and Statistics, Wes McKinney
  2. Monthly, Seasonal and Annual Mean Temp Series from 1901 to 2017

Want to Learn More? Signup in a Click.

Comments

avatar
  Subscribe  
Notify of