Pandas - DataFrame









Data Frames

In my previous blog, I talked about using the Series Data Structure in Pandas. Feel free to have a  look at that blog.
This blog post will be a brief walkthrough of Data Frames. Data Frame is one of the most widely used tool in python
by Data Scientists/Analyst.
A DataFrame can be considered as a data table. Data Frame provides a wide range of functionality like
  • filtering
  • grouping
  • sorting
  • joins
  • merging

Let's start with creating a data frame
In [7]:
#importing pandas and DataFrame
import pandas as pd
from pandas import DataFrame

#constructor to create a data frame
#df=DataFrame( data, index, columns, dtype, copy)
#Lists, dict, Series, Numpy ndarrays, Another DataFrame

#creating an empty data frame
df=DataFrame()

#creating a data frame form an array
df=DataFrame([1,2,3,4,5,5])
print df
   0
0  1
1  2
2  3
3  4
4  5
5  5
In [8]:
# creating a dataframe with an array of array
# each inner array represents a row
d2=[[1,2,3,4],
    ['a','b','c','d']]
df= DataFrame(d2)
print df
   0  1  2  3
0  1  2  3  4
1  a  b  c  d
In [9]:
#creating a datframe from a dict
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Age','Name','a'])
print df
   Age   Name    a
0   28    Tom  NaN
1   34   Jack  NaN
2   29  Steve  NaN
3   42  Ricky  NaN
In [10]:
# creating a df from a list of dictionaries
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df
   a   b     c
0  1   2   NaN
1  5  10  20.0
Now that we know how to create a dataframe. Let's try some functions and statistics with Data Frames
In [11]:
#create a data frame
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Age','Name','a'])
print df
   Age   Name    a
0   28    Tom  NaN
1   34   Jack  NaN
2   29  Steve  NaN
3   42  Ricky  NaN
In [12]:
#understand the data
#gives brief statistics about the data like mean,count,standard deviation, percentile limits
print df.describe()
             Age
count   4.000000
mean   33.250000
std     6.396614
min    28.000000
25%    28.750000
50%    31.500000
75%    36.000000
max    42.000000
In [14]:
#gives information about the data frame like numbr of columns,adta types memory usage
print df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
Age     4 non-null int64
Name    4 non-null object
a       0 non-null object
dtypes: int64(1), object(2)
memory usage: 168.0+ bytes
None
In [15]:
#to get all the columns
print df.columns
print "---------"
# to print all the values
print df.values
print "---------"
# to print the index
print df.index
Index([u'Age', u'Name', u'a'], dtype='object')
---------
[[28 'Tom' nan]
 [34 'Jack' nan]
 [29 'Steve' nan]
 [42 'Ricky' nan]]
---------
RangeIndex(start=0, stop=4, step=1)

Indexing in DataFrame

In [17]:
# selecting a column from the dataframe
print df['Name']
print "----------"

#selecting multiple columns
print df[['Name','Age']]
print "-----------"

# adding a new column
df['Gender']=['M','M','M','M']
print df
print "------------"

#deleting a column from data frame
del df['Gender']
# or df.pop('a')
0      Tom
1     Jack
2    Steve
3    Ricky
Name: Name, dtype: object
----------
    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42
-----------
   Age   Name    a Gender
0   28    Tom  NaN      M
1   34   Jack  NaN      M
2   29  Steve  NaN      M
3   42  Ricky  NaN      M
------------

For selecting values from a DataFrame we have 3 functions

  • iloc = For accessing the data frame with integer indexing
  • loc = For accessing the dataframe with label indexing
  • ix = For accessing teh dataframe with mixture of integer and label indexing
In [24]:
# selecting a row by integer index
df.iloc[1,1]
Out[24]:
'Jack'
In [27]:
# selecting a row by label index
df.loc[1,"Name"]
Out[27]:
'Jack'
In [29]:
# selecting a row by integer index
df.ix[3,1]
Out[29]:
'Ricky'

Iterating over the Data Frame Rows

In [39]:
for row in df.iterrows():
    #returns the tuples with (label,row) mapping
    #check the data type of tuple keys
    print "return type of iterrows ",type(row[0])
    print "values of tupe are of type ",type(row[1])
    break
    
for row in df.iterrows():   
    print "Name is : ",row[1]['Name']
    print "Age is : ",row[1]['Age']
return type of iterrows  <type 'numpy.int64'>
values of tupe are of type  <class 'pandas.core.series.Series'>
Name is :  Tom
Age is :  28
Name is :  Jack
Age is :  34
Name is :  Steve
Age is :  29
Name is :  Ricky
Age is :  42
In [71]:
# returns the columns with values in columns
for row in df.iteritems():
    print row[0]
    print row[1]
Age
rank1    28
rank3    34
rank2    29
rank4    42
Name: Age, dtype: int64
Name
rank1      Tom
rank3     Jack
rank2    Steve
rank4    Ricky
Name: Name, dtype: object
a
rank1    NaN
rank3    NaN
rank2    NaN
rank4    NaN
Name: a, dtype: object
In [78]:
# gives rows as tuples
for row in df.itertuples():
    print row
Pandas(Index='rank1', Age=28, Name='Tom', a=nan)
Pandas(Index='rank3', Age=34, Name='Jack', a=nan)
Pandas(Index='rank2', Age=29, Name='Steve', a=nan)
Pandas(Index='rank4', Age=42, Name='Ricky', a=nan)

Reindexing a DataFrame

Reindexing a dataframe creates a new Dataframe with the desired index from the orignal dataframe
In [43]:
print df
df1=df.reindex(index=[1,2],columns=['Name','Age'])
df1
   Age   Name    a
0   28    Tom  NaN
1   34   Jack  NaN
2   29  Steve  NaN
3   42  Ricky  NaN
Out[43]:
NameAge
1Jack34
2Steve29
In [46]:
#changing the index values of a Data Frame
df.index=['rank1','rank3','rank2','rank4']
df
Out[46]:
AgeNamea
rank128TomNaN
rank334JackNaN
rank229SteveNaN
rank442RickyNaN

Sorting a DataFrame

Sorting can be done in 2 ways
  • Sort by indexes
  • Sort y values
In [82]:
import numpy as np
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
#sorting can be done in 
#sorting  based on index
print unsorted_df.sort_index()
print "-----------------------------"
print unsorted_df.sort_index(ascending=False)
       col2      col1
0  0.991140 -1.203227
1 -1.335899  1.494410
2  0.182436  0.167360
3  0.379386 -0.028920
4 -0.076693  1.179620
5  0.418442 -0.282703
6 -0.136885  0.812825
7  0.808891  0.336993
8  2.177074  1.805119
9  1.240525  0.104993
-----------------------------
       col2      col1
9  1.240525  0.104993
8  2.177074  1.805119
7  0.808891  0.336993
6 -0.136885  0.812825
5  0.418442 -0.282703
4 -0.076693  1.179620
3  0.379386 -0.028920
2  0.182436  0.167360
1 -1.335899  1.494410
0  0.991140 -1.203227
In [86]:
#sorting based on values
#meathod takes a column name or list of columns as input
unsorted_df.sort_values('col1')
unsorted_df.sort_values(['col1','col2'])
Out[86]:
col2col1
00.991140-1.203227
50.418442-0.282703
30.379386-0.028920
91.2405250.104993
20.1824360.167360
70.8088910.336993
6-0.1368850.812825
4-0.0766931.179620
1-1.3358991.494410
82.1770741.805119
In [87]:
#### Statistical Functions with DataFrame
In [95]:
#calculate the percent change on rolling window of 1
print unsorted_df.pct_change(1)
print "---------------"
print "correlation matrix"
print unsorted_df.corr()
       col2       col1
1       NaN        NaN
4 -0.942591  -0.210645
6  0.784835  -0.310944
2 -2.332776  -0.794100
3  1.079551  -1.172798
5  0.102945   8.775505
9  1.964630  -1.371389
8  0.754962  16.192776
0 -0.544738  -1.666564
7 -0.183878  -1.280075
---------------
correlation matrix
          col2      col1
col2  1.000000 -0.157285
col1 -0.157285  1.000000
In [100]:
#rolling functions are window based functions which can be applied on a a set of rows
# calclulate the mean based on every 4 rows in a data frame

print df.rolling(window=2).mean()
print df.rolling(window=2).sum()

#applying custome function to adta frame
def fun(s):
    print s
    return s[0]
print df.rolling(window=2).agg(fun)
        Age   Name    a
rank1   NaN    Tom  NaN
rank3  31.0   Jack  NaN
rank2  31.5  Steve  NaN
rank4  35.5  Ricky  NaN
        Age   Name    a
rank1   NaN    Tom  NaN
rank3  62.0   Jack  NaN
rank2  63.0  Steve  NaN
rank4  71.0  Ricky  NaN
[28. 34.]
[34. 29.]
[29. 42.]
        Age   Name    a
rank1   NaN    Tom  NaN
rank3  28.0   Jack  NaN
rank2  34.0  Steve  NaN
rank4  29.0  Ricky  NaN

Grouping a DataFrame

In [105]:
# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
#group by creates groups of the data frame rows
print df.groupby('Team').groups

#printing the group names
for name,group in df.groupby('Team'):
    print name
    print group
{'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Devils': Int64Index([2, 3], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}
Devils
   Points  Rank    Team  Year
2     863     2  Devils  2014
3     673     3  Devils  2015
Kings
   Points  Rank   Team  Year
4     741     3  Kings  2014
6     756     1  Kings  2016
7     788     1  Kings  2017
Riders
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
8      694     2  Riders  2016
11     690     2  Riders  2017
Royals
    Points  Rank    Team  Year
9      701     4  Royals  2014
10     804     1  Royals  2015
kings
   Points  Rank   Team  Year
5     812     4  kings  2015
In [108]:
#applying sum function in a dataframe
print df.groupby('Team').sum()

#applying sum function on a column
print df.groupby('Team')['Points'].sum()
#or
print df.groupby('Team')['Points'].agg(np.sum)
        Points  Rank  Year
Team                      
Devils    1536     5  4029
Kings     2285     5  6047
Riders    3049     7  8062
Royals    1505     5  4029
kings      812     4  2015
Team
Devils    1536
Kings     2285
Riders    3049
Royals    1505
kings      812
Name: Points, dtype: int64
Team
Devils    1536
Kings     2285
Riders    3049
Royals    1505
kings      812
Name: Points, dtype: int64
In [109]:
# applying multiple functions 
print df.groupby(['Rank'])['Points'].agg([np.sum, np.mean, np.std])
print df.groupby(['Rank','Team'])['Points'].agg([np.sum, np.mean, np.std])
       sum   mean        std
Rank                        
1     3224  806.0  50.754310
2     3036  759.0  83.070251
3     1414  707.0  48.083261
4     1513  756.5  78.488853
              sum        mean        std
Rank Team                               
1    Kings   1544  772.000000  22.627417
     Riders   876  876.000000        NaN
     Royals   804  804.000000        NaN
2    Devils   863  863.000000        NaN
     Riders  2173  724.333333  56.038677
3    Devils   673  673.000000        NaN
     Kings    741  741.000000        NaN
4    Royals   701  701.000000        NaN
     kings    812  812.000000        NaN


Hope the blog was helpful. Share with your fellow Python practitioners.


Comments

Popular posts from this blog

Spidering the web with Python

Word Vectorization

Machine Learning -Solution or Problem