Data Frames

In my previous blog, I talked about using the Series Data Structure in Pandas. Feel free to have a look at that blog.

This blog post will be a brief walkthrough of Data Frames. Data Frame is one of the most widely used tool in python

by Data Scientists/Analyst.

A DataFrame can be considered as a data table. Data Frame provides a wide range of functionality like

filtering
grouping
sorting
joins
merging

Let's start with creating a data frame

In [7]:

#importing pandas and DataFrame
import pandas as pd
from pandas import DataFrame

#constructor to create a data frame
#df=DataFrame( data, index, columns, dtype, copy)
#Lists, dict, Series, Numpy ndarrays, Another DataFrame

#creating an empty data frame
df=DataFrame()

#creating a data frame form an array
df=DataFrame([1,2,3,4,5,5])
print df

In [8]:

# creating a dataframe with an array of array
# each inner array represents a row
d2=[[1,2,3,4],
    ['a','b','c','d']]
df= DataFrame(d2)
print df

   0  1  2  3
0  1  2  3  4
1  a  b  c  d

In [9]:

#creating a datframe from a dict
import pandas as pd
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Age','Name','a'])
print df

   Age   Name    a
0   28    Tom  NaN
1   34   Jack  NaN
2   29  Steve  NaN
3   42  Ricky  NaN

In [10]:

# creating a df from a list of dictionaries
import pandas as pd
data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}]
df = pd.DataFrame(data)
print df

   a   b     c
0  1   2   NaN
1  5  10  20.0

Now that we know how to create a dataframe. Let's try some functions and statistics with Data Frames

In [11]:

#create a data frame
data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]}
df = pd.DataFrame(data,columns=['Age','Name','a'])
print df

   Age   Name    a
0   28    Tom  NaN
1   34   Jack  NaN
2   29  Steve  NaN
3   42  Ricky  NaN

In [12]:

#understand the data
#gives brief statistics about the data like mean,count,standard deviation, percentile limits
print df.describe()

             Age
count   4.000000
mean   33.250000
std     6.396614
min    28.000000
25%    28.750000
50%    31.500000
75%    36.000000
max    42.000000

In [14]:

#gives information about the data frame like numbr of columns,adta types memory usage
print df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 3 columns):
Age     4 non-null int64
Name    4 non-null object
a       0 non-null object
dtypes: int64(1), object(2)
memory usage: 168.0+ bytes
None

In [15]:

#to get all the columns
print df.columns
print "---------"
# to print all the values
print df.values
print "---------"
# to print the index
print df.index

Index([u'Age', u'Name', u'a'], dtype='object')
---------
[[28 'Tom' nan]
 [34 'Jack' nan]
 [29 'Steve' nan]
 [42 'Ricky' nan]]
---------
RangeIndex(start=0, stop=4, step=1)

Indexing in DataFrame

In [17]:

# selecting a column from the dataframe
print df['Name']
print "----------"

#selecting multiple columns
print df[['Name','Age']]
print "-----------"

# adding a new column
df['Gender']=['M','M','M','M']
print df
print "------------"

#deleting a column from data frame
del df['Gender']
# or df.pop('a')

0      Tom
1     Jack
2    Steve
3    Ricky
Name: Name, dtype: object
----------
    Name  Age
0    Tom   28
1   Jack   34
2  Steve   29
3  Ricky   42
-----------
   Age   Name    a Gender
0   28    Tom  NaN      M
1   34   Jack  NaN      M
2   29  Steve  NaN      M
3   42  Ricky  NaN      M
------------

For selecting values from a DataFrame we have 3 functions

iloc = For accessing the data frame with integer indexing
loc = For accessing the dataframe with label indexing
ix = For accessing teh dataframe with mixture of integer and label indexing

In [24]:

# selecting a row by integer index
df.iloc[1,1]

Out[24]:

'Jack'

In [27]:

# selecting a row by label index
df.loc[1,"Name"]

Out[27]:

'Jack'

In [29]:

# selecting a row by integer index
df.ix[3,1]

Out[29]:

'Ricky'

Iterating over the Data Frame Rows

In [39]:

for row in df.iterrows():
    #returns the tuples with (label,row) mapping
    #check the data type of tuple keys
    print "return type of iterrows ",type(row[0])
    print "values of tupe are of type ",type(row[1])
    break
    
for row in df.iterrows():   
    print "Name is : ",row[1]['Name']
    print "Age is : ",row[1]['Age']

return type of iterrows  <type 'numpy.int64'>
values of tupe are of type  <class 'pandas.core.series.Series'>
Name is :  Tom
Age is :  28
Name is :  Jack
Age is :  34
Name is :  Steve
Age is :  29
Name is :  Ricky
Age is :  42

In [71]:

# returns the columns with values in columns
for row in df.iteritems():
    print row[0]
    print row[1]

Age
rank1    28
rank3    34
rank2    29
rank4    42
Name: Age, dtype: int64
Name
rank1      Tom
rank3     Jack
rank2    Steve
rank4    Ricky
Name: Name, dtype: object
a
rank1    NaN
rank3    NaN
rank2    NaN
rank4    NaN
Name: a, dtype: object

In [78]:

# gives rows as tuples
for row in df.itertuples():
    print row

Pandas(Index='rank1', Age=28, Name='Tom', a=nan)
Pandas(Index='rank3', Age=34, Name='Jack', a=nan)
Pandas(Index='rank2', Age=29, Name='Steve', a=nan)
Pandas(Index='rank4', Age=42, Name='Ricky', a=nan)

Reindexing a DataFrame

Reindexing a dataframe creates a new Dataframe with the desired index from the orignal dataframe

In [43]:

print df
df1=df.reindex(index=[1,2],columns=['Name','Age'])
df1

   Age   Name    a
0   28    Tom  NaN
1   34   Jack  NaN
2   29  Steve  NaN
3   42  Ricky  NaN

Out[43]:

	Name	Age
1	Jack	34
2	Steve	29

In [46]:

#changing the index values of a Data Frame
df.index=['rank1','rank3','rank2','rank4']
df

Out[46]:

	Age	Name	a
rank1	28	Tom	NaN
rank3	34	Jack	NaN
rank2	29	Steve	NaN
rank4	42	Ricky	NaN

Sorting a DataFrame

Sorting can be done in 2 ways

Sort by indexes
Sort y values

In [82]:

import numpy as np
unsorted_df=pd.DataFrame(np.random.randn(10,2),index=[1,4,6,2,3,5,9,8,0,7],columns=['col2','col1'])
#sorting can be done in 
#sorting  based on index
print unsorted_df.sort_index()
print "-----------------------------"
print unsorted_df.sort_index(ascending=False)

       col2      col1
0  0.991140 -1.203227
1 -1.335899  1.494410
2  0.182436  0.167360
3  0.379386 -0.028920
4 -0.076693  1.179620
5  0.418442 -0.282703
6 -0.136885  0.812825
7  0.808891  0.336993
8  2.177074  1.805119
9  1.240525  0.104993
-----------------------------
       col2      col1
9  1.240525  0.104993
8  2.177074  1.805119
7  0.808891  0.336993
6 -0.136885  0.812825
5  0.418442 -0.282703
4 -0.076693  1.179620
3  0.379386 -0.028920
2  0.182436  0.167360
1 -1.335899  1.494410
0  0.991140 -1.203227

In [86]:

#sorting based on values
#meathod takes a column name or list of columns as input
unsorted_df.sort_values('col1')
unsorted_df.sort_values(['col1','col2'])

Out[86]:

	col2	col1
0	0.991140	-1.203227
5	0.418442	-0.282703
3	0.379386	-0.028920
9	1.240525	0.104993
2	0.182436	0.167360
7	0.808891	0.336993
6	-0.136885	0.812825
4	-0.076693	1.179620
1	-1.335899	1.494410
8	2.177074	1.805119

In [87]:

#### Statistical Functions with DataFrame

In [95]:

#calculate the percent change on rolling window of 1
print unsorted_df.pct_change(1)
print "---------------"
print "correlation matrix"
print unsorted_df.corr()

       col2       col1
1       NaN        NaN
4 -0.942591  -0.210645
6  0.784835  -0.310944
2 -2.332776  -0.794100
3  1.079551  -1.172798
5  0.102945   8.775505
9  1.964630  -1.371389
8  0.754962  16.192776
0 -0.544738  -1.666564
7 -0.183878  -1.280075
---------------
correlation matrix
          col2      col1
col2  1.000000 -0.157285
col1 -0.157285  1.000000

In [100]:

#rolling functions are window based functions which can be applied on a a set of rows
# calclulate the mean based on every 4 rows in a data frame

print df.rolling(window=2).mean()
print df.rolling(window=2).sum()

#applying custome function to adta frame
def fun(s):
    print s
    return s[0]
print df.rolling(window=2).agg(fun)

        Age   Name    a
rank1   NaN    Tom  NaN
rank3  31.0   Jack  NaN
rank2  31.5  Steve  NaN
rank4  35.5  Ricky  NaN
        Age   Name    a
rank1   NaN    Tom  NaN
rank3  62.0   Jack  NaN
rank2  63.0  Steve  NaN
rank4  71.0  Ricky  NaN
[28. 34.]
[34. 29.]
[29. 42.]
        Age   Name    a
rank1   NaN    Tom  NaN
rank3  28.0   Jack  NaN
rank2  34.0  Steve  NaN
rank4  29.0  Ricky  NaN

Grouping a DataFrame

In [105]:

# import the pandas library
import pandas as pd
ipl_data = {'Team': ['Riders', 'Riders', 'Devils', 'Devils', 'Kings',
         'kings', 'Kings', 'Kings', 'Riders', 'Royals', 'Royals', 'Riders'],
         'Rank': [1, 2, 2, 3, 3,4 ,1 ,1,2 , 4,1,2],
         'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
         'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
#group by creates groups of the data frame rows
print df.groupby('Team').groups

#printing the group names
for name,group in df.groupby('Team'):
    print name
    print group

{'Kings': Int64Index([4, 6, 7], dtype='int64'), 'Devils': Int64Index([2, 3], dtype='int64'), 'Riders': Int64Index([0, 1, 8, 11], dtype='int64'), 'Royals': Int64Index([9, 10], dtype='int64'), 'kings': Int64Index([5], dtype='int64')}
Devils
   Points  Rank    Team  Year
2     863     2  Devils  2014
3     673     3  Devils  2015
Kings
   Points  Rank   Team  Year
4     741     3  Kings  2014
6     756     1  Kings  2016
7     788     1  Kings  2017
Riders
    Points  Rank    Team  Year
0      876     1  Riders  2014
1      789     2  Riders  2015
8      694     2  Riders  2016
11     690     2  Riders  2017
Royals
    Points  Rank    Team  Year
9      701     4  Royals  2014
10     804     1  Royals  2015
kings
   Points  Rank   Team  Year
5     812     4  kings  2015

In [108]:

#applying sum function in a dataframe
print df.groupby('Team').sum()

#applying sum function on a column
print df.groupby('Team')['Points'].sum()
#or
print df.groupby('Team')['Points'].agg(np.sum)

        Points  Rank  Year
Team                      
Devils    1536     5  4029
Kings     2285     5  6047
Riders    3049     7  8062
Royals    1505     5  4029
kings      812     4  2015
Team
Devils    1536
Kings     2285
Riders    3049
Royals    1505
kings      812
Name: Points, dtype: int64
Team
Devils    1536
Kings     2285
Riders    3049
Royals    1505
kings      812
Name: Points, dtype: int64

In [109]:

# applying multiple functions 
print df.groupby(['Rank'])['Points'].agg([np.sum, np.mean, np.std])
print df.groupby(['Rank','Team'])['Points'].agg([np.sum, np.mean, np.std])

       sum   mean        std
Rank                        
1     3224  806.0  50.754310
2     3036  759.0  83.070251
3     1414  707.0  48.083261
4     1513  756.5  78.488853
              sum        mean        std
Rank Team                               
1    Kings   1544  772.000000  22.627417
     Riders   876  876.000000        NaN
     Royals   804  804.000000        NaN
2    Devils   863  863.000000        NaN
     Riders  2173  724.333333  56.038677
3    Devils   673  673.000000        NaN
     Kings    741  741.000000        NaN
4    Royals   701  701.000000        NaN
     kings    812  812.000000        NaN

Search This Blog

Tech Scouter

Pandas - DataFrame

Data Frames

Indexing in DataFrame

For selecting values from a DataFrame we have 3 functions

Iterating over the Data Frame Rows

Reindexing a DataFrame

Sorting a DataFrame

Grouping a DataFrame

Hope the blog was helpful. Share with your fellow Python practitioners.

Comments

Post a Comment

Popular posts from this blog

Spidering the web with Python

Word Vectorization

Machine Learning -Solution or Problem