Hero image credits to Charles Deluvio on Unsplash.
Computing relative dates in pandas
I had to compute the relative differences of dates based by some other column value.
My first thought was that of course there’s a built in functionality for this in pandas, but to my surprise I couldn’t find any.
I came up with this solution.
Solution
The full notebook can be found here.
First, let’s create a DataFrame with same dates and IDs. (I’ll use the ID-values to group the dates.)
1import pandas as pd2import numpy as np3from pandas.util.testing import makeTimeDataFrame45# Dummy dataframe with date as index6df = makeTimeDataFrame()78# Add a random ID-column9df['ID'] = np.random.randint(1, 4, df.shape[0])10# Drop unnecessary columns11df = df.drop(['A', 'B', 'C', 'D'], axis = 1)1213# put the date in it's own column14df['date'] = df.index1516# reset the index17df = df.reset_index(drop = True)
Then the data needs to be sorted for the solution to work correctly.
1# sort the dataframe by the date-column2df = df.sort_values(by = ['date'])
The dataframe now looks like this:
The function do_date_differences
is the one that does the computing.
1# row is the target row2# col is the name of the column in the row that I want to "group by"3# date_col is the name of the column that has the date-formatted data4def do_date_differences(row, col, date_col = 'date'):5 first = df.loc[df[col].isin([row[col]])].iloc[0][date_col]6 second = row[date_col]7 diff = second - first8 return diff
The function works by finding out the first date occurence of the value in the given (grouper) column for the target row. Then it calculates the difference of the found (first) date and the date that’s in the date_col of the target row. The difference is returned so it can be added as a new column value to the row.
And then by using the dataframe’s apply-function, call the do_date_differences.
1# I want to find relative dates for all the rows in the dataframe, grouped by the ID-column2df['difference'] = df.apply(lambda row: do_date_differences(row, 'ID', 'date'), axis=1)
The dataframe now looks like this:
-Juho