Computing relative dates in pandas

Hero image credits to Charles Deluvio on Unsplash.

Computing relative dates in pandas

I had to compute the relative differences of dates based by some other column value.

My first thought was that of course there’s a built in functionality for this in pandas, but to my surprise I couldn’t find any.

I came up with this solution.

Solution

The full notebook can be found here.

First, let’s create a DataFrame with same dates and IDs. (I’ll use the ID-values to group the dates.)

1import pandas as pd
2import numpy as np
3from pandas.util.testing import makeTimeDataFrame
4
5# Dummy dataframe with date as index
6df = makeTimeDataFrame()
7
8# Add a random ID-column
9df['ID'] = np.random.randint(1, 4, df.shape[0])
10# Drop unnecessary columns
11df = df.drop(['A', 'B', 'C', 'D'], axis = 1)
12
13# put the date in it's own column
14df['date'] = df.index
15
16# reset the index
17df = df.reset_index(drop = True)

Then the data needs to be sorted for the solution to work correctly.

1# sort the dataframe by the date-column
2df = df.sort_values(by = ['date'])

The dataframe now looks like this:

The function do_date_differences is the one that does the computing.

1# row is the target row
2# col is the name of the column in the row that I want to "group by"
3# date_col is the name of the column that has the date-formatted data
4def do_date_differences(row, col, date_col = 'date'):
5    first = df.loc[df[col].isin([row[col]])].iloc[0][date_col]
6    second = row[date_col]
7    diff = second - first
8    return diff

The function works by finding out the first date occurence of the value in the given (grouper) column for the target row. Then it calculates the difference of the found (first) date and the date that’s in the date_col of the target row. The difference is returned so it can be added as a new column value to the row.

And then by using the dataframe’s apply-function, call the do_date_differences.

1# I want to find relative dates for all the rows in the dataframe, grouped by the ID-column
2df['difference'] = df.apply(lambda row: do_date_differences(row, 'ID', 'date'), axis=1)

The dataframe now looks like this:

-Juho

Computing relative dates in pandas

Computing relative dates in pandas

Solution

More articles from Juho Salli

From bug to fix to production

Getting CSS styles in a web page