Speed considertions¶

Below there are two notebooks I used for comparing speeds of several functions that I have developed iteratively. Generally speaking, the conclusions are:

Pandas-based functions are quick to develop and easy to understand
Numpy-based functions run significantly faster than pandas-based functions
Numba-compiled functions run significantly faster than Numpy-based functions

Optimal delay¶

01-jsg-Optimal_delay_speed

In [1]:

%load_ext autoreload
%autoreload 2

# Asegurarme de que estamos ejecutando Python en el virtual env
import sys
print(f'Python env: {sys.executable}')
import os
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir('..')
print(f'Working dir: {os.getcwd()}')

Python env: /Users/jsg/Documents/GitHub/bundle_notifications_ds/.venv_bundle_notifications/bin/python
Working dir: /Users/jsg/Documents/GitHub/bundle_notifications_ds

In [20]:

#from bundle_notifications.bundle_notifications import load_data, bundle_func,add_message
import numpy as np
import pandas as pd
import tabulate
pd.set_option('display.max_rows', 100)

import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook' 

from bundle_notifications.bundle_notifications import create_message

from tqdm.notebook import trange, tqdm
%load_ext line_profiler
from numba import jit

The line_profiler extension is already loaded. To reload it, use:
  %reload_ext line_profiler

In [11]:

def random_dates(start = '2017-01-01 00:00:00', n=30):
    """
    Generate n number of numpy dates on the given day
    """
    return np.sort(np.datetime64(start) + np.random.randint(1, 24*60*60,n)).astype(int)

timestamp = random_dates('2017-08-01 00:00:00', n=10)

Speed comparison for delay() function¶

We compare the "no @jit" version with the implemented one

In [12]:

from bundle_notifications.optimal_delay import delay,total_delay_initial

def delay_nojit(t, x):
    """Calculates delay if notifications are sent at indexes indicated by notification_idx
    """
    p1 = t[x[0]]- t[:x[0]]
    p2 = t[x[1]]- t[x[0]+1:x[1]]
    p3 = t[x[2]]- t[x[1]+1:x[2]]
    p4 = t[-1] - t[(x[2]+1):]
    
    #print(f"{p1}\n{p2}\n{p3}\n{p4}")
    return  p1.sum() + p2.sum() + p3.sum() + p4.sum()

In [13]:

x = total_delay_initial(timestamp)
%timeit delay_nojit(timestamp,x)

17.1 µs ± 371 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

In [15]:

%timeit delay(timestamp,x)

1.14 µs ± 18.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

With a simple decorator the funcion runs more than 10x faster :) The first version I did of this function based on pd.Series was about 100x slower.

Compare the computing time and the total delay of the grouping functions¶

Below we compare the computational speed and total delay of three bundling methods:

total_delay_initial() used as initial solution, simply schedules the notifications equally distributed along the day
total_delay_brute() calculates all possible solutions where the notifications are sent, and returns the one that incurs in the least total delay
local_search() from a starting point, this method searches locally for an optimal solution

From the graph we observe that:

Calculating all possible solutions using brute force is very expensive. The computational time grows exponentially
The computing time of the local search is quite small
The total delay is obviously better for the brute force approach. The local search is somewhere in between the other two methods, and it achieves a good trade-off between speed and accuracy.

In [18]:

import time
from bundle_notifications.optimal_delay import total_delay_brute,local_search


### DO a plot: time one methods vs the other and also versus the loss
N = np.linspace(5, 100, num=15,dtype='int')

print(f"Sizes: {N}")
time_brute = [-1]*len(N)
delay_brute = [-1]*len(N)

time_initial = [-1]*len(N)
delay_initial = [-1]*len(N)

time_heuristic = [-1]*len(N)
delay_heuristic = [-1]*len(N)

for i,n in enumerate( tqdm(N)):
    
    # Randomly generate some timestamps
    timestamp = random_dates('2017-08-01 00:00:00', n)
    
    # initial
    start = time.time()
    x = total_delay_initial(timestamp)
    delay_initial[i] = delay(timestamp, x)
    time_initial[i] = time.time() - start
    
    # Brute force
    start = time.time()
    x  = total_delay_brute(timestamp)
    delay_brute[i] = delay(timestamp, x)
    time_brute[i] = time.time() - start

    # heuristic
    start = time.time()
    x = local_search(timestamp)
    delay_heuristic[i] = delay(timestamp, x)
    time_heuristic[i] = time.time() - start
    
    # Sanity check
    if delay_heuristic[i] < delay_brute[i]:
        print("Heuristic produces better results than exact method -> that cannot be!")
        break

Sizes: [  5  11  18  25  32  38  45  52  59  66  72  79  86  93 100]

In [21]:

## Make plot
fig = make_subplots(rows=2, cols=1)

fig.add_trace(go.Scatter(x=N, y=time_heuristic, name="Time heuristic"),row=1, col=1)
fig.add_trace(go.Scatter(x=N, y=time_brute,name="Time brute"),row=1, col=1)
fig.add_trace(go.Scatter(x=N, y=time_initial,name="Time initial"),row=1, col=1)
fig.update_yaxes(title_text="Elapsed seconds", row=1, col=1)
fig.update_xaxes(title_text="Size", row=1, col=1)


fig.add_trace(go.Scatter(x=N, y=delay_heuristic,name="Total delay heuristic (secs)"),row=2, col=1)
fig.add_trace(go.Scatter(x=N, y=delay_initial,name="Total initial (secs)"),row=2, col=1)
fig.add_trace(go.Scatter(x=N, y=delay_brute,name="Total brute (secs)"),row=2, col=1)
fig.update_yaxes(title_text="Total delay in ns", row=2, col=1)
fig.update_xaxes(title_text="Size", row=2, col=1)

fig.update_layout(height=600, width=800, title_text="Comparison of methods")
fig.show()
#plotly.offline.plot(fig, filename='local_search.html')

In [ ]:

Bundle functions¶

02-jsg-Bundle_speed_tests

In [1]:

%load_ext autoreload
%autoreload 2
%load_ext line_profiler

In [2]:

import sys
import os
if os.path.basename(os.getcwd()) == "notebooks":
    os.chdir('..')

from tqdm.notebook import trange, tqdm
    
import numpy as np
from numba import jit
import pandas as pd
pd.set_option('display.max_rows', 100)

import plotly.graph_objects as go
from plotly.subplots import make_subplots

from bundle_notifications.bundle_notifications import load_data,create_message
from bundle_notifications.optimal_delay import delay, local_search

1. Load data¶

We first load the whole dataset and then get a sample of it in order to do some speed tests

In [3]:

url = 'https://static-eu-komoot.s3.amazonaws.com/backend/challenge/notifications.csv'
df = load_data(url, nrows=15000)

df['timestamp_ns'] = df.timestamp.copy().astype("int")
df['dayofyear'] = df.timestamp.copy().dt.dayofyear
    
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 15000 entries, 0 to 14999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   timestamp     15000 non-null  datetime64[ns]
 1   user_id       15000 non-null  object        
 2   friend_id     15000 non-null  object        
 3   friend_name   15000 non-null  object        
 4   timestamp_ns  15000 non-null  int64         
 5   dayofyear     15000 non-null  int64         
dtypes: datetime64[ns](1), int64(2), object(3)
memory usage: 703.2+ KB

In [4]:

# Sample for an ID
df_g = df.loc[(df.user_id == 'CFFEC5978B0A4A05FA6DCEFB2C82CC') & (df.dayofyear == 213),].copy()#.head(20)
df_g.head()

Out[4]:

	timestamp	user_id	friend_id	friend_name	timestamp_ns	dayofyear
3	2017-08-01 01:20:47	CFFEC5978B0A4A05FA6DCEFB2C82CC	2BB0471CAA78ED0FCEE143E175F034	Mona	1501550447000000000	213
25	2017-08-01 02:28:27	CFFEC5978B0A4A05FA6DCEFB2C82CC	2BB0471CAA78ED0FCEE143E175F034	Mona	1501554507000000000	213
29	2017-08-01 03:00:42	CFFEC5978B0A4A05FA6DCEFB2C82CC	74C09338D7CA031859AE26A1586692	Toomas	1501556442000000000	213
39	2017-08-01 03:51:05	CFFEC5978B0A4A05FA6DCEFB2C82CC	F039A0F7A3245F7B2D7BD0942F3680	Sean	1501559465000000000	213
99	2017-08-01 05:03:44	CFFEC5978B0A4A05FA6DCEFB2C82CC	DF6A386FE701217C2A12292DB8D142	Buse	1501563824000000000	213

Speed tests: add_notif_counter()¶

In [5]:

def notif_counter_pandas(df_g,x):
    df_g['notification_bool'] = False
    df_g['notification_bool'].iloc[x] = True

    # Now do a cumsum counter
    df_g['notification_counter'] = df_g.notification_bool.cumsum().shift()+1
    df_g['notification_counter'].iloc[0] = 1
    return df_g.notification_counter

@jit
def add_notif_counter_j(x,notification_counter):
    """ Equivalent to add_notif_counter but compilting it with jit
    """
    
    notification_counter[:x[0]+1] = 1
    notification_counter[x[0]+1:x[1]+1] = 2
    notification_counter[x[1]+1:x[2]+1] = 3
    notification_counter[(x[2]+1):] = 4
    
    return notification_counter


def add_notif_counter(x):
    """ 
    
    Equivalent function to:
    
    df_g['notification_bool'] = False
    df_g.notification_bool.iloc[x] = True

    # Now do a cumsum counter
    df_g['notification_counter'] = df_g.notification_bool.cumsum().shift()+1
    df_g.notification_counter.iloc[0] = 1
    """
    notification_counter = np.zeros(x[-1]+1,dtype='int')
    notification_counter[:x[0]+1] = 1
    notification_counter[x[0]+1:x[1]+1] = 2
    notification_counter[x[1]+1:x[2]+1] = 3
    notification_counter[(x[2]+1):] = 4
    
    return notification_counter

In [7]:

print(f"Shape of test dataset: {df_g.shape}")
x = local_search(df_g.timestamp_ns.to_numpy())

print('With @jit:')
%time n1 = add_notif_counter_j(x,np.zeros(x[-1]+1,dtype='int')) 

print('Without @jit:')
%time n2 = add_notif_counter(x)

print('With pandas:')
%time n3 = notif_counter_pandas(df_g,x)

np.allclose(n1,n2.astype("int") ) 
np.allclose(n2,n3.astype("int") )

Shape of test dataset: (58, 8)
With @jit:
CPU times: user 24 µs, sys: 1 µs, total: 25 µs
Wall time: 26.9 µs
Without @jit:
CPU times: user 25 µs, sys: 1e+03 ns, total: 26 µs
Wall time: 28.1 µs
With pandas:
CPU times: user 2.2 ms, sys: 449 µs, total: 2.65 ms
Wall time: 2.35 ms

Out[7]:

True

Speed test: groupbyuy¶

In [8]:

# prev code
df_g = df.loc[(df.user_id == 'CFFEC5978B0A4A05FA6DCEFB2C82CC') & (df.dayofyear == 213),].copy()#.head(20)

# Calculate when to send notification
x = local_search(df_g.timestamp_ns.to_numpy())

# Add notification counter
df_g['notification_counter'] = add_notif_counter(x)

In [9]:

print('Speed with pandas:')
%time tours_pd =  df_g.groupby('notification_counter')['friend_id'].apply(lambda x: (1- x.duplicated()).cumsum())#.droplevel(0)

from bundle_notifications.bundle_notifications import count_tours_per_notif

# Inputs
notification_counter = df_g.notification_counter.to_numpy('int')
friend_id = df_g.friend_id.to_numpy()
friend_name = df_g.friend_name.to_numpy()
timestamp = df_g.timestamp.to_numpy()

print('Speed with custom numpy:')
%time tours_np,_,_,_ = count_tours_per_notif(notification_counter, friend_id,friend_name,timestamp)

np.allclose(tours_pd.astype("int"),tours_np )

Speed with pandas:
CPU times: user 3.84 ms, sys: 533 µs, total: 4.37 ms
Wall time: 3.91 ms
Speed with custom numpy:
CPU times: user 78 µs, sys: 14 µs, total: 92 µs
Wall time: 87 µs

Out[9]:

True

A custom numpy functtion goes 35x faster! It is not even 100% comparable because inside _count_tours_pernotif we output some extra information.

Speed analysis for bundle()¶

In [10]:

from bundle_notifications.bundle_notifications import bundle_func, bundle

# Use a subset of the dataset
users = df.user_id.unique()[:50]
df_subset = df.loc[ (df.user_id.isin(users)),].copy()
print(df_subset.shape)

%lprun -T notebooks/profile_bundle_func -f bundle_func bundle(df_subset)
print(open('notebooks/profile_bundle_func', 'r').read())

(2513, 6)

*** Profile printout saved to text file 'notebooks/profile_bundle_func'. 
Timer unit: 1e-06 s

Total time: 0.6141 s
File: /Users/jsg/Documents/GitHub/bundle_notifications_ds/bundle_notifications/bundle_notifications.py
Function: bundle_func at line 229

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
   229                                           def bundle_func(df_g):
   230                                               """Bundles notifications for a user_id
   231                                           
   232                                               This function is meant to be used after a pandas groupping (or manual filtering) of user_ids. 
   233                                           
   234                                               Parameters
   235                                               ----------
   236                                               df_g : pd.DataFrame
   237                                                   DataFrame containing 4 columns: ['timestamp', 'user_id', 'friend_id', 'friend_name']
   238                                           
   239                                               Returns
   240                                               -------
   241                                               pd.DataFrame
   242                                                   
   243                                               
   244                                               """
   245                                           
   246                                           
   247                                               # Always send if less than 4 notifications per day
   248       129        476.0      3.7      0.1      if df_g.shape[0] <= 4:
   249                                           
   250        38      30539.0    803.7      5.0          df_g['notification_bool'] = True
   251        38      29977.0    788.9      4.9          df_g['tours'] = 1        
   252        38      30125.0    792.8      4.9          df_g['notification_counter'] = list(range(1,df_g.shape[0]+1)) # [1,2,3,4]
   253        38      37670.0    991.3      6.1          df_g['message'] = create_message(df_g.tours, df_g.friend_name)
   254        38      32158.0    846.3      5.2          df_g['timestamp_first_tour'] = df_g.timestamp
   255                                           
   256                                               else:
   257                                                   
   258                                                   # Calculate best times to send notification
   259        91      13460.0    147.9      2.2          x = local_search(df_g.timestamp_ns.to_numpy())
   260                                           
   261                                                   # Add up an indicator whether the notification is sent at those times
   262        91      68933.0    757.5     11.2          df_g['notification_counter'] = add_notif_counter(x) 
   263                                           
   264                                                   # Group by counter and add
   265                                                   # We also get the names and timestamp of the first element in the notif counter
   266        90     112254.0   1247.3     18.3          df_g['tours'], name_first,timestamp_first_tour, message = count_tours_per_notif(df_g.notification_counter.to_numpy('int'),df_g.friend_id.to_numpy(),df_g.friend_name.to_numpy(),df_g.timestamp.to_numpy())
   267                                                   
   268                                                   # Filter: lets keep the rows with notifications
   269        90     115138.0   1279.3     18.7          df_g = df_g.iloc[x,]
   270                                                   
   271                                                   # Add timestamp first tour
   272        90      71801.0    797.8     11.7          df_g['timestamp_first_tour'] = timestamp_first_tour
   273                                           
   274                                                   # Cutomize message.
   275        90      71427.0    793.6     11.6          df_g['message'] = message
   276                                           
   277                                                   
   278       128        142.0      1.1      0.0      return df_g

I am surprised that a simple filter of rows (df_g = df_g.iloc[x,]) takes as much time as a much more complex custom function df_g['tours'], name_first,timestamp_first_tour, message = count_tours_per_notif(...)
Assigning a value to a new column df_g['message'] = message takes also quite some time.