Speed considertions¶
Below there are two notebooks I used for comparing speeds of several functions that I have developed iteratively. Generally speaking, the conclusions are:
- Pandas-based functions are quick to develop and easy to understand
- Numpy-based functions run significantly faster than pandas-based functions
- Numba-compiled functions run significantly faster than Numpy-based functions
Optimal delay¶
In [1]:
%load_ext autoreload
%autoreload 2
# Asegurarme de que estamos ejecutando Python en el virtual env
import sys
print(f'Python env: {sys.executable}')
import os
if os.path.basename(os.getcwd()) == "notebooks":
os.chdir('..')
print(f'Working dir: {os.getcwd()}')
In [20]:
#from bundle_notifications.bundle_notifications import load_data, bundle_func,add_message
import numpy as np
import pandas as pd
import tabulate
pd.set_option('display.max_rows', 100)
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.io as pio
pio.renderers.default = 'notebook'
from bundle_notifications.bundle_notifications import create_message
from tqdm.notebook import trange, tqdm
%load_ext line_profiler
from numba import jit
In [11]:
def random_dates(start = '2017-01-01 00:00:00', n=30):
"""
Generate n number of numpy dates on the given day
"""
return np.sort(np.datetime64(start) + np.random.randint(1, 24*60*60,n)).astype(int)
timestamp = random_dates('2017-08-01 00:00:00', n=10)
Speed comparison for delay() function¶
We compare the "no @jit" version with the implemented one
In [12]:
from bundle_notifications.optimal_delay import delay,total_delay_initial
def delay_nojit(t, x):
"""Calculates delay if notifications are sent at indexes indicated by notification_idx
"""
p1 = t[x[0]]- t[:x[0]]
p2 = t[x[1]]- t[x[0]+1:x[1]]
p3 = t[x[2]]- t[x[1]+1:x[2]]
p4 = t[-1] - t[(x[2]+1):]
#print(f"{p1}\n{p2}\n{p3}\n{p4}")
return p1.sum() + p2.sum() + p3.sum() + p4.sum()
In [13]:
x = total_delay_initial(timestamp)
%timeit delay_nojit(timestamp,x)
In [15]:
%timeit delay(timestamp,x)
With a simple decorator the funcion runs more than 10x faster :) The first version I did of this function based on pd.Series was about 100x slower.
Compare the computing time and the total delay of the grouping functions¶
Below we compare the computational speed and total delay of three bundling methods:
- total_delay_initial() used as initial solution, simply schedules the notifications equally distributed along the day
- total_delay_brute() calculates all possible solutions where the notifications are sent, and returns the one that incurs in the least total delay
- local_search() from a starting point, this method searches locally for an optimal solution
From the graph we observe that:
- Calculating all possible solutions using brute force is very expensive. The computational time grows exponentially
- The computing time of the local search is quite small
- The total delay is obviously better for the brute force approach. The local search is somewhere in between the other two methods, and it achieves a good trade-off between speed and accuracy.
In [18]:
import time
from bundle_notifications.optimal_delay import total_delay_brute,local_search
### DO a plot: time one methods vs the other and also versus the loss
N = np.linspace(5, 100, num=15,dtype='int')
print(f"Sizes: {N}")
time_brute = [-1]*len(N)
delay_brute = [-1]*len(N)
time_initial = [-1]*len(N)
delay_initial = [-1]*len(N)
time_heuristic = [-1]*len(N)
delay_heuristic = [-1]*len(N)
for i,n in enumerate( tqdm(N)):
# Randomly generate some timestamps
timestamp = random_dates('2017-08-01 00:00:00', n)
# initial
start = time.time()
x = total_delay_initial(timestamp)
delay_initial[i] = delay(timestamp, x)
time_initial[i] = time.time() - start
# Brute force
start = time.time()
x = total_delay_brute(timestamp)
delay_brute[i] = delay(timestamp, x)
time_brute[i] = time.time() - start
# heuristic
start = time.time()
x = local_search(timestamp)
delay_heuristic[i] = delay(timestamp, x)
time_heuristic[i] = time.time() - start
# Sanity check
if delay_heuristic[i] < delay_brute[i]:
print("Heuristic produces better results than exact method -> that cannot be!")
break
In [21]:
## Make plot
fig = make_subplots(rows=2, cols=1)
fig.add_trace(go.Scatter(x=N, y=time_heuristic, name="Time heuristic"),row=1, col=1)
fig.add_trace(go.Scatter(x=N, y=time_brute,name="Time brute"),row=1, col=1)
fig.add_trace(go.Scatter(x=N, y=time_initial,name="Time initial"),row=1, col=1)
fig.update_yaxes(title_text="Elapsed seconds", row=1, col=1)
fig.update_xaxes(title_text="Size", row=1, col=1)
fig.add_trace(go.Scatter(x=N, y=delay_heuristic,name="Total delay heuristic (secs)"),row=2, col=1)
fig.add_trace(go.Scatter(x=N, y=delay_initial,name="Total initial (secs)"),row=2, col=1)
fig.add_trace(go.Scatter(x=N, y=delay_brute,name="Total brute (secs)"),row=2, col=1)
fig.update_yaxes(title_text="Total delay in ns", row=2, col=1)
fig.update_xaxes(title_text="Size", row=2, col=1)
fig.update_layout(height=600, width=800, title_text="Comparison of methods")
fig.show()
#plotly.offline.plot(fig, filename='local_search.html')
In [ ]:
Bundle functions¶
In [1]:
%load_ext autoreload
%autoreload 2
%load_ext line_profiler
In [2]:
import sys
import os
if os.path.basename(os.getcwd()) == "notebooks":
os.chdir('..')
from tqdm.notebook import trange, tqdm
import numpy as np
from numba import jit
import pandas as pd
pd.set_option('display.max_rows', 100)
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from bundle_notifications.bundle_notifications import load_data,create_message
from bundle_notifications.optimal_delay import delay, local_search
1. Load data¶
We first load the whole dataset and then get a sample of it in order to do some speed tests
In [3]:
url = 'https://static-eu-komoot.s3.amazonaws.com/backend/challenge/notifications.csv'
df = load_data(url, nrows=15000)
df['timestamp_ns'] = df.timestamp.copy().astype("int")
df['dayofyear'] = df.timestamp.copy().dt.dayofyear
df.info()
In [4]:
# Sample for an ID
df_g = df.loc[(df.user_id == 'CFFEC5978B0A4A05FA6DCEFB2C82CC') & (df.dayofyear == 213),].copy()#.head(20)
df_g.head()
Out[4]:
Speed tests: add_notif_counter()¶
In [5]:
def notif_counter_pandas(df_g,x):
df_g['notification_bool'] = False
df_g['notification_bool'].iloc[x] = True
# Now do a cumsum counter
df_g['notification_counter'] = df_g.notification_bool.cumsum().shift()+1
df_g['notification_counter'].iloc[0] = 1
return df_g.notification_counter
@jit
def add_notif_counter_j(x,notification_counter):
""" Equivalent to add_notif_counter but compilting it with jit
"""
notification_counter[:x[0]+1] = 1
notification_counter[x[0]+1:x[1]+1] = 2
notification_counter[x[1]+1:x[2]+1] = 3
notification_counter[(x[2]+1):] = 4
return notification_counter
def add_notif_counter(x):
"""
Equivalent function to:
df_g['notification_bool'] = False
df_g.notification_bool.iloc[x] = True
# Now do a cumsum counter
df_g['notification_counter'] = df_g.notification_bool.cumsum().shift()+1
df_g.notification_counter.iloc[0] = 1
"""
notification_counter = np.zeros(x[-1]+1,dtype='int')
notification_counter[:x[0]+1] = 1
notification_counter[x[0]+1:x[1]+1] = 2
notification_counter[x[1]+1:x[2]+1] = 3
notification_counter[(x[2]+1):] = 4
return notification_counter
In [7]:
print(f"Shape of test dataset: {df_g.shape}")
x = local_search(df_g.timestamp_ns.to_numpy())
print('With @jit:')
%time n1 = add_notif_counter_j(x,np.zeros(x[-1]+1,dtype='int'))
print('Without @jit:')
%time n2 = add_notif_counter(x)
print('With pandas:')
%time n3 = notif_counter_pandas(df_g,x)
np.allclose(n1,n2.astype("int") )
np.allclose(n2,n3.astype("int") )
Out[7]:
Speed test: groupbyuy¶
In [8]:
# prev code
df_g = df.loc[(df.user_id == 'CFFEC5978B0A4A05FA6DCEFB2C82CC') & (df.dayofyear == 213),].copy()#.head(20)
# Calculate when to send notification
x = local_search(df_g.timestamp_ns.to_numpy())
# Add notification counter
df_g['notification_counter'] = add_notif_counter(x)
In [9]:
print('Speed with pandas:')
%time tours_pd = df_g.groupby('notification_counter')['friend_id'].apply(lambda x: (1- x.duplicated()).cumsum())#.droplevel(0)
from bundle_notifications.bundle_notifications import count_tours_per_notif
# Inputs
notification_counter = df_g.notification_counter.to_numpy('int')
friend_id = df_g.friend_id.to_numpy()
friend_name = df_g.friend_name.to_numpy()
timestamp = df_g.timestamp.to_numpy()
print('Speed with custom numpy:')
%time tours_np,_,_,_ = count_tours_per_notif(notification_counter, friend_id,friend_name,timestamp)
np.allclose(tours_pd.astype("int"),tours_np )
Out[9]:
A custom numpy functtion goes 35x faster! It is not even 100% comparable because inside _count_tours_pernotif we output some extra information.
Speed analysis for bundle()¶
In [10]:
from bundle_notifications.bundle_notifications import bundle_func, bundle
# Use a subset of the dataset
users = df.user_id.unique()[:50]
df_subset = df.loc[ (df.user_id.isin(users)),].copy()
print(df_subset.shape)
%lprun -T notebooks/profile_bundle_func -f bundle_func bundle(df_subset)
print(open('notebooks/profile_bundle_func', 'r').read())
- I am surprised that a simple filter of rows (
df_g = df_g.iloc[x,]
) takes as much time as a much more complex custom functiondf_g['tours'], name_first,timestamp_first_tour, message = count_tours_per_notif(...)
- Assigning a value to a new column
df_g['message'] = message
takes also quite some time.
In [ ]: