[75.06 / 95.58] Organización de Datos
Trabajo Práctico 1: Análisis Exploratorio de Datos
Notebook Anexo

Grupo 30: Datatouille

  • 101055 - Bojman, Camila
  • 100029 - del Mazo, Federico
  • 100687 - Hortas, Cecilia
  • 97649 - Souto, Rodrigo

https://github.com/FdelMazo/7506-Datos

https://kaggle.com/datatouille2018/7506-TP1/

https://kaggle.com/datatouille2018/7506-TP1-anexo/

Presentamos aca un anexo al trabajo práctico, donde se generan los sets de datos adicionales usados por el TP principal.

In [2]:
import pandas as pd
import numpy as np

df = pd.read_csv("data/events.csv", low_memory=False)
df['timestamp'] = pd.to_datetime(df['timestamp'])

brands.csv

In [3]:
checked = ['iphone', 'samsung', 'motorola', 'lenovo', 'sony', 'lg', 'ipad', 'asus', 'quantum', 'blackberry']
model_parsed = df['model'].dropna().map(lambda x: x.lower())
model_parsed = model_parsed.map(lambda x: x.split())

def find_brand(model):
    for str in model:
        if str in checked:
            return str
    return "other"


df['brand'] = model_parsed.map(find_brand)
df['brand'] = df['brand'].astype('category')
df[['model', 'brand']].dropna().drop_duplicates().to_csv('data/brands.csv', index=False)
In [28]:
brands_csv = pd.read_csv("data/brands.csv", low_memory=False)
display(brands_csv.head())
brands_csv.shape
model brand
0 iPhone 5s iphone
1 Samsung Galaxy S8 samsung
2 Motorola Moto Z Play motorola
3 iPhone 7 Plus iphone
4 iPhone 6S iphone
Out[28]:
(202, 2)

os.csv

In [5]:
df['operating_system_version'].unique()
Out[5]:
array([nan, 'Android 6', 'Android 5.1.1', 'Windows 10 ', 'Windows 7 ',
       'Windows 8.1 ', 'Android 7.1.1', 'Android 7', 'iOS 10.2.1',
       'Android 5.0.2', 'Android 4.2.2', 'Android 4.1.2', 'Android 4.4.4',
       'Android 6.0.1', 'Linux ', 'Android 5.1', 'Mac OS X 10.13.5',
       'iOS 11.3', 'Android 7.1.2', 'Windows XP ', 'Android 8',
       'Android 4.3', 'iOS 11.2.2', 'iOS 11.2.5', 'iOS 10.3.3',
       'Android 4.4.2', 'Windows 8 ', 'Android 5', 'iOS 9.3.5',
       'iOS 7.1.2', 'Android 5.0.1', 'iOS 11.4', 'Windows Phone 8.1',
       'Ubuntu ', 'Android 8.1', 'Windows Vista ', 'Mac OS X 10.10.2',
       'iOS 11.2.6', 'Other ', 'Android 4.0.4', 'Android 4.4.3',
       'iOS 11.1.2', 'Chrome OS 10032.86', 'Windows Phone 10',
       'Windows RT ', 'iOS 10.3.2', 'Windows Phone 8', 'iOS 11.2.1',
       'Mac OS X 10.11.6', 'iOS 10.2', 'Mac OS X 10.13.3',
       'Mac OS X 10.6.4', 'Android ', 'iOS 10.0.2', 'Mac OS X 10.13.4',
       'Android 6.1', 'iOS 3.2', 'Mac OS X 10.12.3', 'iOS 11.1',
       'iOS 11.0.1', 'Tizen 3', 'iOS 9.2.1', 'Windows Phone 7.5',
       'Mac OS X 10.12.6', 'Chrome OS 10452.96', 'Ubuntu 10.4',
       'Mac OS X 10.10.5', 'iOS 4', 'iOS 7.1.1', 'iOS 11.2',
       'Mac OS X 10.9.5', 'Mac OS X 10.6.8', 'iOS 10.3.1',
       'Mac OS X 10.10', 'iOS 11.0.3', 'iOS 11.0.2', 'Mac OS X 10.12.5',
       'Mac OS X 10.10.1', 'Mac OS X 10.13.2', 'Android 4.1.1',
       'Chrome OS 10452.99', 'Mac OS X 10.13.1', 'iOS 9.3.4',
       'Mac OS X 10.11.3', 'iOS 8.1', 'iOS 11.1.1', 'iOS 5.0.1', 'iOS 12',
       'Mac OS X 10.11.1', 'iOS 10.1.1', 'iOS 8.1.3',
       'Chrome OS 10032.75', 'Chrome OS 10452.85', 'Tizen 2.4',
       'Android 2.3.6', 'iOS 8.3', 'iOS 8.4', 'iOS 8.1.1',
       'Mac OS X 10.7.5', 'Chrome OS 10176.66', 'Android 3.2',
       'iOS 9.3.2', 'Chrome OS 10323.67', 'Android 6.3', 'iOS 11',
       'Chrome OS 10176.76', 'Chrome OS 10323.62', 'Mac OS X 10.11.4',
       'Android 4.4.1', 'iOS 11.4.1', 'Android 4.0.3', 'iOS 9.1',
       'Android 4.4', 'iOS 10.0.1', 'iOS 6.1.6', 'Android 3.1',
       'Chrome OS 10452.74', 'Mac OS X 10.13', 'Mac OS X 10.10.4',
       'Chrome OS 10575.54', 'Mac OS X 10.10.3', 'Mac OS X 10.6'],
      dtype=object)
In [6]:
checked = ['windows', 'android', 'linux', 'mac', 'ios', 'ubuntu', 'chrome os', 'tizen', 'other']
os_version_parsed = df['operating_system_version'].dropna().map(lambda x: x.lower())

def find_os(os_version):
    for os in checked:
        if os in os_version:
            return os
    return "another"


df['operating_system'] = os_version_parsed.map(find_os)
df['operating_system'] = df['operating_system'].astype('category')
In [7]:
# Chequeamos cuantos os quedaron con el nombre 'another' (idealmente, ninguno)
df[['operating_system_version', 'operating_system']].dropna().head(10)
df[df['operating_system'] == 'another'][['operating_system_version', 'operating_system']].head(10)
Out[7]:
operating_system_version operating_system
In [8]:
df['operating_system'].value_counts()
Out[8]:
windows      41495
android      41410
ios           3539
mac            406
linux          293
ubuntu          97
chrome os       83
other           48
tizen            7
Name: operating_system, dtype: int64
In [9]:
df[['operating_system_version', 'operating_system']].dropna().drop_duplicates().to_csv('data/os.csv', index=False)
In [29]:
os_csv = pd.read_csv("data/os.csv", low_memory=False)
display(os_csv.head())
os_csv.shape
operating_system_version operating_system
0 Android 6 android
1 Android 5.1.1 android
2 Windows 10 windows
3 Windows 7 windows
4 Windows 8.1 windows
Out[29]:
(121, 2)

browsers.csv

In [50]:
checked = ['mobile safari', 'chrome mobile', 'ie mobile', 'firefox mobile', 'edge mobile', 'opera mobile',
           'mobile', 'chrome', 'android', 'opera', 'ie', 'firefox', 
           'facebook', 'samsung', 'chromium', 'edge', 'yandex', 'uc', 
           'other', 'safari', 'puffin', 'maxthon', 'vivaldi']
browser_version_parsed = df['browser_version'].dropna().map(lambda x: x.lower())

def find_browser(browser_version):
    for browser in checked:
        if browser in browser_version:
            return browser
    return "other"


df['browser'] = browser_version_parsed.map(find_browser)
df['browser'] = df['browser'].astype('category')
In [51]:
df[['browser_version', 'browser']].dropna()
display(df['browser'].value_counts().head())
df[['browser_version', 'browser']].head()
chrome           38081
chrome mobile    37056
mobile safari     3239
firefox           2567
samsung           1699
Name: browser, dtype: int64
Out[51]:
browser_version browser
0 NaN NaN
1 Chrome Mobile 39 chrome mobile
2 NaN NaN
3 NaN NaN
4 NaN NaN
In [52]:
display(df[['browser_version', 'browser']].dropna().drop_duplicates().head())
df[['browser_version', 'browser']].dropna().drop_duplicates().to_csv('data/browsers.csv', index=False)
browser_version browser
1 Chrome Mobile 39 chrome mobile
5 Android 5.1 android
9 Chrome 65.0 chrome
16 Chrome 66.0 chrome
96 Chrome Mobile 66.0 chrome mobile
In [53]:
browsers_csv = pd.read_csv("data/browsers.csv", low_memory=False)
display(browsers_csv.head())
browsers_csv.shape
browser_version browser
0 Chrome Mobile 39 chrome mobile
1 Android 5.1 android
2 Chrome 65.0 chrome
3 Chrome 66.0 chrome
4 Chrome Mobile 66.0 chrome mobile
Out[53]:
(343, 2)

sessions.csv

Definimos una sesión como una serie de eventos por usuario, los cuales están todos con menos de 30 minutos de inactividad entre el actual y el anterior.

In [55]:
funnel = df.sort_values(['person', 'timestamp'])
funnel.head()
Out[55]:
timestamp event person url sku model condition storage color skus ... city region country device_type screen_resolution operating_system_version browser_version brand operating_system browser
0 2018-05-31 23:38:05 ad campaign hit 0004b0a2 /comprar/iphone/iphone-5s NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2018-05-31 23:38:05 visited site 0004b0a2 NaN NaN NaN NaN NaN NaN NaN ... Camaragibe Pernambuco Brazil Smartphone 360x640 Android 6 Chrome Mobile 39 NaN android chrome mobile
2 2018-05-31 23:38:09 viewed product 0004b0a2 NaN 2694.0 iPhone 5s Bom 32GB Cinza espacial NaN ... NaN NaN NaN NaN NaN NaN NaN iphone NaN NaN
3 2018-05-31 23:38:40 checkout 0004b0a2 NaN 2694.0 iPhone 5s Bom 32GB Cinza espacial NaN ... NaN NaN NaN NaN NaN NaN NaN iphone NaN NaN
4 2018-05-29 13:29:25 viewed product 0006a21a NaN 15338.0 Samsung Galaxy S8 Bom 64GB Dourado NaN ... NaN NaN NaN NaN NaN NaN NaN samsung NaN NaN

5 rows × 26 columns

In [56]:
funnel['time_diff'] = funnel.groupby('person')['timestamp'].diff()
funnel['time_diff'] = funnel['time_diff'].fillna(0)
funnel['time_diff_min'] = funnel['time_diff'] / np.timedelta64(1, 'm')
In [57]:
THRESHOLD = 30 # minutes

funnel['new_session'] = funnel['time_diff_min'] > THRESHOLD
funnel['session_id'] = funnel.groupby('person')['new_session'].cumsum()
funnel['session_id'] = funnel['session_id'].astype('int')
In [58]:
gb = funnel.groupby(['person', 'session_id'])

funnel['session_cumno'] = gb.cumcount()
funnel['session_total_events'] = gb['session_cumno'].transform(lambda x: x.size)
funnel['session_first'] = funnel['session_cumno'] == 0
funnel['session_last'] = funnel['session_cumno'] == (-1+funnel['session_total_events'])
In [59]:
cols = ['person', 'timestamp', 'time_diff_min', \
        'session_id', 'event', 'session_total_events', \
        'session_cumno', 'session_first', 'session_last']
funnel[cols]
Out[59]:
person timestamp time_diff_min session_id event session_total_events session_cumno session_first session_last
0 0004b0a2 2018-05-31 23:38:05 0.000000 0 ad campaign hit 4 0 True False
1 0004b0a2 2018-05-31 23:38:05 0.000000 0 visited site 4 1 False False
2 0004b0a2 2018-05-31 23:38:09 0.066667 0 viewed product 4 2 False False
3 0004b0a2 2018-05-31 23:38:40 0.516667 0 checkout 4 3 False True
4 0006a21a 2018-05-29 13:29:25 0.000000 0 viewed product 4 0 True False
5 0006a21a 2018-05-29 13:29:26 0.016667 0 visited site 4 1 False False
6 0006a21a 2018-05-29 13:29:27 0.016667 0 ad campaign hit 4 2 False False
7 0006a21a 2018-05-29 13:29:35 0.133333 0 checkout 4 3 False True
8 000a54b2 2018-04-09 20:12:31 0.000000 0 generic listing 7 0 True False
9 000a54b2 2018-04-09 20:12:31 0.000000 0 visited site 7 1 False False
10 000a54b2 2018-04-09 20:12:31 0.000000 0 search engine hit 7 2 False False
11 000a54b2 2018-04-09 20:12:31 0.000000 0 ad campaign hit 7 3 False False
12 000a54b2 2018-04-09 20:12:53 0.366667 0 brand listing 7 4 False False
13 000a54b2 2018-04-09 20:13:14 0.350000 0 viewed product 7 5 False False
14 000a54b2 2018-04-09 20:13:20 0.100000 0 brand listing 7 6 False True
15 000a54b2 2018-05-24 11:21:07 64267.783333 1 ad campaign hit 30 0 True False
16 000a54b2 2018-05-24 11:21:07 0.000000 1 visited site 30 1 False False
17 000a54b2 2018-05-24 11:21:07 0.000000 1 search engine hit 30 2 False False
18 000a54b2 2018-05-24 11:21:08 0.016667 1 generic listing 30 3 False False
19 000a54b2 2018-05-24 11:26:34 5.433333 1 generic listing 30 4 False False
20 000a54b2 2018-05-24 11:26:53 0.316667 1 brand listing 30 5 False False
21 000a54b2 2018-05-24 11:26:56 0.050000 1 generic listing 30 6 False False
22 000a54b2 2018-05-24 11:27:47 0.850000 1 viewed product 30 7 False False
23 000a54b2 2018-05-24 11:27:50 0.050000 1 generic listing 30 8 False False
24 000a54b2 2018-05-24 11:28:10 0.333333 1 brand listing 30 9 False False
25 000a54b2 2018-05-24 11:28:36 0.433333 1 brand listing 30 10 False False
26 000a54b2 2018-05-24 11:28:59 0.383333 1 viewed product 30 11 False False
27 000a54b2 2018-05-24 11:29:06 0.116667 1 viewed product 30 12 False False
28 000a54b2 2018-05-24 11:29:14 0.133333 1 viewed product 30 13 False False
29 000a54b2 2018-05-24 11:29:16 0.033333 1 viewed product 30 14 False False
... ... ... ... ... ... ... ... ... ...
1011258 ffffa8d1 2018-06-09 13:40:05 0.000000 1 visited site 7 1 False False
1011259 ffffa8d1 2018-06-09 13:40:05 0.000000 1 search engine hit 7 2 False False
1011260 ffffa8d1 2018-06-09 13:40:05 0.000000 1 generic listing 7 3 False False
1011261 ffffa8d1 2018-06-09 13:42:05 2.000000 1 searched products 7 4 False False
1011262 ffffa8d1 2018-06-09 13:43:43 1.633333 1 viewed product 7 5 False False
1011263 ffffa8d1 2018-06-09 13:43:52 0.150000 1 viewed product 7 6 False True
1011264 ffffa8d1 2018-06-09 15:33:23 109.516667 2 viewed product 12 0 True False
1011265 ffffa8d1 2018-06-09 15:33:30 0.116667 2 viewed product 12 1 False False
1011266 ffffa8d1 2018-06-09 15:34:46 1.266667 2 viewed product 12 2 False False
1011267 ffffa8d1 2018-06-09 15:35:00 0.233333 2 viewed product 12 3 False False
1011268 ffffa8d1 2018-06-09 15:35:11 0.183333 2 viewed product 12 4 False False
1011269 ffffa8d1 2018-06-09 15:35:13 0.033333 2 viewed product 12 5 False False
1011270 ffffa8d1 2018-06-09 15:35:18 0.083333 2 viewed product 12 6 False False
1011271 ffffa8d1 2018-06-09 15:39:05 3.783333 2 viewed product 12 7 False False
1011272 ffffa8d1 2018-06-09 15:39:07 0.033333 2 viewed product 12 8 False False
1011273 ffffa8d1 2018-06-09 15:39:56 0.816667 2 visited site 12 9 False False
1011274 ffffa8d1 2018-06-09 15:41:29 1.550000 2 generic listing 12 10 False False
1011275 ffffa8d1 2018-06-09 15:43:29 2.000000 2 viewed product 12 11 False True
1011276 ffffa8d1 2018-06-11 12:25:40 2682.183333 3 search engine hit 8 0 True False
1011277 ffffa8d1 2018-06-11 12:25:40 0.000000 3 visited site 8 1 False False
1011278 ffffa8d1 2018-06-11 12:25:40 0.000000 3 ad campaign hit 8 2 False False
1011279 ffffa8d1 2018-06-11 12:25:41 0.016667 3 generic listing 8 3 False False
1011280 ffffa8d1 2018-06-11 12:28:47 3.100000 3 searched products 8 4 False False
1011281 ffffa8d1 2018-06-11 12:29:11 0.400000 3 viewed product 8 5 False False
1011282 ffffa8d1 2018-06-11 12:29:14 0.050000 3 viewed product 8 6 False False
1011283 ffffa8d1 2018-06-11 12:31:49 2.583333 3 checkout 8 7 False True
1011284 ffffac8a 2018-06-04 23:18:03 0.000000 0 visited site 4 0 True False
1011285 ffffac8a 2018-06-04 23:18:03 0.000000 0 ad campaign hit 4 1 False False
1011286 ffffac8a 2018-06-04 23:18:03 0.000000 0 viewed product 4 2 False False
1011287 ffffac8a 2018-06-04 23:18:10 0.116667 0 checkout 4 3 False True

1011288 rows × 9 columns

In [61]:
funnel['is_conversion'] = funnel['event'] == 'conversion'
gb = funnel.groupby(['person', 'session_id'])
funnel['session_conversion'] = gb['is_conversion'].transform(lambda x: x.sum())
funnel['session_conversion'] = funnel['session_conversion'] > 0
funnel.head()
Out[61]:
timestamp event person url sku model condition storage color skus ... time_diff time_diff_min new_session session_id session_cumno session_total_events session_first session_last is_conversion session_conversion
0 2018-05-31 23:38:05 ad campaign hit 0004b0a2 /comprar/iphone/iphone-5s NaN NaN NaN NaN NaN NaN ... 00:00:00 0.000000 False 0 0 4 True False False False
1 2018-05-31 23:38:05 visited site 0004b0a2 NaN NaN NaN NaN NaN NaN NaN ... 00:00:00 0.000000 False 0 1 4 False False False False
2 2018-05-31 23:38:09 viewed product 0004b0a2 NaN 2694.0 iPhone 5s Bom 32GB Cinza espacial NaN ... 00:00:04 0.066667 False 0 2 4 False False False False
3 2018-05-31 23:38:40 checkout 0004b0a2 NaN 2694.0 iPhone 5s Bom 32GB Cinza espacial NaN ... 00:00:31 0.516667 False 0 3 4 False True False False
4 2018-05-29 13:29:25 viewed product 0006a21a NaN 15338.0 Samsung Galaxy S8 Bom 64GB Dourado NaN ... 00:00:00 0.000000 False 0 0 4 True False False False

5 rows × 36 columns

In [62]:
funnel['is_checkout'] = funnel['event'] == 'checkout'
gb = funnel.groupby(['person', 'session_id'])
funnel['session_checkout'] = gb['is_checkout'].transform(lambda x: x.sum())
funnel['session_checkout'] = funnel['session_checkout'] > 0
funnel.head()
Out[62]:
timestamp event person url sku model condition storage color skus ... new_session session_id session_cumno session_total_events session_first session_last is_conversion session_conversion is_checkout session_checkout
0 2018-05-31 23:38:05 ad campaign hit 0004b0a2 /comprar/iphone/iphone-5s NaN NaN NaN NaN NaN NaN ... False 0 0 4 True False False False False True
1 2018-05-31 23:38:05 visited site 0004b0a2 NaN NaN NaN NaN NaN NaN NaN ... False 0 1 4 False False False False False True
2 2018-05-31 23:38:09 viewed product 0004b0a2 NaN 2694.0 iPhone 5s Bom 32GB Cinza espacial NaN ... False 0 2 4 False False False False False True
3 2018-05-31 23:38:40 checkout 0004b0a2 NaN 2694.0 iPhone 5s Bom 32GB Cinza espacial NaN ... False 0 3 4 False True False False True True
4 2018-05-29 13:29:25 viewed product 0006a21a NaN 15338.0 Samsung Galaxy S8 Bom 64GB Dourado NaN ... False 0 0 4 True False False False False True

5 rows × 38 columns

In [64]:
funnel['ad_origin'] = (funnel['event'] == 'ad campaign hit') & funnel['session_first']
gb = funnel.groupby(['person', 'session_id'])
funnel['session_ad'] = gb['ad_origin'].transform(lambda x: x.sum())
funnel['session_ad'] = funnel['session_ad'] > 0
funnel.head()
Out[64]:
timestamp event person url sku model condition storage color skus ... session_cumno session_total_events session_first session_last is_conversion session_conversion is_checkout session_checkout ad_origin session_ad
0 2018-05-31 23:38:05 ad campaign hit 0004b0a2 /comprar/iphone/iphone-5s NaN NaN NaN NaN NaN NaN ... 0 4 True False False False False True True True
1 2018-05-31 23:38:05 visited site 0004b0a2 NaN NaN NaN NaN NaN NaN NaN ... 1 4 False False False False False True False True
2 2018-05-31 23:38:09 viewed product 0004b0a2 NaN 2694.0 iPhone 5s Bom 32GB Cinza espacial NaN ... 2 4 False False False False False True False True
3 2018-05-31 23:38:40 checkout 0004b0a2 NaN 2694.0 iPhone 5s Bom 32GB Cinza espacial NaN ... 3 4 False True False False True True False True
4 2018-05-29 13:29:25 viewed product 0006a21a NaN 15338.0 Samsung Galaxy S8 Bom 64GB Dourado NaN ... 0 4 True False False False False True False False

5 rows × 40 columns

In [65]:
funnel[cols]
Out[65]:
person timestamp time_diff_min session_id event session_total_events session_cumno session_first session_last
0 0004b0a2 2018-05-31 23:38:05 0.000000 0 ad campaign hit 4 0 True False
1 0004b0a2 2018-05-31 23:38:05 0.000000 0 visited site 4 1 False False
2 0004b0a2 2018-05-31 23:38:09 0.066667 0 viewed product 4 2 False False
3 0004b0a2 2018-05-31 23:38:40 0.516667 0 checkout 4 3 False True
4 0006a21a 2018-05-29 13:29:25 0.000000 0 viewed product 4 0 True False
5 0006a21a 2018-05-29 13:29:26 0.016667 0 visited site 4 1 False False
6 0006a21a 2018-05-29 13:29:27 0.016667 0 ad campaign hit 4 2 False False
7 0006a21a 2018-05-29 13:29:35 0.133333 0 checkout 4 3 False True
8 000a54b2 2018-04-09 20:12:31 0.000000 0 generic listing 7 0 True False
9 000a54b2 2018-04-09 20:12:31 0.000000 0 visited site 7 1 False False
10 000a54b2 2018-04-09 20:12:31 0.000000 0 search engine hit 7 2 False False
11 000a54b2 2018-04-09 20:12:31 0.000000 0 ad campaign hit 7 3 False False
12 000a54b2 2018-04-09 20:12:53 0.366667 0 brand listing 7 4 False False
13 000a54b2 2018-04-09 20:13:14 0.350000 0 viewed product 7 5 False False
14 000a54b2 2018-04-09 20:13:20 0.100000 0 brand listing 7 6 False True
15 000a54b2 2018-05-24 11:21:07 64267.783333 1 ad campaign hit 30 0 True False
16 000a54b2 2018-05-24 11:21:07 0.000000 1 visited site 30 1 False False
17 000a54b2 2018-05-24 11:21:07 0.000000 1 search engine hit 30 2 False False
18 000a54b2 2018-05-24 11:21:08 0.016667 1 generic listing 30 3 False False
19 000a54b2 2018-05-24 11:26:34 5.433333 1 generic listing 30 4 False False
20 000a54b2 2018-05-24 11:26:53 0.316667 1 brand listing 30 5 False False
21 000a54b2 2018-05-24 11:26:56 0.050000 1 generic listing 30 6 False False
22 000a54b2 2018-05-24 11:27:47 0.850000 1 viewed product 30 7 False False
23 000a54b2 2018-05-24 11:27:50 0.050000 1 generic listing 30 8 False False
24 000a54b2 2018-05-24 11:28:10 0.333333 1 brand listing 30 9 False False
25 000a54b2 2018-05-24 11:28:36 0.433333 1 brand listing 30 10 False False
26 000a54b2 2018-05-24 11:28:59 0.383333 1 viewed product 30 11 False False
27 000a54b2 2018-05-24 11:29:06 0.116667 1 viewed product 30 12 False False
28 000a54b2 2018-05-24 11:29:14 0.133333 1 viewed product 30 13 False False
29 000a54b2 2018-05-24 11:29:16 0.033333 1 viewed product 30 14 False False
... ... ... ... ... ... ... ... ... ...
1011258 ffffa8d1 2018-06-09 13:40:05 0.000000 1 visited site 7 1 False False
1011259 ffffa8d1 2018-06-09 13:40:05 0.000000 1 search engine hit 7 2 False False
1011260 ffffa8d1 2018-06-09 13:40:05 0.000000 1 generic listing 7 3 False False
1011261 ffffa8d1 2018-06-09 13:42:05 2.000000 1 searched products 7 4 False False
1011262 ffffa8d1 2018-06-09 13:43:43 1.633333 1 viewed product 7 5 False False
1011263 ffffa8d1 2018-06-09 13:43:52 0.150000 1 viewed product 7 6 False True
1011264 ffffa8d1 2018-06-09 15:33:23 109.516667 2 viewed product 12 0 True False
1011265 ffffa8d1 2018-06-09 15:33:30 0.116667 2 viewed product 12 1 False False
1011266 ffffa8d1 2018-06-09 15:34:46 1.266667 2 viewed product 12 2 False False
1011267 ffffa8d1 2018-06-09 15:35:00 0.233333 2 viewed product 12 3 False False
1011268 ffffa8d1 2018-06-09 15:35:11 0.183333 2 viewed product 12 4 False False
1011269 ffffa8d1 2018-06-09 15:35:13 0.033333 2 viewed product 12 5 False False
1011270 ffffa8d1 2018-06-09 15:35:18 0.083333 2 viewed product 12 6 False False
1011271 ffffa8d1 2018-06-09 15:39:05 3.783333 2 viewed product 12 7 False False
1011272 ffffa8d1 2018-06-09 15:39:07 0.033333 2 viewed product 12 8 False False
1011273 ffffa8d1 2018-06-09 15:39:56 0.816667 2 visited site 12 9 False False
1011274 ffffa8d1 2018-06-09 15:41:29 1.550000 2 generic listing 12 10 False False
1011275 ffffa8d1 2018-06-09 15:43:29 2.000000 2 viewed product 12 11 False True
1011276 ffffa8d1 2018-06-11 12:25:40 2682.183333 3 search engine hit 8 0 True False
1011277 ffffa8d1 2018-06-11 12:25:40 0.000000 3 visited site 8 1 False False
1011278 ffffa8d1 2018-06-11 12:25:40 0.000000 3 ad campaign hit 8 2 False False
1011279 ffffa8d1 2018-06-11 12:25:41 0.016667 3 generic listing 8 3 False False
1011280 ffffa8d1 2018-06-11 12:28:47 3.100000 3 searched products 8 4 False False
1011281 ffffa8d1 2018-06-11 12:29:11 0.400000 3 viewed product 8 5 False False
1011282 ffffa8d1 2018-06-11 12:29:14 0.050000 3 viewed product 8 6 False False
1011283 ffffa8d1 2018-06-11 12:31:49 2.583333 3 checkout 8 7 False True
1011284 ffffac8a 2018-06-04 23:18:03 0.000000 0 visited site 4 0 True False
1011285 ffffac8a 2018-06-04 23:18:03 0.000000 0 ad campaign hit 4 1 False False
1011286 ffffac8a 2018-06-04 23:18:03 0.000000 0 viewed product 4 2 False False
1011287 ffffac8a 2018-06-04 23:18:10 0.116667 0 checkout 4 3 False True

1011288 rows × 9 columns

In [66]:
cols_csv = ['time_diff_min', \
        'session_id', 'session_total_events', \
        'session_cumno', 'session_first', 'session_last', \
        'session_conversion', 'session_checkout', 'session_ad']

funnel[cols_csv].to_csv('data/sessions.csv', index=False)
In [67]:
sessions_csv = pd.read_csv("data/sessions.csv", low_memory=False)
display(sessions_csv.head())
sessions_csv.shape
time_diff_min session_id session_total_events session_cumno session_first session_last session_conversion session_checkout session_ad
0 0.000000 0 4 0 True False False True True
1 0.000000 0 4 1 False False False True True
2 0.066667 0 4 2 False False False True True
3 0.516667 0 4 3 False True False True True
4 0.000000 0 4 0 True False False True False
Out[67]:
(1011288, 9)