[75.06 / 95.58] Organización de Datos
Trabajo Práctico 2: Machine Learning

Notebook Principal

Grupo 30: Datatouille

  • 101055 - Bojman, Camila
  • 100029 - del Mazo, Federico
  • 100687 - Hortas, Cecilia
  • 97649 - Souto, Rodrigo

http://fdelmazo.github.io/7506-Datos/

https://www.kaggle.com/datatouille2018/

https://www.kaggle.com/c/trocafone

Continuando la investigación sobre la empresa Trocafone realizada en el TP1, se busca determinar la probabilidad de que un usuario del sitio realice una conversión en el período determinado.

Notebooks en orden de corrida y lectura:

  1. TP1 y su anexo --> Familiarización con el set de datos y exploración de estos.

  2. Investigación Previa --> Con ayuda de lo trabajado en el TP1, se averiguan más cosas de las datos, en busqueda de que poder reutilizar.

  3. Creación de Dataframes --> Como parte del feature engineering, se crean dataframes nuevos con información de los productos del sitio y de como se accede a este (marcas, sistemas operativos, etc).

  4. Feature Engineering --> Busqueda de atributos de los usuarios de los cuales se busca predecir la conversión.

  5. Feature Selection --> Busqueda de la combinación de features más favorable.

  6. Parameter Tuning --> Busqueda de los mejores hiper-parametros para cada algoritmo de ML.

  7. Submission Framework --> Pequeño framework para construir las postulaciones de labels.

  8. TP2 (este notebook)--> Teniendo todo en cuenta, usando los dataframes con todos los atributos buscados y encontrados, se definen y aplican los algoritmos de clasificación, se realizan los entrenamientos y posteriores predicciones de conversiones y finalmente se arman las postulaciones de labels.

Antes de comenzar, setear las credenciales (usuario y token)

  1. Visitar: https://www.kaggle.com/datatouille2018/account (con la cuenta que sea)
  2. 'Create New API Token'
  3. Guardar el archivo descargado en ~/.kaggle/kaggle.json
In [ ]:
# !unzip -q -o data/events_up_to_01062018.zip -d data

# !pip install kaggle
# !pip install nbimporter
# !pip install ggplot
# !pip install hdbscan
# !conda install -y -c conda-forge xgboost 
# !conda install -y -c conda-forge lightgbm 
# !conda install -y -c conda-forge catboost
In [1]:
import nbimporter # pip install nbimporter
import pandas as pd
import numpy as np
import time
import calendar
from itertools import combinations
import random
from time import sleep
from parameter_tuning import get_hiper_params
from feature_selection import get_feature_selection
import submission_framework as SF

seed = 42
hiper_params = get_hiper_params()
feature_selection = get_feature_selection()
Importing Jupyter notebook from parameter_tuning.ipynb
Importing Jupyter notebook from submission_framework.ipynb
Importing Jupyter notebook from feature_selection.ipynb
In [2]:
df_users = pd.read_csv('data/user-features.csv',low_memory=False).set_index('person')
df_y = pd.read_csv('data/labels_training_set.csv').groupby('person').sum()

display(df_users.head(), df_y.head())
total_brand_listings total_viewed_products total_checkouts total_conversions total_events total_sessions total_session_checkout total_session_conversion total_events_ad_session total_ad_sessions ... percentage_l2w_week_activity percentage_l2w_brand_listings percentage_l2w_viewed_products percentage_l2w_checkouts percentage_l2w_conversions kmeans_3 kmeans_5 kmeans_6 kmeans_15 kmeans_25
person
0008ed71 0.0 0.0 3.0 0.0 6 3.0 3.0 0.0 0.0 0.0 ... 1.000000 0.000000 0.000000 0.500000 0.000000 1 1 4 0 14
00091926 25.0 372.0 2.0 0.0 448 34.0 2.0 0.0 54.0 9.0 ... 0.582589 0.006696 0.511161 0.004464 0.000000 1 1 1 3 7
00091a7a 5.0 3.0 0.0 0.0 10 1.0 0.0 0.0 10.0 1.0 ... 0.000000 0.000000 0.000000 0.000000 0.000000 0 3 3 8 16
000ba417 24.0 153.0 6.0 1.0 206 5.0 4.0 1.0 0.0 0.0 ... 1.000000 0.116505 0.742718 0.029126 0.004854 2 2 0 12 0
000c79fe 0.0 3.0 1.0 0.0 17 1.0 1.0 0.0 17.0 1.0 ... 1.000000 0.000000 0.176471 0.058824 0.000000 1 1 1 3 7

5 rows × 188 columns

label
person
0008ed71 0
000c79fe 0
001802e4 0
0019e639 0
001b0bf9 0

Algoritmos de Machine Learning

In [9]:
posibilidades_algoritmos = []

Decision Tree

In [10]:
from sklearn.tree import DecisionTreeClassifier, export_graphviz

model_name = 'decision_tree'
params = hiper_params[model_name]
model = DecisionTreeClassifier(**params,random_state=seed)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)

posibilidades_algoritmos.append(model_with_name)
Model: decision_tree - AUC: 0.8250 - AUCPR:0.1890 - Accuracy: 0.9496 

Random Forest

In [52]:
from sklearn.ensemble import RandomForestClassifier

model_name = 'random_forest'
params = hiper_params[model_name]
model = RandomForestClassifier(**params,random_state=seed)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)

posibilidades_algoritmos.append(model_with_name)
Model: random_forest - AUC: 0.8420 - AUCPR:0.1634 - Accuracy: 0.9496 

XGBoost

In [53]:
import xgboost as xgb #conda install -c conda-forge xgboost 

model_name = 'xgboost'
params = hiper_params[model_name]
model = xgb.XGBClassifier(**params,random_state=seed)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)

posibilidades_algoritmos.append(model_with_name)
Model: xgboost - AUC: 0.8599 - AUCPR:0.2521 - Accuracy: 0.9491 

KNN

In [13]:
from sklearn.neighbors import KNeighborsClassifier

model_name = 'knn'
params = hiper_params[model_name]
K = params['n_neighbors']
model_name = f'KNN{K}'

model = KNeighborsClassifier(**params)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name, normalize=True)

posibilidades_algoritmos.append(model_with_name)
Model: KNN21_normalized - AUC: 0.7901 - AUCPR:0.1666 - Accuracy: 0.9489 

Naive-Bayes

In [14]:
from sklearn.naive_bayes import GaussianNB,BernoulliNB,MultinomialNB,ComplementNB

model_name = 'naive_bayes_Gaussian'
model = GaussianNB(**{'var_smoothing': 1e-09})
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)
posibilidades_algoritmos.append(model_with_name)

model_name = 'naive_bayes_Bernoulli'
model = BernoulliNB()
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)
posibilidades_algoritmos.append(model_with_name)

model_name = 'naive_bayes_Multinomial'
model = MultinomialNB()
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)
posibilidades_algoritmos.append(model_with_name)

model_name = 'naive_bayes_Complement'
model = ComplementNB()
model_with_name = (model_name,model)
SF.full_framework_wrapper(df_users,df_y,model_with_name)
posibilidades_algoritmos.append(model_with_name)
Model: naive_bayes_Gaussian - AUC: 0.7809 - AUCPR:0.1531 - Accuracy: 0.9274 
Model: naive_bayes_Bernoulli - AUC: 0.8126 - AUCPR:0.1731 - Accuracy: 0.8068 
Model: naive_bayes_Multinomial - AUC: 0.7450 - AUCPR:0.1152 - Accuracy: 0.7587 
Model: naive_bayes_Complement - AUC: 0.7450 - AUCPR:0.1152 - Accuracy: 0.7587 

LightGBM

In [15]:
import lightgbm as lgb  #conda install -c conda-forge lightgbm 

model_name = 'lightgbm'
params = hiper_params[model_name]
model = lgb.LGBMClassifier(**params)
model_with_name = (model_name,model)


SF.full_framework_wrapper(df_users,df_y, model_with_name)

posibilidades_algoritmos.append(model_with_name)
Model: lightgbm - AUC: 0.8669 - AUCPR:0.2556 - Accuracy: 0.9497 

Neural Network

In [16]:
from sklearn.neural_network import MLPClassifier

model_name = 'neuralnetwork'
params = hiper_params[model_name]
model = MLPClassifier(**params)
model_with_name = (model_name, model)

# Funciona sólo con datos normalizados
SF.full_framework_wrapper(df_users, df_y, model_with_name, normalize=True)
posibilidades_algoritmos.append(model_with_name)
Model: neuralnetwork_normalized - AUC: 0.8526 - AUCPR:0.2151 - Accuracy: 0.9496 

Gradient Boosting

In [ ]:
from sklearn.ensemble import GradientBoostingClassifier as GBC  

model_name = 'gradient_boosting'
params = hiper_params[model_name]

model = GBC(**params)
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)
posibilidades_algoritmos.append(model_with_name)

Adaboost

In [ ]:
from sklearn.ensemble import AdaBoostClassifier

model_name = 'adaboost'

model = AdaBoostClassifier()
model_with_name = (model_name,model)

SF.full_framework_wrapper(df_users,df_y,model_with_name)
posibilidades_algoritmos.append(model_with_name)

Catboost

In [17]:
# Muy costoso en tiempo, no lo valen sus resultados

# import catboost as cb #conda install -c conda-forge catboost

# model_name = 'catboost'
# params = hiper_params[model_name]

# model = cb.CatBoostClassifier(**params,verbose=False)
# model_with_name = (model_name,model)

# SF.full_framework_wrapper(df_users,df_y,model_with_name)
# posibilidades_algoritmos.append(model_with_name)

Logistic Regression

In [18]:
# Resultados pauperrimos

# from sklearn.linear_model import LogisticRegression

# model_name = 'logistic_regresion'
# model = LogisticRegression(solver='lbfgs')
# model_with_name = (model_name,model)

# SF.full_framework_wrapper(df_users,df_y,model_with_name)

# posibilidades_algoritmos.append(model_with_name)

Encontrando el mejor submit

Corremos todos los algoritmos definidos sobre esas combinaciones, incluso ensamblados, en busqueda de su mejor combinación de hiper-parametros.

Finalmente, se corren todos los algoritmos en su mejor combinación contra todos los set de features definidos, en busqueda de la mejor fusión universal.

Todos los posibles algoritmos

In [40]:
print([x[0] for x in posibilidades_algoritmos])

posibilidades_algoritmos_y_ensambles = posibilidades_algoritmos[:]
['decision_tree', 'random_forest', 'xgboost', 'KNN21', 'naive_bayes_Gaussian', 'naive_bayes_Bernoulli', 'naive_bayes_Multinomial', 'naive_bayes_Complement', 'lightgbm', 'neuralnetwork']
In [41]:
from sklearn.ensemble import BaggingClassifier

# Se excluyen los algoritmos que ya de por si son de Bagging y los que consumen mucho tiempo

EXCLUDED = ['catboost','neuralnetwork',f'KNN{K}','random_forest']

def bagging(posibilidades):
    posibilidades = list(filter(lambda x:x[0] not in EXCLUDED,posibilidades))
    baggins = [] # Frodo
    for n,m in posibilidades:
        baggins.append((n+'_bagging',BaggingClassifier(m)))
    return baggins

posibilidades_algoritmos_y_ensambles += bagging(posibilidades_algoritmos)
In [42]:
EXCLUDED = ['catboost','neuralnetwork']

def ensamblar_algoritmos(posibilidades, n, tipo):
    posibilidades = list(filter(lambda x:x[0] not in EXCLUDED,posibilidades))
    result = list(combinations(posibilidades,n))
    result_names = [f'{x[0][0]}+{x[1][0]}_{tipo}' for x in result]
    return list(zip(result_names,result))

def ensambles_a_mano(posibilidades, nombres,tipo):
    result = list(combinations(posibilidades, len(nombres)))
    for r in result:
        if not r: continue
        names = [x[0] for x in r]
        if all([x in names for x in nombres]):
            result_names = '+'.join(nombres)
            if tipo=='both':
                return [(result_names+'_hard',r),(result_names+'_soft',r)]
            result_names += f"_{tipo}"
            return [(result_names,r)]
In [43]:
posibilidades_algoritmos_y_ensambles += ensambles_a_mano(posibilidades_algoritmos,['random_forest','lightgbm','neuralnetwork'],'soft')
posibilidades_algoritmos_y_ensambles += ensambles_a_mano(posibilidades_algoritmos,['random_forest','xgboost','naive_bayes_Bernoulli'],'soft')

#Agrega todos los ensambles posibles de pares. 
# Mas allá de algunas corridas para dar una idea de lo que funciona y lo que no, 
# no sirve en el día a dia por su fuerza bruta y el tardar tanto

# posibilidades_algoritmos_y_ensambles += ensamblar_algoritmos(posibilidades_algoritmos,2)
In [44]:
print([x[0] for x in posibilidades_algoritmos_y_ensambles])
['decision_tree', 'random_forest', 'xgboost', 'KNN21', 'naive_bayes_Gaussian', 'naive_bayes_Bernoulli', 'naive_bayes_Multinomial', 'naive_bayes_Complement', 'lightgbm', 'neuralnetwork', 'decision_tree_bagging', 'xgboost_bagging', 'naive_bayes_Gaussian_bagging', 'naive_bayes_Bernoulli_bagging', 'naive_bayes_Multinomial_bagging', 'naive_bayes_Complement_bagging', 'lightgbm_bagging', 'random_forest+lightgbm+neuralnetwork_soft', 'random_forest+xgboost+naive_bayes_Bernoulli_soft']

Todos los posibles sets de features

In [67]:
posibilidades_features = {
    'Cumulative Importance':feature_selection['best_features_progresivo'],
    'Forward Selection':feature_selection['best_features_forward'],
    'Backward Elimination':feature_selection['best_features_backward'],
    'Stepwise Regression':feature_selection['best_features_stepwise'],
    'Full Dataframe':[],
}
In [46]:
todo_junto = [x for f in posibilidades_features.values() for x in f]
intersec = list(set([x for x in todo_junto if todo_junto.count(x)>=2]))
posibilidades_features['Feature Intersection'] = intersec
In [47]:
posibilidades_features['Seleccion a Mano (Boj)'] = ['total_checkouts_month_5',
                    'timestamp_last_checkout',
                    'timestamp_last_event',
                    'has_checkout_month_5',
                    'total_checkouts',
                    'days_to_last_event',
                    'total_checkouts_lw',
                    'total_checkouts_months_1_to_4',
                    'total_conversions',
                    'total_session_conversion',
                    'total_events',
                    'total_sessions',
                    'avg_events_per_session',
                    'total_session_checkout',
                    'has_checkout'
                    ]

posibilidades_features['Seleccion a Mano (Souto)'] = ['dow_last_conversion', 
                     'has_conversion_lw', 'total_conversions_month_4', 
                     'total_session_checkout', 'doy_last_conversion', 'timestamp_last_event', 
                     'dow_last_checkout', 'total_checkouts', 'has_checkout', 'doy_last_checkout', 
                     'has_checkout_month_1', 'timestamp_last_checkout', 'total_sessions', 
                     'woy_last_event', 'has_checkout_month_5', 'avg_events_per_session']

posibilidades_features['Seleccion a Mano (Chortas)'] = [
    'dow_last_conversion', 
    'timestamp_last_event',
    'timestamp_last_checkout',
    'timestamp_last_conversion',
    'timestamp_last_viewed_product',
    'days_to_last_event',
    'days_to_last_checkout',
    'days_to_last_conversion',
    'days_to_last_viewed_product',
    'total_brand_listings_lw',
    'total_viewed_products_lw',
    'total_checkouts_lw',
    'total_conversions_lw',
    'total_events_lw',
    'total_sessions_lw',
    'total_session_checkouts_lw',
    'total_session_conversions_lw',
    'total_events_ad_session_lw',
    'total_ad_sessions_lw',
    'has_checkout_lw',
    'has_conversion_lw',
    'percentage_last_week_activity',
    'percentage_last_week_brand_listings',
    'percentage_last_week_viewed_products',
    'percentage_last_week_checkouts',
    'percentage_last_week_conversions',
    'amount_of_months_that_has_bought'
]

posibilidades_features['Seleccion a Mano (FdM)'] = [ 'total_checkouts',
                                                     'total_conversions',
                                                     'total_events',
                                                     'total_sessions',
                                                     'total_session_checkout',
                                                     'total_session_conversion',
                                                     'total_events_ad_session',
                                                     'total_ad_sessions',
                                                     'avg_events_per_session',
                                                     'avg_events_per_ad_session',
                                                     'percentage_session_ad',
                                                     'has_checkout',
                                                     'has_conversion',
                                                     'total_viewed_products_month_1',
                                                     'total_checkouts_month_1',
                                                     'total_conversions_month_1',
                                                     'total_events_month_1',
                                                     'total_sessions_month_1',
                                                     'total_session_checkouts_month_1',
                                                     'total_session_conversions_month_1',
                                                     'total_events_ad_session_month_1',
                                                     'total_ad_sessions_month_1',
                                                     'has_checkout_month_1',
                                                     'has_conversion_month_1',
                                                     'total_viewed_products_month_2',
                                                     'total_checkouts_month_2',
                                                     'total_conversions_month_2',
                                                     'total_events_month_2',
                                                     'total_sessions_month_2',
                                                     'total_session_checkouts_month_2',
                                                     'total_session_conversions_month_2',
                                                     'total_events_ad_session_month_2',
                                                     'total_ad_sessions_month_2',
                                                     'has_checkout_month_2',
                                                     'has_conversion_month_2',
                                                     'total_viewed_products_month_3',
                                                     'total_checkouts_month_3',
                                                     'total_conversions_month_3',
                                                     'total_events_month_3',
                                                     'total_sessions_month_3',
                                                     'total_session_checkouts_month_3',
                                                     'total_session_conversions_month_3',
                                                     'total_events_ad_session_month_3',
                                                     'total_ad_sessions_month_3',
                                                     'has_checkout_month_3',
                                                     'has_conversion_month_3',
                                                     'total_viewed_products_month_4',
                                                     'total_checkouts_month_4',
                                                     'total_conversions_month_4',
                                                     'total_events_month_4',
                                                     'total_sessions_month_4',
                                                     'total_session_checkouts_month_4',
                                                     'total_session_conversions_month_4',
                                                     'total_events_ad_session_month_4',
                                                     'total_ad_sessions_month_4',
                                                     'has_checkout_month_4',
                                                     'has_conversion_month_4',
                                                     'total_viewed_products_month_5',
                                                     'total_checkouts_month_5',
                                                     'total_conversions_month_5',
                                                     'total_events_month_5',
                                                     'total_sessions_month_5',
                                                     'total_session_checkouts_month_5',
                                                     'total_session_conversions_month_5',
                                                     'total_events_ad_session_month_5',
                                                     'total_ad_sessions_month_5',
                                                     'has_checkout_month_5',
                                                     'has_conversion_month_5',
                                                     'total_viewed_products_months_1_to_4',
                                                     'total_checkouts_months_1_to_4',
                                                     'total_conversions_months_1_to_4',
                                                     'total_events_months_1_to_4',
                                                     'total_sessions_months_1_to_4',
                                                     'total_session_checkouts_months_1_to_4',
                                                     'total_session_conversions_months_1_to_4',
                                                     'total_events_ad_session_months_1_to_4',
                                                     'total_ad_sessions_months_1_to_4',
                                                     'has_checkout_months_1_to_4',
                                                     'has_conversion_months_1_to_4',
                                                     'total_viewed_products_lw',
                                                     'total_checkouts_lw',
                                                     'total_conversions_lw',
                                                     'total_events_lw',
                                                     'total_sessions_lw',
                                                     'total_session_checkouts_lw',
                                                     'total_session_conversions_lw',
                                                     'total_events_ad_session_lw',
                                                     'total_ad_sessions_lw',
                                                     'has_checkout_lw',
                                                     'has_conversion_lw',
                                                     'amount_of_months_that_has_bought',
                                                     'timestamp_last_event',
                                                     'timestamp_last_checkout',
                                                     'timestamp_last_conversion',
                                                     'timestamp_last_viewed_product',
                                                     'days_to_last_event',
                                                     'days_to_last_checkout',
                                                     'days_to_last_conversion',
                                                     'days_to_last_viewed_product',
                                                     'doy_last_event',
                                                     'dow_last_event',
                                                     'dom_last_event',
                                                     'woy_last_event',
                                                     'doy_last_checkout',
                                                     'dow_last_checkout',
                                                     'dom_last_checkout',
                                                     'woy_last_checkout',
                                                     'doy_last_conversion',
                                                     'dow_last_conversion',
                                                     'dom_last_conversion',
                                                     'woy_last_conversion',
                                                     'doy_last_viewed_product',
                                                     'dow_last_viewed_product',
                                                     'dom_last_viewed_product',
                                                     'woy_last_viewed_product',
                                                     'last_conversion_sku',
                                                     'last_conversion_price',
                                                     'percentage_last_week_activity',
                                                     'percentage_last_month_activity',
                                                     'days_between_last_event_and_checkout',
                                                     'percentage_regular_celphones_activity',
                                                     'var_viewed',
                                                     'conversion_gt_media'
]
In [48]:
cant_features = 30

posibilidades_features[f'{cant_features} Random Sample'] = random.sample(df_users.columns.tolist(),cant_features)
posibilidades_features[f'{cant_features} Random Sample 2'] = random.sample(df_users.columns.tolist(),cant_features)
In [49]:
print([x for x in posibilidades_features])
['Cumulative Importance', 'Forward Selection', 'Backward Elimination', 'Stepwise Regression ', 'Full Dataframe', 'Feature Intersection', 'Seleccion a Mano (Boj)', 'Seleccion a Mano (Souto)', 'Seleccion a Mano (Chortas)', 'Seleccion a Mano (FdM)', '30 Random Sample', '30 Random Sample 2']

Combinando ambas ideas

In [70]:
resultados = []
global_time = 0

norm = False

for forma, features in posibilidades_features.items():
    global_start = time.process_time()
    print("{: ^100}\n{: ^100s}".format(f"Tardó: {global_time:.2f}s",'-----------------------'))
    print(f'{forma}:\n')
    print(f'{features}\n\n')
    for nombre,algoritmo in posibilidades_algoritmos_y_ensambles:
        norm = True if ('NN' in nombre or 'neuralnetwork' in nombre) and ('+' not in nombre) else False
        print('\t * ',end='')
        model_with_name = (f'{nombre}',algoritmo)
        start = time.process_time()
        model, auc = SF.full_framework_wrapper(df_users, df_y, model_with_name, columns=features, normalize=norm)
        end = time.process_time()
        print(f'\t\t Tardó: {end-start:.2f}s')
        resultados.append((auc, forma, (nombre, algoritmo), features))
    global_end = time.process_time()
    global_time = global_end-global_start
                                            Tardó: 0.00s                                            
                                      -----------------------                                       
Cumulative Importance:

['timestamp_last_checkout', 'doy_last_checkout', 'woy_last_checkout', 'percentage_last_month_checkouts', 'has_checkout_month_5', 'total_checkouts_month_5', 'days_between_last_event_and_checkout', 'kmeans_5', 'total_events_ad_session', 'dom_last_checkout', 'total_events', 'days_to_last_checkout', 'kmeans_3', 'total_events_month_5', 'total_events_ad_session_month_5', 'total_viewed_products', 'dow_last_checkout', 'total_sessions', 'total_viewed_products_month_5', 'timestamp_last_event', 'total_events_lw', 'total_events_ad_session_lw', 'total_checkouts', 'total_max_viewed_product', 'total_ad_sessions', 'percentage_last_week_checkouts', 'percentage_last_month_activity', 'total_last_week_max_viewed_brand', 'days_to_last_viewed_product', 'doy_last_event', 'days_to_last_event', 'avg_events_per_ad_session', 'dom_last_event', 'total_sessions_month_5', 'avg_events_per_session', 'kmeans_6', 'total_viewed_products_months_1_to_4', 'total_viewed_products_lw', 'total_sessions_last_week', 'var_viewed', 'total_ad_sessions_month_5', 'woy_last_event', 'days_to_last_conversion', 'total_brand_listings', 'percentage_max_viewed_product', 'total_events_months_1_to_4', 'total_brand_listings_lw', 'total_sessions_lw', 'total_last_week_max_viewed_model', 'has_checkout_lw', 'total_checkouts_lw', 'total_sessions_months_1_to_4', 'percentage_last_week_viewed_products', 'total_session_checkout', 'percentage_regular_celphones_activity', 'percentage_last_month_viewed_products', 'percentage_last_week_activity', 'percentage_last_week_brand_listings', 'percentage_last_week_max_viewed_brand', 'percentage_last_week_max_viewed_model', 'dow_last_event', 'total_ad_sessions_lw', 'percentage_session_ad', 'total_sessions_month_1', 'total_conversions_month_2', 'total_session_checkouts_month_1', 'total_events_month_1', 'doy_last_conversion', 'total_checkouts_month_2', 'total_ad_sessions_month_1', 'total_session_conversions_month_1', 'total_events_ad_session_month_1', 'total_viewed_products_month_2', 'dom_last_conversion', 'has_checkout_month_1', 'has_conversion_month_1', 'dow_last_conversion', 'dom_last_viewed_product', 'woy_last_conversion', 'conversion_gt_media', 'total_conversions', 'total_session_conversion', 'cant_visitas_faq_ecommerce', 'cant_visitas_customer_service', 'has_event_last_week', 'ratio_sessions_last_week_over_total', 'percentage_session_conversion', 'cant_viewed_brand_last_conversion', 'has_checkout', 'has_conversion', 'doy_last_viewed_product', 'total_viewed_products_month_1', 'total_checkouts_month_1', 'percentage_last_month_conversions', 'total_conversions_month_1', 'last_conversion_price', 'last_conversion_sku', 'woy_last_viewed_product', 'timestamp_last_conversion', 'dow_last_viewed_product', 'timestamp_last_viewed_product', 'total_events_ad_session_month_3', 'total_events_month_2', 'total_ad_sessions_month_4', 'total_session_conversions_month_5', 'total_session_checkouts_month_5', 'total_events_month_3', 'total_sessions_month_3', 'total_conversions_month_5']


	 * Model: random_forest - AUC: 0.8443 - AUCPR:0.1678 - Accuracy: 0.9496 
		 Tardó: 10.58s
	 * Model: xgboost - AUC: 0.8636 - AUCPR:0.2545 - Accuracy: 0.9496 
		 Tardó: 2.79s
                                           Tardó: 13.37s                                            
                                      -----------------------                                       
Forward Selection:

['doy_last_checkout', 'total_events', 'dom_last_event', 'percentage_session_ad', 'avg_events_per_session', 'total_events_lw']


	 * Model: random_forest - AUC: 0.8563 - AUCPR:0.2025 - Accuracy: 0.9496 
		 Tardó: 3.46s
	 * Model: xgboost - AUC: 0.8587 - AUCPR:0.2124 - Accuracy: 0.9496 
		 Tardó: 0.40s
                                            Tardó: 3.86s                                            
                                      -----------------------                                       
Backward Elimination:

['total_conversions', 'total_events', 'total_session_conversion', 'total_events_ad_session', 'total_ad_sessions', 'avg_events_per_session', 'avg_events_per_ad_session', 'percentage_session_ad', 'percentage_session_conversion', 'has_checkout', 'has_conversion', 'total_viewed_products_month_1', 'total_checkouts_month_1', 'total_conversions_month_1', 'total_events_month_1', 'total_sessions_month_1', 'total_session_checkouts_month_1', 'total_session_conversions_month_1', 'total_events_ad_session_month_1', 'total_ad_sessions_month_1', 'has_checkout_month_1', 'has_conversion_month_1', 'total_viewed_products_month_2', 'total_checkouts_month_2', 'total_conversions_month_2', 'total_events_month_2', 'total_sessions_month_2', 'total_session_checkouts_month_2', 'total_session_conversions_month_2', 'total_events_ad_session_month_2', 'total_ad_sessions_month_2', 'has_checkout_month_2', 'has_conversion_month_2', 'total_viewed_products_month_3', 'total_checkouts_month_3', 'total_conversions_month_3', 'total_events_month_3', 'total_sessions_month_3', 'total_session_checkouts_month_3', 'total_session_conversions_month_3', 'total_events_ad_session_month_3', 'total_ad_sessions_month_3', 'has_checkout_month_3', 'has_conversion_month_3', 'total_viewed_products_month_4', 'total_checkouts_month_4', 'total_conversions_month_4', 'total_session_checkouts_month_4', 'total_session_conversions_month_4', 'total_events_ad_session_month_4', 'total_ad_sessions_month_4', 'has_checkout_month_4', 'has_conversion_month_4', 'total_viewed_products_month_5', 'total_events_month_5', 'total_sessions_month_5', 'total_session_checkouts_month_5', 'total_session_conversions_month_5', 'total_events_ad_session_month_5', 'total_ad_sessions_month_5', 'total_checkouts_months_1_to_4', 'total_conversions_months_1_to_4', 'total_session_conversions_months_1_to_4', 'total_events_ad_session_months_1_to_4', 'total_ad_sessions_months_1_to_4', 'has_conversion_months_1_to_4', 'amount_of_months_that_has_bought', 'timestamp_last_event', 'days_to_last_event', 'days_to_last_conversion', 'days_to_last_viewed_product', 'doy_last_event', 'dow_last_event', 'dom_last_event', 'woy_last_event', 'doy_last_checkout', 'woy_last_checkout', 'dow_last_conversion', 'dom_last_conversion', 'woy_last_conversion', 'dow_last_viewed_product', 'dom_last_viewed_product', 'woy_last_viewed_product', 'last_conversion_price', 'percentage_last_week_activity', 'percentage_last_week_conversions', 'percentage_last_week_viewed_products', 'percentage_last_month_activity', 'percentage_last_month_checkouts', 'percentage_regular_celphones_activity', 'var_viewed', 'conversion_gt_media', 'total_max_viewed_product', 'cant_viewed_brand_last_conversion', 'ratio_sessions_last_week_over_total', 'has_event_last_week', 'cant_visitas_customer_service', 'cant_visitas_faq_ecommerce']


	 * Model: random_forest - AUC: 0.8510 - AUCPR:0.1967 - Accuracy: 0.9496 
		 Tardó: 8.51s
	 * Model: xgboost - AUC: 0.8669 - AUCPR:0.2624 - Accuracy: 0.9496 
		 Tardó: 1.68s
                                           Tardó: 10.19s                                            
                                      -----------------------                                       
Stepwise Regression:

['doy_last_checkout', 'total_events', 'dom_last_event', 'percentage_session_ad', 'avg_events_per_session', 'total_events_lw']


	 * Model: random_forest - AUC: 0.8563 - AUCPR:0.2025 - Accuracy: 0.9496 
		 Tardó: 3.21s
	 * Model: xgboost - AUC: 0.8587 - AUCPR:0.2124 - Accuracy: 0.9496 
		 Tardó: 0.49s
                                            Tardó: 3.70s                                            
                                      -----------------------                                       
Full Dataframe:

[]


	 * Model: random_forest - AUC: 0.8420 - AUCPR:0.1634 - Accuracy: 0.9496 
		 Tardó: 15.80s
	 * Model: xgboost - AUC: 0.8599 - AUCPR:0.2521 - Accuracy: 0.9491 
		 Tardó: 3.53s
In [71]:
resultados.sort(reverse=True)
display([(x[0],x[1],x[2][0]) for x in resultados])
[(0.8669218835938682, 'Backward Elimination', 'xgboost'),
 (0.8635511708262187, 'Cumulative Importance', 'xgboost'),
 (0.8598891648508751, 'Full Dataframe', 'xgboost'),
 (0.8587055466442831, 'Stepwise Regression', 'xgboost'),
 (0.8587055466442831, 'Forward Selection', 'xgboost'),
 (0.8563304050700351, 'Stepwise Regression', 'random_forest'),
 (0.8563304050700351, 'Forward Selection', 'random_forest'),
 (0.8510466433248821, 'Backward Elimination', 'random_forest'),
 (0.844323663165399, 'Cumulative Importance', 'random_forest'),
 (0.8420273336514561, 'Full Dataframe', 'random_forest')]

Corrida Final

Se corre entrenando con X (y no X_train) el submit final.

In [72]:
max_auc, campeon_forma, (campeon_nombre, campeon_algoritmo), campeon_features = resultados[0]
display(f"Mejor Apuesta: {campeon_nombre} ({max_auc:.4f} AUC) - Features: {campeon_forma}")
display(f"Features: {campeon_features}")
'Mejor Apuesta: xgboost (0.8669 AUC) - Features: Backward Elimination'
"Features: ['total_conversions', 'total_events', 'total_session_conversion', 'total_events_ad_session', 'total_ad_sessions', 'avg_events_per_session', 'avg_events_per_ad_session', 'percentage_session_ad', 'percentage_session_conversion', 'has_checkout', 'has_conversion', 'total_viewed_products_month_1', 'total_checkouts_month_1', 'total_conversions_month_1', 'total_events_month_1', 'total_sessions_month_1', 'total_session_checkouts_month_1', 'total_session_conversions_month_1', 'total_events_ad_session_month_1', 'total_ad_sessions_month_1', 'has_checkout_month_1', 'has_conversion_month_1', 'total_viewed_products_month_2', 'total_checkouts_month_2', 'total_conversions_month_2', 'total_events_month_2', 'total_sessions_month_2', 'total_session_checkouts_month_2', 'total_session_conversions_month_2', 'total_events_ad_session_month_2', 'total_ad_sessions_month_2', 'has_checkout_month_2', 'has_conversion_month_2', 'total_viewed_products_month_3', 'total_checkouts_month_3', 'total_conversions_month_3', 'total_events_month_3', 'total_sessions_month_3', 'total_session_checkouts_month_3', 'total_session_conversions_month_3', 'total_events_ad_session_month_3', 'total_ad_sessions_month_3', 'has_checkout_month_3', 'has_conversion_month_3', 'total_viewed_products_month_4', 'total_checkouts_month_4', 'total_conversions_month_4', 'total_session_checkouts_month_4', 'total_session_conversions_month_4', 'total_events_ad_session_month_4', 'total_ad_sessions_month_4', 'has_checkout_month_4', 'has_conversion_month_4', 'total_viewed_products_month_5', 'total_events_month_5', 'total_sessions_month_5', 'total_session_checkouts_month_5', 'total_session_conversions_month_5', 'total_events_ad_session_month_5', 'total_ad_sessions_month_5', 'total_checkouts_months_1_to_4', 'total_conversions_months_1_to_4', 'total_session_conversions_months_1_to_4', 'total_events_ad_session_months_1_to_4', 'total_ad_sessions_months_1_to_4', 'has_conversion_months_1_to_4', 'amount_of_months_that_has_bought', 'timestamp_last_event', 'days_to_last_event', 'days_to_last_conversion', 'days_to_last_viewed_product', 'doy_last_event', 'dow_last_event', 'dom_last_event', 'woy_last_event', 'doy_last_checkout', 'woy_last_checkout', 'dow_last_conversion', 'dom_last_conversion', 'woy_last_conversion', 'dow_last_viewed_product', 'dom_last_viewed_product', 'woy_last_viewed_product', 'last_conversion_price', 'percentage_last_week_activity', 'percentage_last_week_conversions', 'percentage_last_week_viewed_products', 'percentage_last_month_activity', 'percentage_last_month_checkouts', 'percentage_regular_celphones_activity', 'var_viewed', 'conversion_gt_media', 'total_max_viewed_product', 'cant_viewed_brand_last_conversion', 'ratio_sessions_last_week_over_total', 'has_event_last_week', 'cant_visitas_customer_service', 'cant_visitas_faq_ecommerce']"
In [73]:
print(f"{campeon_nombre} - {campeon_forma} - {max_auc:.4f}")
norm = True if ('NN' in campeon_nombre or 'neuralnetwork' in campeon_nombre) and ('+' not in campeon_nombre) else False
campeon_model, campeon_auc, csv_name, campeon_message = SF.full_framework_wrapper(df_users, 
                                                                                    df_y, 
                                                                                    (campeon_nombre,campeon_algoritmo),
                                                                                    columns=campeon_features,
                                                                                    submit=True,
                                                                                    all_in=True,
                                                                                    normalize=norm)   

#!kaggle competitions submit -f {csv_name} -m "{campeon_message}" trocafone
xgboost - Backward Elimination - 0.8669
Model: xgboost_all_in - AUC: 0.8669 - AUCPR:0.2624 - Accuracy: 0.9496 
'submission-xgboost_all_in-0.8669.csv'
"xgboost_all_in - {'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 0.65, 'colsample_bytree': 0.7500000000000001, 'gamma': 0.0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 4, 'min_child_weight': 5, 'missing': None, 'n_estimators': 16, 'n_jobs': 1, 'nthread': None, 'objective': 'binary:logistic', 'random_state': 42, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 2, 'seed': None, 'silent': True, 'subsample': 0.7} - ['total_conversions', 'total_events', 'total_session_conversion', 'total_events_ad_session', 'total_ad_sessions', 'avg_events_per_session', 'avg_events_per_ad_session', 'percentage_session_ad', 'percentage_session_conversion', 'has_checkout', 'has_conversion', 'total_viewed_products_month_1', 'total_checkouts_month_1', 'total_conversions_month_1', 'total_events_month_1', 'total_sessions_month_1', 'total_session_checkouts_month_1', 'total_session_conversions_month_1', 'total_events_ad_session_month_1', 'total_ad_sessions_month_1', 'has_checkout_month_1', 'has_conversion_month_1', 'total_viewed_products_month_2', 'total_checkouts_month_2', 'total_conversions_month_2', 'total_events_month_2', 'total_sessions_month_2', 'total_session_checkouts_month_2', 'total_session_conversions_month_2', 'total_events_ad_session_month_2', 'total_ad_sessions_month_2', 'has_checkout_month_2', 'has_conversion_month_2', 'total_viewed_products_month_3', 'total_checkouts_month_3', 'total_conversions_month_3', 'total_events_month_3', 'total_sessions_month_3', 'total_session_checkouts_month_3', 'total_session_conversions_month_3', 'total_events_ad_session_month_3', 'total_ad_sessions_month_3', 'has_checkout_month_3', 'has_conversion_month_3', 'total_viewed_products_month_4', 'total_checkouts_month_4', 'total_conversions_month_4', 'total_session_checkouts_month_4', 'total_session_conversions_month_4', 'total_events_ad_session_month_4', 'total_ad_sessions_month_4', 'has_checkout_month_4', 'has_conversion_month_4', 'total_viewed_products_month_5', 'total_events_month_5', 'total_sessions_month_5', 'total_session_checkouts_month_5', 'total_session_conversions_month_5', 'total_events_ad_session_month_5', 'total_ad_sessions_month_5', 'total_checkouts_months_1_to_4', 'total_conversions_months_1_to_4', 'total_session_conversions_months_1_to_4', 'total_events_ad_session_months_1_to_4', 'total_ad_sessions_months_1_to_4', 'has_conversion_months_1_to_4', 'amount_of_months_that_has_bought', 'timestamp_last_event', 'days_to_last_event', 'days_to_last_conversion', 'days_to_last_viewed_product', 'doy_last_event', 'dow_last_event', 'dom_last_event', 'woy_last_event', 'doy_last_checkout', 'woy_last_checkout', 'dow_last_conversion', 'dom_last_conversion', 'woy_last_conversion', 'dow_last_viewed_product', 'dom_last_viewed_product', 'woy_last_viewed_product', 'last_conversion_price', 'percentage_last_week_activity', 'percentage_last_week_conversions', 'percentage_last_week_viewed_products', 'percentage_last_month_activity', 'percentage_last_month_checkouts', 'percentage_regular_celphones_activity', 'var_viewed', 'conversion_gt_media', 'total_max_viewed_product', 'cant_viewed_brand_last_conversion', 'ratio_sessions_last_week_over_total', 'has_event_last_week', 'cant_visitas_customer_service', 'cant_visitas_faq_ecommerce']"
In [ ]:
# Quemar n submits de punta a punta 

# for resultado in resultados:
#     print(f"\n\n{resultado[2][0]} - {resultado[1]} - {resultado[0]:.4f}\n\n)
#     max_auc, campeon_forma, (campeon_nombre, campeon_algoritmo), campeon_features = resultado
#     norm = True if ('NN' in campeon_nombre or 'neuralnetwork' in campeon_nombre) and ('+' not in campeon_nombre) else False
#     campeon_model, campeon_auc, csv_name, campeon_message = SF.full_framework_wrapper(df_users, 
#                                                                                     df_y, 
#                                                                                     (campeon_nombre,campeon_algoritmo),
#                                                                                     columns=campeon_features,
#                                                                                     submit=True,
#                                                                                     all_in=True,
#                                                                                     normalize=norm)   
#     !kaggle competitions submit -f {csv_name} -m "{campeon_message}" trocafone
#     sleep(10)
#     print()
In [74]:
#!kaggle competitions leaderboard -d trocafone
#!unzip -o trocafone.zip
#print('\n\nLast Best Score')
#!cat trocafone-publicleaderboard.csv | grep Datatouille | tail -n 1 | awk '{split($0,a,","); print "\t Fecha: " a[3] ; print "\t Porcentaje: " a[4]}'
Warning: Your Kaggle API key is readable by otherusers on this system! To fix this, you can run'chmod 600 /home/delmazo/.kaggle/kaggle.json'
Downloading trocafone.zip to /home/delmazo/Desktop/7506-Datos/TP2
  0%|                                               | 0.00/5.82k [00:00<?, ?B/s]
100%|██████████████████████████████████████| 5.82k/5.82k [00:00<00:00, 1.84MB/s]
Archive:  trocafone.zip
  inflating: trocafone-publicleaderboard.csv  
Last Best Score
	 Fecha: "2018-11-29 22:33:11"
	 Porcentaje: 0.87196