Roc courbe et point de coupure. Python.

Question

J'utilise un modèle logistique et j'ai prédit les valeurs logit. J'ai utilisé :

 from sklearn import metrics fpr, tpr, thresholds = metrics.roc_curve(Y_test,p)

Je sais que metric.roc_auc_score donnera l'aire sous la courbe mais quelqu'un peut-il me dire quelle est la commande permettant de trouver le point de coupure optimal (valeur seuil).

Manohar Swamynathan · Accepted Answer

Bien que sa réponse tardive, la pensée pourrait être utile. Vous pouvez le faire en utilisant le paquetage epi dans R (ici!) , mais je n’ai pas pu trouver de paquet similaire ni d’exemple en python.

Le point de coupure optimal serait où true positive rate est high et le false positive rate est low. Sur la base de cette logique, j'ai tiré un exemple ci-dessous pour trouver le seuil optimal.

Code python:

import pandas as pd import statsmodels.api as sm import pylab as pl import numpy as np from sklearn.metrics import roc_curve, auc # read the data in df = pd.read_csv("http://www.ats.ucla.edu/stat/data/binary.csv") # rename the 'rank' column because there is also a DataFrame method called 'rank' df.columns = ["admit", "gre", "gpa", "prestige"] # dummify rank dummy_ranks = pd.get_dummies(df['prestige'], prefix='prestige') # create a clean data frame for the regression cols_to_keep = ['admit', 'gre', 'gpa'] data = df[cols_to_keep].join(dummy_ranks.ix[:, 'prestige_2':]) # manually add the intercept data['intercept'] = 1.0 train_cols = data.columns[1:] # fit the model result = sm.Logit(data['admit'], data[train_cols]).fit() print result.summary() # Add prediction to dataframe data['pred'] = result.predict(data[train_cols]) fpr, tpr, thresholds =roc_curve(data['admit'], data['pred']) roc_auc = auc(fpr, tpr) print("Area under the ROC curve : %f" % roc_auc) #################################### # The optimal cut off would be where tpr is high and fpr is low # tpr - (1-fpr) is zero or near to zero is the optimal cut off point #################################### i = np.arange(len(tpr)) # index for df roc = pd.DataFrame({'fpr' : pd.Series(fpr, index=i),'tpr' : pd.Series(tpr, index = i), '1-fpr' : pd.Series(1-fpr, index = i), 'tf' : pd.Series(tpr - (1-fpr), index = i), 'thresholds' : pd.Series(thresholds, index = i)}) roc.ix[(roc.tf-0).abs().argsort()[:1]] # Plot tpr vs 1-fpr fig, ax = pl.subplots() pl.plot(roc['tpr']) pl.plot(roc['1-fpr'], color = 'red') pl.xlabel('1-False Positive Rate') pl.ylabel('True Positive Rate') pl.title('Receiver operating characteristic') ax.set_xticklabels([])

Le point de coupure optimal est 0,317628, donc tout ce qui se situe au-dessus de ce chiffre peut être qualifié de 1, sinon de 0. Vous pouvez voir dans le graphique/sortie que si tpr traverse 1-fpr, tpr est à 63%, fpr à 36% et tpr 1-fpr) est le plus proche de zéro dans l'exemple actuel.

Sortie:

 1-fpr fpr tf thresholds tpr 171 0.637363 0.362637 0.000433 0.317628 0.637795

J'espère que c'est utile.

Modifier

Pour simplifier et permettre la réutilisation, j'ai créé une fonction permettant de trouver le seuil de probabilité optimal.

Code Python:

def Find_Optimal_Cutoff(target, predicted): """ Find the optimal probability cutoff point for a classification model related to event rate Parameters ---------- target : Matrix with dependent or target data, where rows are observations predicted : Matrix with predicted data, where rows are observations Returns ------- list type, with optimal cutoff value """ fpr, tpr, threshold = roc_curve(target, predicted) i = np.arange(len(tpr)) roc = pd.DataFrame({'tf' : pd.Series(tpr-(1-fpr), index=i), 'threshold' : pd.Series(threshold, index=i)}) roc_t = roc.ix[(roc.tf-0).abs().argsort()[:1]] return list(roc_t['threshold']) # Add prediction probability to dataframe data['pred_proba'] = result.predict(data[train_cols]) # Find optimal probability threshold threshold = Find_Optimal_Cutoff(data['admit'], data['pred_proba']) print threshold # [0.31762762459360921] # Find prediction to the dataframe applying threshold data['pred'] = data['pred_proba'].map(lambda x: 1 if x > threshold else 0) # Print confusion Matrix from sklearn.metrics import confusion_matrix confusion_matrix(data['admit'], data['pred']) # array([[175, 98], # [ 46, 81]])

cgnorthcutt · Answer

Soit tpr, fpr, les seuils de votre question, la réponse pour le seuil optimal est simplement:

optimal_idx = np.argmax(tpr - fpr) optimal_threshold = thresholds[optimal_idx]

lee · Answer

Implémentation Vanilla Python du J-Score de Youden

def cutoff_youdens_j(fpr,tpr,thresholds): j_scores = tpr-fpr j_ordered = sorted(Zip(j_scores,thresholds)) return j_ordered[-1][1]

j35t3r · Answer

Le poste de cgnorthcutt

Soit tpr, fpr, les seuils de votre question, la réponse pour le seuil optimal est simplement:

optimal_idx = np.argmax (tpr - fpr) optimal_threshold = seuils [optimal_idx]

est presque correct. La valeur abs doit être prise.

optimal_idx = np.argmax(np.abs(tpr - fpr)) optimal_threshold = thresholds[optimal_idx]

Selon la référence mentionnée -> http://www.medicalbiostatistics.com/roccurve.pdf p.6, j’ai trouvé une autre possibilité:

opt_idx = np.argmin (np.sqrt (np.square (1-tpr) + np.square (fpr)))