Customizing Scoring and Evaluation Metrics in scikit-learn

You have a job to do. The business needs a model to detect fraudulent credit card transactions. They hand you a dataset. You dutifully load it up, and the first thing you do, as any good craftsman would, is inspect your materials. You find that a mere 0.1% of the transactions are labeled as fraudulent. The other 99.9% are legitimate. This is a classic imbalanced dataset. A minefield for the unwary.

Your first instinct might be to reach for a standard classifier and measure its success with the most common metric of all: accuracy. It seems reasonable. It’s the default in many cases. So, you build a quick model. Let’s not even use a sophisticated one. Let’s use the most naive model imaginable, one that simply learns the most frequent class and predicts it every single time. In our case, it will always predict “not fraudulent”.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, recall_score

# Create a highly imbalanced dataset
X, y = make_classification(n_samples=10000, n_features=20, n_informative=2,
                           n_redundant=10, n_classes=2, n_clusters_per_class=1,
                           weights=[0.999, 0.001], flip_y=0, random_state=42)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42, stratify=y)

# A naive classifier that always predicts the majority class
dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(X_train, y_train)
y_pred = dummy_clf.predict(X_test)

# Let's check the accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.4f}")

The code executes, and you see the result: Model Accuracy: 0.9990. A stunning 99.9% accuracy! A number that high would get you a promotion in a lesser shop. But you know better. You feel a cold dread, because you understand what this number is truly telling you. It’s telling you that your model is completely, utterly, and catastrophically useless.

This model, with its near-perfect accuracy, has never identified a single fraudulent transaction. Not one. Its sole “skill” is to parrot the most common label. We can prove this by checking a different metric, one that cares about finding the positive cases we’re looking for. Let’s measure the recall for the positive class (fraud).

# Now, let's check what really matters: recall for the positive class
recall = recall_score(y_test, y_pred)
print(f"Model Recall: {recall:.4f}")

The output is a stark and damning Model Recall: 0.0000. Zero. A perfect zero. Our model has an accuracy of 99.9% and a recall of 0%. It is a perfect failure disguised as a spectacular success. This is the fundamental lie of standard metrics when misapplied. They answer a question, but it is often not the question we should be asking. The accuracy score answered, “What percentage of transactions did you label correctly?” The real business question was, “How much of the fraud are you successfully catching?” The first answer is a vanity metric; the second is a measure of value.

The problem isn’t with scikit-learn. The problem isn’t the algorithm. The problem is our choice of measurement. When the cost of a false negative (missing a fraudulent transaction) is thousands of times greater than the cost of a false positive (flagging a good transaction for review), a metric that treats all errors equally is not just insufficient; it is actively dangerous. It leads you down a path of false confidence and delivers a system that fails at its one critical task. We must, therefore, reject these defaults. We must define what success truly means for the specific problem at hand and encode that definition into a custom measure of performance.

Scorers with Greater Needs

The simple scorer you forged was a solid piece of work. It took y_true and y_pred and produced a number that meant something to the business. A fine tool. But some jobs require more specialized instruments. A simple comparison of predicted labels to true labels throws away a wealth of information.

What information? The model’s certainty. A good classifier doesn’t just spit out a binary 0 or 1. Under the hood, it calculates a probability. It says, “I am 95% certain this is fraud,” or “I am 51% certain this is fraud.” Both predictions, when forced through a 50% threshold, become a 1. But they are not the same. One is a confident assertion; the other is barely a guess. A metric that treats them identically is a blunt instrument. We need a sharper one.

Let’s imagine a scenario where we want to reward the model not just for being right, but for being confidently right. We can create a scorer that looks at the predicted probabilities directly. Scikit-learn provides a clean way to do this. When you build a scorer with make_scorer, you can tell it that your function needs these probabilities instead of the final predictions. You simply set the needs_proba flag to True.

Consider this custom scoring function. It heavily penalizes false negatives, lightly penalizes false positives, but gives the highest reward for correctly identifying a fraudulent transaction with high confidence.

from sklearn.metrics import make_scorer
import numpy as np

def confidence_based_reward(y_true, y_proba):
    """
    A custom scorer that rewards confident, correct predictions of the positive class.
    - Correctly identified fraud (TP): reward is proportional to confidence.
    - Missed fraud (FN): heavy, fixed penalty.
    - Incorrectly flagged transaction (FP): small, fixed penalty.
    - Correctly identified non-fraud (TN): zero score, no reward or penalty.
    """
    reward = 0
    # Probabilities for the positive class (fraud)
    proba_fraud = y_proba[:, 1]

    for true_label, prob in zip(y_true, proba_fraud):
        if true_label == 1 and prob > 0.5:  # True Positive
            # Reward is higher for more confident predictions
            reward += (1 + prob) 
        elif true_label == 1 and prob = 0.5:  # False Negative
            # Heavy penalty for missing fraud
            reward -= 100
        elif true_label == 0 and prob > 0.5:  # False Positive
            # Small penalty for a false alarm
            reward -= 1
        # True Negatives (true_label == 0 and prob = 0.5) get a score of 0
        
    return reward

# Create the scorer, telling it we need probabilities
confidence_scorer = make_scorer(confidence_based_reward, needs_proba=True)

Here, our function confidence_based_reward no longer accepts y_pred. Instead, it receives y_proba, an array with shape (n_samples, n_classes) containing the probability estimates for each class. We then define our business logic directly on these probabilities. We have encoded a more nuanced definition of “good” into our system. This is a significant step up.

But we can go further. The needs of the business can be even more deeply tied to the data itself. Let’s return to our fraud detection problem. What is the actual business cost of a missed fraudulent transaction? It is not a uniform value of -100. The cost is the amount of the transaction itself. A missed $10,000 fraud is a hundred times worse than a missed $100 fraud. A model that catches all the small frauds but misses the single largest one is a failure, no matter what a standard metric says.

Our scoring function must therefore have access to the feature matrix, X, where the transaction amount is stored. This is a need that make_scorer cannot fulfill. The make_scorer factory is a convenience wrapper, but its convenience comes at the cost of abstraction. It deliberately hides X from the underlying score function. To break through this abstraction, we must bypass make_scorer entirely.

The tools in scikit-learn, like GridSearchCV and cross_val_score, are more flexible than they first appear. Their scoring parameter does not require an object made by make_scorer. It can accept any callable that takes the estimator, the feature matrix X, and the true labels y_true as arguments. The signature is my_scorer(estimator, X, y). This allows us to build scorers of arbitrary complexity that are perfectly tailored to the business problem.

Let’s build a scorer that calculates the net monetary value saved by the model. We’ll assume the transaction amount is in the first column of our data X. The score will be the total value of all correctly identified frauds (True Positives) minus the total value of all missed frauds (False Negatives). We can even subtract a small, fixed “investigation cost” for every false alarm (False Positive).

def monetary_value_scorer(estimator, X, y_true):
    """
    Calculates the net monetary value of a fraud detection model.
    - Assumes transaction amount is the first feature in X.
    - Sums the value of correctly caught frauds (TPs).
    - Subtracts the value of missed frauds (FNs).
    - Subtracts a fixed cost for each false alarm (FP).
    """
    y_pred = estimator.predict(X)
    X = np.asarray(X)
    y_true = np.asarray(y_true)
    
    transaction_amounts = X[:, 0]
    investigation_cost = 5 # A fixed cost to investigate a flagged transaction

    # Value of correctly identified frauds (True Positives)
    tp_mask = (y_true == 1) & (y_pred == 1)
    saved_value = np.sum(transaction_amounts[tp_mask])

    # Value of missed frauds (False Negatives)
    fn_mask = (y_true == 1) & (y_pred == 0)
    lost_value = np.sum(transaction_amounts[fn_mask])
    
    # Cost of false alarms (False Positives)
    fp_mask = (y_true == 0) & (y_pred == 1)
    total_investigation_cost = np.sum(fp_mask) * investigation_cost

    net_value = saved_value - lost_value - total_investigation_cost
    return net_value

This function, monetary_value_scorer, is a pure expression of business value. It is not an abstract statistical measure. It is a direct calculation of profit and loss. It takes the estimator and the test data X and y_true, makes predictions, and then uses the transaction amounts from X to compute a score in dollars and cents. This is the kind of metric that a project manager can understand. It is the kind of metric that aligns the work of the data scientist directly with the goals of the business. When you pass this function directly to GridSearchCV(..., scoring=monetary_value_scorer), you are no longer asking the machine to optimize for F1-score or AUC. You are asking it to optimize for money. This is a powerful shift in perspective. It moves from building a “good model” in a generic sense to building a tool that performs a specific, valuable job. The code is no longer just code; it is a contract that codifies the financial requirements of the system.

Source: https://www.pythonlore.com/customizing-scoring-and-evaluation-metrics-in-scikit-learn/

Customizing Scoring and Evaluation Metrics in scikit-learn

Scorers with Greater Needs

You might also like this video

Comments

Leave a Reply Cancel reply

Customizing Scoring and Evaluation Metrics in scikit-learn

Choosing Your First Programming Language: Everything You Need to Know Before You Start Coding

Understanding Blocking and Non-blocking Socket Operations

Customizing Matplotlib with Style Sheets