Metrics

uqdd.metrics ¶

Metrics subpackage for UQDD

The uqdd.metrics subpackage provides tools to compute, analyze, and visualize performance and uncertainty metrics for UQDD models. It includes plotting and analysis helpers, statistical testing routines, and reassessment utilities to benchmark and compare models rigorously.

Modules:

Name	Description
- ``constants``: Canonical metric names, grouping orders, hatches, and helper	structures to standardize plots and reports.
- ``analysis``: Functions for aggregating results, loading predictions,	computing rejection curves, and producing comparison plots and calibration visualizations.
- ``stats``: Statistical metrics and tests	bootstrapping, Wilcoxon, Holm–Bonferroni, Friedman–Nemenyi, Cliff's delta), along with boxplots, curve plots, and significance analysis/reporting.
- ``reassessment``: Utilities to reassess runs and models	evidential models), export predictions, and post-process metrics from CSV.

Public API

Commonly used names are re-exported for convenient access via uqdd.metrics.<name>. They are grouped by module below for discoverability.

Constants group_cols, numeric_cols, string_cols, order_by, group_order, group_order_no_time, hatches_dict, hatches_dict_no_time, accmetrics, accmetrics2, uctmetrics, uctmetrics2
Analysis aggregate_results_csv, save_plot, handle_inf_values, plot_pairplot, plot_line_metrics, plot_histogram_metrics, plot_pairwise_scatter_metrics, plot_metrics, find_highly_correlated_metrics, plot_comparison_metrics, load_and_aggregate_calibration_data, plot_calibration_data, move_model_folders, load_predictions, calculate_rmse_rejection_curve, calculate_rejection_curve, get_handles_labels, plot_rmse_rejection_curves, plot_auc_comparison, save_stats_df, load_stats_df
Statistics calc_regression_metrics, bootstrap_ci, rm_tukey_hsd, make_boxplots, make_boxplots_parametric, make_boxplots_nonparametric, make_sign_plots_nonparametric, make_critical_difference_diagrams, make_normality_diagnostic, mcs_plot, make_mcs_plot_grid, make_scatterplot, ci_plot, make_ci_plot_grid, recall_at_precision, calc_classification_metrics, make_curve_plots, harmonize_columns, cliffs_delta, wilcoxon_pairwise_test, holm_bonferroni_correction, pairwise_model_comparison, friedman_nemenyi_test, calculate_critical_difference, bootstrap_auc_difference, plot_critical_difference_diagram, analyze_significance, comprehensive_statistical_analysis, generate_statistical_report
Reassessment nll_evidentials, convert_to_list, preprocess_runs, get_model_class, get_predict_fn, get_preds, pkl_preds_export, csv_nll_post_processing, reassess_metrics

Usage Notes

Reproducibility: Prefer functions that accept random seeds and write diagnositics under uqdd/logs; capture versions and configurations for statistical comparisons.
Data paths: Use the global paths from uqdd.__init__ to keep file/plot outputs consistent.
Plot styles: Use constants from metrics.constants to standardize the look and ordering across figures.

uqdd.metrics.group_cols `module-attribute` ¶

group_cols = ['Model type', 'Task', 'Activity', 'Split', 'desc_prot', 'desc_chem', 'dropout']

uqdd.metrics.numeric_cols `module-attribute` ¶

numeric_cols = ['RMSE', 'R2', 'MAE', 'MDAE', 'MARPD', 'PCC', 'RMS Calibration', 'MA Calibration', 'Miscalibration Area', 'Sharpness', 'NLL', 'CRPS', 'Check', 'Interval', 'rho_rank', 'rho_rank_sim', 'rho_rank_sim_std', 'uq_mis_cal', 'uq_NLL', 'uq_NLL_sim', 'uq_NLL_sim_std', 'Z_var', 'Z_var_CI_low', 'Z_var_CI_high', 'Z_mean', 'Z_mean_CI_low', 'Z_mean_CI_high', 'rmv_rmse_slope', 'rmv_rmse_r_sq', 'rmv_rmse_intercept', 'aleatoric_uct_mean', 'epistemic_uct_mean', 'total_uct_mean']

uqdd.metrics.string_cols `module-attribute` ¶

string_cols = ['wandb project', 'wandb run', 'model name']

uqdd.metrics.order_by `module-attribute` ¶

order_by = ['Split', 'Model type']

uqdd.metrics.group_order `module-attribute` ¶

group_order = ['stratified_pnn', 'stratified_ensemble', 'stratified_mcdropout', 'stratified_evidential', 'stratified_eoe', 'stratified_emc', 'scaffold_cluster_pnn', 'scaffold_cluster_ensemble', 'scaffold_cluster_mcdropout', 'scaffold_cluster_evidential', 'scaffold_cluster_eoe', 'scaffold_cluster_emc', 'time_pnn', 'time_ensemble', 'time_mcdropout', 'time_evidential', 'time_eoe', 'time_emc']

uqdd.metrics.group_order_no_time `module-attribute` ¶

group_order_no_time = ['stratified_pnn', 'stratified_ensemble', 'stratified_mcdropout', 'stratified_evidential', 'stratified_eoe', 'stratified_emc', 'scaffold_cluster_pnn', 'scaffold_cluster_ensemble', 'scaffold_cluster_mcdropout', 'scaffold_cluster_evidential', 'scaffold_cluster_eoe', 'scaffold_cluster_emc']

uqdd.metrics.hatches_dict `module-attribute` ¶

hatches_dict = {'stratified': '\\\\', 'scaffold_cluster': '', 'time': '...'}

uqdd.metrics.hatches_dict_no_time `module-attribute` ¶

hatches_dict_no_time = {'stratified': '\\\\', 'scaffold_cluster': ''}

uqdd.metrics.accmetrics `module-attribute` ¶

accmetrics = ['RMSE', 'R2', 'MAE', 'MDAE', 'MARPD', 'PCC']

uqdd.metrics.accmetrics2 `module-attribute` ¶

accmetrics2 = ['RMSE', 'R2', 'PCC']

uqdd.metrics.uctmetrics `module-attribute` ¶

uctmetrics = ['RMS Calibration', 'MA Calibration', 'Miscalibration Area', 'Sharpness', 'CRPS', 'Check', 'NLL', 'Interval']

uqdd.metrics.uctmetrics2 `module-attribute` ¶

uctmetrics2 = ['Miscalibration Area', 'Sharpness', 'CRPS', 'NLL', 'Interval']

uqdd.metrics.all `module-attribute` ¶

__all__ = ['group_cols', 'numeric_cols', 'string_cols', 'order_by', 'group_order', 'group_order_no_time', 'hatches_dict', 'hatches_dict_no_time', 'accmetrics', 'accmetrics2', 'uctmetrics', 'uctmetrics2', 'aggregate_results_csv', 'save_plot', 'handle_inf_values', 'plot_pairplot', 'plot_line_metrics', 'plot_histogram_metrics', 'plot_pairwise_scatter_metrics', 'plot_metrics', 'find_highly_correlated_metrics', 'plot_comparison_metrics', 'load_and_aggregate_calibration_data', 'plot_calibration_data', 'move_model_folders', 'load_predictions', 'calculate_rmse_rejection_curve', 'calculate_rejection_curve', 'get_handles_labels', 'plot_rmse_rejection_curves', 'plot_auc_comparison', 'save_stats_df', 'load_stats_df', 'calc_regression_metrics', 'bootstrap_ci', 'rm_tukey_hsd', 'make_boxplots', 'make_boxplots_parametric', 'make_boxplots_nonparametric', 'make_sign_plots_nonparametric', 'make_critical_difference_diagrams', 'make_normality_diagnostic', 'mcs_plot', 'make_mcs_plot_grid', 'make_scatterplot', 'ci_plot', 'make_ci_plot_grid', 'recall_at_precision', 'calc_classification_metrics', 'make_curve_plots', 'harmonize_columns', 'cliffs_delta', 'wilcoxon_pairwise_test', 'holm_bonferroni_correction', 'pairwise_model_comparison', 'friedman_nemenyi_test', 'calculate_critical_difference', 'bootstrap_auc_difference', 'plot_critical_difference_diagram', 'analyze_significance', 'comprehensive_statistical_analysis', 'generate_statistical_report', 'nll_evidentials', 'convert_to_list', 'preprocess_runs', 'get_model_class', 'get_predict_fn', 'get_preds', 'pkl_preds_export', 'csv_nll_post_processing', 'reassess_metrics']

uqdd.metrics.aggregate_results_csv ¶

aggregate_results_csv(df: DataFrame, group_cols: List[str], numeric_cols: List[str], string_cols: List[str], order_by: Optional[Union[str, List[str]]] = None, output_file_path: Optional[str] = None) -> pd.DataFrame

Aggregate metrics by groups and export a compact CSV summary.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input results DataFrame.	required
`group_cols`	`list of str`	Column names to group by.	required
`numeric_cols`	`list of str`	Numeric metric columns to aggregate with mean and std.	required
`string_cols`	`list of str`	String columns to aggregate as lists.	required
`order_by`	`str or list of str or None`	Column(s) to sort the final aggregated DataFrame by. Default is None.	`None`
`output_file_path`	`str or None`	Path to write the aggregated CSV. If None, no file is written.	`None`

Returns:

Type	Description
`DataFrame`	Aggregated DataFrame with combined mean(std) strings plus string/list aggregates.

Notes

A helper column project_model is constructed and included in the aggregates.
When output_file_path is provided, the function ensures the directory exists.

Source code in uqdd/metrics/analysis.py

def aggregate_results_csv(
    df: pd.DataFrame,
    group_cols: List[str],
    numeric_cols: List[str],
    string_cols: List[str],
    order_by: Optional[Union[str, List[str]]] = None,
    output_file_path: Optional[str] = None,
) -> pd.DataFrame:
    """
    Aggregate metrics by groups and export a compact CSV summary.

    Parameters
    ----------
    df : pd.DataFrame
        Input results DataFrame.
    group_cols : list of str
        Column names to group by.
    numeric_cols : list of str
        Numeric metric columns to aggregate with mean and std.
    string_cols : list of str
        String columns to aggregate as lists.
    order_by : str or list of str or None, optional
        Column(s) to sort the final aggregated DataFrame by. Default is None.
    output_file_path : str or None, optional
        Path to write the aggregated CSV. If None, no file is written.

    Returns
    -------
    pd.DataFrame
        Aggregated DataFrame with combined mean(std) strings plus string/list aggregates.

    Notes
    -----
    - A helper column `project_model` is constructed and included in the aggregates.
    - When `output_file_path` is provided, the function ensures the directory exists.
    """
    grouped = df.groupby(group_cols)
    aggregated = grouped[numeric_cols].agg(["mean", "std"])
    for col in numeric_cols:
        aggregated[(col, "combined")] = (
            aggregated[(col, "mean")].round(3).astype(str)
            + "("
            + aggregated[(col, "std")].round(3).astype(str)
            + ")"
        )
    aggregated = aggregated[[col for col in aggregated.columns if col[1] == "combined"]]
    aggregated.columns = [col[0] for col in aggregated.columns]

    string_aggregated = grouped[string_cols].agg(lambda x: list(x))

    df["project_model"] = (
        "papyrus"
        + "/"
        + df["Activity"]
        + "/"
        + "all"
        + "/"
        + df["wandb project"]
        + "/"
        + df["model name"]
        + "/"
    )
    project_model_aggregated = grouped["project_model"].agg(lambda x: list(x))

    final_aggregated = pd.concat(
        [aggregated, string_aggregated, project_model_aggregated], axis=1
    ).reset_index()

    if order_by:
        final_aggregated = final_aggregated.sort_values(by=order_by)

    if output_file_path:
        os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
        final_aggregated.to_csv(output_file_path, index=False)

    return final_aggregated

uqdd.metrics.save_plot ¶

save_plot(fig: Figure, save_dir: Optional[str], plot_name: str, tighten: bool = True, show_legend: bool = False) -> None

Save a matplotlib figure to PNG, SVG, and PDF with optional tight layout.

Parameters:

Name	Type	Description	Default
`fig`	`Figure`	Figure to save.	required
`save_dir`	`str or None`	Directory to save the figure files. If None, no files are written.	required
`plot_name`	`str`	Base filename (without extension).	required
`tighten`	`bool`	If True, apply tight_layout and bbox_inches="tight". Default is True.	`True`
`show_legend`	`bool`	If False, remove legend before saving. Default is False.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def save_plot(
    fig: plt.Figure,
    save_dir: Optional[str],
    plot_name: str,
    tighten: bool = True,
    show_legend: bool = False,
) -> None:
    """
    Save a matplotlib figure to PNG, SVG, and PDF with optional tight layout.

    Parameters
    ----------
    fig : matplotlib.figure.Figure
        Figure to save.
    save_dir : str or None
        Directory to save the figure files. If None, no files are written.
    plot_name : str
        Base filename (without extension).
    tighten : bool, optional
        If True, apply tight_layout and bbox_inches="tight". Default is True.
    show_legend : bool, optional
        If False, remove legend before saving. Default is False.

    Returns
    -------
    None
    """
    ax = fig.gca()
    if not show_legend:
        legend = ax.get_legend()
        if legend is not None:
            legend.remove()
    if tighten:
        try:
            with warnings.catch_warnings():
                warnings.filterwarnings(
                    "ignore",
                    message="This figure includes Axes that are not compatible with tight_layout",
                )
                fig.tight_layout()
        except (ValueError, RuntimeError):
            fig.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)

    if save_dir and tighten:
        os.makedirs(save_dir, exist_ok=True)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.png"), dpi=300, bbox_inches="tight")
        fig.savefig(os.path.join(save_dir, f"{plot_name}.svg"), bbox_inches="tight")
        fig.savefig(os.path.join(save_dir, f"{plot_name}.pdf"), dpi=300, bbox_inches="tight")
    elif save_dir and not tighten:
        os.makedirs(save_dir, exist_ok=True)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.png"), dpi=300)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.svg"))
        fig.savefig(os.path.join(save_dir, f"{plot_name}.pdf"), dpi=300)

uqdd.metrics.handle_inf_values ¶

handle_inf_values(df: DataFrame) -> pd.DataFrame

Replace +/- infinity values in a DataFrame with NaN.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required

Returns:

Type	Description
`DataFrame`	DataFrame with infinite values replaced by NaN.

Source code in uqdd/metrics/analysis.py

def handle_inf_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    Replace +/- infinity values in a DataFrame with NaN.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.

    Returns
    -------
    pd.DataFrame
        DataFrame with infinite values replaced by NaN.
    """
    return df.replace([float("inf"), -float("inf")], float("nan"))

uqdd.metrics.plot_pairplot ¶

plot_pairplot(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, cmap: str = 'viridis', group_order: Optional[List[str]] = group_order, show_legend: bool = False) -> None

Plot a seaborn pairplot for a set of metrics colored by Group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing the metrics and a 'Group' column.	required
`title`	`str`	Plot title.	required
`metrics`	`list of str`	Metric column names to include in the pairplot.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`cmap`	`str`	Seaborn/matplotlib palette name. Default is "viridis".	`'viridis'`
`group_order`	`list of str or None`	Order of class labels in the legend. Default is from constants.	`group_order`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_pairplot(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    cmap: str = "viridis",
    group_order: Optional[List[str]] = group_order,
    show_legend: bool = False,
) -> None:
    """
    Plot a seaborn pairplot for a set of metrics colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing the metrics and a 'Group' column.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to include in the pairplot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    cmap : str, optional
        Seaborn/matplotlib palette name. Default is "viridis".
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    sns.pairplot(
        df,
        hue="Group",
        hue_order=group_order,
        vars=metrics,
        palette=cmap,
        plot_kws={"alpha": 0.7},
    )
    plt.suptitle(title, y=1.02)
    plot_name = f"pairplot_{title.replace(' ', '_')}"
    save_plot(plt.gcf(), save_dir, plot_name, tighten=False, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.plot_line_metrics ¶

plot_line_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, show_legend: bool = False) -> None

Plot line charts of metrics over runs, colored by Group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with 'wandb run', metrics, and 'Group'.	required
`title`	`str`	Plot title.	required
`metrics`	`list of str`	Metric column names to plot.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`group_order`	`list of str or None`	Order of class labels in the legend. Default is from constants.	`group_order`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_line_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    show_legend: bool = False,
) -> None:
    """
    Plot line charts of metrics over runs, colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with 'wandb run', metrics, and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    for metric in metrics:
        plt.figure(figsize=(14, 7))
        sns.lineplot(
            data=df,
            x="wandb run",
            y=metric,
            hue="Group",
            marker="o",
            palette="Set2",
            hue_order=group_order,
            label=metric,
        )
        plt.title(f"{title} - {metric}")
        plt.xticks(rotation=45)
        plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
        if INTERACTIVE_MODE:
            plt.show()
        plot_name = f"line_{title.replace(' ', '_')}_{metric}"
        save_plot(plt.gcf(), save_dir, plot_name, tighten=False, show_legend=show_legend)
        plt.close()

uqdd.metrics.plot_histogram_metrics ¶

plot_histogram_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, cmap: str = 'crest', show_legend: bool = False) -> None

Plot histograms with KDE for metrics, split by Group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with metrics and 'Group'.	required
`title`	`str`	Plot title.	required
`metrics`	`list of str`	Metric column names to plot.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`group_order`	`list of str or None`	Order of class labels in the legend. Default is from constants.	`group_order`
`cmap`	`str`	Seaborn/matplotlib palette name. Default is "crest".	`'crest'`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_histogram_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    cmap: str = "crest",
    show_legend: bool = False,
) -> None:
    """
    Plot histograms with KDE for metrics, split by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with metrics and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    cmap : str, optional
        Seaborn/matplotlib palette name. Default is "crest".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    for metric in metrics:
        plt.figure(figsize=(14, 7))
        sns.histplot(
            data=df,
            x=metric,
            hue="Group",
            kde=True,
            palette=cmap,
            element="step",
            hue_order=group_order,
            fill=True,
            alpha=0.7,
        )
        plt.title(f"{title} - {metric}")
        if INTERACTIVE_MODE:
            plt.show()
        plot_name = f"histogram_{title.replace(' ', '_')}_{metric}"
        save_plot(plt.gcf(), save_dir, plot_name, show_legend=show_legend)
        plt.close()

uqdd.metrics.plot_pairwise_scatter_metrics ¶

plot_pairwise_scatter_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, cmap: str = 'tab10_r', show_legend: bool = False) -> None

Plot pairwise scatterplots for all metric combinations, colored by Group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with metrics and 'Group'.	required
`title`	`str`	Plot title.	required
`metrics`	`list of str`	Metric column names to plot pairwise.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`group_order`	`list of str or None`	Order of class labels in the legend. Default is from constants.	`group_order`
`cmap`	`str`	Matplotlib palette name. Default is "tab10_r".	`'tab10_r'`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_pairwise_scatter_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    cmap: str = "tab10_r",
    show_legend: bool = False,
) -> None:
    """
    Plot pairwise scatterplots for all metric combinations, colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with metrics and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot pairwise.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    cmap : str, optional
        Matplotlib palette name. Default is "tab10_r".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    num_metrics = len(metrics)
    fig, axes = plt.subplots(num_metrics, num_metrics, figsize=(15, 15))

    for i in range(num_metrics):
        for j in range(num_metrics):
            if i != j:
                ax = sns.scatterplot(
                    data=df,
                    x=metrics[j],
                    y=metrics[i],
                    hue="Group",
                    palette=cmap,
                    hue_order=group_order,
                    ax=axes[i, j],
                    legend=False if not (i == 1 and j == 0) else "brief",
                )
                if i == 1 and j == 0:
                    handles, labels = ax.get_legend_handles_labels()
                    ax.legend().remove()
            else:
                axes[i, j].set_visible(False)

            axes[i, j].set_ylabel(metrics[i] if j == 0 and i > 0 else "")
            axes[i, j].set_xlabel(metrics[j] if i == num_metrics - 1 else "")

    fig.legend(handles, labels, loc="upper right", bbox_to_anchor=(1.15, 1))
    fig.suptitle(title, y=1.02)
    fig.subplots_adjust(top=0.95, wspace=0.4, hspace=0.4)
    plot_name = f"pairwise_scatter_{title.replace(' ', '_')}"
    save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.plot_metrics ¶

plot_metrics(df: DataFrame, metrics: List[str], cmap: str = 'tab10_r', save_dir: Optional[str] = None, hatches_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, show: bool = True, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> Dict[str, str]

Plot grouped bar charts showing mean and std for metrics across splits and model types.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with columns ['Split', 'Model type'] and metrics.	required
`metrics`	`list of str`	Metric column names to plot.	required
`cmap`	`str`	Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".	`'tab10_r'`
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`hatches_dict`	`dict[str, str] or None`	Mapping from Split to hatch pattern. Default is None.	`None`
`group_order`	`list of str or None`	Order of grouped labels (Split_Model type). Default derives from data.	`None`
`show`	`bool`	If True, display plot in interactive mode. Default is True.	`True`
`fig_width`	`float or None`	Width of the plot area (excluding legend). Default scales with number of metrics.	`None`
`fig_height`	`float or None`	Height of the plot area (excluding legend). Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend of split/model combinations. Default is False.	`False`

Returns:

Type	Description
`dict[str, str]`	Color mapping from 'Model type' to RGBA string used in the plot.

Source code in uqdd/metrics/analysis.py

def plot_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    cmap: str = "tab10_r",
    save_dir: Optional[str] = None,
    hatches_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    show: bool = True,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> Dict[str, str]:
    """
    Plot grouped bar charts showing mean and std for metrics across splits and model types.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with columns ['Split', 'Model type'] and metrics.
    metrics : list of str
        Metric column names to plot.
    cmap : str, optional
        Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    hatches_dict : dict[str, str] or None, optional
        Mapping from Split to hatch pattern. Default is None.
    group_order : list of str or None, optional
        Order of grouped labels (Split_Model type). Default derives from data.
    show : bool, optional
        If True, display plot in interactive mode. Default is True.
    fig_width : float or None, optional
        Width of the plot area (excluding legend). Default scales with number of metrics.
    fig_height : float or None, optional
        Height of the plot area (excluding legend). Default is 6.
    show_legend : bool, optional
        If True, include a legend of split/model combinations. Default is False.

    Returns
    -------
    dict[str, str]
        Color mapping from 'Model type' to RGBA string used in the plot.
    """
    plot_width = fig_width if fig_width else max(10, len(metrics) * 2)
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 5
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.1, right=0.75, top=0.9, bottom=0.2)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.1, 0.15, plot_width / total_width, plot_height / total_height])

    stats_dfs = []
    for metric in metrics:
        mean_df = df.groupby(["Split", "Model type"])[metric].mean().rename(f"{metric}_mean")
        std_df = df.groupby(["Split", "Model type"])[metric].std().rename(f"{metric}_std")
        stats_df = pd.merge(mean_df, std_df, left_index=True, right_index=True).reset_index()
        stats_df["Group"] = stats_df.apply(lambda row: f"{row['Split']}_{row['Model type']}", axis=1)
        stats_df["Metric"] = metric
        stats_dfs.append(stats_df)

    combined_stats_df = pd.concat(stats_dfs)
    if group_order:
        combined_stats_df["Group"] = pd.Categorical(
            combined_stats_df["Group"], categories=group_order, ordered=True
        )
    else:
        group_order = combined_stats_df["Group"].unique().tolist()

    scalar_mappable = ScalarMappable(cmap=cmap)
    model_types = combined_stats_df["Model type"].unique()
    color_dict = {
        m: c
        for m, c in zip(
            model_types,
            scalar_mappable.to_rgba(range(len(model_types)), alpha=1).tolist(),
        )
    }

    bar_width = 0.12
    group_spacing = 0.4
    num_bars = len(model_types) * len(hatches_dict)
    positions = []
    tick_positions = []
    tick_labels = []

    for i, metric in enumerate(metrics):
        metric_data = combined_stats_df[combined_stats_df["Metric"] == metric]
        metric_data.loc[:, "Group"] = pd.Categorical(
            metric_data["Group"], categories=group_order, ordered=True
        )
        metric_data = metric_data.sort_values("Group").reset_index(drop=True)
        for j, (_, row) in enumerate(metric_data.iterrows()):
            position = i * (num_bars * bar_width + group_spacing) + (j % num_bars) * bar_width
            positions.append(position)
            ax.bar(
                position,
                height=row[f"{metric}_mean"],
                color=color_dict[row["Model type"]],
                hatch=hatches_dict[row["Split"]],
                width=bar_width,
            )
        center_position = i * (num_bars * bar_width + group_spacing) + (num_bars * bar_width) / 2
        tick_positions.append(center_position)
        tick_labels.append(metric.replace(" ", "\n") if " " in metric else metric)

    def create_stats_legend(df, color_mapping, hatches_dict, group_order):
        patches_dict = {}
        for _, row in df.iterrows():
            label = f"{row['Split']} {row['Model type']}"
            group_label = f"{row['Split']}_{row['Model type']}"
            if group_label not in patches_dict:
                patches_dict[group_label] = mpatches.Patch(
                    facecolor=color_mapping[row["Model type"]],
                    hatch=hatches_dict[row["Split"]],
                    label=label,
                )
        return [patches_dict[group] for group in group_order if group in patches_dict]

    if show_legend:
        legend_elements = create_stats_legend(combined_stats_df, color_dict, hatches_dict, group_order)
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    for (_, row), bar in zip(combined_stats_df.iterrows(), ax.patches):
        x_bar = bar.get_x() + bar.get_width() / 2
        y_bar = bar.get_height()
        ax.errorbar(
            x_bar,
            y_bar,
            yerr=row[f"{row['Metric']}_std"],
            color="black",
            fmt="none",
            elinewidth=1,
            capsize=3,
            alpha=0.5,
        )

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylim(bottom=0.0)

    if save_dir:
        metrics_names = "_".join(metrics)
        plot_name = f"barplot_{cmap}_{metrics_names}"
        save_plot(fig, save_dir, plot_name, show_legend=show_legend)

    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

    return color_dict

uqdd.metrics.find_highly_correlated_metrics ¶

find_highly_correlated_metrics(df: DataFrame, metrics: List[str], threshold: float = 0.8, save_dir: Optional[str] = None, cmap: str = 'coolwarm', show_legend: bool = False) -> List[Tuple[str, str, float]]

Identify pairs of metrics with correlation above a threshold and plot the matrix.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing the metric columns.	required
`metrics`	`list of str`	Metric column names to include in the correlation analysis.	required
`threshold`	`float`	Absolute correlation threshold for reporting pairs. Default is 0.8.	`0.8`
`save_dir`	`str or None`	Directory to save the heatmap plot. Default is None.	`None`
`cmap`	`str`	Matplotlib colormap name. Default is "coolwarm".	`'coolwarm'`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`list of tuple[str, str, float]`	List of metric pairs and their absolute correlation values.

Source code in uqdd/metrics/analysis.py

def find_highly_correlated_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    threshold: float = 0.8,
    save_dir: Optional[str] = None,
    cmap: str = "coolwarm",
    show_legend: bool = False,
) -> List[Tuple[str, str, float]]:
    """
    Identify pairs of metrics with correlation above a threshold and plot the matrix.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing the metric columns.
    metrics : list of str
        Metric column names to include in the correlation analysis.
    threshold : float, optional
        Absolute correlation threshold for reporting pairs. Default is 0.8.
    save_dir : str or None, optional
        Directory to save the heatmap plot. Default is None.
    cmap : str, optional
        Matplotlib colormap name. Default is "coolwarm".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    list of tuple[str, str, float]
        List of metric pairs and their absolute correlation values.
    """
    corr_matrix = df[metrics].corr().abs()
    pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if corr_matrix.iloc[i, j] > threshold:
                pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

    print(f"Highly correlated metrics (correlation coefficient > {threshold}):")
    for a, b, v in pairs:
        print(f"{a} and {b}: {v:.2f}")

    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, cmap=cmap)
    plt.title("Correlation Matrix")
    plot_name = f"correlation_matrix_{threshold}_{'_'.join(metrics)}"
    save_plot(plt.gcf(), save_dir, plot_name, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()

    return pairs

uqdd.metrics.plot_comparison_metrics ¶

plot_comparison_metrics(df: DataFrame, metrics: List[str], cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir: Optional[str] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False, models_order: Optional[List[str]] = None) -> None

Plot comparison bar charts across splits, model types, and calibration states.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with columns ['Split', 'Model type', 'Calibration'] and metrics.	required
`metrics`	`list of str`	Metric column names to plot.	required
`cmap`	`str`	Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".	`'tab10_r'`
`color_dict`	`dict[str, str] or None`	Precomputed color mapping from model type to color. If None, one is generated.	`None`
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`fig_width`	`float or None`	Width of the plot area (excluding legend). Default scales with the number of metrics.	`None`
`fig_height`	`float or None`	Height of the plot area (excluding legend). Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend. Default is False.	`False`
`models_order`	`list of str or None`	Explicit order of model types for coloring and grouping. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_comparison_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir: Optional[str] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
    models_order: Optional[List[str]] = None,
) -> None:
    """
    Plot comparison bar charts across splits, model types, and calibration states.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with columns ['Split', 'Model type', 'Calibration'] and metrics.
    metrics : list of str
        Metric column names to plot.
    cmap : str, optional
        Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from model type to color. If None, one is generated.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    fig_width : float or None, optional
        Width of the plot area (excluding legend). Default scales with the number of metrics.
    fig_height : float or None, optional
        Height of the plot area (excluding legend). Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.
    models_order : list of str or None, optional
        Explicit order of model types for coloring and grouping. Default derives from data.

    Returns
    -------
    None
    """
    plot_width = fig_width if fig_width else max(7, len(metrics) * 3)
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 5
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.1, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.1, 0.15, plot_width / total_width, plot_height / total_height])

    stats_dfs = []
    for metric in metrics:
        mean_df = df.groupby(["Split", "Model type", "Calibration"])[metric].mean().rename(f"{metric}_mean")
        std_df = df.groupby(["Split", "Model type", "Calibration"])[metric].std().rename(f"{metric}_std")
        stats_df = pd.merge(mean_df, std_df, left_index=True, right_index=True).reset_index()
        stats_df["Group"] = stats_df.apply(
            lambda row: f"{row['Split']}_{row['Model type']}_{row['Calibration']}", axis=1
        )
        stats_df["Metric"] = metric
        stats_dfs.append(stats_df)

    combined_stats_df = pd.concat(stats_dfs)
    if models_order is None:
        models_order = combined_stats_df["Model type"].unique().tolist()

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        color_dict = {
            m: c
            for m, c in zip(
                models_order,
                scalar_mappable.to_rgba(range(len(models_order)), alpha=1).tolist(),
            )
        }
    color_dict = {k: color_dict[k] for k in models_order}

    hatches_dict = {
        "Before Calibration": "\\\\",
        "After Calibration": "",
    }

    bar_width = 0.1
    group_spacing = 0.2
    split_spacing = 0.6
    num_bars = len(models_order) * 2
    positions = []
    tick_positions = []
    tick_labels = []

    for i, metric in enumerate(metrics):
        metric_data = combined_stats_df[combined_stats_df["Metric"] == metric]
        split_types = metric_data["Split"].unique()
        for j, split in enumerate(split_types):
            split_data = metric_data[metric_data["Split"] == split]
            split_data = split_data[split_data["Model type"].isin(models_order)]

            for k, model_type in enumerate(models_order):
                for l, calibration in enumerate(["Before Calibration", "After Calibration"]):
                    position = (
                        i * (split_spacing + len(split_types) * (num_bars * bar_width + group_spacing))
                        + j * (num_bars * bar_width + group_spacing)
                        + k * 2 * bar_width
                        + l * bar_width
                    )
                    positions.append(position)
                    height = split_data[
                        (split_data["Model type"] == model_type)
                        & (split_data["Calibration"] == calibration)
                    ][f"{metric}_mean"].values[0]
                    ax.bar(
                        position,
                        height=height,
                        color=color_dict[model_type],
                        hatch=hatches_dict[calibration],
                        width=bar_width,
                    )

            center_position = (
                i * (split_spacing + len(split_types) * (num_bars * bar_width + group_spacing))
                + j * (num_bars * bar_width + group_spacing)
                + (num_bars * bar_width) / 2
            )
            tick_positions.append(center_position)
            tick_labels.append(f"{metric}\n{split}")

    if show_legend:
        legend_elements = [
            mpatches.Patch(facecolor=color_dict[model], edgecolor="black", label=model)
            for model in models_order
        ]
        legend_elements += [
            mpatches.Patch(facecolor="white", edgecolor="black", hatch=h, label=label)
            for label, h in hatches_dict.items()
        ]
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    for (_, row), bar in zip(combined_stats_df.iterrows(), ax.patches):
        x_bar = bar.get_x() + bar.get_width() / 2
        y_bar = bar.get_height()
        yerr_lower = y_bar - max(0, y_bar - row[f"{row['Metric']}_std"])
        yerr_upper = row[f"{row['Metric']}_std"]
        ax.errorbar(
            x_bar,
            y_bar,
            yerr=[[yerr_lower], [yerr_upper]],
            color="black",
            fmt="none",
            elinewidth=1,
            capsize=3,
            alpha=0.5,
        )

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylim(bottom=0.0)

    if save_dir:
        metrics_names = "_".join(metrics)
        plot_name = f"comparison_barplot_{cmap}_{metrics_names}"
        save_plot(fig, save_dir, plot_name, show_legend=show_legend)

    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.load_and_aggregate_calibration_data ¶

load_and_aggregate_calibration_data(base_path: str, paths: List[str]) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]

Load calibration curve data from multiple model paths and aggregate statistics.

Parameters:

Name	Type	Description	Default
`base_path`	`str`	Base directory from which model subpaths are resolved.	required
`paths`	`list of str`	Relative paths to model directories containing 'calibration_plot_data.csv'.	required

Returns:

Type	Description
`(ndarray, ndarray, ndarray, ndarray)`	Tuple of (expected_values, mean_observed, lower_bound, upper_bound), each of shape (n_bins,).

Source code in uqdd/metrics/analysis.py

def load_and_aggregate_calibration_data(base_path: str, paths: List[str]) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Load calibration curve data from multiple model paths and aggregate statistics.

    Parameters
    ----------
    base_path : str
        Base directory from which model subpaths are resolved.
    paths : list of str
        Relative paths to model directories containing 'calibration_plot_data.csv'.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray)
        Tuple of (expected_values, mean_observed, lower_bound, upper_bound), each of shape (n_bins,).
    """
    expected_values = []
    observed_values = []
    for path in paths:
        file_path = os.path.join(base_path, path, "calibration_plot_data.csv")
        if os.path.exists(file_path):
            data = pd.read_csv(file_path)
            expected_values = data["Expected Proportion"]
            observed_values.append(data["Observed Proportion"])
        else:
            print(f"File not found: {file_path}")

    expected_values = np.array(expected_values)
    observed_values = np.array(observed_values)
    mean_observed = np.mean(observed_values, axis=0)
    lower_bound = np.min(observed_values, axis=0)
    upper_bound = np.max(observed_values, axis=0)
    return expected_values, mean_observed, lower_bound, upper_bound

uqdd.metrics.plot_calibration_data ¶

plot_calibration_data(df_aggregated: DataFrame, base_path: str, save_dir: Optional[str] = None, title: str = 'Calibration Plot', color_name: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> None

Plot aggregated calibration curves for multiple groups against the perfect calibration line.

Parameters:

Name	Type	Description	Default
`df_aggregated`	`DataFrame`	Aggregated DataFrame containing 'Group' and 'project_model' lists for each group.	required
`base_path`	`str`	Base directory where model paths are located.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`title`	`str`	Plot title. Default is "Calibration Plot".	`'Calibration Plot'`
`color_name`	`str`	Colormap name used to derive distinct colors per group. Default is "tab10_r".	`'tab10_r'`
`color_dict`	`dict[str, str] or None`	Precomputed color mapping from group to color. If None, one is generated.	`None`
`group_order`	`list of str or None`	Order of groups in the legend. Default derives from data.	`None`
`fig_width`	`float or None`	Width of the plot area. Default is 6.	`None`
`fig_height`	`float or None`	Height of the plot area. Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend. Default is False.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_calibration_data(
    df_aggregated: pd.DataFrame,
    base_path: str,
    save_dir: Optional[str] = None,
    title: str = "Calibration Plot",
    color_name: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> None:
    """
    Plot aggregated calibration curves for multiple groups against the perfect calibration line.

    Parameters
    ----------
    df_aggregated : pd.DataFrame
        Aggregated DataFrame containing 'Group' and 'project_model' lists for each group.
    base_path : str
        Base directory where model paths are located.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    title : str, optional
        Plot title. Default is "Calibration Plot".
    color_name : str, optional
        Colormap name used to derive distinct colors per group. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from group to color. If None, one is generated.
    group_order : list of str or None, optional
        Order of groups in the legend. Default derives from data.
    fig_width : float or None, optional
        Width of the plot area. Default is 6.
    fig_height : float or None, optional
        Height of the plot area. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    None
    """
    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    if group_order is None:
        group_order = list(df_aggregated["Group"].unique())

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=color_name)
        colors = scalar_mappable.to_rgba(range(len(group_order)))
        color_dict = {group: color for group, color in zip(group_order, colors)}

    legend_handles = {}
    for idx, row in df_aggregated.iterrows():
        model_paths = row["project_model"]
        group_label = row["Group"]
        color = color_dict[group_label]
        expected, mean_observed, lower_bound, upper_bound = load_and_aggregate_calibration_data(base_path, model_paths)
        (line,) = ax.plot(expected, mean_observed, label=group_label, color=color)
        ax.fill_between(expected, lower_bound, upper_bound, alpha=0.2, color=color)
        if group_label not in legend_handles:
            legend_handles[group_label] = line

    (perfect_line,) = ax.plot([0, 1], [0, 1], "k--", label="Perfect Calibration")
    legend_handles["Perfect Calibration"] = perfect_line

    ordered_legend_handles = [legend_handles[group] for group in group_order if group in legend_handles]
    ordered_legend_handles.append(legend_handles["Perfect Calibration"])
    if show_legend:
        ax.legend(handles=ordered_legend_handles, bbox_to_anchor=(1.05, 1), loc="upper left")

    ax.set_title(title)
    ax.set_xlabel("Expected Proportion")
    ax.set_ylabel("Observed Proportion")
    ax.grid(True)
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)

    if save_dir:
        plot_name = f"{title.replace(' ', '_')}"
        save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.move_model_folders ¶

move_model_folders(df: DataFrame, search_dirs: List[str], output_dir: str, overwrite: bool = False) -> None

Move or merge model directories into a single output folder based on model names.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing a 'model name' column.	required
`search_dirs`	`list of str`	Directories to search for model subfolders.	required
`output_dir`	`str`	Destination directory where model folders will be moved or merged.	required
`overwrite`	`bool`	If True, existing folders are merged (copied) with source. Default is False.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def move_model_folders(
    df: pd.DataFrame,
    search_dirs: List[str],
    output_dir: str,
    overwrite: bool = False,
) -> None:
    """
    Move or merge model directories into a single output folder based on model names.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing a 'model name' column.
    search_dirs : list of str
        Directories to search for model subfolders.
    output_dir : str
        Destination directory where model folders will be moved or merged.
    overwrite : bool, optional
        If True, existing folders are merged (copied) with source. Default is False.

    Returns
    -------
    None
    """
    model_names = df["model name"].unique()
    if not os.path.exists(output_dir):
        os.makedirs(output_dir, exist_ok=True)
        print(f"Created output directory '{output_dir}'.")

    for model_name in model_names:
        found = False
        for search_dir in search_dirs:
            if not os.path.isdir(search_dir):
                print(f"Search directory '{search_dir}' does not exist. Skipping.")
                continue
            subdirs = [d for d in os.listdir(search_dir) if os.path.isdir(os.path.join(search_dir, d))]
            if model_name in subdirs:
                source_dir = os.path.join(search_dir, model_name)
                dest_dir = os.path.join(output_dir, model_name)
                if os.path.exists(dest_dir):
                    if overwrite:
                        shutil.copytree(source_dir, dest_dir, dirs_exist_ok=True)
                        print(f"Merged (Copied) '{source_dir}' to '{dest_dir}'.")
                else:
                    try:
                        shutil.move(source_dir, dest_dir)
                        print(f"Moved '{source_dir}' to '{dest_dir}'.")
                    except Exception as e:
                        print(f"Error moving '{source_dir}' to '{dest_dir}': {e}")
                found = True
                break
        if not found:
            print(f"Model folder '{model_name}' not found in any of the search directories.")

uqdd.metrics.load_predictions ¶

load_predictions(model_path: str) -> pd.DataFrame

Load pickled predictions from a model directory.

Parameters:

Name	Type	Description	Default
`model_path`	`str`	Path to the model directory containing 'preds.pkl'.	required

Returns:

Type	Description
`DataFrame`	DataFrame loaded from the pickle file.

Source code in uqdd/metrics/analysis.py

def load_predictions(model_path: str) -> pd.DataFrame:
    """
    Load pickled predictions from a model directory.

    Parameters
    ----------
    model_path : str
        Path to the model directory containing 'preds.pkl'.

    Returns
    -------
    pd.DataFrame
        DataFrame loaded from the pickle file.
    """
    preds_path = os.path.join(model_path, "preds.pkl")
    return pd.read_pickle(preds_path)

uqdd.metrics.calculate_rmse_rejection_curve ¶

calculate_rmse_rejection_curve(preds: DataFrame, uncertainty_col: str = 'y_alea', true_label_col: str = 'y_true', pred_label_col: str = 'y_pred', normalize_rmse: bool = False, random_rejection: bool = False, unc_type: Optional[str] = None, max_rejection_ratio: float = 0.95) -> Tuple[np.ndarray, np.ndarray, float]

Compute RMSE vs. rejection rate curve and its AUC by rejecting high-uncertainty predictions.

Parameters:

Name	Type	Description	Default
`preds`	`DataFrame`	DataFrame with columns for true labels, predicted labels, and uncertainty components.	required
`uncertainty_col`	`str`	Column name for uncertainty to sort by if `unc_type` is None. Default is "y_alea".	`'y_alea'`
`true_label_col`	`str`	Column name for true labels. Default is "y_true".	`'y_true'`
`pred_label_col`	`str`	Column name for predicted labels. Default is "y_pred".	`'y_pred'`
`normalize_rmse`	`bool`	If True, normalize RMSE by the initial RMSE before rejection. Default is False.	`False`
`random_rejection`	`bool`	If True, randomly reject samples instead of sorting by uncertainty. Default is False.	`False`
`unc_type`	`(aleatoric, epistemic, both)`	Which uncertainty to use. If "both", sums aleatoric and epistemic. If None, use `uncertainty_col`.	`"aleatoric"`
`max_rejection_ratio`	`float`	Maximum fraction of samples to reject (exclusive of the tail). Default is 0.95.	`0.95`

Returns:

Type	Description
`(ndarray, ndarray, float)`	Tuple of (rejection_rates, rmses, AUC of the RMSE–rejection curve).

Raises:

Type	Description
`ValueError`	If `unc_type` is invalid or `uncertainty_col` is not present when needed.

Source code in uqdd/metrics/analysis.py

def calculate_rmse_rejection_curve(
    preds: pd.DataFrame,
    uncertainty_col: str = "y_alea",
    true_label_col: str = "y_true",
    pred_label_col: str = "y_pred",
    normalize_rmse: bool = False,
    random_rejection: bool = False,
    unc_type: Optional[str] = None,
    max_rejection_ratio: float = 0.95,
) -> Tuple[np.ndarray, np.ndarray, float]:
    """
    Compute RMSE vs. rejection rate curve and its AUC by rejecting high-uncertainty predictions.

    Parameters
    ----------
    preds : pd.DataFrame
        DataFrame with columns for true labels, predicted labels, and uncertainty components.
    uncertainty_col : str, optional
        Column name for uncertainty to sort by if `unc_type` is None. Default is "y_alea".
    true_label_col : str, optional
        Column name for true labels. Default is "y_true".
    pred_label_col : str, optional
        Column name for predicted labels. Default is "y_pred".
    normalize_rmse : bool, optional
        If True, normalize RMSE by the initial RMSE before rejection. Default is False.
    random_rejection : bool, optional
        If True, randomly reject samples instead of sorting by uncertainty. Default is False.
    unc_type : {"aleatoric", "epistemic", "both"} or None, optional
        Which uncertainty to use. If "both", sums aleatoric and epistemic. If None, use `uncertainty_col`.
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject (exclusive of the tail). Default is 0.95.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, float)
        Tuple of (rejection_rates, rmses, AUC of the RMSE–rejection curve).

    Raises
    ------
    ValueError
        If `unc_type` is invalid or `uncertainty_col` is not present when needed.
    """
    if unc_type == "aleatoric":
        uncertainty_col = "y_alea"
    elif unc_type == "epistemic":
        uncertainty_col = "y_eps"
    elif unc_type == "both":
        preds["y_unc"] = preds["y_alea"] + preds["y_eps"]
        uncertainty_col = "y_unc"
    elif unc_type is None and uncertainty_col in preds.columns:
        pass
    else:
        raise ValueError(
            "Either provide valid uncertainty type or provide the uncertainty column name in the DataFrame"
        )

    if random_rejection:
        preds = preds.sample(frac=max_rejection_ratio).reset_index(drop=True)
    else:
        preds = preds.sort_values(by=uncertainty_col, ascending=False)

    max_rejection_index = int(len(preds) * max_rejection_ratio)
    step = max(1, int(len(preds) * 0.01))
    rejection_steps = np.arange(0, max_rejection_index, step=step)
    rejection_rates = rejection_steps / len(preds)
    rmses = []

    initial_rmse = mean_squared_error(preds[true_label_col], preds[pred_label_col], squared=False)

    for i in rejection_steps:
        selected_preds = preds.iloc[i:]
        rmse = mean_squared_error(selected_preds[true_label_col], selected_preds[pred_label_col], squared=False)
        if normalize_rmse:
            rmse /= initial_rmse
        rmses.append(rmse)
    auc_arc = auc(rejection_rates, rmses)
    return rejection_rates, np.array(rmses), float(auc_arc)

uqdd.metrics.calculate_rejection_curve ¶

calculate_rejection_curve(df: DataFrame, model_paths: List[str], unc_col: str, random_rejection: bool = False, normalize_rmse: bool = False, max_rejection_ratio: float = 0.95) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, float]

Aggregate RMSE–rejection curves across models and compute mean/std and AUC statistics.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Auxiliary DataFrame (not used directly, kept for API symmetry).	required
`model_paths`	`list of str`	Paths to model directories containing 'preds.pkl'.	required
`unc_col`	`str`	Uncertainty column name to use when computing curves (e.g., 'y_alea' or 'y_eps').	required
`random_rejection`	`bool`	If True, randomly reject samples. Default is False.	`False`
`normalize_rmse`	`bool`	If True, normalize RMSE by the initial RMSE. Default is False.	`False`
`max_rejection_ratio`	`float`	Maximum fraction of samples to reject. Default is 0.95.	`0.95`

Returns:

Type	Description
`(ndarray, ndarray, ndarray, float, float)`	Tuple of (rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc).

Source code in uqdd/metrics/analysis.py

def calculate_rejection_curve(
    df: pd.DataFrame,
    model_paths: List[str],
    unc_col: str,
    random_rejection: bool = False,
    normalize_rmse: bool = False,
    max_rejection_ratio: float = 0.95,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, float]:
    """
    Aggregate RMSE–rejection curves across models and compute mean/std and AUC statistics.

    Parameters
    ----------
    df : pd.DataFrame
        Auxiliary DataFrame (not used directly, kept for API symmetry).
    model_paths : list of str
        Paths to model directories containing 'preds.pkl'.
    unc_col : str
        Uncertainty column name to use when computing curves (e.g., 'y_alea' or 'y_eps').
    random_rejection : bool, optional
        If True, randomly reject samples. Default is False.
    normalize_rmse : bool, optional
        If True, normalize RMSE by the initial RMSE. Default is False.
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject. Default is 0.95.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, numpy.ndarray, float, float)
        Tuple of (rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc).
    """
    aggregated_rmses = []
    auc_values = []
    rejection_rates = None

    for model_path in model_paths:
        preds = load_predictions(model_path)
        if preds.empty:
            print(f"Preds not loaded for model: {model_path}")
            continue
        rejection_rates, rmses, auc_arc = calculate_rmse_rejection_curve(
            preds,
            uncertainty_col=unc_col,
            random_rejection=random_rejection,
            normalize_rmse=normalize_rmse,
            max_rejection_ratio=max_rejection_ratio,
        )
        aggregated_rmses.append(rmses)
        auc_values.append(auc_arc)

    mean_rmses = np.mean(aggregated_rmses, axis=0)
    std_rmses = np.std(aggregated_rmses, axis=0)
    mean_auc = np.mean(auc_values)
    std_auc = np.std(auc_values)
    return rejection_rates, mean_rmses, std_rmses, float(mean_auc), float(std_auc)

uqdd.metrics.get_handles_labels ¶

get_handles_labels(ax: Axes, group_order: List[str]) -> Tuple[List, List[str]]

Extract legend handles/labels ordered by group prefix.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	Axes object from which to retrieve legend entries.	required
`group_order`	`list of str`	Group prefixes to order legend entries by.	required

Returns:

Type	Description
`(list, list of str)`	Ordered handles and labels.

Source code in uqdd/metrics/analysis.py

def get_handles_labels(ax: plt.Axes, group_order: List[str]) -> Tuple[List, List[str]]:
    """
    Extract legend handles/labels ordered by group prefix.

    Parameters
    ----------
    ax : matplotlib.axes.Axes
        Axes object from which to retrieve legend entries.
    group_order : list of str
        Group prefixes to order legend entries by.

    Returns
    -------
    (list, list of str)
        Ordered handles and labels.
    """
    handles, labels = ax.get_legend_handles_labels()
    ordered_handles = []
    ordered_labels = []
    for group in group_order:
        for label, handle in zip(labels, handles):
            if label.startswith(group):
                ordered_handles.append(handle)
                ordered_labels.append(label)
    return ordered_handles, ordered_labels

uqdd.metrics.plot_rmse_rejection_curves ¶

plot_rmse_rejection_curves(df: DataFrame, base_dir: str, cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir_plot: Optional[str] = None, add_to_title: str = '', normalize_rmse: bool = False, unc_type: str = 'aleatoric', max_rejection_ratio: float = 0.95, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> pd.DataFrame

Plot RMSE–rejection curves per group, including random rejection baselines, and summarize AUCs.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing columns 'Group', 'Split', and 'project_model'.	required
`base_dir`	`str`	Base directory where model paths are located.	required
`cmap`	`str`	Colormap name used to derive distinct colors per group. Default is "tab10_r".	`'tab10_r'`
`color_dict`	`dict[str, str] or None`	Precomputed color mapping from group to color. If None, one is generated.	`None`
`save_dir_plot`	`str or None`	Directory to save the plot images. Default is None.	`None`
`add_to_title`	`str`	Suffix for the plot filename and title. Default is empty string.	`''`
`normalize_rmse`	`bool`	If True, normalize RMSE by initial RMSE. Default is False.	`False`
`unc_type`	`(aleatoric, epistemic, both)`	Uncertainty component to use for rejection. Default is "aleatoric".	`"aleatoric"`
`max_rejection_ratio`	`float`	Maximum fraction of samples to reject. Default is 0.95.	`0.95`
`group_order`	`list of str or None`	Order of groups in the legend. Default derives from data.	`None`
`fig_width`	`float or None`	Plot width. Default is 6.	`None`
`fig_height`	`float or None`	Plot height. Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend. Default is False.	`False`

Returns:

Type	Description
`DataFrame`	Summary DataFrame with columns ['Model type', 'Split', 'Group', 'AUC-RRC_mean', 'AUC-RRC_std'].

Source code in uqdd/metrics/analysis.py

def plot_rmse_rejection_curves(
    df: pd.DataFrame,
    base_dir: str,
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir_plot: Optional[str] = None,
    add_to_title: str = "",
    normalize_rmse: bool = False,
    unc_type: str = "aleatoric",
    max_rejection_ratio: float = 0.95,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> pd.DataFrame:
    """
    Plot RMSE–rejection curves per group, including random rejection baselines, and summarize AUCs.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing columns 'Group', 'Split', and 'project_model'.
    base_dir : str
        Base directory where model paths are located.
    cmap : str, optional
        Colormap name used to derive distinct colors per group. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from group to color. If None, one is generated.
    save_dir_plot : str or None, optional
        Directory to save the plot images. Default is None.
    add_to_title : str, optional
        Suffix for the plot filename and title. Default is empty string.
    normalize_rmse : bool, optional
        If True, normalize RMSE by initial RMSE. Default is False.
    unc_type : {"aleatoric", "epistemic", "both"}, optional
        Uncertainty component to use for rejection. Default is "aleatoric".
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject. Default is 0.95.
    group_order : list of str or None, optional
        Order of groups in the legend. Default derives from data.
    fig_width : float or None, optional
        Plot width. Default is 6.
    fig_height : float or None, optional
        Plot height. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    pd.DataFrame
        Summary DataFrame with columns ['Model type', 'Split', 'Group', 'AUC-RRC_mean', 'AUC-RRC_std'].
    """
    assert unc_type in ["aleatoric", "epistemic", "both"], "Invalid unc_type"
    unc_col = "y_alea" if unc_type == "aleatoric" else "y_eps"

    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    if group_order is None:
        group_order = list(df["Group"].unique())

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        colors = scalar_mappable.to_rgba(range(len(group_order)))
        color_dict = {group: color for group, color in zip(group_order, colors)}

    color_dict["random reject"] = "black"

    df = df.copy()
    df.loc[:, "model_path"] = df["project_model"].apply(
        lambda x: (str(os.path.join(base_dir, x)) if not str(x).startswith(base_dir) else x)
    )

    stats_dfs = []
    included_groups = df["Group"].unique()
    legend_handles = []

    for group in included_groups:
        group_data = df[df["Group"] == group]
        model_paths = group_data["model_path"].unique()
        rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc = calculate_rejection_curve(
            df, model_paths, unc_col, normalize_rmse=normalize_rmse, max_rejection_ratio=max_rejection_ratio
        )
        (line,) = ax.plot(
            rejection_rates,
            mean_rmses,
            label=f"{group} (AUC-RRC: {mean_auc:.3f} ± {std_auc:.3f})",
            color=color_dict[group],
        )
        ax.fill_between(rejection_rates, mean_rmses - std_rmses, mean_rmses + std_rmses, color=color_dict[group], alpha=0.2)
        legend_handles.append(line)
        stats_dfs.append({
            "Model type": group.rsplit("_", 1)[1],
            "Split": group.rsplit("_", 1)[0],
            "Group": group,
            "AUC-RRC_mean": mean_auc,
            "AUC-RRC_std": std_auc,
        })

    for split in df["Split"].unique():
        split_data = df[df["Split"] == split]
        model_paths = split_data["model_path"].unique()
        rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc = calculate_rejection_curve(
            df, model_paths, unc_col, random_rejection=True, normalize_rmse=normalize_rmse, max_rejection_ratio=max_rejection_ratio
        )
        (line,) = ax.plot(
            rejection_rates,
            mean_rmses,
            label=f"random reject - {split} (AUC-RRC: {mean_auc:.3f} ± {std_auc:.3f})",
            color="black",
            linestyle="--",
        )
        ax.fill_between(rejection_rates, mean_rmses - std_rmses, mean_rmses + std_rmses, color="grey", alpha=0.2)
        legend_handles.append(line)
        stats_dfs.append({
            "Model type": "random reject",
            "Split": split,
            "Group": f"random reject - {split}",
            "AUC-RRC_mean": mean_auc,
            "AUC-RRC_std": std_auc,
        })

    ax.set_xlabel("Rejection Rate")
    ax.set_ylabel("RMSE" if not normalize_rmse else "Normalized RMSE")
    ax.set_xlim(0, max_rejection_ratio)
    ax.grid(True)

    if show_legend:
        ordered_handles, ordered_labels = get_handles_labels(ax, group_order)
        ordered_handles += [legend_handles[-1]]
        ordered_labels += [legend_handles[-1].get_label()]
        ax.legend(handles=ordered_handles, loc="lower left")

    plot_name = f"rmse_rejection_curve_{add_to_title}" if add_to_title else "rmse_rejection_curve"
    save_plot(fig, save_dir_plot, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

    return pd.DataFrame(stats_dfs)

uqdd.metrics.plot_auc_comparison ¶

plot_auc_comparison(stats_df: DataFrame, cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir: Optional[str] = None, add_to_title: str = '', min_y_axis: float = 0.0, hatches_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> None

Plot bar charts comparing RRC-AUC across splits and model types, including random reject baselines.

Parameters:

Name	Type	Description	Default
`stats_df`	`DataFrame`	Summary DataFrame with columns ['Group', 'Split', 'Model type', 'AUC-RRC_mean', 'AUC-RRC_std'].	required
`cmap`	`str`	Colormap name used to derive distinct colors per model type. Default is "tab10_r".	`'tab10_r'`
`color_dict`	`dict[str, str] or None`	Precomputed color mapping from model type to color. If None, one is generated.	`None`
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`add_to_title`	`str`	Title suffix for the plot. Default is empty string.	`''`
`min_y_axis`	`float`	Minimum y-axis limit. Default is 0.0.	`0.0`
`hatches_dict`	`dict[str, str] or None`	Hatch mapping for splits (e.g., {"stratified": "\"}). Default uses sensible defaults.	`None`
`group_order`	`list of str or None`	Order of groups in the legend and x-axis. Default derives from data.	`None`
`fig_width`	`float or None`	Plot width. Default is 6.	`None`
`fig_height`	`float or None`	Plot height. Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend. Default is False.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_auc_comparison(
    stats_df: pd.DataFrame,
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir: Optional[str] = None,
    add_to_title: str = "",
    min_y_axis: float = 0.0,
    hatches_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> None:
    """
    Plot bar charts comparing RRC-AUC across splits and model types, including random reject baselines.

    Parameters
    ----------
    stats_df : pd.DataFrame
        Summary DataFrame with columns ['Group', 'Split', 'Model type', 'AUC-RRC_mean', 'AUC-RRC_std'].
    cmap : str, optional
        Colormap name used to derive distinct colors per model type. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from model type to color. If None, one is generated.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    add_to_title : str, optional
        Title suffix for the plot. Default is empty string.
    min_y_axis : float, optional
        Minimum y-axis limit. Default is 0.0.
    hatches_dict : dict[str, str] or None, optional
        Hatch mapping for splits (e.g., {"stratified": "\\\\"}). Default uses sensible defaults.
    group_order : list of str or None, optional
        Order of groups in the legend and x-axis. Default derives from data.
    fig_width : float or None, optional
        Plot width. Default is 6.
    fig_height : float or None, optional
        Plot height. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    None
    """
    if hatches_dict is None:
        hatches_dict = {"stratified": "\\\\", "scaffold_cluster": "", "time": "/\\/\\/"}

    if group_order:
        all_groups = group_order + list(stats_df.loc[stats_df["Group"].str.startswith("random reject"), "Group"].unique())
        stats_df["Group"] = pd.Categorical(stats_df["Group"], categories=all_groups, ordered=True)
    else:
        all_groups = stats_df["Group"].unique().tolist()

    stats_df = stats_df.sort_values("Group").reset_index(drop=True)

    splits = list(hatches_dict.keys())
    stats_df.loc[:, "Split"] = pd.Categorical(stats_df["Split"], categories=splits, ordered=True)
    stats_df = stats_df.sort_values("Split").reset_index(drop=True)

    unique_model_types = stats_df.loc[stats_df["Model type"] != "random reject", "Model type"].unique()

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        colors = scalar_mappable.to_rgba(range(len(unique_model_types)))
        color_dict = {model: color for model, color in zip(unique_model_types, colors)}
    color_dict["random reject"] = "black"

    unique_model_types = np.append(unique_model_types, "random reject")

    bar_width = 0.12
    group_spacing = 0.6

    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 4

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    tick_positions = []
    tick_labels = []

    for i, split in enumerate(splits):
        split_data = stats_df[stats_df["Split"] == split]
        split_data.loc[:, "Group"] = pd.Categorical(split_data["Group"], categories=all_groups, ordered=True)
        for j, (_, row) in enumerate(split_data.iterrows()):
            position = i * (len(unique_model_types) * bar_width + group_spacing) + j * bar_width
            ax.bar(
                position,
                height=row["AUC-RRC_mean"],
                yerr=row["AUC-RRC_std"],
                color=color_dict[row["Model type"]],
                edgecolor="white" if row["Model type"] == "random reject" else "black",
                hatch=hatches_dict[row["Split"]],
                width=bar_width,
            )
        center_position = i * (len(unique_model_types) * bar_width + group_spacing) + (len(unique_model_types) * bar_width) / 2
        tick_positions.append(center_position)
        tick_labels.append(split)

    def create_stats_legend(color_dict: Dict[str, str], hatches_dict: Dict[str, str], splits: List[str], model_types: Union[List[str], np.ndarray]):
        patches = []
        for split in splits:
            for model in model_types:
                label = f"{split} {model}"
                hatch_color = "white" if model == "random reject" else "black"
                patch = mpatches.Patch(facecolor=color_dict[model], hatch=hatches_dict[split], edgecolor=hatch_color, label=label)
                patches.append(patch)
        return patches

    if show_legend:
        legend_elements = create_stats_legend(color_dict, hatches_dict, splits, unique_model_types)
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylabel("RRC-AUC")
    ax.set_ylim(min_y_axis, 1.0)

    plot_name = f"auc_comparison_barplot_{cmap}" + (f"_{add_to_title}" if add_to_title else "")
    save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.save_stats_df ¶

save_stats_df(stats_df: DataFrame, save_dir: str, add_to_title: str = '') -> None

Save a stats DataFrame to CSV in a given directory.

Parameters:

Name	Type	Description	Default
`stats_df`	`DataFrame`	DataFrame to save.	required
`save_dir`	`str`	Target directory to save the CSV.	required
`add_to_title`	`str`	Suffix to append to the filename. Default is empty string.	`''`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def save_stats_df(stats_df: pd.DataFrame, save_dir: str, add_to_title: str = "") -> None:
    """
    Save a stats DataFrame to CSV in a given directory.

    Parameters
    ----------
    stats_df : pd.DataFrame
        DataFrame to save.
    save_dir : str
        Target directory to save the CSV.
    add_to_title : str, optional
        Suffix to append to the filename. Default is empty string.

    Returns
    -------
    None
    """
    os.makedirs(save_dir, exist_ok=True)
    stats_df.to_csv(os.path.join(save_dir, f"stats_df_{add_to_title}.csv"), index=False)

uqdd.metrics.load_stats_df ¶

load_stats_df(save_dir: str, add_to_title: str = '') -> pd.DataFrame

Load a stats DataFrame from CSV in a given directory.

Parameters:

Name	Type	Description	Default
`save_dir`	`str`	Directory containing the CSV.	required
`add_to_title`	`str`	Suffix appended to the filename. Default is empty string.	`''`

Returns:

Type	Description
`DataFrame`	Loaded DataFrame.

Source code in uqdd/metrics/analysis.py

def load_stats_df(save_dir: str, add_to_title: str = "") -> pd.DataFrame:
    """
    Load a stats DataFrame from CSV in a given directory.

    Parameters
    ----------
    save_dir : str
        Directory containing the CSV.
    add_to_title : str, optional
        Suffix appended to the filename. Default is empty string.

    Returns
    -------
    pd.DataFrame
        Loaded DataFrame.
    """
    return pd.read_csv(os.path.join(save_dir, f"stats_df_{add_to_title}.csv"))

uqdd.metrics.calc_regression_metrics ¶

calc_regression_metrics(df, cycle_col, val_col, pred_col, thresh)

Compute regression and thresholded classification metrics per cycle/method/split.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing true and predicted values.	required
`cycle_col`	`str`	Column name identifying cross-validation cycles.	required
`val_col`	`str`	Column with true target values.	required
`pred_col`	`str`	Column with predicted target values.	required
`thresh`	`float`	Threshold to derive binary classes for precision/recall.	required

Returns:

Type	Description
`DataFrame`	Metrics per (cv_cycle, method, split) with columns ['mae', 'mse', 'r2', 'rho', 'prec', 'recall'].

Source code in uqdd/metrics/stats.py

def calc_regression_metrics(df, cycle_col, val_col, pred_col, thresh):
    """
    Compute regression and thresholded classification metrics per cycle/method/split.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing true and predicted values.
    cycle_col : str
        Column name identifying cross-validation cycles.
    val_col : str
        Column with true target values.
    pred_col : str
        Column with predicted target values.
    thresh : float
        Threshold to derive binary classes for precision/recall.

    Returns
    -------
    pd.DataFrame
        Metrics per (cv_cycle, method, split) with columns ['mae', 'mse', 'r2', 'rho', 'prec', 'recall'].
    """
    df_in = df.copy()
    metric_ls = ["mae", "mse", "r2", "rho", "prec", "recall"]
    metric_list = []
    df_in["true_class"] = df_in[val_col] > thresh
    assert len(df_in.true_class.unique()) == 2, "Binary classification requires two classes"
    df_in["pred_class"] = df_in[pred_col] > thresh

    for k, v in df_in.groupby([cycle_col, "method", "split"]):
        cycle, method, split = k
        mae = mean_absolute_error(v[val_col], v[pred_col])
        mse = mean_squared_error(v[val_col], v[pred_col])
        r2 = r2_score(v[val_col], v[pred_col])
        recall = recall_score(v.true_class, v.pred_class)
        prec = precision_score(v.true_class, v.pred_class)
        rho, _ = spearmanr(v[val_col], v[pred_col])
        metric_list.append([cycle, method, split, mae, mse, r2, rho, prec, recall])
    metric_df = pd.DataFrame(metric_list, columns=["cv_cycle", "method", "split"] + metric_ls)
    return metric_df

uqdd.metrics.bootstrap_ci ¶

bootstrap_ci(data, func=np.mean, n_bootstrap=1000, ci=95, random_state=42)

Compute bootstrap confidence interval for a statistic.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Sequence of numeric values.	required
`func`	`callable`	Statistic function applied to bootstrap samples (e.g., numpy.mean). Default is numpy.mean.	`mean`
`n_bootstrap`	`int`	Number of bootstrap resamples. Default is 1000.	`1000`
`ci`	`int or float`	Confidence level percentage (e.g., 95). Default is 95.	`95`
`random_state`	`int`	Seed for reproducibility. Default is 42.	`42`

Returns:

Type	Description
`tuple[float, float]`	Lower and upper bounds for the confidence interval.

Source code in uqdd/metrics/stats.py

def bootstrap_ci(data, func=np.mean, n_bootstrap=1000, ci=95, random_state=42):
    """
    Compute bootstrap confidence interval for a statistic.

    Parameters
    ----------
    data : array-like
        Sequence of numeric values.
    func : callable, optional
        Statistic function applied to bootstrap samples (e.g., numpy.mean). Default is numpy.mean.
    n_bootstrap : int, optional
        Number of bootstrap resamples. Default is 1000.
    ci : int or float, optional
        Confidence level percentage (e.g., 95). Default is 95.
    random_state : int, optional
        Seed for reproducibility. Default is 42.

    Returns
    -------
    tuple[float, float]
        Lower and upper bounds for the confidence interval.
    """
    np.random.seed(random_state)
    bootstrap_samples = []
    for _ in range(n_bootstrap):
        sample = resample(data, random_state=np.random.randint(0, 10000))
        bootstrap_samples.append(func(sample))
    alpha = (100 - ci) / 2
    lower = np.percentile(bootstrap_samples, alpha)
    upper = np.percentile(bootstrap_samples, 100 - alpha)
    return lower, upper

uqdd.metrics.rm_tukey_hsd ¶

rm_tukey_hsd(df, metric, group_col, alpha=0.05, sort=False, direction_dict=None)

Repeated-measures Tukey HSD approximation using RM-ANOVA and studentized range.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Long-form DataFrame with columns including the metric, group, and 'cv_cycle' subject.	required
`metric`	`str`	Metric column to compare.	required
`group_col`	`str`	Column indicating groups (e.g., method/model type).	required
`alpha`	`float`	Family-wise error rate for intervals. Default is 0.05.	`0.05`
`sort`	`bool`	If True, sort groups by mean value of the metric. Default is False.	`False`
`direction_dict`	`dict or None`	Mapping of metric -> 'maximize'\|'minimize' to set sort ascending/descending.	`None`

Returns:

Type	Description
`tuple`	(result_tab, df_means, df_means_diff, p_values_matrix) where: - result_tab: DataFrame of pairwise comparisons with mean differences and CIs. - df_means: mean per group. - df_means_diff: matrix of mean differences. - pc: matrix of adjusted p-values.

Source code in uqdd/metrics/stats.py

def rm_tukey_hsd(df, metric, group_col, alpha=0.05, sort=False, direction_dict=None):
    """
    Repeated-measures Tukey HSD approximation using RM-ANOVA and studentized range.

    Parameters
    ----------
    df : pd.DataFrame
        Long-form DataFrame with columns including the metric, group, and 'cv_cycle' subject.
    metric : str
        Metric column to compare.
    group_col : str
        Column indicating groups (e.g., method/model type).
    alpha : float, optional
        Family-wise error rate for intervals. Default is 0.05.
    sort : bool, optional
        If True, sort groups by mean value of the metric. Default is False.
    direction_dict : dict or None, optional
        Mapping of metric -> 'maximize'|'minimize' to set sort ascending/descending.

    Returns
    -------
    tuple
        (result_tab, df_means, df_means_diff, p_values_matrix) where:
        - result_tab: DataFrame of pairwise comparisons with mean differences and CIs.
        - df_means: mean per group.
        - df_means_diff: matrix of mean differences.
        - pc: matrix of adjusted p-values.
    """
    if sort and direction_dict and metric in direction_dict:
        ascending = direction_dict[metric] != "maximize"
        df_means = df.groupby(group_col).mean(numeric_only=True).sort_values(metric, ascending=ascending)
    else:
        df_means = df.groupby(group_col).mean(numeric_only=True)

    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=RuntimeWarning, message="divide by zero encountered in scalar divide")
        aov = pg.rm_anova(dv=metric, within=group_col, subject="cv_cycle", data=df, detailed=True)
    mse = aov.loc[1, "MS"]
    df_resid = aov.loc[1, "DF"]

    methods = df_means.index
    n_groups = len(methods)
    n_per_group = df[group_col].value_counts().mean()
    tukey_se = np.sqrt(2 * mse / (n_per_group))
    q = qsturng(1 - alpha, n_groups, df_resid)
    if isinstance(q, (tuple, list, np.ndarray)):
        q = q[0]

    num_comparisons = len(methods) * (len(methods) - 1) // 2
    result_tab = pd.DataFrame(index=range(num_comparisons), columns=["group1", "group2", "meandiff", "lower", "upper", "p-adj"])
    df_means_diff = pd.DataFrame(index=methods, columns=methods, data=0.0)
    pc = pd.DataFrame(index=methods, columns=methods, data=1.0)

    row_idx = 0
    for i, method1 in enumerate(methods):
        for j, method2 in enumerate(methods):
            if i < j:
                group1 = df[df[group_col] == method1][metric]
                group2 = df[df[group_col] == method2][metric]
                mean_diff = group1.mean() - group2.mean()
                studentized_range = np.abs(mean_diff) / tukey_se
                adjusted_p = psturng(studentized_range * np.sqrt(2), n_groups, df_resid)
                if isinstance(adjusted_p, (tuple, list, np.ndarray)):
                    adjusted_p = adjusted_p[0]
                lower = mean_diff - (q / np.sqrt(2) * tukey_se)
                upper = mean_diff + (q / np.sqrt(2) * tukey_se)
                result_tab.loc[row_idx] = [method1, method2, mean_diff, lower, upper, adjusted_p]
                pc.loc[method1, method2] = adjusted_p
                pc.loc[method2, method1] = adjusted_p
                df_means_diff.loc[method1, method2] = mean_diff
                df_means_diff.loc[method2, method1] = -mean_diff
                row_idx += 1

    df_means_diff = df_means_diff.astype(float)
    result_tab["group1_mean"] = result_tab["group1"].map(df_means[metric])
    result_tab["group2_mean"] = result_tab["group2"].map(df_means[metric])
    result_tab.index = result_tab["group1"] + " - " + result_tab["group2"]
    return result_tab, df_means, df_means_diff, pc

uqdd.metrics.make_boxplots ¶

make_boxplots(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots for each metric grouped by method.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to visualize.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of methods on the x-axis. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_boxplots(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots for each metric grouped by method.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, len(metric_ls), sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.upper()
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_boxplots_parametric ¶

make_boxplots_parametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots with RM-ANOVA p-values annotated per metric.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to visualize.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of methods on the x-axis. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_boxplots_parametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots with RM-ANOVA p-values annotated per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, len(metric_ls), sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        model = AnovaRM(data=df, depvar=stat, subject="cv_cycle", within=["method"]).fit()
        p_value = model.anova_table["Pr > F"].iloc[0]
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.upper()
        ax.set_title(f"p={p_value:.1e}")
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_parametric_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_boxplots_nonparametric ¶

make_boxplots_nonparametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots with Friedman p-values annotated per metric.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to visualize.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of methods on the x-axis. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_boxplots_nonparametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots with Friedman p-values annotated per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    n_metrics = len(metric_ls)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, n_metrics, sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        friedman = pg.friedman(df, dv=stat, within="method", subject="cv_cycle")["p-unc"].values[0]
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.replace("_", " ").upper()
        ax.set_title(f"p={friedman:.1e}")
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_nonparametric_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_sign_plots_nonparametric ¶

make_sign_plots_nonparametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot significance heatmaps (Conover post-hoc) for nonparametric comparisons.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to analyze.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of methods on axes. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_sign_plots_nonparametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot significance heatmaps (Conover post-hoc) for nonparametric comparisons.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to analyze.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on axes. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    heatmap_args = {"linewidths": 0.25, "linecolor": "0.5", "clip_on": True, "square": True, "cbar_kws": {"pad": 0.05, "location": "right"}}
    n_metrics = len(metric_ls)
    sns.set_theme(context="paper", font_scale=1.5)
    figure, axes = plt.subplots(1, n_metrics, sharex=False, sharey=True, figsize=(26, 8))
    if n_metrics == 1:
        axes = [axes]
    for i, stat in enumerate(metric_ls):
        pc = sp.posthoc_conover_friedman(df, y_col=stat, group_col="method", block_col="cv_cycle", block_id_col="cv_cycle", p_adjust="holm", melted=True)
        if model_order is not None:
            pc = pc.reindex(index=model_order, columns=model_order)
        sub_ax, sub_c = sp.sign_plot(pc, **heatmap_args, ax=axes[i], xticklabels=True)
        sub_ax.set_title(stat.upper())
        if sub_c is not None and hasattr(sub_c, "ax"):
            figure.subplots_adjust(right=0.85)
            sub_c.ax.set_position([0.87, 0.5, 0.02, 0.2])
    save_plot(figure, save_dir, f"{name_prefix}_sign_plot_nonparametric_{'_'.join(metric_ls)}", tighten=False)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_critical_difference_diagrams ¶

make_critical_difference_diagrams(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot critical difference diagrams per metric using average ranks and post-hoc p-values.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to analyze.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of models on diagrams. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_critical_difference_diagrams(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot critical difference diagrams per metric using average ranks and post-hoc p-values.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to analyze.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of models on diagrams. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    n_metrics = len(metric_ls)
    figure, axes = plt.subplots(n_metrics, 1, sharex=True, sharey=False, figsize=(16, 10))
    for i, stat in enumerate(metric_ls):
        avg_rank = df.groupby("cv_cycle")[stat].rank(pct=True).groupby(df.method).mean()
        pc = sp.posthoc_conover_friedman(df, y_col=stat, group_col="method", block_col="cv_cycle", block_id_col="cv_cycle", p_adjust="holm", melted=True)
        if model_order is not None:
            avg_rank = avg_rank.reindex(model_order)
            pc = pc.reindex(index=model_order, columns=model_order)
        sp.critical_difference_diagram(avg_rank, pc, ax=axes[i])
        axes[i].set_title(stat.upper())
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_critical_difference_diagram_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_normality_diagnostic ¶

make_normality_diagnostic(df, metric_ls, save_dir=None, name_prefix='')

Plot normality diagnostics (histogram/KDE and Q-Q) for residualized metrics.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to diagnose.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_normality_diagnostic(df, metric_ls, save_dir=None, name_prefix=""):
    """
    Plot normality diagnostics (histogram/KDE and Q-Q) for residualized metrics.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to diagnose.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.

    Returns
    -------
    None
    """
    df_norm = df.copy()
    df_norm.replace([np.inf, -np.inf], np.nan, inplace=True)
    for metric in metric_ls:
        df_norm[metric] = df_norm[metric] - df_norm.groupby("method")[metric].transform("mean")
    df_norm = df_norm.melt(id_vars=["cv_cycle", "method", "split"], value_vars=metric_ls, var_name="metric", value_name="value")
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    metrics = df_norm["metric"].unique()
    n_metrics = len(metrics)
    fig, axes = plt.subplots(2, n_metrics, figsize=(20, 10))
    for i, metric in enumerate(metrics):
        ax = axes[0, i]
        sns.histplot(df_norm[df_norm["metric"] == metric]["value"], kde=True, ax=ax)
        ax.set_title(f"{metric}")
        ax.set_xlabel("")
        if i == 0:
            ax.set_ylabel("Count")
        else:
            ax.set_ylabel("")
    for i, metric in enumerate(metrics):
        ax = axes[1, i]
        metric_data = df_norm[df_norm["metric"] == metric]["value"]
        stats.probplot(metric_data, dist="norm", plot=ax)
        ax.set_title("")
        ax.set_xlabel("Theoretical Quantiles")
        if i == 0:
            ax.set_ylabel("Ordered Values")
        else:
            ax.set_ylabel("")
    plt.subplots_adjust(hspace=0.3, wspace=0.8)
    save_plot(fig, save_dir, f"{name_prefix}_normality_diagnostic_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.mcs_plot ¶

mcs_plot(pc, effect_size, means, labels=True, cmap=None, cbar_ax_bbox=None, ax=None, show_diff=True, cell_text_size=10, axis_text_size=8, show_cbar=True, reverse_cmap=False, vlim=None, **kwargs)

Render a multiple-comparisons significance heatmap annotated with effect sizes and stars.

Parameters:

Name	Type	Description	Default
`pc`	`DataFrame`	Matrix of adjusted p-values.	required
`effect_size`	`DataFrame`	Matrix of mean differences (effect sizes) aligned with `pc`.	required
`means`	`Series`	Mean values per group for labeling.	required
`labels`	`bool`	If True, add x/y tick labels from `means.index`. Default is True.	`True`
`cmap`	`str or None`	Colormap name for effect sizes. Default is 'YlGnBu'.	`None`
`cbar_ax_bbox`	`tuple or None`	Custom colorbar axes bbox; unused here but kept for API compatibility.	`None`
`ax`	`Axes or None`	Axes to draw into; if None, a new axes is created.	`None`
`show_diff`	`bool`	If True, annotate cells with rounded effect sizes plus significance. Default is True.	`True`
`cell_text_size`	`int`	Font size for annotations. Default is 10.	`10`
`axis_text_size`	`int`	Font size for axis tick labels. Default is 8.	`8`
`show_cbar`	`bool`	If True, show colorbar. Default is True.	`True`
`reverse_cmap`	`bool`	If True, use reversed colormap. Default is False.	`False`
`vlim`	`float or None`	Symmetric limit for color scaling around 0. Default is None.	`None`

Returns:

Type	Description
`Axes`	Axes containing the rendered heatmap.

Source code in uqdd/metrics/stats.py

def mcs_plot(pc, effect_size, means, labels=True, cmap=None, cbar_ax_bbox=None, ax=None, show_diff=True, cell_text_size=10, axis_text_size=8, show_cbar=True, reverse_cmap=False, vlim=None, **kwargs):
    """
    Render a multiple-comparisons significance heatmap annotated with effect sizes and stars.

    Parameters
    ----------
    pc : pd.DataFrame
        Matrix of adjusted p-values.
    effect_size : pd.DataFrame
        Matrix of mean differences (effect sizes) aligned with `pc`.
    means : pd.Series
        Mean values per group for labeling.
    labels : bool, optional
        If True, add x/y tick labels from `means.index`. Default is True.
    cmap : str or None, optional
        Colormap name for effect sizes. Default is 'YlGnBu'.
    cbar_ax_bbox : tuple or None, optional
        Custom colorbar axes bbox; unused here but kept for API compatibility.
    ax : matplotlib.axes.Axes or None, optional
        Axes to draw into; if None, a new axes is created.
    show_diff : bool, optional
        If True, annotate cells with rounded effect sizes plus significance. Default is True.
    cell_text_size : int, optional
        Font size for annotations. Default is 10.
    axis_text_size : int, optional
        Font size for axis tick labels. Default is 8.
    show_cbar : bool, optional
        If True, show colorbar. Default is True.
    reverse_cmap : bool, optional
        If True, use reversed colormap. Default is False.
    vlim : float or None, optional
        Symmetric limit for color scaling around 0. Default is None.

    Returns
    -------
    matplotlib.axes.Axes
        Axes containing the rendered heatmap.
    """
    for key in ["cbar", "vmin", "vmax", "center"]:
        if key in kwargs:
            del kwargs[key]
    if not cmap:
        cmap = "YlGnBu"
    if reverse_cmap:
        cmap = cmap + "_r"
    significance = pc.copy().astype(object)
    significance[(pc < 0.001) & (pc >= 0)] = "***"
    significance[(pc < 0.01) & (pc >= 0.001)] = "**"
    significance[(pc < 0.05) & (pc >= 0.01)] = "*"
    significance[(pc >= 0.05)] = ""
    np.fill_diagonal(significance.values, "")
    annotations = effect_size.round(2).astype(str) + significance if show_diff else significance
    hax = sns.heatmap(effect_size, cmap=cmap, annot=annotations, fmt="", cbar=show_cbar, ax=ax, annot_kws={"size": cell_text_size}, vmin=-2 * vlim if vlim else None, vmax=2 * vlim if vlim else None, square=True, **kwargs)
    if labels:
        label_list = list(means.index)
        x_label_list = label_list
        y_label_list = label_list
        xtick_positions = np.arange(len(label_list))
        hax.set_xticks(xtick_positions + 0.5)
        hax.set_xticklabels(x_label_list, size=axis_text_size, ha="center", va="center", rotation=90)
        hax.set_yticks(xtick_positions + 0.5)
        hax.set_yticklabels(y_label_list, size=axis_text_size, ha="center", va="center", rotation=0)
    hax.set_xlabel("")
    hax.set_ylabel("")
    return hax

uqdd.metrics.make_mcs_plot_grid ¶

make_mcs_plot_grid(df, stats_list, group_col, alpha=0.05, figsize=(20, 10), direction_dict=None, effect_dict=None, show_diff=True, cell_text_size=16, axis_text_size=12, title_text_size=16, sort_axes=False, save_dir=None, name_prefix='', model_order=None)

Generate a grid of MCS plots for multiple metrics.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`stats_list`	`list of str`	Metrics to include.	required
`group_col`	`str`	Column indicating groups (e.g., method).	required
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`
`figsize`	`tuple`	Figure size. Default is (20, 10).	`(20, 10)`
`direction_dict`	`dict or None`	Mapping metric -> 'maximize'\|'minimize' for colormap orientation.	`None`
`effect_dict`	`dict or None`	Mapping metric -> effect size limit for color scaling.	`None`
`show_diff`	`bool`	If True, annotate mean differences; else annotate significance only.	`True`
`cell_text_size`	`int`	Annotation font size.	`16`
`axis_text_size`	`int`	Axis label font size.	`12`
`title_text_size`	`int`	Title font size.	`16`
`sort_axes`	`bool`	If True, sort groups by mean values per metric.	`False`
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Filename prefix. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit model order for rows/cols.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_mcs_plot_grid(df, stats_list, group_col, alpha=0.05, figsize=(20, 10), direction_dict=None, effect_dict=None, show_diff=True, cell_text_size=16, axis_text_size=12, title_text_size=16, sort_axes=False, save_dir=None, name_prefix="", model_order=None):
    """
    Generate a grid of MCS plots for multiple metrics.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    stats_list : list of str
        Metrics to include.
    group_col : str
        Column indicating groups (e.g., method).
    alpha : float, optional
        Significance level. Default is 0.05.
    figsize : tuple, optional
        Figure size. Default is (20, 10).
    direction_dict : dict or None, optional
        Mapping metric -> 'maximize'|'minimize' for colormap orientation.
    effect_dict : dict or None, optional
        Mapping metric -> effect size limit for color scaling.
    show_diff : bool, optional
        If True, annotate mean differences; else annotate significance only.
    cell_text_size : int, optional
        Annotation font size.
    axis_text_size : int, optional
        Axis label font size.
    title_text_size : int, optional
        Title font size.
    sort_axes : bool, optional
        If True, sort groups by mean values per metric.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Filename prefix. Default is empty.
    model_order : list of str or None, optional
        Explicit model order for rows/cols.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    nrow = math.ceil(len(stats_list) / 3)
    fig, ax = plt.subplots(nrow, 3, figsize=figsize)
    for key in ["r2", "rho", "prec", "recall", "mae", "mse"]:
        direction_dict.setdefault(key, "maximize" if key in ["r2", "rho", "prec", "recall"] else "minimize")
    for key in ["r2", "rho", "prec", "recall"]:
        effect_dict.setdefault(key, 0.1)
    for i, stat in enumerate(stats_list):
        row = i // 3
        col = i % 3
        if stat not in direction_dict:
            raise ValueError(f"Stat '{stat}' is missing in direction_dict. Please set its value.")
        if stat not in effect_dict:
            raise ValueError(f"Stat '{stat}' is missing in effect_dict. Please set its value.")
        reverse_cmap = direction_dict[stat] == "minimize"
        _, df_means, df_means_diff, pc = rm_tukey_hsd(df, stat, group_col, alpha, sort_axes, direction_dict)
        if model_order is not None:
            df_means = df_means.reindex(model_order)
            df_means_diff = df_means_diff.reindex(index=model_order, columns=model_order)
            pc = pc.reindex(index=model_order, columns=model_order)
        hax = mcs_plot(pc, effect_size=df_means_diff, means=df_means[stat], show_diff=show_diff, ax=ax[row, col], cbar=True, cell_text_size=cell_text_size, axis_text_size=axis_text_size, reverse_cmap=reverse_cmap, vlim=effect_dict[stat])
        hax.set_title(stat.upper(), fontsize=title_text_size)
    if (len(stats_list) % 3) != 0:
        for i in range(len(stats_list), nrow * 3):
            row = i // 3
            col = i % 3
            ax[row, col].set_visible(False)
    from matplotlib.lines import Line2D
    legend_elements = [
        Line2D([0], [0], marker="o", color="w", label="p < 0.001 (***): Highly Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p < 0.01 (**): Very Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p < 0.05 (*): Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p >= 0.05: Not Significant", markerfacecolor="black", markersize=10),
    ]
    fig.legend(handles=legend_elements, loc="upper right", ncol=2, fontsize=12, frameon=False)
    plt.subplots_adjust(top=0.88)
    save_plot(fig, save_dir, f"{name_prefix}_mcs_plot_grid_{'_'.join(stats_list)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_scatterplot ¶

make_scatterplot(df, val_col, pred_col, thresh, cycle_col='cv_cycle', group_col='method', save_dir=None)

Scatter plots of predicted vs true values per method, with threshold lines and summary stats.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`val_col`	`str`	True value column.	required
`pred_col`	`str`	Predicted value column.	required
`thresh`	`float`	Threshold for classification overlays.	required
`cycle_col`	`str`	Cross-validation cycle column. Default is 'cv_cycle'.	`'cv_cycle'`
`group_col`	`str`	Method/model type column. Default is 'method'.	`'method'`
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_scatterplot(df, val_col, pred_col, thresh, cycle_col="cv_cycle", group_col="method", save_dir=None):
    """
    Scatter plots of predicted vs true values per method, with threshold lines and summary stats.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    val_col : str
        True value column.
    pred_col : str
        Predicted value column.
    thresh : float
        Threshold for classification overlays.
    cycle_col : str, optional
        Cross-validation cycle column. Default is 'cv_cycle'.
    group_col : str, optional
        Method/model type column. Default is 'method'.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df_split_metrics = calc_regression_metrics(df, cycle_col=cycle_col, val_col=val_col, pred_col=pred_col, thresh=thresh)
    methods = df[group_col].unique()
    fig, axs = plt.subplots(nrows=1, ncols=len(methods), figsize=(25, 10))
    for ax, method in zip(axs, methods):
        df_method = df.query(f"{group_col} == @method")
        df_metrics = df_split_metrics.query(f"{group_col} == @method")
        ax.scatter(df_method[pred_col], df_method[val_col], alpha=0.3)
        ax.plot([df_method[val_col].min(), df_method[val_col].max()], [df_method[val_col].min(), df_method[val_col].max()], "k--", lw=1)
        ax.axhline(y=thresh, color="r", linestyle="--")
        ax.axvline(x=thresh, color="r", linestyle="--")
        ax.set_title(method)
        y_true = df_method[val_col] > thresh
        y_pred = df_method[pred_col] > thresh
        precision = precision_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)
        metrics_text = f"MAE: {df_metrics['mae'].mean():.2f}\nMSE: {df_metrics['mse'].mean():.2f}\nR2: {df_metrics['r2'].mean():.2f}\nrho: {df_metrics['rho'].mean():.2f}\nPrecision: {precision:.2f}\nRecall: {recall:.2f}"
        ax.text(0.05, 0.5, metrics_text, transform=ax.transAxes, verticalalignment="top")
        ax.set_xlabel("Predicted")
        ax.set_ylabel("Measured")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(fig, save_dir, f"scatterplot_{val_col}_vs_{pred_col}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.ci_plot ¶

ci_plot(result_tab, ax_in, name)

Plot mean differences with confidence intervals for pairwise comparisons.

Parameters:

Name	Type	Description	Default
`result_tab`	`DataFrame`	Output of rm_tukey_hsd with columns ['meandiff', 'lower', 'upper'].	required
`ax_in`	`Axes`	Axes to plot into.	required
`name`	`str`	Title for the plot.	required

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def ci_plot(result_tab, ax_in, name):
    """
    Plot mean differences with confidence intervals for pairwise comparisons.

    Parameters
    ----------
    result_tab : pd.DataFrame
        Output of rm_tukey_hsd with columns ['meandiff', 'lower', 'upper'].
    ax_in : matplotlib.axes.Axes
        Axes to plot into.
    name : str
        Title for the plot.

    Returns
    -------
    None
    """
    result_err = np.array([result_tab["meandiff"] - result_tab["lower"], result_tab["upper"] - result_tab["meandiff"]])
    sns.set_theme(context="paper")
    sns.set_style("whitegrid")
    ax = sns.pointplot(x=result_tab.meandiff, y=result_tab.index, marker="o", linestyle="", ax=ax_in)
    ax.errorbar(y=result_tab.index, x=result_tab["meandiff"], xerr=result_err, fmt="o", capsize=5)
    ax.axvline(0, ls="--", lw=3)
    ax.set_xlabel("Mean Difference")
    ax.set_ylabel("")
    ax.set_title(name)
    ax.set_xlim(-0.2, 0.2)

uqdd.metrics.make_ci_plot_grid ¶

make_ci_plot_grid(df_in, metric_list, group_col='method', save_dir=None, name_prefix='', model_order=None)

Plot a grid of confidence-interval charts for multiple metrics.

Parameters:

Name	Type	Description	Default
`df_in`	`DataFrame`	Input DataFrame.	required
`metric_list`	`list of str`	Metrics to render.	required
`group_col`	`str`	Group column (e.g., 'method'). Default is 'method'.	`'method'`
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Filename prefix. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit row order for the CI plots.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_ci_plot_grid(df_in, metric_list, group_col="method", save_dir=None, name_prefix="", model_order=None):
    """
    Plot a grid of confidence-interval charts for multiple metrics.

    Parameters
    ----------
    df_in : pd.DataFrame
        Input DataFrame.
    metric_list : list of str
        Metrics to render.
    group_col : str, optional
        Group column (e.g., 'method'). Default is 'method'.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Filename prefix. Default is empty.
    model_order : list of str or None, optional
        Explicit row order for the CI plots.

    Returns
    -------
    None
    """
    df_in = df_in.copy()
    df_in.replace([np.inf, -np.inf], np.nan, inplace=True)
    figure, axes = plt.subplots(len(metric_list), 1, figsize=(8, 2 * len(metric_list)), sharex=False)
    if not isinstance(axes, np.ndarray):
        axes = np.array([axes])
    for i, metric in enumerate(metric_list):
        df_tukey, _, _, _ = rm_tukey_hsd(df_in, metric, group_col=group_col)
        if model_order is not None:
            df_tukey = df_tukey.reindex(index=model_order)
        ci_plot(df_tukey, ax_in=axes[i], name=metric)
    figure.suptitle("Multiple Comparison of Means\nTukey HSD, FWER=0.05")
    plt.subplots_adjust(hspace=0.9, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_ci_plot_grid_{'_'.join(metric_list)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.recall_at_precision ¶

recall_at_precision(y_true, y_score, precision_threshold=0.5, direction='greater')

Find recall and threshold achieving at least a target precision.

Parameters:

Name	Type	Description	Default
`y_true`	`array - like`	Binary ground-truth labels.	required
`y_score`	`array - like`	Continuous scores or probabilities.	required
`precision_threshold`	`float`	Minimum precision to achieve. Default is 0.5.	`0.5`
`direction`	`(greater, lesser)`	If 'greater', thresholding uses >=; if 'lesser', uses <=. Default is 'greater'.	`"greater"`

Returns:

Type	Description
`tuple[float, float or None]`	(recall, threshold) if achievable; otherwise (nan, None).

Raises:

Type	Description
`ValueError`	If `direction` is invalid.

Source code in uqdd/metrics/stats.py

def recall_at_precision(y_true, y_score, precision_threshold=0.5, direction="greater"):
    """
    Find recall and threshold achieving at least a target precision.

    Parameters
    ----------
    y_true : array-like
        Binary ground-truth labels.
    y_score : array-like
        Continuous scores or probabilities.
    precision_threshold : float, optional
        Minimum precision to achieve. Default is 0.5.
    direction : {"greater", "lesser"}, optional
        If 'greater', thresholding uses >=; if 'lesser', uses <=. Default is 'greater'.

    Returns
    -------
    tuple[float, float or None]
        (recall, threshold) if achievable; otherwise (nan, None).

    Raises
    ------
    ValueError
        If `direction` is invalid.
    """
    if direction not in ["greater", "lesser"]:
        raise ValueError("Invalid direction. Expected one of: ['greater', 'lesser']")
    y_true = np.array(y_true)
    y_score = np.array(y_score)
    thresholds = np.unique(y_score)
    thresholds = np.sort(thresholds)
    if direction == "lesser":
        thresholds = thresholds[::-1]
    for threshold in thresholds:
        y_pred = y_score >= threshold if direction == "greater" else y_score <= threshold
        precision = precision_score(y_true, y_pred)
        if precision >= precision_threshold:
            recall = recall_score(y_true, y_pred)
            return recall, threshold
    return np.nan, None

uqdd.metrics.calc_classification_metrics ¶

calc_classification_metrics(df_in, cycle_col, val_col, prob_col, pred_col)

Compute classification metrics per cycle/method/split, including ROC-AUC, PR-AUC, MCC, recall, and TNR.

Parameters:

Name	Type	Description	Default
`df_in`	`DataFrame`	Input DataFrame.	required
`cycle_col`	`str`	Column name for cross-validation cycles.	required
`val_col`	`str`	True binary label column.	required
`prob_col`	`str`	Predicted probability/score column.	required
`pred_col`	`str`	Predicted binary label column.	required

Returns:

Type	Description
`DataFrame`	Metrics per (cv_cycle, method, split) with columns ['roc_auc', 'pr_auc', 'mcc', 'recall', 'tnr'].

Source code in uqdd/metrics/stats.py

def calc_classification_metrics(df_in, cycle_col, val_col, prob_col, pred_col):
    """
    Compute classification metrics per cycle/method/split, including ROC-AUC, PR-AUC, MCC, recall, and TNR.

    Parameters
    ----------
    df_in : pd.DataFrame
        Input DataFrame.
    cycle_col : str
        Column name for cross-validation cycles.
    val_col : str
        True binary label column.
    prob_col : str
        Predicted probability/score column.
    pred_col : str
        Predicted binary label column.

    Returns
    -------
    pd.DataFrame
        Metrics per (cv_cycle, method, split) with columns ['roc_auc', 'pr_auc', 'mcc', 'recall', 'tnr'].
    """
    metric_list = []
    for k, v in df_in.groupby([cycle_col, "method", "split"]):
        cycle, method, split = k
        roc_auc = roc_auc_score(v[val_col], v[prob_col])
        pr_auc = average_precision_score(v[val_col], v[prob_col])
        mcc = matthews_corrcoef(v[val_col], v[pred_col])
        recall, _ = recall_at_precision(v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="greater")
        tnr, _ = recall_at_precision(~v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="lesser")
        metric_list.append([cycle, method, split, roc_auc, pr_auc, mcc, recall, tnr])
    metric_df = pd.DataFrame(metric_list, columns=["cv_cycle", "method", "split", "roc_auc", "pr_auc", "mcc", "recall", "tnr"])
    return metric_df

uqdd.metrics.make_curve_plots ¶

make_curve_plots(df)

Plot ROC and PR curves for split/method selections with threshold markers.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing 'cv_cycle', 'split', and method columns plus true/probability fields.	required

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_curve_plots(df):
    """
    Plot ROC and PR curves for split/method selections with threshold markers.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing 'cv_cycle', 'split', and method columns plus true/probability fields.

    Returns
    -------
    None
    """
    df_plot = df.query("cv_cycle == 0 and split == 'scaffold'").copy()
    color_map = plt.get_cmap("tab10")
    le = LabelEncoder()
    df_plot["color"] = le.fit_transform(df_plot["method"])
    colors = color_map(df_plot["color"].unique())
    val_col = "Sol"
    prob_col = "Sol_prob"
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))
    for (k, v), color in zip(df_plot.groupby("method"), colors):
        roc_auc = roc_auc_score(v[val_col], v[prob_col])
        pr_auc = average_precision_score(v[val_col], v[prob_col])
        fpr, recall_pos, thresholds_roc = roc_curve(v[val_col], v[prob_col])
        precision, recall, thresholds_pr = precision_recall_curve(v[val_col], v[prob_col])
        _, threshold_recall_pos = recall_at_precision(v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="greater")
        _, threshold_recall_neg = recall_at_precision(~v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="lesser")
        fpr_recall_pos = fpr[np.abs(thresholds_roc - threshold_recall_pos).argmin()]
        fpr_recall_neg = fpr[np.abs(thresholds_roc - threshold_recall_neg).argmin()]
        recall_recall_pos = recall[np.abs(thresholds_pr - threshold_recall_pos).argmin()]
        recall_recall_neg = recall[np.abs(thresholds_pr - threshold_recall_neg).argmin()]
        axes[0].plot(fpr, recall_pos, label=f"{k} (ROC AUC={roc_auc:.03f})", color=color, alpha=0.75)
        axes[1].plot(recall, precision, label=f"{k} (PR AUC={pr_auc:.03f})", color=color, alpha=0.75)
        axes[0].axvline(fpr_recall_pos, color=color, linestyle=":", alpha=0.75)
        axes[0].axvline(fpr_recall_neg, color=color, linestyle="--", alpha=0.75)
        axes[1].axvline(recall_recall_pos, color=color, linestyle=":", alpha=0.75)
        axes[1].axvline(recall_recall_neg, color=color, linestyle="--", alpha=0.75)
    axes[0].plot([0, 1], [0, 1], "--", color="black", lw=0.5)
    axes[0].set_xlabel("False Positive Rate")
    axes[0].set_ylabel("True Positive Rate")
    axes[0].set_title("ROC Curve")
    axes[0].legend()
    axes[1].set_xlabel("Recall")
    axes[1].set_ylabel("Precision")
    axes[1].set_title("Precision-Recall Curve")
    axes[1].legend()

uqdd.metrics.harmonize_columns ¶

harmonize_columns(df)

Normalize common column names to ['method', 'split', 'cv_cycle'].

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with possibly varied column naming.	required

Returns:

Type	Description
`DataFrame`	DataFrame with standardized column names and assertion that required columns exist.

Source code in uqdd/metrics/stats.py

def harmonize_columns(df):
    """
    Normalize common column names to ['method', 'split', 'cv_cycle'].

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with possibly varied column naming.

    Returns
    -------
    pd.DataFrame
        DataFrame with standardized column names and assertion that required columns exist.
    """
    df = df.copy()
    rename_map = {
        "Model type": "method",
        "Split": "split",
        "Group_Number": "cv_cycle",
    }
    df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True)
    assert {"method", "split", "cv_cycle"}.issubset(df.columns)
    return df

uqdd.metrics.cliffs_delta ¶

cliffs_delta(x, y)

Compute Cliff's delta effect size and qualitative interpretation.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	First sample of numeric values.	required
`y`	`array - like`	Second sample of numeric values.	required

Returns:

Type	Description
`tuple[float, str]`	(delta, interpretation) where interpretation is one of {'negligible','small','medium','large'}.

Source code in uqdd/metrics/stats.py

def cliffs_delta(x, y):
    """
    Compute Cliff's delta effect size and qualitative interpretation.

    Parameters
    ----------
    x : array-like
        First sample of numeric values.
    y : array-like
        Second sample of numeric values.

    Returns
    -------
    tuple[float, str]
        (delta, interpretation) where interpretation is one of {'negligible','small','medium','large'}.
    """
    x, y = np.array(x), np.array(y)
    m, n = len(x), len(y)
    comparisons = 0
    for xi in x:
        for yi in y:
            if xi > yi:
                comparisons += 1
            elif xi < yi:
                comparisons -= 1
    delta = comparisons / (m * n)
    abs_delta = abs(delta)
    if abs_delta < 0.147:
        interpretation = "negligible"
    elif abs_delta < 0.33:
        interpretation = "small"
    elif abs_delta < 0.474:
        interpretation = "medium"
    else:
        interpretation = "large"
    return delta, interpretation

uqdd.metrics.wilcoxon_pairwise_test ¶

wilcoxon_pairwise_test(df, metric, model_a, model_b, task=None, split=None, seed_col=None)

Perform paired Wilcoxon signed-rank test between two models on a metric.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric`	`str`	Metric column to compare.	required
`model_a`	`str`	First model type name.	required
`model_b`	`str`	Second model type name.	required
`task`	`str or None`	Task filter. Default is None.	`None`
`split`	`str or None`	Split filter. Default is None.	`None`
`seed_col`	`str or None`	Optional seed column identifier (unused here).	`None`

Returns:

Type	Description
`dict or None`	Test summary including statistic, p-value, Cliff's delta, CI on differences; None if insufficient data.

Source code in uqdd/metrics/stats.py

def wilcoxon_pairwise_test(df, metric, model_a, model_b, task=None, split=None, seed_col=None):
    """
    Perform paired Wilcoxon signed-rank test between two models on a metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric : str
        Metric column to compare.
    model_a : str
        First model type name.
    model_b : str
        Second model type name.
    task : str or None, optional
        Task filter. Default is None.
    split : str or None, optional
        Split filter. Default is None.
    seed_col : str or None, optional
        Optional seed column identifier (unused here).

    Returns
    -------
    dict or None
        Test summary including statistic, p-value, Cliff's delta, CI on differences; None if insufficient data.
    """
    data = df.copy()
    if task is not None:
        data = data[data["Task"] == task]
    if split is not None:
        data = data[data["Split"] == split]
    values_a = data[data["Model type"] == model_a][metric].values
    values_b = data[data["Model type"] == model_b][metric].values
    if len(values_a) == 0 or len(values_b) == 0:
        return None
    min_len = min(len(values_a), len(values_b))
    values_a = values_a[:min_len]
    values_b = values_b[:min_len]
    statistic, p_value = wilcoxon(values_a, values_b, alternative="two-sided")
    delta, effect_size_interpretation = cliffs_delta(values_a, values_b)
    differences = values_a - values_b
    median_diff = np.median(differences)
    ci_lower, ci_upper = bootstrap_ci(differences, np.median, n_bootstrap=1000)
    if ci_lower <= 0 <= ci_upper:
        practical_significance = "difference is small (CI includes 0)"
    elif abs(median_diff) < 0.1 * np.std(np.concatenate([values_a, values_b])):
        practical_significance = "difference is small"
    else:
        practical_significance = "difference may be meaningful"
    return {
        "model_a": model_a,
        "model_b": model_b,
        "metric": metric,
        "task": task,
        "split": split,
        "n_pairs": min_len,
        "wilcoxon_statistic": statistic,
        "p_value": p_value,
        "cliffs_delta": delta,
        "effect_size_interpretation": effect_size_interpretation,
        "median_difference": median_diff,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "practical_significance": practical_significance,
    }

uqdd.metrics.holm_bonferroni_correction ¶

holm_bonferroni_correction(p_values)

Apply Holm–Bonferroni correction to an array of p-values.

Parameters:

Name	Type	Description	Default
`p_values`	`array - like`	Raw p-values.	required

Returns:

Type	Description
`tuple[ndarray, ndarray]`	(corrected_p_values, rejected_mask) where rejected indicates significance after correction.

Source code in uqdd/metrics/stats.py

def holm_bonferroni_correction(p_values):
    """
    Apply Holm–Bonferroni correction to an array of p-values.

    Parameters
    ----------
    p_values : array-like
        Raw p-values.

    Returns
    -------
    tuple[numpy.ndarray, numpy.ndarray]
        (corrected_p_values, rejected_mask) where rejected indicates significance after correction.
    """
    p_values = np.array(p_values)
    n = len(p_values)
    sorted_indices = np.argsort(p_values)
    sorted_p_values = p_values[sorted_indices]
    corrected_p_values = np.zeros(n)
    rejected = np.zeros(n, dtype=bool)
    for i in range(n):
        correction_factor = n - i
        corrected_p_values[sorted_indices[i]] = min(1.0, sorted_p_values[i] * correction_factor)
        if corrected_p_values[sorted_indices[i]] < 0.05:
            rejected[sorted_indices[i]] = True
        else:
            break
    return corrected_p_values, rejected

uqdd.metrics.pairwise_model_comparison ¶

pairwise_model_comparison(df, metrics, models=None, tasks=None, splits=None, alpha=0.05)

Run pairwise Wilcoxon tests across models/tasks/splits for multiple metrics and adjust p-values.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metrics`	`list of str`	Metrics to compare.	required
`models`	`list of str or None`	Models to include; default derives from data.	`None`
`tasks`	`list of str or None`	Tasks to include; default derives from data.	`None`
`splits`	`list of str or None`	Splits to include; default derives from data.	`None`
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`

Returns:

Type	Description
`DataFrame`	Results table with corrected p-values and significance flags.

Source code in uqdd/metrics/stats.py

def pairwise_model_comparison(df, metrics, models=None, tasks=None, splits=None, alpha=0.05):
    """
    Run pairwise Wilcoxon tests across models/tasks/splits for multiple metrics and adjust p-values.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to compare.
    models : list of str or None, optional
        Models to include; default derives from data.
    tasks : list of str or None, optional
        Tasks to include; default derives from data.
    splits : list of str or None, optional
        Splits to include; default derives from data.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    pd.DataFrame
        Results table with corrected p-values and significance flags.
    """
    if models is None:
        models = df["Model type"].unique()
    if tasks is None:
        tasks = df["Task"].unique()
    if splits is None:
        splits = df["Split"].unique()
    results = []
    for metric in metrics:
        for task in tasks:
            for split in splits:
                for i, model_a in enumerate(models):
                    for j, model_b in enumerate(models):
                        if i < j:
                            result = wilcoxon_pairwise_test(df, metric, model_a, model_b, task, split)
                            if result is not None:
                                results.append(result)
    if not results:
        return pd.DataFrame()
    results_df = pd.DataFrame(results)
    p_values = results_df["p_value"].values
    corrected_p_values, rejected = holm_bonferroni_correction(p_values)
    results_df["corrected_p_value"] = corrected_p_values
    results_df["significant_after_correction"] = rejected
    return results_df

uqdd.metrics.friedman_nemenyi_test ¶

friedman_nemenyi_test(df, metrics, models=None, alpha=0.05)

Run Friedman test across models with Nemenyi post-hoc where significant, per metric.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metrics`	`list of str`	Metrics to test.	required
`models`	`list of str or None`	Models to include; default derives from data.	`None`
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`

Returns:

Type	Description
`dict`	Mapping metric -> result dict containing stats, p-values, mean ranks, and optional post-hoc outputs.

Source code in uqdd/metrics/stats.py

def friedman_nemenyi_test(df, metrics, models=None, alpha=0.05):
    """
    Run Friedman test across models with Nemenyi post-hoc where significant, per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to test.
    models : list of str or None, optional
        Models to include; default derives from data.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    dict
        Mapping metric -> result dict containing stats, p-values, mean ranks, and optional post-hoc outputs.
    """
    if models is None:
        models = df["Model type"].unique()
    results = {}
    for metric in metrics:
        pivot_data = df.pivot_table(values=metric, index=["Task", "Split"], columns="Model type", aggfunc="mean")
        available_models = [m for m in models if m in pivot_data.columns]
        pivot_data = pivot_data[available_models]
        pivot_data = pivot_data.dropna()
        if pivot_data.shape[0] < 2 or pivot_data.shape[1] < 3:
            results[metric] = {"error": "Insufficient data for Friedman test", "data_shape": pivot_data.shape}
            continue
        try:
            friedman_stat, friedman_p = friedmanchisquare(*[pivot_data[col].values for col in pivot_data.columns])
            ranks = pivot_data.rank(axis=1, ascending=False)
            mean_ranks = ranks.mean()
            result = {
                "friedman_statistic": friedman_stat,
                "friedman_p_value": friedman_p,
                "mean_ranks": mean_ranks.to_dict(),
                "significant": friedman_p < alpha,
            }
            if friedman_p < alpha:
                try:
                    data_array = pivot_data.values
                    nemenyi_result = sp.posthoc_nemenyi_friedman(data_array.T)
                    nemenyi_result.index = available_models
                    nemenyi_result.columns = available_models
                    result["nemenyi_p_values"] = nemenyi_result.to_dict()
                    result["critical_difference"] = calculate_critical_difference(len(available_models), pivot_data.shape[0], alpha)
                except Exception as e:
                    result["nemenyi_error"] = str(e)
            results[metric] = result
        except Exception as e:
            results[metric] = {"error": str(e)}
    return results

uqdd.metrics.calculate_critical_difference ¶

calculate_critical_difference(k, n, alpha=0.05)

Compute the critical difference for average ranks in Nemenyi post-hoc tests.

Parameters:

Name	Type	Description	Default
`k`	`int`	Number of models.	required
`n`	`int`	Number of datasets/blocks.	required
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`

Returns:

Type	Description
`float`	Critical difference value.

Source code in uqdd/metrics/stats.py

def calculate_critical_difference(k, n, alpha=0.05):
    """
    Compute the critical difference for average ranks in Nemenyi post-hoc tests.

    Parameters
    ----------
    k : int
        Number of models.
    n : int
        Number of datasets/blocks.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    float
        Critical difference value.
    """
    from scipy.stats import studentized_range
    q_alpha = studentized_range.ppf(1 - alpha, k, np.inf) / np.sqrt(2)
    cd = q_alpha * np.sqrt(k * (k + 1) / (6 * n))
    return cd

uqdd.metrics.bootstrap_auc_difference ¶

bootstrap_auc_difference(auc_values_a, auc_values_b, n_bootstrap=1000, ci=95, random_state=42)

Bootstrap confidence interval for difference of mean AUCs between two models.

Parameters:

Name	Type	Description	Default
`auc_values_a`	`array - like`	AUC values for model A.	required
`auc_values_b`	`array - like`	AUC values for model B.	required
`n_bootstrap`	`int`	Number of bootstrap resamples. Default is 1000.	`1000`
`ci`	`int or float`	Confidence level in percent. Default is 95.	`95`
`random_state`	`int`	Seed for reproducibility. Default is 42.	`42`

Returns:

Type	Description
`dict`	{'mean_difference', 'ci_lower', 'ci_upper', 'bootstrap_differences'}

Source code in uqdd/metrics/stats.py

def bootstrap_auc_difference(auc_values_a, auc_values_b, n_bootstrap=1000, ci=95, random_state=42):
    """
    Bootstrap confidence interval for difference of mean AUCs between two models.

    Parameters
    ----------
    auc_values_a : array-like
        AUC values for model A.
    auc_values_b : array-like
        AUC values for model B.
    n_bootstrap : int, optional
        Number of bootstrap resamples. Default is 1000.
    ci : int or float, optional
        Confidence level in percent. Default is 95.
    random_state : int, optional
        Seed for reproducibility. Default is 42.

    Returns
    -------
    dict
        {'mean_difference', 'ci_lower', 'ci_upper', 'bootstrap_differences'}
    """
    np.random.seed(random_state)
    differences = []
    for _ in range(n_bootstrap):
        sample_a = resample(auc_values_a, random_state=np.random.randint(0, 10000))
        sample_b = resample(auc_values_b, random_state=np.random.randint(0, 10000))
        diff = np.mean(sample_a) - np.mean(sample_b)
        differences.append(diff)
    differences = np.array(differences)
    alpha = (100 - ci) / 2
    ci_lower = np.percentile(differences, alpha)
    ci_upper = np.percentile(differences, 100 - alpha)
    original_diff = np.mean(auc_values_a) - np.mean(auc_values_b)
    return {"mean_difference": original_diff, "ci_lower": ci_lower, "ci_upper": ci_upper, "bootstrap_differences": differences}

uqdd.metrics.plot_critical_difference_diagram ¶

plot_critical_difference_diagram(friedman_results, metric, save_dir=None, alpha=0.05)

Plot a simple critical difference diagram using mean ranks and CD value.

Parameters:

Name	Type	Description	Default
`friedman_results`	`dict`	Output dictionary from friedman_nemenyi_test.	required
`metric`	`str`	Metric to plot.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`alpha`	`float`	Significance level used to compute CD. Default is 0.05.	`0.05`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def plot_critical_difference_diagram(friedman_results, metric, save_dir=None, alpha=0.05):
    """
    Plot a simple critical difference diagram using mean ranks and CD value.

    Parameters
    ----------
    friedman_results : dict
        Output dictionary from friedman_nemenyi_test.
    metric : str
        Metric to plot.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    alpha : float, optional
        Significance level used to compute CD. Default is 0.05.

    Returns
    -------
    None
    """
    if metric not in friedman_results:
        print(f"Metric {metric} not found in Friedman results")
        return
    result = friedman_results[metric]
    if "error" in result:
        print(f"Error in Friedman test for {metric}: {result['error']}")
        return
    if not result["significant"]:
        print(f"Friedman test not significant for {metric}, skipping CD diagram")
        return
    mean_ranks = result["mean_ranks"]
    models = list(mean_ranks.keys())
    ranks = [mean_ranks[model] for model in models]
    sorted_indices = np.argsort(ranks)
    sorted_models = [models[i] for i in sorted_indices]
    sorted_ranks = [ranks[i] for i in sorted_indices]
    fig, ax = plt.subplots(figsize=(12, 6))
    y_pos = 0
    ax.scatter(sorted_ranks, [y_pos] * len(sorted_ranks), s=100, c="blue")
    for i, (model, rank) in enumerate(zip(sorted_models, sorted_ranks)):
        ax.annotate(model, (rank, y_pos), xytext=(0, 20), textcoords="offset points", ha="center", rotation=45)
    if "critical_difference" in result:
        cd = result["critical_difference"]
        groups = []
        for i, model_a in enumerate(sorted_models):
            group = [model_a]
            rank_a = sorted_ranks[i]
            for j, model_b in enumerate(sorted_models):
                if i != j:
                    rank_b = sorted_ranks[j]
                    if abs(rank_a - rank_b) <= cd:
                        if model_b not in [m for g in groups for m in g]:
                            group.append(model_b)
            if len(group) > 1:
                groups.append(group)
        colors = plt.cm.Set3(np.linspace(0, 1, len(groups)))
        for group, color in zip(groups, colors):
            if len(group) > 1:
                group_ranks = [sorted_ranks[sorted_models.index(m)] for m in group]
                min_rank, max_rank = min(group_ranks), max(group_ranks)
                ax.plot([min_rank, max_rank], [y_pos - 0.05, y_pos - 0.05], color=color, linewidth=3, alpha=0.7)
    ax.set_xlim(min(sorted_ranks) - 0.5, max(sorted_ranks) + 0.5)
    ax.set_ylim(-0.3, 0.5)
    ax.set_xlabel("Average Rank")
    ax.set_title(f"Critical Difference Diagram - {metric}")
    ax.grid(True, alpha=0.3)
    ax.set_yticks([])
    if save_dir:
        plot_name = f"critical_difference_{metric.replace(' ', '_')}"
        save_plot(fig, save_dir, plot_name)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analyze_significance ¶

analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=None, model_order=None, activity=None)

End-to-end significance analysis and plotting across splits for multiple metrics.

Parameters:

Name	Type	Description	Default
`df_raw`	`DataFrame`	Raw results DataFrame.	required
`metrics`	`list of str`	Metric names to analyze.	required
`direction_dict`	`dict`	Mapping metric -> 'maximize'\|'minimize'.	required
`effect_dict`	`dict`	Mapping metric -> effect size threshold for visualization.	required
`save_dir`	`str or None`	Directory to save plots and outputs. Default is None.	`None`
`model_order`	`list of str or None`	Explicit ordering of models. Default derives from data.	`None`
`activity`	`str or None`	Activity name for prefixes. Default is None.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=None, model_order=None, activity=None):
    """
    End-to-end significance analysis and plotting across splits for multiple metrics.

    Parameters
    ----------
    df_raw : pd.DataFrame
        Raw results DataFrame.
    metrics : list of str
        Metric names to analyze.
    direction_dict : dict
        Mapping metric -> 'maximize'|'minimize'.
    effect_dict : dict
        Mapping metric -> effect size threshold for visualization.
    save_dir : str or None, optional
        Directory to save plots and outputs. Default is None.
    model_order : list of str or None, optional
        Explicit ordering of models. Default derives from data.
    activity : str or None, optional
        Activity name for prefixes. Default is None.

    Returns
    -------
    None
    """
    df = harmonize_columns(df_raw)
    for metric in metrics:
        df[metric] = pd.to_numeric(df[metric], errors="coerce")
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    for split in df["split"].unique():
        df_s = df[df["split"] == split].copy()
        print(f"\n=== Split: {split} ===")
        name_prefix = f"06_{activity}_{split}" if activity else f"{split}"
        make_normality_diagnostic(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix)
        for metric in metrics:
            print(f"\n-- Metric: {metric}")
            wide = df_s.pivot(index="cv_cycle", columns="method", values=metric)
            resid = (wide.T - wide.mean(axis=1)).T
            vals = resid.values.flatten()
            vals = vals[~np.isnan(vals)]
            W, p_norm = shapiro(vals) if len(vals) >= 3 else (None, 0.0)
            if p_norm is None:
                print("Not enough data for Shapiro-Wilk test (need at least 3 non-NaN values), assuming non-normality")
            elif p_norm < 0.05:
                print(f"Shapiro-Wilk test for {metric} indicates non-normality (W={W:.3f}, p={p_norm:.3f})")
            else:
                print(f"Shapiro-Wilk test for {metric} indicates normality (W={W:.3f}, p={p_norm:.3f})")
        make_boxplots(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_boxplots_parametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_boxplots_nonparametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_sign_plots_nonparametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_critical_difference_diagrams(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_mcs_plot_grid(df=df_s, stats_list=metrics, group_col="method", alpha=0.05, figsize=(30, 15), direction_dict=direction_dict, effect_dict=effect_dict, show_diff=True, sort_axes=True, save_dir=save_dir, name_prefix=name_prefix + "_diff", model_order=model_order)
        make_mcs_plot_grid(df=df_s, stats_list=metrics, group_col="method", alpha=0.05, figsize=(30, 15), direction_dict=direction_dict, effect_dict=effect_dict, show_diff=False, sort_axes=True, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_ci_plot_grid(df_s, metrics, group_col="method", save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)

uqdd.metrics.comprehensive_statistical_analysis ¶

comprehensive_statistical_analysis(df, metrics, models=None, tasks=None, splits=None, save_dir=None, alpha=0.05)

Run a comprehensive suite of statistical tests and export results.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metrics`	`list of str`	Metrics to analyze.	required
`models`	`list of str or None`	Models to include. Default derives from data.	`None`
`tasks`	`list of str or None`	Tasks to include. Default derives from data.	`None`
`splits`	`list of str or None`	Splits to include. Default derives from data.	`None`
`save_dir`	`str or None`	Directory to save tables and JSON outputs. Default is None.	`None`
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`

Returns:

Type	Description
`dict`	Results dict including pairwise tests, Friedman/Nemenyi outputs, and optional AUC bootstrap comparisons.

Source code in uqdd/metrics/stats.py

def comprehensive_statistical_analysis(df, metrics, models=None, tasks=None, splits=None, save_dir=None, alpha=0.05):
    """
    Run a comprehensive suite of statistical tests and export results.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to analyze.
    models : list of str or None, optional
        Models to include. Default derives from data.
    tasks : list of str or None, optional
        Tasks to include. Default derives from data.
    splits : list of str or None, optional
        Splits to include. Default derives from data.
    save_dir : str or None, optional
        Directory to save tables and JSON outputs. Default is None.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    dict
        Results dict including pairwise tests, Friedman/Nemenyi outputs, and optional AUC bootstrap comparisons.
    """
    print("Performing comprehensive statistical analysis...")
    results = {}
    print("1. Running pairwise Wilcoxon signed-rank tests...")
    pairwise_results = pairwise_model_comparison(df, metrics, models, tasks, splits, alpha)
    results["pairwise_tests"] = pairwise_results
    print("2. Running Friedman tests with Nemenyi post-hoc...")
    friedman_results = friedman_nemenyi_test(df, metrics, models, alpha)
    results["friedman_nemenyi"] = friedman_results
    auc_columns = [col for col in df.columns if "AUC" in col or "auc" in col]
    if auc_columns:
        print("3. Running bootstrap comparisons for AUC metrics...")
        auc_bootstrap_results = {}
        for auc_col in auc_columns:
            auc_bootstrap_results[auc_col] = {}
            available_models = df["Model type"].unique() if models is None else models
            for i, model_a in enumerate(available_models):
                for j, model_b in enumerate(available_models):
                    if i < j:
                        auc_a = df[df["Model type"] == model_a][auc_col].dropna().values
                        auc_b = df[df["Model type"] == model_b][auc_col].dropna().values
                        if len(auc_a) > 0 and len(auc_b) > 0:
                            bootstrap_result = bootstrap_auc_difference(auc_a, auc_b)
                            auc_bootstrap_results[auc_col][f"{model_a}_vs_{model_b}"] = bootstrap_result
        results["auc_bootstrap"] = auc_bootstrap_results
    if save_dir:
        os.makedirs(save_dir, exist_ok=True)
        if not pairwise_results.empty:
            pairwise_results.to_csv(os.path.join(save_dir, "pairwise_statistical_tests.csv"), index=False)
        import json
        with open(os.path.join(save_dir, "friedman_nemenyi_results.json"), "w") as f:
            json_compatible_results = {}
            for metric, result in friedman_results.items():
                json_compatible_results[metric] = {}
                for key, value in result.items():
                    if isinstance(value, (np.ndarray, np.generic)):
                        json_compatible_results[metric][key] = value.tolist()
                    elif isinstance(value, dict):
                        json_compatible_results[metric][key] = {str(k): (float(v) if isinstance(v, (np.ndarray, np.generic)) else v) for k, v in value.items()}
                    else:
                        json_compatible_results[metric][key] = (float(value) if isinstance(value, (np.ndarray, np.generic)) else value)
            json.dump(json_compatible_results, f, indent=2)
        if auc_columns:
            with open(os.path.join(save_dir, "auc_bootstrap_results.json"), "w") as f:
                json_compatible_auc = {}
                for auc_col, comparisons in results["auc_bootstrap"].items():
                    json_compatible_auc[auc_col] = {}
                    for comparison, result in comparisons.items():
                        json_compatible_auc[auc_col][comparison] = {k: v.tolist() if isinstance(v, np.ndarray) else v for k, v in result.items()}
                json.dump(json_compatible_auc, f, indent=2)
    return results

uqdd.metrics.generate_statistical_report ¶

generate_statistical_report(results, save_dir=None, df_raw=None, metrics=None, direction_dict=None, effect_dict=None)

Generate a human-readable text report from comprehensive statistical results and optionally run plots.

Parameters:

Name	Type	Description	Default
`results`	`dict`	Output of comprehensive_statistical_analysis.	required
`save_dir`	`str or None`	Directory to save the report text file. Default is None.	`None`
`df_raw`	`DataFrame or None`	Raw DataFrame to run plotting-based significance analysis. Default is None.	`None`
`metrics`	`list of str or None`	Metrics to plot (when df_raw provided).	`None`
`direction_dict`	`dict or None`	Direction mapping for metrics (required when df_raw provided).	`None`
`effect_dict`	`dict or None`	Effect threshold mapping (required when df_raw provided).	`None`

Returns:

Type	Description
`str`	Report text.

Source code in uqdd/metrics/stats.py

def generate_statistical_report(results, save_dir=None, df_raw=None, metrics=None, direction_dict=None, effect_dict=None):
    """
    Generate a human-readable text report from comprehensive statistical results and optionally run plots.

    Parameters
    ----------
    results : dict
        Output of comprehensive_statistical_analysis.
    save_dir : str or None, optional
        Directory to save the report text file. Default is None.
    df_raw : pd.DataFrame or None, optional
        Raw DataFrame to run plotting-based significance analysis. Default is None.
    metrics : list of str or None, optional
        Metrics to plot (when df_raw provided).
    direction_dict : dict or None, optional
        Direction mapping for metrics (required when df_raw provided).
    effect_dict : dict or None, optional
        Effect threshold mapping (required when df_raw provided).

    Returns
    -------
    str
        Report text.
    """
    report = []
    report.append("=" * 80)
    report.append("COMPREHENSIVE STATISTICAL ANALYSIS REPORT")
    report.append("=" * 80)
    report.append("")
    if "pairwise_tests" in results and not results["pairwise_tests"].empty:
        pairwise_df = results["pairwise_tests"]
        report.append("1. PAIRWISE MODEL COMPARISONS (Wilcoxon Signed-Rank Test)")
        report.append("-" * 60)
        significant = pairwise_df[pairwise_df["significant_after_correction"] == True]
        report.append(f"Total pairwise comparisons performed: {len(pairwise_df)}")
        report.append(f"Significant differences (after Holm-Bonferroni correction): {len(significant)}")
        report.append("")
        if len(significant) > 0:
            report.append("Significant differences found:")
            for _, row in significant.iterrows():
                effect_size = row["effect_size_interpretation"]
                report.append(f"  • {row['model_a']} vs {row['model_b']} ({row['metric']}, {row['split']}):")
                report.append(f"    - p-value: {row['p_value']:.4f} (corrected: {row['corrected_p_value']:.4f})")
                report.append(f"    - Cliff's Δ: {row['cliffs_delta']:.3f} ({effect_size} effect)")
                report.append(f"    - Median difference: {row['median_difference']:.4f} [{row['ci_lower']:.4f}, {row['ci_upper']:.4f}]")
                report.append(f"    - {row['practical_significance']}")
                report.append("")
        else:
            report.append("No significant differences found after multiple comparison correction.")
            report.append("")
    if "friedman_nemenyi" in results:
        friedman_results = results["friedman_nemenyi"]
        report.append("2. MULTIPLE MODEL COMPARISONS (Friedman + Nemenyi Tests)")
        report.append("-" * 60)
        for metric, result in friedman_results.items():
            if "error" in result:
                report.append(f"{metric}: {result['error']}")
                continue
            report.append(f"Metric: {metric}")
            report.append(f"  Friedman test p-value: {result['friedman_p_value']:.4f}")
            if result["significant"]:
                report.append("  Result: Significant difference between models detected")
                mean_ranks = result["mean_ranks"]
                sorted_ranks = sorted(mean_ranks.items(), key=lambda x: x[1])
                report.append("  Model rankings (lower rank = better performance):")
                for i, (model, rank) in enumerate(sorted_ranks, 1):
                    report.append(f"    {i}. {model}: {rank:.2f}")
                if "critical_difference" in result:
                    report.append(f"  Critical difference: {result['critical_difference']:.3f}")
            else:
                report.append("  Result: No significant difference between models")
            report.append("")
    if "auc_bootstrap" in results:
        auc_results = results["auc_bootstrap"]
        report.append("3. AUC BOOTSTRAP COMPARISONS")
        report.append("-" * 60)
        for auc_col, comparisons in auc_results.items():
            report.append(f"AUC Metric: {auc_col}")
            for comparison, result in comparisons.items():
                model_a, model_b = comparison.split("_vs_")
                mean_diff = result["mean_difference"]
                ci_lower = result["ci_lower"]
                ci_upper = result["ci_upper"]
                significance = "difference is small (CI includes 0)" if (ci_lower <= 0 <= ci_upper) else "difference may be meaningful"
                report.append(f"  {model_a} vs {model_b}:")
                report.append(f"    Mean difference: {mean_diff:.4f} [{ci_lower:.4f}, {ci_upper:.4f}]")
                report.append(f"    {significance}")
            report.append("")
    report_text = "\n".join(report)
    if save_dir:
        os.makedirs(save_dir, exist_ok=True)
        with open(os.path.join(save_dir, "statistical_analysis_report.txt"), "w") as f:
            f.write(report_text)
    print(report_text)
    if df_raw is not None and metrics is not None and direction_dict is not None and effect_dict is not None:
        analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=save_dir)
    return report_text

uqdd.metrics.nll_evidentials ¶

nll_evidentials(evidential_model, test_dataloader, model_type: str = 'evidential', num_mc_samples: int = 100, device=DEVICE)

Compute negative log-likelihood (NLL) for evidential-style models.

Parameters:

Name	Type	Description	Default
`evidential_model`	`Module`	Trained model instance.	required
`test_dataloader`	`DataLoader`	DataLoader providing test set batches.	required
`model_type`	`(evidential, eoe, emc)`	Model family determining the NLL backend. Default is "evidential".	`"evidential"`
`num_mc_samples`	`int`	Number of MC samples for EMC models. Default is 100.	`100`
`device`	`device`	Device to run evaluation on. Default uses `DEVICE`.	`DEVICE`

Returns:

Type	Description
`float or None`	Scalar NLL if supported by the model type; None otherwise.

Source code in uqdd/metrics/reassessment.py

def nll_evidentials(
    evidential_model,
    test_dataloader,
    model_type: str = "evidential",
    num_mc_samples: int = 100,
    device=DEVICE,
):
    """
    Compute negative log-likelihood (NLL) for evidential-style models.

    Parameters
    ----------
    evidential_model : torch.nn.Module
        Trained model instance.
    test_dataloader : torch.utils.data.DataLoader
        DataLoader providing test set batches.
    model_type : {"evidential", "eoe", "emc"}, optional
        Model family determining the NLL backend. Default is "evidential".
    num_mc_samples : int, optional
        Number of MC samples for EMC models. Default is 100.
    device : torch.device, optional
        Device to run evaluation on. Default uses `DEVICE`.

    Returns
    -------
    float or None
        Scalar NLL if supported by the model type; None otherwise.
    """
    if model_type in ["evidential", "eoe"]:
        return ev_nll(evidential_model, test_dataloader, device=device)
    elif model_type == "emc":
        return emc_nll(evidential_model, test_dataloader, num_mc_samples=num_mc_samples, device=device)
    else:
        return None

uqdd.metrics.convert_to_list ¶

convert_to_list(val)

Parse a string representation of a Python list to a list; pass through non-strings.

Parameters:

Name	Type	Description	Default
`val`	`str or any`	Input value, possibly a string encoding of a list.	required

Returns:

Type	Description
`list`	Parsed list if `val` is a valid string list, empty list on parse failure.
`any`	Original value if not a string.

Notes

Uses ast.literal_eval for safe evaluation.
Prints a warning and returns [] when parsing fails.

Source code in uqdd/metrics/reassessment.py

def convert_to_list(val):
    """
    Parse a string representation of a Python list to a list; pass through non-strings.

    Parameters
    ----------
    val : str or any
        Input value, possibly a string encoding of a list.

    Returns
    -------
    list
        Parsed list if `val` is a valid string list, empty list on parse failure.
    any
        Original value if not a string.

    Notes
    -----
    - Uses `ast.literal_eval` for safe evaluation.
    - Prints a warning and returns [] when parsing fails.
    """
    if isinstance(val, str):
        try:
            parsed_val = ast.literal_eval(val)
            if isinstance(parsed_val, list):
                return parsed_val
            else:
                return []
        except (SyntaxError, ValueError):
            print(f"Warning: Unable to parse value {val}, returning empty list.")
            return []
    return val

uqdd.metrics.preprocess_runs ¶

preprocess_runs(runs_path: str, models_dir: str = MODELS_DIR, data_name: str = 'papyrus', activity_type: str = 'xc50', descriptor_protein: str = 'ankh-large', descriptor_chemical: str = 'ecfp2048', data_specific_path: str = 'papyrus/xc50/all', prot_input_dim: int = 1536, chem_input_dim: int = 2048) -> pd.DataFrame

Read a runs CSV and enrich with resolved model paths and descriptor metadata.

Parameters:

Name	Type	Description	Default
`runs_path`	`str`	Path to the CSV file containing run metadata.	required
`models_dir`	`str`	Directory containing trained model .pt files. Default uses `MODELS_DIR`.	`MODELS_DIR`
`data_name`	`str`	Dataset identifier. Default is "papyrus".	`'papyrus'`
`activity_type`	`str`	Activity type (e.g., "xc50", "kc"). Default is "xc50".	`'xc50'`
`descriptor_protein`	`str`	Protein descriptor type. Default is "ankh-large".	`'ankh-large'`
`descriptor_chemical`	`str`	Chemical descriptor type. Default is "ecfp2048".	`'ecfp2048'`
`data_specific_path`	`str`	Subpath encoding dataset context for figures/exports. Default is "papyrus/xc50/all".	`'papyrus/xc50/all'`
`prot_input_dim`	`int`	Protein input dimensionality. Default is 1536.	`1536`
`chem_input_dim`	`int`	Chemical input dimensionality. Default is 2048.	`2048`

Returns:

Type	Description
`DataFrame`	Preprocessed runs DataFrame with columns like 'model_name', 'model_path', and descriptor fields.

Notes

Resolves model_name to actual .pt files via glob and sets 'model_path'.
Adds multi-task flag 'MT' from 'n_targets' > 1.
Converts layer columns from strings to lists using convert_to_list.

Source code in uqdd/metrics/reassessment.py

def preprocess_runs(
    runs_path: str,
    models_dir: str = MODELS_DIR,
    data_name: str = "papyrus",
    activity_type: str = "xc50",
    descriptor_protein: str = "ankh-large",
    descriptor_chemical: str = "ecfp2048",
    data_specific_path: str = "papyrus/xc50/all",
    prot_input_dim: int = 1536,
    chem_input_dim: int = 2048,
) -> pd.DataFrame:
    """
    Read a runs CSV and enrich with resolved model paths and descriptor metadata.

    Parameters
    ----------
    runs_path : str
        Path to the CSV file containing run metadata.
    models_dir : str, optional
        Directory containing trained model .pt files. Default uses `MODELS_DIR`.
    data_name : str, optional
        Dataset identifier. Default is "papyrus".
    activity_type : str, optional
        Activity type (e.g., "xc50", "kc"). Default is "xc50".
    descriptor_protein : str, optional
        Protein descriptor type. Default is "ankh-large".
    descriptor_chemical : str, optional
        Chemical descriptor type. Default is "ecfp2048".
    data_specific_path : str, optional
        Subpath encoding dataset context for figures/exports. Default is "papyrus/xc50/all".
    prot_input_dim : int, optional
        Protein input dimensionality. Default is 1536.
    chem_input_dim : int, optional
        Chemical input dimensionality. Default is 2048.

    Returns
    -------
    pd.DataFrame
        Preprocessed runs DataFrame with columns like 'model_name', 'model_path', and descriptor fields.

    Notes
    -----
    - Resolves `model_name` to actual .pt files via glob and sets 'model_path'.
    - Adds multi-task flag 'MT' from 'n_targets' > 1.
    - Converts layer columns from strings to lists using `convert_to_list`.
    """
    runs_df = pd.read_csv(
        runs_path,
        converters={
            "chem_layers": convert_to_list,
            "prot_layers": convert_to_list,
            "regressor_layers": convert_to_list,
        },
    )
    runs_df.rename(columns={"Name": "run_name"}, inplace=True)
    i = 1
    for index, row in runs_df.iterrows():
        model_name = row["model_name"] if not pd.isna(row["model_name"]) else row["run_name"]
        model_file_pattern = os.path.join(models_dir, f"*{model_name}.pt")
        model_files = glob.glob(model_file_pattern)
        if model_files:
            model_file_path = model_files[0]
            model_name = os.path.basename(model_file_path).replace(".pt", "")
            runs_df.at[index, "model_name"] = model_name
            runs_df.at[index, "model_path"] = model_file_path
        else:
            print(f"{i} Model file(s) not found for {model_name} \n with pattern {model_file_pattern}")
            runs_df.at[index, "model_path"] = ""
            i += 1
    runs_df["data_name"] = data_name
    runs_df["activity_type"] = activity_type
    runs_df["descriptor_protein"] = descriptor_protein
    runs_df["descriptor_chemical"] = descriptor_chemical
    runs_df["chem_input_dim"] = chem_input_dim
    runs_df["prot_input_dim"] = prot_input_dim
    runs_df["data_specific_path"] = data_specific_path
    runs_df["MT"] = runs_df["n_targets"].apply(lambda x: True if x > 1 else False)
    return runs_df

uqdd.metrics.get_model_class ¶

get_model_class(model_type: str)

Map a model type name to the corresponding class.

Parameters:

Name	Type	Description	Default
`model_type`	`str`	Model type identifier (e.g., "pnn", "ensemble", "evidential", "eoe", "emc", "mcdropout").	required

Returns:

Type	Description
`type`	Model class matching the type.

Raises:

Type	Description
`ValueError`	If the `model_type` is not recognized.

Source code in uqdd/metrics/reassessment.py

def get_model_class(model_type: str):
    """
    Map a model type name to the corresponding class.

    Parameters
    ----------
    model_type : str
        Model type identifier (e.g., "pnn", "ensemble", "evidential", "eoe", "emc", "mcdropout").

    Returns
    -------
    type
        Model class matching the type.

    Raises
    ------
    ValueError
        If the `model_type` is not recognized.
    """
    if model_type.lower() in ["pnn", "mcdropout"]:
        return PNN
    elif model_type.lower() == "ensemble":
        return EnsembleDNN
    elif model_type.lower() in ["evidential", "emc"]:
        return EvidentialDNN
    elif model_type.lower() == "eoe":
        return EoEDNN
    else:
        raise ValueError(f"Model type {model_type} not recognized")

uqdd.metrics.get_predict_fn ¶

get_predict_fn(model_type: str, num_mc_samples: int = 100)

Get the appropriate predict function and kwargs for a given model type.

Parameters:

Name	Type	Description	Default
`model_type`	`str`	Model type identifier.	required
`num_mc_samples`	`int`	Number of MC samples for MC Dropout or EMC models. Default is 100.	`100`

Returns:

Type	Description
`(callable, dict)`	Tuple of (predict_function, keyword_arguments).

Raises:

Type	Description
`ValueError`	If the `model_type` is not recognized.

Source code in uqdd/metrics/reassessment.py

def get_predict_fn(model_type: str, num_mc_samples: int = 100):
    """
    Get the appropriate predict function and kwargs for a given model type.

    Parameters
    ----------
    model_type : str
        Model type identifier.
    num_mc_samples : int, optional
        Number of MC samples for MC Dropout or EMC models. Default is 100.

    Returns
    -------
    (callable, dict)
        Tuple of (predict_function, keyword_arguments).

    Raises
    ------
    ValueError
        If the `model_type` is not recognized.
    """
    if model_type.lower() == "mcdropout":
        return mc_predict, {"num_mc_samples": num_mc_samples}
    elif model_type.lower() in ["ensemble", "pnn"]:
        return predict, {}
    elif model_type.lower() in ["evidential", "eoe"]:
        return ev_predict, {}
    elif model_type.lower() == "emc":
        return emc_predict, {"num_mc_samples": num_mc_samples}
    else:
        raise ValueError(f"Model type {model_type} not recognized")

uqdd.metrics.get_preds ¶

get_preds(model, dataloaders, model_type: str, subset: str = 'test', num_mc_samples: int = 100)

Run inference and unpack predictions for the requested subset.

Parameters:

Name	Type	Description	Default
`model`	`Module`	Trained model instance.	required
`dataloaders`	`dict`	Dictionary of DataLoaders keyed by subset (e.g., 'train', 'val', 'test').	required
`model_type`	`str`	Model type determining the predict function and outputs.	required
`subset`	`str`	Subset key to use from `dataloaders`. Default is "test".	`'test'`
`num_mc_samples`	`int`	Number of MC samples for stochastic predictors. Default is 100.	`100`

Returns:

Type	Description
`tuple`	(preds, labels, alea_vars, epi_vars) where `epi_vars` may be None for non-evidential models.

Source code in uqdd/metrics/reassessment.py

def get_preds(
    model,
    dataloaders,
    model_type: str,
    subset: str = "test",
    num_mc_samples: int = 100,
):
    """
    Run inference and unpack predictions for the requested subset.

    Parameters
    ----------
    model : torch.nn.Module
        Trained model instance.
    dataloaders : dict
        Dictionary of DataLoaders keyed by subset (e.g., 'train', 'val', 'test').
    model_type : str
        Model type determining the predict function and outputs.
    subset : str, optional
        Subset key to use from `dataloaders`. Default is "test".
    num_mc_samples : int, optional
        Number of MC samples for stochastic predictors. Default is 100.

    Returns
    -------
    tuple
        (preds, labels, alea_vars, epi_vars) where `epi_vars` may be None for non-evidential models.
    """
    predict_fn, predict_kwargs = get_predict_fn(model_type, num_mc_samples=num_mc_samples)
    preds_res = predict_fn(model, dataloaders[subset], device=DEVICE, **predict_kwargs)
    if model_type in ["evidential", "eoe", "emc"]:
        preds, labels, alea_vars, epi_vars = preds_res
    else:
        preds, labels, alea_vars = preds_res
        epi_vars = None
    return preds, labels, alea_vars, epi_vars

uqdd.metrics.pkl_preds_export ¶

pkl_preds_export(preds, labels, alea_vars, epi_vars, outpath: str, model_type: str, logger=None)

Export predictions and uncertainties to a standardized pickle and return the DataFrame.

Parameters:

Name	Type	Description	Default
`preds`	`ndarray or Tensor`	Model predictions.	required
`labels`	`ndarray or Tensor`	True labels.	required
`alea_vars`	`ndarray or Tensor`	Aleatoric uncertainty components.	required
`epi_vars`	`ndarray or Tensor or None`	Epistemic uncertainty components, or None for non-evidential models.	required
`outpath`	`str`	Output directory to write 'preds.pkl'.	required
`model_type`	`str`	Model type used to guide `process_preds` behavior.	required
`logger`	`Logger or None`	Logger for messages. Default is None.	`None`

Returns:

Type	Description
`DataFrame`	DataFrame with columns [y_true, y_pred, y_err, y_alea, y_eps].

Source code in uqdd/metrics/reassessment.py

def pkl_preds_export(
    preds,
    labels,
    alea_vars,
    epi_vars,
    outpath: str,
    model_type: str,
    logger=None,
):
    """
    Export predictions and uncertainties to a standardized pickle and return the DataFrame.

    Parameters
    ----------
    preds : numpy.ndarray or torch.Tensor
        Model predictions.
    labels : numpy.ndarray or torch.Tensor
        True labels.
    alea_vars : numpy.ndarray or torch.Tensor
        Aleatoric uncertainty components.
    epi_vars : numpy.ndarray or torch.Tensor or None
        Epistemic uncertainty components, or None for non-evidential models.
    outpath : str
        Output directory to write 'preds.pkl'.
    model_type : str
        Model type used to guide `process_preds` behavior.
    logger : logging.Logger or None, optional
        Logger for messages. Default is None.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns [y_true, y_pred, y_err, y_alea, y_eps].
    """
    y_true, y_pred, y_err, y_alea, y_eps = process_preds(preds, labels, alea_vars, epi_vars, None, model_type)
    df = create_df_preds(y_true=y_true, y_pred=y_pred, y_err=y_err, y_alea=y_alea, y_eps=y_eps, export=False, logger=logger)
    df.to_pickle(os.path.join(outpath, "preds.pkl"))
    return df

uqdd.metrics.csv_nll_post_processing ¶

csv_nll_post_processing(csv_path: str) -> None

Normalize NLL values in a CSV by taking the first value per model name.

Parameters:

Name	Type	Description	Default
`csv_path`	`str`	Path to the CSV file containing a 'model name' and 'NLL' column.	required

Returns:

Type	Description
`None`

Source code in uqdd/metrics/reassessment.py

def csv_nll_post_processing(csv_path: str) -> None:
    """
    Normalize NLL values in a CSV by taking the first value per model name.

    Parameters
    ----------
    csv_path : str
        Path to the CSV file containing a 'model name' and 'NLL' column.

    Returns
    -------
    None
    """
    df = pd.read_csv(csv_path)
    df["NLL"] = df.groupby("model name")["NLL"].transform("first")
    df.to_csv(csv_path, index=False)

uqdd.metrics.reassess_metrics ¶

reassess_metrics(runs_df: DataFrame, figs_out_path: str, csv_out_path: str, project_out_name: str, logger) -> None

Reassess metrics for each run: reload model, predict, compute NLL, evaluate, and recalibrate.

Parameters:

Name	Type	Description	Default
`runs_df`	`DataFrame`	Preprocessed runs DataFrame with resolved 'model_path' and configuration fields.	required
`figs_out_path`	`str`	Directory where per-model figures and prediction pickles are saved.	required
`csv_out_path`	`str`	Path to a CSV for logging metrics (passed to `evaluate_predictions`).	required
`project_out_name`	`str`	Name used for grouping results in downstream logging.	required
`logger`	`Logger`	Logger instance used through evaluation and recalibration.	required

Returns:

Type	Description
`None`

Notes

Skips models already reassessed when a figure directory exists.
Uses validation split for isotonic recalibration and logs final metrics.

Source code in uqdd/metrics/reassessment.py

def reassess_metrics(
    runs_df: pd.DataFrame,
    figs_out_path: str,
    csv_out_path: str,
    project_out_name: str,
    logger,
) -> None:
    """
    Reassess metrics for each run: reload model, predict, compute NLL, evaluate, and recalibrate.

    Parameters
    ----------
    runs_df : pd.DataFrame
        Preprocessed runs DataFrame with resolved 'model_path' and configuration fields.
    figs_out_path : str
        Directory where per-model figures and prediction pickles are saved.
    csv_out_path : str
        Path to a CSV for logging metrics (passed to `evaluate_predictions`).
    project_out_name : str
        Name used for grouping results in downstream logging.
    logger : logging.Logger
        Logger instance used through evaluation and recalibration.

    Returns
    -------
    None

    Notes
    -----
    - Skips models already reassessed when a figure directory exists.
    - Uses validation split for isotonic recalibration and logs final metrics.
    """
    runs_df = runs_df.sample(frac=1).reset_index(drop=True)
    for index, row in runs_df.iterrows():
        model_path = row["model_path"]
        model_name = row["model_name"]
        run_name = row["run_name"]
        rowkwargs = row.to_dict()
        model_type = rowkwargs.pop("model_type")
        activity_type = rowkwargs.pop("activity_type")
        if model_path:
            model_fig_out_path = os.path.join(figs_out_path, model_name)
            if os.path.exists(model_fig_out_path):
                print(f"Model {model_name} already reassessed")
                continue
            os.makedirs(model_fig_out_path, exist_ok=True)
            config = get_model_config(model_type=model_type, activity_type=activity_type, **rowkwargs)
            num_mc_samples = config.get("num_mc_samples", 100)
            model_class = get_model_class(model_type)
            prefix = "models." if model_type == "eoe" else ""
            model = load_model(model_class, model_path, prefix_to_state_keys=prefix, config=config).to(DEVICE)
            dataloaders = get_dataloader(config, device=DEVICE, logger=logger)
            preds, labels, alea_vars, epi_vars = get_preds(model, dataloaders, model_type, subset="test", num_mc_samples=num_mc_samples)
            nll = nll_evidentials(model, dataloaders["test"], model_type=model_type, num_mc_samples=num_mc_samples, device=DEVICE)
            df = pkl_preds_export(preds, labels, alea_vars, epi_vars, model_fig_out_path, model_type, logger=logger)
            metrics, plots, uct_logger = evaluate_predictions(
                config,
                preds,
                labels,
                alea_vars,
                model_type,
                logger,
                epi_vars=epi_vars,
                wandb_push=False,
                run_name=config["run_name"],
                project_name=project_out_name,
                figpath=model_fig_out_path,
                export_preds=False,
                verbose=False,
                csv_path=csv_out_path,
                nll=nll,
            )
            preds_val, labels_val, alea_vars_val, epi_vars_val = get_preds(model, dataloaders, model_type, subset="val", num_mc_samples=num_mc_samples)
            nll = nll_evidentials(model, dataloaders["val"], model_type=model_type, num_mc_samples=num_mc_samples, device=DEVICE)
            iso_recal_model = recalibrate_model(
                preds_val,
                labels_val,
                alea_vars_val,
                preds,
                labels,
                alea_vars,
                config=config,
                epi_val=epi_vars_val,
                epi_test=epi_vars,
                uct_logger=uct_logger,
                figpath=model_fig_out_path,
                nll=nll,
            )
            uct_logger.csv_log()

uqdd.metrics.analysis ¶

Analysis and plotting utilities for model metrics.

This module provides functions to aggregate experiment results, compute summary statistics, and visualize metrics via pairplots, line plots, histograms, bar plots, correlation matrices, calibration curves, and RMSE rejection curves.

uqdd.metrics.analysis.aggregate_results_csv ¶

aggregate_results_csv(df: DataFrame, group_cols: List[str], numeric_cols: List[str], string_cols: List[str], order_by: Optional[Union[str, List[str]]] = None, output_file_path: Optional[str] = None) -> pd.DataFrame

Aggregate metrics by groups and export a compact CSV summary.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input results DataFrame.	required
`group_cols`	`list of str`	Column names to group by.	required
`numeric_cols`	`list of str`	Numeric metric columns to aggregate with mean and std.	required
`string_cols`	`list of str`	String columns to aggregate as lists.	required
`order_by`	`str or list of str or None`	Column(s) to sort the final aggregated DataFrame by. Default is None.	`None`
`output_file_path`	`str or None`	Path to write the aggregated CSV. If None, no file is written.	`None`

Returns:

Type	Description
`DataFrame`	Aggregated DataFrame with combined mean(std) strings plus string/list aggregates.

Notes

A helper column project_model is constructed and included in the aggregates.
When output_file_path is provided, the function ensures the directory exists.

Source code in uqdd/metrics/analysis.py

def aggregate_results_csv(
    df: pd.DataFrame,
    group_cols: List[str],
    numeric_cols: List[str],
    string_cols: List[str],
    order_by: Optional[Union[str, List[str]]] = None,
    output_file_path: Optional[str] = None,
) -> pd.DataFrame:
    """
    Aggregate metrics by groups and export a compact CSV summary.

    Parameters
    ----------
    df : pd.DataFrame
        Input results DataFrame.
    group_cols : list of str
        Column names to group by.
    numeric_cols : list of str
        Numeric metric columns to aggregate with mean and std.
    string_cols : list of str
        String columns to aggregate as lists.
    order_by : str or list of str or None, optional
        Column(s) to sort the final aggregated DataFrame by. Default is None.
    output_file_path : str or None, optional
        Path to write the aggregated CSV. If None, no file is written.

    Returns
    -------
    pd.DataFrame
        Aggregated DataFrame with combined mean(std) strings plus string/list aggregates.

    Notes
    -----
    - A helper column `project_model` is constructed and included in the aggregates.
    - When `output_file_path` is provided, the function ensures the directory exists.
    """
    grouped = df.groupby(group_cols)
    aggregated = grouped[numeric_cols].agg(["mean", "std"])
    for col in numeric_cols:
        aggregated[(col, "combined")] = (
            aggregated[(col, "mean")].round(3).astype(str)
            + "("
            + aggregated[(col, "std")].round(3).astype(str)
            + ")"
        )
    aggregated = aggregated[[col for col in aggregated.columns if col[1] == "combined"]]
    aggregated.columns = [col[0] for col in aggregated.columns]

    string_aggregated = grouped[string_cols].agg(lambda x: list(x))

    df["project_model"] = (
        "papyrus"
        + "/"
        + df["Activity"]
        + "/"
        + "all"
        + "/"
        + df["wandb project"]
        + "/"
        + df["model name"]
        + "/"
    )
    project_model_aggregated = grouped["project_model"].agg(lambda x: list(x))

    final_aggregated = pd.concat(
        [aggregated, string_aggregated, project_model_aggregated], axis=1
    ).reset_index()

    if order_by:
        final_aggregated = final_aggregated.sort_values(by=order_by)

    if output_file_path:
        os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
        final_aggregated.to_csv(output_file_path, index=False)

    return final_aggregated

uqdd.metrics.analysis.save_plot ¶

save_plot(fig: Figure, save_dir: Optional[str], plot_name: str, tighten: bool = True, show_legend: bool = False) -> None

Save a matplotlib figure to PNG, SVG, and PDF with optional tight layout.

Parameters:

Name	Type	Description	Default
`fig`	`Figure`	Figure to save.	required
`save_dir`	`str or None`	Directory to save the figure files. If None, no files are written.	required
`plot_name`	`str`	Base filename (without extension).	required
`tighten`	`bool`	If True, apply tight_layout and bbox_inches="tight". Default is True.	`True`
`show_legend`	`bool`	If False, remove legend before saving. Default is False.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def save_plot(
    fig: plt.Figure,
    save_dir: Optional[str],
    plot_name: str,
    tighten: bool = True,
    show_legend: bool = False,
) -> None:
    """
    Save a matplotlib figure to PNG, SVG, and PDF with optional tight layout.

    Parameters
    ----------
    fig : matplotlib.figure.Figure
        Figure to save.
    save_dir : str or None
        Directory to save the figure files. If None, no files are written.
    plot_name : str
        Base filename (without extension).
    tighten : bool, optional
        If True, apply tight_layout and bbox_inches="tight". Default is True.
    show_legend : bool, optional
        If False, remove legend before saving. Default is False.

    Returns
    -------
    None
    """
    ax = fig.gca()
    if not show_legend:
        legend = ax.get_legend()
        if legend is not None:
            legend.remove()
    if tighten:
        try:
            with warnings.catch_warnings():
                warnings.filterwarnings(
                    "ignore",
                    message="This figure includes Axes that are not compatible with tight_layout",
                )
                fig.tight_layout()
        except (ValueError, RuntimeError):
            fig.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)

    if save_dir and tighten:
        os.makedirs(save_dir, exist_ok=True)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.png"), dpi=300, bbox_inches="tight")
        fig.savefig(os.path.join(save_dir, f"{plot_name}.svg"), bbox_inches="tight")
        fig.savefig(os.path.join(save_dir, f"{plot_name}.pdf"), dpi=300, bbox_inches="tight")
    elif save_dir and not tighten:
        os.makedirs(save_dir, exist_ok=True)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.png"), dpi=300)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.svg"))
        fig.savefig(os.path.join(save_dir, f"{plot_name}.pdf"), dpi=300)

uqdd.metrics.analysis.handle_inf_values ¶

handle_inf_values(df: DataFrame) -> pd.DataFrame

Replace +/- infinity values in a DataFrame with NaN.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required

Returns:

Type	Description
`DataFrame`	DataFrame with infinite values replaced by NaN.

Source code in uqdd/metrics/analysis.py

def handle_inf_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    Replace +/- infinity values in a DataFrame with NaN.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.

    Returns
    -------
    pd.DataFrame
        DataFrame with infinite values replaced by NaN.
    """
    return df.replace([float("inf"), -float("inf")], float("nan"))

uqdd.metrics.analysis.plot_pairplot ¶

plot_pairplot(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, cmap: str = 'viridis', group_order: Optional[List[str]] = group_order, show_legend: bool = False) -> None

Plot a seaborn pairplot for a set of metrics colored by Group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing the metrics and a 'Group' column.	required
`title`	`str`	Plot title.	required
`metrics`	`list of str`	Metric column names to include in the pairplot.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`cmap`	`str`	Seaborn/matplotlib palette name. Default is "viridis".	`'viridis'`
`group_order`	`list of str or None`	Order of class labels in the legend. Default is from constants.	`group_order`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_pairplot(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    cmap: str = "viridis",
    group_order: Optional[List[str]] = group_order,
    show_legend: bool = False,
) -> None:
    """
    Plot a seaborn pairplot for a set of metrics colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing the metrics and a 'Group' column.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to include in the pairplot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    cmap : str, optional
        Seaborn/matplotlib palette name. Default is "viridis".
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    sns.pairplot(
        df,
        hue="Group",
        hue_order=group_order,
        vars=metrics,
        palette=cmap,
        plot_kws={"alpha": 0.7},
    )
    plt.suptitle(title, y=1.02)
    plot_name = f"pairplot_{title.replace(' ', '_')}"
    save_plot(plt.gcf(), save_dir, plot_name, tighten=False, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.plot_line_metrics ¶

plot_line_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, show_legend: bool = False) -> None

Plot line charts of metrics over runs, colored by Group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with 'wandb run', metrics, and 'Group'.	required
`title`	`str`	Plot title.	required
`metrics`	`list of str`	Metric column names to plot.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`group_order`	`list of str or None`	Order of class labels in the legend. Default is from constants.	`group_order`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_line_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    show_legend: bool = False,
) -> None:
    """
    Plot line charts of metrics over runs, colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with 'wandb run', metrics, and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    for metric in metrics:
        plt.figure(figsize=(14, 7))
        sns.lineplot(
            data=df,
            x="wandb run",
            y=metric,
            hue="Group",
            marker="o",
            palette="Set2",
            hue_order=group_order,
            label=metric,
        )
        plt.title(f"{title} - {metric}")
        plt.xticks(rotation=45)
        plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
        if INTERACTIVE_MODE:
            plt.show()
        plot_name = f"line_{title.replace(' ', '_')}_{metric}"
        save_plot(plt.gcf(), save_dir, plot_name, tighten=False, show_legend=show_legend)
        plt.close()

uqdd.metrics.analysis.plot_histogram_metrics ¶

plot_histogram_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, cmap: str = 'crest', show_legend: bool = False) -> None

Plot histograms with KDE for metrics, split by Group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with metrics and 'Group'.	required
`title`	`str`	Plot title.	required
`metrics`	`list of str`	Metric column names to plot.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`group_order`	`list of str or None`	Order of class labels in the legend. Default is from constants.	`group_order`
`cmap`	`str`	Seaborn/matplotlib palette name. Default is "crest".	`'crest'`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_histogram_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    cmap: str = "crest",
    show_legend: bool = False,
) -> None:
    """
    Plot histograms with KDE for metrics, split by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with metrics and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    cmap : str, optional
        Seaborn/matplotlib palette name. Default is "crest".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    for metric in metrics:
        plt.figure(figsize=(14, 7))
        sns.histplot(
            data=df,
            x=metric,
            hue="Group",
            kde=True,
            palette=cmap,
            element="step",
            hue_order=group_order,
            fill=True,
            alpha=0.7,
        )
        plt.title(f"{title} - {metric}")
        if INTERACTIVE_MODE:
            plt.show()
        plot_name = f"histogram_{title.replace(' ', '_')}_{metric}"
        save_plot(plt.gcf(), save_dir, plot_name, show_legend=show_legend)
        plt.close()

uqdd.metrics.analysis.plot_pairwise_scatter_metrics ¶

plot_pairwise_scatter_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, cmap: str = 'tab10_r', show_legend: bool = False) -> None

Plot pairwise scatterplots for all metric combinations, colored by Group.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with metrics and 'Group'.	required
`title`	`str`	Plot title.	required
`metrics`	`list of str`	Metric column names to plot pairwise.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`group_order`	`list of str or None`	Order of class labels in the legend. Default is from constants.	`group_order`
`cmap`	`str`	Matplotlib palette name. Default is "tab10_r".	`'tab10_r'`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_pairwise_scatter_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    cmap: str = "tab10_r",
    show_legend: bool = False,
) -> None:
    """
    Plot pairwise scatterplots for all metric combinations, colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with metrics and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot pairwise.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    cmap : str, optional
        Matplotlib palette name. Default is "tab10_r".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    num_metrics = len(metrics)
    fig, axes = plt.subplots(num_metrics, num_metrics, figsize=(15, 15))

    for i in range(num_metrics):
        for j in range(num_metrics):
            if i != j:
                ax = sns.scatterplot(
                    data=df,
                    x=metrics[j],
                    y=metrics[i],
                    hue="Group",
                    palette=cmap,
                    hue_order=group_order,
                    ax=axes[i, j],
                    legend=False if not (i == 1 and j == 0) else "brief",
                )
                if i == 1 and j == 0:
                    handles, labels = ax.get_legend_handles_labels()
                    ax.legend().remove()
            else:
                axes[i, j].set_visible(False)

            axes[i, j].set_ylabel(metrics[i] if j == 0 and i > 0 else "")
            axes[i, j].set_xlabel(metrics[j] if i == num_metrics - 1 else "")

    fig.legend(handles, labels, loc="upper right", bbox_to_anchor=(1.15, 1))
    fig.suptitle(title, y=1.02)
    fig.subplots_adjust(top=0.95, wspace=0.4, hspace=0.4)
    plot_name = f"pairwise_scatter_{title.replace(' ', '_')}"
    save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.plot_metrics ¶

plot_metrics(df: DataFrame, metrics: List[str], cmap: str = 'tab10_r', save_dir: Optional[str] = None, hatches_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, show: bool = True, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> Dict[str, str]

Plot grouped bar charts showing mean and std for metrics across splits and model types.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with columns ['Split', 'Model type'] and metrics.	required
`metrics`	`list of str`	Metric column names to plot.	required
`cmap`	`str`	Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".	`'tab10_r'`
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`hatches_dict`	`dict[str, str] or None`	Mapping from Split to hatch pattern. Default is None.	`None`
`group_order`	`list of str or None`	Order of grouped labels (Split_Model type). Default derives from data.	`None`
`show`	`bool`	If True, display plot in interactive mode. Default is True.	`True`
`fig_width`	`float or None`	Width of the plot area (excluding legend). Default scales with number of metrics.	`None`
`fig_height`	`float or None`	Height of the plot area (excluding legend). Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend of split/model combinations. Default is False.	`False`

Returns:

Type	Description
`dict[str, str]`	Color mapping from 'Model type' to RGBA string used in the plot.

Source code in uqdd/metrics/analysis.py

def plot_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    cmap: str = "tab10_r",
    save_dir: Optional[str] = None,
    hatches_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    show: bool = True,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> Dict[str, str]:
    """
    Plot grouped bar charts showing mean and std for metrics across splits and model types.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with columns ['Split', 'Model type'] and metrics.
    metrics : list of str
        Metric column names to plot.
    cmap : str, optional
        Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    hatches_dict : dict[str, str] or None, optional
        Mapping from Split to hatch pattern. Default is None.
    group_order : list of str or None, optional
        Order of grouped labels (Split_Model type). Default derives from data.
    show : bool, optional
        If True, display plot in interactive mode. Default is True.
    fig_width : float or None, optional
        Width of the plot area (excluding legend). Default scales with number of metrics.
    fig_height : float or None, optional
        Height of the plot area (excluding legend). Default is 6.
    show_legend : bool, optional
        If True, include a legend of split/model combinations. Default is False.

    Returns
    -------
    dict[str, str]
        Color mapping from 'Model type' to RGBA string used in the plot.
    """
    plot_width = fig_width if fig_width else max(10, len(metrics) * 2)
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 5
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.1, right=0.75, top=0.9, bottom=0.2)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.1, 0.15, plot_width / total_width, plot_height / total_height])

    stats_dfs = []
    for metric in metrics:
        mean_df = df.groupby(["Split", "Model type"])[metric].mean().rename(f"{metric}_mean")
        std_df = df.groupby(["Split", "Model type"])[metric].std().rename(f"{metric}_std")
        stats_df = pd.merge(mean_df, std_df, left_index=True, right_index=True).reset_index()
        stats_df["Group"] = stats_df.apply(lambda row: f"{row['Split']}_{row['Model type']}", axis=1)
        stats_df["Metric"] = metric
        stats_dfs.append(stats_df)

    combined_stats_df = pd.concat(stats_dfs)
    if group_order:
        combined_stats_df["Group"] = pd.Categorical(
            combined_stats_df["Group"], categories=group_order, ordered=True
        )
    else:
        group_order = combined_stats_df["Group"].unique().tolist()

    scalar_mappable = ScalarMappable(cmap=cmap)
    model_types = combined_stats_df["Model type"].unique()
    color_dict = {
        m: c
        for m, c in zip(
            model_types,
            scalar_mappable.to_rgba(range(len(model_types)), alpha=1).tolist(),
        )
    }

    bar_width = 0.12
    group_spacing = 0.4
    num_bars = len(model_types) * len(hatches_dict)
    positions = []
    tick_positions = []
    tick_labels = []

    for i, metric in enumerate(metrics):
        metric_data = combined_stats_df[combined_stats_df["Metric"] == metric]
        metric_data.loc[:, "Group"] = pd.Categorical(
            metric_data["Group"], categories=group_order, ordered=True
        )
        metric_data = metric_data.sort_values("Group").reset_index(drop=True)
        for j, (_, row) in enumerate(metric_data.iterrows()):
            position = i * (num_bars * bar_width + group_spacing) + (j % num_bars) * bar_width
            positions.append(position)
            ax.bar(
                position,
                height=row[f"{metric}_mean"],
                color=color_dict[row["Model type"]],
                hatch=hatches_dict[row["Split"]],
                width=bar_width,
            )
        center_position = i * (num_bars * bar_width + group_spacing) + (num_bars * bar_width) / 2
        tick_positions.append(center_position)
        tick_labels.append(metric.replace(" ", "\n") if " " in metric else metric)

    def create_stats_legend(df, color_mapping, hatches_dict, group_order):
        patches_dict = {}
        for _, row in df.iterrows():
            label = f"{row['Split']} {row['Model type']}"
            group_label = f"{row['Split']}_{row['Model type']}"
            if group_label not in patches_dict:
                patches_dict[group_label] = mpatches.Patch(
                    facecolor=color_mapping[row["Model type"]],
                    hatch=hatches_dict[row["Split"]],
                    label=label,
                )
        return [patches_dict[group] for group in group_order if group in patches_dict]

    if show_legend:
        legend_elements = create_stats_legend(combined_stats_df, color_dict, hatches_dict, group_order)
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    for (_, row), bar in zip(combined_stats_df.iterrows(), ax.patches):
        x_bar = bar.get_x() + bar.get_width() / 2
        y_bar = bar.get_height()
        ax.errorbar(
            x_bar,
            y_bar,
            yerr=row[f"{row['Metric']}_std"],
            color="black",
            fmt="none",
            elinewidth=1,
            capsize=3,
            alpha=0.5,
        )

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylim(bottom=0.0)

    if save_dir:
        metrics_names = "_".join(metrics)
        plot_name = f"barplot_{cmap}_{metrics_names}"
        save_plot(fig, save_dir, plot_name, show_legend=show_legend)

    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

    return color_dict

uqdd.metrics.analysis.find_highly_correlated_metrics ¶

find_highly_correlated_metrics(df: DataFrame, metrics: List[str], threshold: float = 0.8, save_dir: Optional[str] = None, cmap: str = 'coolwarm', show_legend: bool = False) -> List[Tuple[str, str, float]]

Identify pairs of metrics with correlation above a threshold and plot the matrix.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing the metric columns.	required
`metrics`	`list of str`	Metric column names to include in the correlation analysis.	required
`threshold`	`float`	Absolute correlation threshold for reporting pairs. Default is 0.8.	`0.8`
`save_dir`	`str or None`	Directory to save the heatmap plot. Default is None.	`None`
`cmap`	`str`	Matplotlib colormap name. Default is "coolwarm".	`'coolwarm'`
`show_legend`	`bool`	If True, keep the legend; otherwise it will be removed before saving.	`False`

Returns:

Type	Description
`list of tuple[str, str, float]`	List of metric pairs and their absolute correlation values.

Source code in uqdd/metrics/analysis.py

def find_highly_correlated_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    threshold: float = 0.8,
    save_dir: Optional[str] = None,
    cmap: str = "coolwarm",
    show_legend: bool = False,
) -> List[Tuple[str, str, float]]:
    """
    Identify pairs of metrics with correlation above a threshold and plot the matrix.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing the metric columns.
    metrics : list of str
        Metric column names to include in the correlation analysis.
    threshold : float, optional
        Absolute correlation threshold for reporting pairs. Default is 0.8.
    save_dir : str or None, optional
        Directory to save the heatmap plot. Default is None.
    cmap : str, optional
        Matplotlib colormap name. Default is "coolwarm".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    list of tuple[str, str, float]
        List of metric pairs and their absolute correlation values.
    """
    corr_matrix = df[metrics].corr().abs()
    pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if corr_matrix.iloc[i, j] > threshold:
                pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

    print(f"Highly correlated metrics (correlation coefficient > {threshold}):")
    for a, b, v in pairs:
        print(f"{a} and {b}: {v:.2f}")

    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, cmap=cmap)
    plt.title("Correlation Matrix")
    plot_name = f"correlation_matrix_{threshold}_{'_'.join(metrics)}"
    save_plot(plt.gcf(), save_dir, plot_name, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()

    return pairs

uqdd.metrics.analysis.plot_comparison_metrics ¶

plot_comparison_metrics(df: DataFrame, metrics: List[str], cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir: Optional[str] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False, models_order: Optional[List[str]] = None) -> None

Plot comparison bar charts across splits, model types, and calibration states.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with columns ['Split', 'Model type', 'Calibration'] and metrics.	required
`metrics`	`list of str`	Metric column names to plot.	required
`cmap`	`str`	Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".	`'tab10_r'`
`color_dict`	`dict[str, str] or None`	Precomputed color mapping from model type to color. If None, one is generated.	`None`
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`fig_width`	`float or None`	Width of the plot area (excluding legend). Default scales with the number of metrics.	`None`
`fig_height`	`float or None`	Height of the plot area (excluding legend). Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend. Default is False.	`False`
`models_order`	`list of str or None`	Explicit order of model types for coloring and grouping. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_comparison_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir: Optional[str] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
    models_order: Optional[List[str]] = None,
) -> None:
    """
    Plot comparison bar charts across splits, model types, and calibration states.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with columns ['Split', 'Model type', 'Calibration'] and metrics.
    metrics : list of str
        Metric column names to plot.
    cmap : str, optional
        Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from model type to color. If None, one is generated.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    fig_width : float or None, optional
        Width of the plot area (excluding legend). Default scales with the number of metrics.
    fig_height : float or None, optional
        Height of the plot area (excluding legend). Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.
    models_order : list of str or None, optional
        Explicit order of model types for coloring and grouping. Default derives from data.

    Returns
    -------
    None
    """
    plot_width = fig_width if fig_width else max(7, len(metrics) * 3)
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 5
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.1, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.1, 0.15, plot_width / total_width, plot_height / total_height])

    stats_dfs = []
    for metric in metrics:
        mean_df = df.groupby(["Split", "Model type", "Calibration"])[metric].mean().rename(f"{metric}_mean")
        std_df = df.groupby(["Split", "Model type", "Calibration"])[metric].std().rename(f"{metric}_std")
        stats_df = pd.merge(mean_df, std_df, left_index=True, right_index=True).reset_index()
        stats_df["Group"] = stats_df.apply(
            lambda row: f"{row['Split']}_{row['Model type']}_{row['Calibration']}", axis=1
        )
        stats_df["Metric"] = metric
        stats_dfs.append(stats_df)

    combined_stats_df = pd.concat(stats_dfs)
    if models_order is None:
        models_order = combined_stats_df["Model type"].unique().tolist()

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        color_dict = {
            m: c
            for m, c in zip(
                models_order,
                scalar_mappable.to_rgba(range(len(models_order)), alpha=1).tolist(),
            )
        }
    color_dict = {k: color_dict[k] for k in models_order}

    hatches_dict = {
        "Before Calibration": "\\\\",
        "After Calibration": "",
    }

    bar_width = 0.1
    group_spacing = 0.2
    split_spacing = 0.6
    num_bars = len(models_order) * 2
    positions = []
    tick_positions = []
    tick_labels = []

    for i, metric in enumerate(metrics):
        metric_data = combined_stats_df[combined_stats_df["Metric"] == metric]
        split_types = metric_data["Split"].unique()
        for j, split in enumerate(split_types):
            split_data = metric_data[metric_data["Split"] == split]
            split_data = split_data[split_data["Model type"].isin(models_order)]

            for k, model_type in enumerate(models_order):
                for l, calibration in enumerate(["Before Calibration", "After Calibration"]):
                    position = (
                        i * (split_spacing + len(split_types) * (num_bars * bar_width + group_spacing))
                        + j * (num_bars * bar_width + group_spacing)
                        + k * 2 * bar_width
                        + l * bar_width
                    )
                    positions.append(position)
                    height = split_data[
                        (split_data["Model type"] == model_type)
                        & (split_data["Calibration"] == calibration)
                    ][f"{metric}_mean"].values[0]
                    ax.bar(
                        position,
                        height=height,
                        color=color_dict[model_type],
                        hatch=hatches_dict[calibration],
                        width=bar_width,
                    )

            center_position = (
                i * (split_spacing + len(split_types) * (num_bars * bar_width + group_spacing))
                + j * (num_bars * bar_width + group_spacing)
                + (num_bars * bar_width) / 2
            )
            tick_positions.append(center_position)
            tick_labels.append(f"{metric}\n{split}")

    if show_legend:
        legend_elements = [
            mpatches.Patch(facecolor=color_dict[model], edgecolor="black", label=model)
            for model in models_order
        ]
        legend_elements += [
            mpatches.Patch(facecolor="white", edgecolor="black", hatch=h, label=label)
            for label, h in hatches_dict.items()
        ]
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    for (_, row), bar in zip(combined_stats_df.iterrows(), ax.patches):
        x_bar = bar.get_x() + bar.get_width() / 2
        y_bar = bar.get_height()
        yerr_lower = y_bar - max(0, y_bar - row[f"{row['Metric']}_std"])
        yerr_upper = row[f"{row['Metric']}_std"]
        ax.errorbar(
            x_bar,
            y_bar,
            yerr=[[yerr_lower], [yerr_upper]],
            color="black",
            fmt="none",
            elinewidth=1,
            capsize=3,
            alpha=0.5,
        )

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylim(bottom=0.0)

    if save_dir:
        metrics_names = "_".join(metrics)
        plot_name = f"comparison_barplot_{cmap}_{metrics_names}"
        save_plot(fig, save_dir, plot_name, show_legend=show_legend)

    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.load_and_aggregate_calibration_data ¶

load_and_aggregate_calibration_data(base_path: str, paths: List[str]) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]

Load calibration curve data from multiple model paths and aggregate statistics.

Parameters:

Name	Type	Description	Default
`base_path`	`str`	Base directory from which model subpaths are resolved.	required
`paths`	`list of str`	Relative paths to model directories containing 'calibration_plot_data.csv'.	required

Returns:

Type	Description
`(ndarray, ndarray, ndarray, ndarray)`	Tuple of (expected_values, mean_observed, lower_bound, upper_bound), each of shape (n_bins,).

Source code in uqdd/metrics/analysis.py

def load_and_aggregate_calibration_data(base_path: str, paths: List[str]) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Load calibration curve data from multiple model paths and aggregate statistics.

    Parameters
    ----------
    base_path : str
        Base directory from which model subpaths are resolved.
    paths : list of str
        Relative paths to model directories containing 'calibration_plot_data.csv'.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray)
        Tuple of (expected_values, mean_observed, lower_bound, upper_bound), each of shape (n_bins,).
    """
    expected_values = []
    observed_values = []
    for path in paths:
        file_path = os.path.join(base_path, path, "calibration_plot_data.csv")
        if os.path.exists(file_path):
            data = pd.read_csv(file_path)
            expected_values = data["Expected Proportion"]
            observed_values.append(data["Observed Proportion"])
        else:
            print(f"File not found: {file_path}")

    expected_values = np.array(expected_values)
    observed_values = np.array(observed_values)
    mean_observed = np.mean(observed_values, axis=0)
    lower_bound = np.min(observed_values, axis=0)
    upper_bound = np.max(observed_values, axis=0)
    return expected_values, mean_observed, lower_bound, upper_bound

uqdd.metrics.analysis.plot_calibration_data ¶

plot_calibration_data(df_aggregated: DataFrame, base_path: str, save_dir: Optional[str] = None, title: str = 'Calibration Plot', color_name: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> None

Plot aggregated calibration curves for multiple groups against the perfect calibration line.

Parameters:

Name	Type	Description	Default
`df_aggregated`	`DataFrame`	Aggregated DataFrame containing 'Group' and 'project_model' lists for each group.	required
`base_path`	`str`	Base directory where model paths are located.	required
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`title`	`str`	Plot title. Default is "Calibration Plot".	`'Calibration Plot'`
`color_name`	`str`	Colormap name used to derive distinct colors per group. Default is "tab10_r".	`'tab10_r'`
`color_dict`	`dict[str, str] or None`	Precomputed color mapping from group to color. If None, one is generated.	`None`
`group_order`	`list of str or None`	Order of groups in the legend. Default derives from data.	`None`
`fig_width`	`float or None`	Width of the plot area. Default is 6.	`None`
`fig_height`	`float or None`	Height of the plot area. Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend. Default is False.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_calibration_data(
    df_aggregated: pd.DataFrame,
    base_path: str,
    save_dir: Optional[str] = None,
    title: str = "Calibration Plot",
    color_name: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> None:
    """
    Plot aggregated calibration curves for multiple groups against the perfect calibration line.

    Parameters
    ----------
    df_aggregated : pd.DataFrame
        Aggregated DataFrame containing 'Group' and 'project_model' lists for each group.
    base_path : str
        Base directory where model paths are located.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    title : str, optional
        Plot title. Default is "Calibration Plot".
    color_name : str, optional
        Colormap name used to derive distinct colors per group. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from group to color. If None, one is generated.
    group_order : list of str or None, optional
        Order of groups in the legend. Default derives from data.
    fig_width : float or None, optional
        Width of the plot area. Default is 6.
    fig_height : float or None, optional
        Height of the plot area. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    None
    """
    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    if group_order is None:
        group_order = list(df_aggregated["Group"].unique())

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=color_name)
        colors = scalar_mappable.to_rgba(range(len(group_order)))
        color_dict = {group: color for group, color in zip(group_order, colors)}

    legend_handles = {}
    for idx, row in df_aggregated.iterrows():
        model_paths = row["project_model"]
        group_label = row["Group"]
        color = color_dict[group_label]
        expected, mean_observed, lower_bound, upper_bound = load_and_aggregate_calibration_data(base_path, model_paths)
        (line,) = ax.plot(expected, mean_observed, label=group_label, color=color)
        ax.fill_between(expected, lower_bound, upper_bound, alpha=0.2, color=color)
        if group_label not in legend_handles:
            legend_handles[group_label] = line

    (perfect_line,) = ax.plot([0, 1], [0, 1], "k--", label="Perfect Calibration")
    legend_handles["Perfect Calibration"] = perfect_line

    ordered_legend_handles = [legend_handles[group] for group in group_order if group in legend_handles]
    ordered_legend_handles.append(legend_handles["Perfect Calibration"])
    if show_legend:
        ax.legend(handles=ordered_legend_handles, bbox_to_anchor=(1.05, 1), loc="upper left")

    ax.set_title(title)
    ax.set_xlabel("Expected Proportion")
    ax.set_ylabel("Observed Proportion")
    ax.grid(True)
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)

    if save_dir:
        plot_name = f"{title.replace(' ', '_')}"
        save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.move_model_folders ¶

move_model_folders(df: DataFrame, search_dirs: List[str], output_dir: str, overwrite: bool = False) -> None

Move or merge model directories into a single output folder based on model names.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing a 'model name' column.	required
`search_dirs`	`list of str`	Directories to search for model subfolders.	required
`output_dir`	`str`	Destination directory where model folders will be moved or merged.	required
`overwrite`	`bool`	If True, existing folders are merged (copied) with source. Default is False.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def move_model_folders(
    df: pd.DataFrame,
    search_dirs: List[str],
    output_dir: str,
    overwrite: bool = False,
) -> None:
    """
    Move or merge model directories into a single output folder based on model names.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing a 'model name' column.
    search_dirs : list of str
        Directories to search for model subfolders.
    output_dir : str
        Destination directory where model folders will be moved or merged.
    overwrite : bool, optional
        If True, existing folders are merged (copied) with source. Default is False.

    Returns
    -------
    None
    """
    model_names = df["model name"].unique()
    if not os.path.exists(output_dir):
        os.makedirs(output_dir, exist_ok=True)
        print(f"Created output directory '{output_dir}'.")

    for model_name in model_names:
        found = False
        for search_dir in search_dirs:
            if not os.path.isdir(search_dir):
                print(f"Search directory '{search_dir}' does not exist. Skipping.")
                continue
            subdirs = [d for d in os.listdir(search_dir) if os.path.isdir(os.path.join(search_dir, d))]
            if model_name in subdirs:
                source_dir = os.path.join(search_dir, model_name)
                dest_dir = os.path.join(output_dir, model_name)
                if os.path.exists(dest_dir):
                    if overwrite:
                        shutil.copytree(source_dir, dest_dir, dirs_exist_ok=True)
                        print(f"Merged (Copied) '{source_dir}' to '{dest_dir}'.")
                else:
                    try:
                        shutil.move(source_dir, dest_dir)
                        print(f"Moved '{source_dir}' to '{dest_dir}'.")
                    except Exception as e:
                        print(f"Error moving '{source_dir}' to '{dest_dir}': {e}")
                found = True
                break
        if not found:
            print(f"Model folder '{model_name}' not found in any of the search directories.")

uqdd.metrics.analysis.load_predictions ¶

load_predictions(model_path: str) -> pd.DataFrame

Load pickled predictions from a model directory.

Parameters:

Name	Type	Description	Default
`model_path`	`str`	Path to the model directory containing 'preds.pkl'.	required

Returns:

Type	Description
`DataFrame`	DataFrame loaded from the pickle file.

Source code in uqdd/metrics/analysis.py

def load_predictions(model_path: str) -> pd.DataFrame:
    """
    Load pickled predictions from a model directory.

    Parameters
    ----------
    model_path : str
        Path to the model directory containing 'preds.pkl'.

    Returns
    -------
    pd.DataFrame
        DataFrame loaded from the pickle file.
    """
    preds_path = os.path.join(model_path, "preds.pkl")
    return pd.read_pickle(preds_path)

uqdd.metrics.analysis.calculate_rmse_rejection_curve ¶

calculate_rmse_rejection_curve(preds: DataFrame, uncertainty_col: str = 'y_alea', true_label_col: str = 'y_true', pred_label_col: str = 'y_pred', normalize_rmse: bool = False, random_rejection: bool = False, unc_type: Optional[str] = None, max_rejection_ratio: float = 0.95) -> Tuple[np.ndarray, np.ndarray, float]

Compute RMSE vs. rejection rate curve and its AUC by rejecting high-uncertainty predictions.

Parameters:

Name	Type	Description	Default
`preds`	`DataFrame`	DataFrame with columns for true labels, predicted labels, and uncertainty components.	required
`uncertainty_col`	`str`	Column name for uncertainty to sort by if `unc_type` is None. Default is "y_alea".	`'y_alea'`
`true_label_col`	`str`	Column name for true labels. Default is "y_true".	`'y_true'`
`pred_label_col`	`str`	Column name for predicted labels. Default is "y_pred".	`'y_pred'`
`normalize_rmse`	`bool`	If True, normalize RMSE by the initial RMSE before rejection. Default is False.	`False`
`random_rejection`	`bool`	If True, randomly reject samples instead of sorting by uncertainty. Default is False.	`False`
`unc_type`	`(aleatoric, epistemic, both)`	Which uncertainty to use. If "both", sums aleatoric and epistemic. If None, use `uncertainty_col`.	`"aleatoric"`
`max_rejection_ratio`	`float`	Maximum fraction of samples to reject (exclusive of the tail). Default is 0.95.	`0.95`

Returns:

Type	Description
`(ndarray, ndarray, float)`	Tuple of (rejection_rates, rmses, AUC of the RMSE–rejection curve).

Raises:

Type	Description
`ValueError`	If `unc_type` is invalid or `uncertainty_col` is not present when needed.

Source code in uqdd/metrics/analysis.py

def calculate_rmse_rejection_curve(
    preds: pd.DataFrame,
    uncertainty_col: str = "y_alea",
    true_label_col: str = "y_true",
    pred_label_col: str = "y_pred",
    normalize_rmse: bool = False,
    random_rejection: bool = False,
    unc_type: Optional[str] = None,
    max_rejection_ratio: float = 0.95,
) -> Tuple[np.ndarray, np.ndarray, float]:
    """
    Compute RMSE vs. rejection rate curve and its AUC by rejecting high-uncertainty predictions.

    Parameters
    ----------
    preds : pd.DataFrame
        DataFrame with columns for true labels, predicted labels, and uncertainty components.
    uncertainty_col : str, optional
        Column name for uncertainty to sort by if `unc_type` is None. Default is "y_alea".
    true_label_col : str, optional
        Column name for true labels. Default is "y_true".
    pred_label_col : str, optional
        Column name for predicted labels. Default is "y_pred".
    normalize_rmse : bool, optional
        If True, normalize RMSE by the initial RMSE before rejection. Default is False.
    random_rejection : bool, optional
        If True, randomly reject samples instead of sorting by uncertainty. Default is False.
    unc_type : {"aleatoric", "epistemic", "both"} or None, optional
        Which uncertainty to use. If "both", sums aleatoric and epistemic. If None, use `uncertainty_col`.
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject (exclusive of the tail). Default is 0.95.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, float)
        Tuple of (rejection_rates, rmses, AUC of the RMSE–rejection curve).

    Raises
    ------
    ValueError
        If `unc_type` is invalid or `uncertainty_col` is not present when needed.
    """
    if unc_type == "aleatoric":
        uncertainty_col = "y_alea"
    elif unc_type == "epistemic":
        uncertainty_col = "y_eps"
    elif unc_type == "both":
        preds["y_unc"] = preds["y_alea"] + preds["y_eps"]
        uncertainty_col = "y_unc"
    elif unc_type is None and uncertainty_col in preds.columns:
        pass
    else:
        raise ValueError(
            "Either provide valid uncertainty type or provide the uncertainty column name in the DataFrame"
        )

    if random_rejection:
        preds = preds.sample(frac=max_rejection_ratio).reset_index(drop=True)
    else:
        preds = preds.sort_values(by=uncertainty_col, ascending=False)

    max_rejection_index = int(len(preds) * max_rejection_ratio)
    step = max(1, int(len(preds) * 0.01))
    rejection_steps = np.arange(0, max_rejection_index, step=step)
    rejection_rates = rejection_steps / len(preds)
    rmses = []

    initial_rmse = mean_squared_error(preds[true_label_col], preds[pred_label_col], squared=False)

    for i in rejection_steps:
        selected_preds = preds.iloc[i:]
        rmse = mean_squared_error(selected_preds[true_label_col], selected_preds[pred_label_col], squared=False)
        if normalize_rmse:
            rmse /= initial_rmse
        rmses.append(rmse)
    auc_arc = auc(rejection_rates, rmses)
    return rejection_rates, np.array(rmses), float(auc_arc)

uqdd.metrics.analysis.calculate_rejection_curve ¶

calculate_rejection_curve(df: DataFrame, model_paths: List[str], unc_col: str, random_rejection: bool = False, normalize_rmse: bool = False, max_rejection_ratio: float = 0.95) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, float]

Aggregate RMSE–rejection curves across models and compute mean/std and AUC statistics.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Auxiliary DataFrame (not used directly, kept for API symmetry).	required
`model_paths`	`list of str`	Paths to model directories containing 'preds.pkl'.	required
`unc_col`	`str`	Uncertainty column name to use when computing curves (e.g., 'y_alea' or 'y_eps').	required
`random_rejection`	`bool`	If True, randomly reject samples. Default is False.	`False`
`normalize_rmse`	`bool`	If True, normalize RMSE by the initial RMSE. Default is False.	`False`
`max_rejection_ratio`	`float`	Maximum fraction of samples to reject. Default is 0.95.	`0.95`

Returns:

Type	Description
`(ndarray, ndarray, ndarray, float, float)`	Tuple of (rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc).

Source code in uqdd/metrics/analysis.py

def calculate_rejection_curve(
    df: pd.DataFrame,
    model_paths: List[str],
    unc_col: str,
    random_rejection: bool = False,
    normalize_rmse: bool = False,
    max_rejection_ratio: float = 0.95,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, float]:
    """
    Aggregate RMSE–rejection curves across models and compute mean/std and AUC statistics.

    Parameters
    ----------
    df : pd.DataFrame
        Auxiliary DataFrame (not used directly, kept for API symmetry).
    model_paths : list of str
        Paths to model directories containing 'preds.pkl'.
    unc_col : str
        Uncertainty column name to use when computing curves (e.g., 'y_alea' or 'y_eps').
    random_rejection : bool, optional
        If True, randomly reject samples. Default is False.
    normalize_rmse : bool, optional
        If True, normalize RMSE by the initial RMSE. Default is False.
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject. Default is 0.95.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, numpy.ndarray, float, float)
        Tuple of (rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc).
    """
    aggregated_rmses = []
    auc_values = []
    rejection_rates = None

    for model_path in model_paths:
        preds = load_predictions(model_path)
        if preds.empty:
            print(f"Preds not loaded for model: {model_path}")
            continue
        rejection_rates, rmses, auc_arc = calculate_rmse_rejection_curve(
            preds,
            uncertainty_col=unc_col,
            random_rejection=random_rejection,
            normalize_rmse=normalize_rmse,
            max_rejection_ratio=max_rejection_ratio,
        )
        aggregated_rmses.append(rmses)
        auc_values.append(auc_arc)

    mean_rmses = np.mean(aggregated_rmses, axis=0)
    std_rmses = np.std(aggregated_rmses, axis=0)
    mean_auc = np.mean(auc_values)
    std_auc = np.std(auc_values)
    return rejection_rates, mean_rmses, std_rmses, float(mean_auc), float(std_auc)

uqdd.metrics.analysis.get_handles_labels ¶

get_handles_labels(ax: Axes, group_order: List[str]) -> Tuple[List, List[str]]

Extract legend handles/labels ordered by group prefix.

Parameters:

Name	Type	Description	Default
`ax`	`Axes`	Axes object from which to retrieve legend entries.	required
`group_order`	`list of str`	Group prefixes to order legend entries by.	required

Returns:

Type	Description
`(list, list of str)`	Ordered handles and labels.

Source code in uqdd/metrics/analysis.py

def get_handles_labels(ax: plt.Axes, group_order: List[str]) -> Tuple[List, List[str]]:
    """
    Extract legend handles/labels ordered by group prefix.

    Parameters
    ----------
    ax : matplotlib.axes.Axes
        Axes object from which to retrieve legend entries.
    group_order : list of str
        Group prefixes to order legend entries by.

    Returns
    -------
    (list, list of str)
        Ordered handles and labels.
    """
    handles, labels = ax.get_legend_handles_labels()
    ordered_handles = []
    ordered_labels = []
    for group in group_order:
        for label, handle in zip(labels, handles):
            if label.startswith(group):
                ordered_handles.append(handle)
                ordered_labels.append(label)
    return ordered_handles, ordered_labels

uqdd.metrics.analysis.plot_rmse_rejection_curves ¶

plot_rmse_rejection_curves(df: DataFrame, base_dir: str, cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir_plot: Optional[str] = None, add_to_title: str = '', normalize_rmse: bool = False, unc_type: str = 'aleatoric', max_rejection_ratio: float = 0.95, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> pd.DataFrame

Plot RMSE–rejection curves per group, including random rejection baselines, and summarize AUCs.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	DataFrame containing columns 'Group', 'Split', and 'project_model'.	required
`base_dir`	`str`	Base directory where model paths are located.	required
`cmap`	`str`	Colormap name used to derive distinct colors per group. Default is "tab10_r".	`'tab10_r'`
`color_dict`	`dict[str, str] or None`	Precomputed color mapping from group to color. If None, one is generated.	`None`
`save_dir_plot`	`str or None`	Directory to save the plot images. Default is None.	`None`
`add_to_title`	`str`	Suffix for the plot filename and title. Default is empty string.	`''`
`normalize_rmse`	`bool`	If True, normalize RMSE by initial RMSE. Default is False.	`False`
`unc_type`	`(aleatoric, epistemic, both)`	Uncertainty component to use for rejection. Default is "aleatoric".	`"aleatoric"`
`max_rejection_ratio`	`float`	Maximum fraction of samples to reject. Default is 0.95.	`0.95`
`group_order`	`list of str or None`	Order of groups in the legend. Default derives from data.	`None`
`fig_width`	`float or None`	Plot width. Default is 6.	`None`
`fig_height`	`float or None`	Plot height. Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend. Default is False.	`False`

Returns:

Type	Description
`DataFrame`	Summary DataFrame with columns ['Model type', 'Split', 'Group', 'AUC-RRC_mean', 'AUC-RRC_std'].

Source code in uqdd/metrics/analysis.py

def plot_rmse_rejection_curves(
    df: pd.DataFrame,
    base_dir: str,
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir_plot: Optional[str] = None,
    add_to_title: str = "",
    normalize_rmse: bool = False,
    unc_type: str = "aleatoric",
    max_rejection_ratio: float = 0.95,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> pd.DataFrame:
    """
    Plot RMSE–rejection curves per group, including random rejection baselines, and summarize AUCs.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing columns 'Group', 'Split', and 'project_model'.
    base_dir : str
        Base directory where model paths are located.
    cmap : str, optional
        Colormap name used to derive distinct colors per group. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from group to color. If None, one is generated.
    save_dir_plot : str or None, optional
        Directory to save the plot images. Default is None.
    add_to_title : str, optional
        Suffix for the plot filename and title. Default is empty string.
    normalize_rmse : bool, optional
        If True, normalize RMSE by initial RMSE. Default is False.
    unc_type : {"aleatoric", "epistemic", "both"}, optional
        Uncertainty component to use for rejection. Default is "aleatoric".
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject. Default is 0.95.
    group_order : list of str or None, optional
        Order of groups in the legend. Default derives from data.
    fig_width : float or None, optional
        Plot width. Default is 6.
    fig_height : float or None, optional
        Plot height. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    pd.DataFrame
        Summary DataFrame with columns ['Model type', 'Split', 'Group', 'AUC-RRC_mean', 'AUC-RRC_std'].
    """
    assert unc_type in ["aleatoric", "epistemic", "both"], "Invalid unc_type"
    unc_col = "y_alea" if unc_type == "aleatoric" else "y_eps"

    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    if group_order is None:
        group_order = list(df["Group"].unique())

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        colors = scalar_mappable.to_rgba(range(len(group_order)))
        color_dict = {group: color for group, color in zip(group_order, colors)}

    color_dict["random reject"] = "black"

    df = df.copy()
    df.loc[:, "model_path"] = df["project_model"].apply(
        lambda x: (str(os.path.join(base_dir, x)) if not str(x).startswith(base_dir) else x)
    )

    stats_dfs = []
    included_groups = df["Group"].unique()
    legend_handles = []

    for group in included_groups:
        group_data = df[df["Group"] == group]
        model_paths = group_data["model_path"].unique()
        rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc = calculate_rejection_curve(
            df, model_paths, unc_col, normalize_rmse=normalize_rmse, max_rejection_ratio=max_rejection_ratio
        )
        (line,) = ax.plot(
            rejection_rates,
            mean_rmses,
            label=f"{group} (AUC-RRC: {mean_auc:.3f} ± {std_auc:.3f})",
            color=color_dict[group],
        )
        ax.fill_between(rejection_rates, mean_rmses - std_rmses, mean_rmses + std_rmses, color=color_dict[group], alpha=0.2)
        legend_handles.append(line)
        stats_dfs.append({
            "Model type": group.rsplit("_", 1)[1],
            "Split": group.rsplit("_", 1)[0],
            "Group": group,
            "AUC-RRC_mean": mean_auc,
            "AUC-RRC_std": std_auc,
        })

    for split in df["Split"].unique():
        split_data = df[df["Split"] == split]
        model_paths = split_data["model_path"].unique()
        rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc = calculate_rejection_curve(
            df, model_paths, unc_col, random_rejection=True, normalize_rmse=normalize_rmse, max_rejection_ratio=max_rejection_ratio
        )
        (line,) = ax.plot(
            rejection_rates,
            mean_rmses,
            label=f"random reject - {split} (AUC-RRC: {mean_auc:.3f} ± {std_auc:.3f})",
            color="black",
            linestyle="--",
        )
        ax.fill_between(rejection_rates, mean_rmses - std_rmses, mean_rmses + std_rmses, color="grey", alpha=0.2)
        legend_handles.append(line)
        stats_dfs.append({
            "Model type": "random reject",
            "Split": split,
            "Group": f"random reject - {split}",
            "AUC-RRC_mean": mean_auc,
            "AUC-RRC_std": std_auc,
        })

    ax.set_xlabel("Rejection Rate")
    ax.set_ylabel("RMSE" if not normalize_rmse else "Normalized RMSE")
    ax.set_xlim(0, max_rejection_ratio)
    ax.grid(True)

    if show_legend:
        ordered_handles, ordered_labels = get_handles_labels(ax, group_order)
        ordered_handles += [legend_handles[-1]]
        ordered_labels += [legend_handles[-1].get_label()]
        ax.legend(handles=ordered_handles, loc="lower left")

    plot_name = f"rmse_rejection_curve_{add_to_title}" if add_to_title else "rmse_rejection_curve"
    save_plot(fig, save_dir_plot, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

    return pd.DataFrame(stats_dfs)

uqdd.metrics.analysis.plot_auc_comparison ¶

plot_auc_comparison(stats_df: DataFrame, cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir: Optional[str] = None, add_to_title: str = '', min_y_axis: float = 0.0, hatches_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> None

Plot bar charts comparing RRC-AUC across splits and model types, including random reject baselines.

Parameters:

Name	Type	Description	Default
`stats_df`	`DataFrame`	Summary DataFrame with columns ['Group', 'Split', 'Model type', 'AUC-RRC_mean', 'AUC-RRC_std'].	required
`cmap`	`str`	Colormap name used to derive distinct colors per model type. Default is "tab10_r".	`'tab10_r'`
`color_dict`	`dict[str, str] or None`	Precomputed color mapping from model type to color. If None, one is generated.	`None`
`save_dir`	`str or None`	Directory to save plot images. Default is None.	`None`
`add_to_title`	`str`	Title suffix for the plot. Default is empty string.	`''`
`min_y_axis`	`float`	Minimum y-axis limit. Default is 0.0.	`0.0`
`hatches_dict`	`dict[str, str] or None`	Hatch mapping for splits (e.g., {"stratified": "\"}). Default uses sensible defaults.	`None`
`group_order`	`list of str or None`	Order of groups in the legend and x-axis. Default derives from data.	`None`
`fig_width`	`float or None`	Plot width. Default is 6.	`None`
`fig_height`	`float or None`	Plot height. Default is 6.	`None`
`show_legend`	`bool`	If True, include a legend. Default is False.	`False`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def plot_auc_comparison(
    stats_df: pd.DataFrame,
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir: Optional[str] = None,
    add_to_title: str = "",
    min_y_axis: float = 0.0,
    hatches_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> None:
    """
    Plot bar charts comparing RRC-AUC across splits and model types, including random reject baselines.

    Parameters
    ----------
    stats_df : pd.DataFrame
        Summary DataFrame with columns ['Group', 'Split', 'Model type', 'AUC-RRC_mean', 'AUC-RRC_std'].
    cmap : str, optional
        Colormap name used to derive distinct colors per model type. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from model type to color. If None, one is generated.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    add_to_title : str, optional
        Title suffix for the plot. Default is empty string.
    min_y_axis : float, optional
        Minimum y-axis limit. Default is 0.0.
    hatches_dict : dict[str, str] or None, optional
        Hatch mapping for splits (e.g., {"stratified": "\\\\"}). Default uses sensible defaults.
    group_order : list of str or None, optional
        Order of groups in the legend and x-axis. Default derives from data.
    fig_width : float or None, optional
        Plot width. Default is 6.
    fig_height : float or None, optional
        Plot height. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    None
    """
    if hatches_dict is None:
        hatches_dict = {"stratified": "\\\\", "scaffold_cluster": "", "time": "/\\/\\/"}

    if group_order:
        all_groups = group_order + list(stats_df.loc[stats_df["Group"].str.startswith("random reject"), "Group"].unique())
        stats_df["Group"] = pd.Categorical(stats_df["Group"], categories=all_groups, ordered=True)
    else:
        all_groups = stats_df["Group"].unique().tolist()

    stats_df = stats_df.sort_values("Group").reset_index(drop=True)

    splits = list(hatches_dict.keys())
    stats_df.loc[:, "Split"] = pd.Categorical(stats_df["Split"], categories=splits, ordered=True)
    stats_df = stats_df.sort_values("Split").reset_index(drop=True)

    unique_model_types = stats_df.loc[stats_df["Model type"] != "random reject", "Model type"].unique()

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        colors = scalar_mappable.to_rgba(range(len(unique_model_types)))
        color_dict = {model: color for model, color in zip(unique_model_types, colors)}
    color_dict["random reject"] = "black"

    unique_model_types = np.append(unique_model_types, "random reject")

    bar_width = 0.12
    group_spacing = 0.6

    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 4

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    tick_positions = []
    tick_labels = []

    for i, split in enumerate(splits):
        split_data = stats_df[stats_df["Split"] == split]
        split_data.loc[:, "Group"] = pd.Categorical(split_data["Group"], categories=all_groups, ordered=True)
        for j, (_, row) in enumerate(split_data.iterrows()):
            position = i * (len(unique_model_types) * bar_width + group_spacing) + j * bar_width
            ax.bar(
                position,
                height=row["AUC-RRC_mean"],
                yerr=row["AUC-RRC_std"],
                color=color_dict[row["Model type"]],
                edgecolor="white" if row["Model type"] == "random reject" else "black",
                hatch=hatches_dict[row["Split"]],
                width=bar_width,
            )
        center_position = i * (len(unique_model_types) * bar_width + group_spacing) + (len(unique_model_types) * bar_width) / 2
        tick_positions.append(center_position)
        tick_labels.append(split)

    def create_stats_legend(color_dict: Dict[str, str], hatches_dict: Dict[str, str], splits: List[str], model_types: Union[List[str], np.ndarray]):
        patches = []
        for split in splits:
            for model in model_types:
                label = f"{split} {model}"
                hatch_color = "white" if model == "random reject" else "black"
                patch = mpatches.Patch(facecolor=color_dict[model], hatch=hatches_dict[split], edgecolor=hatch_color, label=label)
                patches.append(patch)
        return patches

    if show_legend:
        legend_elements = create_stats_legend(color_dict, hatches_dict, splits, unique_model_types)
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylabel("RRC-AUC")
    ax.set_ylim(min_y_axis, 1.0)

    plot_name = f"auc_comparison_barplot_{cmap}" + (f"_{add_to_title}" if add_to_title else "")
    save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.save_stats_df ¶

save_stats_df(stats_df: DataFrame, save_dir: str, add_to_title: str = '') -> None

Save a stats DataFrame to CSV in a given directory.

Parameters:

Name	Type	Description	Default
`stats_df`	`DataFrame`	DataFrame to save.	required
`save_dir`	`str`	Target directory to save the CSV.	required
`add_to_title`	`str`	Suffix to append to the filename. Default is empty string.	`''`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/analysis.py

def save_stats_df(stats_df: pd.DataFrame, save_dir: str, add_to_title: str = "") -> None:
    """
    Save a stats DataFrame to CSV in a given directory.

    Parameters
    ----------
    stats_df : pd.DataFrame
        DataFrame to save.
    save_dir : str
        Target directory to save the CSV.
    add_to_title : str, optional
        Suffix to append to the filename. Default is empty string.

    Returns
    -------
    None
    """
    os.makedirs(save_dir, exist_ok=True)
    stats_df.to_csv(os.path.join(save_dir, f"stats_df_{add_to_title}.csv"), index=False)

uqdd.metrics.analysis.load_stats_df ¶

load_stats_df(save_dir: str, add_to_title: str = '') -> pd.DataFrame

Load a stats DataFrame from CSV in a given directory.

Parameters:

Name	Type	Description	Default
`save_dir`	`str`	Directory containing the CSV.	required
`add_to_title`	`str`	Suffix appended to the filename. Default is empty string.	`''`

Returns:

Type	Description
`DataFrame`	Loaded DataFrame.

Source code in uqdd/metrics/analysis.py

def load_stats_df(save_dir: str, add_to_title: str = "") -> pd.DataFrame:
    """
    Load a stats DataFrame from CSV in a given directory.

    Parameters
    ----------
    save_dir : str
        Directory containing the CSV.
    add_to_title : str, optional
        Suffix appended to the filename. Default is empty string.

    Returns
    -------
    pd.DataFrame
        Loaded DataFrame.
    """
    return pd.read_csv(os.path.join(save_dir, f"stats_df_{add_to_title}.csv"))

uqdd.metrics.constants ¶

uqdd.metrics.reassessment ¶

Model reassessment utilities: loading trained models, generating predictions, computing NLL, exporting artifacts, and recalibrating with isotonic regression.

This module wires together model loaders and predictors to re-run evaluation on saved runs, export standardized prediction pickles, append NLL to CSV logs, and apply isotonic recalibration using validation data.

uqdd.metrics.reassessment.nll_evidentials ¶

nll_evidentials(evidential_model, test_dataloader, model_type: str = 'evidential', num_mc_samples: int = 100, device=DEVICE)

Compute negative log-likelihood (NLL) for evidential-style models.

Parameters:

Name	Type	Description	Default
`evidential_model`	`Module`	Trained model instance.	required
`test_dataloader`	`DataLoader`	DataLoader providing test set batches.	required
`model_type`	`(evidential, eoe, emc)`	Model family determining the NLL backend. Default is "evidential".	`"evidential"`
`num_mc_samples`	`int`	Number of MC samples for EMC models. Default is 100.	`100`
`device`	`device`	Device to run evaluation on. Default uses `DEVICE`.	`DEVICE`

Returns:

Type	Description
`float or None`	Scalar NLL if supported by the model type; None otherwise.

Source code in uqdd/metrics/reassessment.py

def nll_evidentials(
    evidential_model,
    test_dataloader,
    model_type: str = "evidential",
    num_mc_samples: int = 100,
    device=DEVICE,
):
    """
    Compute negative log-likelihood (NLL) for evidential-style models.

    Parameters
    ----------
    evidential_model : torch.nn.Module
        Trained model instance.
    test_dataloader : torch.utils.data.DataLoader
        DataLoader providing test set batches.
    model_type : {"evidential", "eoe", "emc"}, optional
        Model family determining the NLL backend. Default is "evidential".
    num_mc_samples : int, optional
        Number of MC samples for EMC models. Default is 100.
    device : torch.device, optional
        Device to run evaluation on. Default uses `DEVICE`.

    Returns
    -------
    float or None
        Scalar NLL if supported by the model type; None otherwise.
    """
    if model_type in ["evidential", "eoe"]:
        return ev_nll(evidential_model, test_dataloader, device=device)
    elif model_type == "emc":
        return emc_nll(evidential_model, test_dataloader, num_mc_samples=num_mc_samples, device=device)
    else:
        return None

uqdd.metrics.reassessment.convert_to_list ¶

convert_to_list(val)

Parse a string representation of a Python list to a list; pass through non-strings.

Parameters:

Name	Type	Description	Default
`val`	`str or any`	Input value, possibly a string encoding of a list.	required

Returns:

Type	Description
`list`	Parsed list if `val` is a valid string list, empty list on parse failure.
`any`	Original value if not a string.

Notes

Uses ast.literal_eval for safe evaluation.
Prints a warning and returns [] when parsing fails.

Source code in uqdd/metrics/reassessment.py

def convert_to_list(val):
    """
    Parse a string representation of a Python list to a list; pass through non-strings.

    Parameters
    ----------
    val : str or any
        Input value, possibly a string encoding of a list.

    Returns
    -------
    list
        Parsed list if `val` is a valid string list, empty list on parse failure.
    any
        Original value if not a string.

    Notes
    -----
    - Uses `ast.literal_eval` for safe evaluation.
    - Prints a warning and returns [] when parsing fails.
    """
    if isinstance(val, str):
        try:
            parsed_val = ast.literal_eval(val)
            if isinstance(parsed_val, list):
                return parsed_val
            else:
                return []
        except (SyntaxError, ValueError):
            print(f"Warning: Unable to parse value {val}, returning empty list.")
            return []
    return val

uqdd.metrics.reassessment.preprocess_runs ¶

preprocess_runs(runs_path: str, models_dir: str = MODELS_DIR, data_name: str = 'papyrus', activity_type: str = 'xc50', descriptor_protein: str = 'ankh-large', descriptor_chemical: str = 'ecfp2048', data_specific_path: str = 'papyrus/xc50/all', prot_input_dim: int = 1536, chem_input_dim: int = 2048) -> pd.DataFrame

Read a runs CSV and enrich with resolved model paths and descriptor metadata.

Parameters:

Name	Type	Description	Default
`runs_path`	`str`	Path to the CSV file containing run metadata.	required
`models_dir`	`str`	Directory containing trained model .pt files. Default uses `MODELS_DIR`.	`MODELS_DIR`
`data_name`	`str`	Dataset identifier. Default is "papyrus".	`'papyrus'`
`activity_type`	`str`	Activity type (e.g., "xc50", "kc"). Default is "xc50".	`'xc50'`
`descriptor_protein`	`str`	Protein descriptor type. Default is "ankh-large".	`'ankh-large'`
`descriptor_chemical`	`str`	Chemical descriptor type. Default is "ecfp2048".	`'ecfp2048'`
`data_specific_path`	`str`	Subpath encoding dataset context for figures/exports. Default is "papyrus/xc50/all".	`'papyrus/xc50/all'`
`prot_input_dim`	`int`	Protein input dimensionality. Default is 1536.	`1536`
`chem_input_dim`	`int`	Chemical input dimensionality. Default is 2048.	`2048`

Returns:

Type	Description
`DataFrame`	Preprocessed runs DataFrame with columns like 'model_name', 'model_path', and descriptor fields.

Notes

Resolves model_name to actual .pt files via glob and sets 'model_path'.
Adds multi-task flag 'MT' from 'n_targets' > 1.
Converts layer columns from strings to lists using convert_to_list.

Source code in uqdd/metrics/reassessment.py

def preprocess_runs(
    runs_path: str,
    models_dir: str = MODELS_DIR,
    data_name: str = "papyrus",
    activity_type: str = "xc50",
    descriptor_protein: str = "ankh-large",
    descriptor_chemical: str = "ecfp2048",
    data_specific_path: str = "papyrus/xc50/all",
    prot_input_dim: int = 1536,
    chem_input_dim: int = 2048,
) -> pd.DataFrame:
    """
    Read a runs CSV and enrich with resolved model paths and descriptor metadata.

    Parameters
    ----------
    runs_path : str
        Path to the CSV file containing run metadata.
    models_dir : str, optional
        Directory containing trained model .pt files. Default uses `MODELS_DIR`.
    data_name : str, optional
        Dataset identifier. Default is "papyrus".
    activity_type : str, optional
        Activity type (e.g., "xc50", "kc"). Default is "xc50".
    descriptor_protein : str, optional
        Protein descriptor type. Default is "ankh-large".
    descriptor_chemical : str, optional
        Chemical descriptor type. Default is "ecfp2048".
    data_specific_path : str, optional
        Subpath encoding dataset context for figures/exports. Default is "papyrus/xc50/all".
    prot_input_dim : int, optional
        Protein input dimensionality. Default is 1536.
    chem_input_dim : int, optional
        Chemical input dimensionality. Default is 2048.

    Returns
    -------
    pd.DataFrame
        Preprocessed runs DataFrame with columns like 'model_name', 'model_path', and descriptor fields.

    Notes
    -----
    - Resolves `model_name` to actual .pt files via glob and sets 'model_path'.
    - Adds multi-task flag 'MT' from 'n_targets' > 1.
    - Converts layer columns from strings to lists using `convert_to_list`.
    """
    runs_df = pd.read_csv(
        runs_path,
        converters={
            "chem_layers": convert_to_list,
            "prot_layers": convert_to_list,
            "regressor_layers": convert_to_list,
        },
    )
    runs_df.rename(columns={"Name": "run_name"}, inplace=True)
    i = 1
    for index, row in runs_df.iterrows():
        model_name = row["model_name"] if not pd.isna(row["model_name"]) else row["run_name"]
        model_file_pattern = os.path.join(models_dir, f"*{model_name}.pt")
        model_files = glob.glob(model_file_pattern)
        if model_files:
            model_file_path = model_files[0]
            model_name = os.path.basename(model_file_path).replace(".pt", "")
            runs_df.at[index, "model_name"] = model_name
            runs_df.at[index, "model_path"] = model_file_path
        else:
            print(f"{i} Model file(s) not found for {model_name} \n with pattern {model_file_pattern}")
            runs_df.at[index, "model_path"] = ""
            i += 1
    runs_df["data_name"] = data_name
    runs_df["activity_type"] = activity_type
    runs_df["descriptor_protein"] = descriptor_protein
    runs_df["descriptor_chemical"] = descriptor_chemical
    runs_df["chem_input_dim"] = chem_input_dim
    runs_df["prot_input_dim"] = prot_input_dim
    runs_df["data_specific_path"] = data_specific_path
    runs_df["MT"] = runs_df["n_targets"].apply(lambda x: True if x > 1 else False)
    return runs_df

uqdd.metrics.reassessment.get_model_class ¶

get_model_class(model_type: str)

Map a model type name to the corresponding class.

Parameters:

Name	Type	Description	Default
`model_type`	`str`	Model type identifier (e.g., "pnn", "ensemble", "evidential", "eoe", "emc", "mcdropout").	required

Returns:

Type	Description
`type`	Model class matching the type.

Raises:

Type	Description
`ValueError`	If the `model_type` is not recognized.

Source code in uqdd/metrics/reassessment.py

def get_model_class(model_type: str):
    """
    Map a model type name to the corresponding class.

    Parameters
    ----------
    model_type : str
        Model type identifier (e.g., "pnn", "ensemble", "evidential", "eoe", "emc", "mcdropout").

    Returns
    -------
    type
        Model class matching the type.

    Raises
    ------
    ValueError
        If the `model_type` is not recognized.
    """
    if model_type.lower() in ["pnn", "mcdropout"]:
        return PNN
    elif model_type.lower() == "ensemble":
        return EnsembleDNN
    elif model_type.lower() in ["evidential", "emc"]:
        return EvidentialDNN
    elif model_type.lower() == "eoe":
        return EoEDNN
    else:
        raise ValueError(f"Model type {model_type} not recognized")

uqdd.metrics.reassessment.get_predict_fn ¶

get_predict_fn(model_type: str, num_mc_samples: int = 100)

Get the appropriate predict function and kwargs for a given model type.

Parameters:

Name	Type	Description	Default
`model_type`	`str`	Model type identifier.	required
`num_mc_samples`	`int`	Number of MC samples for MC Dropout or EMC models. Default is 100.	`100`

Returns:

Type	Description
`(callable, dict)`	Tuple of (predict_function, keyword_arguments).

Raises:

Type	Description
`ValueError`	If the `model_type` is not recognized.

Source code in uqdd/metrics/reassessment.py

def get_predict_fn(model_type: str, num_mc_samples: int = 100):
    """
    Get the appropriate predict function and kwargs for a given model type.

    Parameters
    ----------
    model_type : str
        Model type identifier.
    num_mc_samples : int, optional
        Number of MC samples for MC Dropout or EMC models. Default is 100.

    Returns
    -------
    (callable, dict)
        Tuple of (predict_function, keyword_arguments).

    Raises
    ------
    ValueError
        If the `model_type` is not recognized.
    """
    if model_type.lower() == "mcdropout":
        return mc_predict, {"num_mc_samples": num_mc_samples}
    elif model_type.lower() in ["ensemble", "pnn"]:
        return predict, {}
    elif model_type.lower() in ["evidential", "eoe"]:
        return ev_predict, {}
    elif model_type.lower() == "emc":
        return emc_predict, {"num_mc_samples": num_mc_samples}
    else:
        raise ValueError(f"Model type {model_type} not recognized")

uqdd.metrics.reassessment.get_preds ¶

get_preds(model, dataloaders, model_type: str, subset: str = 'test', num_mc_samples: int = 100)

Run inference and unpack predictions for the requested subset.

Parameters:

Name	Type	Description	Default
`model`	`Module`	Trained model instance.	required
`dataloaders`	`dict`	Dictionary of DataLoaders keyed by subset (e.g., 'train', 'val', 'test').	required
`model_type`	`str`	Model type determining the predict function and outputs.	required
`subset`	`str`	Subset key to use from `dataloaders`. Default is "test".	`'test'`
`num_mc_samples`	`int`	Number of MC samples for stochastic predictors. Default is 100.	`100`

Returns:

Type	Description
`tuple`	(preds, labels, alea_vars, epi_vars) where `epi_vars` may be None for non-evidential models.

Source code in uqdd/metrics/reassessment.py

def get_preds(
    model,
    dataloaders,
    model_type: str,
    subset: str = "test",
    num_mc_samples: int = 100,
):
    """
    Run inference and unpack predictions for the requested subset.

    Parameters
    ----------
    model : torch.nn.Module
        Trained model instance.
    dataloaders : dict
        Dictionary of DataLoaders keyed by subset (e.g., 'train', 'val', 'test').
    model_type : str
        Model type determining the predict function and outputs.
    subset : str, optional
        Subset key to use from `dataloaders`. Default is "test".
    num_mc_samples : int, optional
        Number of MC samples for stochastic predictors. Default is 100.

    Returns
    -------
    tuple
        (preds, labels, alea_vars, epi_vars) where `epi_vars` may be None for non-evidential models.
    """
    predict_fn, predict_kwargs = get_predict_fn(model_type, num_mc_samples=num_mc_samples)
    preds_res = predict_fn(model, dataloaders[subset], device=DEVICE, **predict_kwargs)
    if model_type in ["evidential", "eoe", "emc"]:
        preds, labels, alea_vars, epi_vars = preds_res
    else:
        preds, labels, alea_vars = preds_res
        epi_vars = None
    return preds, labels, alea_vars, epi_vars

uqdd.metrics.reassessment.pkl_preds_export ¶

pkl_preds_export(preds, labels, alea_vars, epi_vars, outpath: str, model_type: str, logger=None)

Export predictions and uncertainties to a standardized pickle and return the DataFrame.

Parameters:

Name	Type	Description	Default
`preds`	`ndarray or Tensor`	Model predictions.	required
`labels`	`ndarray or Tensor`	True labels.	required
`alea_vars`	`ndarray or Tensor`	Aleatoric uncertainty components.	required
`epi_vars`	`ndarray or Tensor or None`	Epistemic uncertainty components, or None for non-evidential models.	required
`outpath`	`str`	Output directory to write 'preds.pkl'.	required
`model_type`	`str`	Model type used to guide `process_preds` behavior.	required
`logger`	`Logger or None`	Logger for messages. Default is None.	`None`

Returns:

Type	Description
`DataFrame`	DataFrame with columns [y_true, y_pred, y_err, y_alea, y_eps].

Source code in uqdd/metrics/reassessment.py

def pkl_preds_export(
    preds,
    labels,
    alea_vars,
    epi_vars,
    outpath: str,
    model_type: str,
    logger=None,
):
    """
    Export predictions and uncertainties to a standardized pickle and return the DataFrame.

    Parameters
    ----------
    preds : numpy.ndarray or torch.Tensor
        Model predictions.
    labels : numpy.ndarray or torch.Tensor
        True labels.
    alea_vars : numpy.ndarray or torch.Tensor
        Aleatoric uncertainty components.
    epi_vars : numpy.ndarray or torch.Tensor or None
        Epistemic uncertainty components, or None for non-evidential models.
    outpath : str
        Output directory to write 'preds.pkl'.
    model_type : str
        Model type used to guide `process_preds` behavior.
    logger : logging.Logger or None, optional
        Logger for messages. Default is None.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns [y_true, y_pred, y_err, y_alea, y_eps].
    """
    y_true, y_pred, y_err, y_alea, y_eps = process_preds(preds, labels, alea_vars, epi_vars, None, model_type)
    df = create_df_preds(y_true=y_true, y_pred=y_pred, y_err=y_err, y_alea=y_alea, y_eps=y_eps, export=False, logger=logger)
    df.to_pickle(os.path.join(outpath, "preds.pkl"))
    return df

uqdd.metrics.reassessment.csv_nll_post_processing ¶

csv_nll_post_processing(csv_path: str) -> None

Normalize NLL values in a CSV by taking the first value per model name.

Parameters:

Name	Type	Description	Default
`csv_path`	`str`	Path to the CSV file containing a 'model name' and 'NLL' column.	required

Returns:

Type	Description
`None`

Source code in uqdd/metrics/reassessment.py

def csv_nll_post_processing(csv_path: str) -> None:
    """
    Normalize NLL values in a CSV by taking the first value per model name.

    Parameters
    ----------
    csv_path : str
        Path to the CSV file containing a 'model name' and 'NLL' column.

    Returns
    -------
    None
    """
    df = pd.read_csv(csv_path)
    df["NLL"] = df.groupby("model name")["NLL"].transform("first")
    df.to_csv(csv_path, index=False)

uqdd.metrics.reassessment.reassess_metrics ¶

reassess_metrics(runs_df: DataFrame, figs_out_path: str, csv_out_path: str, project_out_name: str, logger) -> None

Reassess metrics for each run: reload model, predict, compute NLL, evaluate, and recalibrate.

Parameters:

Name	Type	Description	Default
`runs_df`	`DataFrame`	Preprocessed runs DataFrame with resolved 'model_path' and configuration fields.	required
`figs_out_path`	`str`	Directory where per-model figures and prediction pickles are saved.	required
`csv_out_path`	`str`	Path to a CSV for logging metrics (passed to `evaluate_predictions`).	required
`project_out_name`	`str`	Name used for grouping results in downstream logging.	required
`logger`	`Logger`	Logger instance used through evaluation and recalibration.	required

Returns:

Type	Description
`None`

Notes

Skips models already reassessed when a figure directory exists.
Uses validation split for isotonic recalibration and logs final metrics.

Source code in uqdd/metrics/reassessment.py

def reassess_metrics(
    runs_df: pd.DataFrame,
    figs_out_path: str,
    csv_out_path: str,
    project_out_name: str,
    logger,
) -> None:
    """
    Reassess metrics for each run: reload model, predict, compute NLL, evaluate, and recalibrate.

    Parameters
    ----------
    runs_df : pd.DataFrame
        Preprocessed runs DataFrame with resolved 'model_path' and configuration fields.
    figs_out_path : str
        Directory where per-model figures and prediction pickles are saved.
    csv_out_path : str
        Path to a CSV for logging metrics (passed to `evaluate_predictions`).
    project_out_name : str
        Name used for grouping results in downstream logging.
    logger : logging.Logger
        Logger instance used through evaluation and recalibration.

    Returns
    -------
    None

    Notes
    -----
    - Skips models already reassessed when a figure directory exists.
    - Uses validation split for isotonic recalibration and logs final metrics.
    """
    runs_df = runs_df.sample(frac=1).reset_index(drop=True)
    for index, row in runs_df.iterrows():
        model_path = row["model_path"]
        model_name = row["model_name"]
        run_name = row["run_name"]
        rowkwargs = row.to_dict()
        model_type = rowkwargs.pop("model_type")
        activity_type = rowkwargs.pop("activity_type")
        if model_path:
            model_fig_out_path = os.path.join(figs_out_path, model_name)
            if os.path.exists(model_fig_out_path):
                print(f"Model {model_name} already reassessed")
                continue
            os.makedirs(model_fig_out_path, exist_ok=True)
            config = get_model_config(model_type=model_type, activity_type=activity_type, **rowkwargs)
            num_mc_samples = config.get("num_mc_samples", 100)
            model_class = get_model_class(model_type)
            prefix = "models." if model_type == "eoe" else ""
            model = load_model(model_class, model_path, prefix_to_state_keys=prefix, config=config).to(DEVICE)
            dataloaders = get_dataloader(config, device=DEVICE, logger=logger)
            preds, labels, alea_vars, epi_vars = get_preds(model, dataloaders, model_type, subset="test", num_mc_samples=num_mc_samples)
            nll = nll_evidentials(model, dataloaders["test"], model_type=model_type, num_mc_samples=num_mc_samples, device=DEVICE)
            df = pkl_preds_export(preds, labels, alea_vars, epi_vars, model_fig_out_path, model_type, logger=logger)
            metrics, plots, uct_logger = evaluate_predictions(
                config,
                preds,
                labels,
                alea_vars,
                model_type,
                logger,
                epi_vars=epi_vars,
                wandb_push=False,
                run_name=config["run_name"],
                project_name=project_out_name,
                figpath=model_fig_out_path,
                export_preds=False,
                verbose=False,
                csv_path=csv_out_path,
                nll=nll,
            )
            preds_val, labels_val, alea_vars_val, epi_vars_val = get_preds(model, dataloaders, model_type, subset="val", num_mc_samples=num_mc_samples)
            nll = nll_evidentials(model, dataloaders["val"], model_type=model_type, num_mc_samples=num_mc_samples, device=DEVICE)
            iso_recal_model = recalibrate_model(
                preds_val,
                labels_val,
                alea_vars_val,
                preds,
                labels,
                alea_vars,
                config=config,
                epi_val=epi_vars_val,
                epi_test=epi_vars,
                uct_logger=uct_logger,
                figpath=model_fig_out_path,
                nll=nll,
            )
            uct_logger.csv_log()

uqdd.metrics.stats ¶

Statistical utilities for metrics analysis and significance testing.

This module includes helpers to compute descriptive statistics, confidence intervals, bootstrap aggregates, correlation and significance tests, and summary tables to support model evaluation and reporting.

uqdd.metrics.stats.calc_regression_metrics ¶

calc_regression_metrics(df, cycle_col, val_col, pred_col, thresh)

Compute regression and thresholded classification metrics per cycle/method/split.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing true and predicted values.	required
`cycle_col`	`str`	Column name identifying cross-validation cycles.	required
`val_col`	`str`	Column with true target values.	required
`pred_col`	`str`	Column with predicted target values.	required
`thresh`	`float`	Threshold to derive binary classes for precision/recall.	required

Returns:

Type	Description
`DataFrame`	Metrics per (cv_cycle, method, split) with columns ['mae', 'mse', 'r2', 'rho', 'prec', 'recall'].

Source code in uqdd/metrics/stats.py

def calc_regression_metrics(df, cycle_col, val_col, pred_col, thresh):
    """
    Compute regression and thresholded classification metrics per cycle/method/split.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing true and predicted values.
    cycle_col : str
        Column name identifying cross-validation cycles.
    val_col : str
        Column with true target values.
    pred_col : str
        Column with predicted target values.
    thresh : float
        Threshold to derive binary classes for precision/recall.

    Returns
    -------
    pd.DataFrame
        Metrics per (cv_cycle, method, split) with columns ['mae', 'mse', 'r2', 'rho', 'prec', 'recall'].
    """
    df_in = df.copy()
    metric_ls = ["mae", "mse", "r2", "rho", "prec", "recall"]
    metric_list = []
    df_in["true_class"] = df_in[val_col] > thresh
    assert len(df_in.true_class.unique()) == 2, "Binary classification requires two classes"
    df_in["pred_class"] = df_in[pred_col] > thresh

    for k, v in df_in.groupby([cycle_col, "method", "split"]):
        cycle, method, split = k
        mae = mean_absolute_error(v[val_col], v[pred_col])
        mse = mean_squared_error(v[val_col], v[pred_col])
        r2 = r2_score(v[val_col], v[pred_col])
        recall = recall_score(v.true_class, v.pred_class)
        prec = precision_score(v.true_class, v.pred_class)
        rho, _ = spearmanr(v[val_col], v[pred_col])
        metric_list.append([cycle, method, split, mae, mse, r2, rho, prec, recall])
    metric_df = pd.DataFrame(metric_list, columns=["cv_cycle", "method", "split"] + metric_ls)
    return metric_df

uqdd.metrics.stats.bootstrap_ci ¶

bootstrap_ci(data, func=np.mean, n_bootstrap=1000, ci=95, random_state=42)

Compute bootstrap confidence interval for a statistic.

Parameters:

Name	Type	Description	Default
`data`	`array - like`	Sequence of numeric values.	required
`func`	`callable`	Statistic function applied to bootstrap samples (e.g., numpy.mean). Default is numpy.mean.	`mean`
`n_bootstrap`	`int`	Number of bootstrap resamples. Default is 1000.	`1000`
`ci`	`int or float`	Confidence level percentage (e.g., 95). Default is 95.	`95`
`random_state`	`int`	Seed for reproducibility. Default is 42.	`42`

Returns:

Type	Description
`tuple[float, float]`	Lower and upper bounds for the confidence interval.

Source code in uqdd/metrics/stats.py

def bootstrap_ci(data, func=np.mean, n_bootstrap=1000, ci=95, random_state=42):
    """
    Compute bootstrap confidence interval for a statistic.

    Parameters
    ----------
    data : array-like
        Sequence of numeric values.
    func : callable, optional
        Statistic function applied to bootstrap samples (e.g., numpy.mean). Default is numpy.mean.
    n_bootstrap : int, optional
        Number of bootstrap resamples. Default is 1000.
    ci : int or float, optional
        Confidence level percentage (e.g., 95). Default is 95.
    random_state : int, optional
        Seed for reproducibility. Default is 42.

    Returns
    -------
    tuple[float, float]
        Lower and upper bounds for the confidence interval.
    """
    np.random.seed(random_state)
    bootstrap_samples = []
    for _ in range(n_bootstrap):
        sample = resample(data, random_state=np.random.randint(0, 10000))
        bootstrap_samples.append(func(sample))
    alpha = (100 - ci) / 2
    lower = np.percentile(bootstrap_samples, alpha)
    upper = np.percentile(bootstrap_samples, 100 - alpha)
    return lower, upper

uqdd.metrics.stats.rm_tukey_hsd ¶

rm_tukey_hsd(df, metric, group_col, alpha=0.05, sort=False, direction_dict=None)

Repeated-measures Tukey HSD approximation using RM-ANOVA and studentized range.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Long-form DataFrame with columns including the metric, group, and 'cv_cycle' subject.	required
`metric`	`str`	Metric column to compare.	required
`group_col`	`str`	Column indicating groups (e.g., method/model type).	required
`alpha`	`float`	Family-wise error rate for intervals. Default is 0.05.	`0.05`
`sort`	`bool`	If True, sort groups by mean value of the metric. Default is False.	`False`
`direction_dict`	`dict or None`	Mapping of metric -> 'maximize'\|'minimize' to set sort ascending/descending.	`None`

Returns:

Type	Description
`tuple`	(result_tab, df_means, df_means_diff, p_values_matrix) where: - result_tab: DataFrame of pairwise comparisons with mean differences and CIs. - df_means: mean per group. - df_means_diff: matrix of mean differences. - pc: matrix of adjusted p-values.

Source code in uqdd/metrics/stats.py

def rm_tukey_hsd(df, metric, group_col, alpha=0.05, sort=False, direction_dict=None):
    """
    Repeated-measures Tukey HSD approximation using RM-ANOVA and studentized range.

    Parameters
    ----------
    df : pd.DataFrame
        Long-form DataFrame with columns including the metric, group, and 'cv_cycle' subject.
    metric : str
        Metric column to compare.
    group_col : str
        Column indicating groups (e.g., method/model type).
    alpha : float, optional
        Family-wise error rate for intervals. Default is 0.05.
    sort : bool, optional
        If True, sort groups by mean value of the metric. Default is False.
    direction_dict : dict or None, optional
        Mapping of metric -> 'maximize'|'minimize' to set sort ascending/descending.

    Returns
    -------
    tuple
        (result_tab, df_means, df_means_diff, p_values_matrix) where:
        - result_tab: DataFrame of pairwise comparisons with mean differences and CIs.
        - df_means: mean per group.
        - df_means_diff: matrix of mean differences.
        - pc: matrix of adjusted p-values.
    """
    if sort and direction_dict and metric in direction_dict:
        ascending = direction_dict[metric] != "maximize"
        df_means = df.groupby(group_col).mean(numeric_only=True).sort_values(metric, ascending=ascending)
    else:
        df_means = df.groupby(group_col).mean(numeric_only=True)

    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=RuntimeWarning, message="divide by zero encountered in scalar divide")
        aov = pg.rm_anova(dv=metric, within=group_col, subject="cv_cycle", data=df, detailed=True)
    mse = aov.loc[1, "MS"]
    df_resid = aov.loc[1, "DF"]

    methods = df_means.index
    n_groups = len(methods)
    n_per_group = df[group_col].value_counts().mean()
    tukey_se = np.sqrt(2 * mse / (n_per_group))
    q = qsturng(1 - alpha, n_groups, df_resid)
    if isinstance(q, (tuple, list, np.ndarray)):
        q = q[0]

    num_comparisons = len(methods) * (len(methods) - 1) // 2
    result_tab = pd.DataFrame(index=range(num_comparisons), columns=["group1", "group2", "meandiff", "lower", "upper", "p-adj"])
    df_means_diff = pd.DataFrame(index=methods, columns=methods, data=0.0)
    pc = pd.DataFrame(index=methods, columns=methods, data=1.0)

    row_idx = 0
    for i, method1 in enumerate(methods):
        for j, method2 in enumerate(methods):
            if i < j:
                group1 = df[df[group_col] == method1][metric]
                group2 = df[df[group_col] == method2][metric]
                mean_diff = group1.mean() - group2.mean()
                studentized_range = np.abs(mean_diff) / tukey_se
                adjusted_p = psturng(studentized_range * np.sqrt(2), n_groups, df_resid)
                if isinstance(adjusted_p, (tuple, list, np.ndarray)):
                    adjusted_p = adjusted_p[0]
                lower = mean_diff - (q / np.sqrt(2) * tukey_se)
                upper = mean_diff + (q / np.sqrt(2) * tukey_se)
                result_tab.loc[row_idx] = [method1, method2, mean_diff, lower, upper, adjusted_p]
                pc.loc[method1, method2] = adjusted_p
                pc.loc[method2, method1] = adjusted_p
                df_means_diff.loc[method1, method2] = mean_diff
                df_means_diff.loc[method2, method1] = -mean_diff
                row_idx += 1

    df_means_diff = df_means_diff.astype(float)
    result_tab["group1_mean"] = result_tab["group1"].map(df_means[metric])
    result_tab["group2_mean"] = result_tab["group2"].map(df_means[metric])
    result_tab.index = result_tab["group1"] + " - " + result_tab["group2"]
    return result_tab, df_means, df_means_diff, pc

uqdd.metrics.stats.make_boxplots ¶

make_boxplots(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots for each metric grouped by method.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to visualize.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of methods on the x-axis. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_boxplots(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots for each metric grouped by method.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, len(metric_ls), sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.upper()
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_boxplots_parametric ¶

make_boxplots_parametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots with RM-ANOVA p-values annotated per metric.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to visualize.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of methods on the x-axis. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_boxplots_parametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots with RM-ANOVA p-values annotated per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, len(metric_ls), sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        model = AnovaRM(data=df, depvar=stat, subject="cv_cycle", within=["method"]).fit()
        p_value = model.anova_table["Pr > F"].iloc[0]
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.upper()
        ax.set_title(f"p={p_value:.1e}")
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_parametric_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_boxplots_nonparametric ¶

make_boxplots_nonparametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots with Friedman p-values annotated per metric.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to visualize.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of methods on the x-axis. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_boxplots_nonparametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots with Friedman p-values annotated per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    n_metrics = len(metric_ls)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, n_metrics, sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        friedman = pg.friedman(df, dv=stat, within="method", subject="cv_cycle")["p-unc"].values[0]
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.replace("_", " ").upper()
        ax.set_title(f"p={friedman:.1e}")
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_nonparametric_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_sign_plots_nonparametric ¶

make_sign_plots_nonparametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot significance heatmaps (Conover post-hoc) for nonparametric comparisons.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to analyze.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of methods on axes. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_sign_plots_nonparametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot significance heatmaps (Conover post-hoc) for nonparametric comparisons.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to analyze.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on axes. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    heatmap_args = {"linewidths": 0.25, "linecolor": "0.5", "clip_on": True, "square": True, "cbar_kws": {"pad": 0.05, "location": "right"}}
    n_metrics = len(metric_ls)
    sns.set_theme(context="paper", font_scale=1.5)
    figure, axes = plt.subplots(1, n_metrics, sharex=False, sharey=True, figsize=(26, 8))
    if n_metrics == 1:
        axes = [axes]
    for i, stat in enumerate(metric_ls):
        pc = sp.posthoc_conover_friedman(df, y_col=stat, group_col="method", block_col="cv_cycle", block_id_col="cv_cycle", p_adjust="holm", melted=True)
        if model_order is not None:
            pc = pc.reindex(index=model_order, columns=model_order)
        sub_ax, sub_c = sp.sign_plot(pc, **heatmap_args, ax=axes[i], xticklabels=True)
        sub_ax.set_title(stat.upper())
        if sub_c is not None and hasattr(sub_c, "ax"):
            figure.subplots_adjust(right=0.85)
            sub_c.ax.set_position([0.87, 0.5, 0.02, 0.2])
    save_plot(figure, save_dir, f"{name_prefix}_sign_plot_nonparametric_{'_'.join(metric_ls)}", tighten=False)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_critical_difference_diagrams ¶

make_critical_difference_diagrams(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot critical difference diagrams per metric using average ranks and post-hoc p-values.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to analyze.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit order of models on diagrams. Default derives from data.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_critical_difference_diagrams(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot critical difference diagrams per metric using average ranks and post-hoc p-values.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to analyze.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of models on diagrams. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    n_metrics = len(metric_ls)
    figure, axes = plt.subplots(n_metrics, 1, sharex=True, sharey=False, figsize=(16, 10))
    for i, stat in enumerate(metric_ls):
        avg_rank = df.groupby("cv_cycle")[stat].rank(pct=True).groupby(df.method).mean()
        pc = sp.posthoc_conover_friedman(df, y_col=stat, group_col="method", block_col="cv_cycle", block_id_col="cv_cycle", p_adjust="holm", melted=True)
        if model_order is not None:
            avg_rank = avg_rank.reindex(model_order)
            pc = pc.reindex(index=model_order, columns=model_order)
        sp.critical_difference_diagram(avg_rank, pc, ax=axes[i])
        axes[i].set_title(stat.upper())
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_critical_difference_diagram_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_normality_diagnostic ¶

make_normality_diagnostic(df, metric_ls, save_dir=None, name_prefix='')

Plot normality diagnostics (histogram/KDE and Q-Q) for residualized metrics.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric_ls`	`list of str`	Metrics to diagnose.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Prefix for the output filename. Default is empty.	`''`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_normality_diagnostic(df, metric_ls, save_dir=None, name_prefix=""):
    """
    Plot normality diagnostics (histogram/KDE and Q-Q) for residualized metrics.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to diagnose.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.

    Returns
    -------
    None
    """
    df_norm = df.copy()
    df_norm.replace([np.inf, -np.inf], np.nan, inplace=True)
    for metric in metric_ls:
        df_norm[metric] = df_norm[metric] - df_norm.groupby("method")[metric].transform("mean")
    df_norm = df_norm.melt(id_vars=["cv_cycle", "method", "split"], value_vars=metric_ls, var_name="metric", value_name="value")
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    metrics = df_norm["metric"].unique()
    n_metrics = len(metrics)
    fig, axes = plt.subplots(2, n_metrics, figsize=(20, 10))
    for i, metric in enumerate(metrics):
        ax = axes[0, i]
        sns.histplot(df_norm[df_norm["metric"] == metric]["value"], kde=True, ax=ax)
        ax.set_title(f"{metric}")
        ax.set_xlabel("")
        if i == 0:
            ax.set_ylabel("Count")
        else:
            ax.set_ylabel("")
    for i, metric in enumerate(metrics):
        ax = axes[1, i]
        metric_data = df_norm[df_norm["metric"] == metric]["value"]
        stats.probplot(metric_data, dist="norm", plot=ax)
        ax.set_title("")
        ax.set_xlabel("Theoretical Quantiles")
        if i == 0:
            ax.set_ylabel("Ordered Values")
        else:
            ax.set_ylabel("")
    plt.subplots_adjust(hspace=0.3, wspace=0.8)
    save_plot(fig, save_dir, f"{name_prefix}_normality_diagnostic_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.mcs_plot ¶

mcs_plot(pc, effect_size, means, labels=True, cmap=None, cbar_ax_bbox=None, ax=None, show_diff=True, cell_text_size=10, axis_text_size=8, show_cbar=True, reverse_cmap=False, vlim=None, **kwargs)

Render a multiple-comparisons significance heatmap annotated with effect sizes and stars.

Parameters:

Name	Type	Description	Default
`pc`	`DataFrame`	Matrix of adjusted p-values.	required
`effect_size`	`DataFrame`	Matrix of mean differences (effect sizes) aligned with `pc`.	required
`means`	`Series`	Mean values per group for labeling.	required
`labels`	`bool`	If True, add x/y tick labels from `means.index`. Default is True.	`True`
`cmap`	`str or None`	Colormap name for effect sizes. Default is 'YlGnBu'.	`None`
`cbar_ax_bbox`	`tuple or None`	Custom colorbar axes bbox; unused here but kept for API compatibility.	`None`
`ax`	`Axes or None`	Axes to draw into; if None, a new axes is created.	`None`
`show_diff`	`bool`	If True, annotate cells with rounded effect sizes plus significance. Default is True.	`True`
`cell_text_size`	`int`	Font size for annotations. Default is 10.	`10`
`axis_text_size`	`int`	Font size for axis tick labels. Default is 8.	`8`
`show_cbar`	`bool`	If True, show colorbar. Default is True.	`True`
`reverse_cmap`	`bool`	If True, use reversed colormap. Default is False.	`False`
`vlim`	`float or None`	Symmetric limit for color scaling around 0. Default is None.	`None`

Returns:

Type	Description
`Axes`	Axes containing the rendered heatmap.

Source code in uqdd/metrics/stats.py

def mcs_plot(pc, effect_size, means, labels=True, cmap=None, cbar_ax_bbox=None, ax=None, show_diff=True, cell_text_size=10, axis_text_size=8, show_cbar=True, reverse_cmap=False, vlim=None, **kwargs):
    """
    Render a multiple-comparisons significance heatmap annotated with effect sizes and stars.

    Parameters
    ----------
    pc : pd.DataFrame
        Matrix of adjusted p-values.
    effect_size : pd.DataFrame
        Matrix of mean differences (effect sizes) aligned with `pc`.
    means : pd.Series
        Mean values per group for labeling.
    labels : bool, optional
        If True, add x/y tick labels from `means.index`. Default is True.
    cmap : str or None, optional
        Colormap name for effect sizes. Default is 'YlGnBu'.
    cbar_ax_bbox : tuple or None, optional
        Custom colorbar axes bbox; unused here but kept for API compatibility.
    ax : matplotlib.axes.Axes or None, optional
        Axes to draw into; if None, a new axes is created.
    show_diff : bool, optional
        If True, annotate cells with rounded effect sizes plus significance. Default is True.
    cell_text_size : int, optional
        Font size for annotations. Default is 10.
    axis_text_size : int, optional
        Font size for axis tick labels. Default is 8.
    show_cbar : bool, optional
        If True, show colorbar. Default is True.
    reverse_cmap : bool, optional
        If True, use reversed colormap. Default is False.
    vlim : float or None, optional
        Symmetric limit for color scaling around 0. Default is None.

    Returns
    -------
    matplotlib.axes.Axes
        Axes containing the rendered heatmap.
    """
    for key in ["cbar", "vmin", "vmax", "center"]:
        if key in kwargs:
            del kwargs[key]
    if not cmap:
        cmap = "YlGnBu"
    if reverse_cmap:
        cmap = cmap + "_r"
    significance = pc.copy().astype(object)
    significance[(pc < 0.001) & (pc >= 0)] = "***"
    significance[(pc < 0.01) & (pc >= 0.001)] = "**"
    significance[(pc < 0.05) & (pc >= 0.01)] = "*"
    significance[(pc >= 0.05)] = ""
    np.fill_diagonal(significance.values, "")
    annotations = effect_size.round(2).astype(str) + significance if show_diff else significance
    hax = sns.heatmap(effect_size, cmap=cmap, annot=annotations, fmt="", cbar=show_cbar, ax=ax, annot_kws={"size": cell_text_size}, vmin=-2 * vlim if vlim else None, vmax=2 * vlim if vlim else None, square=True, **kwargs)
    if labels:
        label_list = list(means.index)
        x_label_list = label_list
        y_label_list = label_list
        xtick_positions = np.arange(len(label_list))
        hax.set_xticks(xtick_positions + 0.5)
        hax.set_xticklabels(x_label_list, size=axis_text_size, ha="center", va="center", rotation=90)
        hax.set_yticks(xtick_positions + 0.5)
        hax.set_yticklabels(y_label_list, size=axis_text_size, ha="center", va="center", rotation=0)
    hax.set_xlabel("")
    hax.set_ylabel("")
    return hax

uqdd.metrics.stats.make_mcs_plot_grid ¶

make_mcs_plot_grid(df, stats_list, group_col, alpha=0.05, figsize=(20, 10), direction_dict=None, effect_dict=None, show_diff=True, cell_text_size=16, axis_text_size=12, title_text_size=16, sort_axes=False, save_dir=None, name_prefix='', model_order=None)

Generate a grid of MCS plots for multiple metrics.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`stats_list`	`list of str`	Metrics to include.	required
`group_col`	`str`	Column indicating groups (e.g., method).	required
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`
`figsize`	`tuple`	Figure size. Default is (20, 10).	`(20, 10)`
`direction_dict`	`dict or None`	Mapping metric -> 'maximize'\|'minimize' for colormap orientation.	`None`
`effect_dict`	`dict or None`	Mapping metric -> effect size limit for color scaling.	`None`
`show_diff`	`bool`	If True, annotate mean differences; else annotate significance only.	`True`
`cell_text_size`	`int`	Annotation font size.	`16`
`axis_text_size`	`int`	Axis label font size.	`12`
`title_text_size`	`int`	Title font size.	`16`
`sort_axes`	`bool`	If True, sort groups by mean values per metric.	`False`
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Filename prefix. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit model order for rows/cols.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_mcs_plot_grid(df, stats_list, group_col, alpha=0.05, figsize=(20, 10), direction_dict=None, effect_dict=None, show_diff=True, cell_text_size=16, axis_text_size=12, title_text_size=16, sort_axes=False, save_dir=None, name_prefix="", model_order=None):
    """
    Generate a grid of MCS plots for multiple metrics.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    stats_list : list of str
        Metrics to include.
    group_col : str
        Column indicating groups (e.g., method).
    alpha : float, optional
        Significance level. Default is 0.05.
    figsize : tuple, optional
        Figure size. Default is (20, 10).
    direction_dict : dict or None, optional
        Mapping metric -> 'maximize'|'minimize' for colormap orientation.
    effect_dict : dict or None, optional
        Mapping metric -> effect size limit for color scaling.
    show_diff : bool, optional
        If True, annotate mean differences; else annotate significance only.
    cell_text_size : int, optional
        Annotation font size.
    axis_text_size : int, optional
        Axis label font size.
    title_text_size : int, optional
        Title font size.
    sort_axes : bool, optional
        If True, sort groups by mean values per metric.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Filename prefix. Default is empty.
    model_order : list of str or None, optional
        Explicit model order for rows/cols.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    nrow = math.ceil(len(stats_list) / 3)
    fig, ax = plt.subplots(nrow, 3, figsize=figsize)
    for key in ["r2", "rho", "prec", "recall", "mae", "mse"]:
        direction_dict.setdefault(key, "maximize" if key in ["r2", "rho", "prec", "recall"] else "minimize")
    for key in ["r2", "rho", "prec", "recall"]:
        effect_dict.setdefault(key, 0.1)
    for i, stat in enumerate(stats_list):
        row = i // 3
        col = i % 3
        if stat not in direction_dict:
            raise ValueError(f"Stat '{stat}' is missing in direction_dict. Please set its value.")
        if stat not in effect_dict:
            raise ValueError(f"Stat '{stat}' is missing in effect_dict. Please set its value.")
        reverse_cmap = direction_dict[stat] == "minimize"
        _, df_means, df_means_diff, pc = rm_tukey_hsd(df, stat, group_col, alpha, sort_axes, direction_dict)
        if model_order is not None:
            df_means = df_means.reindex(model_order)
            df_means_diff = df_means_diff.reindex(index=model_order, columns=model_order)
            pc = pc.reindex(index=model_order, columns=model_order)
        hax = mcs_plot(pc, effect_size=df_means_diff, means=df_means[stat], show_diff=show_diff, ax=ax[row, col], cbar=True, cell_text_size=cell_text_size, axis_text_size=axis_text_size, reverse_cmap=reverse_cmap, vlim=effect_dict[stat])
        hax.set_title(stat.upper(), fontsize=title_text_size)
    if (len(stats_list) % 3) != 0:
        for i in range(len(stats_list), nrow * 3):
            row = i // 3
            col = i % 3
            ax[row, col].set_visible(False)
    from matplotlib.lines import Line2D
    legend_elements = [
        Line2D([0], [0], marker="o", color="w", label="p < 0.001 (***): Highly Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p < 0.01 (**): Very Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p < 0.05 (*): Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p >= 0.05: Not Significant", markerfacecolor="black", markersize=10),
    ]
    fig.legend(handles=legend_elements, loc="upper right", ncol=2, fontsize=12, frameon=False)
    plt.subplots_adjust(top=0.88)
    save_plot(fig, save_dir, f"{name_prefix}_mcs_plot_grid_{'_'.join(stats_list)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_scatterplot ¶

make_scatterplot(df, val_col, pred_col, thresh, cycle_col='cv_cycle', group_col='method', save_dir=None)

Scatter plots of predicted vs true values per method, with threshold lines and summary stats.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`val_col`	`str`	True value column.	required
`pred_col`	`str`	Predicted value column.	required
`thresh`	`float`	Threshold for classification overlays.	required
`cycle_col`	`str`	Cross-validation cycle column. Default is 'cv_cycle'.	`'cv_cycle'`
`group_col`	`str`	Method/model type column. Default is 'method'.	`'method'`
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_scatterplot(df, val_col, pred_col, thresh, cycle_col="cv_cycle", group_col="method", save_dir=None):
    """
    Scatter plots of predicted vs true values per method, with threshold lines and summary stats.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    val_col : str
        True value column.
    pred_col : str
        Predicted value column.
    thresh : float
        Threshold for classification overlays.
    cycle_col : str, optional
        Cross-validation cycle column. Default is 'cv_cycle'.
    group_col : str, optional
        Method/model type column. Default is 'method'.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df_split_metrics = calc_regression_metrics(df, cycle_col=cycle_col, val_col=val_col, pred_col=pred_col, thresh=thresh)
    methods = df[group_col].unique()
    fig, axs = plt.subplots(nrows=1, ncols=len(methods), figsize=(25, 10))
    for ax, method in zip(axs, methods):
        df_method = df.query(f"{group_col} == @method")
        df_metrics = df_split_metrics.query(f"{group_col} == @method")
        ax.scatter(df_method[pred_col], df_method[val_col], alpha=0.3)
        ax.plot([df_method[val_col].min(), df_method[val_col].max()], [df_method[val_col].min(), df_method[val_col].max()], "k--", lw=1)
        ax.axhline(y=thresh, color="r", linestyle="--")
        ax.axvline(x=thresh, color="r", linestyle="--")
        ax.set_title(method)
        y_true = df_method[val_col] > thresh
        y_pred = df_method[pred_col] > thresh
        precision = precision_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)
        metrics_text = f"MAE: {df_metrics['mae'].mean():.2f}\nMSE: {df_metrics['mse'].mean():.2f}\nR2: {df_metrics['r2'].mean():.2f}\nrho: {df_metrics['rho'].mean():.2f}\nPrecision: {precision:.2f}\nRecall: {recall:.2f}"
        ax.text(0.05, 0.5, metrics_text, transform=ax.transAxes, verticalalignment="top")
        ax.set_xlabel("Predicted")
        ax.set_ylabel("Measured")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(fig, save_dir, f"scatterplot_{val_col}_vs_{pred_col}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.ci_plot ¶

ci_plot(result_tab, ax_in, name)

Plot mean differences with confidence intervals for pairwise comparisons.

Parameters:

Name	Type	Description	Default
`result_tab`	`DataFrame`	Output of rm_tukey_hsd with columns ['meandiff', 'lower', 'upper'].	required
`ax_in`	`Axes`	Axes to plot into.	required
`name`	`str`	Title for the plot.	required

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def ci_plot(result_tab, ax_in, name):
    """
    Plot mean differences with confidence intervals for pairwise comparisons.

    Parameters
    ----------
    result_tab : pd.DataFrame
        Output of rm_tukey_hsd with columns ['meandiff', 'lower', 'upper'].
    ax_in : matplotlib.axes.Axes
        Axes to plot into.
    name : str
        Title for the plot.

    Returns
    -------
    None
    """
    result_err = np.array([result_tab["meandiff"] - result_tab["lower"], result_tab["upper"] - result_tab["meandiff"]])
    sns.set_theme(context="paper")
    sns.set_style("whitegrid")
    ax = sns.pointplot(x=result_tab.meandiff, y=result_tab.index, marker="o", linestyle="", ax=ax_in)
    ax.errorbar(y=result_tab.index, x=result_tab["meandiff"], xerr=result_err, fmt="o", capsize=5)
    ax.axvline(0, ls="--", lw=3)
    ax.set_xlabel("Mean Difference")
    ax.set_ylabel("")
    ax.set_title(name)
    ax.set_xlim(-0.2, 0.2)

uqdd.metrics.stats.make_ci_plot_grid ¶

make_ci_plot_grid(df_in, metric_list, group_col='method', save_dir=None, name_prefix='', model_order=None)

Plot a grid of confidence-interval charts for multiple metrics.

Parameters:

Name	Type	Description	Default
`df_in`	`DataFrame`	Input DataFrame.	required
`metric_list`	`list of str`	Metrics to render.	required
`group_col`	`str`	Group column (e.g., 'method'). Default is 'method'.	`'method'`
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`name_prefix`	`str`	Filename prefix. Default is empty.	`''`
`model_order`	`list of str or None`	Explicit row order for the CI plots.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_ci_plot_grid(df_in, metric_list, group_col="method", save_dir=None, name_prefix="", model_order=None):
    """
    Plot a grid of confidence-interval charts for multiple metrics.

    Parameters
    ----------
    df_in : pd.DataFrame
        Input DataFrame.
    metric_list : list of str
        Metrics to render.
    group_col : str, optional
        Group column (e.g., 'method'). Default is 'method'.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Filename prefix. Default is empty.
    model_order : list of str or None, optional
        Explicit row order for the CI plots.

    Returns
    -------
    None
    """
    df_in = df_in.copy()
    df_in.replace([np.inf, -np.inf], np.nan, inplace=True)
    figure, axes = plt.subplots(len(metric_list), 1, figsize=(8, 2 * len(metric_list)), sharex=False)
    if not isinstance(axes, np.ndarray):
        axes = np.array([axes])
    for i, metric in enumerate(metric_list):
        df_tukey, _, _, _ = rm_tukey_hsd(df_in, metric, group_col=group_col)
        if model_order is not None:
            df_tukey = df_tukey.reindex(index=model_order)
        ci_plot(df_tukey, ax_in=axes[i], name=metric)
    figure.suptitle("Multiple Comparison of Means\nTukey HSD, FWER=0.05")
    plt.subplots_adjust(hspace=0.9, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_ci_plot_grid_{'_'.join(metric_list)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.recall_at_precision ¶

recall_at_precision(y_true, y_score, precision_threshold=0.5, direction='greater')

Find recall and threshold achieving at least a target precision.

Parameters:

Name	Type	Description	Default
`y_true`	`array - like`	Binary ground-truth labels.	required
`y_score`	`array - like`	Continuous scores or probabilities.	required
`precision_threshold`	`float`	Minimum precision to achieve. Default is 0.5.	`0.5`
`direction`	`(greater, lesser)`	If 'greater', thresholding uses >=; if 'lesser', uses <=. Default is 'greater'.	`"greater"`

Returns:

Type	Description
`tuple[float, float or None]`	(recall, threshold) if achievable; otherwise (nan, None).

Raises:

Type	Description
`ValueError`	If `direction` is invalid.

Source code in uqdd/metrics/stats.py

def recall_at_precision(y_true, y_score, precision_threshold=0.5, direction="greater"):
    """
    Find recall and threshold achieving at least a target precision.

    Parameters
    ----------
    y_true : array-like
        Binary ground-truth labels.
    y_score : array-like
        Continuous scores or probabilities.
    precision_threshold : float, optional
        Minimum precision to achieve. Default is 0.5.
    direction : {"greater", "lesser"}, optional
        If 'greater', thresholding uses >=; if 'lesser', uses <=. Default is 'greater'.

    Returns
    -------
    tuple[float, float or None]
        (recall, threshold) if achievable; otherwise (nan, None).

    Raises
    ------
    ValueError
        If `direction` is invalid.
    """
    if direction not in ["greater", "lesser"]:
        raise ValueError("Invalid direction. Expected one of: ['greater', 'lesser']")
    y_true = np.array(y_true)
    y_score = np.array(y_score)
    thresholds = np.unique(y_score)
    thresholds = np.sort(thresholds)
    if direction == "lesser":
        thresholds = thresholds[::-1]
    for threshold in thresholds:
        y_pred = y_score >= threshold if direction == "greater" else y_score <= threshold
        precision = precision_score(y_true, y_pred)
        if precision >= precision_threshold:
            recall = recall_score(y_true, y_pred)
            return recall, threshold
    return np.nan, None

uqdd.metrics.stats.calc_classification_metrics ¶

calc_classification_metrics(df_in, cycle_col, val_col, prob_col, pred_col)

Compute classification metrics per cycle/method/split, including ROC-AUC, PR-AUC, MCC, recall, and TNR.

Parameters:

Name	Type	Description	Default
`df_in`	`DataFrame`	Input DataFrame.	required
`cycle_col`	`str`	Column name for cross-validation cycles.	required
`val_col`	`str`	True binary label column.	required
`prob_col`	`str`	Predicted probability/score column.	required
`pred_col`	`str`	Predicted binary label column.	required

Returns:

Type	Description
`DataFrame`	Metrics per (cv_cycle, method, split) with columns ['roc_auc', 'pr_auc', 'mcc', 'recall', 'tnr'].

Source code in uqdd/metrics/stats.py

def calc_classification_metrics(df_in, cycle_col, val_col, prob_col, pred_col):
    """
    Compute classification metrics per cycle/method/split, including ROC-AUC, PR-AUC, MCC, recall, and TNR.

    Parameters
    ----------
    df_in : pd.DataFrame
        Input DataFrame.
    cycle_col : str
        Column name for cross-validation cycles.
    val_col : str
        True binary label column.
    prob_col : str
        Predicted probability/score column.
    pred_col : str
        Predicted binary label column.

    Returns
    -------
    pd.DataFrame
        Metrics per (cv_cycle, method, split) with columns ['roc_auc', 'pr_auc', 'mcc', 'recall', 'tnr'].
    """
    metric_list = []
    for k, v in df_in.groupby([cycle_col, "method", "split"]):
        cycle, method, split = k
        roc_auc = roc_auc_score(v[val_col], v[prob_col])
        pr_auc = average_precision_score(v[val_col], v[prob_col])
        mcc = matthews_corrcoef(v[val_col], v[pred_col])
        recall, _ = recall_at_precision(v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="greater")
        tnr, _ = recall_at_precision(~v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="lesser")
        metric_list.append([cycle, method, split, roc_auc, pr_auc, mcc, recall, tnr])
    metric_df = pd.DataFrame(metric_list, columns=["cv_cycle", "method", "split", "roc_auc", "pr_auc", "mcc", "recall", "tnr"])
    return metric_df

uqdd.metrics.stats.make_curve_plots ¶

make_curve_plots(df)

Plot ROC and PR curves for split/method selections with threshold markers.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame containing 'cv_cycle', 'split', and method columns plus true/probability fields.	required

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def make_curve_plots(df):
    """
    Plot ROC and PR curves for split/method selections with threshold markers.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing 'cv_cycle', 'split', and method columns plus true/probability fields.

    Returns
    -------
    None
    """
    df_plot = df.query("cv_cycle == 0 and split == 'scaffold'").copy()
    color_map = plt.get_cmap("tab10")
    le = LabelEncoder()
    df_plot["color"] = le.fit_transform(df_plot["method"])
    colors = color_map(df_plot["color"].unique())
    val_col = "Sol"
    prob_col = "Sol_prob"
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))
    for (k, v), color in zip(df_plot.groupby("method"), colors):
        roc_auc = roc_auc_score(v[val_col], v[prob_col])
        pr_auc = average_precision_score(v[val_col], v[prob_col])
        fpr, recall_pos, thresholds_roc = roc_curve(v[val_col], v[prob_col])
        precision, recall, thresholds_pr = precision_recall_curve(v[val_col], v[prob_col])
        _, threshold_recall_pos = recall_at_precision(v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="greater")
        _, threshold_recall_neg = recall_at_precision(~v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="lesser")
        fpr_recall_pos = fpr[np.abs(thresholds_roc - threshold_recall_pos).argmin()]
        fpr_recall_neg = fpr[np.abs(thresholds_roc - threshold_recall_neg).argmin()]
        recall_recall_pos = recall[np.abs(thresholds_pr - threshold_recall_pos).argmin()]
        recall_recall_neg = recall[np.abs(thresholds_pr - threshold_recall_neg).argmin()]
        axes[0].plot(fpr, recall_pos, label=f"{k} (ROC AUC={roc_auc:.03f})", color=color, alpha=0.75)
        axes[1].plot(recall, precision, label=f"{k} (PR AUC={pr_auc:.03f})", color=color, alpha=0.75)
        axes[0].axvline(fpr_recall_pos, color=color, linestyle=":", alpha=0.75)
        axes[0].axvline(fpr_recall_neg, color=color, linestyle="--", alpha=0.75)
        axes[1].axvline(recall_recall_pos, color=color, linestyle=":", alpha=0.75)
        axes[1].axvline(recall_recall_neg, color=color, linestyle="--", alpha=0.75)
    axes[0].plot([0, 1], [0, 1], "--", color="black", lw=0.5)
    axes[0].set_xlabel("False Positive Rate")
    axes[0].set_ylabel("True Positive Rate")
    axes[0].set_title("ROC Curve")
    axes[0].legend()
    axes[1].set_xlabel("Recall")
    axes[1].set_ylabel("Precision")
    axes[1].set_title("Precision-Recall Curve")
    axes[1].legend()

uqdd.metrics.stats.harmonize_columns ¶

harmonize_columns(df)

Normalize common column names to ['method', 'split', 'cv_cycle'].

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame with possibly varied column naming.	required

Returns:

Type	Description
`DataFrame`	DataFrame with standardized column names and assertion that required columns exist.

Source code in uqdd/metrics/stats.py

def harmonize_columns(df):
    """
    Normalize common column names to ['method', 'split', 'cv_cycle'].

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with possibly varied column naming.

    Returns
    -------
    pd.DataFrame
        DataFrame with standardized column names and assertion that required columns exist.
    """
    df = df.copy()
    rename_map = {
        "Model type": "method",
        "Split": "split",
        "Group_Number": "cv_cycle",
    }
    df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True)
    assert {"method", "split", "cv_cycle"}.issubset(df.columns)
    return df

uqdd.metrics.stats.cliffs_delta ¶

cliffs_delta(x, y)

Compute Cliff's delta effect size and qualitative interpretation.

Parameters:

Name	Type	Description	Default
`x`	`array - like`	First sample of numeric values.	required
`y`	`array - like`	Second sample of numeric values.	required

Returns:

Type	Description
`tuple[float, str]`	(delta, interpretation) where interpretation is one of {'negligible','small','medium','large'}.

Source code in uqdd/metrics/stats.py

def cliffs_delta(x, y):
    """
    Compute Cliff's delta effect size and qualitative interpretation.

    Parameters
    ----------
    x : array-like
        First sample of numeric values.
    y : array-like
        Second sample of numeric values.

    Returns
    -------
    tuple[float, str]
        (delta, interpretation) where interpretation is one of {'negligible','small','medium','large'}.
    """
    x, y = np.array(x), np.array(y)
    m, n = len(x), len(y)
    comparisons = 0
    for xi in x:
        for yi in y:
            if xi > yi:
                comparisons += 1
            elif xi < yi:
                comparisons -= 1
    delta = comparisons / (m * n)
    abs_delta = abs(delta)
    if abs_delta < 0.147:
        interpretation = "negligible"
    elif abs_delta < 0.33:
        interpretation = "small"
    elif abs_delta < 0.474:
        interpretation = "medium"
    else:
        interpretation = "large"
    return delta, interpretation

uqdd.metrics.stats.wilcoxon_pairwise_test ¶

wilcoxon_pairwise_test(df, metric, model_a, model_b, task=None, split=None, seed_col=None)

Perform paired Wilcoxon signed-rank test between two models on a metric.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metric`	`str`	Metric column to compare.	required
`model_a`	`str`	First model type name.	required
`model_b`	`str`	Second model type name.	required
`task`	`str or None`	Task filter. Default is None.	`None`
`split`	`str or None`	Split filter. Default is None.	`None`
`seed_col`	`str or None`	Optional seed column identifier (unused here).	`None`

Returns:

Type	Description
`dict or None`	Test summary including statistic, p-value, Cliff's delta, CI on differences; None if insufficient data.

Source code in uqdd/metrics/stats.py

def wilcoxon_pairwise_test(df, metric, model_a, model_b, task=None, split=None, seed_col=None):
    """
    Perform paired Wilcoxon signed-rank test between two models on a metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric : str
        Metric column to compare.
    model_a : str
        First model type name.
    model_b : str
        Second model type name.
    task : str or None, optional
        Task filter. Default is None.
    split : str or None, optional
        Split filter. Default is None.
    seed_col : str or None, optional
        Optional seed column identifier (unused here).

    Returns
    -------
    dict or None
        Test summary including statistic, p-value, Cliff's delta, CI on differences; None if insufficient data.
    """
    data = df.copy()
    if task is not None:
        data = data[data["Task"] == task]
    if split is not None:
        data = data[data["Split"] == split]
    values_a = data[data["Model type"] == model_a][metric].values
    values_b = data[data["Model type"] == model_b][metric].values
    if len(values_a) == 0 or len(values_b) == 0:
        return None
    min_len = min(len(values_a), len(values_b))
    values_a = values_a[:min_len]
    values_b = values_b[:min_len]
    statistic, p_value = wilcoxon(values_a, values_b, alternative="two-sided")
    delta, effect_size_interpretation = cliffs_delta(values_a, values_b)
    differences = values_a - values_b
    median_diff = np.median(differences)
    ci_lower, ci_upper = bootstrap_ci(differences, np.median, n_bootstrap=1000)
    if ci_lower <= 0 <= ci_upper:
        practical_significance = "difference is small (CI includes 0)"
    elif abs(median_diff) < 0.1 * np.std(np.concatenate([values_a, values_b])):
        practical_significance = "difference is small"
    else:
        practical_significance = "difference may be meaningful"
    return {
        "model_a": model_a,
        "model_b": model_b,
        "metric": metric,
        "task": task,
        "split": split,
        "n_pairs": min_len,
        "wilcoxon_statistic": statistic,
        "p_value": p_value,
        "cliffs_delta": delta,
        "effect_size_interpretation": effect_size_interpretation,
        "median_difference": median_diff,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "practical_significance": practical_significance,
    }

uqdd.metrics.stats.holm_bonferroni_correction ¶

holm_bonferroni_correction(p_values)

Apply Holm–Bonferroni correction to an array of p-values.

Parameters:

Name	Type	Description	Default
`p_values`	`array - like`	Raw p-values.	required

Returns:

Type	Description
`tuple[ndarray, ndarray]`	(corrected_p_values, rejected_mask) where rejected indicates significance after correction.

Source code in uqdd/metrics/stats.py

def holm_bonferroni_correction(p_values):
    """
    Apply Holm–Bonferroni correction to an array of p-values.

    Parameters
    ----------
    p_values : array-like
        Raw p-values.

    Returns
    -------
    tuple[numpy.ndarray, numpy.ndarray]
        (corrected_p_values, rejected_mask) where rejected indicates significance after correction.
    """
    p_values = np.array(p_values)
    n = len(p_values)
    sorted_indices = np.argsort(p_values)
    sorted_p_values = p_values[sorted_indices]
    corrected_p_values = np.zeros(n)
    rejected = np.zeros(n, dtype=bool)
    for i in range(n):
        correction_factor = n - i
        corrected_p_values[sorted_indices[i]] = min(1.0, sorted_p_values[i] * correction_factor)
        if corrected_p_values[sorted_indices[i]] < 0.05:
            rejected[sorted_indices[i]] = True
        else:
            break
    return corrected_p_values, rejected

uqdd.metrics.stats.pairwise_model_comparison ¶

pairwise_model_comparison(df, metrics, models=None, tasks=None, splits=None, alpha=0.05)

Run pairwise Wilcoxon tests across models/tasks/splits for multiple metrics and adjust p-values.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metrics`	`list of str`	Metrics to compare.	required
`models`	`list of str or None`	Models to include; default derives from data.	`None`
`tasks`	`list of str or None`	Tasks to include; default derives from data.	`None`
`splits`	`list of str or None`	Splits to include; default derives from data.	`None`
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`

Returns:

Type	Description
`DataFrame`	Results table with corrected p-values and significance flags.

Source code in uqdd/metrics/stats.py

def pairwise_model_comparison(df, metrics, models=None, tasks=None, splits=None, alpha=0.05):
    """
    Run pairwise Wilcoxon tests across models/tasks/splits for multiple metrics and adjust p-values.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to compare.
    models : list of str or None, optional
        Models to include; default derives from data.
    tasks : list of str or None, optional
        Tasks to include; default derives from data.
    splits : list of str or None, optional
        Splits to include; default derives from data.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    pd.DataFrame
        Results table with corrected p-values and significance flags.
    """
    if models is None:
        models = df["Model type"].unique()
    if tasks is None:
        tasks = df["Task"].unique()
    if splits is None:
        splits = df["Split"].unique()
    results = []
    for metric in metrics:
        for task in tasks:
            for split in splits:
                for i, model_a in enumerate(models):
                    for j, model_b in enumerate(models):
                        if i < j:
                            result = wilcoxon_pairwise_test(df, metric, model_a, model_b, task, split)
                            if result is not None:
                                results.append(result)
    if not results:
        return pd.DataFrame()
    results_df = pd.DataFrame(results)
    p_values = results_df["p_value"].values
    corrected_p_values, rejected = holm_bonferroni_correction(p_values)
    results_df["corrected_p_value"] = corrected_p_values
    results_df["significant_after_correction"] = rejected
    return results_df

uqdd.metrics.stats.friedman_nemenyi_test ¶

friedman_nemenyi_test(df, metrics, models=None, alpha=0.05)

Run Friedman test across models with Nemenyi post-hoc where significant, per metric.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metrics`	`list of str`	Metrics to test.	required
`models`	`list of str or None`	Models to include; default derives from data.	`None`
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`

Returns:

Type	Description
`dict`	Mapping metric -> result dict containing stats, p-values, mean ranks, and optional post-hoc outputs.

Source code in uqdd/metrics/stats.py

def friedman_nemenyi_test(df, metrics, models=None, alpha=0.05):
    """
    Run Friedman test across models with Nemenyi post-hoc where significant, per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to test.
    models : list of str or None, optional
        Models to include; default derives from data.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    dict
        Mapping metric -> result dict containing stats, p-values, mean ranks, and optional post-hoc outputs.
    """
    if models is None:
        models = df["Model type"].unique()
    results = {}
    for metric in metrics:
        pivot_data = df.pivot_table(values=metric, index=["Task", "Split"], columns="Model type", aggfunc="mean")
        available_models = [m for m in models if m in pivot_data.columns]
        pivot_data = pivot_data[available_models]
        pivot_data = pivot_data.dropna()
        if pivot_data.shape[0] < 2 or pivot_data.shape[1] < 3:
            results[metric] = {"error": "Insufficient data for Friedman test", "data_shape": pivot_data.shape}
            continue
        try:
            friedman_stat, friedman_p = friedmanchisquare(*[pivot_data[col].values for col in pivot_data.columns])
            ranks = pivot_data.rank(axis=1, ascending=False)
            mean_ranks = ranks.mean()
            result = {
                "friedman_statistic": friedman_stat,
                "friedman_p_value": friedman_p,
                "mean_ranks": mean_ranks.to_dict(),
                "significant": friedman_p < alpha,
            }
            if friedman_p < alpha:
                try:
                    data_array = pivot_data.values
                    nemenyi_result = sp.posthoc_nemenyi_friedman(data_array.T)
                    nemenyi_result.index = available_models
                    nemenyi_result.columns = available_models
                    result["nemenyi_p_values"] = nemenyi_result.to_dict()
                    result["critical_difference"] = calculate_critical_difference(len(available_models), pivot_data.shape[0], alpha)
                except Exception as e:
                    result["nemenyi_error"] = str(e)
            results[metric] = result
        except Exception as e:
            results[metric] = {"error": str(e)}
    return results

uqdd.metrics.stats.calculate_critical_difference ¶

calculate_critical_difference(k, n, alpha=0.05)

Compute the critical difference for average ranks in Nemenyi post-hoc tests.

Parameters:

Name	Type	Description	Default
`k`	`int`	Number of models.	required
`n`	`int`	Number of datasets/blocks.	required
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`

Returns:

Type	Description
`float`	Critical difference value.

Source code in uqdd/metrics/stats.py

def calculate_critical_difference(k, n, alpha=0.05):
    """
    Compute the critical difference for average ranks in Nemenyi post-hoc tests.

    Parameters
    ----------
    k : int
        Number of models.
    n : int
        Number of datasets/blocks.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    float
        Critical difference value.
    """
    from scipy.stats import studentized_range
    q_alpha = studentized_range.ppf(1 - alpha, k, np.inf) / np.sqrt(2)
    cd = q_alpha * np.sqrt(k * (k + 1) / (6 * n))
    return cd

uqdd.metrics.stats.bootstrap_auc_difference ¶

bootstrap_auc_difference(auc_values_a, auc_values_b, n_bootstrap=1000, ci=95, random_state=42)

Bootstrap confidence interval for difference of mean AUCs between two models.

Parameters:

Name	Type	Description	Default
`auc_values_a`	`array - like`	AUC values for model A.	required
`auc_values_b`	`array - like`	AUC values for model B.	required
`n_bootstrap`	`int`	Number of bootstrap resamples. Default is 1000.	`1000`
`ci`	`int or float`	Confidence level in percent. Default is 95.	`95`
`random_state`	`int`	Seed for reproducibility. Default is 42.	`42`

Returns:

Type	Description
`dict`	{'mean_difference', 'ci_lower', 'ci_upper', 'bootstrap_differences'}

Source code in uqdd/metrics/stats.py

def bootstrap_auc_difference(auc_values_a, auc_values_b, n_bootstrap=1000, ci=95, random_state=42):
    """
    Bootstrap confidence interval for difference of mean AUCs between two models.

    Parameters
    ----------
    auc_values_a : array-like
        AUC values for model A.
    auc_values_b : array-like
        AUC values for model B.
    n_bootstrap : int, optional
        Number of bootstrap resamples. Default is 1000.
    ci : int or float, optional
        Confidence level in percent. Default is 95.
    random_state : int, optional
        Seed for reproducibility. Default is 42.

    Returns
    -------
    dict
        {'mean_difference', 'ci_lower', 'ci_upper', 'bootstrap_differences'}
    """
    np.random.seed(random_state)
    differences = []
    for _ in range(n_bootstrap):
        sample_a = resample(auc_values_a, random_state=np.random.randint(0, 10000))
        sample_b = resample(auc_values_b, random_state=np.random.randint(0, 10000))
        diff = np.mean(sample_a) - np.mean(sample_b)
        differences.append(diff)
    differences = np.array(differences)
    alpha = (100 - ci) / 2
    ci_lower = np.percentile(differences, alpha)
    ci_upper = np.percentile(differences, 100 - alpha)
    original_diff = np.mean(auc_values_a) - np.mean(auc_values_b)
    return {"mean_difference": original_diff, "ci_lower": ci_lower, "ci_upper": ci_upper, "bootstrap_differences": differences}

uqdd.metrics.stats.plot_critical_difference_diagram ¶

plot_critical_difference_diagram(friedman_results, metric, save_dir=None, alpha=0.05)

Plot a simple critical difference diagram using mean ranks and CD value.

Parameters:

Name	Type	Description	Default
`friedman_results`	`dict`	Output dictionary from friedman_nemenyi_test.	required
`metric`	`str`	Metric to plot.	required
`save_dir`	`str or None`	Directory to save the plot. Default is None.	`None`
`alpha`	`float`	Significance level used to compute CD. Default is 0.05.	`0.05`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def plot_critical_difference_diagram(friedman_results, metric, save_dir=None, alpha=0.05):
    """
    Plot a simple critical difference diagram using mean ranks and CD value.

    Parameters
    ----------
    friedman_results : dict
        Output dictionary from friedman_nemenyi_test.
    metric : str
        Metric to plot.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    alpha : float, optional
        Significance level used to compute CD. Default is 0.05.

    Returns
    -------
    None
    """
    if metric not in friedman_results:
        print(f"Metric {metric} not found in Friedman results")
        return
    result = friedman_results[metric]
    if "error" in result:
        print(f"Error in Friedman test for {metric}: {result['error']}")
        return
    if not result["significant"]:
        print(f"Friedman test not significant for {metric}, skipping CD diagram")
        return
    mean_ranks = result["mean_ranks"]
    models = list(mean_ranks.keys())
    ranks = [mean_ranks[model] for model in models]
    sorted_indices = np.argsort(ranks)
    sorted_models = [models[i] for i in sorted_indices]
    sorted_ranks = [ranks[i] for i in sorted_indices]
    fig, ax = plt.subplots(figsize=(12, 6))
    y_pos = 0
    ax.scatter(sorted_ranks, [y_pos] * len(sorted_ranks), s=100, c="blue")
    for i, (model, rank) in enumerate(zip(sorted_models, sorted_ranks)):
        ax.annotate(model, (rank, y_pos), xytext=(0, 20), textcoords="offset points", ha="center", rotation=45)
    if "critical_difference" in result:
        cd = result["critical_difference"]
        groups = []
        for i, model_a in enumerate(sorted_models):
            group = [model_a]
            rank_a = sorted_ranks[i]
            for j, model_b in enumerate(sorted_models):
                if i != j:
                    rank_b = sorted_ranks[j]
                    if abs(rank_a - rank_b) <= cd:
                        if model_b not in [m for g in groups for m in g]:
                            group.append(model_b)
            if len(group) > 1:
                groups.append(group)
        colors = plt.cm.Set3(np.linspace(0, 1, len(groups)))
        for group, color in zip(groups, colors):
            if len(group) > 1:
                group_ranks = [sorted_ranks[sorted_models.index(m)] for m in group]
                min_rank, max_rank = min(group_ranks), max(group_ranks)
                ax.plot([min_rank, max_rank], [y_pos - 0.05, y_pos - 0.05], color=color, linewidth=3, alpha=0.7)
    ax.set_xlim(min(sorted_ranks) - 0.5, max(sorted_ranks) + 0.5)
    ax.set_ylim(-0.3, 0.5)
    ax.set_xlabel("Average Rank")
    ax.set_title(f"Critical Difference Diagram - {metric}")
    ax.grid(True, alpha=0.3)
    ax.set_yticks([])
    if save_dir:
        plot_name = f"critical_difference_{metric.replace(' ', '_')}"
        save_plot(fig, save_dir, plot_name)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.analyze_significance ¶

analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=None, model_order=None, activity=None)

End-to-end significance analysis and plotting across splits for multiple metrics.

Parameters:

Name	Type	Description	Default
`df_raw`	`DataFrame`	Raw results DataFrame.	required
`metrics`	`list of str`	Metric names to analyze.	required
`direction_dict`	`dict`	Mapping metric -> 'maximize'\|'minimize'.	required
`effect_dict`	`dict`	Mapping metric -> effect size threshold for visualization.	required
`save_dir`	`str or None`	Directory to save plots and outputs. Default is None.	`None`
`model_order`	`list of str or None`	Explicit ordering of models. Default derives from data.	`None`
`activity`	`str or None`	Activity name for prefixes. Default is None.	`None`

Returns:

Type	Description
`None`

Source code in uqdd/metrics/stats.py

def analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=None, model_order=None, activity=None):
    """
    End-to-end significance analysis and plotting across splits for multiple metrics.

    Parameters
    ----------
    df_raw : pd.DataFrame
        Raw results DataFrame.
    metrics : list of str
        Metric names to analyze.
    direction_dict : dict
        Mapping metric -> 'maximize'|'minimize'.
    effect_dict : dict
        Mapping metric -> effect size threshold for visualization.
    save_dir : str or None, optional
        Directory to save plots and outputs. Default is None.
    model_order : list of str or None, optional
        Explicit ordering of models. Default derives from data.
    activity : str or None, optional
        Activity name for prefixes. Default is None.

    Returns
    -------
    None
    """
    df = harmonize_columns(df_raw)
    for metric in metrics:
        df[metric] = pd.to_numeric(df[metric], errors="coerce")
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    for split in df["split"].unique():
        df_s = df[df["split"] == split].copy()
        print(f"\n=== Split: {split} ===")
        name_prefix = f"06_{activity}_{split}" if activity else f"{split}"
        make_normality_diagnostic(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix)
        for metric in metrics:
            print(f"\n-- Metric: {metric}")
            wide = df_s.pivot(index="cv_cycle", columns="method", values=metric)
            resid = (wide.T - wide.mean(axis=1)).T
            vals = resid.values.flatten()
            vals = vals[~np.isnan(vals)]
            W, p_norm = shapiro(vals) if len(vals) >= 3 else (None, 0.0)
            if p_norm is None:
                print("Not enough data for Shapiro-Wilk test (need at least 3 non-NaN values), assuming non-normality")
            elif p_norm < 0.05:
                print(f"Shapiro-Wilk test for {metric} indicates non-normality (W={W:.3f}, p={p_norm:.3f})")
            else:
                print(f"Shapiro-Wilk test for {metric} indicates normality (W={W:.3f}, p={p_norm:.3f})")
        make_boxplots(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_boxplots_parametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_boxplots_nonparametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_sign_plots_nonparametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_critical_difference_diagrams(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_mcs_plot_grid(df=df_s, stats_list=metrics, group_col="method", alpha=0.05, figsize=(30, 15), direction_dict=direction_dict, effect_dict=effect_dict, show_diff=True, sort_axes=True, save_dir=save_dir, name_prefix=name_prefix + "_diff", model_order=model_order)
        make_mcs_plot_grid(df=df_s, stats_list=metrics, group_col="method", alpha=0.05, figsize=(30, 15), direction_dict=direction_dict, effect_dict=effect_dict, show_diff=False, sort_axes=True, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_ci_plot_grid(df_s, metrics, group_col="method", save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)

uqdd.metrics.stats.comprehensive_statistical_analysis ¶

comprehensive_statistical_analysis(df, metrics, models=None, tasks=None, splits=None, save_dir=None, alpha=0.05)

Run a comprehensive suite of statistical tests and export results.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	Input DataFrame.	required
`metrics`	`list of str`	Metrics to analyze.	required
`models`	`list of str or None`	Models to include. Default derives from data.	`None`
`tasks`	`list of str or None`	Tasks to include. Default derives from data.	`None`
`splits`	`list of str or None`	Splits to include. Default derives from data.	`None`
`save_dir`	`str or None`	Directory to save tables and JSON outputs. Default is None.	`None`
`alpha`	`float`	Significance level. Default is 0.05.	`0.05`

Returns:

Type	Description
`dict`	Results dict including pairwise tests, Friedman/Nemenyi outputs, and optional AUC bootstrap comparisons.

Source code in uqdd/metrics/stats.py

def comprehensive_statistical_analysis(df, metrics, models=None, tasks=None, splits=None, save_dir=None, alpha=0.05):
    """
    Run a comprehensive suite of statistical tests and export results.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to analyze.
    models : list of str or None, optional
        Models to include. Default derives from data.
    tasks : list of str or None, optional
        Tasks to include. Default derives from data.
    splits : list of str or None, optional
        Splits to include. Default derives from data.
    save_dir : str or None, optional
        Directory to save tables and JSON outputs. Default is None.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    dict
        Results dict including pairwise tests, Friedman/Nemenyi outputs, and optional AUC bootstrap comparisons.
    """
    print("Performing comprehensive statistical analysis...")
    results = {}
    print("1. Running pairwise Wilcoxon signed-rank tests...")
    pairwise_results = pairwise_model_comparison(df, metrics, models, tasks, splits, alpha)
    results["pairwise_tests"] = pairwise_results
    print("2. Running Friedman tests with Nemenyi post-hoc...")
    friedman_results = friedman_nemenyi_test(df, metrics, models, alpha)
    results["friedman_nemenyi"] = friedman_results
    auc_columns = [col for col in df.columns if "AUC" in col or "auc" in col]
    if auc_columns:
        print("3. Running bootstrap comparisons for AUC metrics...")
        auc_bootstrap_results = {}
        for auc_col in auc_columns:
            auc_bootstrap_results[auc_col] = {}
            available_models = df["Model type"].unique() if models is None else models
            for i, model_a in enumerate(available_models):
                for j, model_b in enumerate(available_models):
                    if i < j:
                        auc_a = df[df["Model type"] == model_a][auc_col].dropna().values
                        auc_b = df[df["Model type"] == model_b][auc_col].dropna().values
                        if len(auc_a) > 0 and len(auc_b) > 0:
                            bootstrap_result = bootstrap_auc_difference(auc_a, auc_b)
                            auc_bootstrap_results[auc_col][f"{model_a}_vs_{model_b}"] = bootstrap_result
        results["auc_bootstrap"] = auc_bootstrap_results
    if save_dir:
        os.makedirs(save_dir, exist_ok=True)
        if not pairwise_results.empty:
            pairwise_results.to_csv(os.path.join(save_dir, "pairwise_statistical_tests.csv"), index=False)
        import json
        with open(os.path.join(save_dir, "friedman_nemenyi_results.json"), "w") as f:
            json_compatible_results = {}
            for metric, result in friedman_results.items():
                json_compatible_results[metric] = {}
                for key, value in result.items():
                    if isinstance(value, (np.ndarray, np.generic)):
                        json_compatible_results[metric][key] = value.tolist()
                    elif isinstance(value, dict):
                        json_compatible_results[metric][key] = {str(k): (float(v) if isinstance(v, (np.ndarray, np.generic)) else v) for k, v in value.items()}
                    else:
                        json_compatible_results[metric][key] = (float(value) if isinstance(value, (np.ndarray, np.generic)) else value)
            json.dump(json_compatible_results, f, indent=2)
        if auc_columns:
            with open(os.path.join(save_dir, "auc_bootstrap_results.json"), "w") as f:
                json_compatible_auc = {}
                for auc_col, comparisons in results["auc_bootstrap"].items():
                    json_compatible_auc[auc_col] = {}
                    for comparison, result in comparisons.items():
                        json_compatible_auc[auc_col][comparison] = {k: v.tolist() if isinstance(v, np.ndarray) else v for k, v in result.items()}
                json.dump(json_compatible_auc, f, indent=2)
    return results

uqdd.metrics.stats.generate_statistical_report ¶

generate_statistical_report(results, save_dir=None, df_raw=None, metrics=None, direction_dict=None, effect_dict=None)

Generate a human-readable text report from comprehensive statistical results and optionally run plots.

Parameters:

Name	Type	Description	Default
`results`	`dict`	Output of comprehensive_statistical_analysis.	required
`save_dir`	`str or None`	Directory to save the report text file. Default is None.	`None`
`df_raw`	`DataFrame or None`	Raw DataFrame to run plotting-based significance analysis. Default is None.	`None`
`metrics`	`list of str or None`	Metrics to plot (when df_raw provided).	`None`
`direction_dict`	`dict or None`	Direction mapping for metrics (required when df_raw provided).	`None`
`effect_dict`	`dict or None`	Effect threshold mapping (required when df_raw provided).	`None`

Returns:

Type	Description
`str`	Report text.

Source code in uqdd/metrics/stats.py

def generate_statistical_report(results, save_dir=None, df_raw=None, metrics=None, direction_dict=None, effect_dict=None):
    """
    Generate a human-readable text report from comprehensive statistical results and optionally run plots.

    Parameters
    ----------
    results : dict
        Output of comprehensive_statistical_analysis.
    save_dir : str or None, optional
        Directory to save the report text file. Default is None.
    df_raw : pd.DataFrame or None, optional
        Raw DataFrame to run plotting-based significance analysis. Default is None.
    metrics : list of str or None, optional
        Metrics to plot (when df_raw provided).
    direction_dict : dict or None, optional
        Direction mapping for metrics (required when df_raw provided).
    effect_dict : dict or None, optional
        Effect threshold mapping (required when df_raw provided).

    Returns
    -------
    str
        Report text.
    """
    report = []
    report.append("=" * 80)
    report.append("COMPREHENSIVE STATISTICAL ANALYSIS REPORT")
    report.append("=" * 80)
    report.append("")
    if "pairwise_tests" in results and not results["pairwise_tests"].empty:
        pairwise_df = results["pairwise_tests"]
        report.append("1. PAIRWISE MODEL COMPARISONS (Wilcoxon Signed-Rank Test)")
        report.append("-" * 60)
        significant = pairwise_df[pairwise_df["significant_after_correction"] == True]
        report.append(f"Total pairwise comparisons performed: {len(pairwise_df)}")
        report.append(f"Significant differences (after Holm-Bonferroni correction): {len(significant)}")
        report.append("")
        if len(significant) > 0:
            report.append("Significant differences found:")
            for _, row in significant.iterrows():
                effect_size = row["effect_size_interpretation"]
                report.append(f"  • {row['model_a']} vs {row['model_b']} ({row['metric']}, {row['split']}):")
                report.append(f"    - p-value: {row['p_value']:.4f} (corrected: {row['corrected_p_value']:.4f})")
                report.append(f"    - Cliff's Δ: {row['cliffs_delta']:.3f} ({effect_size} effect)")
                report.append(f"    - Median difference: {row['median_difference']:.4f} [{row['ci_lower']:.4f}, {row['ci_upper']:.4f}]")
                report.append(f"    - {row['practical_significance']}")
                report.append("")
        else:
            report.append("No significant differences found after multiple comparison correction.")
            report.append("")
    if "friedman_nemenyi" in results:
        friedman_results = results["friedman_nemenyi"]
        report.append("2. MULTIPLE MODEL COMPARISONS (Friedman + Nemenyi Tests)")
        report.append("-" * 60)
        for metric, result in friedman_results.items():
            if "error" in result:
                report.append(f"{metric}: {result['error']}")
                continue
            report.append(f"Metric: {metric}")
            report.append(f"  Friedman test p-value: {result['friedman_p_value']:.4f}")
            if result["significant"]:
                report.append("  Result: Significant difference between models detected")
                mean_ranks = result["mean_ranks"]
                sorted_ranks = sorted(mean_ranks.items(), key=lambda x: x[1])
                report.append("  Model rankings (lower rank = better performance):")
                for i, (model, rank) in enumerate(sorted_ranks, 1):
                    report.append(f"    {i}. {model}: {rank:.2f}")
                if "critical_difference" in result:
                    report.append(f"  Critical difference: {result['critical_difference']:.3f}")
            else:
                report.append("  Result: No significant difference between models")
            report.append("")
    if "auc_bootstrap" in results:
        auc_results = results["auc_bootstrap"]
        report.append("3. AUC BOOTSTRAP COMPARISONS")
        report.append("-" * 60)
        for auc_col, comparisons in auc_results.items():
            report.append(f"AUC Metric: {auc_col}")
            for comparison, result in comparisons.items():
                model_a, model_b = comparison.split("_vs_")
                mean_diff = result["mean_difference"]
                ci_lower = result["ci_lower"]
                ci_upper = result["ci_upper"]
                significance = "difference is small (CI includes 0)" if (ci_lower <= 0 <= ci_upper) else "difference may be meaningful"
                report.append(f"  {model_a} vs {model_b}:")
                report.append(f"    Mean difference: {mean_diff:.4f} [{ci_lower:.4f}, {ci_upper:.4f}]")
                report.append(f"    {significance}")
            report.append("")
    report_text = "\n".join(report)
    if save_dir:
        os.makedirs(save_dir, exist_ok=True)
        with open(os.path.join(save_dir, "statistical_analysis_report.txt"), "w") as f:
            f.write(report_text)
    print(report_text)
    if df_raw is not None and metrics is not None and direction_dict is not None and effect_dict is not None:
        analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=save_dir)
    return report_text

Metrics

uqdd.metrics ¶

uqdd.metrics.group_cols module-attribute ¶

uqdd.metrics.numeric_cols module-attribute ¶

uqdd.metrics.string_cols module-attribute ¶

uqdd.metrics.order_by module-attribute ¶

uqdd.metrics.group_order module-attribute ¶

uqdd.metrics.group_order_no_time module-attribute ¶

uqdd.metrics.hatches_dict module-attribute ¶

uqdd.metrics.hatches_dict_no_time module-attribute ¶

uqdd.metrics.accmetrics module-attribute ¶

uqdd.metrics.accmetrics2 module-attribute ¶

uqdd.metrics.uctmetrics module-attribute ¶

uqdd.metrics.uctmetrics2 module-attribute ¶

uqdd.metrics.__all__ module-attribute ¶

uqdd.metrics.aggregate_results_csv ¶

uqdd.metrics.save_plot ¶

uqdd.metrics.handle_inf_values ¶

uqdd.metrics.plot_pairplot ¶

uqdd.metrics.plot_line_metrics ¶

uqdd.metrics.plot_histogram_metrics ¶

uqdd.metrics.plot_pairwise_scatter_metrics ¶

uqdd.metrics.plot_metrics ¶

uqdd.metrics.find_highly_correlated_metrics ¶

uqdd.metrics.plot_comparison_metrics ¶

uqdd.metrics.load_and_aggregate_calibration_data ¶

uqdd.metrics.plot_calibration_data ¶

uqdd.metrics.move_model_folders ¶

uqdd.metrics.load_predictions ¶

uqdd.metrics.calculate_rmse_rejection_curve ¶

uqdd.metrics.calculate_rejection_curve ¶

uqdd.metrics.get_handles_labels ¶

uqdd.metrics.plot_rmse_rejection_curves ¶

uqdd.metrics.plot_auc_comparison ¶

uqdd.metrics.save_stats_df ¶

uqdd.metrics.load_stats_df ¶

uqdd.metrics.calc_regression_metrics ¶

uqdd.metrics.bootstrap_ci ¶

uqdd.metrics.rm_tukey_hsd ¶

uqdd.metrics.make_boxplots ¶

uqdd.metrics.make_boxplots_parametric ¶

uqdd.metrics.make_boxplots_nonparametric ¶

uqdd.metrics.make_sign_plots_nonparametric ¶

uqdd.metrics.make_critical_difference_diagrams ¶

uqdd.metrics.make_normality_diagnostic ¶

uqdd.metrics.mcs_plot ¶

uqdd.metrics.make_mcs_plot_grid ¶

uqdd.metrics.make_scatterplot ¶

uqdd.metrics.ci_plot ¶

uqdd.metrics.make_ci_plot_grid ¶

uqdd.metrics.recall_at_precision ¶

uqdd.metrics.calc_classification_metrics ¶

uqdd.metrics.make_curve_plots ¶

uqdd.metrics.harmonize_columns ¶

uqdd.metrics.cliffs_delta ¶

uqdd.metrics.wilcoxon_pairwise_test ¶

uqdd.metrics.holm_bonferroni_correction ¶

uqdd.metrics.pairwise_model_comparison ¶

uqdd.metrics.friedman_nemenyi_test ¶

uqdd.metrics.calculate_critical_difference ¶

uqdd.metrics.bootstrap_auc_difference ¶

uqdd.metrics.plot_critical_difference_diagram ¶

uqdd.metrics.analyze_significance ¶

uqdd.metrics.comprehensive_statistical_analysis ¶

uqdd.metrics.generate_statistical_report ¶

uqdd.metrics.nll_evidentials ¶

uqdd.metrics.convert_to_list ¶

uqdd.metrics.preprocess_runs ¶

uqdd.metrics.get_model_class ¶

uqdd.metrics.get_predict_fn ¶

uqdd.metrics.get_preds ¶

uqdd.metrics.pkl_preds_export ¶

uqdd.metrics.csv_nll_post_processing ¶

uqdd.metrics.reassess_metrics ¶

uqdd.metrics.analysis ¶

uqdd.metrics.analysis.aggregate_results_csv ¶

uqdd.metrics.analysis.save_plot ¶

uqdd.metrics.analysis.handle_inf_values ¶

uqdd.metrics.analysis.plot_pairplot ¶

uqdd.metrics.analysis.plot_line_metrics ¶

uqdd.metrics.group_cols `module-attribute` ¶

uqdd.metrics.numeric_cols `module-attribute` ¶

uqdd.metrics.string_cols `module-attribute` ¶

uqdd.metrics.order_by `module-attribute` ¶

uqdd.metrics.group_order `module-attribute` ¶

uqdd.metrics.group_order_no_time `module-attribute` ¶

uqdd.metrics.hatches_dict `module-attribute` ¶

uqdd.metrics.hatches_dict_no_time `module-attribute` ¶

uqdd.metrics.accmetrics `module-attribute` ¶

uqdd.metrics.accmetrics2 `module-attribute` ¶

uqdd.metrics.uctmetrics `module-attribute` ¶

uqdd.metrics.uctmetrics2 `module-attribute` ¶

uqdd.metrics.all `module-attribute` ¶