Skip to content

Metrics

uqdd.metrics

Metrics subpackage for UQDD

The uqdd.metrics subpackage provides tools to compute, analyze, and visualize performance and uncertainty metrics for UQDD models. It includes plotting and analysis helpers, statistical testing routines, and reassessment utilities to benchmark and compare models rigorously.

Modules:

Name Description
- ``constants``: Canonical metric names, grouping orders, hatches, and helper

structures to standardize plots and reports.

- ``analysis``: Functions for aggregating results, loading predictions,

computing rejection curves, and producing comparison plots and calibration visualizations.

- ``stats``: Statistical metrics and tests

bootstrapping, Wilcoxon, Holm–Bonferroni, Friedman–Nemenyi, Cliff's delta), along with boxplots, curve plots, and significance analysis/reporting.

- ``reassessment``: Utilities to reassess runs and models

evidential models), export predictions, and post-process metrics from CSV.

Public API

Commonly used names are re-exported for convenient access via uqdd.metrics.<name>. They are grouped by module below for discoverability.

  • Constants group_cols, numeric_cols, string_cols, order_by, group_order, group_order_no_time, hatches_dict, hatches_dict_no_time, accmetrics, accmetrics2, uctmetrics, uctmetrics2

  • Analysis aggregate_results_csv, save_plot, handle_inf_values, plot_pairplot, plot_line_metrics, plot_histogram_metrics, plot_pairwise_scatter_metrics, plot_metrics, find_highly_correlated_metrics, plot_comparison_metrics, load_and_aggregate_calibration_data, plot_calibration_data, move_model_folders, load_predictions, calculate_rmse_rejection_curve, calculate_rejection_curve, get_handles_labels, plot_rmse_rejection_curves, plot_auc_comparison, save_stats_df, load_stats_df

  • Statistics calc_regression_metrics, bootstrap_ci, rm_tukey_hsd, make_boxplots, make_boxplots_parametric, make_boxplots_nonparametric, make_sign_plots_nonparametric, make_critical_difference_diagrams, make_normality_diagnostic, mcs_plot, make_mcs_plot_grid, make_scatterplot, ci_plot, make_ci_plot_grid, recall_at_precision, calc_classification_metrics, make_curve_plots, harmonize_columns, cliffs_delta, wilcoxon_pairwise_test, holm_bonferroni_correction, pairwise_model_comparison, friedman_nemenyi_test, calculate_critical_difference, bootstrap_auc_difference, plot_critical_difference_diagram, analyze_significance, comprehensive_statistical_analysis, generate_statistical_report

  • Reassessment nll_evidentials, convert_to_list, preprocess_runs, get_model_class, get_predict_fn, get_preds, pkl_preds_export, csv_nll_post_processing, reassess_metrics

Usage Notes
  • Reproducibility: Prefer functions that accept random seeds and write diagnositics under uqdd/logs; capture versions and configurations for statistical comparisons.
  • Data paths: Use the global paths from uqdd.__init__ to keep file/plot outputs consistent.
  • Plot styles: Use constants from metrics.constants to standardize the look and ordering across figures.

uqdd.metrics.group_cols module-attribute

group_cols = ['Model type', 'Task', 'Activity', 'Split', 'desc_prot', 'desc_chem', 'dropout']

uqdd.metrics.numeric_cols module-attribute

numeric_cols = ['RMSE', 'R2', 'MAE', 'MDAE', 'MARPD', 'PCC', 'RMS Calibration', 'MA Calibration', 'Miscalibration Area', 'Sharpness', 'NLL', 'CRPS', 'Check', 'Interval', 'rho_rank', 'rho_rank_sim', 'rho_rank_sim_std', 'uq_mis_cal', 'uq_NLL', 'uq_NLL_sim', 'uq_NLL_sim_std', 'Z_var', 'Z_var_CI_low', 'Z_var_CI_high', 'Z_mean', 'Z_mean_CI_low', 'Z_mean_CI_high', 'rmv_rmse_slope', 'rmv_rmse_r_sq', 'rmv_rmse_intercept', 'aleatoric_uct_mean', 'epistemic_uct_mean', 'total_uct_mean']

uqdd.metrics.string_cols module-attribute

string_cols = ['wandb project', 'wandb run', 'model name']

uqdd.metrics.order_by module-attribute

order_by = ['Split', 'Model type']

uqdd.metrics.group_order module-attribute

group_order = ['stratified_pnn', 'stratified_ensemble', 'stratified_mcdropout', 'stratified_evidential', 'stratified_eoe', 'stratified_emc', 'scaffold_cluster_pnn', 'scaffold_cluster_ensemble', 'scaffold_cluster_mcdropout', 'scaffold_cluster_evidential', 'scaffold_cluster_eoe', 'scaffold_cluster_emc', 'time_pnn', 'time_ensemble', 'time_mcdropout', 'time_evidential', 'time_eoe', 'time_emc']

uqdd.metrics.group_order_no_time module-attribute

group_order_no_time = ['stratified_pnn', 'stratified_ensemble', 'stratified_mcdropout', 'stratified_evidential', 'stratified_eoe', 'stratified_emc', 'scaffold_cluster_pnn', 'scaffold_cluster_ensemble', 'scaffold_cluster_mcdropout', 'scaffold_cluster_evidential', 'scaffold_cluster_eoe', 'scaffold_cluster_emc']

uqdd.metrics.hatches_dict module-attribute

hatches_dict = {'stratified': '\\\\', 'scaffold_cluster': '', 'time': '...'}

uqdd.metrics.hatches_dict_no_time module-attribute

hatches_dict_no_time = {'stratified': '\\\\', 'scaffold_cluster': ''}

uqdd.metrics.accmetrics module-attribute

accmetrics = ['RMSE', 'R2', 'MAE', 'MDAE', 'MARPD', 'PCC']

uqdd.metrics.accmetrics2 module-attribute

accmetrics2 = ['RMSE', 'R2', 'PCC']

uqdd.metrics.uctmetrics module-attribute

uctmetrics = ['RMS Calibration', 'MA Calibration', 'Miscalibration Area', 'Sharpness', 'CRPS', 'Check', 'NLL', 'Interval']

uqdd.metrics.uctmetrics2 module-attribute

uctmetrics2 = ['Miscalibration Area', 'Sharpness', 'CRPS', 'NLL', 'Interval']

uqdd.metrics.__all__ module-attribute

__all__ = ['group_cols', 'numeric_cols', 'string_cols', 'order_by', 'group_order', 'group_order_no_time', 'hatches_dict', 'hatches_dict_no_time', 'accmetrics', 'accmetrics2', 'uctmetrics', 'uctmetrics2', 'aggregate_results_csv', 'save_plot', 'handle_inf_values', 'plot_pairplot', 'plot_line_metrics', 'plot_histogram_metrics', 'plot_pairwise_scatter_metrics', 'plot_metrics', 'find_highly_correlated_metrics', 'plot_comparison_metrics', 'load_and_aggregate_calibration_data', 'plot_calibration_data', 'move_model_folders', 'load_predictions', 'calculate_rmse_rejection_curve', 'calculate_rejection_curve', 'get_handles_labels', 'plot_rmse_rejection_curves', 'plot_auc_comparison', 'save_stats_df', 'load_stats_df', 'calc_regression_metrics', 'bootstrap_ci', 'rm_tukey_hsd', 'make_boxplots', 'make_boxplots_parametric', 'make_boxplots_nonparametric', 'make_sign_plots_nonparametric', 'make_critical_difference_diagrams', 'make_normality_diagnostic', 'mcs_plot', 'make_mcs_plot_grid', 'make_scatterplot', 'ci_plot', 'make_ci_plot_grid', 'recall_at_precision', 'calc_classification_metrics', 'make_curve_plots', 'harmonize_columns', 'cliffs_delta', 'wilcoxon_pairwise_test', 'holm_bonferroni_correction', 'pairwise_model_comparison', 'friedman_nemenyi_test', 'calculate_critical_difference', 'bootstrap_auc_difference', 'plot_critical_difference_diagram', 'analyze_significance', 'comprehensive_statistical_analysis', 'generate_statistical_report', 'nll_evidentials', 'convert_to_list', 'preprocess_runs', 'get_model_class', 'get_predict_fn', 'get_preds', 'pkl_preds_export', 'csv_nll_post_processing', 'reassess_metrics']

uqdd.metrics.aggregate_results_csv

aggregate_results_csv(df: DataFrame, group_cols: List[str], numeric_cols: List[str], string_cols: List[str], order_by: Optional[Union[str, List[str]]] = None, output_file_path: Optional[str] = None) -> pd.DataFrame

Aggregate metrics by groups and export a compact CSV summary.

Parameters:

Name Type Description Default
df DataFrame

Input results DataFrame.

required
group_cols list of str

Column names to group by.

required
numeric_cols list of str

Numeric metric columns to aggregate with mean and std.

required
string_cols list of str

String columns to aggregate as lists.

required
order_by str or list of str or None

Column(s) to sort the final aggregated DataFrame by. Default is None.

None
output_file_path str or None

Path to write the aggregated CSV. If None, no file is written.

None

Returns:

Type Description
DataFrame

Aggregated DataFrame with combined mean(std) strings plus string/list aggregates.

Notes
  • A helper column project_model is constructed and included in the aggregates.
  • When output_file_path is provided, the function ensures the directory exists.
Source code in uqdd/metrics/analysis.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
def aggregate_results_csv(
    df: pd.DataFrame,
    group_cols: List[str],
    numeric_cols: List[str],
    string_cols: List[str],
    order_by: Optional[Union[str, List[str]]] = None,
    output_file_path: Optional[str] = None,
) -> pd.DataFrame:
    """
    Aggregate metrics by groups and export a compact CSV summary.

    Parameters
    ----------
    df : pd.DataFrame
        Input results DataFrame.
    group_cols : list of str
        Column names to group by.
    numeric_cols : list of str
        Numeric metric columns to aggregate with mean and std.
    string_cols : list of str
        String columns to aggregate as lists.
    order_by : str or list of str or None, optional
        Column(s) to sort the final aggregated DataFrame by. Default is None.
    output_file_path : str or None, optional
        Path to write the aggregated CSV. If None, no file is written.

    Returns
    -------
    pd.DataFrame
        Aggregated DataFrame with combined mean(std) strings plus string/list aggregates.

    Notes
    -----
    - A helper column `project_model` is constructed and included in the aggregates.
    - When `output_file_path` is provided, the function ensures the directory exists.
    """
    grouped = df.groupby(group_cols)
    aggregated = grouped[numeric_cols].agg(["mean", "std"])
    for col in numeric_cols:
        aggregated[(col, "combined")] = (
            aggregated[(col, "mean")].round(3).astype(str)
            + "("
            + aggregated[(col, "std")].round(3).astype(str)
            + ")"
        )
    aggregated = aggregated[[col for col in aggregated.columns if col[1] == "combined"]]
    aggregated.columns = [col[0] for col in aggregated.columns]

    string_aggregated = grouped[string_cols].agg(lambda x: list(x))

    df["project_model"] = (
        "papyrus"
        + "/"
        + df["Activity"]
        + "/"
        + "all"
        + "/"
        + df["wandb project"]
        + "/"
        + df["model name"]
        + "/"
    )
    project_model_aggregated = grouped["project_model"].agg(lambda x: list(x))

    final_aggregated = pd.concat(
        [aggregated, string_aggregated, project_model_aggregated], axis=1
    ).reset_index()

    if order_by:
        final_aggregated = final_aggregated.sort_values(by=order_by)

    if output_file_path:
        os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
        final_aggregated.to_csv(output_file_path, index=False)

    return final_aggregated

uqdd.metrics.save_plot

save_plot(fig: Figure, save_dir: Optional[str], plot_name: str, tighten: bool = True, show_legend: bool = False) -> None

Save a matplotlib figure to PNG, SVG, and PDF with optional tight layout.

Parameters:

Name Type Description Default
fig Figure

Figure to save.

required
save_dir str or None

Directory to save the figure files. If None, no files are written.

required
plot_name str

Base filename (without extension).

required
tighten bool

If True, apply tight_layout and bbox_inches="tight". Default is True.

True
show_legend bool

If False, remove legend before saving. Default is False.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
def save_plot(
    fig: plt.Figure,
    save_dir: Optional[str],
    plot_name: str,
    tighten: bool = True,
    show_legend: bool = False,
) -> None:
    """
    Save a matplotlib figure to PNG, SVG, and PDF with optional tight layout.

    Parameters
    ----------
    fig : matplotlib.figure.Figure
        Figure to save.
    save_dir : str or None
        Directory to save the figure files. If None, no files are written.
    plot_name : str
        Base filename (without extension).
    tighten : bool, optional
        If True, apply tight_layout and bbox_inches="tight". Default is True.
    show_legend : bool, optional
        If False, remove legend before saving. Default is False.

    Returns
    -------
    None
    """
    ax = fig.gca()
    if not show_legend:
        legend = ax.get_legend()
        if legend is not None:
            legend.remove()
    if tighten:
        try:
            with warnings.catch_warnings():
                warnings.filterwarnings(
                    "ignore",
                    message="This figure includes Axes that are not compatible with tight_layout",
                )
                fig.tight_layout()
        except (ValueError, RuntimeError):
            fig.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)

    if save_dir and tighten:
        os.makedirs(save_dir, exist_ok=True)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.png"), dpi=300, bbox_inches="tight")
        fig.savefig(os.path.join(save_dir, f"{plot_name}.svg"), bbox_inches="tight")
        fig.savefig(os.path.join(save_dir, f"{plot_name}.pdf"), dpi=300, bbox_inches="tight")
    elif save_dir and not tighten:
        os.makedirs(save_dir, exist_ok=True)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.png"), dpi=300)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.svg"))
        fig.savefig(os.path.join(save_dir, f"{plot_name}.pdf"), dpi=300)

uqdd.metrics.handle_inf_values

handle_inf_values(df: DataFrame) -> pd.DataFrame

Replace +/- infinity values in a DataFrame with NaN.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required

Returns:

Type Description
DataFrame

DataFrame with infinite values replaced by NaN.

Source code in uqdd/metrics/analysis.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def handle_inf_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    Replace +/- infinity values in a DataFrame with NaN.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.

    Returns
    -------
    pd.DataFrame
        DataFrame with infinite values replaced by NaN.
    """
    return df.replace([float("inf"), -float("inf")], float("nan"))

uqdd.metrics.plot_pairplot

plot_pairplot(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, cmap: str = 'viridis', group_order: Optional[List[str]] = group_order, show_legend: bool = False) -> None

Plot a seaborn pairplot for a set of metrics colored by Group.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing the metrics and a 'Group' column.

required
title str

Plot title.

required
metrics list of str

Metric column names to include in the pairplot.

required
save_dir str or None

Directory to save plot images. Default is None.

None
cmap str

Seaborn/matplotlib palette name. Default is "viridis".

'viridis'
group_order list of str or None

Order of class labels in the legend. Default is from constants.

group_order
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
def plot_pairplot(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    cmap: str = "viridis",
    group_order: Optional[List[str]] = group_order,
    show_legend: bool = False,
) -> None:
    """
    Plot a seaborn pairplot for a set of metrics colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing the metrics and a 'Group' column.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to include in the pairplot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    cmap : str, optional
        Seaborn/matplotlib palette name. Default is "viridis".
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    sns.pairplot(
        df,
        hue="Group",
        hue_order=group_order,
        vars=metrics,
        palette=cmap,
        plot_kws={"alpha": 0.7},
    )
    plt.suptitle(title, y=1.02)
    plot_name = f"pairplot_{title.replace(' ', '_')}"
    save_plot(plt.gcf(), save_dir, plot_name, tighten=False, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.plot_line_metrics

plot_line_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, show_legend: bool = False) -> None

Plot line charts of metrics over runs, colored by Group.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with 'wandb run', metrics, and 'Group'.

required
title str

Plot title.

required
metrics list of str

Metric column names to plot.

required
save_dir str or None

Directory to save plot images. Default is None.

None
group_order list of str or None

Order of class labels in the legend. Default is from constants.

group_order
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
def plot_line_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    show_legend: bool = False,
) -> None:
    """
    Plot line charts of metrics over runs, colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with 'wandb run', metrics, and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    for metric in metrics:
        plt.figure(figsize=(14, 7))
        sns.lineplot(
            data=df,
            x="wandb run",
            y=metric,
            hue="Group",
            marker="o",
            palette="Set2",
            hue_order=group_order,
            label=metric,
        )
        plt.title(f"{title} - {metric}")
        plt.xticks(rotation=45)
        plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
        if INTERACTIVE_MODE:
            plt.show()
        plot_name = f"line_{title.replace(' ', '_')}_{metric}"
        save_plot(plt.gcf(), save_dir, plot_name, tighten=False, show_legend=show_legend)
        plt.close()

uqdd.metrics.plot_histogram_metrics

plot_histogram_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, cmap: str = 'crest', show_legend: bool = False) -> None

Plot histograms with KDE for metrics, split by Group.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with metrics and 'Group'.

required
title str

Plot title.

required
metrics list of str

Metric column names to plot.

required
save_dir str or None

Directory to save plot images. Default is None.

None
group_order list of str or None

Order of class labels in the legend. Default is from constants.

group_order
cmap str

Seaborn/matplotlib palette name. Default is "crest".

'crest'
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def plot_histogram_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    cmap: str = "crest",
    show_legend: bool = False,
) -> None:
    """
    Plot histograms with KDE for metrics, split by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with metrics and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    cmap : str, optional
        Seaborn/matplotlib palette name. Default is "crest".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    for metric in metrics:
        plt.figure(figsize=(14, 7))
        sns.histplot(
            data=df,
            x=metric,
            hue="Group",
            kde=True,
            palette=cmap,
            element="step",
            hue_order=group_order,
            fill=True,
            alpha=0.7,
        )
        plt.title(f"{title} - {metric}")
        if INTERACTIVE_MODE:
            plt.show()
        plot_name = f"histogram_{title.replace(' ', '_')}_{metric}"
        save_plot(plt.gcf(), save_dir, plot_name, show_legend=show_legend)
        plt.close()

uqdd.metrics.plot_pairwise_scatter_metrics

plot_pairwise_scatter_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, cmap: str = 'tab10_r', show_legend: bool = False) -> None

Plot pairwise scatterplots for all metric combinations, colored by Group.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with metrics and 'Group'.

required
title str

Plot title.

required
metrics list of str

Metric column names to plot pairwise.

required
save_dir str or None

Directory to save plot images. Default is None.

None
group_order list of str or None

Order of class labels in the legend. Default is from constants.

group_order
cmap str

Matplotlib palette name. Default is "tab10_r".

'tab10_r'
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
def plot_pairwise_scatter_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    cmap: str = "tab10_r",
    show_legend: bool = False,
) -> None:
    """
    Plot pairwise scatterplots for all metric combinations, colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with metrics and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot pairwise.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    cmap : str, optional
        Matplotlib palette name. Default is "tab10_r".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    num_metrics = len(metrics)
    fig, axes = plt.subplots(num_metrics, num_metrics, figsize=(15, 15))

    for i in range(num_metrics):
        for j in range(num_metrics):
            if i != j:
                ax = sns.scatterplot(
                    data=df,
                    x=metrics[j],
                    y=metrics[i],
                    hue="Group",
                    palette=cmap,
                    hue_order=group_order,
                    ax=axes[i, j],
                    legend=False if not (i == 1 and j == 0) else "brief",
                )
                if i == 1 and j == 0:
                    handles, labels = ax.get_legend_handles_labels()
                    ax.legend().remove()
            else:
                axes[i, j].set_visible(False)

            axes[i, j].set_ylabel(metrics[i] if j == 0 and i > 0 else "")
            axes[i, j].set_xlabel(metrics[j] if i == num_metrics - 1 else "")

    fig.legend(handles, labels, loc="upper right", bbox_to_anchor=(1.15, 1))
    fig.suptitle(title, y=1.02)
    fig.subplots_adjust(top=0.95, wspace=0.4, hspace=0.4)
    plot_name = f"pairwise_scatter_{title.replace(' ', '_')}"
    save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.plot_metrics

plot_metrics(df: DataFrame, metrics: List[str], cmap: str = 'tab10_r', save_dir: Optional[str] = None, hatches_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, show: bool = True, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> Dict[str, str]

Plot grouped bar charts showing mean and std for metrics across splits and model types.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with columns ['Split', 'Model type'] and metrics.

required
metrics list of str

Metric column names to plot.

required
cmap str

Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".

'tab10_r'
save_dir str or None

Directory to save plot images. Default is None.

None
hatches_dict dict[str, str] or None

Mapping from Split to hatch pattern. Default is None.

None
group_order list of str or None

Order of grouped labels (Split_Model type). Default derives from data.

None
show bool

If True, display plot in interactive mode. Default is True.

True
fig_width float or None

Width of the plot area (excluding legend). Default scales with number of metrics.

None
fig_height float or None

Height of the plot area (excluding legend). Default is 6.

None
show_legend bool

If True, include a legend of split/model combinations. Default is False.

False

Returns:

Type Description
dict[str, str]

Color mapping from 'Model type' to RGBA string used in the plot.

Source code in uqdd/metrics/analysis.py
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
def plot_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    cmap: str = "tab10_r",
    save_dir: Optional[str] = None,
    hatches_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    show: bool = True,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> Dict[str, str]:
    """
    Plot grouped bar charts showing mean and std for metrics across splits and model types.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with columns ['Split', 'Model type'] and metrics.
    metrics : list of str
        Metric column names to plot.
    cmap : str, optional
        Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    hatches_dict : dict[str, str] or None, optional
        Mapping from Split to hatch pattern. Default is None.
    group_order : list of str or None, optional
        Order of grouped labels (Split_Model type). Default derives from data.
    show : bool, optional
        If True, display plot in interactive mode. Default is True.
    fig_width : float or None, optional
        Width of the plot area (excluding legend). Default scales with number of metrics.
    fig_height : float or None, optional
        Height of the plot area (excluding legend). Default is 6.
    show_legend : bool, optional
        If True, include a legend of split/model combinations. Default is False.

    Returns
    -------
    dict[str, str]
        Color mapping from 'Model type' to RGBA string used in the plot.
    """
    plot_width = fig_width if fig_width else max(10, len(metrics) * 2)
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 5
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.1, right=0.75, top=0.9, bottom=0.2)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.1, 0.15, plot_width / total_width, plot_height / total_height])

    stats_dfs = []
    for metric in metrics:
        mean_df = df.groupby(["Split", "Model type"])[metric].mean().rename(f"{metric}_mean")
        std_df = df.groupby(["Split", "Model type"])[metric].std().rename(f"{metric}_std")
        stats_df = pd.merge(mean_df, std_df, left_index=True, right_index=True).reset_index()
        stats_df["Group"] = stats_df.apply(lambda row: f"{row['Split']}_{row['Model type']}", axis=1)
        stats_df["Metric"] = metric
        stats_dfs.append(stats_df)

    combined_stats_df = pd.concat(stats_dfs)
    if group_order:
        combined_stats_df["Group"] = pd.Categorical(
            combined_stats_df["Group"], categories=group_order, ordered=True
        )
    else:
        group_order = combined_stats_df["Group"].unique().tolist()

    scalar_mappable = ScalarMappable(cmap=cmap)
    model_types = combined_stats_df["Model type"].unique()
    color_dict = {
        m: c
        for m, c in zip(
            model_types,
            scalar_mappable.to_rgba(range(len(model_types)), alpha=1).tolist(),
        )
    }

    bar_width = 0.12
    group_spacing = 0.4
    num_bars = len(model_types) * len(hatches_dict)
    positions = []
    tick_positions = []
    tick_labels = []

    for i, metric in enumerate(metrics):
        metric_data = combined_stats_df[combined_stats_df["Metric"] == metric]
        metric_data.loc[:, "Group"] = pd.Categorical(
            metric_data["Group"], categories=group_order, ordered=True
        )
        metric_data = metric_data.sort_values("Group").reset_index(drop=True)
        for j, (_, row) in enumerate(metric_data.iterrows()):
            position = i * (num_bars * bar_width + group_spacing) + (j % num_bars) * bar_width
            positions.append(position)
            ax.bar(
                position,
                height=row[f"{metric}_mean"],
                color=color_dict[row["Model type"]],
                hatch=hatches_dict[row["Split"]],
                width=bar_width,
            )
        center_position = i * (num_bars * bar_width + group_spacing) + (num_bars * bar_width) / 2
        tick_positions.append(center_position)
        tick_labels.append(metric.replace(" ", "\n") if " " in metric else metric)

    def create_stats_legend(df, color_mapping, hatches_dict, group_order):
        patches_dict = {}
        for _, row in df.iterrows():
            label = f"{row['Split']} {row['Model type']}"
            group_label = f"{row['Split']}_{row['Model type']}"
            if group_label not in patches_dict:
                patches_dict[group_label] = mpatches.Patch(
                    facecolor=color_mapping[row["Model type"]],
                    hatch=hatches_dict[row["Split"]],
                    label=label,
                )
        return [patches_dict[group] for group in group_order if group in patches_dict]

    if show_legend:
        legend_elements = create_stats_legend(combined_stats_df, color_dict, hatches_dict, group_order)
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    for (_, row), bar in zip(combined_stats_df.iterrows(), ax.patches):
        x_bar = bar.get_x() + bar.get_width() / 2
        y_bar = bar.get_height()
        ax.errorbar(
            x_bar,
            y_bar,
            yerr=row[f"{row['Metric']}_std"],
            color="black",
            fmt="none",
            elinewidth=1,
            capsize=3,
            alpha=0.5,
        )

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylim(bottom=0.0)

    if save_dir:
        metrics_names = "_".join(metrics)
        plot_name = f"barplot_{cmap}_{metrics_names}"
        save_plot(fig, save_dir, plot_name, show_legend=show_legend)

    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

    return color_dict

uqdd.metrics.find_highly_correlated_metrics

find_highly_correlated_metrics(df: DataFrame, metrics: List[str], threshold: float = 0.8, save_dir: Optional[str] = None, cmap: str = 'coolwarm', show_legend: bool = False) -> List[Tuple[str, str, float]]

Identify pairs of metrics with correlation above a threshold and plot the matrix.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing the metric columns.

required
metrics list of str

Metric column names to include in the correlation analysis.

required
threshold float

Absolute correlation threshold for reporting pairs. Default is 0.8.

0.8
save_dir str or None

Directory to save the heatmap plot. Default is None.

None
cmap str

Matplotlib colormap name. Default is "coolwarm".

'coolwarm'
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
list of tuple[str, str, float]

List of metric pairs and their absolute correlation values.

Source code in uqdd/metrics/analysis.py
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
def find_highly_correlated_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    threshold: float = 0.8,
    save_dir: Optional[str] = None,
    cmap: str = "coolwarm",
    show_legend: bool = False,
) -> List[Tuple[str, str, float]]:
    """
    Identify pairs of metrics with correlation above a threshold and plot the matrix.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing the metric columns.
    metrics : list of str
        Metric column names to include in the correlation analysis.
    threshold : float, optional
        Absolute correlation threshold for reporting pairs. Default is 0.8.
    save_dir : str or None, optional
        Directory to save the heatmap plot. Default is None.
    cmap : str, optional
        Matplotlib colormap name. Default is "coolwarm".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    list of tuple[str, str, float]
        List of metric pairs and their absolute correlation values.
    """
    corr_matrix = df[metrics].corr().abs()
    pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if corr_matrix.iloc[i, j] > threshold:
                pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

    print(f"Highly correlated metrics (correlation coefficient > {threshold}):")
    for a, b, v in pairs:
        print(f"{a} and {b}: {v:.2f}")

    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, cmap=cmap)
    plt.title("Correlation Matrix")
    plot_name = f"correlation_matrix_{threshold}_{'_'.join(metrics)}"
    save_plot(plt.gcf(), save_dir, plot_name, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()

    return pairs

uqdd.metrics.plot_comparison_metrics

plot_comparison_metrics(df: DataFrame, metrics: List[str], cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir: Optional[str] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False, models_order: Optional[List[str]] = None) -> None

Plot comparison bar charts across splits, model types, and calibration states.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with columns ['Split', 'Model type', 'Calibration'] and metrics.

required
metrics list of str

Metric column names to plot.

required
cmap str

Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".

'tab10_r'
color_dict dict[str, str] or None

Precomputed color mapping from model type to color. If None, one is generated.

None
save_dir str or None

Directory to save plot images. Default is None.

None
fig_width float or None

Width of the plot area (excluding legend). Default scales with the number of metrics.

None
fig_height float or None

Height of the plot area (excluding legend). Default is 6.

None
show_legend bool

If True, include a legend. Default is False.

False
models_order list of str or None

Explicit order of model types for coloring and grouping. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
def plot_comparison_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir: Optional[str] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
    models_order: Optional[List[str]] = None,
) -> None:
    """
    Plot comparison bar charts across splits, model types, and calibration states.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with columns ['Split', 'Model type', 'Calibration'] and metrics.
    metrics : list of str
        Metric column names to plot.
    cmap : str, optional
        Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from model type to color. If None, one is generated.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    fig_width : float or None, optional
        Width of the plot area (excluding legend). Default scales with the number of metrics.
    fig_height : float or None, optional
        Height of the plot area (excluding legend). Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.
    models_order : list of str or None, optional
        Explicit order of model types for coloring and grouping. Default derives from data.

    Returns
    -------
    None
    """
    plot_width = fig_width if fig_width else max(7, len(metrics) * 3)
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 5
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.1, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.1, 0.15, plot_width / total_width, plot_height / total_height])

    stats_dfs = []
    for metric in metrics:
        mean_df = df.groupby(["Split", "Model type", "Calibration"])[metric].mean().rename(f"{metric}_mean")
        std_df = df.groupby(["Split", "Model type", "Calibration"])[metric].std().rename(f"{metric}_std")
        stats_df = pd.merge(mean_df, std_df, left_index=True, right_index=True).reset_index()
        stats_df["Group"] = stats_df.apply(
            lambda row: f"{row['Split']}_{row['Model type']}_{row['Calibration']}", axis=1
        )
        stats_df["Metric"] = metric
        stats_dfs.append(stats_df)

    combined_stats_df = pd.concat(stats_dfs)
    if models_order is None:
        models_order = combined_stats_df["Model type"].unique().tolist()

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        color_dict = {
            m: c
            for m, c in zip(
                models_order,
                scalar_mappable.to_rgba(range(len(models_order)), alpha=1).tolist(),
            )
        }
    color_dict = {k: color_dict[k] for k in models_order}

    hatches_dict = {
        "Before Calibration": "\\\\",
        "After Calibration": "",
    }

    bar_width = 0.1
    group_spacing = 0.2
    split_spacing = 0.6
    num_bars = len(models_order) * 2
    positions = []
    tick_positions = []
    tick_labels = []

    for i, metric in enumerate(metrics):
        metric_data = combined_stats_df[combined_stats_df["Metric"] == metric]
        split_types = metric_data["Split"].unique()
        for j, split in enumerate(split_types):
            split_data = metric_data[metric_data["Split"] == split]
            split_data = split_data[split_data["Model type"].isin(models_order)]

            for k, model_type in enumerate(models_order):
                for l, calibration in enumerate(["Before Calibration", "After Calibration"]):
                    position = (
                        i * (split_spacing + len(split_types) * (num_bars * bar_width + group_spacing))
                        + j * (num_bars * bar_width + group_spacing)
                        + k * 2 * bar_width
                        + l * bar_width
                    )
                    positions.append(position)
                    height = split_data[
                        (split_data["Model type"] == model_type)
                        & (split_data["Calibration"] == calibration)
                    ][f"{metric}_mean"].values[0]
                    ax.bar(
                        position,
                        height=height,
                        color=color_dict[model_type],
                        hatch=hatches_dict[calibration],
                        width=bar_width,
                    )

            center_position = (
                i * (split_spacing + len(split_types) * (num_bars * bar_width + group_spacing))
                + j * (num_bars * bar_width + group_spacing)
                + (num_bars * bar_width) / 2
            )
            tick_positions.append(center_position)
            tick_labels.append(f"{metric}\n{split}")

    if show_legend:
        legend_elements = [
            mpatches.Patch(facecolor=color_dict[model], edgecolor="black", label=model)
            for model in models_order
        ]
        legend_elements += [
            mpatches.Patch(facecolor="white", edgecolor="black", hatch=h, label=label)
            for label, h in hatches_dict.items()
        ]
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    for (_, row), bar in zip(combined_stats_df.iterrows(), ax.patches):
        x_bar = bar.get_x() + bar.get_width() / 2
        y_bar = bar.get_height()
        yerr_lower = y_bar - max(0, y_bar - row[f"{row['Metric']}_std"])
        yerr_upper = row[f"{row['Metric']}_std"]
        ax.errorbar(
            x_bar,
            y_bar,
            yerr=[[yerr_lower], [yerr_upper]],
            color="black",
            fmt="none",
            elinewidth=1,
            capsize=3,
            alpha=0.5,
        )

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylim(bottom=0.0)

    if save_dir:
        metrics_names = "_".join(metrics)
        plot_name = f"comparison_barplot_{cmap}_{metrics_names}"
        save_plot(fig, save_dir, plot_name, show_legend=show_legend)

    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.load_and_aggregate_calibration_data

load_and_aggregate_calibration_data(base_path: str, paths: List[str]) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]

Load calibration curve data from multiple model paths and aggregate statistics.

Parameters:

Name Type Description Default
base_path str

Base directory from which model subpaths are resolved.

required
paths list of str

Relative paths to model directories containing 'calibration_plot_data.csv'.

required

Returns:

Type Description
(ndarray, ndarray, ndarray, ndarray)

Tuple of (expected_values, mean_observed, lower_bound, upper_bound), each of shape (n_bins,).

Source code in uqdd/metrics/analysis.py
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
def load_and_aggregate_calibration_data(base_path: str, paths: List[str]) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Load calibration curve data from multiple model paths and aggregate statistics.

    Parameters
    ----------
    base_path : str
        Base directory from which model subpaths are resolved.
    paths : list of str
        Relative paths to model directories containing 'calibration_plot_data.csv'.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray)
        Tuple of (expected_values, mean_observed, lower_bound, upper_bound), each of shape (n_bins,).
    """
    expected_values = []
    observed_values = []
    for path in paths:
        file_path = os.path.join(base_path, path, "calibration_plot_data.csv")
        if os.path.exists(file_path):
            data = pd.read_csv(file_path)
            expected_values = data["Expected Proportion"]
            observed_values.append(data["Observed Proportion"])
        else:
            print(f"File not found: {file_path}")

    expected_values = np.array(expected_values)
    observed_values = np.array(observed_values)
    mean_observed = np.mean(observed_values, axis=0)
    lower_bound = np.min(observed_values, axis=0)
    upper_bound = np.max(observed_values, axis=0)
    return expected_values, mean_observed, lower_bound, upper_bound

uqdd.metrics.plot_calibration_data

plot_calibration_data(df_aggregated: DataFrame, base_path: str, save_dir: Optional[str] = None, title: str = 'Calibration Plot', color_name: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> None

Plot aggregated calibration curves for multiple groups against the perfect calibration line.

Parameters:

Name Type Description Default
df_aggregated DataFrame

Aggregated DataFrame containing 'Group' and 'project_model' lists for each group.

required
base_path str

Base directory where model paths are located.

required
save_dir str or None

Directory to save plot images. Default is None.

None
title str

Plot title. Default is "Calibration Plot".

'Calibration Plot'
color_name str

Colormap name used to derive distinct colors per group. Default is "tab10_r".

'tab10_r'
color_dict dict[str, str] or None

Precomputed color mapping from group to color. If None, one is generated.

None
group_order list of str or None

Order of groups in the legend. Default derives from data.

None
fig_width float or None

Width of the plot area. Default is 6.

None
fig_height float or None

Height of the plot area. Default is 6.

None
show_legend bool

If True, include a legend. Default is False.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
def plot_calibration_data(
    df_aggregated: pd.DataFrame,
    base_path: str,
    save_dir: Optional[str] = None,
    title: str = "Calibration Plot",
    color_name: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> None:
    """
    Plot aggregated calibration curves for multiple groups against the perfect calibration line.

    Parameters
    ----------
    df_aggregated : pd.DataFrame
        Aggregated DataFrame containing 'Group' and 'project_model' lists for each group.
    base_path : str
        Base directory where model paths are located.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    title : str, optional
        Plot title. Default is "Calibration Plot".
    color_name : str, optional
        Colormap name used to derive distinct colors per group. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from group to color. If None, one is generated.
    group_order : list of str or None, optional
        Order of groups in the legend. Default derives from data.
    fig_width : float or None, optional
        Width of the plot area. Default is 6.
    fig_height : float or None, optional
        Height of the plot area. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    None
    """
    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    if group_order is None:
        group_order = list(df_aggregated["Group"].unique())

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=color_name)
        colors = scalar_mappable.to_rgba(range(len(group_order)))
        color_dict = {group: color for group, color in zip(group_order, colors)}

    legend_handles = {}
    for idx, row in df_aggregated.iterrows():
        model_paths = row["project_model"]
        group_label = row["Group"]
        color = color_dict[group_label]
        expected, mean_observed, lower_bound, upper_bound = load_and_aggregate_calibration_data(base_path, model_paths)
        (line,) = ax.plot(expected, mean_observed, label=group_label, color=color)
        ax.fill_between(expected, lower_bound, upper_bound, alpha=0.2, color=color)
        if group_label not in legend_handles:
            legend_handles[group_label] = line

    (perfect_line,) = ax.plot([0, 1], [0, 1], "k--", label="Perfect Calibration")
    legend_handles["Perfect Calibration"] = perfect_line

    ordered_legend_handles = [legend_handles[group] for group in group_order if group in legend_handles]
    ordered_legend_handles.append(legend_handles["Perfect Calibration"])
    if show_legend:
        ax.legend(handles=ordered_legend_handles, bbox_to_anchor=(1.05, 1), loc="upper left")

    ax.set_title(title)
    ax.set_xlabel("Expected Proportion")
    ax.set_ylabel("Observed Proportion")
    ax.grid(True)
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)

    if save_dir:
        plot_name = f"{title.replace(' ', '_')}"
        save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.move_model_folders

move_model_folders(df: DataFrame, search_dirs: List[str], output_dir: str, overwrite: bool = False) -> None

Move or merge model directories into a single output folder based on model names.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing a 'model name' column.

required
search_dirs list of str

Directories to search for model subfolders.

required
output_dir str

Destination directory where model folders will be moved or merged.

required
overwrite bool

If True, existing folders are merged (copied) with source. Default is False.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
def move_model_folders(
    df: pd.DataFrame,
    search_dirs: List[str],
    output_dir: str,
    overwrite: bool = False,
) -> None:
    """
    Move or merge model directories into a single output folder based on model names.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing a 'model name' column.
    search_dirs : list of str
        Directories to search for model subfolders.
    output_dir : str
        Destination directory where model folders will be moved or merged.
    overwrite : bool, optional
        If True, existing folders are merged (copied) with source. Default is False.

    Returns
    -------
    None
    """
    model_names = df["model name"].unique()
    if not os.path.exists(output_dir):
        os.makedirs(output_dir, exist_ok=True)
        print(f"Created output directory '{output_dir}'.")

    for model_name in model_names:
        found = False
        for search_dir in search_dirs:
            if not os.path.isdir(search_dir):
                print(f"Search directory '{search_dir}' does not exist. Skipping.")
                continue
            subdirs = [d for d in os.listdir(search_dir) if os.path.isdir(os.path.join(search_dir, d))]
            if model_name in subdirs:
                source_dir = os.path.join(search_dir, model_name)
                dest_dir = os.path.join(output_dir, model_name)
                if os.path.exists(dest_dir):
                    if overwrite:
                        shutil.copytree(source_dir, dest_dir, dirs_exist_ok=True)
                        print(f"Merged (Copied) '{source_dir}' to '{dest_dir}'.")
                else:
                    try:
                        shutil.move(source_dir, dest_dir)
                        print(f"Moved '{source_dir}' to '{dest_dir}'.")
                    except Exception as e:
                        print(f"Error moving '{source_dir}' to '{dest_dir}': {e}")
                found = True
                break
        if not found:
            print(f"Model folder '{model_name}' not found in any of the search directories.")

uqdd.metrics.load_predictions

load_predictions(model_path: str) -> pd.DataFrame

Load pickled predictions from a model directory.

Parameters:

Name Type Description Default
model_path str

Path to the model directory containing 'preds.pkl'.

required

Returns:

Type Description
DataFrame

DataFrame loaded from the pickle file.

Source code in uqdd/metrics/analysis.py
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
def load_predictions(model_path: str) -> pd.DataFrame:
    """
    Load pickled predictions from a model directory.

    Parameters
    ----------
    model_path : str
        Path to the model directory containing 'preds.pkl'.

    Returns
    -------
    pd.DataFrame
        DataFrame loaded from the pickle file.
    """
    preds_path = os.path.join(model_path, "preds.pkl")
    return pd.read_pickle(preds_path)

uqdd.metrics.calculate_rmse_rejection_curve

calculate_rmse_rejection_curve(preds: DataFrame, uncertainty_col: str = 'y_alea', true_label_col: str = 'y_true', pred_label_col: str = 'y_pred', normalize_rmse: bool = False, random_rejection: bool = False, unc_type: Optional[str] = None, max_rejection_ratio: float = 0.95) -> Tuple[np.ndarray, np.ndarray, float]

Compute RMSE vs. rejection rate curve and its AUC by rejecting high-uncertainty predictions.

Parameters:

Name Type Description Default
preds DataFrame

DataFrame with columns for true labels, predicted labels, and uncertainty components.

required
uncertainty_col str

Column name for uncertainty to sort by if unc_type is None. Default is "y_alea".

'y_alea'
true_label_col str

Column name for true labels. Default is "y_true".

'y_true'
pred_label_col str

Column name for predicted labels. Default is "y_pred".

'y_pred'
normalize_rmse bool

If True, normalize RMSE by the initial RMSE before rejection. Default is False.

False
random_rejection bool

If True, randomly reject samples instead of sorting by uncertainty. Default is False.

False
unc_type (aleatoric, epistemic, both)

Which uncertainty to use. If "both", sums aleatoric and epistemic. If None, use uncertainty_col.

"aleatoric"
max_rejection_ratio float

Maximum fraction of samples to reject (exclusive of the tail). Default is 0.95.

0.95

Returns:

Type Description
(ndarray, ndarray, float)

Tuple of (rejection_rates, rmses, AUC of the RMSE–rejection curve).

Raises:

Type Description
ValueError

If unc_type is invalid or uncertainty_col is not present when needed.

Source code in uqdd/metrics/analysis.py
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
def calculate_rmse_rejection_curve(
    preds: pd.DataFrame,
    uncertainty_col: str = "y_alea",
    true_label_col: str = "y_true",
    pred_label_col: str = "y_pred",
    normalize_rmse: bool = False,
    random_rejection: bool = False,
    unc_type: Optional[str] = None,
    max_rejection_ratio: float = 0.95,
) -> Tuple[np.ndarray, np.ndarray, float]:
    """
    Compute RMSE vs. rejection rate curve and its AUC by rejecting high-uncertainty predictions.

    Parameters
    ----------
    preds : pd.DataFrame
        DataFrame with columns for true labels, predicted labels, and uncertainty components.
    uncertainty_col : str, optional
        Column name for uncertainty to sort by if `unc_type` is None. Default is "y_alea".
    true_label_col : str, optional
        Column name for true labels. Default is "y_true".
    pred_label_col : str, optional
        Column name for predicted labels. Default is "y_pred".
    normalize_rmse : bool, optional
        If True, normalize RMSE by the initial RMSE before rejection. Default is False.
    random_rejection : bool, optional
        If True, randomly reject samples instead of sorting by uncertainty. Default is False.
    unc_type : {"aleatoric", "epistemic", "both"} or None, optional
        Which uncertainty to use. If "both", sums aleatoric and epistemic. If None, use `uncertainty_col`.
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject (exclusive of the tail). Default is 0.95.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, float)
        Tuple of (rejection_rates, rmses, AUC of the RMSE–rejection curve).

    Raises
    ------
    ValueError
        If `unc_type` is invalid or `uncertainty_col` is not present when needed.
    """
    if unc_type == "aleatoric":
        uncertainty_col = "y_alea"
    elif unc_type == "epistemic":
        uncertainty_col = "y_eps"
    elif unc_type == "both":
        preds["y_unc"] = preds["y_alea"] + preds["y_eps"]
        uncertainty_col = "y_unc"
    elif unc_type is None and uncertainty_col in preds.columns:
        pass
    else:
        raise ValueError(
            "Either provide valid uncertainty type or provide the uncertainty column name in the DataFrame"
        )

    if random_rejection:
        preds = preds.sample(frac=max_rejection_ratio).reset_index(drop=True)
    else:
        preds = preds.sort_values(by=uncertainty_col, ascending=False)

    max_rejection_index = int(len(preds) * max_rejection_ratio)
    step = max(1, int(len(preds) * 0.01))
    rejection_steps = np.arange(0, max_rejection_index, step=step)
    rejection_rates = rejection_steps / len(preds)
    rmses = []

    initial_rmse = mean_squared_error(preds[true_label_col], preds[pred_label_col], squared=False)

    for i in rejection_steps:
        selected_preds = preds.iloc[i:]
        rmse = mean_squared_error(selected_preds[true_label_col], selected_preds[pred_label_col], squared=False)
        if normalize_rmse:
            rmse /= initial_rmse
        rmses.append(rmse)
    auc_arc = auc(rejection_rates, rmses)
    return rejection_rates, np.array(rmses), float(auc_arc)

uqdd.metrics.calculate_rejection_curve

calculate_rejection_curve(df: DataFrame, model_paths: List[str], unc_col: str, random_rejection: bool = False, normalize_rmse: bool = False, max_rejection_ratio: float = 0.95) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, float]

Aggregate RMSE–rejection curves across models and compute mean/std and AUC statistics.

Parameters:

Name Type Description Default
df DataFrame

Auxiliary DataFrame (not used directly, kept for API symmetry).

required
model_paths list of str

Paths to model directories containing 'preds.pkl'.

required
unc_col str

Uncertainty column name to use when computing curves (e.g., 'y_alea' or 'y_eps').

required
random_rejection bool

If True, randomly reject samples. Default is False.

False
normalize_rmse bool

If True, normalize RMSE by the initial RMSE. Default is False.

False
max_rejection_ratio float

Maximum fraction of samples to reject. Default is 0.95.

0.95

Returns:

Type Description
(ndarray, ndarray, ndarray, float, float)

Tuple of (rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc).

Source code in uqdd/metrics/analysis.py
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
def calculate_rejection_curve(
    df: pd.DataFrame,
    model_paths: List[str],
    unc_col: str,
    random_rejection: bool = False,
    normalize_rmse: bool = False,
    max_rejection_ratio: float = 0.95,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, float]:
    """
    Aggregate RMSE–rejection curves across models and compute mean/std and AUC statistics.

    Parameters
    ----------
    df : pd.DataFrame
        Auxiliary DataFrame (not used directly, kept for API symmetry).
    model_paths : list of str
        Paths to model directories containing 'preds.pkl'.
    unc_col : str
        Uncertainty column name to use when computing curves (e.g., 'y_alea' or 'y_eps').
    random_rejection : bool, optional
        If True, randomly reject samples. Default is False.
    normalize_rmse : bool, optional
        If True, normalize RMSE by the initial RMSE. Default is False.
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject. Default is 0.95.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, numpy.ndarray, float, float)
        Tuple of (rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc).
    """
    aggregated_rmses = []
    auc_values = []
    rejection_rates = None

    for model_path in model_paths:
        preds = load_predictions(model_path)
        if preds.empty:
            print(f"Preds not loaded for model: {model_path}")
            continue
        rejection_rates, rmses, auc_arc = calculate_rmse_rejection_curve(
            preds,
            uncertainty_col=unc_col,
            random_rejection=random_rejection,
            normalize_rmse=normalize_rmse,
            max_rejection_ratio=max_rejection_ratio,
        )
        aggregated_rmses.append(rmses)
        auc_values.append(auc_arc)

    mean_rmses = np.mean(aggregated_rmses, axis=0)
    std_rmses = np.std(aggregated_rmses, axis=0)
    mean_auc = np.mean(auc_values)
    std_auc = np.std(auc_values)
    return rejection_rates, mean_rmses, std_rmses, float(mean_auc), float(std_auc)

uqdd.metrics.get_handles_labels

get_handles_labels(ax: Axes, group_order: List[str]) -> Tuple[List, List[str]]

Extract legend handles/labels ordered by group prefix.

Parameters:

Name Type Description Default
ax Axes

Axes object from which to retrieve legend entries.

required
group_order list of str

Group prefixes to order legend entries by.

required

Returns:

Type Description
(list, list of str)

Ordered handles and labels.

Source code in uqdd/metrics/analysis.py
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
def get_handles_labels(ax: plt.Axes, group_order: List[str]) -> Tuple[List, List[str]]:
    """
    Extract legend handles/labels ordered by group prefix.

    Parameters
    ----------
    ax : matplotlib.axes.Axes
        Axes object from which to retrieve legend entries.
    group_order : list of str
        Group prefixes to order legend entries by.

    Returns
    -------
    (list, list of str)
        Ordered handles and labels.
    """
    handles, labels = ax.get_legend_handles_labels()
    ordered_handles = []
    ordered_labels = []
    for group in group_order:
        for label, handle in zip(labels, handles):
            if label.startswith(group):
                ordered_handles.append(handle)
                ordered_labels.append(label)
    return ordered_handles, ordered_labels

uqdd.metrics.plot_rmse_rejection_curves

plot_rmse_rejection_curves(df: DataFrame, base_dir: str, cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir_plot: Optional[str] = None, add_to_title: str = '', normalize_rmse: bool = False, unc_type: str = 'aleatoric', max_rejection_ratio: float = 0.95, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> pd.DataFrame

Plot RMSE–rejection curves per group, including random rejection baselines, and summarize AUCs.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing columns 'Group', 'Split', and 'project_model'.

required
base_dir str

Base directory where model paths are located.

required
cmap str

Colormap name used to derive distinct colors per group. Default is "tab10_r".

'tab10_r'
color_dict dict[str, str] or None

Precomputed color mapping from group to color. If None, one is generated.

None
save_dir_plot str or None

Directory to save the plot images. Default is None.

None
add_to_title str

Suffix for the plot filename and title. Default is empty string.

''
normalize_rmse bool

If True, normalize RMSE by initial RMSE. Default is False.

False
unc_type (aleatoric, epistemic, both)

Uncertainty component to use for rejection. Default is "aleatoric".

"aleatoric"
max_rejection_ratio float

Maximum fraction of samples to reject. Default is 0.95.

0.95
group_order list of str or None

Order of groups in the legend. Default derives from data.

None
fig_width float or None

Plot width. Default is 6.

None
fig_height float or None

Plot height. Default is 6.

None
show_legend bool

If True, include a legend. Default is False.

False

Returns:

Type Description
DataFrame

Summary DataFrame with columns ['Model type', 'Split', 'Group', 'AUC-RRC_mean', 'AUC-RRC_std'].

Source code in uqdd/metrics/analysis.py
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
def plot_rmse_rejection_curves(
    df: pd.DataFrame,
    base_dir: str,
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir_plot: Optional[str] = None,
    add_to_title: str = "",
    normalize_rmse: bool = False,
    unc_type: str = "aleatoric",
    max_rejection_ratio: float = 0.95,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> pd.DataFrame:
    """
    Plot RMSE–rejection curves per group, including random rejection baselines, and summarize AUCs.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing columns 'Group', 'Split', and 'project_model'.
    base_dir : str
        Base directory where model paths are located.
    cmap : str, optional
        Colormap name used to derive distinct colors per group. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from group to color. If None, one is generated.
    save_dir_plot : str or None, optional
        Directory to save the plot images. Default is None.
    add_to_title : str, optional
        Suffix for the plot filename and title. Default is empty string.
    normalize_rmse : bool, optional
        If True, normalize RMSE by initial RMSE. Default is False.
    unc_type : {"aleatoric", "epistemic", "both"}, optional
        Uncertainty component to use for rejection. Default is "aleatoric".
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject. Default is 0.95.
    group_order : list of str or None, optional
        Order of groups in the legend. Default derives from data.
    fig_width : float or None, optional
        Plot width. Default is 6.
    fig_height : float or None, optional
        Plot height. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    pd.DataFrame
        Summary DataFrame with columns ['Model type', 'Split', 'Group', 'AUC-RRC_mean', 'AUC-RRC_std'].
    """
    assert unc_type in ["aleatoric", "epistemic", "both"], "Invalid unc_type"
    unc_col = "y_alea" if unc_type == "aleatoric" else "y_eps"

    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    if group_order is None:
        group_order = list(df["Group"].unique())

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        colors = scalar_mappable.to_rgba(range(len(group_order)))
        color_dict = {group: color for group, color in zip(group_order, colors)}

    color_dict["random reject"] = "black"

    df = df.copy()
    df.loc[:, "model_path"] = df["project_model"].apply(
        lambda x: (str(os.path.join(base_dir, x)) if not str(x).startswith(base_dir) else x)
    )

    stats_dfs = []
    included_groups = df["Group"].unique()
    legend_handles = []

    for group in included_groups:
        group_data = df[df["Group"] == group]
        model_paths = group_data["model_path"].unique()
        rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc = calculate_rejection_curve(
            df, model_paths, unc_col, normalize_rmse=normalize_rmse, max_rejection_ratio=max_rejection_ratio
        )
        (line,) = ax.plot(
            rejection_rates,
            mean_rmses,
            label=f"{group} (AUC-RRC: {mean_auc:.3f} ± {std_auc:.3f})",
            color=color_dict[group],
        )
        ax.fill_between(rejection_rates, mean_rmses - std_rmses, mean_rmses + std_rmses, color=color_dict[group], alpha=0.2)
        legend_handles.append(line)
        stats_dfs.append({
            "Model type": group.rsplit("_", 1)[1],
            "Split": group.rsplit("_", 1)[0],
            "Group": group,
            "AUC-RRC_mean": mean_auc,
            "AUC-RRC_std": std_auc,
        })

    for split in df["Split"].unique():
        split_data = df[df["Split"] == split]
        model_paths = split_data["model_path"].unique()
        rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc = calculate_rejection_curve(
            df, model_paths, unc_col, random_rejection=True, normalize_rmse=normalize_rmse, max_rejection_ratio=max_rejection_ratio
        )
        (line,) = ax.plot(
            rejection_rates,
            mean_rmses,
            label=f"random reject - {split} (AUC-RRC: {mean_auc:.3f} ± {std_auc:.3f})",
            color="black",
            linestyle="--",
        )
        ax.fill_between(rejection_rates, mean_rmses - std_rmses, mean_rmses + std_rmses, color="grey", alpha=0.2)
        legend_handles.append(line)
        stats_dfs.append({
            "Model type": "random reject",
            "Split": split,
            "Group": f"random reject - {split}",
            "AUC-RRC_mean": mean_auc,
            "AUC-RRC_std": std_auc,
        })

    ax.set_xlabel("Rejection Rate")
    ax.set_ylabel("RMSE" if not normalize_rmse else "Normalized RMSE")
    ax.set_xlim(0, max_rejection_ratio)
    ax.grid(True)

    if show_legend:
        ordered_handles, ordered_labels = get_handles_labels(ax, group_order)
        ordered_handles += [legend_handles[-1]]
        ordered_labels += [legend_handles[-1].get_label()]
        ax.legend(handles=ordered_handles, loc="lower left")

    plot_name = f"rmse_rejection_curve_{add_to_title}" if add_to_title else "rmse_rejection_curve"
    save_plot(fig, save_dir_plot, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

    return pd.DataFrame(stats_dfs)

uqdd.metrics.plot_auc_comparison

plot_auc_comparison(stats_df: DataFrame, cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir: Optional[str] = None, add_to_title: str = '', min_y_axis: float = 0.0, hatches_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> None

Plot bar charts comparing RRC-AUC across splits and model types, including random reject baselines.

Parameters:

Name Type Description Default
stats_df DataFrame

Summary DataFrame with columns ['Group', 'Split', 'Model type', 'AUC-RRC_mean', 'AUC-RRC_std'].

required
cmap str

Colormap name used to derive distinct colors per model type. Default is "tab10_r".

'tab10_r'
color_dict dict[str, str] or None

Precomputed color mapping from model type to color. If None, one is generated.

None
save_dir str or None

Directory to save plot images. Default is None.

None
add_to_title str

Title suffix for the plot. Default is empty string.

''
min_y_axis float

Minimum y-axis limit. Default is 0.0.

0.0
hatches_dict dict[str, str] or None

Hatch mapping for splits (e.g., {"stratified": "\"}). Default uses sensible defaults.

None
group_order list of str or None

Order of groups in the legend and x-axis. Default derives from data.

None
fig_width float or None

Plot width. Default is 6.

None
fig_height float or None

Plot height. Default is 6.

None
show_legend bool

If True, include a legend. Default is False.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
def plot_auc_comparison(
    stats_df: pd.DataFrame,
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir: Optional[str] = None,
    add_to_title: str = "",
    min_y_axis: float = 0.0,
    hatches_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> None:
    """
    Plot bar charts comparing RRC-AUC across splits and model types, including random reject baselines.

    Parameters
    ----------
    stats_df : pd.DataFrame
        Summary DataFrame with columns ['Group', 'Split', 'Model type', 'AUC-RRC_mean', 'AUC-RRC_std'].
    cmap : str, optional
        Colormap name used to derive distinct colors per model type. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from model type to color. If None, one is generated.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    add_to_title : str, optional
        Title suffix for the plot. Default is empty string.
    min_y_axis : float, optional
        Minimum y-axis limit. Default is 0.0.
    hatches_dict : dict[str, str] or None, optional
        Hatch mapping for splits (e.g., {"stratified": "\\\\"}). Default uses sensible defaults.
    group_order : list of str or None, optional
        Order of groups in the legend and x-axis. Default derives from data.
    fig_width : float or None, optional
        Plot width. Default is 6.
    fig_height : float or None, optional
        Plot height. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    None
    """
    if hatches_dict is None:
        hatches_dict = {"stratified": "\\\\", "scaffold_cluster": "", "time": "/\\/\\/"}

    if group_order:
        all_groups = group_order + list(stats_df.loc[stats_df["Group"].str.startswith("random reject"), "Group"].unique())
        stats_df["Group"] = pd.Categorical(stats_df["Group"], categories=all_groups, ordered=True)
    else:
        all_groups = stats_df["Group"].unique().tolist()

    stats_df = stats_df.sort_values("Group").reset_index(drop=True)

    splits = list(hatches_dict.keys())
    stats_df.loc[:, "Split"] = pd.Categorical(stats_df["Split"], categories=splits, ordered=True)
    stats_df = stats_df.sort_values("Split").reset_index(drop=True)

    unique_model_types = stats_df.loc[stats_df["Model type"] != "random reject", "Model type"].unique()

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        colors = scalar_mappable.to_rgba(range(len(unique_model_types)))
        color_dict = {model: color for model, color in zip(unique_model_types, colors)}
    color_dict["random reject"] = "black"

    unique_model_types = np.append(unique_model_types, "random reject")

    bar_width = 0.12
    group_spacing = 0.6

    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 4

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    tick_positions = []
    tick_labels = []

    for i, split in enumerate(splits):
        split_data = stats_df[stats_df["Split"] == split]
        split_data.loc[:, "Group"] = pd.Categorical(split_data["Group"], categories=all_groups, ordered=True)
        for j, (_, row) in enumerate(split_data.iterrows()):
            position = i * (len(unique_model_types) * bar_width + group_spacing) + j * bar_width
            ax.bar(
                position,
                height=row["AUC-RRC_mean"],
                yerr=row["AUC-RRC_std"],
                color=color_dict[row["Model type"]],
                edgecolor="white" if row["Model type"] == "random reject" else "black",
                hatch=hatches_dict[row["Split"]],
                width=bar_width,
            )
        center_position = i * (len(unique_model_types) * bar_width + group_spacing) + (len(unique_model_types) * bar_width) / 2
        tick_positions.append(center_position)
        tick_labels.append(split)

    def create_stats_legend(color_dict: Dict[str, str], hatches_dict: Dict[str, str], splits: List[str], model_types: Union[List[str], np.ndarray]):
        patches = []
        for split in splits:
            for model in model_types:
                label = f"{split} {model}"
                hatch_color = "white" if model == "random reject" else "black"
                patch = mpatches.Patch(facecolor=color_dict[model], hatch=hatches_dict[split], edgecolor=hatch_color, label=label)
                patches.append(patch)
        return patches

    if show_legend:
        legend_elements = create_stats_legend(color_dict, hatches_dict, splits, unique_model_types)
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylabel("RRC-AUC")
    ax.set_ylim(min_y_axis, 1.0)

    plot_name = f"auc_comparison_barplot_{cmap}" + (f"_{add_to_title}" if add_to_title else "")
    save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.save_stats_df

save_stats_df(stats_df: DataFrame, save_dir: str, add_to_title: str = '') -> None

Save a stats DataFrame to CSV in a given directory.

Parameters:

Name Type Description Default
stats_df DataFrame

DataFrame to save.

required
save_dir str

Target directory to save the CSV.

required
add_to_title str

Suffix to append to the filename. Default is empty string.

''

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
def save_stats_df(stats_df: pd.DataFrame, save_dir: str, add_to_title: str = "") -> None:
    """
    Save a stats DataFrame to CSV in a given directory.

    Parameters
    ----------
    stats_df : pd.DataFrame
        DataFrame to save.
    save_dir : str
        Target directory to save the CSV.
    add_to_title : str, optional
        Suffix to append to the filename. Default is empty string.

    Returns
    -------
    None
    """
    os.makedirs(save_dir, exist_ok=True)
    stats_df.to_csv(os.path.join(save_dir, f"stats_df_{add_to_title}.csv"), index=False)

uqdd.metrics.load_stats_df

load_stats_df(save_dir: str, add_to_title: str = '') -> pd.DataFrame

Load a stats DataFrame from CSV in a given directory.

Parameters:

Name Type Description Default
save_dir str

Directory containing the CSV.

required
add_to_title str

Suffix appended to the filename. Default is empty string.

''

Returns:

Type Description
DataFrame

Loaded DataFrame.

Source code in uqdd/metrics/analysis.py
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
def load_stats_df(save_dir: str, add_to_title: str = "") -> pd.DataFrame:
    """
    Load a stats DataFrame from CSV in a given directory.

    Parameters
    ----------
    save_dir : str
        Directory containing the CSV.
    add_to_title : str, optional
        Suffix appended to the filename. Default is empty string.

    Returns
    -------
    pd.DataFrame
        Loaded DataFrame.
    """
    return pd.read_csv(os.path.join(save_dir, f"stats_df_{add_to_title}.csv"))

uqdd.metrics.calc_regression_metrics

calc_regression_metrics(df, cycle_col, val_col, pred_col, thresh)

Compute regression and thresholded classification metrics per cycle/method/split.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing true and predicted values.

required
cycle_col str

Column name identifying cross-validation cycles.

required
val_col str

Column with true target values.

required
pred_col str

Column with predicted target values.

required
thresh float

Threshold to derive binary classes for precision/recall.

required

Returns:

Type Description
DataFrame

Metrics per (cv_cycle, method, split) with columns ['mae', 'mse', 'r2', 'rho', 'prec', 'recall'].

Source code in uqdd/metrics/stats.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def calc_regression_metrics(df, cycle_col, val_col, pred_col, thresh):
    """
    Compute regression and thresholded classification metrics per cycle/method/split.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing true and predicted values.
    cycle_col : str
        Column name identifying cross-validation cycles.
    val_col : str
        Column with true target values.
    pred_col : str
        Column with predicted target values.
    thresh : float
        Threshold to derive binary classes for precision/recall.

    Returns
    -------
    pd.DataFrame
        Metrics per (cv_cycle, method, split) with columns ['mae', 'mse', 'r2', 'rho', 'prec', 'recall'].
    """
    df_in = df.copy()
    metric_ls = ["mae", "mse", "r2", "rho", "prec", "recall"]
    metric_list = []
    df_in["true_class"] = df_in[val_col] > thresh
    assert len(df_in.true_class.unique()) == 2, "Binary classification requires two classes"
    df_in["pred_class"] = df_in[pred_col] > thresh

    for k, v in df_in.groupby([cycle_col, "method", "split"]):
        cycle, method, split = k
        mae = mean_absolute_error(v[val_col], v[pred_col])
        mse = mean_squared_error(v[val_col], v[pred_col])
        r2 = r2_score(v[val_col], v[pred_col])
        recall = recall_score(v.true_class, v.pred_class)
        prec = precision_score(v.true_class, v.pred_class)
        rho, _ = spearmanr(v[val_col], v[pred_col])
        metric_list.append([cycle, method, split, mae, mse, r2, rho, prec, recall])
    metric_df = pd.DataFrame(metric_list, columns=["cv_cycle", "method", "split"] + metric_ls)
    return metric_df

uqdd.metrics.bootstrap_ci

bootstrap_ci(data, func=np.mean, n_bootstrap=1000, ci=95, random_state=42)

Compute bootstrap confidence interval for a statistic.

Parameters:

Name Type Description Default
data array - like

Sequence of numeric values.

required
func callable

Statistic function applied to bootstrap samples (e.g., numpy.mean). Default is numpy.mean.

mean
n_bootstrap int

Number of bootstrap resamples. Default is 1000.

1000
ci int or float

Confidence level percentage (e.g., 95). Default is 95.

95
random_state int

Seed for reproducibility. Default is 42.

42

Returns:

Type Description
tuple[float, float]

Lower and upper bounds for the confidence interval.

Source code in uqdd/metrics/stats.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def bootstrap_ci(data, func=np.mean, n_bootstrap=1000, ci=95, random_state=42):
    """
    Compute bootstrap confidence interval for a statistic.

    Parameters
    ----------
    data : array-like
        Sequence of numeric values.
    func : callable, optional
        Statistic function applied to bootstrap samples (e.g., numpy.mean). Default is numpy.mean.
    n_bootstrap : int, optional
        Number of bootstrap resamples. Default is 1000.
    ci : int or float, optional
        Confidence level percentage (e.g., 95). Default is 95.
    random_state : int, optional
        Seed for reproducibility. Default is 42.

    Returns
    -------
    tuple[float, float]
        Lower and upper bounds for the confidence interval.
    """
    np.random.seed(random_state)
    bootstrap_samples = []
    for _ in range(n_bootstrap):
        sample = resample(data, random_state=np.random.randint(0, 10000))
        bootstrap_samples.append(func(sample))
    alpha = (100 - ci) / 2
    lower = np.percentile(bootstrap_samples, alpha)
    upper = np.percentile(bootstrap_samples, 100 - alpha)
    return lower, upper

uqdd.metrics.rm_tukey_hsd

rm_tukey_hsd(df, metric, group_col, alpha=0.05, sort=False, direction_dict=None)

Repeated-measures Tukey HSD approximation using RM-ANOVA and studentized range.

Parameters:

Name Type Description Default
df DataFrame

Long-form DataFrame with columns including the metric, group, and 'cv_cycle' subject.

required
metric str

Metric column to compare.

required
group_col str

Column indicating groups (e.g., method/model type).

required
alpha float

Family-wise error rate for intervals. Default is 0.05.

0.05
sort bool

If True, sort groups by mean value of the metric. Default is False.

False
direction_dict dict or None

Mapping of metric -> 'maximize'|'minimize' to set sort ascending/descending.

None

Returns:

Type Description
tuple

(result_tab, df_means, df_means_diff, p_values_matrix) where: - result_tab: DataFrame of pairwise comparisons with mean differences and CIs. - df_means: mean per group. - df_means_diff: matrix of mean differences. - pc: matrix of adjusted p-values.

Source code in uqdd/metrics/stats.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
def rm_tukey_hsd(df, metric, group_col, alpha=0.05, sort=False, direction_dict=None):
    """
    Repeated-measures Tukey HSD approximation using RM-ANOVA and studentized range.

    Parameters
    ----------
    df : pd.DataFrame
        Long-form DataFrame with columns including the metric, group, and 'cv_cycle' subject.
    metric : str
        Metric column to compare.
    group_col : str
        Column indicating groups (e.g., method/model type).
    alpha : float, optional
        Family-wise error rate for intervals. Default is 0.05.
    sort : bool, optional
        If True, sort groups by mean value of the metric. Default is False.
    direction_dict : dict or None, optional
        Mapping of metric -> 'maximize'|'minimize' to set sort ascending/descending.

    Returns
    -------
    tuple
        (result_tab, df_means, df_means_diff, p_values_matrix) where:
        - result_tab: DataFrame of pairwise comparisons with mean differences and CIs.
        - df_means: mean per group.
        - df_means_diff: matrix of mean differences.
        - pc: matrix of adjusted p-values.
    """
    if sort and direction_dict and metric in direction_dict:
        ascending = direction_dict[metric] != "maximize"
        df_means = df.groupby(group_col).mean(numeric_only=True).sort_values(metric, ascending=ascending)
    else:
        df_means = df.groupby(group_col).mean(numeric_only=True)

    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=RuntimeWarning, message="divide by zero encountered in scalar divide")
        aov = pg.rm_anova(dv=metric, within=group_col, subject="cv_cycle", data=df, detailed=True)
    mse = aov.loc[1, "MS"]
    df_resid = aov.loc[1, "DF"]

    methods = df_means.index
    n_groups = len(methods)
    n_per_group = df[group_col].value_counts().mean()
    tukey_se = np.sqrt(2 * mse / (n_per_group))
    q = qsturng(1 - alpha, n_groups, df_resid)
    if isinstance(q, (tuple, list, np.ndarray)):
        q = q[0]

    num_comparisons = len(methods) * (len(methods) - 1) // 2
    result_tab = pd.DataFrame(index=range(num_comparisons), columns=["group1", "group2", "meandiff", "lower", "upper", "p-adj"])
    df_means_diff = pd.DataFrame(index=methods, columns=methods, data=0.0)
    pc = pd.DataFrame(index=methods, columns=methods, data=1.0)

    row_idx = 0
    for i, method1 in enumerate(methods):
        for j, method2 in enumerate(methods):
            if i < j:
                group1 = df[df[group_col] == method1][metric]
                group2 = df[df[group_col] == method2][metric]
                mean_diff = group1.mean() - group2.mean()
                studentized_range = np.abs(mean_diff) / tukey_se
                adjusted_p = psturng(studentized_range * np.sqrt(2), n_groups, df_resid)
                if isinstance(adjusted_p, (tuple, list, np.ndarray)):
                    adjusted_p = adjusted_p[0]
                lower = mean_diff - (q / np.sqrt(2) * tukey_se)
                upper = mean_diff + (q / np.sqrt(2) * tukey_se)
                result_tab.loc[row_idx] = [method1, method2, mean_diff, lower, upper, adjusted_p]
                pc.loc[method1, method2] = adjusted_p
                pc.loc[method2, method1] = adjusted_p
                df_means_diff.loc[method1, method2] = mean_diff
                df_means_diff.loc[method2, method1] = -mean_diff
                row_idx += 1

    df_means_diff = df_means_diff.astype(float)
    result_tab["group1_mean"] = result_tab["group1"].map(df_means[metric])
    result_tab["group2_mean"] = result_tab["group2"].map(df_means[metric])
    result_tab.index = result_tab["group1"] + " - " + result_tab["group2"]
    return result_tab, df_means, df_means_diff, pc

uqdd.metrics.make_boxplots

make_boxplots(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots for each metric grouped by method.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to visualize.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of methods on the x-axis. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
def make_boxplots(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots for each metric grouped by method.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, len(metric_ls), sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.upper()
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_boxplots_parametric

make_boxplots_parametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots with RM-ANOVA p-values annotated per metric.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to visualize.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of methods on the x-axis. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
def make_boxplots_parametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots with RM-ANOVA p-values annotated per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, len(metric_ls), sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        model = AnovaRM(data=df, depvar=stat, subject="cv_cycle", within=["method"]).fit()
        p_value = model.anova_table["Pr > F"].iloc[0]
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.upper()
        ax.set_title(f"p={p_value:.1e}")
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_parametric_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_boxplots_nonparametric

make_boxplots_nonparametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots with Friedman p-values annotated per metric.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to visualize.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of methods on the x-axis. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
def make_boxplots_nonparametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots with Friedman p-values annotated per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    n_metrics = len(metric_ls)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, n_metrics, sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        friedman = pg.friedman(df, dv=stat, within="method", subject="cv_cycle")["p-unc"].values[0]
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.replace("_", " ").upper()
        ax.set_title(f"p={friedman:.1e}")
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_nonparametric_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_sign_plots_nonparametric

make_sign_plots_nonparametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot significance heatmaps (Conover post-hoc) for nonparametric comparisons.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to analyze.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of methods on axes. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
def make_sign_plots_nonparametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot significance heatmaps (Conover post-hoc) for nonparametric comparisons.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to analyze.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on axes. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    heatmap_args = {"linewidths": 0.25, "linecolor": "0.5", "clip_on": True, "square": True, "cbar_kws": {"pad": 0.05, "location": "right"}}
    n_metrics = len(metric_ls)
    sns.set_theme(context="paper", font_scale=1.5)
    figure, axes = plt.subplots(1, n_metrics, sharex=False, sharey=True, figsize=(26, 8))
    if n_metrics == 1:
        axes = [axes]
    for i, stat in enumerate(metric_ls):
        pc = sp.posthoc_conover_friedman(df, y_col=stat, group_col="method", block_col="cv_cycle", block_id_col="cv_cycle", p_adjust="holm", melted=True)
        if model_order is not None:
            pc = pc.reindex(index=model_order, columns=model_order)
        sub_ax, sub_c = sp.sign_plot(pc, **heatmap_args, ax=axes[i], xticklabels=True)
        sub_ax.set_title(stat.upper())
        if sub_c is not None and hasattr(sub_c, "ax"):
            figure.subplots_adjust(right=0.85)
            sub_c.ax.set_position([0.87, 0.5, 0.02, 0.2])
    save_plot(figure, save_dir, f"{name_prefix}_sign_plot_nonparametric_{'_'.join(metric_ls)}", tighten=False)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_critical_difference_diagrams

make_critical_difference_diagrams(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot critical difference diagrams per metric using average ranks and post-hoc p-values.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to analyze.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of models on diagrams. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
def make_critical_difference_diagrams(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot critical difference diagrams per metric using average ranks and post-hoc p-values.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to analyze.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of models on diagrams. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    n_metrics = len(metric_ls)
    figure, axes = plt.subplots(n_metrics, 1, sharex=True, sharey=False, figsize=(16, 10))
    for i, stat in enumerate(metric_ls):
        avg_rank = df.groupby("cv_cycle")[stat].rank(pct=True).groupby(df.method).mean()
        pc = sp.posthoc_conover_friedman(df, y_col=stat, group_col="method", block_col="cv_cycle", block_id_col="cv_cycle", p_adjust="holm", melted=True)
        if model_order is not None:
            avg_rank = avg_rank.reindex(model_order)
            pc = pc.reindex(index=model_order, columns=model_order)
        sp.critical_difference_diagram(avg_rank, pc, ax=axes[i])
        axes[i].set_title(stat.upper())
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_critical_difference_diagram_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_normality_diagnostic

make_normality_diagnostic(df, metric_ls, save_dir=None, name_prefix='')

Plot normality diagnostics (histogram/KDE and Q-Q) for residualized metrics.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to diagnose.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
def make_normality_diagnostic(df, metric_ls, save_dir=None, name_prefix=""):
    """
    Plot normality diagnostics (histogram/KDE and Q-Q) for residualized metrics.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to diagnose.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.

    Returns
    -------
    None
    """
    df_norm = df.copy()
    df_norm.replace([np.inf, -np.inf], np.nan, inplace=True)
    for metric in metric_ls:
        df_norm[metric] = df_norm[metric] - df_norm.groupby("method")[metric].transform("mean")
    df_norm = df_norm.melt(id_vars=["cv_cycle", "method", "split"], value_vars=metric_ls, var_name="metric", value_name="value")
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    metrics = df_norm["metric"].unique()
    n_metrics = len(metrics)
    fig, axes = plt.subplots(2, n_metrics, figsize=(20, 10))
    for i, metric in enumerate(metrics):
        ax = axes[0, i]
        sns.histplot(df_norm[df_norm["metric"] == metric]["value"], kde=True, ax=ax)
        ax.set_title(f"{metric}")
        ax.set_xlabel("")
        if i == 0:
            ax.set_ylabel("Count")
        else:
            ax.set_ylabel("")
    for i, metric in enumerate(metrics):
        ax = axes[1, i]
        metric_data = df_norm[df_norm["metric"] == metric]["value"]
        stats.probplot(metric_data, dist="norm", plot=ax)
        ax.set_title("")
        ax.set_xlabel("Theoretical Quantiles")
        if i == 0:
            ax.set_ylabel("Ordered Values")
        else:
            ax.set_ylabel("")
    plt.subplots_adjust(hspace=0.3, wspace=0.8)
    save_plot(fig, save_dir, f"{name_prefix}_normality_diagnostic_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.mcs_plot

mcs_plot(pc, effect_size, means, labels=True, cmap=None, cbar_ax_bbox=None, ax=None, show_diff=True, cell_text_size=10, axis_text_size=8, show_cbar=True, reverse_cmap=False, vlim=None, **kwargs)

Render a multiple-comparisons significance heatmap annotated with effect sizes and stars.

Parameters:

Name Type Description Default
pc DataFrame

Matrix of adjusted p-values.

required
effect_size DataFrame

Matrix of mean differences (effect sizes) aligned with pc.

required
means Series

Mean values per group for labeling.

required
labels bool

If True, add x/y tick labels from means.index. Default is True.

True
cmap str or None

Colormap name for effect sizes. Default is 'YlGnBu'.

None
cbar_ax_bbox tuple or None

Custom colorbar axes bbox; unused here but kept for API compatibility.

None
ax Axes or None

Axes to draw into; if None, a new axes is created.

None
show_diff bool

If True, annotate cells with rounded effect sizes plus significance. Default is True.

True
cell_text_size int

Font size for annotations. Default is 10.

10
axis_text_size int

Font size for axis tick labels. Default is 8.

8
show_cbar bool

If True, show colorbar. Default is True.

True
reverse_cmap bool

If True, use reversed colormap. Default is False.

False
vlim float or None

Symmetric limit for color scaling around 0. Default is None.

None

Returns:

Type Description
Axes

Axes containing the rendered heatmap.

Source code in uqdd/metrics/stats.py
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
def mcs_plot(pc, effect_size, means, labels=True, cmap=None, cbar_ax_bbox=None, ax=None, show_diff=True, cell_text_size=10, axis_text_size=8, show_cbar=True, reverse_cmap=False, vlim=None, **kwargs):
    """
    Render a multiple-comparisons significance heatmap annotated with effect sizes and stars.

    Parameters
    ----------
    pc : pd.DataFrame
        Matrix of adjusted p-values.
    effect_size : pd.DataFrame
        Matrix of mean differences (effect sizes) aligned with `pc`.
    means : pd.Series
        Mean values per group for labeling.
    labels : bool, optional
        If True, add x/y tick labels from `means.index`. Default is True.
    cmap : str or None, optional
        Colormap name for effect sizes. Default is 'YlGnBu'.
    cbar_ax_bbox : tuple or None, optional
        Custom colorbar axes bbox; unused here but kept for API compatibility.
    ax : matplotlib.axes.Axes or None, optional
        Axes to draw into; if None, a new axes is created.
    show_diff : bool, optional
        If True, annotate cells with rounded effect sizes plus significance. Default is True.
    cell_text_size : int, optional
        Font size for annotations. Default is 10.
    axis_text_size : int, optional
        Font size for axis tick labels. Default is 8.
    show_cbar : bool, optional
        If True, show colorbar. Default is True.
    reverse_cmap : bool, optional
        If True, use reversed colormap. Default is False.
    vlim : float or None, optional
        Symmetric limit for color scaling around 0. Default is None.

    Returns
    -------
    matplotlib.axes.Axes
        Axes containing the rendered heatmap.
    """
    for key in ["cbar", "vmin", "vmax", "center"]:
        if key in kwargs:
            del kwargs[key]
    if not cmap:
        cmap = "YlGnBu"
    if reverse_cmap:
        cmap = cmap + "_r"
    significance = pc.copy().astype(object)
    significance[(pc < 0.001) & (pc >= 0)] = "***"
    significance[(pc < 0.01) & (pc >= 0.001)] = "**"
    significance[(pc < 0.05) & (pc >= 0.01)] = "*"
    significance[(pc >= 0.05)] = ""
    np.fill_diagonal(significance.values, "")
    annotations = effect_size.round(2).astype(str) + significance if show_diff else significance
    hax = sns.heatmap(effect_size, cmap=cmap, annot=annotations, fmt="", cbar=show_cbar, ax=ax, annot_kws={"size": cell_text_size}, vmin=-2 * vlim if vlim else None, vmax=2 * vlim if vlim else None, square=True, **kwargs)
    if labels:
        label_list = list(means.index)
        x_label_list = label_list
        y_label_list = label_list
        xtick_positions = np.arange(len(label_list))
        hax.set_xticks(xtick_positions + 0.5)
        hax.set_xticklabels(x_label_list, size=axis_text_size, ha="center", va="center", rotation=90)
        hax.set_yticks(xtick_positions + 0.5)
        hax.set_yticklabels(y_label_list, size=axis_text_size, ha="center", va="center", rotation=0)
    hax.set_xlabel("")
    hax.set_ylabel("")
    return hax

uqdd.metrics.make_mcs_plot_grid

make_mcs_plot_grid(df, stats_list, group_col, alpha=0.05, figsize=(20, 10), direction_dict=None, effect_dict=None, show_diff=True, cell_text_size=16, axis_text_size=12, title_text_size=16, sort_axes=False, save_dir=None, name_prefix='', model_order=None)

Generate a grid of MCS plots for multiple metrics.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
stats_list list of str

Metrics to include.

required
group_col str

Column indicating groups (e.g., method).

required
alpha float

Significance level. Default is 0.05.

0.05
figsize tuple

Figure size. Default is (20, 10).

(20, 10)
direction_dict dict or None

Mapping metric -> 'maximize'|'minimize' for colormap orientation.

None
effect_dict dict or None

Mapping metric -> effect size limit for color scaling.

None
show_diff bool

If True, annotate mean differences; else annotate significance only.

True
cell_text_size int

Annotation font size.

16
axis_text_size int

Axis label font size.

12
title_text_size int

Title font size.

16
sort_axes bool

If True, sort groups by mean values per metric.

False
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Filename prefix. Default is empty.

''
model_order list of str or None

Explicit model order for rows/cols.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
def make_mcs_plot_grid(df, stats_list, group_col, alpha=0.05, figsize=(20, 10), direction_dict=None, effect_dict=None, show_diff=True, cell_text_size=16, axis_text_size=12, title_text_size=16, sort_axes=False, save_dir=None, name_prefix="", model_order=None):
    """
    Generate a grid of MCS plots for multiple metrics.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    stats_list : list of str
        Metrics to include.
    group_col : str
        Column indicating groups (e.g., method).
    alpha : float, optional
        Significance level. Default is 0.05.
    figsize : tuple, optional
        Figure size. Default is (20, 10).
    direction_dict : dict or None, optional
        Mapping metric -> 'maximize'|'minimize' for colormap orientation.
    effect_dict : dict or None, optional
        Mapping metric -> effect size limit for color scaling.
    show_diff : bool, optional
        If True, annotate mean differences; else annotate significance only.
    cell_text_size : int, optional
        Annotation font size.
    axis_text_size : int, optional
        Axis label font size.
    title_text_size : int, optional
        Title font size.
    sort_axes : bool, optional
        If True, sort groups by mean values per metric.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Filename prefix. Default is empty.
    model_order : list of str or None, optional
        Explicit model order for rows/cols.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    nrow = math.ceil(len(stats_list) / 3)
    fig, ax = plt.subplots(nrow, 3, figsize=figsize)
    for key in ["r2", "rho", "prec", "recall", "mae", "mse"]:
        direction_dict.setdefault(key, "maximize" if key in ["r2", "rho", "prec", "recall"] else "minimize")
    for key in ["r2", "rho", "prec", "recall"]:
        effect_dict.setdefault(key, 0.1)
    for i, stat in enumerate(stats_list):
        row = i // 3
        col = i % 3
        if stat not in direction_dict:
            raise ValueError(f"Stat '{stat}' is missing in direction_dict. Please set its value.")
        if stat not in effect_dict:
            raise ValueError(f"Stat '{stat}' is missing in effect_dict. Please set its value.")
        reverse_cmap = direction_dict[stat] == "minimize"
        _, df_means, df_means_diff, pc = rm_tukey_hsd(df, stat, group_col, alpha, sort_axes, direction_dict)
        if model_order is not None:
            df_means = df_means.reindex(model_order)
            df_means_diff = df_means_diff.reindex(index=model_order, columns=model_order)
            pc = pc.reindex(index=model_order, columns=model_order)
        hax = mcs_plot(pc, effect_size=df_means_diff, means=df_means[stat], show_diff=show_diff, ax=ax[row, col], cbar=True, cell_text_size=cell_text_size, axis_text_size=axis_text_size, reverse_cmap=reverse_cmap, vlim=effect_dict[stat])
        hax.set_title(stat.upper(), fontsize=title_text_size)
    if (len(stats_list) % 3) != 0:
        for i in range(len(stats_list), nrow * 3):
            row = i // 3
            col = i % 3
            ax[row, col].set_visible(False)
    from matplotlib.lines import Line2D
    legend_elements = [
        Line2D([0], [0], marker="o", color="w", label="p < 0.001 (***): Highly Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p < 0.01 (**): Very Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p < 0.05 (*): Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p >= 0.05: Not Significant", markerfacecolor="black", markersize=10),
    ]
    fig.legend(handles=legend_elements, loc="upper right", ncol=2, fontsize=12, frameon=False)
    plt.subplots_adjust(top=0.88)
    save_plot(fig, save_dir, f"{name_prefix}_mcs_plot_grid_{'_'.join(stats_list)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.make_scatterplot

make_scatterplot(df, val_col, pred_col, thresh, cycle_col='cv_cycle', group_col='method', save_dir=None)

Scatter plots of predicted vs true values per method, with threshold lines and summary stats.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
val_col str

True value column.

required
pred_col str

Predicted value column.

required
thresh float

Threshold for classification overlays.

required
cycle_col str

Cross-validation cycle column. Default is 'cv_cycle'.

'cv_cycle'
group_col str

Method/model type column. Default is 'method'.

'method'
save_dir str or None

Directory to save the plot. Default is None.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
def make_scatterplot(df, val_col, pred_col, thresh, cycle_col="cv_cycle", group_col="method", save_dir=None):
    """
    Scatter plots of predicted vs true values per method, with threshold lines and summary stats.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    val_col : str
        True value column.
    pred_col : str
        Predicted value column.
    thresh : float
        Threshold for classification overlays.
    cycle_col : str, optional
        Cross-validation cycle column. Default is 'cv_cycle'.
    group_col : str, optional
        Method/model type column. Default is 'method'.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df_split_metrics = calc_regression_metrics(df, cycle_col=cycle_col, val_col=val_col, pred_col=pred_col, thresh=thresh)
    methods = df[group_col].unique()
    fig, axs = plt.subplots(nrows=1, ncols=len(methods), figsize=(25, 10))
    for ax, method in zip(axs, methods):
        df_method = df.query(f"{group_col} == @method")
        df_metrics = df_split_metrics.query(f"{group_col} == @method")
        ax.scatter(df_method[pred_col], df_method[val_col], alpha=0.3)
        ax.plot([df_method[val_col].min(), df_method[val_col].max()], [df_method[val_col].min(), df_method[val_col].max()], "k--", lw=1)
        ax.axhline(y=thresh, color="r", linestyle="--")
        ax.axvline(x=thresh, color="r", linestyle="--")
        ax.set_title(method)
        y_true = df_method[val_col] > thresh
        y_pred = df_method[pred_col] > thresh
        precision = precision_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)
        metrics_text = f"MAE: {df_metrics['mae'].mean():.2f}\nMSE: {df_metrics['mse'].mean():.2f}\nR2: {df_metrics['r2'].mean():.2f}\nrho: {df_metrics['rho'].mean():.2f}\nPrecision: {precision:.2f}\nRecall: {recall:.2f}"
        ax.text(0.05, 0.5, metrics_text, transform=ax.transAxes, verticalalignment="top")
        ax.set_xlabel("Predicted")
        ax.set_ylabel("Measured")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(fig, save_dir, f"scatterplot_{val_col}_vs_{pred_col}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.ci_plot

ci_plot(result_tab, ax_in, name)

Plot mean differences with confidence intervals for pairwise comparisons.

Parameters:

Name Type Description Default
result_tab DataFrame

Output of rm_tukey_hsd with columns ['meandiff', 'lower', 'upper'].

required
ax_in Axes

Axes to plot into.

required
name str

Title for the plot.

required

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
def ci_plot(result_tab, ax_in, name):
    """
    Plot mean differences with confidence intervals for pairwise comparisons.

    Parameters
    ----------
    result_tab : pd.DataFrame
        Output of rm_tukey_hsd with columns ['meandiff', 'lower', 'upper'].
    ax_in : matplotlib.axes.Axes
        Axes to plot into.
    name : str
        Title for the plot.

    Returns
    -------
    None
    """
    result_err = np.array([result_tab["meandiff"] - result_tab["lower"], result_tab["upper"] - result_tab["meandiff"]])
    sns.set_theme(context="paper")
    sns.set_style("whitegrid")
    ax = sns.pointplot(x=result_tab.meandiff, y=result_tab.index, marker="o", linestyle="", ax=ax_in)
    ax.errorbar(y=result_tab.index, x=result_tab["meandiff"], xerr=result_err, fmt="o", capsize=5)
    ax.axvline(0, ls="--", lw=3)
    ax.set_xlabel("Mean Difference")
    ax.set_ylabel("")
    ax.set_title(name)
    ax.set_xlim(-0.2, 0.2)

uqdd.metrics.make_ci_plot_grid

make_ci_plot_grid(df_in, metric_list, group_col='method', save_dir=None, name_prefix='', model_order=None)

Plot a grid of confidence-interval charts for multiple metrics.

Parameters:

Name Type Description Default
df_in DataFrame

Input DataFrame.

required
metric_list list of str

Metrics to render.

required
group_col str

Group column (e.g., 'method'). Default is 'method'.

'method'
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Filename prefix. Default is empty.

''
model_order list of str or None

Explicit row order for the CI plots.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
def make_ci_plot_grid(df_in, metric_list, group_col="method", save_dir=None, name_prefix="", model_order=None):
    """
    Plot a grid of confidence-interval charts for multiple metrics.

    Parameters
    ----------
    df_in : pd.DataFrame
        Input DataFrame.
    metric_list : list of str
        Metrics to render.
    group_col : str, optional
        Group column (e.g., 'method'). Default is 'method'.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Filename prefix. Default is empty.
    model_order : list of str or None, optional
        Explicit row order for the CI plots.

    Returns
    -------
    None
    """
    df_in = df_in.copy()
    df_in.replace([np.inf, -np.inf], np.nan, inplace=True)
    figure, axes = plt.subplots(len(metric_list), 1, figsize=(8, 2 * len(metric_list)), sharex=False)
    if not isinstance(axes, np.ndarray):
        axes = np.array([axes])
    for i, metric in enumerate(metric_list):
        df_tukey, _, _, _ = rm_tukey_hsd(df_in, metric, group_col=group_col)
        if model_order is not None:
            df_tukey = df_tukey.reindex(index=model_order)
        ci_plot(df_tukey, ax_in=axes[i], name=metric)
    figure.suptitle("Multiple Comparison of Means\nTukey HSD, FWER=0.05")
    plt.subplots_adjust(hspace=0.9, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_ci_plot_grid_{'_'.join(metric_list)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.recall_at_precision

recall_at_precision(y_true, y_score, precision_threshold=0.5, direction='greater')

Find recall and threshold achieving at least a target precision.

Parameters:

Name Type Description Default
y_true array - like

Binary ground-truth labels.

required
y_score array - like

Continuous scores or probabilities.

required
precision_threshold float

Minimum precision to achieve. Default is 0.5.

0.5
direction (greater, lesser)

If 'greater', thresholding uses >=; if 'lesser', uses <=. Default is 'greater'.

"greater"

Returns:

Type Description
tuple[float, float or None]

(recall, threshold) if achievable; otherwise (nan, None).

Raises:

Type Description
ValueError

If direction is invalid.

Source code in uqdd/metrics/stats.py
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
def recall_at_precision(y_true, y_score, precision_threshold=0.5, direction="greater"):
    """
    Find recall and threshold achieving at least a target precision.

    Parameters
    ----------
    y_true : array-like
        Binary ground-truth labels.
    y_score : array-like
        Continuous scores or probabilities.
    precision_threshold : float, optional
        Minimum precision to achieve. Default is 0.5.
    direction : {"greater", "lesser"}, optional
        If 'greater', thresholding uses >=; if 'lesser', uses <=. Default is 'greater'.

    Returns
    -------
    tuple[float, float or None]
        (recall, threshold) if achievable; otherwise (nan, None).

    Raises
    ------
    ValueError
        If `direction` is invalid.
    """
    if direction not in ["greater", "lesser"]:
        raise ValueError("Invalid direction. Expected one of: ['greater', 'lesser']")
    y_true = np.array(y_true)
    y_score = np.array(y_score)
    thresholds = np.unique(y_score)
    thresholds = np.sort(thresholds)
    if direction == "lesser":
        thresholds = thresholds[::-1]
    for threshold in thresholds:
        y_pred = y_score >= threshold if direction == "greater" else y_score <= threshold
        precision = precision_score(y_true, y_pred)
        if precision >= precision_threshold:
            recall = recall_score(y_true, y_pred)
            return recall, threshold
    return np.nan, None

uqdd.metrics.calc_classification_metrics

calc_classification_metrics(df_in, cycle_col, val_col, prob_col, pred_col)

Compute classification metrics per cycle/method/split, including ROC-AUC, PR-AUC, MCC, recall, and TNR.

Parameters:

Name Type Description Default
df_in DataFrame

Input DataFrame.

required
cycle_col str

Column name for cross-validation cycles.

required
val_col str

True binary label column.

required
prob_col str

Predicted probability/score column.

required
pred_col str

Predicted binary label column.

required

Returns:

Type Description
DataFrame

Metrics per (cv_cycle, method, split) with columns ['roc_auc', 'pr_auc', 'mcc', 'recall', 'tnr'].

Source code in uqdd/metrics/stats.py
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
def calc_classification_metrics(df_in, cycle_col, val_col, prob_col, pred_col):
    """
    Compute classification metrics per cycle/method/split, including ROC-AUC, PR-AUC, MCC, recall, and TNR.

    Parameters
    ----------
    df_in : pd.DataFrame
        Input DataFrame.
    cycle_col : str
        Column name for cross-validation cycles.
    val_col : str
        True binary label column.
    prob_col : str
        Predicted probability/score column.
    pred_col : str
        Predicted binary label column.

    Returns
    -------
    pd.DataFrame
        Metrics per (cv_cycle, method, split) with columns ['roc_auc', 'pr_auc', 'mcc', 'recall', 'tnr'].
    """
    metric_list = []
    for k, v in df_in.groupby([cycle_col, "method", "split"]):
        cycle, method, split = k
        roc_auc = roc_auc_score(v[val_col], v[prob_col])
        pr_auc = average_precision_score(v[val_col], v[prob_col])
        mcc = matthews_corrcoef(v[val_col], v[pred_col])
        recall, _ = recall_at_precision(v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="greater")
        tnr, _ = recall_at_precision(~v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="lesser")
        metric_list.append([cycle, method, split, roc_auc, pr_auc, mcc, recall, tnr])
    metric_df = pd.DataFrame(metric_list, columns=["cv_cycle", "method", "split", "roc_auc", "pr_auc", "mcc", "recall", "tnr"])
    return metric_df

uqdd.metrics.make_curve_plots

make_curve_plots(df)

Plot ROC and PR curves for split/method selections with threshold markers.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing 'cv_cycle', 'split', and method columns plus true/probability fields.

required

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
def make_curve_plots(df):
    """
    Plot ROC and PR curves for split/method selections with threshold markers.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing 'cv_cycle', 'split', and method columns plus true/probability fields.

    Returns
    -------
    None
    """
    df_plot = df.query("cv_cycle == 0 and split == 'scaffold'").copy()
    color_map = plt.get_cmap("tab10")
    le = LabelEncoder()
    df_plot["color"] = le.fit_transform(df_plot["method"])
    colors = color_map(df_plot["color"].unique())
    val_col = "Sol"
    prob_col = "Sol_prob"
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))
    for (k, v), color in zip(df_plot.groupby("method"), colors):
        roc_auc = roc_auc_score(v[val_col], v[prob_col])
        pr_auc = average_precision_score(v[val_col], v[prob_col])
        fpr, recall_pos, thresholds_roc = roc_curve(v[val_col], v[prob_col])
        precision, recall, thresholds_pr = precision_recall_curve(v[val_col], v[prob_col])
        _, threshold_recall_pos = recall_at_precision(v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="greater")
        _, threshold_recall_neg = recall_at_precision(~v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="lesser")
        fpr_recall_pos = fpr[np.abs(thresholds_roc - threshold_recall_pos).argmin()]
        fpr_recall_neg = fpr[np.abs(thresholds_roc - threshold_recall_neg).argmin()]
        recall_recall_pos = recall[np.abs(thresholds_pr - threshold_recall_pos).argmin()]
        recall_recall_neg = recall[np.abs(thresholds_pr - threshold_recall_neg).argmin()]
        axes[0].plot(fpr, recall_pos, label=f"{k} (ROC AUC={roc_auc:.03f})", color=color, alpha=0.75)
        axes[1].plot(recall, precision, label=f"{k} (PR AUC={pr_auc:.03f})", color=color, alpha=0.75)
        axes[0].axvline(fpr_recall_pos, color=color, linestyle=":", alpha=0.75)
        axes[0].axvline(fpr_recall_neg, color=color, linestyle="--", alpha=0.75)
        axes[1].axvline(recall_recall_pos, color=color, linestyle=":", alpha=0.75)
        axes[1].axvline(recall_recall_neg, color=color, linestyle="--", alpha=0.75)
    axes[0].plot([0, 1], [0, 1], "--", color="black", lw=0.5)
    axes[0].set_xlabel("False Positive Rate")
    axes[0].set_ylabel("True Positive Rate")
    axes[0].set_title("ROC Curve")
    axes[0].legend()
    axes[1].set_xlabel("Recall")
    axes[1].set_ylabel("Precision")
    axes[1].set_title("Precision-Recall Curve")
    axes[1].legend()

uqdd.metrics.harmonize_columns

harmonize_columns(df)

Normalize common column names to ['method', 'split', 'cv_cycle'].

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with possibly varied column naming.

required

Returns:

Type Description
DataFrame

DataFrame with standardized column names and assertion that required columns exist.

Source code in uqdd/metrics/stats.py
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
def harmonize_columns(df):
    """
    Normalize common column names to ['method', 'split', 'cv_cycle'].

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with possibly varied column naming.

    Returns
    -------
    pd.DataFrame
        DataFrame with standardized column names and assertion that required columns exist.
    """
    df = df.copy()
    rename_map = {
        "Model type": "method",
        "Split": "split",
        "Group_Number": "cv_cycle",
    }
    df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True)
    assert {"method", "split", "cv_cycle"}.issubset(df.columns)
    return df

uqdd.metrics.cliffs_delta

cliffs_delta(x, y)

Compute Cliff's delta effect size and qualitative interpretation.

Parameters:

Name Type Description Default
x array - like

First sample of numeric values.

required
y array - like

Second sample of numeric values.

required

Returns:

Type Description
tuple[float, str]

(delta, interpretation) where interpretation is one of {'negligible','small','medium','large'}.

Source code in uqdd/metrics/stats.py
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
def cliffs_delta(x, y):
    """
    Compute Cliff's delta effect size and qualitative interpretation.

    Parameters
    ----------
    x : array-like
        First sample of numeric values.
    y : array-like
        Second sample of numeric values.

    Returns
    -------
    tuple[float, str]
        (delta, interpretation) where interpretation is one of {'negligible','small','medium','large'}.
    """
    x, y = np.array(x), np.array(y)
    m, n = len(x), len(y)
    comparisons = 0
    for xi in x:
        for yi in y:
            if xi > yi:
                comparisons += 1
            elif xi < yi:
                comparisons -= 1
    delta = comparisons / (m * n)
    abs_delta = abs(delta)
    if abs_delta < 0.147:
        interpretation = "negligible"
    elif abs_delta < 0.33:
        interpretation = "small"
    elif abs_delta < 0.474:
        interpretation = "medium"
    else:
        interpretation = "large"
    return delta, interpretation

uqdd.metrics.wilcoxon_pairwise_test

wilcoxon_pairwise_test(df, metric, model_a, model_b, task=None, split=None, seed_col=None)

Perform paired Wilcoxon signed-rank test between two models on a metric.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric str

Metric column to compare.

required
model_a str

First model type name.

required
model_b str

Second model type name.

required
task str or None

Task filter. Default is None.

None
split str or None

Split filter. Default is None.

None
seed_col str or None

Optional seed column identifier (unused here).

None

Returns:

Type Description
dict or None

Test summary including statistic, p-value, Cliff's delta, CI on differences; None if insufficient data.

Source code in uqdd/metrics/stats.py
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
def wilcoxon_pairwise_test(df, metric, model_a, model_b, task=None, split=None, seed_col=None):
    """
    Perform paired Wilcoxon signed-rank test between two models on a metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric : str
        Metric column to compare.
    model_a : str
        First model type name.
    model_b : str
        Second model type name.
    task : str or None, optional
        Task filter. Default is None.
    split : str or None, optional
        Split filter. Default is None.
    seed_col : str or None, optional
        Optional seed column identifier (unused here).

    Returns
    -------
    dict or None
        Test summary including statistic, p-value, Cliff's delta, CI on differences; None if insufficient data.
    """
    data = df.copy()
    if task is not None:
        data = data[data["Task"] == task]
    if split is not None:
        data = data[data["Split"] == split]
    values_a = data[data["Model type"] == model_a][metric].values
    values_b = data[data["Model type"] == model_b][metric].values
    if len(values_a) == 0 or len(values_b) == 0:
        return None
    min_len = min(len(values_a), len(values_b))
    values_a = values_a[:min_len]
    values_b = values_b[:min_len]
    statistic, p_value = wilcoxon(values_a, values_b, alternative="two-sided")
    delta, effect_size_interpretation = cliffs_delta(values_a, values_b)
    differences = values_a - values_b
    median_diff = np.median(differences)
    ci_lower, ci_upper = bootstrap_ci(differences, np.median, n_bootstrap=1000)
    if ci_lower <= 0 <= ci_upper:
        practical_significance = "difference is small (CI includes 0)"
    elif abs(median_diff) < 0.1 * np.std(np.concatenate([values_a, values_b])):
        practical_significance = "difference is small"
    else:
        practical_significance = "difference may be meaningful"
    return {
        "model_a": model_a,
        "model_b": model_b,
        "metric": metric,
        "task": task,
        "split": split,
        "n_pairs": min_len,
        "wilcoxon_statistic": statistic,
        "p_value": p_value,
        "cliffs_delta": delta,
        "effect_size_interpretation": effect_size_interpretation,
        "median_difference": median_diff,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "practical_significance": practical_significance,
    }

uqdd.metrics.holm_bonferroni_correction

holm_bonferroni_correction(p_values)

Apply Holm–Bonferroni correction to an array of p-values.

Parameters:

Name Type Description Default
p_values array - like

Raw p-values.

required

Returns:

Type Description
tuple[ndarray, ndarray]

(corrected_p_values, rejected_mask) where rejected indicates significance after correction.

Source code in uqdd/metrics/stats.py
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
def holm_bonferroni_correction(p_values):
    """
    Apply Holm–Bonferroni correction to an array of p-values.

    Parameters
    ----------
    p_values : array-like
        Raw p-values.

    Returns
    -------
    tuple[numpy.ndarray, numpy.ndarray]
        (corrected_p_values, rejected_mask) where rejected indicates significance after correction.
    """
    p_values = np.array(p_values)
    n = len(p_values)
    sorted_indices = np.argsort(p_values)
    sorted_p_values = p_values[sorted_indices]
    corrected_p_values = np.zeros(n)
    rejected = np.zeros(n, dtype=bool)
    for i in range(n):
        correction_factor = n - i
        corrected_p_values[sorted_indices[i]] = min(1.0, sorted_p_values[i] * correction_factor)
        if corrected_p_values[sorted_indices[i]] < 0.05:
            rejected[sorted_indices[i]] = True
        else:
            break
    return corrected_p_values, rejected

uqdd.metrics.pairwise_model_comparison

pairwise_model_comparison(df, metrics, models=None, tasks=None, splits=None, alpha=0.05)

Run pairwise Wilcoxon tests across models/tasks/splits for multiple metrics and adjust p-values.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metrics list of str

Metrics to compare.

required
models list of str or None

Models to include; default derives from data.

None
tasks list of str or None

Tasks to include; default derives from data.

None
splits list of str or None

Splits to include; default derives from data.

None
alpha float

Significance level. Default is 0.05.

0.05

Returns:

Type Description
DataFrame

Results table with corrected p-values and significance flags.

Source code in uqdd/metrics/stats.py
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
def pairwise_model_comparison(df, metrics, models=None, tasks=None, splits=None, alpha=0.05):
    """
    Run pairwise Wilcoxon tests across models/tasks/splits for multiple metrics and adjust p-values.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to compare.
    models : list of str or None, optional
        Models to include; default derives from data.
    tasks : list of str or None, optional
        Tasks to include; default derives from data.
    splits : list of str or None, optional
        Splits to include; default derives from data.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    pd.DataFrame
        Results table with corrected p-values and significance flags.
    """
    if models is None:
        models = df["Model type"].unique()
    if tasks is None:
        tasks = df["Task"].unique()
    if splits is None:
        splits = df["Split"].unique()
    results = []
    for metric in metrics:
        for task in tasks:
            for split in splits:
                for i, model_a in enumerate(models):
                    for j, model_b in enumerate(models):
                        if i < j:
                            result = wilcoxon_pairwise_test(df, metric, model_a, model_b, task, split)
                            if result is not None:
                                results.append(result)
    if not results:
        return pd.DataFrame()
    results_df = pd.DataFrame(results)
    p_values = results_df["p_value"].values
    corrected_p_values, rejected = holm_bonferroni_correction(p_values)
    results_df["corrected_p_value"] = corrected_p_values
    results_df["significant_after_correction"] = rejected
    return results_df

uqdd.metrics.friedman_nemenyi_test

friedman_nemenyi_test(df, metrics, models=None, alpha=0.05)

Run Friedman test across models with Nemenyi post-hoc where significant, per metric.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metrics list of str

Metrics to test.

required
models list of str or None

Models to include; default derives from data.

None
alpha float

Significance level. Default is 0.05.

0.05

Returns:

Type Description
dict

Mapping metric -> result dict containing stats, p-values, mean ranks, and optional post-hoc outputs.

Source code in uqdd/metrics/stats.py
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
def friedman_nemenyi_test(df, metrics, models=None, alpha=0.05):
    """
    Run Friedman test across models with Nemenyi post-hoc where significant, per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to test.
    models : list of str or None, optional
        Models to include; default derives from data.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    dict
        Mapping metric -> result dict containing stats, p-values, mean ranks, and optional post-hoc outputs.
    """
    if models is None:
        models = df["Model type"].unique()
    results = {}
    for metric in metrics:
        pivot_data = df.pivot_table(values=metric, index=["Task", "Split"], columns="Model type", aggfunc="mean")
        available_models = [m for m in models if m in pivot_data.columns]
        pivot_data = pivot_data[available_models]
        pivot_data = pivot_data.dropna()
        if pivot_data.shape[0] < 2 or pivot_data.shape[1] < 3:
            results[metric] = {"error": "Insufficient data for Friedman test", "data_shape": pivot_data.shape}
            continue
        try:
            friedman_stat, friedman_p = friedmanchisquare(*[pivot_data[col].values for col in pivot_data.columns])
            ranks = pivot_data.rank(axis=1, ascending=False)
            mean_ranks = ranks.mean()
            result = {
                "friedman_statistic": friedman_stat,
                "friedman_p_value": friedman_p,
                "mean_ranks": mean_ranks.to_dict(),
                "significant": friedman_p < alpha,
            }
            if friedman_p < alpha:
                try:
                    data_array = pivot_data.values
                    nemenyi_result = sp.posthoc_nemenyi_friedman(data_array.T)
                    nemenyi_result.index = available_models
                    nemenyi_result.columns = available_models
                    result["nemenyi_p_values"] = nemenyi_result.to_dict()
                    result["critical_difference"] = calculate_critical_difference(len(available_models), pivot_data.shape[0], alpha)
                except Exception as e:
                    result["nemenyi_error"] = str(e)
            results[metric] = result
        except Exception as e:
            results[metric] = {"error": str(e)}
    return results

uqdd.metrics.calculate_critical_difference

calculate_critical_difference(k, n, alpha=0.05)

Compute the critical difference for average ranks in Nemenyi post-hoc tests.

Parameters:

Name Type Description Default
k int

Number of models.

required
n int

Number of datasets/blocks.

required
alpha float

Significance level. Default is 0.05.

0.05

Returns:

Type Description
float

Critical difference value.

Source code in uqdd/metrics/stats.py
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
def calculate_critical_difference(k, n, alpha=0.05):
    """
    Compute the critical difference for average ranks in Nemenyi post-hoc tests.

    Parameters
    ----------
    k : int
        Number of models.
    n : int
        Number of datasets/blocks.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    float
        Critical difference value.
    """
    from scipy.stats import studentized_range
    q_alpha = studentized_range.ppf(1 - alpha, k, np.inf) / np.sqrt(2)
    cd = q_alpha * np.sqrt(k * (k + 1) / (6 * n))
    return cd

uqdd.metrics.bootstrap_auc_difference

bootstrap_auc_difference(auc_values_a, auc_values_b, n_bootstrap=1000, ci=95, random_state=42)

Bootstrap confidence interval for difference of mean AUCs between two models.

Parameters:

Name Type Description Default
auc_values_a array - like

AUC values for model A.

required
auc_values_b array - like

AUC values for model B.

required
n_bootstrap int

Number of bootstrap resamples. Default is 1000.

1000
ci int or float

Confidence level in percent. Default is 95.

95
random_state int

Seed for reproducibility. Default is 42.

42

Returns:

Type Description
dict

{'mean_difference', 'ci_lower', 'ci_upper', 'bootstrap_differences'}

Source code in uqdd/metrics/stats.py
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
def bootstrap_auc_difference(auc_values_a, auc_values_b, n_bootstrap=1000, ci=95, random_state=42):
    """
    Bootstrap confidence interval for difference of mean AUCs between two models.

    Parameters
    ----------
    auc_values_a : array-like
        AUC values for model A.
    auc_values_b : array-like
        AUC values for model B.
    n_bootstrap : int, optional
        Number of bootstrap resamples. Default is 1000.
    ci : int or float, optional
        Confidence level in percent. Default is 95.
    random_state : int, optional
        Seed for reproducibility. Default is 42.

    Returns
    -------
    dict
        {'mean_difference', 'ci_lower', 'ci_upper', 'bootstrap_differences'}
    """
    np.random.seed(random_state)
    differences = []
    for _ in range(n_bootstrap):
        sample_a = resample(auc_values_a, random_state=np.random.randint(0, 10000))
        sample_b = resample(auc_values_b, random_state=np.random.randint(0, 10000))
        diff = np.mean(sample_a) - np.mean(sample_b)
        differences.append(diff)
    differences = np.array(differences)
    alpha = (100 - ci) / 2
    ci_lower = np.percentile(differences, alpha)
    ci_upper = np.percentile(differences, 100 - alpha)
    original_diff = np.mean(auc_values_a) - np.mean(auc_values_b)
    return {"mean_difference": original_diff, "ci_lower": ci_lower, "ci_upper": ci_upper, "bootstrap_differences": differences}

uqdd.metrics.plot_critical_difference_diagram

plot_critical_difference_diagram(friedman_results, metric, save_dir=None, alpha=0.05)

Plot a simple critical difference diagram using mean ranks and CD value.

Parameters:

Name Type Description Default
friedman_results dict

Output dictionary from friedman_nemenyi_test.

required
metric str

Metric to plot.

required
save_dir str or None

Directory to save the plot. Default is None.

None
alpha float

Significance level used to compute CD. Default is 0.05.

0.05

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
def plot_critical_difference_diagram(friedman_results, metric, save_dir=None, alpha=0.05):
    """
    Plot a simple critical difference diagram using mean ranks and CD value.

    Parameters
    ----------
    friedman_results : dict
        Output dictionary from friedman_nemenyi_test.
    metric : str
        Metric to plot.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    alpha : float, optional
        Significance level used to compute CD. Default is 0.05.

    Returns
    -------
    None
    """
    if metric not in friedman_results:
        print(f"Metric {metric} not found in Friedman results")
        return
    result = friedman_results[metric]
    if "error" in result:
        print(f"Error in Friedman test for {metric}: {result['error']}")
        return
    if not result["significant"]:
        print(f"Friedman test not significant for {metric}, skipping CD diagram")
        return
    mean_ranks = result["mean_ranks"]
    models = list(mean_ranks.keys())
    ranks = [mean_ranks[model] for model in models]
    sorted_indices = np.argsort(ranks)
    sorted_models = [models[i] for i in sorted_indices]
    sorted_ranks = [ranks[i] for i in sorted_indices]
    fig, ax = plt.subplots(figsize=(12, 6))
    y_pos = 0
    ax.scatter(sorted_ranks, [y_pos] * len(sorted_ranks), s=100, c="blue")
    for i, (model, rank) in enumerate(zip(sorted_models, sorted_ranks)):
        ax.annotate(model, (rank, y_pos), xytext=(0, 20), textcoords="offset points", ha="center", rotation=45)
    if "critical_difference" in result:
        cd = result["critical_difference"]
        groups = []
        for i, model_a in enumerate(sorted_models):
            group = [model_a]
            rank_a = sorted_ranks[i]
            for j, model_b in enumerate(sorted_models):
                if i != j:
                    rank_b = sorted_ranks[j]
                    if abs(rank_a - rank_b) <= cd:
                        if model_b not in [m for g in groups for m in g]:
                            group.append(model_b)
            if len(group) > 1:
                groups.append(group)
        colors = plt.cm.Set3(np.linspace(0, 1, len(groups)))
        for group, color in zip(groups, colors):
            if len(group) > 1:
                group_ranks = [sorted_ranks[sorted_models.index(m)] for m in group]
                min_rank, max_rank = min(group_ranks), max(group_ranks)
                ax.plot([min_rank, max_rank], [y_pos - 0.05, y_pos - 0.05], color=color, linewidth=3, alpha=0.7)
    ax.set_xlim(min(sorted_ranks) - 0.5, max(sorted_ranks) + 0.5)
    ax.set_ylim(-0.3, 0.5)
    ax.set_xlabel("Average Rank")
    ax.set_title(f"Critical Difference Diagram - {metric}")
    ax.grid(True, alpha=0.3)
    ax.set_yticks([])
    if save_dir:
        plot_name = f"critical_difference_{metric.replace(' ', '_')}"
        save_plot(fig, save_dir, plot_name)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analyze_significance

analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=None, model_order=None, activity=None)

End-to-end significance analysis and plotting across splits for multiple metrics.

Parameters:

Name Type Description Default
df_raw DataFrame

Raw results DataFrame.

required
metrics list of str

Metric names to analyze.

required
direction_dict dict

Mapping metric -> 'maximize'|'minimize'.

required
effect_dict dict

Mapping metric -> effect size threshold for visualization.

required
save_dir str or None

Directory to save plots and outputs. Default is None.

None
model_order list of str or None

Explicit ordering of models. Default derives from data.

None
activity str or None

Activity name for prefixes. Default is None.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
def analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=None, model_order=None, activity=None):
    """
    End-to-end significance analysis and plotting across splits for multiple metrics.

    Parameters
    ----------
    df_raw : pd.DataFrame
        Raw results DataFrame.
    metrics : list of str
        Metric names to analyze.
    direction_dict : dict
        Mapping metric -> 'maximize'|'minimize'.
    effect_dict : dict
        Mapping metric -> effect size threshold for visualization.
    save_dir : str or None, optional
        Directory to save plots and outputs. Default is None.
    model_order : list of str or None, optional
        Explicit ordering of models. Default derives from data.
    activity : str or None, optional
        Activity name for prefixes. Default is None.

    Returns
    -------
    None
    """
    df = harmonize_columns(df_raw)
    for metric in metrics:
        df[metric] = pd.to_numeric(df[metric], errors="coerce")
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    for split in df["split"].unique():
        df_s = df[df["split"] == split].copy()
        print(f"\n=== Split: {split} ===")
        name_prefix = f"06_{activity}_{split}" if activity else f"{split}"
        make_normality_diagnostic(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix)
        for metric in metrics:
            print(f"\n-- Metric: {metric}")
            wide = df_s.pivot(index="cv_cycle", columns="method", values=metric)
            resid = (wide.T - wide.mean(axis=1)).T
            vals = resid.values.flatten()
            vals = vals[~np.isnan(vals)]
            W, p_norm = shapiro(vals) if len(vals) >= 3 else (None, 0.0)
            if p_norm is None:
                print("Not enough data for Shapiro-Wilk test (need at least 3 non-NaN values), assuming non-normality")
            elif p_norm < 0.05:
                print(f"Shapiro-Wilk test for {metric} indicates non-normality (W={W:.3f}, p={p_norm:.3f})")
            else:
                print(f"Shapiro-Wilk test for {metric} indicates normality (W={W:.3f}, p={p_norm:.3f})")
        make_boxplots(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_boxplots_parametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_boxplots_nonparametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_sign_plots_nonparametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_critical_difference_diagrams(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_mcs_plot_grid(df=df_s, stats_list=metrics, group_col="method", alpha=0.05, figsize=(30, 15), direction_dict=direction_dict, effect_dict=effect_dict, show_diff=True, sort_axes=True, save_dir=save_dir, name_prefix=name_prefix + "_diff", model_order=model_order)
        make_mcs_plot_grid(df=df_s, stats_list=metrics, group_col="method", alpha=0.05, figsize=(30, 15), direction_dict=direction_dict, effect_dict=effect_dict, show_diff=False, sort_axes=True, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_ci_plot_grid(df_s, metrics, group_col="method", save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)

uqdd.metrics.comprehensive_statistical_analysis

comprehensive_statistical_analysis(df, metrics, models=None, tasks=None, splits=None, save_dir=None, alpha=0.05)

Run a comprehensive suite of statistical tests and export results.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metrics list of str

Metrics to analyze.

required
models list of str or None

Models to include. Default derives from data.

None
tasks list of str or None

Tasks to include. Default derives from data.

None
splits list of str or None

Splits to include. Default derives from data.

None
save_dir str or None

Directory to save tables and JSON outputs. Default is None.

None
alpha float

Significance level. Default is 0.05.

0.05

Returns:

Type Description
dict

Results dict including pairwise tests, Friedman/Nemenyi outputs, and optional AUC bootstrap comparisons.

Source code in uqdd/metrics/stats.py
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
def comprehensive_statistical_analysis(df, metrics, models=None, tasks=None, splits=None, save_dir=None, alpha=0.05):
    """
    Run a comprehensive suite of statistical tests and export results.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to analyze.
    models : list of str or None, optional
        Models to include. Default derives from data.
    tasks : list of str or None, optional
        Tasks to include. Default derives from data.
    splits : list of str or None, optional
        Splits to include. Default derives from data.
    save_dir : str or None, optional
        Directory to save tables and JSON outputs. Default is None.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    dict
        Results dict including pairwise tests, Friedman/Nemenyi outputs, and optional AUC bootstrap comparisons.
    """
    print("Performing comprehensive statistical analysis...")
    results = {}
    print("1. Running pairwise Wilcoxon signed-rank tests...")
    pairwise_results = pairwise_model_comparison(df, metrics, models, tasks, splits, alpha)
    results["pairwise_tests"] = pairwise_results
    print("2. Running Friedman tests with Nemenyi post-hoc...")
    friedman_results = friedman_nemenyi_test(df, metrics, models, alpha)
    results["friedman_nemenyi"] = friedman_results
    auc_columns = [col for col in df.columns if "AUC" in col or "auc" in col]
    if auc_columns:
        print("3. Running bootstrap comparisons for AUC metrics...")
        auc_bootstrap_results = {}
        for auc_col in auc_columns:
            auc_bootstrap_results[auc_col] = {}
            available_models = df["Model type"].unique() if models is None else models
            for i, model_a in enumerate(available_models):
                for j, model_b in enumerate(available_models):
                    if i < j:
                        auc_a = df[df["Model type"] == model_a][auc_col].dropna().values
                        auc_b = df[df["Model type"] == model_b][auc_col].dropna().values
                        if len(auc_a) > 0 and len(auc_b) > 0:
                            bootstrap_result = bootstrap_auc_difference(auc_a, auc_b)
                            auc_bootstrap_results[auc_col][f"{model_a}_vs_{model_b}"] = bootstrap_result
        results["auc_bootstrap"] = auc_bootstrap_results
    if save_dir:
        os.makedirs(save_dir, exist_ok=True)
        if not pairwise_results.empty:
            pairwise_results.to_csv(os.path.join(save_dir, "pairwise_statistical_tests.csv"), index=False)
        import json
        with open(os.path.join(save_dir, "friedman_nemenyi_results.json"), "w") as f:
            json_compatible_results = {}
            for metric, result in friedman_results.items():
                json_compatible_results[metric] = {}
                for key, value in result.items():
                    if isinstance(value, (np.ndarray, np.generic)):
                        json_compatible_results[metric][key] = value.tolist()
                    elif isinstance(value, dict):
                        json_compatible_results[metric][key] = {str(k): (float(v) if isinstance(v, (np.ndarray, np.generic)) else v) for k, v in value.items()}
                    else:
                        json_compatible_results[metric][key] = (float(value) if isinstance(value, (np.ndarray, np.generic)) else value)
            json.dump(json_compatible_results, f, indent=2)
        if auc_columns:
            with open(os.path.join(save_dir, "auc_bootstrap_results.json"), "w") as f:
                json_compatible_auc = {}
                for auc_col, comparisons in results["auc_bootstrap"].items():
                    json_compatible_auc[auc_col] = {}
                    for comparison, result in comparisons.items():
                        json_compatible_auc[auc_col][comparison] = {k: v.tolist() if isinstance(v, np.ndarray) else v for k, v in result.items()}
                json.dump(json_compatible_auc, f, indent=2)
    return results

uqdd.metrics.generate_statistical_report

generate_statistical_report(results, save_dir=None, df_raw=None, metrics=None, direction_dict=None, effect_dict=None)

Generate a human-readable text report from comprehensive statistical results and optionally run plots.

Parameters:

Name Type Description Default
results dict

Output of comprehensive_statistical_analysis.

required
save_dir str or None

Directory to save the report text file. Default is None.

None
df_raw DataFrame or None

Raw DataFrame to run plotting-based significance analysis. Default is None.

None
metrics list of str or None

Metrics to plot (when df_raw provided).

None
direction_dict dict or None

Direction mapping for metrics (required when df_raw provided).

None
effect_dict dict or None

Effect threshold mapping (required when df_raw provided).

None

Returns:

Type Description
str

Report text.

Source code in uqdd/metrics/stats.py
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
def generate_statistical_report(results, save_dir=None, df_raw=None, metrics=None, direction_dict=None, effect_dict=None):
    """
    Generate a human-readable text report from comprehensive statistical results and optionally run plots.

    Parameters
    ----------
    results : dict
        Output of comprehensive_statistical_analysis.
    save_dir : str or None, optional
        Directory to save the report text file. Default is None.
    df_raw : pd.DataFrame or None, optional
        Raw DataFrame to run plotting-based significance analysis. Default is None.
    metrics : list of str or None, optional
        Metrics to plot (when df_raw provided).
    direction_dict : dict or None, optional
        Direction mapping for metrics (required when df_raw provided).
    effect_dict : dict or None, optional
        Effect threshold mapping (required when df_raw provided).

    Returns
    -------
    str
        Report text.
    """
    report = []
    report.append("=" * 80)
    report.append("COMPREHENSIVE STATISTICAL ANALYSIS REPORT")
    report.append("=" * 80)
    report.append("")
    if "pairwise_tests" in results and not results["pairwise_tests"].empty:
        pairwise_df = results["pairwise_tests"]
        report.append("1. PAIRWISE MODEL COMPARISONS (Wilcoxon Signed-Rank Test)")
        report.append("-" * 60)
        significant = pairwise_df[pairwise_df["significant_after_correction"] == True]
        report.append(f"Total pairwise comparisons performed: {len(pairwise_df)}")
        report.append(f"Significant differences (after Holm-Bonferroni correction): {len(significant)}")
        report.append("")
        if len(significant) > 0:
            report.append("Significant differences found:")
            for _, row in significant.iterrows():
                effect_size = row["effect_size_interpretation"]
                report.append(f"  • {row['model_a']} vs {row['model_b']} ({row['metric']}, {row['split']}):")
                report.append(f"    - p-value: {row['p_value']:.4f} (corrected: {row['corrected_p_value']:.4f})")
                report.append(f"    - Cliff's Δ: {row['cliffs_delta']:.3f} ({effect_size} effect)")
                report.append(f"    - Median difference: {row['median_difference']:.4f} [{row['ci_lower']:.4f}, {row['ci_upper']:.4f}]")
                report.append(f"    - {row['practical_significance']}")
                report.append("")
        else:
            report.append("No significant differences found after multiple comparison correction.")
            report.append("")
    if "friedman_nemenyi" in results:
        friedman_results = results["friedman_nemenyi"]
        report.append("2. MULTIPLE MODEL COMPARISONS (Friedman + Nemenyi Tests)")
        report.append("-" * 60)
        for metric, result in friedman_results.items():
            if "error" in result:
                report.append(f"{metric}: {result['error']}")
                continue
            report.append(f"Metric: {metric}")
            report.append(f"  Friedman test p-value: {result['friedman_p_value']:.4f}")
            if result["significant"]:
                report.append("  Result: Significant difference between models detected")
                mean_ranks = result["mean_ranks"]
                sorted_ranks = sorted(mean_ranks.items(), key=lambda x: x[1])
                report.append("  Model rankings (lower rank = better performance):")
                for i, (model, rank) in enumerate(sorted_ranks, 1):
                    report.append(f"    {i}. {model}: {rank:.2f}")
                if "critical_difference" in result:
                    report.append(f"  Critical difference: {result['critical_difference']:.3f}")
            else:
                report.append("  Result: No significant difference between models")
            report.append("")
    if "auc_bootstrap" in results:
        auc_results = results["auc_bootstrap"]
        report.append("3. AUC BOOTSTRAP COMPARISONS")
        report.append("-" * 60)
        for auc_col, comparisons in auc_results.items():
            report.append(f"AUC Metric: {auc_col}")
            for comparison, result in comparisons.items():
                model_a, model_b = comparison.split("_vs_")
                mean_diff = result["mean_difference"]
                ci_lower = result["ci_lower"]
                ci_upper = result["ci_upper"]
                significance = "difference is small (CI includes 0)" if (ci_lower <= 0 <= ci_upper) else "difference may be meaningful"
                report.append(f"  {model_a} vs {model_b}:")
                report.append(f"    Mean difference: {mean_diff:.4f} [{ci_lower:.4f}, {ci_upper:.4f}]")
                report.append(f"    {significance}")
            report.append("")
    report_text = "\n".join(report)
    if save_dir:
        os.makedirs(save_dir, exist_ok=True)
        with open(os.path.join(save_dir, "statistical_analysis_report.txt"), "w") as f:
            f.write(report_text)
    print(report_text)
    if df_raw is not None and metrics is not None and direction_dict is not None and effect_dict is not None:
        analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=save_dir)
    return report_text

uqdd.metrics.nll_evidentials

nll_evidentials(evidential_model, test_dataloader, model_type: str = 'evidential', num_mc_samples: int = 100, device=DEVICE)

Compute negative log-likelihood (NLL) for evidential-style models.

Parameters:

Name Type Description Default
evidential_model Module

Trained model instance.

required
test_dataloader DataLoader

DataLoader providing test set batches.

required
model_type (evidential, eoe, emc)

Model family determining the NLL backend. Default is "evidential".

"evidential"
num_mc_samples int

Number of MC samples for EMC models. Default is 100.

100
device device

Device to run evaluation on. Default uses DEVICE.

DEVICE

Returns:

Type Description
float or None

Scalar NLL if supported by the model type; None otherwise.

Source code in uqdd/metrics/reassessment.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def nll_evidentials(
    evidential_model,
    test_dataloader,
    model_type: str = "evidential",
    num_mc_samples: int = 100,
    device=DEVICE,
):
    """
    Compute negative log-likelihood (NLL) for evidential-style models.

    Parameters
    ----------
    evidential_model : torch.nn.Module
        Trained model instance.
    test_dataloader : torch.utils.data.DataLoader
        DataLoader providing test set batches.
    model_type : {"evidential", "eoe", "emc"}, optional
        Model family determining the NLL backend. Default is "evidential".
    num_mc_samples : int, optional
        Number of MC samples for EMC models. Default is 100.
    device : torch.device, optional
        Device to run evaluation on. Default uses `DEVICE`.

    Returns
    -------
    float or None
        Scalar NLL if supported by the model type; None otherwise.
    """
    if model_type in ["evidential", "eoe"]:
        return ev_nll(evidential_model, test_dataloader, device=device)
    elif model_type == "emc":
        return emc_nll(evidential_model, test_dataloader, num_mc_samples=num_mc_samples, device=device)
    else:
        return None

uqdd.metrics.convert_to_list

convert_to_list(val)

Parse a string representation of a Python list to a list; pass through non-strings.

Parameters:

Name Type Description Default
val str or any

Input value, possibly a string encoding of a list.

required

Returns:

Type Description
list

Parsed list if val is a valid string list, empty list on parse failure.

any

Original value if not a string.

Notes
  • Uses ast.literal_eval for safe evaluation.
  • Prints a warning and returns [] when parsing fails.
Source code in uqdd/metrics/reassessment.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def convert_to_list(val):
    """
    Parse a string representation of a Python list to a list; pass through non-strings.

    Parameters
    ----------
    val : str or any
        Input value, possibly a string encoding of a list.

    Returns
    -------
    list
        Parsed list if `val` is a valid string list, empty list on parse failure.
    any
        Original value if not a string.

    Notes
    -----
    - Uses `ast.literal_eval` for safe evaluation.
    - Prints a warning and returns [] when parsing fails.
    """
    if isinstance(val, str):
        try:
            parsed_val = ast.literal_eval(val)
            if isinstance(parsed_val, list):
                return parsed_val
            else:
                return []
        except (SyntaxError, ValueError):
            print(f"Warning: Unable to parse value {val}, returning empty list.")
            return []
    return val

uqdd.metrics.preprocess_runs

preprocess_runs(runs_path: str, models_dir: str = MODELS_DIR, data_name: str = 'papyrus', activity_type: str = 'xc50', descriptor_protein: str = 'ankh-large', descriptor_chemical: str = 'ecfp2048', data_specific_path: str = 'papyrus/xc50/all', prot_input_dim: int = 1536, chem_input_dim: int = 2048) -> pd.DataFrame

Read a runs CSV and enrich with resolved model paths and descriptor metadata.

Parameters:

Name Type Description Default
runs_path str

Path to the CSV file containing run metadata.

required
models_dir str

Directory containing trained model .pt files. Default uses MODELS_DIR.

MODELS_DIR
data_name str

Dataset identifier. Default is "papyrus".

'papyrus'
activity_type str

Activity type (e.g., "xc50", "kc"). Default is "xc50".

'xc50'
descriptor_protein str

Protein descriptor type. Default is "ankh-large".

'ankh-large'
descriptor_chemical str

Chemical descriptor type. Default is "ecfp2048".

'ecfp2048'
data_specific_path str

Subpath encoding dataset context for figures/exports. Default is "papyrus/xc50/all".

'papyrus/xc50/all'
prot_input_dim int

Protein input dimensionality. Default is 1536.

1536
chem_input_dim int

Chemical input dimensionality. Default is 2048.

2048

Returns:

Type Description
DataFrame

Preprocessed runs DataFrame with columns like 'model_name', 'model_path', and descriptor fields.

Notes
  • Resolves model_name to actual .pt files via glob and sets 'model_path'.
  • Adds multi-task flag 'MT' from 'n_targets' > 1.
  • Converts layer columns from strings to lists using convert_to_list.
Source code in uqdd/metrics/reassessment.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
def preprocess_runs(
    runs_path: str,
    models_dir: str = MODELS_DIR,
    data_name: str = "papyrus",
    activity_type: str = "xc50",
    descriptor_protein: str = "ankh-large",
    descriptor_chemical: str = "ecfp2048",
    data_specific_path: str = "papyrus/xc50/all",
    prot_input_dim: int = 1536,
    chem_input_dim: int = 2048,
) -> pd.DataFrame:
    """
    Read a runs CSV and enrich with resolved model paths and descriptor metadata.

    Parameters
    ----------
    runs_path : str
        Path to the CSV file containing run metadata.
    models_dir : str, optional
        Directory containing trained model .pt files. Default uses `MODELS_DIR`.
    data_name : str, optional
        Dataset identifier. Default is "papyrus".
    activity_type : str, optional
        Activity type (e.g., "xc50", "kc"). Default is "xc50".
    descriptor_protein : str, optional
        Protein descriptor type. Default is "ankh-large".
    descriptor_chemical : str, optional
        Chemical descriptor type. Default is "ecfp2048".
    data_specific_path : str, optional
        Subpath encoding dataset context for figures/exports. Default is "papyrus/xc50/all".
    prot_input_dim : int, optional
        Protein input dimensionality. Default is 1536.
    chem_input_dim : int, optional
        Chemical input dimensionality. Default is 2048.

    Returns
    -------
    pd.DataFrame
        Preprocessed runs DataFrame with columns like 'model_name', 'model_path', and descriptor fields.

    Notes
    -----
    - Resolves `model_name` to actual .pt files via glob and sets 'model_path'.
    - Adds multi-task flag 'MT' from 'n_targets' > 1.
    - Converts layer columns from strings to lists using `convert_to_list`.
    """
    runs_df = pd.read_csv(
        runs_path,
        converters={
            "chem_layers": convert_to_list,
            "prot_layers": convert_to_list,
            "regressor_layers": convert_to_list,
        },
    )
    runs_df.rename(columns={"Name": "run_name"}, inplace=True)
    i = 1
    for index, row in runs_df.iterrows():
        model_name = row["model_name"] if not pd.isna(row["model_name"]) else row["run_name"]
        model_file_pattern = os.path.join(models_dir, f"*{model_name}.pt")
        model_files = glob.glob(model_file_pattern)
        if model_files:
            model_file_path = model_files[0]
            model_name = os.path.basename(model_file_path).replace(".pt", "")
            runs_df.at[index, "model_name"] = model_name
            runs_df.at[index, "model_path"] = model_file_path
        else:
            print(f"{i} Model file(s) not found for {model_name} \n with pattern {model_file_pattern}")
            runs_df.at[index, "model_path"] = ""
            i += 1
    runs_df["data_name"] = data_name
    runs_df["activity_type"] = activity_type
    runs_df["descriptor_protein"] = descriptor_protein
    runs_df["descriptor_chemical"] = descriptor_chemical
    runs_df["chem_input_dim"] = chem_input_dim
    runs_df["prot_input_dim"] = prot_input_dim
    runs_df["data_specific_path"] = data_specific_path
    runs_df["MT"] = runs_df["n_targets"].apply(lambda x: True if x > 1 else False)
    return runs_df

uqdd.metrics.get_model_class

get_model_class(model_type: str)

Map a model type name to the corresponding class.

Parameters:

Name Type Description Default
model_type str

Model type identifier (e.g., "pnn", "ensemble", "evidential", "eoe", "emc", "mcdropout").

required

Returns:

Type Description
type

Model class matching the type.

Raises:

Type Description
ValueError

If the model_type is not recognized.

Source code in uqdd/metrics/reassessment.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
def get_model_class(model_type: str):
    """
    Map a model type name to the corresponding class.

    Parameters
    ----------
    model_type : str
        Model type identifier (e.g., "pnn", "ensemble", "evidential", "eoe", "emc", "mcdropout").

    Returns
    -------
    type
        Model class matching the type.

    Raises
    ------
    ValueError
        If the `model_type` is not recognized.
    """
    if model_type.lower() in ["pnn", "mcdropout"]:
        return PNN
    elif model_type.lower() == "ensemble":
        return EnsembleDNN
    elif model_type.lower() in ["evidential", "emc"]:
        return EvidentialDNN
    elif model_type.lower() == "eoe":
        return EoEDNN
    else:
        raise ValueError(f"Model type {model_type} not recognized")

uqdd.metrics.get_predict_fn

get_predict_fn(model_type: str, num_mc_samples: int = 100)

Get the appropriate predict function and kwargs for a given model type.

Parameters:

Name Type Description Default
model_type str

Model type identifier.

required
num_mc_samples int

Number of MC samples for MC Dropout or EMC models. Default is 100.

100

Returns:

Type Description
(callable, dict)

Tuple of (predict_function, keyword_arguments).

Raises:

Type Description
ValueError

If the model_type is not recognized.

Source code in uqdd/metrics/reassessment.py
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
def get_predict_fn(model_type: str, num_mc_samples: int = 100):
    """
    Get the appropriate predict function and kwargs for a given model type.

    Parameters
    ----------
    model_type : str
        Model type identifier.
    num_mc_samples : int, optional
        Number of MC samples for MC Dropout or EMC models. Default is 100.

    Returns
    -------
    (callable, dict)
        Tuple of (predict_function, keyword_arguments).

    Raises
    ------
    ValueError
        If the `model_type` is not recognized.
    """
    if model_type.lower() == "mcdropout":
        return mc_predict, {"num_mc_samples": num_mc_samples}
    elif model_type.lower() in ["ensemble", "pnn"]:
        return predict, {}
    elif model_type.lower() in ["evidential", "eoe"]:
        return ev_predict, {}
    elif model_type.lower() == "emc":
        return emc_predict, {"num_mc_samples": num_mc_samples}
    else:
        raise ValueError(f"Model type {model_type} not recognized")

uqdd.metrics.get_preds

get_preds(model, dataloaders, model_type: str, subset: str = 'test', num_mc_samples: int = 100)

Run inference and unpack predictions for the requested subset.

Parameters:

Name Type Description Default
model Module

Trained model instance.

required
dataloaders dict

Dictionary of DataLoaders keyed by subset (e.g., 'train', 'val', 'test').

required
model_type str

Model type determining the predict function and outputs.

required
subset str

Subset key to use from dataloaders. Default is "test".

'test'
num_mc_samples int

Number of MC samples for stochastic predictors. Default is 100.

100

Returns:

Type Description
tuple

(preds, labels, alea_vars, epi_vars) where epi_vars may be None for non-evidential models.

Source code in uqdd/metrics/reassessment.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
def get_preds(
    model,
    dataloaders,
    model_type: str,
    subset: str = "test",
    num_mc_samples: int = 100,
):
    """
    Run inference and unpack predictions for the requested subset.

    Parameters
    ----------
    model : torch.nn.Module
        Trained model instance.
    dataloaders : dict
        Dictionary of DataLoaders keyed by subset (e.g., 'train', 'val', 'test').
    model_type : str
        Model type determining the predict function and outputs.
    subset : str, optional
        Subset key to use from `dataloaders`. Default is "test".
    num_mc_samples : int, optional
        Number of MC samples for stochastic predictors. Default is 100.

    Returns
    -------
    tuple
        (preds, labels, alea_vars, epi_vars) where `epi_vars` may be None for non-evidential models.
    """
    predict_fn, predict_kwargs = get_predict_fn(model_type, num_mc_samples=num_mc_samples)
    preds_res = predict_fn(model, dataloaders[subset], device=DEVICE, **predict_kwargs)
    if model_type in ["evidential", "eoe", "emc"]:
        preds, labels, alea_vars, epi_vars = preds_res
    else:
        preds, labels, alea_vars = preds_res
        epi_vars = None
    return preds, labels, alea_vars, epi_vars

uqdd.metrics.pkl_preds_export

pkl_preds_export(preds, labels, alea_vars, epi_vars, outpath: str, model_type: str, logger=None)

Export predictions and uncertainties to a standardized pickle and return the DataFrame.

Parameters:

Name Type Description Default
preds ndarray or Tensor

Model predictions.

required
labels ndarray or Tensor

True labels.

required
alea_vars ndarray or Tensor

Aleatoric uncertainty components.

required
epi_vars ndarray or Tensor or None

Epistemic uncertainty components, or None for non-evidential models.

required
outpath str

Output directory to write 'preds.pkl'.

required
model_type str

Model type used to guide process_preds behavior.

required
logger Logger or None

Logger for messages. Default is None.

None

Returns:

Type Description
DataFrame

DataFrame with columns [y_true, y_pred, y_err, y_alea, y_eps].

Source code in uqdd/metrics/reassessment.py
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
def pkl_preds_export(
    preds,
    labels,
    alea_vars,
    epi_vars,
    outpath: str,
    model_type: str,
    logger=None,
):
    """
    Export predictions and uncertainties to a standardized pickle and return the DataFrame.

    Parameters
    ----------
    preds : numpy.ndarray or torch.Tensor
        Model predictions.
    labels : numpy.ndarray or torch.Tensor
        True labels.
    alea_vars : numpy.ndarray or torch.Tensor
        Aleatoric uncertainty components.
    epi_vars : numpy.ndarray or torch.Tensor or None
        Epistemic uncertainty components, or None for non-evidential models.
    outpath : str
        Output directory to write 'preds.pkl'.
    model_type : str
        Model type used to guide `process_preds` behavior.
    logger : logging.Logger or None, optional
        Logger for messages. Default is None.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns [y_true, y_pred, y_err, y_alea, y_eps].
    """
    y_true, y_pred, y_err, y_alea, y_eps = process_preds(preds, labels, alea_vars, epi_vars, None, model_type)
    df = create_df_preds(y_true=y_true, y_pred=y_pred, y_err=y_err, y_alea=y_alea, y_eps=y_eps, export=False, logger=logger)
    df.to_pickle(os.path.join(outpath, "preds.pkl"))
    return df

uqdd.metrics.csv_nll_post_processing

csv_nll_post_processing(csv_path: str) -> None

Normalize NLL values in a CSV by taking the first value per model name.

Parameters:

Name Type Description Default
csv_path str

Path to the CSV file containing a 'model name' and 'NLL' column.

required

Returns:

Type Description
None
Source code in uqdd/metrics/reassessment.py
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def csv_nll_post_processing(csv_path: str) -> None:
    """
    Normalize NLL values in a CSV by taking the first value per model name.

    Parameters
    ----------
    csv_path : str
        Path to the CSV file containing a 'model name' and 'NLL' column.

    Returns
    -------
    None
    """
    df = pd.read_csv(csv_path)
    df["NLL"] = df.groupby("model name")["NLL"].transform("first")
    df.to_csv(csv_path, index=False)

uqdd.metrics.reassess_metrics

reassess_metrics(runs_df: DataFrame, figs_out_path: str, csv_out_path: str, project_out_name: str, logger) -> None

Reassess metrics for each run: reload model, predict, compute NLL, evaluate, and recalibrate.

Parameters:

Name Type Description Default
runs_df DataFrame

Preprocessed runs DataFrame with resolved 'model_path' and configuration fields.

required
figs_out_path str

Directory where per-model figures and prediction pickles are saved.

required
csv_out_path str

Path to a CSV for logging metrics (passed to evaluate_predictions).

required
project_out_name str

Name used for grouping results in downstream logging.

required
logger Logger

Logger instance used through evaluation and recalibration.

required

Returns:

Type Description
None
Notes
  • Skips models already reassessed when a figure directory exists.
  • Uses validation split for isotonic recalibration and logs final metrics.
Source code in uqdd/metrics/reassessment.py
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
def reassess_metrics(
    runs_df: pd.DataFrame,
    figs_out_path: str,
    csv_out_path: str,
    project_out_name: str,
    logger,
) -> None:
    """
    Reassess metrics for each run: reload model, predict, compute NLL, evaluate, and recalibrate.

    Parameters
    ----------
    runs_df : pd.DataFrame
        Preprocessed runs DataFrame with resolved 'model_path' and configuration fields.
    figs_out_path : str
        Directory where per-model figures and prediction pickles are saved.
    csv_out_path : str
        Path to a CSV for logging metrics (passed to `evaluate_predictions`).
    project_out_name : str
        Name used for grouping results in downstream logging.
    logger : logging.Logger
        Logger instance used through evaluation and recalibration.

    Returns
    -------
    None

    Notes
    -----
    - Skips models already reassessed when a figure directory exists.
    - Uses validation split for isotonic recalibration and logs final metrics.
    """
    runs_df = runs_df.sample(frac=1).reset_index(drop=True)
    for index, row in runs_df.iterrows():
        model_path = row["model_path"]
        model_name = row["model_name"]
        run_name = row["run_name"]
        rowkwargs = row.to_dict()
        model_type = rowkwargs.pop("model_type")
        activity_type = rowkwargs.pop("activity_type")
        if model_path:
            model_fig_out_path = os.path.join(figs_out_path, model_name)
            if os.path.exists(model_fig_out_path):
                print(f"Model {model_name} already reassessed")
                continue
            os.makedirs(model_fig_out_path, exist_ok=True)
            config = get_model_config(model_type=model_type, activity_type=activity_type, **rowkwargs)
            num_mc_samples = config.get("num_mc_samples", 100)
            model_class = get_model_class(model_type)
            prefix = "models." if model_type == "eoe" else ""
            model = load_model(model_class, model_path, prefix_to_state_keys=prefix, config=config).to(DEVICE)
            dataloaders = get_dataloader(config, device=DEVICE, logger=logger)
            preds, labels, alea_vars, epi_vars = get_preds(model, dataloaders, model_type, subset="test", num_mc_samples=num_mc_samples)
            nll = nll_evidentials(model, dataloaders["test"], model_type=model_type, num_mc_samples=num_mc_samples, device=DEVICE)
            df = pkl_preds_export(preds, labels, alea_vars, epi_vars, model_fig_out_path, model_type, logger=logger)
            metrics, plots, uct_logger = evaluate_predictions(
                config,
                preds,
                labels,
                alea_vars,
                model_type,
                logger,
                epi_vars=epi_vars,
                wandb_push=False,
                run_name=config["run_name"],
                project_name=project_out_name,
                figpath=model_fig_out_path,
                export_preds=False,
                verbose=False,
                csv_path=csv_out_path,
                nll=nll,
            )
            preds_val, labels_val, alea_vars_val, epi_vars_val = get_preds(model, dataloaders, model_type, subset="val", num_mc_samples=num_mc_samples)
            nll = nll_evidentials(model, dataloaders["val"], model_type=model_type, num_mc_samples=num_mc_samples, device=DEVICE)
            iso_recal_model = recalibrate_model(
                preds_val,
                labels_val,
                alea_vars_val,
                preds,
                labels,
                alea_vars,
                config=config,
                epi_val=epi_vars_val,
                epi_test=epi_vars,
                uct_logger=uct_logger,
                figpath=model_fig_out_path,
                nll=nll,
            )
            uct_logger.csv_log()

uqdd.metrics.analysis

Analysis and plotting utilities for model metrics.

This module provides functions to aggregate experiment results, compute summary statistics, and visualize metrics via pairplots, line plots, histograms, bar plots, correlation matrices, calibration curves, and RMSE rejection curves.

uqdd.metrics.analysis.aggregate_results_csv

aggregate_results_csv(df: DataFrame, group_cols: List[str], numeric_cols: List[str], string_cols: List[str], order_by: Optional[Union[str, List[str]]] = None, output_file_path: Optional[str] = None) -> pd.DataFrame

Aggregate metrics by groups and export a compact CSV summary.

Parameters:

Name Type Description Default
df DataFrame

Input results DataFrame.

required
group_cols list of str

Column names to group by.

required
numeric_cols list of str

Numeric metric columns to aggregate with mean and std.

required
string_cols list of str

String columns to aggregate as lists.

required
order_by str or list of str or None

Column(s) to sort the final aggregated DataFrame by. Default is None.

None
output_file_path str or None

Path to write the aggregated CSV. If None, no file is written.

None

Returns:

Type Description
DataFrame

Aggregated DataFrame with combined mean(std) strings plus string/list aggregates.

Notes
  • A helper column project_model is constructed and included in the aggregates.
  • When output_file_path is provided, the function ensures the directory exists.
Source code in uqdd/metrics/analysis.py
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
def aggregate_results_csv(
    df: pd.DataFrame,
    group_cols: List[str],
    numeric_cols: List[str],
    string_cols: List[str],
    order_by: Optional[Union[str, List[str]]] = None,
    output_file_path: Optional[str] = None,
) -> pd.DataFrame:
    """
    Aggregate metrics by groups and export a compact CSV summary.

    Parameters
    ----------
    df : pd.DataFrame
        Input results DataFrame.
    group_cols : list of str
        Column names to group by.
    numeric_cols : list of str
        Numeric metric columns to aggregate with mean and std.
    string_cols : list of str
        String columns to aggregate as lists.
    order_by : str or list of str or None, optional
        Column(s) to sort the final aggregated DataFrame by. Default is None.
    output_file_path : str or None, optional
        Path to write the aggregated CSV. If None, no file is written.

    Returns
    -------
    pd.DataFrame
        Aggregated DataFrame with combined mean(std) strings plus string/list aggregates.

    Notes
    -----
    - A helper column `project_model` is constructed and included in the aggregates.
    - When `output_file_path` is provided, the function ensures the directory exists.
    """
    grouped = df.groupby(group_cols)
    aggregated = grouped[numeric_cols].agg(["mean", "std"])
    for col in numeric_cols:
        aggregated[(col, "combined")] = (
            aggregated[(col, "mean")].round(3).astype(str)
            + "("
            + aggregated[(col, "std")].round(3).astype(str)
            + ")"
        )
    aggregated = aggregated[[col for col in aggregated.columns if col[1] == "combined"]]
    aggregated.columns = [col[0] for col in aggregated.columns]

    string_aggregated = grouped[string_cols].agg(lambda x: list(x))

    df["project_model"] = (
        "papyrus"
        + "/"
        + df["Activity"]
        + "/"
        + "all"
        + "/"
        + df["wandb project"]
        + "/"
        + df["model name"]
        + "/"
    )
    project_model_aggregated = grouped["project_model"].agg(lambda x: list(x))

    final_aggregated = pd.concat(
        [aggregated, string_aggregated, project_model_aggregated], axis=1
    ).reset_index()

    if order_by:
        final_aggregated = final_aggregated.sort_values(by=order_by)

    if output_file_path:
        os.makedirs(os.path.dirname(output_file_path), exist_ok=True)
        final_aggregated.to_csv(output_file_path, index=False)

    return final_aggregated

uqdd.metrics.analysis.save_plot

save_plot(fig: Figure, save_dir: Optional[str], plot_name: str, tighten: bool = True, show_legend: bool = False) -> None

Save a matplotlib figure to PNG, SVG, and PDF with optional tight layout.

Parameters:

Name Type Description Default
fig Figure

Figure to save.

required
save_dir str or None

Directory to save the figure files. If None, no files are written.

required
plot_name str

Base filename (without extension).

required
tighten bool

If True, apply tight_layout and bbox_inches="tight". Default is True.

True
show_legend bool

If False, remove legend before saving. Default is False.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
def save_plot(
    fig: plt.Figure,
    save_dir: Optional[str],
    plot_name: str,
    tighten: bool = True,
    show_legend: bool = False,
) -> None:
    """
    Save a matplotlib figure to PNG, SVG, and PDF with optional tight layout.

    Parameters
    ----------
    fig : matplotlib.figure.Figure
        Figure to save.
    save_dir : str or None
        Directory to save the figure files. If None, no files are written.
    plot_name : str
        Base filename (without extension).
    tighten : bool, optional
        If True, apply tight_layout and bbox_inches="tight". Default is True.
    show_legend : bool, optional
        If False, remove legend before saving. Default is False.

    Returns
    -------
    None
    """
    ax = fig.gca()
    if not show_legend:
        legend = ax.get_legend()
        if legend is not None:
            legend.remove()
    if tighten:
        try:
            with warnings.catch_warnings():
                warnings.filterwarnings(
                    "ignore",
                    message="This figure includes Axes that are not compatible with tight_layout",
                )
                fig.tight_layout()
        except (ValueError, RuntimeError):
            fig.subplots_adjust(left=0.1, right=0.9, top=0.9, bottom=0.1)

    if save_dir and tighten:
        os.makedirs(save_dir, exist_ok=True)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.png"), dpi=300, bbox_inches="tight")
        fig.savefig(os.path.join(save_dir, f"{plot_name}.svg"), bbox_inches="tight")
        fig.savefig(os.path.join(save_dir, f"{plot_name}.pdf"), dpi=300, bbox_inches="tight")
    elif save_dir and not tighten:
        os.makedirs(save_dir, exist_ok=True)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.png"), dpi=300)
        fig.savefig(os.path.join(save_dir, f"{plot_name}.svg"))
        fig.savefig(os.path.join(save_dir, f"{plot_name}.pdf"), dpi=300)

uqdd.metrics.analysis.handle_inf_values

handle_inf_values(df: DataFrame) -> pd.DataFrame

Replace +/- infinity values in a DataFrame with NaN.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required

Returns:

Type Description
DataFrame

DataFrame with infinite values replaced by NaN.

Source code in uqdd/metrics/analysis.py
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
def handle_inf_values(df: pd.DataFrame) -> pd.DataFrame:
    """
    Replace +/- infinity values in a DataFrame with NaN.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.

    Returns
    -------
    pd.DataFrame
        DataFrame with infinite values replaced by NaN.
    """
    return df.replace([float("inf"), -float("inf")], float("nan"))

uqdd.metrics.analysis.plot_pairplot

plot_pairplot(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, cmap: str = 'viridis', group_order: Optional[List[str]] = group_order, show_legend: bool = False) -> None

Plot a seaborn pairplot for a set of metrics colored by Group.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing the metrics and a 'Group' column.

required
title str

Plot title.

required
metrics list of str

Metric column names to include in the pairplot.

required
save_dir str or None

Directory to save plot images. Default is None.

None
cmap str

Seaborn/matplotlib palette name. Default is "viridis".

'viridis'
group_order list of str or None

Order of class labels in the legend. Default is from constants.

group_order
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
def plot_pairplot(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    cmap: str = "viridis",
    group_order: Optional[List[str]] = group_order,
    show_legend: bool = False,
) -> None:
    """
    Plot a seaborn pairplot for a set of metrics colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing the metrics and a 'Group' column.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to include in the pairplot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    cmap : str, optional
        Seaborn/matplotlib palette name. Default is "viridis".
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    sns.pairplot(
        df,
        hue="Group",
        hue_order=group_order,
        vars=metrics,
        palette=cmap,
        plot_kws={"alpha": 0.7},
    )
    plt.suptitle(title, y=1.02)
    plot_name = f"pairplot_{title.replace(' ', '_')}"
    save_plot(plt.gcf(), save_dir, plot_name, tighten=False, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.plot_line_metrics

plot_line_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, show_legend: bool = False) -> None

Plot line charts of metrics over runs, colored by Group.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with 'wandb run', metrics, and 'Group'.

required
title str

Plot title.

required
metrics list of str

Metric column names to plot.

required
save_dir str or None

Directory to save plot images. Default is None.

None
group_order list of str or None

Order of class labels in the legend. Default is from constants.

group_order
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
def plot_line_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    show_legend: bool = False,
) -> None:
    """
    Plot line charts of metrics over runs, colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with 'wandb run', metrics, and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    for metric in metrics:
        plt.figure(figsize=(14, 7))
        sns.lineplot(
            data=df,
            x="wandb run",
            y=metric,
            hue="Group",
            marker="o",
            palette="Set2",
            hue_order=group_order,
            label=metric,
        )
        plt.title(f"{title} - {metric}")
        plt.xticks(rotation=45)
        plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
        if INTERACTIVE_MODE:
            plt.show()
        plot_name = f"line_{title.replace(' ', '_')}_{metric}"
        save_plot(plt.gcf(), save_dir, plot_name, tighten=False, show_legend=show_legend)
        plt.close()

uqdd.metrics.analysis.plot_histogram_metrics

plot_histogram_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, cmap: str = 'crest', show_legend: bool = False) -> None

Plot histograms with KDE for metrics, split by Group.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with metrics and 'Group'.

required
title str

Plot title.

required
metrics list of str

Metric column names to plot.

required
save_dir str or None

Directory to save plot images. Default is None.

None
group_order list of str or None

Order of class labels in the legend. Default is from constants.

group_order
cmap str

Seaborn/matplotlib palette name. Default is "crest".

'crest'
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def plot_histogram_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    cmap: str = "crest",
    show_legend: bool = False,
) -> None:
    """
    Plot histograms with KDE for metrics, split by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with metrics and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    cmap : str, optional
        Seaborn/matplotlib palette name. Default is "crest".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    for metric in metrics:
        plt.figure(figsize=(14, 7))
        sns.histplot(
            data=df,
            x=metric,
            hue="Group",
            kde=True,
            palette=cmap,
            element="step",
            hue_order=group_order,
            fill=True,
            alpha=0.7,
        )
        plt.title(f"{title} - {metric}")
        if INTERACTIVE_MODE:
            plt.show()
        plot_name = f"histogram_{title.replace(' ', '_')}_{metric}"
        save_plot(plt.gcf(), save_dir, plot_name, show_legend=show_legend)
        plt.close()

uqdd.metrics.analysis.plot_pairwise_scatter_metrics

plot_pairwise_scatter_metrics(df: DataFrame, title: str, metrics: List[str], save_dir: Optional[str] = None, group_order: Optional[List[str]] = group_order, cmap: str = 'tab10_r', show_legend: bool = False) -> None

Plot pairwise scatterplots for all metric combinations, colored by Group.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with metrics and 'Group'.

required
title str

Plot title.

required
metrics list of str

Metric column names to plot pairwise.

required
save_dir str or None

Directory to save plot images. Default is None.

None
group_order list of str or None

Order of class labels in the legend. Default is from constants.

group_order
cmap str

Matplotlib palette name. Default is "tab10_r".

'tab10_r'
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
def plot_pairwise_scatter_metrics(
    df: pd.DataFrame,
    title: str,
    metrics: List[str],
    save_dir: Optional[str] = None,
    group_order: Optional[List[str]] = group_order,
    cmap: str = "tab10_r",
    show_legend: bool = False,
) -> None:
    """
    Plot pairwise scatterplots for all metric combinations, colored by Group.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with metrics and 'Group'.
    title : str
        Plot title.
    metrics : list of str
        Metric column names to plot pairwise.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    group_order : list of str or None, optional
        Order of class labels in the legend. Default is from constants.
    cmap : str, optional
        Matplotlib palette name. Default is "tab10_r".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    None
    """
    df = handle_inf_values(df)
    num_metrics = len(metrics)
    fig, axes = plt.subplots(num_metrics, num_metrics, figsize=(15, 15))

    for i in range(num_metrics):
        for j in range(num_metrics):
            if i != j:
                ax = sns.scatterplot(
                    data=df,
                    x=metrics[j],
                    y=metrics[i],
                    hue="Group",
                    palette=cmap,
                    hue_order=group_order,
                    ax=axes[i, j],
                    legend=False if not (i == 1 and j == 0) else "brief",
                )
                if i == 1 and j == 0:
                    handles, labels = ax.get_legend_handles_labels()
                    ax.legend().remove()
            else:
                axes[i, j].set_visible(False)

            axes[i, j].set_ylabel(metrics[i] if j == 0 and i > 0 else "")
            axes[i, j].set_xlabel(metrics[j] if i == num_metrics - 1 else "")

    fig.legend(handles, labels, loc="upper right", bbox_to_anchor=(1.15, 1))
    fig.suptitle(title, y=1.02)
    fig.subplots_adjust(top=0.95, wspace=0.4, hspace=0.4)
    plot_name = f"pairwise_scatter_{title.replace(' ', '_')}"
    save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.plot_metrics

plot_metrics(df: DataFrame, metrics: List[str], cmap: str = 'tab10_r', save_dir: Optional[str] = None, hatches_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, show: bool = True, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> Dict[str, str]

Plot grouped bar charts showing mean and std for metrics across splits and model types.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with columns ['Split', 'Model type'] and metrics.

required
metrics list of str

Metric column names to plot.

required
cmap str

Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".

'tab10_r'
save_dir str or None

Directory to save plot images. Default is None.

None
hatches_dict dict[str, str] or None

Mapping from Split to hatch pattern. Default is None.

None
group_order list of str or None

Order of grouped labels (Split_Model type). Default derives from data.

None
show bool

If True, display plot in interactive mode. Default is True.

True
fig_width float or None

Width of the plot area (excluding legend). Default scales with number of metrics.

None
fig_height float or None

Height of the plot area (excluding legend). Default is 6.

None
show_legend bool

If True, include a legend of split/model combinations. Default is False.

False

Returns:

Type Description
dict[str, str]

Color mapping from 'Model type' to RGBA string used in the plot.

Source code in uqdd/metrics/analysis.py
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
def plot_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    cmap: str = "tab10_r",
    save_dir: Optional[str] = None,
    hatches_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    show: bool = True,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> Dict[str, str]:
    """
    Plot grouped bar charts showing mean and std for metrics across splits and model types.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with columns ['Split', 'Model type'] and metrics.
    metrics : list of str
        Metric column names to plot.
    cmap : str, optional
        Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    hatches_dict : dict[str, str] or None, optional
        Mapping from Split to hatch pattern. Default is None.
    group_order : list of str or None, optional
        Order of grouped labels (Split_Model type). Default derives from data.
    show : bool, optional
        If True, display plot in interactive mode. Default is True.
    fig_width : float or None, optional
        Width of the plot area (excluding legend). Default scales with number of metrics.
    fig_height : float or None, optional
        Height of the plot area (excluding legend). Default is 6.
    show_legend : bool, optional
        If True, include a legend of split/model combinations. Default is False.

    Returns
    -------
    dict[str, str]
        Color mapping from 'Model type' to RGBA string used in the plot.
    """
    plot_width = fig_width if fig_width else max(10, len(metrics) * 2)
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 5
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.1, right=0.75, top=0.9, bottom=0.2)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.1, 0.15, plot_width / total_width, plot_height / total_height])

    stats_dfs = []
    for metric in metrics:
        mean_df = df.groupby(["Split", "Model type"])[metric].mean().rename(f"{metric}_mean")
        std_df = df.groupby(["Split", "Model type"])[metric].std().rename(f"{metric}_std")
        stats_df = pd.merge(mean_df, std_df, left_index=True, right_index=True).reset_index()
        stats_df["Group"] = stats_df.apply(lambda row: f"{row['Split']}_{row['Model type']}", axis=1)
        stats_df["Metric"] = metric
        stats_dfs.append(stats_df)

    combined_stats_df = pd.concat(stats_dfs)
    if group_order:
        combined_stats_df["Group"] = pd.Categorical(
            combined_stats_df["Group"], categories=group_order, ordered=True
        )
    else:
        group_order = combined_stats_df["Group"].unique().tolist()

    scalar_mappable = ScalarMappable(cmap=cmap)
    model_types = combined_stats_df["Model type"].unique()
    color_dict = {
        m: c
        for m, c in zip(
            model_types,
            scalar_mappable.to_rgba(range(len(model_types)), alpha=1).tolist(),
        )
    }

    bar_width = 0.12
    group_spacing = 0.4
    num_bars = len(model_types) * len(hatches_dict)
    positions = []
    tick_positions = []
    tick_labels = []

    for i, metric in enumerate(metrics):
        metric_data = combined_stats_df[combined_stats_df["Metric"] == metric]
        metric_data.loc[:, "Group"] = pd.Categorical(
            metric_data["Group"], categories=group_order, ordered=True
        )
        metric_data = metric_data.sort_values("Group").reset_index(drop=True)
        for j, (_, row) in enumerate(metric_data.iterrows()):
            position = i * (num_bars * bar_width + group_spacing) + (j % num_bars) * bar_width
            positions.append(position)
            ax.bar(
                position,
                height=row[f"{metric}_mean"],
                color=color_dict[row["Model type"]],
                hatch=hatches_dict[row["Split"]],
                width=bar_width,
            )
        center_position = i * (num_bars * bar_width + group_spacing) + (num_bars * bar_width) / 2
        tick_positions.append(center_position)
        tick_labels.append(metric.replace(" ", "\n") if " " in metric else metric)

    def create_stats_legend(df, color_mapping, hatches_dict, group_order):
        patches_dict = {}
        for _, row in df.iterrows():
            label = f"{row['Split']} {row['Model type']}"
            group_label = f"{row['Split']}_{row['Model type']}"
            if group_label not in patches_dict:
                patches_dict[group_label] = mpatches.Patch(
                    facecolor=color_mapping[row["Model type"]],
                    hatch=hatches_dict[row["Split"]],
                    label=label,
                )
        return [patches_dict[group] for group in group_order if group in patches_dict]

    if show_legend:
        legend_elements = create_stats_legend(combined_stats_df, color_dict, hatches_dict, group_order)
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    for (_, row), bar in zip(combined_stats_df.iterrows(), ax.patches):
        x_bar = bar.get_x() + bar.get_width() / 2
        y_bar = bar.get_height()
        ax.errorbar(
            x_bar,
            y_bar,
            yerr=row[f"{row['Metric']}_std"],
            color="black",
            fmt="none",
            elinewidth=1,
            capsize=3,
            alpha=0.5,
        )

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylim(bottom=0.0)

    if save_dir:
        metrics_names = "_".join(metrics)
        plot_name = f"barplot_{cmap}_{metrics_names}"
        save_plot(fig, save_dir, plot_name, show_legend=show_legend)

    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

    return color_dict

uqdd.metrics.analysis.find_highly_correlated_metrics

find_highly_correlated_metrics(df: DataFrame, metrics: List[str], threshold: float = 0.8, save_dir: Optional[str] = None, cmap: str = 'coolwarm', show_legend: bool = False) -> List[Tuple[str, str, float]]

Identify pairs of metrics with correlation above a threshold and plot the matrix.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing the metric columns.

required
metrics list of str

Metric column names to include in the correlation analysis.

required
threshold float

Absolute correlation threshold for reporting pairs. Default is 0.8.

0.8
save_dir str or None

Directory to save the heatmap plot. Default is None.

None
cmap str

Matplotlib colormap name. Default is "coolwarm".

'coolwarm'
show_legend bool

If True, keep the legend; otherwise it will be removed before saving.

False

Returns:

Type Description
list of tuple[str, str, float]

List of metric pairs and their absolute correlation values.

Source code in uqdd/metrics/analysis.py
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
def find_highly_correlated_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    threshold: float = 0.8,
    save_dir: Optional[str] = None,
    cmap: str = "coolwarm",
    show_legend: bool = False,
) -> List[Tuple[str, str, float]]:
    """
    Identify pairs of metrics with correlation above a threshold and plot the matrix.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing the metric columns.
    metrics : list of str
        Metric column names to include in the correlation analysis.
    threshold : float, optional
        Absolute correlation threshold for reporting pairs. Default is 0.8.
    save_dir : str or None, optional
        Directory to save the heatmap plot. Default is None.
    cmap : str, optional
        Matplotlib colormap name. Default is "coolwarm".
    show_legend : bool, optional
        If True, keep the legend; otherwise it will be removed before saving.

    Returns
    -------
    list of tuple[str, str, float]
        List of metric pairs and their absolute correlation values.
    """
    corr_matrix = df[metrics].corr().abs()
    pairs = []
    for i in range(len(corr_matrix.columns)):
        for j in range(i):
            if corr_matrix.iloc[i, j] > threshold:
                pairs.append((corr_matrix.columns[i], corr_matrix.columns[j], corr_matrix.iloc[i, j]))

    print(f"Highly correlated metrics (correlation coefficient > {threshold}):")
    for a, b, v in pairs:
        print(f"{a} and {b}: {v:.2f}")

    plt.figure(figsize=(10, 8))
    sns.heatmap(corr_matrix, annot=True, cmap=cmap)
    plt.title("Correlation Matrix")
    plot_name = f"correlation_matrix_{threshold}_{'_'.join(metrics)}"
    save_plot(plt.gcf(), save_dir, plot_name, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()

    return pairs

uqdd.metrics.analysis.plot_comparison_metrics

plot_comparison_metrics(df: DataFrame, metrics: List[str], cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir: Optional[str] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False, models_order: Optional[List[str]] = None) -> None

Plot comparison bar charts across splits, model types, and calibration states.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with columns ['Split', 'Model type', 'Calibration'] and metrics.

required
metrics list of str

Metric column names to plot.

required
cmap str

Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".

'tab10_r'
color_dict dict[str, str] or None

Precomputed color mapping from model type to color. If None, one is generated.

None
save_dir str or None

Directory to save plot images. Default is None.

None
fig_width float or None

Width of the plot area (excluding legend). Default scales with the number of metrics.

None
fig_height float or None

Height of the plot area (excluding legend). Default is 6.

None
show_legend bool

If True, include a legend. Default is False.

False
models_order list of str or None

Explicit order of model types for coloring and grouping. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
def plot_comparison_metrics(
    df: pd.DataFrame,
    metrics: List[str],
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir: Optional[str] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
    models_order: Optional[List[str]] = None,
) -> None:
    """
    Plot comparison bar charts across splits, model types, and calibration states.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with columns ['Split', 'Model type', 'Calibration'] and metrics.
    metrics : list of str
        Metric column names to plot.
    cmap : str, optional
        Matplotlib colormap name used to derive distinct colors per model type. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from model type to color. If None, one is generated.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    fig_width : float or None, optional
        Width of the plot area (excluding legend). Default scales with the number of metrics.
    fig_height : float or None, optional
        Height of the plot area (excluding legend). Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.
    models_order : list of str or None, optional
        Explicit order of model types for coloring and grouping. Default derives from data.

    Returns
    -------
    None
    """
    plot_width = fig_width if fig_width else max(7, len(metrics) * 3)
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 5
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.1, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.1, 0.15, plot_width / total_width, plot_height / total_height])

    stats_dfs = []
    for metric in metrics:
        mean_df = df.groupby(["Split", "Model type", "Calibration"])[metric].mean().rename(f"{metric}_mean")
        std_df = df.groupby(["Split", "Model type", "Calibration"])[metric].std().rename(f"{metric}_std")
        stats_df = pd.merge(mean_df, std_df, left_index=True, right_index=True).reset_index()
        stats_df["Group"] = stats_df.apply(
            lambda row: f"{row['Split']}_{row['Model type']}_{row['Calibration']}", axis=1
        )
        stats_df["Metric"] = metric
        stats_dfs.append(stats_df)

    combined_stats_df = pd.concat(stats_dfs)
    if models_order is None:
        models_order = combined_stats_df["Model type"].unique().tolist()

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        color_dict = {
            m: c
            for m, c in zip(
                models_order,
                scalar_mappable.to_rgba(range(len(models_order)), alpha=1).tolist(),
            )
        }
    color_dict = {k: color_dict[k] for k in models_order}

    hatches_dict = {
        "Before Calibration": "\\\\",
        "After Calibration": "",
    }

    bar_width = 0.1
    group_spacing = 0.2
    split_spacing = 0.6
    num_bars = len(models_order) * 2
    positions = []
    tick_positions = []
    tick_labels = []

    for i, metric in enumerate(metrics):
        metric_data = combined_stats_df[combined_stats_df["Metric"] == metric]
        split_types = metric_data["Split"].unique()
        for j, split in enumerate(split_types):
            split_data = metric_data[metric_data["Split"] == split]
            split_data = split_data[split_data["Model type"].isin(models_order)]

            for k, model_type in enumerate(models_order):
                for l, calibration in enumerate(["Before Calibration", "After Calibration"]):
                    position = (
                        i * (split_spacing + len(split_types) * (num_bars * bar_width + group_spacing))
                        + j * (num_bars * bar_width + group_spacing)
                        + k * 2 * bar_width
                        + l * bar_width
                    )
                    positions.append(position)
                    height = split_data[
                        (split_data["Model type"] == model_type)
                        & (split_data["Calibration"] == calibration)
                    ][f"{metric}_mean"].values[0]
                    ax.bar(
                        position,
                        height=height,
                        color=color_dict[model_type],
                        hatch=hatches_dict[calibration],
                        width=bar_width,
                    )

            center_position = (
                i * (split_spacing + len(split_types) * (num_bars * bar_width + group_spacing))
                + j * (num_bars * bar_width + group_spacing)
                + (num_bars * bar_width) / 2
            )
            tick_positions.append(center_position)
            tick_labels.append(f"{metric}\n{split}")

    if show_legend:
        legend_elements = [
            mpatches.Patch(facecolor=color_dict[model], edgecolor="black", label=model)
            for model in models_order
        ]
        legend_elements += [
            mpatches.Patch(facecolor="white", edgecolor="black", hatch=h, label=label)
            for label, h in hatches_dict.items()
        ]
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    for (_, row), bar in zip(combined_stats_df.iterrows(), ax.patches):
        x_bar = bar.get_x() + bar.get_width() / 2
        y_bar = bar.get_height()
        yerr_lower = y_bar - max(0, y_bar - row[f"{row['Metric']}_std"])
        yerr_upper = row[f"{row['Metric']}_std"]
        ax.errorbar(
            x_bar,
            y_bar,
            yerr=[[yerr_lower], [yerr_upper]],
            color="black",
            fmt="none",
            elinewidth=1,
            capsize=3,
            alpha=0.5,
        )

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylim(bottom=0.0)

    if save_dir:
        metrics_names = "_".join(metrics)
        plot_name = f"comparison_barplot_{cmap}_{metrics_names}"
        save_plot(fig, save_dir, plot_name, show_legend=show_legend)

    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.load_and_aggregate_calibration_data

load_and_aggregate_calibration_data(base_path: str, paths: List[str]) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]

Load calibration curve data from multiple model paths and aggregate statistics.

Parameters:

Name Type Description Default
base_path str

Base directory from which model subpaths are resolved.

required
paths list of str

Relative paths to model directories containing 'calibration_plot_data.csv'.

required

Returns:

Type Description
(ndarray, ndarray, ndarray, ndarray)

Tuple of (expected_values, mean_observed, lower_bound, upper_bound), each of shape (n_bins,).

Source code in uqdd/metrics/analysis.py
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
def load_and_aggregate_calibration_data(base_path: str, paths: List[str]) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """
    Load calibration curve data from multiple model paths and aggregate statistics.

    Parameters
    ----------
    base_path : str
        Base directory from which model subpaths are resolved.
    paths : list of str
        Relative paths to model directories containing 'calibration_plot_data.csv'.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray)
        Tuple of (expected_values, mean_observed, lower_bound, upper_bound), each of shape (n_bins,).
    """
    expected_values = []
    observed_values = []
    for path in paths:
        file_path = os.path.join(base_path, path, "calibration_plot_data.csv")
        if os.path.exists(file_path):
            data = pd.read_csv(file_path)
            expected_values = data["Expected Proportion"]
            observed_values.append(data["Observed Proportion"])
        else:
            print(f"File not found: {file_path}")

    expected_values = np.array(expected_values)
    observed_values = np.array(observed_values)
    mean_observed = np.mean(observed_values, axis=0)
    lower_bound = np.min(observed_values, axis=0)
    upper_bound = np.max(observed_values, axis=0)
    return expected_values, mean_observed, lower_bound, upper_bound

uqdd.metrics.analysis.plot_calibration_data

plot_calibration_data(df_aggregated: DataFrame, base_path: str, save_dir: Optional[str] = None, title: str = 'Calibration Plot', color_name: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> None

Plot aggregated calibration curves for multiple groups against the perfect calibration line.

Parameters:

Name Type Description Default
df_aggregated DataFrame

Aggregated DataFrame containing 'Group' and 'project_model' lists for each group.

required
base_path str

Base directory where model paths are located.

required
save_dir str or None

Directory to save plot images. Default is None.

None
title str

Plot title. Default is "Calibration Plot".

'Calibration Plot'
color_name str

Colormap name used to derive distinct colors per group. Default is "tab10_r".

'tab10_r'
color_dict dict[str, str] or None

Precomputed color mapping from group to color. If None, one is generated.

None
group_order list of str or None

Order of groups in the legend. Default derives from data.

None
fig_width float or None

Width of the plot area. Default is 6.

None
fig_height float or None

Height of the plot area. Default is 6.

None
show_legend bool

If True, include a legend. Default is False.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
def plot_calibration_data(
    df_aggregated: pd.DataFrame,
    base_path: str,
    save_dir: Optional[str] = None,
    title: str = "Calibration Plot",
    color_name: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> None:
    """
    Plot aggregated calibration curves for multiple groups against the perfect calibration line.

    Parameters
    ----------
    df_aggregated : pd.DataFrame
        Aggregated DataFrame containing 'Group' and 'project_model' lists for each group.
    base_path : str
        Base directory where model paths are located.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    title : str, optional
        Plot title. Default is "Calibration Plot".
    color_name : str, optional
        Colormap name used to derive distinct colors per group. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from group to color. If None, one is generated.
    group_order : list of str or None, optional
        Order of groups in the legend. Default derives from data.
    fig_width : float or None, optional
        Width of the plot area. Default is 6.
    fig_height : float or None, optional
        Height of the plot area. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    None
    """
    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    if group_order is None:
        group_order = list(df_aggregated["Group"].unique())

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=color_name)
        colors = scalar_mappable.to_rgba(range(len(group_order)))
        color_dict = {group: color for group, color in zip(group_order, colors)}

    legend_handles = {}
    for idx, row in df_aggregated.iterrows():
        model_paths = row["project_model"]
        group_label = row["Group"]
        color = color_dict[group_label]
        expected, mean_observed, lower_bound, upper_bound = load_and_aggregate_calibration_data(base_path, model_paths)
        (line,) = ax.plot(expected, mean_observed, label=group_label, color=color)
        ax.fill_between(expected, lower_bound, upper_bound, alpha=0.2, color=color)
        if group_label not in legend_handles:
            legend_handles[group_label] = line

    (perfect_line,) = ax.plot([0, 1], [0, 1], "k--", label="Perfect Calibration")
    legend_handles["Perfect Calibration"] = perfect_line

    ordered_legend_handles = [legend_handles[group] for group in group_order if group in legend_handles]
    ordered_legend_handles.append(legend_handles["Perfect Calibration"])
    if show_legend:
        ax.legend(handles=ordered_legend_handles, bbox_to_anchor=(1.05, 1), loc="upper left")

    ax.set_title(title)
    ax.set_xlabel("Expected Proportion")
    ax.set_ylabel("Observed Proportion")
    ax.grid(True)
    ax.set_xlim(0, 1)
    ax.set_ylim(0, 1)

    if save_dir:
        plot_name = f"{title.replace(' ', '_')}"
        save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.move_model_folders

move_model_folders(df: DataFrame, search_dirs: List[str], output_dir: str, overwrite: bool = False) -> None

Move or merge model directories into a single output folder based on model names.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing a 'model name' column.

required
search_dirs list of str

Directories to search for model subfolders.

required
output_dir str

Destination directory where model folders will be moved or merged.

required
overwrite bool

If True, existing folders are merged (copied) with source. Default is False.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
def move_model_folders(
    df: pd.DataFrame,
    search_dirs: List[str],
    output_dir: str,
    overwrite: bool = False,
) -> None:
    """
    Move or merge model directories into a single output folder based on model names.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing a 'model name' column.
    search_dirs : list of str
        Directories to search for model subfolders.
    output_dir : str
        Destination directory where model folders will be moved or merged.
    overwrite : bool, optional
        If True, existing folders are merged (copied) with source. Default is False.

    Returns
    -------
    None
    """
    model_names = df["model name"].unique()
    if not os.path.exists(output_dir):
        os.makedirs(output_dir, exist_ok=True)
        print(f"Created output directory '{output_dir}'.")

    for model_name in model_names:
        found = False
        for search_dir in search_dirs:
            if not os.path.isdir(search_dir):
                print(f"Search directory '{search_dir}' does not exist. Skipping.")
                continue
            subdirs = [d for d in os.listdir(search_dir) if os.path.isdir(os.path.join(search_dir, d))]
            if model_name in subdirs:
                source_dir = os.path.join(search_dir, model_name)
                dest_dir = os.path.join(output_dir, model_name)
                if os.path.exists(dest_dir):
                    if overwrite:
                        shutil.copytree(source_dir, dest_dir, dirs_exist_ok=True)
                        print(f"Merged (Copied) '{source_dir}' to '{dest_dir}'.")
                else:
                    try:
                        shutil.move(source_dir, dest_dir)
                        print(f"Moved '{source_dir}' to '{dest_dir}'.")
                    except Exception as e:
                        print(f"Error moving '{source_dir}' to '{dest_dir}': {e}")
                found = True
                break
        if not found:
            print(f"Model folder '{model_name}' not found in any of the search directories.")

uqdd.metrics.analysis.load_predictions

load_predictions(model_path: str) -> pd.DataFrame

Load pickled predictions from a model directory.

Parameters:

Name Type Description Default
model_path str

Path to the model directory containing 'preds.pkl'.

required

Returns:

Type Description
DataFrame

DataFrame loaded from the pickle file.

Source code in uqdd/metrics/analysis.py
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
def load_predictions(model_path: str) -> pd.DataFrame:
    """
    Load pickled predictions from a model directory.

    Parameters
    ----------
    model_path : str
        Path to the model directory containing 'preds.pkl'.

    Returns
    -------
    pd.DataFrame
        DataFrame loaded from the pickle file.
    """
    preds_path = os.path.join(model_path, "preds.pkl")
    return pd.read_pickle(preds_path)

uqdd.metrics.analysis.calculate_rmse_rejection_curve

calculate_rmse_rejection_curve(preds: DataFrame, uncertainty_col: str = 'y_alea', true_label_col: str = 'y_true', pred_label_col: str = 'y_pred', normalize_rmse: bool = False, random_rejection: bool = False, unc_type: Optional[str] = None, max_rejection_ratio: float = 0.95) -> Tuple[np.ndarray, np.ndarray, float]

Compute RMSE vs. rejection rate curve and its AUC by rejecting high-uncertainty predictions.

Parameters:

Name Type Description Default
preds DataFrame

DataFrame with columns for true labels, predicted labels, and uncertainty components.

required
uncertainty_col str

Column name for uncertainty to sort by if unc_type is None. Default is "y_alea".

'y_alea'
true_label_col str

Column name for true labels. Default is "y_true".

'y_true'
pred_label_col str

Column name for predicted labels. Default is "y_pred".

'y_pred'
normalize_rmse bool

If True, normalize RMSE by the initial RMSE before rejection. Default is False.

False
random_rejection bool

If True, randomly reject samples instead of sorting by uncertainty. Default is False.

False
unc_type (aleatoric, epistemic, both)

Which uncertainty to use. If "both", sums aleatoric and epistemic. If None, use uncertainty_col.

"aleatoric"
max_rejection_ratio float

Maximum fraction of samples to reject (exclusive of the tail). Default is 0.95.

0.95

Returns:

Type Description
(ndarray, ndarray, float)

Tuple of (rejection_rates, rmses, AUC of the RMSE–rejection curve).

Raises:

Type Description
ValueError

If unc_type is invalid or uncertainty_col is not present when needed.

Source code in uqdd/metrics/analysis.py
 980
 981
 982
 983
 984
 985
 986
 987
 988
 989
 990
 991
 992
 993
 994
 995
 996
 997
 998
 999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
def calculate_rmse_rejection_curve(
    preds: pd.DataFrame,
    uncertainty_col: str = "y_alea",
    true_label_col: str = "y_true",
    pred_label_col: str = "y_pred",
    normalize_rmse: bool = False,
    random_rejection: bool = False,
    unc_type: Optional[str] = None,
    max_rejection_ratio: float = 0.95,
) -> Tuple[np.ndarray, np.ndarray, float]:
    """
    Compute RMSE vs. rejection rate curve and its AUC by rejecting high-uncertainty predictions.

    Parameters
    ----------
    preds : pd.DataFrame
        DataFrame with columns for true labels, predicted labels, and uncertainty components.
    uncertainty_col : str, optional
        Column name for uncertainty to sort by if `unc_type` is None. Default is "y_alea".
    true_label_col : str, optional
        Column name for true labels. Default is "y_true".
    pred_label_col : str, optional
        Column name for predicted labels. Default is "y_pred".
    normalize_rmse : bool, optional
        If True, normalize RMSE by the initial RMSE before rejection. Default is False.
    random_rejection : bool, optional
        If True, randomly reject samples instead of sorting by uncertainty. Default is False.
    unc_type : {"aleatoric", "epistemic", "both"} or None, optional
        Which uncertainty to use. If "both", sums aleatoric and epistemic. If None, use `uncertainty_col`.
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject (exclusive of the tail). Default is 0.95.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, float)
        Tuple of (rejection_rates, rmses, AUC of the RMSE–rejection curve).

    Raises
    ------
    ValueError
        If `unc_type` is invalid or `uncertainty_col` is not present when needed.
    """
    if unc_type == "aleatoric":
        uncertainty_col = "y_alea"
    elif unc_type == "epistemic":
        uncertainty_col = "y_eps"
    elif unc_type == "both":
        preds["y_unc"] = preds["y_alea"] + preds["y_eps"]
        uncertainty_col = "y_unc"
    elif unc_type is None and uncertainty_col in preds.columns:
        pass
    else:
        raise ValueError(
            "Either provide valid uncertainty type or provide the uncertainty column name in the DataFrame"
        )

    if random_rejection:
        preds = preds.sample(frac=max_rejection_ratio).reset_index(drop=True)
    else:
        preds = preds.sort_values(by=uncertainty_col, ascending=False)

    max_rejection_index = int(len(preds) * max_rejection_ratio)
    step = max(1, int(len(preds) * 0.01))
    rejection_steps = np.arange(0, max_rejection_index, step=step)
    rejection_rates = rejection_steps / len(preds)
    rmses = []

    initial_rmse = mean_squared_error(preds[true_label_col], preds[pred_label_col], squared=False)

    for i in rejection_steps:
        selected_preds = preds.iloc[i:]
        rmse = mean_squared_error(selected_preds[true_label_col], selected_preds[pred_label_col], squared=False)
        if normalize_rmse:
            rmse /= initial_rmse
        rmses.append(rmse)
    auc_arc = auc(rejection_rates, rmses)
    return rejection_rates, np.array(rmses), float(auc_arc)

uqdd.metrics.analysis.calculate_rejection_curve

calculate_rejection_curve(df: DataFrame, model_paths: List[str], unc_col: str, random_rejection: bool = False, normalize_rmse: bool = False, max_rejection_ratio: float = 0.95) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, float]

Aggregate RMSE–rejection curves across models and compute mean/std and AUC statistics.

Parameters:

Name Type Description Default
df DataFrame

Auxiliary DataFrame (not used directly, kept for API symmetry).

required
model_paths list of str

Paths to model directories containing 'preds.pkl'.

required
unc_col str

Uncertainty column name to use when computing curves (e.g., 'y_alea' or 'y_eps').

required
random_rejection bool

If True, randomly reject samples. Default is False.

False
normalize_rmse bool

If True, normalize RMSE by the initial RMSE. Default is False.

False
max_rejection_ratio float

Maximum fraction of samples to reject. Default is 0.95.

0.95

Returns:

Type Description
(ndarray, ndarray, ndarray, float, float)

Tuple of (rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc).

Source code in uqdd/metrics/analysis.py
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
def calculate_rejection_curve(
    df: pd.DataFrame,
    model_paths: List[str],
    unc_col: str,
    random_rejection: bool = False,
    normalize_rmse: bool = False,
    max_rejection_ratio: float = 0.95,
) -> Tuple[np.ndarray, np.ndarray, np.ndarray, float, float]:
    """
    Aggregate RMSE–rejection curves across models and compute mean/std and AUC statistics.

    Parameters
    ----------
    df : pd.DataFrame
        Auxiliary DataFrame (not used directly, kept for API symmetry).
    model_paths : list of str
        Paths to model directories containing 'preds.pkl'.
    unc_col : str
        Uncertainty column name to use when computing curves (e.g., 'y_alea' or 'y_eps').
    random_rejection : bool, optional
        If True, randomly reject samples. Default is False.
    normalize_rmse : bool, optional
        If True, normalize RMSE by the initial RMSE. Default is False.
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject. Default is 0.95.

    Returns
    -------
    (numpy.ndarray, numpy.ndarray, numpy.ndarray, float, float)
        Tuple of (rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc).
    """
    aggregated_rmses = []
    auc_values = []
    rejection_rates = None

    for model_path in model_paths:
        preds = load_predictions(model_path)
        if preds.empty:
            print(f"Preds not loaded for model: {model_path}")
            continue
        rejection_rates, rmses, auc_arc = calculate_rmse_rejection_curve(
            preds,
            uncertainty_col=unc_col,
            random_rejection=random_rejection,
            normalize_rmse=normalize_rmse,
            max_rejection_ratio=max_rejection_ratio,
        )
        aggregated_rmses.append(rmses)
        auc_values.append(auc_arc)

    mean_rmses = np.mean(aggregated_rmses, axis=0)
    std_rmses = np.std(aggregated_rmses, axis=0)
    mean_auc = np.mean(auc_values)
    std_auc = np.std(auc_values)
    return rejection_rates, mean_rmses, std_rmses, float(mean_auc), float(std_auc)

uqdd.metrics.analysis.get_handles_labels

get_handles_labels(ax: Axes, group_order: List[str]) -> Tuple[List, List[str]]

Extract legend handles/labels ordered by group prefix.

Parameters:

Name Type Description Default
ax Axes

Axes object from which to retrieve legend entries.

required
group_order list of str

Group prefixes to order legend entries by.

required

Returns:

Type Description
(list, list of str)

Ordered handles and labels.

Source code in uqdd/metrics/analysis.py
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
def get_handles_labels(ax: plt.Axes, group_order: List[str]) -> Tuple[List, List[str]]:
    """
    Extract legend handles/labels ordered by group prefix.

    Parameters
    ----------
    ax : matplotlib.axes.Axes
        Axes object from which to retrieve legend entries.
    group_order : list of str
        Group prefixes to order legend entries by.

    Returns
    -------
    (list, list of str)
        Ordered handles and labels.
    """
    handles, labels = ax.get_legend_handles_labels()
    ordered_handles = []
    ordered_labels = []
    for group in group_order:
        for label, handle in zip(labels, handles):
            if label.startswith(group):
                ordered_handles.append(handle)
                ordered_labels.append(label)
    return ordered_handles, ordered_labels

uqdd.metrics.analysis.plot_rmse_rejection_curves

plot_rmse_rejection_curves(df: DataFrame, base_dir: str, cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir_plot: Optional[str] = None, add_to_title: str = '', normalize_rmse: bool = False, unc_type: str = 'aleatoric', max_rejection_ratio: float = 0.95, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> pd.DataFrame

Plot RMSE–rejection curves per group, including random rejection baselines, and summarize AUCs.

Parameters:

Name Type Description Default
df DataFrame

DataFrame containing columns 'Group', 'Split', and 'project_model'.

required
base_dir str

Base directory where model paths are located.

required
cmap str

Colormap name used to derive distinct colors per group. Default is "tab10_r".

'tab10_r'
color_dict dict[str, str] or None

Precomputed color mapping from group to color. If None, one is generated.

None
save_dir_plot str or None

Directory to save the plot images. Default is None.

None
add_to_title str

Suffix for the plot filename and title. Default is empty string.

''
normalize_rmse bool

If True, normalize RMSE by initial RMSE. Default is False.

False
unc_type (aleatoric, epistemic, both)

Uncertainty component to use for rejection. Default is "aleatoric".

"aleatoric"
max_rejection_ratio float

Maximum fraction of samples to reject. Default is 0.95.

0.95
group_order list of str or None

Order of groups in the legend. Default derives from data.

None
fig_width float or None

Plot width. Default is 6.

None
fig_height float or None

Plot height. Default is 6.

None
show_legend bool

If True, include a legend. Default is False.

False

Returns:

Type Description
DataFrame

Summary DataFrame with columns ['Model type', 'Split', 'Group', 'AUC-RRC_mean', 'AUC-RRC_std'].

Source code in uqdd/metrics/analysis.py
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
def plot_rmse_rejection_curves(
    df: pd.DataFrame,
    base_dir: str,
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir_plot: Optional[str] = None,
    add_to_title: str = "",
    normalize_rmse: bool = False,
    unc_type: str = "aleatoric",
    max_rejection_ratio: float = 0.95,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> pd.DataFrame:
    """
    Plot RMSE–rejection curves per group, including random rejection baselines, and summarize AUCs.

    Parameters
    ----------
    df : pd.DataFrame
        DataFrame containing columns 'Group', 'Split', and 'project_model'.
    base_dir : str
        Base directory where model paths are located.
    cmap : str, optional
        Colormap name used to derive distinct colors per group. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from group to color. If None, one is generated.
    save_dir_plot : str or None, optional
        Directory to save the plot images. Default is None.
    add_to_title : str, optional
        Suffix for the plot filename and title. Default is empty string.
    normalize_rmse : bool, optional
        If True, normalize RMSE by initial RMSE. Default is False.
    unc_type : {"aleatoric", "epistemic", "both"}, optional
        Uncertainty component to use for rejection. Default is "aleatoric".
    max_rejection_ratio : float, optional
        Maximum fraction of samples to reject. Default is 0.95.
    group_order : list of str or None, optional
        Order of groups in the legend. Default derives from data.
    fig_width : float or None, optional
        Plot width. Default is 6.
    fig_height : float or None, optional
        Plot height. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    pd.DataFrame
        Summary DataFrame with columns ['Model type', 'Split', 'Group', 'AUC-RRC_mean', 'AUC-RRC_std'].
    """
    assert unc_type in ["aleatoric", "epistemic", "both"], "Invalid unc_type"
    unc_col = "y_alea" if unc_type == "aleatoric" else "y_eps"

    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 2

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    if group_order is None:
        group_order = list(df["Group"].unique())

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        colors = scalar_mappable.to_rgba(range(len(group_order)))
        color_dict = {group: color for group, color in zip(group_order, colors)}

    color_dict["random reject"] = "black"

    df = df.copy()
    df.loc[:, "model_path"] = df["project_model"].apply(
        lambda x: (str(os.path.join(base_dir, x)) if not str(x).startswith(base_dir) else x)
    )

    stats_dfs = []
    included_groups = df["Group"].unique()
    legend_handles = []

    for group in included_groups:
        group_data = df[df["Group"] == group]
        model_paths = group_data["model_path"].unique()
        rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc = calculate_rejection_curve(
            df, model_paths, unc_col, normalize_rmse=normalize_rmse, max_rejection_ratio=max_rejection_ratio
        )
        (line,) = ax.plot(
            rejection_rates,
            mean_rmses,
            label=f"{group} (AUC-RRC: {mean_auc:.3f} ± {std_auc:.3f})",
            color=color_dict[group],
        )
        ax.fill_between(rejection_rates, mean_rmses - std_rmses, mean_rmses + std_rmses, color=color_dict[group], alpha=0.2)
        legend_handles.append(line)
        stats_dfs.append({
            "Model type": group.rsplit("_", 1)[1],
            "Split": group.rsplit("_", 1)[0],
            "Group": group,
            "AUC-RRC_mean": mean_auc,
            "AUC-RRC_std": std_auc,
        })

    for split in df["Split"].unique():
        split_data = df[df["Split"] == split]
        model_paths = split_data["model_path"].unique()
        rejection_rates, mean_rmses, std_rmses, mean_auc, std_auc = calculate_rejection_curve(
            df, model_paths, unc_col, random_rejection=True, normalize_rmse=normalize_rmse, max_rejection_ratio=max_rejection_ratio
        )
        (line,) = ax.plot(
            rejection_rates,
            mean_rmses,
            label=f"random reject - {split} (AUC-RRC: {mean_auc:.3f} ± {std_auc:.3f})",
            color="black",
            linestyle="--",
        )
        ax.fill_between(rejection_rates, mean_rmses - std_rmses, mean_rmses + std_rmses, color="grey", alpha=0.2)
        legend_handles.append(line)
        stats_dfs.append({
            "Model type": "random reject",
            "Split": split,
            "Group": f"random reject - {split}",
            "AUC-RRC_mean": mean_auc,
            "AUC-RRC_std": std_auc,
        })

    ax.set_xlabel("Rejection Rate")
    ax.set_ylabel("RMSE" if not normalize_rmse else "Normalized RMSE")
    ax.set_xlim(0, max_rejection_ratio)
    ax.grid(True)

    if show_legend:
        ordered_handles, ordered_labels = get_handles_labels(ax, group_order)
        ordered_handles += [legend_handles[-1]]
        ordered_labels += [legend_handles[-1].get_label()]
        ax.legend(handles=ordered_handles, loc="lower left")

    plot_name = f"rmse_rejection_curve_{add_to_title}" if add_to_title else "rmse_rejection_curve"
    save_plot(fig, save_dir_plot, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

    return pd.DataFrame(stats_dfs)

uqdd.metrics.analysis.plot_auc_comparison

plot_auc_comparison(stats_df: DataFrame, cmap: str = 'tab10_r', color_dict: Optional[Dict[str, str]] = None, save_dir: Optional[str] = None, add_to_title: str = '', min_y_axis: float = 0.0, hatches_dict: Optional[Dict[str, str]] = None, group_order: Optional[List[str]] = None, fig_width: Optional[float] = None, fig_height: Optional[float] = None, show_legend: bool = False) -> None

Plot bar charts comparing RRC-AUC across splits and model types, including random reject baselines.

Parameters:

Name Type Description Default
stats_df DataFrame

Summary DataFrame with columns ['Group', 'Split', 'Model type', 'AUC-RRC_mean', 'AUC-RRC_std'].

required
cmap str

Colormap name used to derive distinct colors per model type. Default is "tab10_r".

'tab10_r'
color_dict dict[str, str] or None

Precomputed color mapping from model type to color. If None, one is generated.

None
save_dir str or None

Directory to save plot images. Default is None.

None
add_to_title str

Title suffix for the plot. Default is empty string.

''
min_y_axis float

Minimum y-axis limit. Default is 0.0.

0.0
hatches_dict dict[str, str] or None

Hatch mapping for splits (e.g., {"stratified": "\"}). Default uses sensible defaults.

None
group_order list of str or None

Order of groups in the legend and x-axis. Default derives from data.

None
fig_width float or None

Plot width. Default is 6.

None
fig_height float or None

Plot height. Default is 6.

None
show_legend bool

If True, include a legend. Default is False.

False

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
def plot_auc_comparison(
    stats_df: pd.DataFrame,
    cmap: str = "tab10_r",
    color_dict: Optional[Dict[str, str]] = None,
    save_dir: Optional[str] = None,
    add_to_title: str = "",
    min_y_axis: float = 0.0,
    hatches_dict: Optional[Dict[str, str]] = None,
    group_order: Optional[List[str]] = None,
    fig_width: Optional[float] = None,
    fig_height: Optional[float] = None,
    show_legend: bool = False,
) -> None:
    """
    Plot bar charts comparing RRC-AUC across splits and model types, including random reject baselines.

    Parameters
    ----------
    stats_df : pd.DataFrame
        Summary DataFrame with columns ['Group', 'Split', 'Model type', 'AUC-RRC_mean', 'AUC-RRC_std'].
    cmap : str, optional
        Colormap name used to derive distinct colors per model type. Default is "tab10_r".
    color_dict : dict[str, str] or None, optional
        Precomputed color mapping from model type to color. If None, one is generated.
    save_dir : str or None, optional
        Directory to save plot images. Default is None.
    add_to_title : str, optional
        Title suffix for the plot. Default is empty string.
    min_y_axis : float, optional
        Minimum y-axis limit. Default is 0.0.
    hatches_dict : dict[str, str] or None, optional
        Hatch mapping for splits (e.g., {"stratified": "\\\\"}). Default uses sensible defaults.
    group_order : list of str or None, optional
        Order of groups in the legend and x-axis. Default derives from data.
    fig_width : float or None, optional
        Plot width. Default is 6.
    fig_height : float or None, optional
        Plot height. Default is 6.
    show_legend : bool, optional
        If True, include a legend. Default is False.

    Returns
    -------
    None
    """
    if hatches_dict is None:
        hatches_dict = {"stratified": "\\\\", "scaffold_cluster": "", "time": "/\\/\\/"}

    if group_order:
        all_groups = group_order + list(stats_df.loc[stats_df["Group"].str.startswith("random reject"), "Group"].unique())
        stats_df["Group"] = pd.Categorical(stats_df["Group"], categories=all_groups, ordered=True)
    else:
        all_groups = stats_df["Group"].unique().tolist()

    stats_df = stats_df.sort_values("Group").reset_index(drop=True)

    splits = list(hatches_dict.keys())
    stats_df.loc[:, "Split"] = pd.Categorical(stats_df["Split"], categories=splits, ordered=True)
    stats_df = stats_df.sort_values("Split").reset_index(drop=True)

    unique_model_types = stats_df.loc[stats_df["Model type"] != "random reject", "Model type"].unique()

    if color_dict is None:
        scalar_mappable = ScalarMappable(cmap=cmap)
        colors = scalar_mappable.to_rgba(range(len(unique_model_types)))
        color_dict = {model: color for model, color in zip(unique_model_types, colors)}
    color_dict["random reject"] = "black"

    unique_model_types = np.append(unique_model_types, "random reject")

    bar_width = 0.12
    group_spacing = 0.6

    plot_width = fig_width if fig_width else 6
    plot_height = fig_height if fig_height else 6
    total_width = plot_width + 4
    total_height = plot_height + 4

    fig = plt.figure(figsize=(total_width, total_height))
    gs = gridspec.GridSpec(1, 1, figure=fig, left=0.15, right=0.75, top=0.9, bottom=0.15)
    ax = fig.add_subplot(gs[0])
    ax.set_position([0.15, 0.15, plot_width / total_width, plot_height / total_height])

    tick_positions = []
    tick_labels = []

    for i, split in enumerate(splits):
        split_data = stats_df[stats_df["Split"] == split]
        split_data.loc[:, "Group"] = pd.Categorical(split_data["Group"], categories=all_groups, ordered=True)
        for j, (_, row) in enumerate(split_data.iterrows()):
            position = i * (len(unique_model_types) * bar_width + group_spacing) + j * bar_width
            ax.bar(
                position,
                height=row["AUC-RRC_mean"],
                yerr=row["AUC-RRC_std"],
                color=color_dict[row["Model type"]],
                edgecolor="white" if row["Model type"] == "random reject" else "black",
                hatch=hatches_dict[row["Split"]],
                width=bar_width,
            )
        center_position = i * (len(unique_model_types) * bar_width + group_spacing) + (len(unique_model_types) * bar_width) / 2
        tick_positions.append(center_position)
        tick_labels.append(split)

    def create_stats_legend(color_dict: Dict[str, str], hatches_dict: Dict[str, str], splits: List[str], model_types: Union[List[str], np.ndarray]):
        patches = []
        for split in splits:
            for model in model_types:
                label = f"{split} {model}"
                hatch_color = "white" if model == "random reject" else "black"
                patch = mpatches.Patch(facecolor=color_dict[model], hatch=hatches_dict[split], edgecolor=hatch_color, label=label)
                patches.append(patch)
        return patches

    if show_legend:
        legend_elements = create_stats_legend(color_dict, hatches_dict, splits, unique_model_types)
        ax.legend(handles=legend_elements, bbox_to_anchor=(1.05, 1), loc="upper left", borderaxespad=0, frameon=False)

    ax.set_xticks(tick_positions)
    ax.set_xticklabels(tick_labels, rotation=45, ha="right", rotation_mode="anchor", fontsize=9)
    ax.set_ylabel("RRC-AUC")
    ax.set_ylim(min_y_axis, 1.0)

    plot_name = f"auc_comparison_barplot_{cmap}" + (f"_{add_to_title}" if add_to_title else "")
    save_plot(fig, save_dir, plot_name, tighten=True, show_legend=show_legend)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.analysis.save_stats_df

save_stats_df(stats_df: DataFrame, save_dir: str, add_to_title: str = '') -> None

Save a stats DataFrame to CSV in a given directory.

Parameters:

Name Type Description Default
stats_df DataFrame

DataFrame to save.

required
save_dir str

Target directory to save the CSV.

required
add_to_title str

Suffix to append to the filename. Default is empty string.

''

Returns:

Type Description
None
Source code in uqdd/metrics/analysis.py
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
def save_stats_df(stats_df: pd.DataFrame, save_dir: str, add_to_title: str = "") -> None:
    """
    Save a stats DataFrame to CSV in a given directory.

    Parameters
    ----------
    stats_df : pd.DataFrame
        DataFrame to save.
    save_dir : str
        Target directory to save the CSV.
    add_to_title : str, optional
        Suffix to append to the filename. Default is empty string.

    Returns
    -------
    None
    """
    os.makedirs(save_dir, exist_ok=True)
    stats_df.to_csv(os.path.join(save_dir, f"stats_df_{add_to_title}.csv"), index=False)

uqdd.metrics.analysis.load_stats_df

load_stats_df(save_dir: str, add_to_title: str = '') -> pd.DataFrame

Load a stats DataFrame from CSV in a given directory.

Parameters:

Name Type Description Default
save_dir str

Directory containing the CSV.

required
add_to_title str

Suffix appended to the filename. Default is empty string.

''

Returns:

Type Description
DataFrame

Loaded DataFrame.

Source code in uqdd/metrics/analysis.py
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
def load_stats_df(save_dir: str, add_to_title: str = "") -> pd.DataFrame:
    """
    Load a stats DataFrame from CSV in a given directory.

    Parameters
    ----------
    save_dir : str
        Directory containing the CSV.
    add_to_title : str, optional
        Suffix appended to the filename. Default is empty string.

    Returns
    -------
    pd.DataFrame
        Loaded DataFrame.
    """
    return pd.read_csv(os.path.join(save_dir, f"stats_df_{add_to_title}.csv"))

uqdd.metrics.constants

uqdd.metrics.reassessment

Model reassessment utilities: loading trained models, generating predictions, computing NLL, exporting artifacts, and recalibrating with isotonic regression.

This module wires together model loaders and predictors to re-run evaluation on saved runs, export standardized prediction pickles, append NLL to CSV logs, and apply isotonic recalibration using validation data.

uqdd.metrics.reassessment.nll_evidentials

nll_evidentials(evidential_model, test_dataloader, model_type: str = 'evidential', num_mc_samples: int = 100, device=DEVICE)

Compute negative log-likelihood (NLL) for evidential-style models.

Parameters:

Name Type Description Default
evidential_model Module

Trained model instance.

required
test_dataloader DataLoader

DataLoader providing test set batches.

required
model_type (evidential, eoe, emc)

Model family determining the NLL backend. Default is "evidential".

"evidential"
num_mc_samples int

Number of MC samples for EMC models. Default is 100.

100
device device

Device to run evaluation on. Default uses DEVICE.

DEVICE

Returns:

Type Description
float or None

Scalar NLL if supported by the model type; None otherwise.

Source code in uqdd/metrics/reassessment.py
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def nll_evidentials(
    evidential_model,
    test_dataloader,
    model_type: str = "evidential",
    num_mc_samples: int = 100,
    device=DEVICE,
):
    """
    Compute negative log-likelihood (NLL) for evidential-style models.

    Parameters
    ----------
    evidential_model : torch.nn.Module
        Trained model instance.
    test_dataloader : torch.utils.data.DataLoader
        DataLoader providing test set batches.
    model_type : {"evidential", "eoe", "emc"}, optional
        Model family determining the NLL backend. Default is "evidential".
    num_mc_samples : int, optional
        Number of MC samples for EMC models. Default is 100.
    device : torch.device, optional
        Device to run evaluation on. Default uses `DEVICE`.

    Returns
    -------
    float or None
        Scalar NLL if supported by the model type; None otherwise.
    """
    if model_type in ["evidential", "eoe"]:
        return ev_nll(evidential_model, test_dataloader, device=device)
    elif model_type == "emc":
        return emc_nll(evidential_model, test_dataloader, num_mc_samples=num_mc_samples, device=device)
    else:
        return None

uqdd.metrics.reassessment.convert_to_list

convert_to_list(val)

Parse a string representation of a Python list to a list; pass through non-strings.

Parameters:

Name Type Description Default
val str or any

Input value, possibly a string encoding of a list.

required

Returns:

Type Description
list

Parsed list if val is a valid string list, empty list on parse failure.

any

Original value if not a string.

Notes
  • Uses ast.literal_eval for safe evaluation.
  • Prints a warning and returns [] when parsing fails.
Source code in uqdd/metrics/reassessment.py
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
def convert_to_list(val):
    """
    Parse a string representation of a Python list to a list; pass through non-strings.

    Parameters
    ----------
    val : str or any
        Input value, possibly a string encoding of a list.

    Returns
    -------
    list
        Parsed list if `val` is a valid string list, empty list on parse failure.
    any
        Original value if not a string.

    Notes
    -----
    - Uses `ast.literal_eval` for safe evaluation.
    - Prints a warning and returns [] when parsing fails.
    """
    if isinstance(val, str):
        try:
            parsed_val = ast.literal_eval(val)
            if isinstance(parsed_val, list):
                return parsed_val
            else:
                return []
        except (SyntaxError, ValueError):
            print(f"Warning: Unable to parse value {val}, returning empty list.")
            return []
    return val

uqdd.metrics.reassessment.preprocess_runs

preprocess_runs(runs_path: str, models_dir: str = MODELS_DIR, data_name: str = 'papyrus', activity_type: str = 'xc50', descriptor_protein: str = 'ankh-large', descriptor_chemical: str = 'ecfp2048', data_specific_path: str = 'papyrus/xc50/all', prot_input_dim: int = 1536, chem_input_dim: int = 2048) -> pd.DataFrame

Read a runs CSV and enrich with resolved model paths and descriptor metadata.

Parameters:

Name Type Description Default
runs_path str

Path to the CSV file containing run metadata.

required
models_dir str

Directory containing trained model .pt files. Default uses MODELS_DIR.

MODELS_DIR
data_name str

Dataset identifier. Default is "papyrus".

'papyrus'
activity_type str

Activity type (e.g., "xc50", "kc"). Default is "xc50".

'xc50'
descriptor_protein str

Protein descriptor type. Default is "ankh-large".

'ankh-large'
descriptor_chemical str

Chemical descriptor type. Default is "ecfp2048".

'ecfp2048'
data_specific_path str

Subpath encoding dataset context for figures/exports. Default is "papyrus/xc50/all".

'papyrus/xc50/all'
prot_input_dim int

Protein input dimensionality. Default is 1536.

1536
chem_input_dim int

Chemical input dimensionality. Default is 2048.

2048

Returns:

Type Description
DataFrame

Preprocessed runs DataFrame with columns like 'model_name', 'model_path', and descriptor fields.

Notes
  • Resolves model_name to actual .pt files via glob and sets 'model_path'.
  • Adds multi-task flag 'MT' from 'n_targets' > 1.
  • Converts layer columns from strings to lists using convert_to_list.
Source code in uqdd/metrics/reassessment.py
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
def preprocess_runs(
    runs_path: str,
    models_dir: str = MODELS_DIR,
    data_name: str = "papyrus",
    activity_type: str = "xc50",
    descriptor_protein: str = "ankh-large",
    descriptor_chemical: str = "ecfp2048",
    data_specific_path: str = "papyrus/xc50/all",
    prot_input_dim: int = 1536,
    chem_input_dim: int = 2048,
) -> pd.DataFrame:
    """
    Read a runs CSV and enrich with resolved model paths and descriptor metadata.

    Parameters
    ----------
    runs_path : str
        Path to the CSV file containing run metadata.
    models_dir : str, optional
        Directory containing trained model .pt files. Default uses `MODELS_DIR`.
    data_name : str, optional
        Dataset identifier. Default is "papyrus".
    activity_type : str, optional
        Activity type (e.g., "xc50", "kc"). Default is "xc50".
    descriptor_protein : str, optional
        Protein descriptor type. Default is "ankh-large".
    descriptor_chemical : str, optional
        Chemical descriptor type. Default is "ecfp2048".
    data_specific_path : str, optional
        Subpath encoding dataset context for figures/exports. Default is "papyrus/xc50/all".
    prot_input_dim : int, optional
        Protein input dimensionality. Default is 1536.
    chem_input_dim : int, optional
        Chemical input dimensionality. Default is 2048.

    Returns
    -------
    pd.DataFrame
        Preprocessed runs DataFrame with columns like 'model_name', 'model_path', and descriptor fields.

    Notes
    -----
    - Resolves `model_name` to actual .pt files via glob and sets 'model_path'.
    - Adds multi-task flag 'MT' from 'n_targets' > 1.
    - Converts layer columns from strings to lists using `convert_to_list`.
    """
    runs_df = pd.read_csv(
        runs_path,
        converters={
            "chem_layers": convert_to_list,
            "prot_layers": convert_to_list,
            "regressor_layers": convert_to_list,
        },
    )
    runs_df.rename(columns={"Name": "run_name"}, inplace=True)
    i = 1
    for index, row in runs_df.iterrows():
        model_name = row["model_name"] if not pd.isna(row["model_name"]) else row["run_name"]
        model_file_pattern = os.path.join(models_dir, f"*{model_name}.pt")
        model_files = glob.glob(model_file_pattern)
        if model_files:
            model_file_path = model_files[0]
            model_name = os.path.basename(model_file_path).replace(".pt", "")
            runs_df.at[index, "model_name"] = model_name
            runs_df.at[index, "model_path"] = model_file_path
        else:
            print(f"{i} Model file(s) not found for {model_name} \n with pattern {model_file_pattern}")
            runs_df.at[index, "model_path"] = ""
            i += 1
    runs_df["data_name"] = data_name
    runs_df["activity_type"] = activity_type
    runs_df["descriptor_protein"] = descriptor_protein
    runs_df["descriptor_chemical"] = descriptor_chemical
    runs_df["chem_input_dim"] = chem_input_dim
    runs_df["prot_input_dim"] = prot_input_dim
    runs_df["data_specific_path"] = data_specific_path
    runs_df["MT"] = runs_df["n_targets"].apply(lambda x: True if x > 1 else False)
    return runs_df

uqdd.metrics.reassessment.get_model_class

get_model_class(model_type: str)

Map a model type name to the corresponding class.

Parameters:

Name Type Description Default
model_type str

Model type identifier (e.g., "pnn", "ensemble", "evidential", "eoe", "emc", "mcdropout").

required

Returns:

Type Description
type

Model class matching the type.

Raises:

Type Description
ValueError

If the model_type is not recognized.

Source code in uqdd/metrics/reassessment.py
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
def get_model_class(model_type: str):
    """
    Map a model type name to the corresponding class.

    Parameters
    ----------
    model_type : str
        Model type identifier (e.g., "pnn", "ensemble", "evidential", "eoe", "emc", "mcdropout").

    Returns
    -------
    type
        Model class matching the type.

    Raises
    ------
    ValueError
        If the `model_type` is not recognized.
    """
    if model_type.lower() in ["pnn", "mcdropout"]:
        return PNN
    elif model_type.lower() == "ensemble":
        return EnsembleDNN
    elif model_type.lower() in ["evidential", "emc"]:
        return EvidentialDNN
    elif model_type.lower() == "eoe":
        return EoEDNN
    else:
        raise ValueError(f"Model type {model_type} not recognized")

uqdd.metrics.reassessment.get_predict_fn

get_predict_fn(model_type: str, num_mc_samples: int = 100)

Get the appropriate predict function and kwargs for a given model type.

Parameters:

Name Type Description Default
model_type str

Model type identifier.

required
num_mc_samples int

Number of MC samples for MC Dropout or EMC models. Default is 100.

100

Returns:

Type Description
(callable, dict)

Tuple of (predict_function, keyword_arguments).

Raises:

Type Description
ValueError

If the model_type is not recognized.

Source code in uqdd/metrics/reassessment.py
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
def get_predict_fn(model_type: str, num_mc_samples: int = 100):
    """
    Get the appropriate predict function and kwargs for a given model type.

    Parameters
    ----------
    model_type : str
        Model type identifier.
    num_mc_samples : int, optional
        Number of MC samples for MC Dropout or EMC models. Default is 100.

    Returns
    -------
    (callable, dict)
        Tuple of (predict_function, keyword_arguments).

    Raises
    ------
    ValueError
        If the `model_type` is not recognized.
    """
    if model_type.lower() == "mcdropout":
        return mc_predict, {"num_mc_samples": num_mc_samples}
    elif model_type.lower() in ["ensemble", "pnn"]:
        return predict, {}
    elif model_type.lower() in ["evidential", "eoe"]:
        return ev_predict, {}
    elif model_type.lower() == "emc":
        return emc_predict, {"num_mc_samples": num_mc_samples}
    else:
        raise ValueError(f"Model type {model_type} not recognized")

uqdd.metrics.reassessment.get_preds

get_preds(model, dataloaders, model_type: str, subset: str = 'test', num_mc_samples: int = 100)

Run inference and unpack predictions for the requested subset.

Parameters:

Name Type Description Default
model Module

Trained model instance.

required
dataloaders dict

Dictionary of DataLoaders keyed by subset (e.g., 'train', 'val', 'test').

required
model_type str

Model type determining the predict function and outputs.

required
subset str

Subset key to use from dataloaders. Default is "test".

'test'
num_mc_samples int

Number of MC samples for stochastic predictors. Default is 100.

100

Returns:

Type Description
tuple

(preds, labels, alea_vars, epi_vars) where epi_vars may be None for non-evidential models.

Source code in uqdd/metrics/reassessment.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
def get_preds(
    model,
    dataloaders,
    model_type: str,
    subset: str = "test",
    num_mc_samples: int = 100,
):
    """
    Run inference and unpack predictions for the requested subset.

    Parameters
    ----------
    model : torch.nn.Module
        Trained model instance.
    dataloaders : dict
        Dictionary of DataLoaders keyed by subset (e.g., 'train', 'val', 'test').
    model_type : str
        Model type determining the predict function and outputs.
    subset : str, optional
        Subset key to use from `dataloaders`. Default is "test".
    num_mc_samples : int, optional
        Number of MC samples for stochastic predictors. Default is 100.

    Returns
    -------
    tuple
        (preds, labels, alea_vars, epi_vars) where `epi_vars` may be None for non-evidential models.
    """
    predict_fn, predict_kwargs = get_predict_fn(model_type, num_mc_samples=num_mc_samples)
    preds_res = predict_fn(model, dataloaders[subset], device=DEVICE, **predict_kwargs)
    if model_type in ["evidential", "eoe", "emc"]:
        preds, labels, alea_vars, epi_vars = preds_res
    else:
        preds, labels, alea_vars = preds_res
        epi_vars = None
    return preds, labels, alea_vars, epi_vars

uqdd.metrics.reassessment.pkl_preds_export

pkl_preds_export(preds, labels, alea_vars, epi_vars, outpath: str, model_type: str, logger=None)

Export predictions and uncertainties to a standardized pickle and return the DataFrame.

Parameters:

Name Type Description Default
preds ndarray or Tensor

Model predictions.

required
labels ndarray or Tensor

True labels.

required
alea_vars ndarray or Tensor

Aleatoric uncertainty components.

required
epi_vars ndarray or Tensor or None

Epistemic uncertainty components, or None for non-evidential models.

required
outpath str

Output directory to write 'preds.pkl'.

required
model_type str

Model type used to guide process_preds behavior.

required
logger Logger or None

Logger for messages. Default is None.

None

Returns:

Type Description
DataFrame

DataFrame with columns [y_true, y_pred, y_err, y_alea, y_eps].

Source code in uqdd/metrics/reassessment.py
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
def pkl_preds_export(
    preds,
    labels,
    alea_vars,
    epi_vars,
    outpath: str,
    model_type: str,
    logger=None,
):
    """
    Export predictions and uncertainties to a standardized pickle and return the DataFrame.

    Parameters
    ----------
    preds : numpy.ndarray or torch.Tensor
        Model predictions.
    labels : numpy.ndarray or torch.Tensor
        True labels.
    alea_vars : numpy.ndarray or torch.Tensor
        Aleatoric uncertainty components.
    epi_vars : numpy.ndarray or torch.Tensor or None
        Epistemic uncertainty components, or None for non-evidential models.
    outpath : str
        Output directory to write 'preds.pkl'.
    model_type : str
        Model type used to guide `process_preds` behavior.
    logger : logging.Logger or None, optional
        Logger for messages. Default is None.

    Returns
    -------
    pd.DataFrame
        DataFrame with columns [y_true, y_pred, y_err, y_alea, y_eps].
    """
    y_true, y_pred, y_err, y_alea, y_eps = process_preds(preds, labels, alea_vars, epi_vars, None, model_type)
    df = create_df_preds(y_true=y_true, y_pred=y_pred, y_err=y_err, y_alea=y_alea, y_eps=y_eps, export=False, logger=logger)
    df.to_pickle(os.path.join(outpath, "preds.pkl"))
    return df

uqdd.metrics.reassessment.csv_nll_post_processing

csv_nll_post_processing(csv_path: str) -> None

Normalize NLL values in a CSV by taking the first value per model name.

Parameters:

Name Type Description Default
csv_path str

Path to the CSV file containing a 'model name' and 'NLL' column.

required

Returns:

Type Description
None
Source code in uqdd/metrics/reassessment.py
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
def csv_nll_post_processing(csv_path: str) -> None:
    """
    Normalize NLL values in a CSV by taking the first value per model name.

    Parameters
    ----------
    csv_path : str
        Path to the CSV file containing a 'model name' and 'NLL' column.

    Returns
    -------
    None
    """
    df = pd.read_csv(csv_path)
    df["NLL"] = df.groupby("model name")["NLL"].transform("first")
    df.to_csv(csv_path, index=False)

uqdd.metrics.reassessment.reassess_metrics

reassess_metrics(runs_df: DataFrame, figs_out_path: str, csv_out_path: str, project_out_name: str, logger) -> None

Reassess metrics for each run: reload model, predict, compute NLL, evaluate, and recalibrate.

Parameters:

Name Type Description Default
runs_df DataFrame

Preprocessed runs DataFrame with resolved 'model_path' and configuration fields.

required
figs_out_path str

Directory where per-model figures and prediction pickles are saved.

required
csv_out_path str

Path to a CSV for logging metrics (passed to evaluate_predictions).

required
project_out_name str

Name used for grouping results in downstream logging.

required
logger Logger

Logger instance used through evaluation and recalibration.

required

Returns:

Type Description
None
Notes
  • Skips models already reassessed when a figure directory exists.
  • Uses validation split for isotonic recalibration and logs final metrics.
Source code in uqdd/metrics/reassessment.py
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
def reassess_metrics(
    runs_df: pd.DataFrame,
    figs_out_path: str,
    csv_out_path: str,
    project_out_name: str,
    logger,
) -> None:
    """
    Reassess metrics for each run: reload model, predict, compute NLL, evaluate, and recalibrate.

    Parameters
    ----------
    runs_df : pd.DataFrame
        Preprocessed runs DataFrame with resolved 'model_path' and configuration fields.
    figs_out_path : str
        Directory where per-model figures and prediction pickles are saved.
    csv_out_path : str
        Path to a CSV for logging metrics (passed to `evaluate_predictions`).
    project_out_name : str
        Name used for grouping results in downstream logging.
    logger : logging.Logger
        Logger instance used through evaluation and recalibration.

    Returns
    -------
    None

    Notes
    -----
    - Skips models already reassessed when a figure directory exists.
    - Uses validation split for isotonic recalibration and logs final metrics.
    """
    runs_df = runs_df.sample(frac=1).reset_index(drop=True)
    for index, row in runs_df.iterrows():
        model_path = row["model_path"]
        model_name = row["model_name"]
        run_name = row["run_name"]
        rowkwargs = row.to_dict()
        model_type = rowkwargs.pop("model_type")
        activity_type = rowkwargs.pop("activity_type")
        if model_path:
            model_fig_out_path = os.path.join(figs_out_path, model_name)
            if os.path.exists(model_fig_out_path):
                print(f"Model {model_name} already reassessed")
                continue
            os.makedirs(model_fig_out_path, exist_ok=True)
            config = get_model_config(model_type=model_type, activity_type=activity_type, **rowkwargs)
            num_mc_samples = config.get("num_mc_samples", 100)
            model_class = get_model_class(model_type)
            prefix = "models." if model_type == "eoe" else ""
            model = load_model(model_class, model_path, prefix_to_state_keys=prefix, config=config).to(DEVICE)
            dataloaders = get_dataloader(config, device=DEVICE, logger=logger)
            preds, labels, alea_vars, epi_vars = get_preds(model, dataloaders, model_type, subset="test", num_mc_samples=num_mc_samples)
            nll = nll_evidentials(model, dataloaders["test"], model_type=model_type, num_mc_samples=num_mc_samples, device=DEVICE)
            df = pkl_preds_export(preds, labels, alea_vars, epi_vars, model_fig_out_path, model_type, logger=logger)
            metrics, plots, uct_logger = evaluate_predictions(
                config,
                preds,
                labels,
                alea_vars,
                model_type,
                logger,
                epi_vars=epi_vars,
                wandb_push=False,
                run_name=config["run_name"],
                project_name=project_out_name,
                figpath=model_fig_out_path,
                export_preds=False,
                verbose=False,
                csv_path=csv_out_path,
                nll=nll,
            )
            preds_val, labels_val, alea_vars_val, epi_vars_val = get_preds(model, dataloaders, model_type, subset="val", num_mc_samples=num_mc_samples)
            nll = nll_evidentials(model, dataloaders["val"], model_type=model_type, num_mc_samples=num_mc_samples, device=DEVICE)
            iso_recal_model = recalibrate_model(
                preds_val,
                labels_val,
                alea_vars_val,
                preds,
                labels,
                alea_vars,
                config=config,
                epi_val=epi_vars_val,
                epi_test=epi_vars,
                uct_logger=uct_logger,
                figpath=model_fig_out_path,
                nll=nll,
            )
            uct_logger.csv_log()

uqdd.metrics.stats

Statistical utilities for metrics analysis and significance testing.

This module includes helpers to compute descriptive statistics, confidence intervals, bootstrap aggregates, correlation and significance tests, and summary tables to support model evaluation and reporting.

uqdd.metrics.stats.calc_regression_metrics

calc_regression_metrics(df, cycle_col, val_col, pred_col, thresh)

Compute regression and thresholded classification metrics per cycle/method/split.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing true and predicted values.

required
cycle_col str

Column name identifying cross-validation cycles.

required
val_col str

Column with true target values.

required
pred_col str

Column with predicted target values.

required
thresh float

Threshold to derive binary classes for precision/recall.

required

Returns:

Type Description
DataFrame

Metrics per (cv_cycle, method, split) with columns ['mae', 'mse', 'r2', 'rho', 'prec', 'recall'].

Source code in uqdd/metrics/stats.py
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
def calc_regression_metrics(df, cycle_col, val_col, pred_col, thresh):
    """
    Compute regression and thresholded classification metrics per cycle/method/split.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing true and predicted values.
    cycle_col : str
        Column name identifying cross-validation cycles.
    val_col : str
        Column with true target values.
    pred_col : str
        Column with predicted target values.
    thresh : float
        Threshold to derive binary classes for precision/recall.

    Returns
    -------
    pd.DataFrame
        Metrics per (cv_cycle, method, split) with columns ['mae', 'mse', 'r2', 'rho', 'prec', 'recall'].
    """
    df_in = df.copy()
    metric_ls = ["mae", "mse", "r2", "rho", "prec", "recall"]
    metric_list = []
    df_in["true_class"] = df_in[val_col] > thresh
    assert len(df_in.true_class.unique()) == 2, "Binary classification requires two classes"
    df_in["pred_class"] = df_in[pred_col] > thresh

    for k, v in df_in.groupby([cycle_col, "method", "split"]):
        cycle, method, split = k
        mae = mean_absolute_error(v[val_col], v[pred_col])
        mse = mean_squared_error(v[val_col], v[pred_col])
        r2 = r2_score(v[val_col], v[pred_col])
        recall = recall_score(v.true_class, v.pred_class)
        prec = precision_score(v.true_class, v.pred_class)
        rho, _ = spearmanr(v[val_col], v[pred_col])
        metric_list.append([cycle, method, split, mae, mse, r2, rho, prec, recall])
    metric_df = pd.DataFrame(metric_list, columns=["cv_cycle", "method", "split"] + metric_ls)
    return metric_df

uqdd.metrics.stats.bootstrap_ci

bootstrap_ci(data, func=np.mean, n_bootstrap=1000, ci=95, random_state=42)

Compute bootstrap confidence interval for a statistic.

Parameters:

Name Type Description Default
data array - like

Sequence of numeric values.

required
func callable

Statistic function applied to bootstrap samples (e.g., numpy.mean). Default is numpy.mean.

mean
n_bootstrap int

Number of bootstrap resamples. Default is 1000.

1000
ci int or float

Confidence level percentage (e.g., 95). Default is 95.

95
random_state int

Seed for reproducibility. Default is 42.

42

Returns:

Type Description
tuple[float, float]

Lower and upper bounds for the confidence interval.

Source code in uqdd/metrics/stats.py
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
def bootstrap_ci(data, func=np.mean, n_bootstrap=1000, ci=95, random_state=42):
    """
    Compute bootstrap confidence interval for a statistic.

    Parameters
    ----------
    data : array-like
        Sequence of numeric values.
    func : callable, optional
        Statistic function applied to bootstrap samples (e.g., numpy.mean). Default is numpy.mean.
    n_bootstrap : int, optional
        Number of bootstrap resamples. Default is 1000.
    ci : int or float, optional
        Confidence level percentage (e.g., 95). Default is 95.
    random_state : int, optional
        Seed for reproducibility. Default is 42.

    Returns
    -------
    tuple[float, float]
        Lower and upper bounds for the confidence interval.
    """
    np.random.seed(random_state)
    bootstrap_samples = []
    for _ in range(n_bootstrap):
        sample = resample(data, random_state=np.random.randint(0, 10000))
        bootstrap_samples.append(func(sample))
    alpha = (100 - ci) / 2
    lower = np.percentile(bootstrap_samples, alpha)
    upper = np.percentile(bootstrap_samples, 100 - alpha)
    return lower, upper

uqdd.metrics.stats.rm_tukey_hsd

rm_tukey_hsd(df, metric, group_col, alpha=0.05, sort=False, direction_dict=None)

Repeated-measures Tukey HSD approximation using RM-ANOVA and studentized range.

Parameters:

Name Type Description Default
df DataFrame

Long-form DataFrame with columns including the metric, group, and 'cv_cycle' subject.

required
metric str

Metric column to compare.

required
group_col str

Column indicating groups (e.g., method/model type).

required
alpha float

Family-wise error rate for intervals. Default is 0.05.

0.05
sort bool

If True, sort groups by mean value of the metric. Default is False.

False
direction_dict dict or None

Mapping of metric -> 'maximize'|'minimize' to set sort ascending/descending.

None

Returns:

Type Description
tuple

(result_tab, df_means, df_means_diff, p_values_matrix) where: - result_tab: DataFrame of pairwise comparisons with mean differences and CIs. - df_means: mean per group. - df_means_diff: matrix of mean differences. - pc: matrix of adjusted p-values.

Source code in uqdd/metrics/stats.py
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
def rm_tukey_hsd(df, metric, group_col, alpha=0.05, sort=False, direction_dict=None):
    """
    Repeated-measures Tukey HSD approximation using RM-ANOVA and studentized range.

    Parameters
    ----------
    df : pd.DataFrame
        Long-form DataFrame with columns including the metric, group, and 'cv_cycle' subject.
    metric : str
        Metric column to compare.
    group_col : str
        Column indicating groups (e.g., method/model type).
    alpha : float, optional
        Family-wise error rate for intervals. Default is 0.05.
    sort : bool, optional
        If True, sort groups by mean value of the metric. Default is False.
    direction_dict : dict or None, optional
        Mapping of metric -> 'maximize'|'minimize' to set sort ascending/descending.

    Returns
    -------
    tuple
        (result_tab, df_means, df_means_diff, p_values_matrix) where:
        - result_tab: DataFrame of pairwise comparisons with mean differences and CIs.
        - df_means: mean per group.
        - df_means_diff: matrix of mean differences.
        - pc: matrix of adjusted p-values.
    """
    if sort and direction_dict and metric in direction_dict:
        ascending = direction_dict[metric] != "maximize"
        df_means = df.groupby(group_col).mean(numeric_only=True).sort_values(metric, ascending=ascending)
    else:
        df_means = df.groupby(group_col).mean(numeric_only=True)

    with warnings.catch_warnings():
        warnings.filterwarnings("ignore", category=RuntimeWarning, message="divide by zero encountered in scalar divide")
        aov = pg.rm_anova(dv=metric, within=group_col, subject="cv_cycle", data=df, detailed=True)
    mse = aov.loc[1, "MS"]
    df_resid = aov.loc[1, "DF"]

    methods = df_means.index
    n_groups = len(methods)
    n_per_group = df[group_col].value_counts().mean()
    tukey_se = np.sqrt(2 * mse / (n_per_group))
    q = qsturng(1 - alpha, n_groups, df_resid)
    if isinstance(q, (tuple, list, np.ndarray)):
        q = q[0]

    num_comparisons = len(methods) * (len(methods) - 1) // 2
    result_tab = pd.DataFrame(index=range(num_comparisons), columns=["group1", "group2", "meandiff", "lower", "upper", "p-adj"])
    df_means_diff = pd.DataFrame(index=methods, columns=methods, data=0.0)
    pc = pd.DataFrame(index=methods, columns=methods, data=1.0)

    row_idx = 0
    for i, method1 in enumerate(methods):
        for j, method2 in enumerate(methods):
            if i < j:
                group1 = df[df[group_col] == method1][metric]
                group2 = df[df[group_col] == method2][metric]
                mean_diff = group1.mean() - group2.mean()
                studentized_range = np.abs(mean_diff) / tukey_se
                adjusted_p = psturng(studentized_range * np.sqrt(2), n_groups, df_resid)
                if isinstance(adjusted_p, (tuple, list, np.ndarray)):
                    adjusted_p = adjusted_p[0]
                lower = mean_diff - (q / np.sqrt(2) * tukey_se)
                upper = mean_diff + (q / np.sqrt(2) * tukey_se)
                result_tab.loc[row_idx] = [method1, method2, mean_diff, lower, upper, adjusted_p]
                pc.loc[method1, method2] = adjusted_p
                pc.loc[method2, method1] = adjusted_p
                df_means_diff.loc[method1, method2] = mean_diff
                df_means_diff.loc[method2, method1] = -mean_diff
                row_idx += 1

    df_means_diff = df_means_diff.astype(float)
    result_tab["group1_mean"] = result_tab["group1"].map(df_means[metric])
    result_tab["group2_mean"] = result_tab["group2"].map(df_means[metric])
    result_tab.index = result_tab["group1"] + " - " + result_tab["group2"]
    return result_tab, df_means, df_means_diff, pc

uqdd.metrics.stats.make_boxplots

make_boxplots(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots for each metric grouped by method.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to visualize.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of methods on the x-axis. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
def make_boxplots(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots for each metric grouped by method.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, len(metric_ls), sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.upper()
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_boxplots_parametric

make_boxplots_parametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots with RM-ANOVA p-values annotated per metric.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to visualize.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of methods on the x-axis. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
def make_boxplots_parametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots with RM-ANOVA p-values annotated per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, len(metric_ls), sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        model = AnovaRM(data=df, depvar=stat, subject="cv_cycle", within=["method"]).fit()
        p_value = model.anova_table["Pr > F"].iloc[0]
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.upper()
        ax.set_title(f"p={p_value:.1e}")
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_parametric_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_boxplots_nonparametric

make_boxplots_nonparametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot boxplots with Friedman p-values annotated per metric.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to visualize.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of methods on the x-axis. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
def make_boxplots_nonparametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot boxplots with Friedman p-values annotated per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to visualize.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on the x-axis. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    n_metrics = len(metric_ls)
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    figure, axes = plt.subplots(1, n_metrics, sharex=False, sharey=False, figsize=(28, 8))
    for i, stat in enumerate(metric_ls):
        friedman = pg.friedman(df, dv=stat, within="method", subject="cv_cycle")["p-unc"].values[0]
        ax = sns.boxplot(y=stat, x="method", hue="method", ax=axes[i], data=df, palette="Set2", legend=False, order=model_order, hue_order=model_order)
        title = stat.replace("_", " ").upper()
        ax.set_title(f"p={friedman:.1e}")
        ax.set_xlabel("")
        ax.set_ylabel(title)
        x_tick_labels = ax.get_xticklabels()
        label_text_list = [x.get_text() for x in x_tick_labels]
        new_xtick_labels = ["\n".join(x.split("_")) for x in label_text_list]
        ax.set_xticks(list(range(0, len(x_tick_labels))))
        ax.set_xticklabels(new_xtick_labels, rotation=45, ha="right")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_boxplot_nonparametric_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_sign_plots_nonparametric

make_sign_plots_nonparametric(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot significance heatmaps (Conover post-hoc) for nonparametric comparisons.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to analyze.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of methods on axes. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
def make_sign_plots_nonparametric(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot significance heatmaps (Conover post-hoc) for nonparametric comparisons.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to analyze.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of methods on axes. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    heatmap_args = {"linewidths": 0.25, "linecolor": "0.5", "clip_on": True, "square": True, "cbar_kws": {"pad": 0.05, "location": "right"}}
    n_metrics = len(metric_ls)
    sns.set_theme(context="paper", font_scale=1.5)
    figure, axes = plt.subplots(1, n_metrics, sharex=False, sharey=True, figsize=(26, 8))
    if n_metrics == 1:
        axes = [axes]
    for i, stat in enumerate(metric_ls):
        pc = sp.posthoc_conover_friedman(df, y_col=stat, group_col="method", block_col="cv_cycle", block_id_col="cv_cycle", p_adjust="holm", melted=True)
        if model_order is not None:
            pc = pc.reindex(index=model_order, columns=model_order)
        sub_ax, sub_c = sp.sign_plot(pc, **heatmap_args, ax=axes[i], xticklabels=True)
        sub_ax.set_title(stat.upper())
        if sub_c is not None and hasattr(sub_c, "ax"):
            figure.subplots_adjust(right=0.85)
            sub_c.ax.set_position([0.87, 0.5, 0.02, 0.2])
    save_plot(figure, save_dir, f"{name_prefix}_sign_plot_nonparametric_{'_'.join(metric_ls)}", tighten=False)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_critical_difference_diagrams

make_critical_difference_diagrams(df, metric_ls, save_dir=None, name_prefix='', model_order=None)

Plot critical difference diagrams per metric using average ranks and post-hoc p-values.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to analyze.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''
model_order list of str or None

Explicit order of models on diagrams. Default derives from data.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
def make_critical_difference_diagrams(df, metric_ls, save_dir=None, name_prefix="", model_order=None):
    """
    Plot critical difference diagrams per metric using average ranks and post-hoc p-values.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to analyze.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.
    model_order : list of str or None, optional
        Explicit order of models on diagrams. Default derives from data.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    n_metrics = len(metric_ls)
    figure, axes = plt.subplots(n_metrics, 1, sharex=True, sharey=False, figsize=(16, 10))
    for i, stat in enumerate(metric_ls):
        avg_rank = df.groupby("cv_cycle")[stat].rank(pct=True).groupby(df.method).mean()
        pc = sp.posthoc_conover_friedman(df, y_col=stat, group_col="method", block_col="cv_cycle", block_id_col="cv_cycle", p_adjust="holm", melted=True)
        if model_order is not None:
            avg_rank = avg_rank.reindex(model_order)
            pc = pc.reindex(index=model_order, columns=model_order)
        sp.critical_difference_diagram(avg_rank, pc, ax=axes[i])
        axes[i].set_title(stat.upper())
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_critical_difference_diagram_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_normality_diagnostic

make_normality_diagnostic(df, metric_ls, save_dir=None, name_prefix='')

Plot normality diagnostics (histogram/KDE and Q-Q) for residualized metrics.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric_ls list of str

Metrics to diagnose.

required
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Prefix for the output filename. Default is empty.

''

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
def make_normality_diagnostic(df, metric_ls, save_dir=None, name_prefix=""):
    """
    Plot normality diagnostics (histogram/KDE and Q-Q) for residualized metrics.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric_ls : list of str
        Metrics to diagnose.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Prefix for the output filename. Default is empty.

    Returns
    -------
    None
    """
    df_norm = df.copy()
    df_norm.replace([np.inf, -np.inf], np.nan, inplace=True)
    for metric in metric_ls:
        df_norm[metric] = df_norm[metric] - df_norm.groupby("method")[metric].transform("mean")
    df_norm = df_norm.melt(id_vars=["cv_cycle", "method", "split"], value_vars=metric_ls, var_name="metric", value_name="value")
    sns.set_theme(context="paper", font_scale=1.5)
    sns.set_style("whitegrid")
    metrics = df_norm["metric"].unique()
    n_metrics = len(metrics)
    fig, axes = plt.subplots(2, n_metrics, figsize=(20, 10))
    for i, metric in enumerate(metrics):
        ax = axes[0, i]
        sns.histplot(df_norm[df_norm["metric"] == metric]["value"], kde=True, ax=ax)
        ax.set_title(f"{metric}")
        ax.set_xlabel("")
        if i == 0:
            ax.set_ylabel("Count")
        else:
            ax.set_ylabel("")
    for i, metric in enumerate(metrics):
        ax = axes[1, i]
        metric_data = df_norm[df_norm["metric"] == metric]["value"]
        stats.probplot(metric_data, dist="norm", plot=ax)
        ax.set_title("")
        ax.set_xlabel("Theoretical Quantiles")
        if i == 0:
            ax.set_ylabel("Ordered Values")
        else:
            ax.set_ylabel("")
    plt.subplots_adjust(hspace=0.3, wspace=0.8)
    save_plot(fig, save_dir, f"{name_prefix}_normality_diagnostic_{'_'.join(metric_ls)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.mcs_plot

mcs_plot(pc, effect_size, means, labels=True, cmap=None, cbar_ax_bbox=None, ax=None, show_diff=True, cell_text_size=10, axis_text_size=8, show_cbar=True, reverse_cmap=False, vlim=None, **kwargs)

Render a multiple-comparisons significance heatmap annotated with effect sizes and stars.

Parameters:

Name Type Description Default
pc DataFrame

Matrix of adjusted p-values.

required
effect_size DataFrame

Matrix of mean differences (effect sizes) aligned with pc.

required
means Series

Mean values per group for labeling.

required
labels bool

If True, add x/y tick labels from means.index. Default is True.

True
cmap str or None

Colormap name for effect sizes. Default is 'YlGnBu'.

None
cbar_ax_bbox tuple or None

Custom colorbar axes bbox; unused here but kept for API compatibility.

None
ax Axes or None

Axes to draw into; if None, a new axes is created.

None
show_diff bool

If True, annotate cells with rounded effect sizes plus significance. Default is True.

True
cell_text_size int

Font size for annotations. Default is 10.

10
axis_text_size int

Font size for axis tick labels. Default is 8.

8
show_cbar bool

If True, show colorbar. Default is True.

True
reverse_cmap bool

If True, use reversed colormap. Default is False.

False
vlim float or None

Symmetric limit for color scaling around 0. Default is None.

None

Returns:

Type Description
Axes

Axes containing the rendered heatmap.

Source code in uqdd/metrics/stats.py
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
def mcs_plot(pc, effect_size, means, labels=True, cmap=None, cbar_ax_bbox=None, ax=None, show_diff=True, cell_text_size=10, axis_text_size=8, show_cbar=True, reverse_cmap=False, vlim=None, **kwargs):
    """
    Render a multiple-comparisons significance heatmap annotated with effect sizes and stars.

    Parameters
    ----------
    pc : pd.DataFrame
        Matrix of adjusted p-values.
    effect_size : pd.DataFrame
        Matrix of mean differences (effect sizes) aligned with `pc`.
    means : pd.Series
        Mean values per group for labeling.
    labels : bool, optional
        If True, add x/y tick labels from `means.index`. Default is True.
    cmap : str or None, optional
        Colormap name for effect sizes. Default is 'YlGnBu'.
    cbar_ax_bbox : tuple or None, optional
        Custom colorbar axes bbox; unused here but kept for API compatibility.
    ax : matplotlib.axes.Axes or None, optional
        Axes to draw into; if None, a new axes is created.
    show_diff : bool, optional
        If True, annotate cells with rounded effect sizes plus significance. Default is True.
    cell_text_size : int, optional
        Font size for annotations. Default is 10.
    axis_text_size : int, optional
        Font size for axis tick labels. Default is 8.
    show_cbar : bool, optional
        If True, show colorbar. Default is True.
    reverse_cmap : bool, optional
        If True, use reversed colormap. Default is False.
    vlim : float or None, optional
        Symmetric limit for color scaling around 0. Default is None.

    Returns
    -------
    matplotlib.axes.Axes
        Axes containing the rendered heatmap.
    """
    for key in ["cbar", "vmin", "vmax", "center"]:
        if key in kwargs:
            del kwargs[key]
    if not cmap:
        cmap = "YlGnBu"
    if reverse_cmap:
        cmap = cmap + "_r"
    significance = pc.copy().astype(object)
    significance[(pc < 0.001) & (pc >= 0)] = "***"
    significance[(pc < 0.01) & (pc >= 0.001)] = "**"
    significance[(pc < 0.05) & (pc >= 0.01)] = "*"
    significance[(pc >= 0.05)] = ""
    np.fill_diagonal(significance.values, "")
    annotations = effect_size.round(2).astype(str) + significance if show_diff else significance
    hax = sns.heatmap(effect_size, cmap=cmap, annot=annotations, fmt="", cbar=show_cbar, ax=ax, annot_kws={"size": cell_text_size}, vmin=-2 * vlim if vlim else None, vmax=2 * vlim if vlim else None, square=True, **kwargs)
    if labels:
        label_list = list(means.index)
        x_label_list = label_list
        y_label_list = label_list
        xtick_positions = np.arange(len(label_list))
        hax.set_xticks(xtick_positions + 0.5)
        hax.set_xticklabels(x_label_list, size=axis_text_size, ha="center", va="center", rotation=90)
        hax.set_yticks(xtick_positions + 0.5)
        hax.set_yticklabels(y_label_list, size=axis_text_size, ha="center", va="center", rotation=0)
    hax.set_xlabel("")
    hax.set_ylabel("")
    return hax

uqdd.metrics.stats.make_mcs_plot_grid

make_mcs_plot_grid(df, stats_list, group_col, alpha=0.05, figsize=(20, 10), direction_dict=None, effect_dict=None, show_diff=True, cell_text_size=16, axis_text_size=12, title_text_size=16, sort_axes=False, save_dir=None, name_prefix='', model_order=None)

Generate a grid of MCS plots for multiple metrics.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
stats_list list of str

Metrics to include.

required
group_col str

Column indicating groups (e.g., method).

required
alpha float

Significance level. Default is 0.05.

0.05
figsize tuple

Figure size. Default is (20, 10).

(20, 10)
direction_dict dict or None

Mapping metric -> 'maximize'|'minimize' for colormap orientation.

None
effect_dict dict or None

Mapping metric -> effect size limit for color scaling.

None
show_diff bool

If True, annotate mean differences; else annotate significance only.

True
cell_text_size int

Annotation font size.

16
axis_text_size int

Axis label font size.

12
title_text_size int

Title font size.

16
sort_axes bool

If True, sort groups by mean values per metric.

False
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Filename prefix. Default is empty.

''
model_order list of str or None

Explicit model order for rows/cols.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
def make_mcs_plot_grid(df, stats_list, group_col, alpha=0.05, figsize=(20, 10), direction_dict=None, effect_dict=None, show_diff=True, cell_text_size=16, axis_text_size=12, title_text_size=16, sort_axes=False, save_dir=None, name_prefix="", model_order=None):
    """
    Generate a grid of MCS plots for multiple metrics.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    stats_list : list of str
        Metrics to include.
    group_col : str
        Column indicating groups (e.g., method).
    alpha : float, optional
        Significance level. Default is 0.05.
    figsize : tuple, optional
        Figure size. Default is (20, 10).
    direction_dict : dict or None, optional
        Mapping metric -> 'maximize'|'minimize' for colormap orientation.
    effect_dict : dict or None, optional
        Mapping metric -> effect size limit for color scaling.
    show_diff : bool, optional
        If True, annotate mean differences; else annotate significance only.
    cell_text_size : int, optional
        Annotation font size.
    axis_text_size : int, optional
        Axis label font size.
    title_text_size : int, optional
        Title font size.
    sort_axes : bool, optional
        If True, sort groups by mean values per metric.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Filename prefix. Default is empty.
    model_order : list of str or None, optional
        Explicit model order for rows/cols.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    nrow = math.ceil(len(stats_list) / 3)
    fig, ax = plt.subplots(nrow, 3, figsize=figsize)
    for key in ["r2", "rho", "prec", "recall", "mae", "mse"]:
        direction_dict.setdefault(key, "maximize" if key in ["r2", "rho", "prec", "recall"] else "minimize")
    for key in ["r2", "rho", "prec", "recall"]:
        effect_dict.setdefault(key, 0.1)
    for i, stat in enumerate(stats_list):
        row = i // 3
        col = i % 3
        if stat not in direction_dict:
            raise ValueError(f"Stat '{stat}' is missing in direction_dict. Please set its value.")
        if stat not in effect_dict:
            raise ValueError(f"Stat '{stat}' is missing in effect_dict. Please set its value.")
        reverse_cmap = direction_dict[stat] == "minimize"
        _, df_means, df_means_diff, pc = rm_tukey_hsd(df, stat, group_col, alpha, sort_axes, direction_dict)
        if model_order is not None:
            df_means = df_means.reindex(model_order)
            df_means_diff = df_means_diff.reindex(index=model_order, columns=model_order)
            pc = pc.reindex(index=model_order, columns=model_order)
        hax = mcs_plot(pc, effect_size=df_means_diff, means=df_means[stat], show_diff=show_diff, ax=ax[row, col], cbar=True, cell_text_size=cell_text_size, axis_text_size=axis_text_size, reverse_cmap=reverse_cmap, vlim=effect_dict[stat])
        hax.set_title(stat.upper(), fontsize=title_text_size)
    if (len(stats_list) % 3) != 0:
        for i in range(len(stats_list), nrow * 3):
            row = i // 3
            col = i % 3
            ax[row, col].set_visible(False)
    from matplotlib.lines import Line2D
    legend_elements = [
        Line2D([0], [0], marker="o", color="w", label="p < 0.001 (***): Highly Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p < 0.01 (**): Very Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p < 0.05 (*): Significant", markerfacecolor="black", markersize=10),
        Line2D([0], [0], marker="o", color="w", label="p >= 0.05: Not Significant", markerfacecolor="black", markersize=10),
    ]
    fig.legend(handles=legend_elements, loc="upper right", ncol=2, fontsize=12, frameon=False)
    plt.subplots_adjust(top=0.88)
    save_plot(fig, save_dir, f"{name_prefix}_mcs_plot_grid_{'_'.join(stats_list)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.make_scatterplot

make_scatterplot(df, val_col, pred_col, thresh, cycle_col='cv_cycle', group_col='method', save_dir=None)

Scatter plots of predicted vs true values per method, with threshold lines and summary stats.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
val_col str

True value column.

required
pred_col str

Predicted value column.

required
thresh float

Threshold for classification overlays.

required
cycle_col str

Cross-validation cycle column. Default is 'cv_cycle'.

'cv_cycle'
group_col str

Method/model type column. Default is 'method'.

'method'
save_dir str or None

Directory to save the plot. Default is None.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
def make_scatterplot(df, val_col, pred_col, thresh, cycle_col="cv_cycle", group_col="method", save_dir=None):
    """
    Scatter plots of predicted vs true values per method, with threshold lines and summary stats.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    val_col : str
        True value column.
    pred_col : str
        Predicted value column.
    thresh : float
        Threshold for classification overlays.
    cycle_col : str, optional
        Cross-validation cycle column. Default is 'cv_cycle'.
    group_col : str, optional
        Method/model type column. Default is 'method'.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.

    Returns
    -------
    None
    """
    df = df.copy()
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    df_split_metrics = calc_regression_metrics(df, cycle_col=cycle_col, val_col=val_col, pred_col=pred_col, thresh=thresh)
    methods = df[group_col].unique()
    fig, axs = plt.subplots(nrows=1, ncols=len(methods), figsize=(25, 10))
    for ax, method in zip(axs, methods):
        df_method = df.query(f"{group_col} == @method")
        df_metrics = df_split_metrics.query(f"{group_col} == @method")
        ax.scatter(df_method[pred_col], df_method[val_col], alpha=0.3)
        ax.plot([df_method[val_col].min(), df_method[val_col].max()], [df_method[val_col].min(), df_method[val_col].max()], "k--", lw=1)
        ax.axhline(y=thresh, color="r", linestyle="--")
        ax.axvline(x=thresh, color="r", linestyle="--")
        ax.set_title(method)
        y_true = df_method[val_col] > thresh
        y_pred = df_method[pred_col] > thresh
        precision = precision_score(y_true, y_pred)
        recall = recall_score(y_true, y_pred)
        metrics_text = f"MAE: {df_metrics['mae'].mean():.2f}\nMSE: {df_metrics['mse'].mean():.2f}\nR2: {df_metrics['r2'].mean():.2f}\nrho: {df_metrics['rho'].mean():.2f}\nPrecision: {precision:.2f}\nRecall: {recall:.2f}"
        ax.text(0.05, 0.5, metrics_text, transform=ax.transAxes, verticalalignment="top")
        ax.set_xlabel("Predicted")
        ax.set_ylabel("Measured")
    plt.subplots_adjust(hspace=0.3, wspace=0.3)
    save_plot(fig, save_dir, f"scatterplot_{val_col}_vs_{pred_col}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.ci_plot

ci_plot(result_tab, ax_in, name)

Plot mean differences with confidence intervals for pairwise comparisons.

Parameters:

Name Type Description Default
result_tab DataFrame

Output of rm_tukey_hsd with columns ['meandiff', 'lower', 'upper'].

required
ax_in Axes

Axes to plot into.

required
name str

Title for the plot.

required

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
def ci_plot(result_tab, ax_in, name):
    """
    Plot mean differences with confidence intervals for pairwise comparisons.

    Parameters
    ----------
    result_tab : pd.DataFrame
        Output of rm_tukey_hsd with columns ['meandiff', 'lower', 'upper'].
    ax_in : matplotlib.axes.Axes
        Axes to plot into.
    name : str
        Title for the plot.

    Returns
    -------
    None
    """
    result_err = np.array([result_tab["meandiff"] - result_tab["lower"], result_tab["upper"] - result_tab["meandiff"]])
    sns.set_theme(context="paper")
    sns.set_style("whitegrid")
    ax = sns.pointplot(x=result_tab.meandiff, y=result_tab.index, marker="o", linestyle="", ax=ax_in)
    ax.errorbar(y=result_tab.index, x=result_tab["meandiff"], xerr=result_err, fmt="o", capsize=5)
    ax.axvline(0, ls="--", lw=3)
    ax.set_xlabel("Mean Difference")
    ax.set_ylabel("")
    ax.set_title(name)
    ax.set_xlim(-0.2, 0.2)

uqdd.metrics.stats.make_ci_plot_grid

make_ci_plot_grid(df_in, metric_list, group_col='method', save_dir=None, name_prefix='', model_order=None)

Plot a grid of confidence-interval charts for multiple metrics.

Parameters:

Name Type Description Default
df_in DataFrame

Input DataFrame.

required
metric_list list of str

Metrics to render.

required
group_col str

Group column (e.g., 'method'). Default is 'method'.

'method'
save_dir str or None

Directory to save the plot. Default is None.

None
name_prefix str

Filename prefix. Default is empty.

''
model_order list of str or None

Explicit row order for the CI plots.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
def make_ci_plot_grid(df_in, metric_list, group_col="method", save_dir=None, name_prefix="", model_order=None):
    """
    Plot a grid of confidence-interval charts for multiple metrics.

    Parameters
    ----------
    df_in : pd.DataFrame
        Input DataFrame.
    metric_list : list of str
        Metrics to render.
    group_col : str, optional
        Group column (e.g., 'method'). Default is 'method'.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    name_prefix : str, optional
        Filename prefix. Default is empty.
    model_order : list of str or None, optional
        Explicit row order for the CI plots.

    Returns
    -------
    None
    """
    df_in = df_in.copy()
    df_in.replace([np.inf, -np.inf], np.nan, inplace=True)
    figure, axes = plt.subplots(len(metric_list), 1, figsize=(8, 2 * len(metric_list)), sharex=False)
    if not isinstance(axes, np.ndarray):
        axes = np.array([axes])
    for i, metric in enumerate(metric_list):
        df_tukey, _, _, _ = rm_tukey_hsd(df_in, metric, group_col=group_col)
        if model_order is not None:
            df_tukey = df_tukey.reindex(index=model_order)
        ci_plot(df_tukey, ax_in=axes[i], name=metric)
    figure.suptitle("Multiple Comparison of Means\nTukey HSD, FWER=0.05")
    plt.subplots_adjust(hspace=0.9, wspace=0.3)
    save_plot(figure, save_dir, f"{name_prefix}_ci_plot_grid_{'_'.join(metric_list)}")
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.recall_at_precision

recall_at_precision(y_true, y_score, precision_threshold=0.5, direction='greater')

Find recall and threshold achieving at least a target precision.

Parameters:

Name Type Description Default
y_true array - like

Binary ground-truth labels.

required
y_score array - like

Continuous scores or probabilities.

required
precision_threshold float

Minimum precision to achieve. Default is 0.5.

0.5
direction (greater, lesser)

If 'greater', thresholding uses >=; if 'lesser', uses <=. Default is 'greater'.

"greater"

Returns:

Type Description
tuple[float, float or None]

(recall, threshold) if achievable; otherwise (nan, None).

Raises:

Type Description
ValueError

If direction is invalid.

Source code in uqdd/metrics/stats.py
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
def recall_at_precision(y_true, y_score, precision_threshold=0.5, direction="greater"):
    """
    Find recall and threshold achieving at least a target precision.

    Parameters
    ----------
    y_true : array-like
        Binary ground-truth labels.
    y_score : array-like
        Continuous scores or probabilities.
    precision_threshold : float, optional
        Minimum precision to achieve. Default is 0.5.
    direction : {"greater", "lesser"}, optional
        If 'greater', thresholding uses >=; if 'lesser', uses <=. Default is 'greater'.

    Returns
    -------
    tuple[float, float or None]
        (recall, threshold) if achievable; otherwise (nan, None).

    Raises
    ------
    ValueError
        If `direction` is invalid.
    """
    if direction not in ["greater", "lesser"]:
        raise ValueError("Invalid direction. Expected one of: ['greater', 'lesser']")
    y_true = np.array(y_true)
    y_score = np.array(y_score)
    thresholds = np.unique(y_score)
    thresholds = np.sort(thresholds)
    if direction == "lesser":
        thresholds = thresholds[::-1]
    for threshold in thresholds:
        y_pred = y_score >= threshold if direction == "greater" else y_score <= threshold
        precision = precision_score(y_true, y_pred)
        if precision >= precision_threshold:
            recall = recall_score(y_true, y_pred)
            return recall, threshold
    return np.nan, None

uqdd.metrics.stats.calc_classification_metrics

calc_classification_metrics(df_in, cycle_col, val_col, prob_col, pred_col)

Compute classification metrics per cycle/method/split, including ROC-AUC, PR-AUC, MCC, recall, and TNR.

Parameters:

Name Type Description Default
df_in DataFrame

Input DataFrame.

required
cycle_col str

Column name for cross-validation cycles.

required
val_col str

True binary label column.

required
prob_col str

Predicted probability/score column.

required
pred_col str

Predicted binary label column.

required

Returns:

Type Description
DataFrame

Metrics per (cv_cycle, method, split) with columns ['roc_auc', 'pr_auc', 'mcc', 'recall', 'tnr'].

Source code in uqdd/metrics/stats.py
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
def calc_classification_metrics(df_in, cycle_col, val_col, prob_col, pred_col):
    """
    Compute classification metrics per cycle/method/split, including ROC-AUC, PR-AUC, MCC, recall, and TNR.

    Parameters
    ----------
    df_in : pd.DataFrame
        Input DataFrame.
    cycle_col : str
        Column name for cross-validation cycles.
    val_col : str
        True binary label column.
    prob_col : str
        Predicted probability/score column.
    pred_col : str
        Predicted binary label column.

    Returns
    -------
    pd.DataFrame
        Metrics per (cv_cycle, method, split) with columns ['roc_auc', 'pr_auc', 'mcc', 'recall', 'tnr'].
    """
    metric_list = []
    for k, v in df_in.groupby([cycle_col, "method", "split"]):
        cycle, method, split = k
        roc_auc = roc_auc_score(v[val_col], v[prob_col])
        pr_auc = average_precision_score(v[val_col], v[prob_col])
        mcc = matthews_corrcoef(v[val_col], v[pred_col])
        recall, _ = recall_at_precision(v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="greater")
        tnr, _ = recall_at_precision(~v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="lesser")
        metric_list.append([cycle, method, split, roc_auc, pr_auc, mcc, recall, tnr])
    metric_df = pd.DataFrame(metric_list, columns=["cv_cycle", "method", "split", "roc_auc", "pr_auc", "mcc", "recall", "tnr"])
    return metric_df

uqdd.metrics.stats.make_curve_plots

make_curve_plots(df)

Plot ROC and PR curves for split/method selections with threshold markers.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame containing 'cv_cycle', 'split', and method columns plus true/probability fields.

required

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
def make_curve_plots(df):
    """
    Plot ROC and PR curves for split/method selections with threshold markers.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame containing 'cv_cycle', 'split', and method columns plus true/probability fields.

    Returns
    -------
    None
    """
    df_plot = df.query("cv_cycle == 0 and split == 'scaffold'").copy()
    color_map = plt.get_cmap("tab10")
    le = LabelEncoder()
    df_plot["color"] = le.fit_transform(df_plot["method"])
    colors = color_map(df_plot["color"].unique())
    val_col = "Sol"
    prob_col = "Sol_prob"
    fig, axes = plt.subplots(1, 2, figsize=(12, 6))
    for (k, v), color in zip(df_plot.groupby("method"), colors):
        roc_auc = roc_auc_score(v[val_col], v[prob_col])
        pr_auc = average_precision_score(v[val_col], v[prob_col])
        fpr, recall_pos, thresholds_roc = roc_curve(v[val_col], v[prob_col])
        precision, recall, thresholds_pr = precision_recall_curve(v[val_col], v[prob_col])
        _, threshold_recall_pos = recall_at_precision(v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="greater")
        _, threshold_recall_neg = recall_at_precision(~v[val_col].astype(bool), v[prob_col], precision_threshold=0.8, direction="lesser")
        fpr_recall_pos = fpr[np.abs(thresholds_roc - threshold_recall_pos).argmin()]
        fpr_recall_neg = fpr[np.abs(thresholds_roc - threshold_recall_neg).argmin()]
        recall_recall_pos = recall[np.abs(thresholds_pr - threshold_recall_pos).argmin()]
        recall_recall_neg = recall[np.abs(thresholds_pr - threshold_recall_neg).argmin()]
        axes[0].plot(fpr, recall_pos, label=f"{k} (ROC AUC={roc_auc:.03f})", color=color, alpha=0.75)
        axes[1].plot(recall, precision, label=f"{k} (PR AUC={pr_auc:.03f})", color=color, alpha=0.75)
        axes[0].axvline(fpr_recall_pos, color=color, linestyle=":", alpha=0.75)
        axes[0].axvline(fpr_recall_neg, color=color, linestyle="--", alpha=0.75)
        axes[1].axvline(recall_recall_pos, color=color, linestyle=":", alpha=0.75)
        axes[1].axvline(recall_recall_neg, color=color, linestyle="--", alpha=0.75)
    axes[0].plot([0, 1], [0, 1], "--", color="black", lw=0.5)
    axes[0].set_xlabel("False Positive Rate")
    axes[0].set_ylabel("True Positive Rate")
    axes[0].set_title("ROC Curve")
    axes[0].legend()
    axes[1].set_xlabel("Recall")
    axes[1].set_ylabel("Precision")
    axes[1].set_title("Precision-Recall Curve")
    axes[1].legend()

uqdd.metrics.stats.harmonize_columns

harmonize_columns(df)

Normalize common column names to ['method', 'split', 'cv_cycle'].

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame with possibly varied column naming.

required

Returns:

Type Description
DataFrame

DataFrame with standardized column names and assertion that required columns exist.

Source code in uqdd/metrics/stats.py
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
def harmonize_columns(df):
    """
    Normalize common column names to ['method', 'split', 'cv_cycle'].

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame with possibly varied column naming.

    Returns
    -------
    pd.DataFrame
        DataFrame with standardized column names and assertion that required columns exist.
    """
    df = df.copy()
    rename_map = {
        "Model type": "method",
        "Split": "split",
        "Group_Number": "cv_cycle",
    }
    df.rename(columns={k: v for k, v in rename_map.items() if k in df.columns}, inplace=True)
    assert {"method", "split", "cv_cycle"}.issubset(df.columns)
    return df

uqdd.metrics.stats.cliffs_delta

cliffs_delta(x, y)

Compute Cliff's delta effect size and qualitative interpretation.

Parameters:

Name Type Description Default
x array - like

First sample of numeric values.

required
y array - like

Second sample of numeric values.

required

Returns:

Type Description
tuple[float, str]

(delta, interpretation) where interpretation is one of {'negligible','small','medium','large'}.

Source code in uqdd/metrics/stats.py
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
def cliffs_delta(x, y):
    """
    Compute Cliff's delta effect size and qualitative interpretation.

    Parameters
    ----------
    x : array-like
        First sample of numeric values.
    y : array-like
        Second sample of numeric values.

    Returns
    -------
    tuple[float, str]
        (delta, interpretation) where interpretation is one of {'negligible','small','medium','large'}.
    """
    x, y = np.array(x), np.array(y)
    m, n = len(x), len(y)
    comparisons = 0
    for xi in x:
        for yi in y:
            if xi > yi:
                comparisons += 1
            elif xi < yi:
                comparisons -= 1
    delta = comparisons / (m * n)
    abs_delta = abs(delta)
    if abs_delta < 0.147:
        interpretation = "negligible"
    elif abs_delta < 0.33:
        interpretation = "small"
    elif abs_delta < 0.474:
        interpretation = "medium"
    else:
        interpretation = "large"
    return delta, interpretation

uqdd.metrics.stats.wilcoxon_pairwise_test

wilcoxon_pairwise_test(df, metric, model_a, model_b, task=None, split=None, seed_col=None)

Perform paired Wilcoxon signed-rank test between two models on a metric.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metric str

Metric column to compare.

required
model_a str

First model type name.

required
model_b str

Second model type name.

required
task str or None

Task filter. Default is None.

None
split str or None

Split filter. Default is None.

None
seed_col str or None

Optional seed column identifier (unused here).

None

Returns:

Type Description
dict or None

Test summary including statistic, p-value, Cliff's delta, CI on differences; None if insufficient data.

Source code in uqdd/metrics/stats.py
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
def wilcoxon_pairwise_test(df, metric, model_a, model_b, task=None, split=None, seed_col=None):
    """
    Perform paired Wilcoxon signed-rank test between two models on a metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metric : str
        Metric column to compare.
    model_a : str
        First model type name.
    model_b : str
        Second model type name.
    task : str or None, optional
        Task filter. Default is None.
    split : str or None, optional
        Split filter. Default is None.
    seed_col : str or None, optional
        Optional seed column identifier (unused here).

    Returns
    -------
    dict or None
        Test summary including statistic, p-value, Cliff's delta, CI on differences; None if insufficient data.
    """
    data = df.copy()
    if task is not None:
        data = data[data["Task"] == task]
    if split is not None:
        data = data[data["Split"] == split]
    values_a = data[data["Model type"] == model_a][metric].values
    values_b = data[data["Model type"] == model_b][metric].values
    if len(values_a) == 0 or len(values_b) == 0:
        return None
    min_len = min(len(values_a), len(values_b))
    values_a = values_a[:min_len]
    values_b = values_b[:min_len]
    statistic, p_value = wilcoxon(values_a, values_b, alternative="two-sided")
    delta, effect_size_interpretation = cliffs_delta(values_a, values_b)
    differences = values_a - values_b
    median_diff = np.median(differences)
    ci_lower, ci_upper = bootstrap_ci(differences, np.median, n_bootstrap=1000)
    if ci_lower <= 0 <= ci_upper:
        practical_significance = "difference is small (CI includes 0)"
    elif abs(median_diff) < 0.1 * np.std(np.concatenate([values_a, values_b])):
        practical_significance = "difference is small"
    else:
        practical_significance = "difference may be meaningful"
    return {
        "model_a": model_a,
        "model_b": model_b,
        "metric": metric,
        "task": task,
        "split": split,
        "n_pairs": min_len,
        "wilcoxon_statistic": statistic,
        "p_value": p_value,
        "cliffs_delta": delta,
        "effect_size_interpretation": effect_size_interpretation,
        "median_difference": median_diff,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "practical_significance": practical_significance,
    }

uqdd.metrics.stats.holm_bonferroni_correction

holm_bonferroni_correction(p_values)

Apply Holm–Bonferroni correction to an array of p-values.

Parameters:

Name Type Description Default
p_values array - like

Raw p-values.

required

Returns:

Type Description
tuple[ndarray, ndarray]

(corrected_p_values, rejected_mask) where rejected indicates significance after correction.

Source code in uqdd/metrics/stats.py
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
def holm_bonferroni_correction(p_values):
    """
    Apply Holm–Bonferroni correction to an array of p-values.

    Parameters
    ----------
    p_values : array-like
        Raw p-values.

    Returns
    -------
    tuple[numpy.ndarray, numpy.ndarray]
        (corrected_p_values, rejected_mask) where rejected indicates significance after correction.
    """
    p_values = np.array(p_values)
    n = len(p_values)
    sorted_indices = np.argsort(p_values)
    sorted_p_values = p_values[sorted_indices]
    corrected_p_values = np.zeros(n)
    rejected = np.zeros(n, dtype=bool)
    for i in range(n):
        correction_factor = n - i
        corrected_p_values[sorted_indices[i]] = min(1.0, sorted_p_values[i] * correction_factor)
        if corrected_p_values[sorted_indices[i]] < 0.05:
            rejected[sorted_indices[i]] = True
        else:
            break
    return corrected_p_values, rejected

uqdd.metrics.stats.pairwise_model_comparison

pairwise_model_comparison(df, metrics, models=None, tasks=None, splits=None, alpha=0.05)

Run pairwise Wilcoxon tests across models/tasks/splits for multiple metrics and adjust p-values.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metrics list of str

Metrics to compare.

required
models list of str or None

Models to include; default derives from data.

None
tasks list of str or None

Tasks to include; default derives from data.

None
splits list of str or None

Splits to include; default derives from data.

None
alpha float

Significance level. Default is 0.05.

0.05

Returns:

Type Description
DataFrame

Results table with corrected p-values and significance flags.

Source code in uqdd/metrics/stats.py
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
def pairwise_model_comparison(df, metrics, models=None, tasks=None, splits=None, alpha=0.05):
    """
    Run pairwise Wilcoxon tests across models/tasks/splits for multiple metrics and adjust p-values.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to compare.
    models : list of str or None, optional
        Models to include; default derives from data.
    tasks : list of str or None, optional
        Tasks to include; default derives from data.
    splits : list of str or None, optional
        Splits to include; default derives from data.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    pd.DataFrame
        Results table with corrected p-values and significance flags.
    """
    if models is None:
        models = df["Model type"].unique()
    if tasks is None:
        tasks = df["Task"].unique()
    if splits is None:
        splits = df["Split"].unique()
    results = []
    for metric in metrics:
        for task in tasks:
            for split in splits:
                for i, model_a in enumerate(models):
                    for j, model_b in enumerate(models):
                        if i < j:
                            result = wilcoxon_pairwise_test(df, metric, model_a, model_b, task, split)
                            if result is not None:
                                results.append(result)
    if not results:
        return pd.DataFrame()
    results_df = pd.DataFrame(results)
    p_values = results_df["p_value"].values
    corrected_p_values, rejected = holm_bonferroni_correction(p_values)
    results_df["corrected_p_value"] = corrected_p_values
    results_df["significant_after_correction"] = rejected
    return results_df

uqdd.metrics.stats.friedman_nemenyi_test

friedman_nemenyi_test(df, metrics, models=None, alpha=0.05)

Run Friedman test across models with Nemenyi post-hoc where significant, per metric.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metrics list of str

Metrics to test.

required
models list of str or None

Models to include; default derives from data.

None
alpha float

Significance level. Default is 0.05.

0.05

Returns:

Type Description
dict

Mapping metric -> result dict containing stats, p-values, mean ranks, and optional post-hoc outputs.

Source code in uqdd/metrics/stats.py
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
def friedman_nemenyi_test(df, metrics, models=None, alpha=0.05):
    """
    Run Friedman test across models with Nemenyi post-hoc where significant, per metric.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to test.
    models : list of str or None, optional
        Models to include; default derives from data.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    dict
        Mapping metric -> result dict containing stats, p-values, mean ranks, and optional post-hoc outputs.
    """
    if models is None:
        models = df["Model type"].unique()
    results = {}
    for metric in metrics:
        pivot_data = df.pivot_table(values=metric, index=["Task", "Split"], columns="Model type", aggfunc="mean")
        available_models = [m for m in models if m in pivot_data.columns]
        pivot_data = pivot_data[available_models]
        pivot_data = pivot_data.dropna()
        if pivot_data.shape[0] < 2 or pivot_data.shape[1] < 3:
            results[metric] = {"error": "Insufficient data for Friedman test", "data_shape": pivot_data.shape}
            continue
        try:
            friedman_stat, friedman_p = friedmanchisquare(*[pivot_data[col].values for col in pivot_data.columns])
            ranks = pivot_data.rank(axis=1, ascending=False)
            mean_ranks = ranks.mean()
            result = {
                "friedman_statistic": friedman_stat,
                "friedman_p_value": friedman_p,
                "mean_ranks": mean_ranks.to_dict(),
                "significant": friedman_p < alpha,
            }
            if friedman_p < alpha:
                try:
                    data_array = pivot_data.values
                    nemenyi_result = sp.posthoc_nemenyi_friedman(data_array.T)
                    nemenyi_result.index = available_models
                    nemenyi_result.columns = available_models
                    result["nemenyi_p_values"] = nemenyi_result.to_dict()
                    result["critical_difference"] = calculate_critical_difference(len(available_models), pivot_data.shape[0], alpha)
                except Exception as e:
                    result["nemenyi_error"] = str(e)
            results[metric] = result
        except Exception as e:
            results[metric] = {"error": str(e)}
    return results

uqdd.metrics.stats.calculate_critical_difference

calculate_critical_difference(k, n, alpha=0.05)

Compute the critical difference for average ranks in Nemenyi post-hoc tests.

Parameters:

Name Type Description Default
k int

Number of models.

required
n int

Number of datasets/blocks.

required
alpha float

Significance level. Default is 0.05.

0.05

Returns:

Type Description
float

Critical difference value.

Source code in uqdd/metrics/stats.py
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
def calculate_critical_difference(k, n, alpha=0.05):
    """
    Compute the critical difference for average ranks in Nemenyi post-hoc tests.

    Parameters
    ----------
    k : int
        Number of models.
    n : int
        Number of datasets/blocks.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    float
        Critical difference value.
    """
    from scipy.stats import studentized_range
    q_alpha = studentized_range.ppf(1 - alpha, k, np.inf) / np.sqrt(2)
    cd = q_alpha * np.sqrt(k * (k + 1) / (6 * n))
    return cd

uqdd.metrics.stats.bootstrap_auc_difference

bootstrap_auc_difference(auc_values_a, auc_values_b, n_bootstrap=1000, ci=95, random_state=42)

Bootstrap confidence interval for difference of mean AUCs between two models.

Parameters:

Name Type Description Default
auc_values_a array - like

AUC values for model A.

required
auc_values_b array - like

AUC values for model B.

required
n_bootstrap int

Number of bootstrap resamples. Default is 1000.

1000
ci int or float

Confidence level in percent. Default is 95.

95
random_state int

Seed for reproducibility. Default is 42.

42

Returns:

Type Description
dict

{'mean_difference', 'ci_lower', 'ci_upper', 'bootstrap_differences'}

Source code in uqdd/metrics/stats.py
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
def bootstrap_auc_difference(auc_values_a, auc_values_b, n_bootstrap=1000, ci=95, random_state=42):
    """
    Bootstrap confidence interval for difference of mean AUCs between two models.

    Parameters
    ----------
    auc_values_a : array-like
        AUC values for model A.
    auc_values_b : array-like
        AUC values for model B.
    n_bootstrap : int, optional
        Number of bootstrap resamples. Default is 1000.
    ci : int or float, optional
        Confidence level in percent. Default is 95.
    random_state : int, optional
        Seed for reproducibility. Default is 42.

    Returns
    -------
    dict
        {'mean_difference', 'ci_lower', 'ci_upper', 'bootstrap_differences'}
    """
    np.random.seed(random_state)
    differences = []
    for _ in range(n_bootstrap):
        sample_a = resample(auc_values_a, random_state=np.random.randint(0, 10000))
        sample_b = resample(auc_values_b, random_state=np.random.randint(0, 10000))
        diff = np.mean(sample_a) - np.mean(sample_b)
        differences.append(diff)
    differences = np.array(differences)
    alpha = (100 - ci) / 2
    ci_lower = np.percentile(differences, alpha)
    ci_upper = np.percentile(differences, 100 - alpha)
    original_diff = np.mean(auc_values_a) - np.mean(auc_values_b)
    return {"mean_difference": original_diff, "ci_lower": ci_lower, "ci_upper": ci_upper, "bootstrap_differences": differences}

uqdd.metrics.stats.plot_critical_difference_diagram

plot_critical_difference_diagram(friedman_results, metric, save_dir=None, alpha=0.05)

Plot a simple critical difference diagram using mean ranks and CD value.

Parameters:

Name Type Description Default
friedman_results dict

Output dictionary from friedman_nemenyi_test.

required
metric str

Metric to plot.

required
save_dir str or None

Directory to save the plot. Default is None.

None
alpha float

Significance level used to compute CD. Default is 0.05.

0.05

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
def plot_critical_difference_diagram(friedman_results, metric, save_dir=None, alpha=0.05):
    """
    Plot a simple critical difference diagram using mean ranks and CD value.

    Parameters
    ----------
    friedman_results : dict
        Output dictionary from friedman_nemenyi_test.
    metric : str
        Metric to plot.
    save_dir : str or None, optional
        Directory to save the plot. Default is None.
    alpha : float, optional
        Significance level used to compute CD. Default is 0.05.

    Returns
    -------
    None
    """
    if metric not in friedman_results:
        print(f"Metric {metric} not found in Friedman results")
        return
    result = friedman_results[metric]
    if "error" in result:
        print(f"Error in Friedman test for {metric}: {result['error']}")
        return
    if not result["significant"]:
        print(f"Friedman test not significant for {metric}, skipping CD diagram")
        return
    mean_ranks = result["mean_ranks"]
    models = list(mean_ranks.keys())
    ranks = [mean_ranks[model] for model in models]
    sorted_indices = np.argsort(ranks)
    sorted_models = [models[i] for i in sorted_indices]
    sorted_ranks = [ranks[i] for i in sorted_indices]
    fig, ax = plt.subplots(figsize=(12, 6))
    y_pos = 0
    ax.scatter(sorted_ranks, [y_pos] * len(sorted_ranks), s=100, c="blue")
    for i, (model, rank) in enumerate(zip(sorted_models, sorted_ranks)):
        ax.annotate(model, (rank, y_pos), xytext=(0, 20), textcoords="offset points", ha="center", rotation=45)
    if "critical_difference" in result:
        cd = result["critical_difference"]
        groups = []
        for i, model_a in enumerate(sorted_models):
            group = [model_a]
            rank_a = sorted_ranks[i]
            for j, model_b in enumerate(sorted_models):
                if i != j:
                    rank_b = sorted_ranks[j]
                    if abs(rank_a - rank_b) <= cd:
                        if model_b not in [m for g in groups for m in g]:
                            group.append(model_b)
            if len(group) > 1:
                groups.append(group)
        colors = plt.cm.Set3(np.linspace(0, 1, len(groups)))
        for group, color in zip(groups, colors):
            if len(group) > 1:
                group_ranks = [sorted_ranks[sorted_models.index(m)] for m in group]
                min_rank, max_rank = min(group_ranks), max(group_ranks)
                ax.plot([min_rank, max_rank], [y_pos - 0.05, y_pos - 0.05], color=color, linewidth=3, alpha=0.7)
    ax.set_xlim(min(sorted_ranks) - 0.5, max(sorted_ranks) + 0.5)
    ax.set_ylim(-0.3, 0.5)
    ax.set_xlabel("Average Rank")
    ax.set_title(f"Critical Difference Diagram - {metric}")
    ax.grid(True, alpha=0.3)
    ax.set_yticks([])
    if save_dir:
        plot_name = f"critical_difference_{metric.replace(' ', '_')}"
        save_plot(fig, save_dir, plot_name)
    if INTERACTIVE_MODE:
        plt.show()
    plt.close()

uqdd.metrics.stats.analyze_significance

analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=None, model_order=None, activity=None)

End-to-end significance analysis and plotting across splits for multiple metrics.

Parameters:

Name Type Description Default
df_raw DataFrame

Raw results DataFrame.

required
metrics list of str

Metric names to analyze.

required
direction_dict dict

Mapping metric -> 'maximize'|'minimize'.

required
effect_dict dict

Mapping metric -> effect size threshold for visualization.

required
save_dir str or None

Directory to save plots and outputs. Default is None.

None
model_order list of str or None

Explicit ordering of models. Default derives from data.

None
activity str or None

Activity name for prefixes. Default is None.

None

Returns:

Type Description
None
Source code in uqdd/metrics/stats.py
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
def analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=None, model_order=None, activity=None):
    """
    End-to-end significance analysis and plotting across splits for multiple metrics.

    Parameters
    ----------
    df_raw : pd.DataFrame
        Raw results DataFrame.
    metrics : list of str
        Metric names to analyze.
    direction_dict : dict
        Mapping metric -> 'maximize'|'minimize'.
    effect_dict : dict
        Mapping metric -> effect size threshold for visualization.
    save_dir : str or None, optional
        Directory to save plots and outputs. Default is None.
    model_order : list of str or None, optional
        Explicit ordering of models. Default derives from data.
    activity : str or None, optional
        Activity name for prefixes. Default is None.

    Returns
    -------
    None
    """
    df = harmonize_columns(df_raw)
    for metric in metrics:
        df[metric] = pd.to_numeric(df[metric], errors="coerce")
    df.replace([np.inf, -np.inf], np.nan, inplace=True)
    for split in df["split"].unique():
        df_s = df[df["split"] == split].copy()
        print(f"\n=== Split: {split} ===")
        name_prefix = f"06_{activity}_{split}" if activity else f"{split}"
        make_normality_diagnostic(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix)
        for metric in metrics:
            print(f"\n-- Metric: {metric}")
            wide = df_s.pivot(index="cv_cycle", columns="method", values=metric)
            resid = (wide.T - wide.mean(axis=1)).T
            vals = resid.values.flatten()
            vals = vals[~np.isnan(vals)]
            W, p_norm = shapiro(vals) if len(vals) >= 3 else (None, 0.0)
            if p_norm is None:
                print("Not enough data for Shapiro-Wilk test (need at least 3 non-NaN values), assuming non-normality")
            elif p_norm < 0.05:
                print(f"Shapiro-Wilk test for {metric} indicates non-normality (W={W:.3f}, p={p_norm:.3f})")
            else:
                print(f"Shapiro-Wilk test for {metric} indicates normality (W={W:.3f}, p={p_norm:.3f})")
        make_boxplots(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_boxplots_parametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_boxplots_nonparametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_sign_plots_nonparametric(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_critical_difference_diagrams(df_s, metrics, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_mcs_plot_grid(df=df_s, stats_list=metrics, group_col="method", alpha=0.05, figsize=(30, 15), direction_dict=direction_dict, effect_dict=effect_dict, show_diff=True, sort_axes=True, save_dir=save_dir, name_prefix=name_prefix + "_diff", model_order=model_order)
        make_mcs_plot_grid(df=df_s, stats_list=metrics, group_col="method", alpha=0.05, figsize=(30, 15), direction_dict=direction_dict, effect_dict=effect_dict, show_diff=False, sort_axes=True, save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)
        make_ci_plot_grid(df_s, metrics, group_col="method", save_dir=save_dir, name_prefix=name_prefix, model_order=model_order)

uqdd.metrics.stats.comprehensive_statistical_analysis

comprehensive_statistical_analysis(df, metrics, models=None, tasks=None, splits=None, save_dir=None, alpha=0.05)

Run a comprehensive suite of statistical tests and export results.

Parameters:

Name Type Description Default
df DataFrame

Input DataFrame.

required
metrics list of str

Metrics to analyze.

required
models list of str or None

Models to include. Default derives from data.

None
tasks list of str or None

Tasks to include. Default derives from data.

None
splits list of str or None

Splits to include. Default derives from data.

None
save_dir str or None

Directory to save tables and JSON outputs. Default is None.

None
alpha float

Significance level. Default is 0.05.

0.05

Returns:

Type Description
dict

Results dict including pairwise tests, Friedman/Nemenyi outputs, and optional AUC bootstrap comparisons.

Source code in uqdd/metrics/stats.py
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
def comprehensive_statistical_analysis(df, metrics, models=None, tasks=None, splits=None, save_dir=None, alpha=0.05):
    """
    Run a comprehensive suite of statistical tests and export results.

    Parameters
    ----------
    df : pd.DataFrame
        Input DataFrame.
    metrics : list of str
        Metrics to analyze.
    models : list of str or None, optional
        Models to include. Default derives from data.
    tasks : list of str or None, optional
        Tasks to include. Default derives from data.
    splits : list of str or None, optional
        Splits to include. Default derives from data.
    save_dir : str or None, optional
        Directory to save tables and JSON outputs. Default is None.
    alpha : float, optional
        Significance level. Default is 0.05.

    Returns
    -------
    dict
        Results dict including pairwise tests, Friedman/Nemenyi outputs, and optional AUC bootstrap comparisons.
    """
    print("Performing comprehensive statistical analysis...")
    results = {}
    print("1. Running pairwise Wilcoxon signed-rank tests...")
    pairwise_results = pairwise_model_comparison(df, metrics, models, tasks, splits, alpha)
    results["pairwise_tests"] = pairwise_results
    print("2. Running Friedman tests with Nemenyi post-hoc...")
    friedman_results = friedman_nemenyi_test(df, metrics, models, alpha)
    results["friedman_nemenyi"] = friedman_results
    auc_columns = [col for col in df.columns if "AUC" in col or "auc" in col]
    if auc_columns:
        print("3. Running bootstrap comparisons for AUC metrics...")
        auc_bootstrap_results = {}
        for auc_col in auc_columns:
            auc_bootstrap_results[auc_col] = {}
            available_models = df["Model type"].unique() if models is None else models
            for i, model_a in enumerate(available_models):
                for j, model_b in enumerate(available_models):
                    if i < j:
                        auc_a = df[df["Model type"] == model_a][auc_col].dropna().values
                        auc_b = df[df["Model type"] == model_b][auc_col].dropna().values
                        if len(auc_a) > 0 and len(auc_b) > 0:
                            bootstrap_result = bootstrap_auc_difference(auc_a, auc_b)
                            auc_bootstrap_results[auc_col][f"{model_a}_vs_{model_b}"] = bootstrap_result
        results["auc_bootstrap"] = auc_bootstrap_results
    if save_dir:
        os.makedirs(save_dir, exist_ok=True)
        if not pairwise_results.empty:
            pairwise_results.to_csv(os.path.join(save_dir, "pairwise_statistical_tests.csv"), index=False)
        import json
        with open(os.path.join(save_dir, "friedman_nemenyi_results.json"), "w") as f:
            json_compatible_results = {}
            for metric, result in friedman_results.items():
                json_compatible_results[metric] = {}
                for key, value in result.items():
                    if isinstance(value, (np.ndarray, np.generic)):
                        json_compatible_results[metric][key] = value.tolist()
                    elif isinstance(value, dict):
                        json_compatible_results[metric][key] = {str(k): (float(v) if isinstance(v, (np.ndarray, np.generic)) else v) for k, v in value.items()}
                    else:
                        json_compatible_results[metric][key] = (float(value) if isinstance(value, (np.ndarray, np.generic)) else value)
            json.dump(json_compatible_results, f, indent=2)
        if auc_columns:
            with open(os.path.join(save_dir, "auc_bootstrap_results.json"), "w") as f:
                json_compatible_auc = {}
                for auc_col, comparisons in results["auc_bootstrap"].items():
                    json_compatible_auc[auc_col] = {}
                    for comparison, result in comparisons.items():
                        json_compatible_auc[auc_col][comparison] = {k: v.tolist() if isinstance(v, np.ndarray) else v for k, v in result.items()}
                json.dump(json_compatible_auc, f, indent=2)
    return results

uqdd.metrics.stats.generate_statistical_report

generate_statistical_report(results, save_dir=None, df_raw=None, metrics=None, direction_dict=None, effect_dict=None)

Generate a human-readable text report from comprehensive statistical results and optionally run plots.

Parameters:

Name Type Description Default
results dict

Output of comprehensive_statistical_analysis.

required
save_dir str or None

Directory to save the report text file. Default is None.

None
df_raw DataFrame or None

Raw DataFrame to run plotting-based significance analysis. Default is None.

None
metrics list of str or None

Metrics to plot (when df_raw provided).

None
direction_dict dict or None

Direction mapping for metrics (required when df_raw provided).

None
effect_dict dict or None

Effect threshold mapping (required when df_raw provided).

None

Returns:

Type Description
str

Report text.

Source code in uqdd/metrics/stats.py
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
def generate_statistical_report(results, save_dir=None, df_raw=None, metrics=None, direction_dict=None, effect_dict=None):
    """
    Generate a human-readable text report from comprehensive statistical results and optionally run plots.

    Parameters
    ----------
    results : dict
        Output of comprehensive_statistical_analysis.
    save_dir : str or None, optional
        Directory to save the report text file. Default is None.
    df_raw : pd.DataFrame or None, optional
        Raw DataFrame to run plotting-based significance analysis. Default is None.
    metrics : list of str or None, optional
        Metrics to plot (when df_raw provided).
    direction_dict : dict or None, optional
        Direction mapping for metrics (required when df_raw provided).
    effect_dict : dict or None, optional
        Effect threshold mapping (required when df_raw provided).

    Returns
    -------
    str
        Report text.
    """
    report = []
    report.append("=" * 80)
    report.append("COMPREHENSIVE STATISTICAL ANALYSIS REPORT")
    report.append("=" * 80)
    report.append("")
    if "pairwise_tests" in results and not results["pairwise_tests"].empty:
        pairwise_df = results["pairwise_tests"]
        report.append("1. PAIRWISE MODEL COMPARISONS (Wilcoxon Signed-Rank Test)")
        report.append("-" * 60)
        significant = pairwise_df[pairwise_df["significant_after_correction"] == True]
        report.append(f"Total pairwise comparisons performed: {len(pairwise_df)}")
        report.append(f"Significant differences (after Holm-Bonferroni correction): {len(significant)}")
        report.append("")
        if len(significant) > 0:
            report.append("Significant differences found:")
            for _, row in significant.iterrows():
                effect_size = row["effect_size_interpretation"]
                report.append(f"  • {row['model_a']} vs {row['model_b']} ({row['metric']}, {row['split']}):")
                report.append(f"    - p-value: {row['p_value']:.4f} (corrected: {row['corrected_p_value']:.4f})")
                report.append(f"    - Cliff's Δ: {row['cliffs_delta']:.3f} ({effect_size} effect)")
                report.append(f"    - Median difference: {row['median_difference']:.4f} [{row['ci_lower']:.4f}, {row['ci_upper']:.4f}]")
                report.append(f"    - {row['practical_significance']}")
                report.append("")
        else:
            report.append("No significant differences found after multiple comparison correction.")
            report.append("")
    if "friedman_nemenyi" in results:
        friedman_results = results["friedman_nemenyi"]
        report.append("2. MULTIPLE MODEL COMPARISONS (Friedman + Nemenyi Tests)")
        report.append("-" * 60)
        for metric, result in friedman_results.items():
            if "error" in result:
                report.append(f"{metric}: {result['error']}")
                continue
            report.append(f"Metric: {metric}")
            report.append(f"  Friedman test p-value: {result['friedman_p_value']:.4f}")
            if result["significant"]:
                report.append("  Result: Significant difference between models detected")
                mean_ranks = result["mean_ranks"]
                sorted_ranks = sorted(mean_ranks.items(), key=lambda x: x[1])
                report.append("  Model rankings (lower rank = better performance):")
                for i, (model, rank) in enumerate(sorted_ranks, 1):
                    report.append(f"    {i}. {model}: {rank:.2f}")
                if "critical_difference" in result:
                    report.append(f"  Critical difference: {result['critical_difference']:.3f}")
            else:
                report.append("  Result: No significant difference between models")
            report.append("")
    if "auc_bootstrap" in results:
        auc_results = results["auc_bootstrap"]
        report.append("3. AUC BOOTSTRAP COMPARISONS")
        report.append("-" * 60)
        for auc_col, comparisons in auc_results.items():
            report.append(f"AUC Metric: {auc_col}")
            for comparison, result in comparisons.items():
                model_a, model_b = comparison.split("_vs_")
                mean_diff = result["mean_difference"]
                ci_lower = result["ci_lower"]
                ci_upper = result["ci_upper"]
                significance = "difference is small (CI includes 0)" if (ci_lower <= 0 <= ci_upper) else "difference may be meaningful"
                report.append(f"  {model_a} vs {model_b}:")
                report.append(f"    Mean difference: {mean_diff:.4f} [{ci_lower:.4f}, {ci_upper:.4f}]")
                report.append(f"    {significance}")
            report.append("")
    report_text = "\n".join(report)
    if save_dir:
        os.makedirs(save_dir, exist_ok=True)
        with open(os.path.join(save_dir, "statistical_analysis_report.txt"), "w") as f:
            f.write(report_text)
    print(report_text)
    if df_raw is not None and metrics is not None and direction_dict is not None and effect_dict is not None:
        analyze_significance(df_raw, metrics, direction_dict, effect_dict, save_dir=save_dir)
    return report_text