Cutoff selection

One of the most important tasks when constructing a cluster expansion is the choice of appropriate cutoffs. A simple way to select cutoffs is to scan a wide range of different values and monitor how the cross validation error changes, as will be demonstrated below.

Here, we will use the randomized dataset for the \(\ce{Mo_{1-x}V_xC_{1-y}}\) system.

[1]:
from ase.db import connect
from ase.io import read

# setup
chemical_symbols = [['C', 'Be'], ['Mo', 'V']]
db = connect('../dft-databases/rocksalt_MoVCvac_randomized.db')
prim = read('../structures/MoC_rocksalt_prim.xyz')

# collect training structures
training_structures = []
for row in db.select():
    structure = row.toatoms()
    structure.info['mixing_energy'] = row.mixing_energy
    training_structures.append(structure)
[2]:
from icet import ClusterSpace, StructureContainer

def get_fit_data(cutoffs: list[float]):
    """ Helper function for compiling the sensing matrix and
    target mixing energies given a set of cutoffs. """
    cs = ClusterSpace(prim, cutoffs, chemical_symbols)
    sc = StructureContainer(cs)
    for structure in training_structures:
        sc.add_structure(
            structure,
            properties=dict(mixing_energy=structure.info['mixing_energy']))
    return sc.get_fit_data(key='mixing_energy')

Selecting the pair cutoff

Next, we loop over different pairwise cutoffs and compute the cross-validation.

[3]:
import numpy as np
from pandas import DataFrame
from trainstation import CrossValidationEstimator

# parameters
fit_method = 'ardr'
train_size = 350
n_splits = 300
cutoff_pair_vals = [4, 5, 6, 7, 8, 9, 10, 11, 12]

# run pair cutoff scan
records = []
for c2 in cutoff_pair_vals:
    print(f'pair cutoff: {c2:4.1f}')
    A, y = get_fit_data([c2])

    cve = CrossValidationEstimator(
        (A, y), validation_method='shuffle-split', n_splits=n_splits,
        fit_method=fit_method, train_size=train_size)
    cve.train()
    cve.validate()

    records.append(dict(
        cutoff_pair=c2,
        rmse_train=cve.rmse_train,
        rmse_validation=cve.rmse_validation,
        n_parameters=cve.n_parameters,
    ))

df2 = DataFrame(records)
pair cutoff:  4.0
pair cutoff:  5.0
pair cutoff:  6.0
pair cutoff:  7.0
pair cutoff:  8.0
pair cutoff:  9.0
pair cutoff: 10.0
pair cutoff: 11.0
pair cutoff: 12.0
[4]:
from matplotlib import pyplot as plt

fig, ax = plt.subplots(
    figsize=(4.5, 3.4),
    dpi=120,
)

ax.plot(df2.cutoff_pair, 1e3 * df2.rmse_train,
        '--s', label='training error')
ax.plot(df2.cutoff_pair, 1e3 * df2.rmse_validation,
        '-o', label='validation error')
ax.set_xlabel('Pair cutoff (Å)')
ax.set_ylabel('RMSE (meV/site)')
ax.set_ylim(0, 15)

ax2 = ax.twinx()
ax2.plot(df2.cutoff_pair, df2.n_parameters,
         ':d', color='tab:red', label='number of parameters')
ax2.set_ylim(0, 60)
ax2.set_ylabel('Number of parameters')

ax.plot(np.nan, np.nan, ':d', color='tab:red', label='number of parameters')
ax.legend(loc='upper center')

fig.tight_layout()
../_images/part-1_cutoff-selection_5_0.png

From the above figure it is apparent that the validation error does not improve beyond 9 Å. In fact it even increases slightly again. This indicates that a suitable choice for the pair cutoff is 9 Å.

Selecting the triplet cutoff

Next, we repeat the same procedure for the triplet cutoff using a fixed pair cutoff of 9 Å.

[5]:
cutoff_pair_final = 9.0
cutoff_triplet_vals = [3, 4, 5, 6, 7]
records = []
for c3 in cutoff_triplet_vals:
    print(f'cutoff triplet: {c3}')
    A, y = get_fit_data([cutoff_pair_final, c3])

    cve = CrossValidationEstimator(
        (A, y), validation_method='shuffle-split', n_splits=n_splits,
        fit_method=fit_method, train_size=train_size)
    cve.train()
    cve.validate()

    records.append(dict(
        cutoff_triplet=c3,
        rmse_train=cve.rmse_train,
        rmse_validation=cve.rmse_validation,
        n_parameters=cve.n_parameters,
    ))

df3 = DataFrame(records)
cutoff triplet: 3
cutoff triplet: 4
cutoff triplet: 5
cutoff triplet: 6
cutoff triplet: 7
[6]:
fig, ax = plt.subplots(
    figsize=(4.5, 3.4),
    dpi=120,
)

ax.plot(df3.cutoff_triplet, 1e3 * df3.rmse_train,
        '--s', label='training error')
ax.plot(df3.cutoff_triplet, 1e3 * df3.rmse_validation,
        '-o', label='validation error')
ax.set_ylim(0, 10)
ax.set_xlabel('Triplet cutoff (Å)')
ax.set_ylabel('RMSE (meV/site)')

ax2 = ax.twinx()
ax2.plot(df3.cutoff_triplet, df3.n_parameters,
         ':d', color='tab:red', label='number of parameters')
ax2.set_ylim(0, 150)
ax2.set_ylabel('Number of parameters')

ax.plot(np.nan, np.nan, ':d', color='tab:red', label='number of parameters')
ax.legend(loc='upper left')

fig.tight_layout()
../_images/part-1_cutoff-selection_9_0.png

Applying the same logic as in the case of the pair cutoff this analysis indicates a triplet cutoff of 5 Å. This process can be repeated for higher order clusters (quadruplets etc) but in the present case this will not lead to a significant improvement of the model.

In general simpler models (shorter cutoffs) tend to yield more transferable and well-behaved models. In general, selecting cutoffs is not an exact procedure and one needs to balance accuracy vs simplicity of the model.