The data module (src/data/) handles loading, processing, and batching of crystal structure data.
Data Flow¶
Key Classes¶
CrystalBatch¶
The core data container for batched crystal structures (src/data/schema.py):
from src.data.schema import CrystalBatch
# CrystalBatch contains:
# - atom_types: Tensor of atomic numbers
# - frac_coords: Fractional coordinates
# - lengths: Lattice vector lengths
# - angles: Lattice angles
# - num_atoms: Number of atoms per structure
# - batch: Batch indicesDataModule¶
PyTorch Lightning DataModule for training (src/data/datamodule.py):
from src.data import DataModule
datamodule = DataModule(
data_dir="data/mp-20",
batch_size=32,
num_workers=4,
)Featurizer¶
Converts pymatgen structures to model-ready features (src/utils/featurizer.py):
from src.utils.featurizer import featurize
# Featurize structures
features = featurize(
structures=[structure],
model_path=None, # Uses default pre-trained VAE
batch_size=2000,
)
# Returns dict with "structure_features", "composition_features", "atom_features"Configuration¶
See configs/data/ for data configurations:
# configs/data/mp-20.yaml (default)
_target_: src.data.datamodule.DataModule
data_dir: ${paths.data_dir}/mp-20
batch_size: 256
dataset_type: "mp_20"
num_workers: 16Learn More¶
Predictor-Based Reward Tutorial - Guide on using predictors for RL rewards
API Reference - Full API documentation