Reinforcement Learning (RL) is the third stage of the Chemeleon2 pipeline. It fine-tunes the LDM to generate crystal structures that maximize user-defined reward functions.
What RL Does¶
The RL module is the third stage of Chemeleon2 that fine-tunes the LDM to generate crystal structures optimized for specific material properties. For architectural details, see RL Module.
Key concepts (see src/rl_module/rl_module.py):
GRPO Algorithm: Group Relative Policy Optimization for efficient training
Reward Functions: Define what properties to optimize (see
src/rl_module/reward.py)Policy Update: Adjust LDM weights to favor high-reward structures
Prerequisites¶
RL training requires both trained LDM and VAE checkpoints. The LDM is fine-tuned with reward signals, while the VAE decodes latent vectors to structures for reward computation.
# In config files
rl_module:
ldm_ckpt_path: ${hub:mp_20_ldm_base} # Or use local path
vae_ckpt_path: ${hub:mp_20_vae}# In CLI
python src/train_rl.py \
rl_module.ldm_ckpt_path='${hub:mp_20_ldm_base}' \
rl_module.vae_ckpt_path='${hub:mp_20_vae}'See Checkpoint Management for available checkpoints.
Quick Start¶
# Fine-tune with de novo generation reward (src/train_rl.py)
python src/train_rl.py experiment=mp_20/rl_dngTraining script: src/train_rl.py
Example config: configs/experiment/mp_20/rl_dng.yaml
Training Commands¶
Basic Training¶
# Use custom reward config
python src/train_rl.py custom_reward=rl_dng
# Override checkpoint paths (e.g., use alex_mp_20 model)
python src/train_rl.py custom_reward=rl_dng \
rl_module.ldm_ckpt_path='${hub:alex_mp_20_ldm_base}' \
rl_module.vae_ckpt_path='${hub:alex_mp_20_vae}'
# Override RL hyperparameters
python src/train_rl.py custom_reward=rl_dng \
rl_module.rl_configs.num_group_samples=128 \
data.batch_size=8GRPO Algorithm¶
Chemeleon2 uses Group Relative Policy Optimization (GRPO) for efficient RL training:
Sample Groups: Generate multiple structures per batch
Compute Rewards: Evaluate all structures in the group
Relative Ranking: Compare rewards within each group
Policy Update: Reinforce high-reward structures relative to group
Key GRPO Hyperparameters¶
| Parameter | Default | Description |
|---|---|---|
num_group_samples | 64 | Structures per group |
group_reward_norm | true | Normalize rewards within group (required for GRPO) |
num_inner_batch | 2 | Number of inner batches for gradient accumulation |
clip_ratio | 0.001 | PPO-style clipping ratio |
kl_weight | 1.0 | KL divergence penalty weight |
entropy_weight | 1e-5 | Entropy regularization weight |
# Example: adjust group size
python src/train_rl.py custom_reward=rl_dng \
rl_module.rl_configs.num_group_samples=128Reward Configuration¶
Rewards are defined in the reward_fn section of the config (see configs/train_rl.yaml for defaults):
rl_module:
reward_fn:
_target_: src.rl_module.reward.ReinforceReward
normalize_fn: std # Global normalization
eps: 1e-4
reference_dataset: mp-20 # For novelty/uniqueness metrics
components:
- _target_: src.rl_module.components.CreativityReward
weight: 1.0
normalize_fn: null
- _target_: src.rl_module.components.EnergyReward
weight: 1.0
normalize_fn: norm
- _target_: src.rl_module.components.StructureDiversityReward
weight: 0.1
normalize_fn: norm
- _target_: src.rl_module.components.CompositionDiversityReward
weight: 1.0
normalize_fn: normSee Custom Rewards Guide for detailed component documentation (src/rl_module/components.py).
Available Experiments¶
| Custom Reward Config | Dataset | Reward | Description |
|---|---|---|---|
atomic_density | Alex MP-20 | Custom | Example: atomic density optimization (see Custom Reward tutorial) |
rl_dng | MP-20 | DNG (multi-objective) | Paper’s de novo generation (see DNG Reward tutorial) |
rl_bandgap | Alex MP-20 | Predictor-based | Band gap optimization (see Predictor Reward tutorial) |
Training Tips¶
Monitoring¶
Key metrics to watch in WandB:
train/reward: Average reward (should increase)train/kl_div: KL divergence from original LDMval/reward: Validation reward for generalizationComponent-specific metrics (e.g.,
train/creativity,train/energy)
Hyperparameter Tuning¶
| Issue | Solution |
|---|---|
| Unstable training | Increase num_group_samples, enable group_reward_norm |
| Mode collapse | Increase kl_weight, add diversity rewards |
| Slow convergence | Decrease kl_weight, increase reward weights |
| Poor structure quality | Add EnergyReward component |
Typical Training¶
Duration: ~500-2000 steps
Batch size: 5 (default for GRPO with 64 group samples)
GPU memory: Scales with
num_group_samples(64 samples × 5 batch = 320 structures per step)
Next Steps¶
Custom Rewards Overview - Learn about reward components
DNG Reward Tutorial - Paper’s multi-objective reward
Predictor Reward Tutorial - Property optimization