RL Training - Chemeleon2 Documentation

Reinforcement Learning (RL) is the third stage of the Chemeleon2 pipeline. It fine-tunes the LDM to generate crystal structures that maximize user-defined reward functions.

What RL Does¶

The RL module is the third stage of Chemeleon2 that fine-tunes the LDM to generate crystal structures optimized for specific material properties. For architectural details, see RL Module.

Key concepts (see src/rl_module/rl_module.py):

GRPO Algorithm: Group Relative Policy Optimization for efficient training
Reward Functions: Define what properties to optimize (see src/rl_module/reward.py)
Policy Update: Adjust LDM weights to favor high-reward structures

Prerequisites¶

RL training requires both trained LDM and VAE checkpoints. The LDM is fine-tuned with reward signals, while the VAE decodes latent vectors to structures for reward computation.

# In config files
rl_module:
  ldm_ckpt_path: ${hub:mp_20_ldm_base}  # Or use local path
  vae_ckpt_path: ${hub:mp_20_vae}

# In CLI
python src/train_rl.py \
  rl_module.ldm_ckpt_path='${hub:mp_20_ldm_base}' \
  rl_module.vae_ckpt_path='${hub:mp_20_vae}'

See Checkpoint Management for available checkpoints.

Quick Start¶

# Fine-tune with de novo generation reward (src/train_rl.py)
python src/train_rl.py experiment=mp_20/rl_dng

Training script: src/train_rl.py Example config: configs/experiment/mp_20/rl_dng.yaml

Training Commands¶

Basic Training¶

# Use custom reward config
python src/train_rl.py custom_reward=rl_dng

# Override checkpoint paths (e.g., use alex_mp_20 model)
python src/train_rl.py custom_reward=rl_dng \
    rl_module.ldm_ckpt_path='${hub:alex_mp_20_ldm_base}' \
    rl_module.vae_ckpt_path='${hub:alex_mp_20_vae}'

# Override RL hyperparameters
python src/train_rl.py custom_reward=rl_dng \
    rl_module.rl_configs.num_group_samples=128 \
    data.batch_size=8

GRPO Algorithm¶

Chemeleon2 uses Group Relative Policy Optimization (GRPO) for efficient RL training:

Sample Groups: Generate multiple structures per batch
Compute Rewards: Evaluate all structures in the group
Relative Ranking: Compare rewards within each group
Policy Update: Reinforce high-reward structures relative to group

Key GRPO Hyperparameters¶

Parameter	Default	Description
`num_group_samples`	64	Structures per group
`group_reward_norm`	true	Normalize rewards within group (required for GRPO)
`num_inner_batch`	2	Number of inner batches for gradient accumulation
`clip_ratio`	0.001	PPO-style clipping ratio
`kl_weight`	1.0	KL divergence penalty weight
`entropy_weight`	1e-5	Entropy regularization weight

# Example: adjust group size
python src/train_rl.py custom_reward=rl_dng \
    rl_module.rl_configs.num_group_samples=128

Reward Configuration¶

Rewards are defined in the reward_fn section of the config (see configs/train_rl.yaml for defaults):

rl_module:
  reward_fn:
    _target_: src.rl_module.reward.ReinforceReward
    normalize_fn: std           # Global normalization
    eps: 1e-4
    reference_dataset: mp-20    # For novelty/uniqueness metrics
    components:
      - _target_: src.rl_module.components.CreativityReward
        weight: 1.0
        normalize_fn: null
      - _target_: src.rl_module.components.EnergyReward
        weight: 1.0
        normalize_fn: norm
      - _target_: src.rl_module.components.StructureDiversityReward
        weight: 0.1
        normalize_fn: norm
      - _target_: src.rl_module.components.CompositionDiversityReward
        weight: 1.0
        normalize_fn: norm

See Custom Rewards Guide for detailed component documentation (src/rl_module/components.py).

Available Experiments¶

Custom Reward Config	Dataset	Reward	Description
`atomic_density`	Alex MP-20	Custom	Example: atomic density optimization (see Custom Reward tutorial)
`rl_dng`	MP-20	DNG (multi-objective)	Paper’s de novo generation (see DNG Reward tutorial)
`rl_bandgap`	Alex MP-20	Predictor-based	Band gap optimization (see Predictor Reward tutorial)

Training Tips¶

Monitoring¶

Key metrics to watch in WandB:

train/reward: Average reward (should increase)
train/kl_div: KL divergence from original LDM
val/reward: Validation reward for generalization
Component-specific metrics (e.g., train/creativity, train/energy)

Hyperparameter Tuning¶

Issue	Solution
Unstable training	Increase `num_group_samples`, enable `group_reward_norm`
Mode collapse	Increase `kl_weight`, add diversity rewards
Slow convergence	Decrease `kl_weight`, increase reward weights
Poor structure quality	Add `EnergyReward` component

Typical Training¶

Duration: ~500-2000 steps
Batch size: 5 (default for GRPO with 64 group samples)
GPU memory: Scales with num_group_samples (64 samples × 5 batch = 320 structures per step)

Next Steps¶

Custom Rewards Overview - Learn about reward components
DNG Reward Tutorial - Paper’s multi-objective reward
Predictor Reward Tutorial - Property optimization