Dataset Preparation#

1. Introduction#

0_best_mae MOFTransformer takes both atom-wise graph embeddings and energy-grid embeddings to capture local and global features,respectively. Both of embeddings are generated from the cif files of MOFs.

(1) atom-wise graph embeddings

Tha atom-wise graph embeddings are taken from the modified CGCNN by removing pooling layer and adding opologically unique atom selection.

(2) energy-grid embeddings

The 3D energy grids are calculated using GRIDAY with the united atom model of methane molecule using UFF.

2.Generate custom dataset#

To generate inputs for MOFTransformer from CIF files, use the moftransformer/utils/prepare_data file to generate atom-wise graph embeddings and energy-grid embeddings. You will need to prepare CIF files and raw_{downstream}.json files in a root_cifs directory.

  • root_cif: A directory that contains .cif and .json file.

  • downstream : name of user-specific downstream task (e.g. band_gap, gas_uptake, etc).

Example for root_cif#

The example of root_cifs directory is as follows.

root_cifs # root for cif files
├── [cif_id].cif
├── [cif_id].cif
├── ...
└── raw_{downstream}.json

The example of raw_{downstream}.json files is as follows.

{ 
    cif_id : property (float or int),
    ...
}

Run prepare_data function#

If there is a json files named raw_{downstream}.json in root_cifs directory, then it will be randomly splitted to train, val and test. (Default = 8:1:1).

from moftransformer.utils import prepare_data

# single task
prepare_data(root_cifs, root_dataset, downstream="example") 

# multiple tasks (contain several json files in root_cif)
prepare_data(root_cifs, root_dataset, downstream=["example1", "example2", ...])

prepare_data will generate the atom-wise graph embeddings and energy-grid embeddings in root_dataset directory.

root_dataset # root for generated inputs 
├── train
│   ├── [cif_id].graphdata # graphdata
│   ├── [cif_id].grid # energy grid information
│   ├── [cif_id].griddata16 # grid data
│   ├── [cif_id].cif # primitive cif
│   └── ...
├── val
│   ├── [cif_id].graphdata # graphdata
│   ├── [cif_id].grid # energy grid information
│   ├── [cif_id].griddata16 # grid data
│   ├── [cif_id].cif # primitive cif
│   └── ...
├── test    
│   ├── [cif_id].graphdata # graphdata
│   ├── [cif_id].grid # energy grid information
│   ├── [cif_id].griddata16 # grid data
│   ├── [cif_id].cif # primitive cif
│   └── ...
├── train_{downstream}.json
├── val_{downstream}.json
└── test_{downstream}.json

3. Dataset for public database (CoREMOF, QMOF).#

we’ve provided the dataset of atom-wise graph embedding and energy-grid embedding for the CoREMOF and the QMOF database in our figshare database.

Or, you can download using command line or python (refer to installation)