Dataset Preparation¶
1. Introduction¶
MOFTransformer
takes both atom-wise graph embeddings and energy-grid embeddings to capture local and global features,respectively. Both of embeddings are generated from the cif files of MOFs.
(1) atom-wise graph embeddings
Tha atom-wise graph embeddings are taken from the modified CGCNN by removing pooling layer and adding opologically unique atom selection.
(2) energy-grid embeddings
The 3D energy grids are calculated using GRIDAY with the united atom model of methane molecule using UFF.
2.Generate custom dataset¶
To generate inputs for MOFTransformer from CIF files, use the moftransformer/utils/prepare_data
file to generate atom-wise graph embeddings and energy-grid embeddings. You will need to prepare CIF files and raw_{downstream}.json
files in a root_cifs directory.
root_cif
: A directory that contains.cif
and.json
file.downstream
: name of user-specific downstream task (e.g. band_gap, gas_uptake, etc).
Example for root_cif
¶
The example of root_cifs
directory is as follows.
root_cifs # root for cif files
├── [cif_id].cif
├── [cif_id].cif
├── ...
└── raw_{downstream}.json
The example of raw_{downstream}.json
files is as follows.
{
cif_id : property (float or int),
...
}
Run prepare_data
function¶
If there is a json files named raw_{downstream}.json
in root_cifs
directory, then it will be randomly splitted to train
, val
and test
. (Default = 8:1:1).
from moftransformer.utils import prepare_data
# single task
prepare_data(root_cifs, root_dataset, downstream="example")
# multiple tasks (contain several json files in root_cif)
prepare_data(root_cifs, root_dataset, downstream=["example1", "example2", ...])
prepare_data
will generate the atom-wise graph embeddings and energy-grid embeddings in root_dataset
directory.
root_dataset # root for generated inputs
├── train
│ ├── [cif_id].graphdata # graphdata
│ ├── [cif_id].grid # energy grid information
│ ├── [cif_id].griddata16 # grid data
│ ├── [cif_id].cif # primitive cif
│ └── ...
├── val
│ ├── [cif_id].graphdata # graphdata
│ ├── [cif_id].grid # energy grid information
│ ├── [cif_id].griddata16 # grid data
│ ├── [cif_id].cif # primitive cif
│ └── ...
├── test
│ ├── [cif_id].graphdata # graphdata
│ ├── [cif_id].grid # energy grid information
│ ├── [cif_id].griddata16 # grid data
│ ├── [cif_id].cif # primitive cif
│ └── ...
├── train_{downstream}.json
├── val_{downstream}.json
└── test_{downstream}.json
3. Dataset for public database (CoREMOF, QMOF).¶
we’ve provided the dataset of atom-wise graph embedding and energy-grid embedding for the CoREMOF and the QMOF database in our figshare database.
Or, you can download using command line or python (refer to installation)