Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

code problem #188

Open
coderabbittank opened this issue Dec 1, 2023 · 2 comments
Open

code problem #188

coderabbittank opened this issue Dec 1, 2023 · 2 comments

Comments

@coderabbittank
Copy link

I noticed that you used the generation of 3d molecular conformations in conformer.py in the data folder in mol_tools, but it seems that 1 is generated for each molecular conformation, but in the uni-mol paper you mentioned that each molecule is generated 11 conformations. Where is the relevant code? If there is any, please point it out to me. Thank you.

@ZhouGengmo
Copy link
Contributor

you can refer to this.

@coderabbittank
Copy link
Author

coderabbittank commented Dec 17, 2023

Another question, when I use unimol_tools to fine tune the tox21 dataset, my code looks like this:
clf = MolTrain(task='multilabel_classification',
data_type='molecule',
batch_size=16,
metrics='auc',
split='random',
epochs=20,
learning_rate=2e-5,
)
pred = clf.fit(data = './train_data.csv')

clf = MolPredict(load_model='/home/zhuyifeng/gitcode/Uni-Mol-main/unimol_tools/unimol_tools/exp')
res = clf.predict(data = './test_data.csv')

After data processing my training set is as follows:

SMILES,TARGET1,TARGET2,TARGET3,TARGET4,TARGET5,TARGET6,TARGET7,TARGET8,TARGET9,TARGET10,TARGET11,TARGET12
O=C([O-])COc1ccc(Cl)cc1Cl,0,0,0,0,0,0,0,0,0,0,0,0
ClCC(Cl)CCl,0,0,0,0,1,0,0,0,0,0,0,0
Nc1ccn([C@@h]2OC@HC@@HC2(F)F)c(=O)n1,0,0,0,0,0,0,,0,0,0,0,1
CO,0,0,0,0,0,0,0,0,0,0,0,0
FC1(F)C(F)(F)C(F)(F)C2(F)C(F)(C1(F)F)C(F)(F)C(F)(F)C1(F)C(F)(F)C(F)(F)C(F)(F)C(F)(F)C12F,0,0,0,0,,0,0,0,0,0,0,0
CC1(C)S[C@@h]2C@HC(=O)N2[C@H]1C(=O)OCOC(=O)[C@@h]1N2C(=O)C[C@H]2S(=O)(=O)C1(C)C,0,,0,0,0,0,,,0,,0,
CC@@HC@@Hc1ccccc1,0,0,0,0,0,0,0,0,0,0,0,0
...

My test set is as follows:

SMILES,TARGET1,TARGET2,TARGET3,TARGET4,TARGET5,TARGET6,TARGET7,TARGET8,TARGET9,TARGET10,TARGET11,TARGET12
O=C(O)CNC(=O)c1ccc(N+[O-])cc1,0,0,0,0,1,0,0,0,0,0,0,0
COC(=O)c1ccccc1C(=O)OC,0,0,0,0,0,0,0,0,0,0,0,0
COc1cc2nc(N3CCN(C(=O)c4ccco4)CC3)nc(N)c2cc1OC,0,,,,0,0,0,0,,0,0,
Cc1cccc(N(C)C(=S)Oc2ccc3c(c2)C2CCC3C2)c1,,,,,,,,,,0,,
CCN(CC)C(=O)[C@]1(c2ccccc2)C[C@@h]1CN,0,0,0,0,0,0,0,,0,,0,0
SCCSCCS,0,0,0,0,0,0,0,0,0,0,0,0
NC@@HC(=O)O,0,0,0,,0,0,1,0,0,0,0,0
CCOC(=O)CC#N,0,0,0,0,0,0,0,0,0,0,0,0
Cc1ccc(Nc2nccc(N(C)c3ccc4c(C)n(C)nc4c3)n2)cc1S(N)(=O)=O,0,0,,,,0,,,,0,,1
c1cc(C(c2ccc(OCC3CO3)cc2)c2ccc(OCC3CO3)cc2)ccc1OCC1CO1,0,,0,,,0,,1,,1,,0

1.But in the process of fine-tuning the training there is a loss value of a very small negative number, and there is an error reported: ValueError: multi_class must be in ('ovo', 'ovr'), please ask how to solve it!

2.Before this I fine-tuned BBBP, BACE, ClinTox, according to the code result 'auroc' is 0.737, 0.85, 0.868 is a little different from the result 0.729, 0.857, 0.919 in the uni-mol paper, the first two classification tasks are almost the same as the paper result, but the latter one has a big difference, is it because of the hyperparameters?

3.Generating molecular conformations during training when using the SIDER dataset is very slow and feels stuck

  1. Uni-mol_tools performs 5-fold cross-validation on the training set, which is different from the benchmark comparison of the standard pair of datasets 0.8 -0.1-0.1. The standard practice is to divide the dataset according to 0.8, 0.1, 0.1 into a training set, a validation set and a test set, so that the training set is trained and then validated by the validation set, and then tested by the test set, but the 5-fold cross-validation has an impact on the results. What I did was after dividing the original dataset by 0.8, 0.1, 0.1, I took the 0.8 training and 0.1 validation set together as input training for Moltrain and then tested it using MolPredict with the 0.1 test set. This approach may not be correct and I would appreciate if you could make some suggestions or correct approach for this!

Here is the code to fine-tune the regression task:

from unimol_tools import MolTrain, MolPredict
from scaffold import load_data,scaffold_split,split_data
import pandas as pd
import csv
train_data_full = pd.read_csv('/gitcode/Uni-Mol-main/unimol_tools/unimol_tools/MoleculeNet/freesolv.csv')
label_count = train_data_full.head(1).shape[1]

target_columns = ["TARGET{}".format(i) for i in range(1, label_count)]

train_data_full.columns = ["SMILES"] + target_columns
train_data_full.to_csv("./mol_train_full.csv", index=False) 

data = load_data("./mol_train_full.csv")

train_data, val_data, test_data = split_data(data,'random',[0.8,0.1,0.1],42)

csv_file1 = 'train_data.csv'
csv_columns = ["SMILES"] + target_columns
csv_file2 = 'test_data.csv'


smile_list1 = train_data.smile() + val_data.smile()
label_list1 = train_data.label() + val_data.label()

smile_list2 = test_data.smile()
label_list2 = test_data.label()



with open(csv_file1, 'w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=csv_columns)
    writer.writeheader()
    for smile, label in zip(smile_list1, label_list1):
        row = {'SMILES': smile}
        row.update({'TARGET{}'.format(i+1): l if l is not None else None for i, l in enumerate(label)})
        writer.writerow(row)
with open(csv_file2, 'w', newline='') as file:
    writer = csv.DictWriter(file, fieldnames=csv_columns)
    writer.writeheader()
    for smile, label in zip(smile_list2, label_list2):
        row = {'SMILES': smile}
        row.update({'TARGET{}'.format(i+1): l if l is not None else None for i, l in enumerate(label)})
        writer.writerow(row)


print(len(train_data)+len(val_data))
print(len(test_data))
clf = MolTrain(task='regression', 
               data_type='molecule', 
               batch_size=16, 
               metrics='mse',
               split='random',
               epochs=20,
               learning_rate=5e-5,
                )
pred = clf.fit(data = './train_data.csv')
# currently support data with smiles based csv/txt file, and
# custom dict of {'atoms':[['C','C],['C','H','O']], 'coordinates':[coordinates_1,coordinates_2]}

clf = MolPredict(load_model='/Uni-Mol-main/unimol_tools/unimol_tools/exp')
res = clf.predict(data = './test_data.csv')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants