Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    su-chao
    @su-chao
    image.png
    Someone can solve the problem that small molecules are too large, for example:
    CC(C)c1ccc(cc1)NC(=O)O[C@@H]1CO[C@H]2[C@H](CO[C@@H]12)NC(=O)Nc1cccc(C(F)(F)F)c1
    my featurizer : dc.feat.CircularFingerprint()
    su-chao
    @su-chao
    Hi, Anyone can tell me the output predict.shape always (#,#,2) which is 3D,
    and that predict[:,:0] and predict[:,:,1] is differnet values,which is the correct predicted value ?
    su-chao
    @su-chao
    It's very confusing
    davidRFB
    @davidRFB
    Hi @rbharath Thank you for answer. The same mistake appears with other test.
    image.png
    I am executing with the deepchem 2.6.0dev into the deepchem directory that I fork from github. Maybe a environment reset could work ?
    Tonylac77
    @Tonylac77

    Dear all, sorry for the late answer. It seems we have fixed the iteration over Diskdatasets. In this case we use a 'for' loop where we split the tuples generated by the k-fold split function, and then output them as two csv files (we want to do this for use in other machine learning software).

    k = [k1, k2, k3, k4, k5, k6, k7, k8, k9, k10] #where K1-K10 are the folds from the splitting
    a = 1
    
    for x in k:
        train = x[0].to_dataframe()
        cv = x[1].to_dataframe()
        train = train['ids']
        cv = cv['ids']
        train.to_csv("k"+str(a) +"_train.csv")
        cv.to_csv("k"+str(a) +"_cv.csv")
        a = a+1

    This works fine when loading our dataset from CSV (with CSVLoader function) without an ID field. However, if we try to use a dataset with ChemBL IDs (in this case) we get the following RDkit error when performing the k-fold-split (see below) would love any input on this!

    ArgumentError                             Traceback (most recent call last)
    <ipython-input-8-604eaa868421> in <module>
          1 # split dataset
    ----> 2 k = splitter.k_fold_split(dataset=dataset, k=10)
    
    C:\Anaconda3\envs\deepchem38\lib\site-packages\deepchem\splits\splitters.py in k_fold_split(self, dataset, k, directories, **kwargs)
         84       frac_fold = 1. / (k - fold)
         85       train_dir, cv_dir = directories[2 * fold], directories[2 * fold + 1]
    ---> 86       fold_inds, rem_inds, _ = self.split(
         87           rem_dataset,
         88           frac_train=frac_fold,
    
    C:\Anaconda3\envs\deepchem38\lib\site-packages\deepchem\splits\splitters.py in split(self, dataset, frac_train, frac_valid, frac_test, seed, log_every_n)
       1107     for ind, smiles in enumerate(dataset.ids):
       1108       mols.append(Chem.MolFromSmiles(smiles))
    -> 1109     fps = [AllChem.GetMorganFingerprintAsBitVect(x, 2, 1024) for x in mols]
       1110 
       1111     # calcaulate scaffold sets
    
    C:\Anaconda3\envs\deepchem38\lib\site-packages\deepchem\splits\splitters.py in <listcomp>(.0)
       1107     for ind, smiles in enumerate(dataset.ids):
       1108       mols.append(Chem.MolFromSmiles(smiles))
    -> 1109     fps = [AllChem.GetMorganFingerprintAsBitVect(x, 2, 1024) for x in mols]
       1110 
       1111     # calcaulate scaffold sets
    
    ArgumentError: Python argument types in
        rdkit.Chem.rdMolDescriptors.GetMorganFingerprintAsBitVect(NoneType, int, int)
    did not match C++ signature:
        GetMorganFingerprintAsBitVect(class RDKit::ROMol mol, unsigned int radius, unsigned int nBits=2048, class boost::python::api::object invariants=[], class boost::python::api::object fromAtoms=[], bool useChirality=False, bool useBondTypes=True, bool useFeatures=False, class boost::python::api::object bitInfo=None, bool includeRedundantEnvironments=False)
    Vignesh Ram Somnath
    @vsomnath
    This feels like a molecule / SMILES is invalid.
    Bharath Ramsundar
    @rbharath
    @davidRFB Looks like an error in your environment setup? We have some issues with installing head right now (because of the deepchem 2.6.0 delay and tensorfllow) so perhaps try installing the environment manually?
    18 replies
    @Tonylac77 Seconding @vsomnath. So basically rdkit couldn't load the smiles you provided into a Mol type and returned None instead which is triggering a downstream error. Perhaps check if you have an invalid smiles
    Antonio Parra
    @parrasevilla91_twitter
    Hi there! Anyone can recommend the most suitable training model for predict the activity of a molecule, based in a MoleculeNet dataset?
    Bharath Ramsundar
    @rbharath
    @parrasevilla91_twitter dc.models.GraphConvModel is probably a good place to start
    alat-rights
    @alat-rights
    I was wondering if we should do anything about the deprecation warning about imp raised by a number of DeepChem unit tests?
    image.png
    I’m not sure if I’ve brought it up before so sorry if I have
    Antonio Parra
    @parrasevilla91_twitter

    @parrasevilla91_twitter dc.models.GraphConvModel is probably a good place to start

    thank you so much @rbharath

    Bharath Ramsundar
    @rbharath
    @alat-rights We should definitely fix these deprecation warnings. They seem to primarily be coming from TensorFlow though, so maybe will be lessened as we migrate to pytorch over time
    alat-rights
    @alat-rights
    Sounds good!
    Arthur Funnell
    @elemets
    I've been training a GraphConv model on peptide data but can't seem to get it predicting decently with an R-squared of -0.02? When I used a DAG model I trained for a lot less time and still got an R squared of around 0.6. Would anyone have a good suggestion of why, what should I try next?
    Omid Tarkhaneh
    @OmidTarkhaneh
    Anyone says what merits deepchem has? does it only do vectorizing? why we do not use pytorch or keras to do so? which problems deepchem fixes in fact?
    Bharath Ramsundar
    @rbharath
    @elemets Was the peptide represented as a smiles string for input?
    And how large were the peptides in question?
    Omid Tarkhaneh
    @OmidTarkhaneh
    Hello every body. Would you please recommend some resources for Graph NNs useful for deep learning in molecular and chemistry sciences. Many thanks .
    Bharath Ramsundar
    @rbharath
    @OmidTarkhaneh I'd recommend checking out the deepchem tutorials (see tutorials link on deepchem.io)
    Omid Tarkhaneh
    @OmidTarkhaneh
    @rbharath Thank you so much.
    Sahar RZ
    @SaharRohaniZ
    Hi Deepchem team - I am working with ConvMolFeaturizer and I'd like to know how's the feature matrix generated. I used this featurizer to featurize a molecule (fed through a pdb file) that has 155 atoms, and the feature matrix is of shape [95,75]. I want to understand why only 95 atoms were featurized. Any help would be appreciated.
    Bharath Ramsundar
    @rbharath
    @SaharRohaniZ My guess is hydrogens were dropped likely but I'd have to dig into the source code to know for sure
    The featurization is done by internal methods. Probably fastest to take a look at the ConvMolFeaturizer source directly since the featurization methods aren't part of DeepChem's public API
    Sahar RZ
    @SaharRohaniZ
    @rbharath I did the math and you are right. Thanks for your help.
    Arthur Funnell
    @elemets
    @rbharath Hey yes it was represented by smiles strings and the peptides lengths are varied but they are pretty big
    Bharath Ramsundar
    @rbharath
    @elemets It's possible the graphconv is just struggling with lengths. If you're interested, it would actually be very useful if you could contribute a small peptide benchmark dataset for us. None of DeepChem's sample datasets use peptides so we've never benchmarked for that use case
    Omid Tarkhaneh
    @OmidTarkhaneh
    Hello, pytorch-geometric does not installed for me in google colab, I am using pytorch version 1.9.0+cu102. I appreciate any help. Thanks. The error message is as follows: Detected that PyTorch and torch_sparse were compiled with different CUDA versions. PyTorch has CUDA version 10.2 and torch_sparse has CUDA version 11.0. Please reinstall the torch_sparse that matches your PyTorch install. (edited)
    I tried to install torch_spares with cu102 but it does not work for me
    Omid Tarkhaneh
    @OmidTarkhaneh
    This is the installation command that I have used. import torch
    !pip install torch-scatter -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html
    !pip install torch-sparse -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html
    !pip install torch-cluster -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html
    !pip install torch-spline-conv -f https://pytorch-geometric.com/whl/torch-1.9.0+cu102.html
    !pip install torch-geometric
    Vignesh Venkataraman
    @VIGNESHinZONE
    There seems to be error in docs, deepchem uses CUDA 11.0 , Also which version of deepchem are you using?
    Omid Tarkhaneh
    @OmidTarkhaneh
    @VIGNESHinZONE No this is not related to the deepchem. Here I just tried to install pytorch-Geometric in google colab.
    Hannes Stärk
    @HannesStark

    Hello!
    Is there someone who is familiar with the BACE dataset from MoleculeNet?
    I was using the BACE dataset through SNAP Stanford's OGBG library but needed access to the smiles representations of the molecules and downloaded the BACE dataset directly from http://moleculenet.ai/datasets-1
    However, the annotations for the scaffold split in the CSV somewhat confuse me:
    There is the "Model" annotation which gives the following amount of molecules for each split:

    Train: 203
    Valid: 45
    Test: 1265

    This seems strange to me, especially considering the split in the OGBG library:

    Train: 1210
    Valid: 151
    Test: 152

    Is there something I am missing? Many thanks for any help!

    Bharath Ramsundar
    @rbharath
    @HannesStark Sorry for the slow response! I saw your email but didn't have a chance to respond.
    This is correct as structured (I was the author on the BACE paper who added the dataset into moleculenet)
    I think the OBGB folks have restructured the dataset somehow but I'm not familiar with what they've done
    For this paper, the reason the train/valid/test was asymmetrical was we were trying to understand the effects of having a small amount of training data as is standard in drug discovery settings
    Hannes Stärk
    @HannesStark
    @rbharath Thank you very much!
    I would like to add on the question what the split indices in http://deepchem.io.s3-website-us-west-1.amazonaws.com/trained_models/Hyperparameter_MoleculeNetv3.tar.gz for instance in bace_cscaffold123.pkl refer to, since when using them I end up with a split of these sizes:
    Train: 1208
    Valid: 153
    Test: 152
    y6q9
    @yuanqidu
    Hi! I have a question about two molecule featurizers, RDKitDescriptors and MordredDescriptors. Is there any correspondence between the returned feature numpy array and real-world descriptors? I checked the website given by mordred, but they have in total 1826 descriptors, while the returned array has 1613 after ignoring 3D parameter is set as True. Do we by any chance still know the name of each features in correspondence to the real-world descriptors?
    Bharath Ramsundar
    @rbharath
    @HannesStark I believe the three datasets were combined then split using a scaffold split on a 80/10/10 spllit
    @yuanqidu Hmm, I'd recommend just checking the source. I think we just called the mordred API directly
    So I don't think we know the names and would have to look at the mordred docs to figure those out
    y6q9
    @yuanqidu
    @rbharath Oh, got it! Thanks!
    Omid Tarkhaneh
    @OmidTarkhaneh
    Hello everybody. During featurizing and working with RdKit. I received this error. I do not know how to fix my SMILES dataset. Any suggestion. The error is like below: