from io import BytesIOimport pandas as pdimport numpy as npfrom rdkit.Chem import PandasToolsfrom rdkit import Chemfrom rdkit.Chem import AllChemfrom rdkit.Chem import DataStructsfrom rdkit.Chem import rdMolDescriptorsfrom rdkit.Chem import rdRGroupDecompositionfrom rdkit.Chem.Draw import IPythonConsole #Needed to show moleculesfrom rdkit.Chem import Drawfrom rdkit.Chem import rdDepictorfrom rdkit.Chem.Draw import rdMolDraw2Dfrom rdkit.Chem.Draw.MolDrawing import MolDrawing, DrawingOptions #Only needed if modifying defaultsDrawingOptions.bondLineWidth=1.8IPythonConsole.ipython_useSVG=Truefrom rdkit import RDLoggerRDLogger.DisableLog('rdApp.warning')import rdkitprint(rdkit.__version__)
2021.09.2
Motivation
Modern compound library size is increasing fast thanks to parallel synthesis. Enamine REAL Space now have more than 10 billion compounds available for purchase and other vendors are also offering libraries that are much larger than traditional screening library size.
The papers (1 and 2) from Enamine scientists discusses how they have constructed such library in detail. I got interested in building such library myself and tested if I can use RDKit to attempt such library
Synthons
Prepare a three groups; core, block1, and block2. For this particular blocks, they all have carboxyl group. The core has one tertiary amine, which can be reacted first, and N-Boc group, which can be deprotected and reacted in a second step.
Because the reaction order can be swapped, this three synthons will result in four different arrangement of functional group depending on the order of reaction:
deacylation of group1, deprotection, deacylation of group1
deacylation of group1, deprotection, deacylation of group2
deacylation of group2, deprotection, deacylation of group1
deacylation of group2, deprotection, deacylation of group2
Below is a core molecule, which have two tertiary amines and one of the amines is protected by Boc group, and two reactant groups, each having one carboxylate group, which can be acylated and readily reacted with a tertiary amine.
Three main reaction are used here: acylation, amide formation, and deprotection. These reactions can be represented as below SMARTS strings. And I’ll walk through different how we carry out reactions to generate new compound.
We can easily scale up above procedure to generate a library of compounds by combinatorially using different reaction groups
core_smiles_arr = [core_smiles]block_smiles_arr = [group1_smiles, group2_smiles]# turn smiles to mol objectcore_arr = [Chem.MolFromSmiles(smiles) for smiles in core_smiles_arr]block_arr = [Chem.MolFromSmiles(smiles) for smiles in block_smiles_arr]# set names for image generationfor i, core inenumerate(core_arr): core.SetProp("_Name", f'Core{i+1}')for i, block inenumerate(block_arr): block.SetProp("_Name", f'Block{i+1}')
from rdkit.Chem.Draw import rdMolDraw2Dfrom io import BytesIOms = [Chem.RemoveHs(m) for m in product_library]for m in ms: tmp=AllChem.Compute2DCoords(m)legends=["%s"% (x.GetProp("_Name"), ) for i, x inenumerate(ms)]molsPerRow =4subImgSize = (300, 300)width = subImgSize[0] * molsPerRowheight = subImgSize[1] *int(len(ms) / molsPerRow)d2d = rdMolDraw2D.MolDraw2DCairo(width, height, subImgSize[0], subImgSize[1])d2d.drawOptions().legendFontSize=24d2d.DrawMolecules(ms,legends=legends)d2d.FinishDrawing()img = BytesIO()img.write(d2d.GetDrawingText())from PIL import ImageImage.open(img)
Conclusion
The actual code for building library was straightforward, as long as the fragments and the reactions are well curated, but I can imagine maintaining such curated fragment/reaction library at scale will be certainly a challenge.
Also, if one attempts to build the entire possible library, the amount of computation will grow exponentially. For example, if we have 300 functional group blocks and 200 core blocks, which permits two reactions, the number of possible combinations are two billion already.
feature
The paper also illustrates how such database can be constructed on-the-fly with clever use of virtual screening method, alleviate the need to store the entire library on disk and reduce computation significantly.