Matched Molecular Pair Analysis

Introduction

In drug discovery, when the most promising lead compound found from screening, it needs to be further improved in one or more properties in the lead optimization process before it can be considered as a clinical candidate. In this scenario, it is about understanding and predicting what effect changing the structure will have on the properties, ideally retaining or enhancing the desirable properties while reducing the undesirable. Matched molecular pair analysis (MMPA) is a promising approach that can be used for this purpose. The term of Molecular Matched Pair Analysis was coined by Kenny and Sadowski in 2004 for a special case of QSAR, and now it is widely used in drug design processes.

What Is Matched Molecular Pair Analysis?

One definition of MMPA can be described as "identifying every pair of molecules that differ only by a particular, well-defined, structural transformation in a database of measured properties and computing the corresponding change in property". Such pairs of compounds are known as matched molecular pairs (MMP). Because the structural difference between the two molecules is small, any experimentally observed change in a physical or biological property between the matched molecular pair can more easily be interpreted. MMPA inspire people to think about medicinal chemistry differently, can help chemists to discover new effects, provide insights, understand the relations between structures and properties.

Application

The assumption that the effect of chemical substitution can be generalized, is inherently assumed in all QSAR methods, including the MMP approach, successfully highlighted by the work of Lipinski et al. who correlated physicochemical properties to oral bioavailability. With the increasing availability of public databases containing millions of structure–activity-relationship (SAR) or SPR data, multiple papers have been published applying MMP concept to: ADME, bioisosterism, aqueous solubility, plasma protein binding, oral exposure, logD, potency, intrinsic clearance, herG and P450 metabolism, in vitro UGT (Uridine 5′-diphosphoglucuronosyltransferase) glucuronidation clearance, half-life, selectivity against off-targets, impact of N- and O-methylation on aqueous solubility and lipophilicity or mode of action; the analysis differing only in the MMP algorithm used.

Activity Cliffs

One interesting subset of matched molecular pairs is those in which the change in property is surprisingly large. A number of researchers, most prominently the Bajorath group in Bonn, have reported the insight that these pairs, which they call activity cliffs, can bring. Activity cliffs are generally defined as pairs of structurally similar compounds having large differences in potency. Notable recent contributions include using Hussain and Rea’s fragment and index method to identify matched series within all the high-confidence ChEMBL KI data relating to activity against human targets. These reveal that coordinated activity cliffs can be identified in which the same structural change causes the same large change in property across several chemical series. A similar method can be used to explore activity within large sets of screening hits. Using MMPA to study the activity cliffs can discover new insights and provide an alternative source for further chemical exploration. In contrast to traditional SAR analysis, where similar compounds are assumed to have similar properties, activity cliffs describe the substitution pattern with the most impact upon a small structural change.

For example, above is a representative MMP-cliff, "6600nM | 6" indicates that the potency value of the compound is 6600 nM and the number of non-hydrogen atoms comprising the differentiating fragment is 6.

MMPA Algorithms

Generally, there are two broad classes of methods for identifying matched molecular pairs, supervised and unsupervised algorithms. These depend upon whether the pairs are found using manual intervention or automatically.

Supervised

In supervised methods the chemical transformation that generates the MMP is predefined, the SMARTS and SMIRKS nomenclatures are used to define the chemical transformation widely, SMARTS strings have the great advantage that they can be encoded to be extremely specific or general but provide exquisite control in terms of chemical structure. The advantage with supervised methods lies within the precise control of the definition of the MMP to address a particular question. On the other hand, these methods cannot find new and surprising MMPs in the way that unsupervised methods can. In the first publication of MMPA, Kenny and Sadowski described how a molecular editor could be used to identify matched molecular pairs in which substituents are added to a benzene-like ring and analyze the effect of substituent on aqueous solubility.

Unsupervised

Supervised methods to find matched pairs have many limitations, it isn't user friendly and need amount human time to encode the fragment pattern and can't find new MMPs not in the scope of predefined patterns. So a new approach was needed in which matched pairs could be identified automatically. Therefore, a number of unsupervised approaches have also been developed. And they can be divided into two types: fragment and index or maximum common subgraph (MCS).

Fragment and Index

Hussain-Rea fragmentation and index algorithm is the first efficient, unsupervised MMPA algorithm. This algorithm mainly includes two steps:

Fragmenting. Fragment each molecule in the data set by all possible single, double, and triple cuts.
Indexing. Generate a key-value store with fragments as keys and values as cores.

In the first step, each molecule is fragmented, by breaking selected bonds. Hussain and Rea achieve this by defining the bonds to be broken using a SMARTS pattern. They aim to have this pattern be specific to acyclic single bonds. The SMARTS pattern suggested is "[*]!@!=[*]" that signifies an atom type joined to any atom type by neither a bond that is in a ring nor a double bond. If a molecule has multiple possible bonds to cut, bonds are broken one at a time, and the resulting fragments are then stored as canonical representations (for example SMILES strings or hash code).

In the above case, a single, acyclic bond within a connected molecular graph (bromobenzene, left) is removed, yielding two fragments (right, phenyl and bromo). The benzene ring can be regarded as a substituent of bromo. Likewise, bromo can be regarded as a substituent of benzene. This symmetry plays a prominent role in the Hussain-Rea algorithm.

Breaking two bonds generate three fragments, one of which necessarily connects to both of its mates when the cuts are reversed. This central fragment is always designated as the core.

Triple cuts will yield four fragments. Here there are two possible results: (1) the third cut is made inside the core; or (2) the third cut occurs outside the core. Result (2) isn't considered in this situation, because triple cuts yields two cores are not allowed by the Hussain-Rea algorithm.

Quadruple (or more) cuts in the fragmentation are possible. However, in the author's opinion, it is only likely to result in a small increase in the number of unique MMPs found.

The next stage of the algorithm is to index these fragments. For single cut, the two resulting fragments formed (i.e., fragment X and fragment Y) are both canonicalized and added to the index. First, fragment X is added as a "key" into the index with fragment Y as its "value". The converse is also carried out; fragment Y is added as a key with fragment X as its value. An identifier for the compound is also stored in the value of the index, so the value can be a tuple (core, ID). In the double cut scenario, the fragmentations result in a core and two terminal fragments. The core is stored as the value with the dot-disconnected unique identifier (in the original publication, using the canonicalized SMILES) of the terminal fragments as their key. Similarly, in the triple cut example, only fragmentations that result in a core and three terminal groups are stored in the index.

After generating the indexes for all molecule fragments, the MMPs could be found directly. In each Key-values map, for each pairing of members in values, generate a matched pair.

The SMARTS definition that identifies which bonds are to be fragmented is not ideal in the original publication and can lead to fragmentations and grouping into sets of pairs that chemists would not normally consider to be chemically sensible. For instance, fragmentation of the single bond in amides, esters, or sulfonamides and acyclic triple bonds would happen if the original SMARTS proposed by Hussain and Rea was to be applied. Consequently, a refined SMARTS definition has been proposed by Wirth and coworkers and is "[#6+0;!$(*=,#[!#6&!R])]!@!#!=[*]". They also proposed a second rule that allows fragmentations of exocyclic double bonds: "[#6+0;R]=[#7&!R,#8&!R,#16&!R,#6+0&!R]". This definition signifies any atom type connected via a bond that is neither in a ring nor double nor triple to a carbon atom that is not charged and that in turn is not double- or triple-bonded to an atom of any element other than carbon (unless that atom is in a ring).

Maximum Common Subgraph

Another class of MMP algorithms is based the maximum common subgraph (MCS),which perform a pairwise comparison within a compound data set to find the MCS between each pair of compounds and define the fixed part. The atoms within a pair of compounds that are not part of the MCS are then analyzed to determine if they constitute a single-point change. This class of MMP identification algorithms is capable of potentially finding all MMPs within a compound data set, but because of the computation expense of MCS algorithms coupled with the O(n2) nature of the MMP identification (because of the pairwise comparisons that need to be performed), the algorithms are computationally expensive. Therefore, to alleviate this limitation, heuristics are performed between pairs of compounds before the MCS or the MMP identification step is carried out (clustering and topological similarity), which may result in certain MMPs not being found. The high computational cost of running these MMP identification algorithms means they are very difficult to apply on large compound data sets.

MCS MMPA algorithm

Conclusion

Matched molecular pair analysis is a powerful tool in drug discovery and other domains. The methods used to identify matched pairs range from manual inspection, through supervised methods to unsupervised methods. The MMP framework allows one to study numerous properties (most commonly binding affinity or potency) and to rationalize the design of the next compound to make within a series.