Exploring Chemical Space with Score-based Out-of-distribution Generation

Score-based generative models have shown promise in molecule generation, but often struggle to create truly novel candidates beyond the training distribution. MOOD (Molecular Out-Of-distribution Diffusion) is a score-based diffusion framework that enables controllable exploration of out-of-distribution chemical space without incurring additional computational costs. By integrating a property prediction network into the reverse-time SDE, MOOD effectively guides the generation toward molecules with desired novel traits such as high binding affinity, drug-likeness, and synthesizability.

Introduction: The Challenge of Novel Molecule Generation

In de novo drug discovery, deep generative models have emerged as powerful tools for automating the design of novel molecules. However, the generated molecules tend to closely resemble those in the training distribution, restricting their usefulness in discovering truly novel compounds with superior therapeutic potential. This is especially problematic when aiming to avoid patented scaffolds or explore uncharted regions of the chemical space. Moreover, real-world drug design often requires satisfying multiple complex property constraints, such as high binding affinity, drug-likeness, and synthesizability. Most existing models optimize simplistic proxy scores, which often result in trivial or unrealistic structures. This blog introduces MOOD (Molecular Out-Of-distribution Diffusion), a novel score-based generative framework that addresses these limitations by enabling controlled exploration beyond the training data, while optimizing for multiple drug-relevant properties.

Limitations of Existing Models

Prior works for molecular generation, such as VAE , GANs , and reinforcement learning-based models , primarily rely on learning distributions from existing molecular datasets. As a result, they exhibit a strong inductive bias toward the training distribution. For example, the molecules generated by GENTRL frequently exhibit strong resemblance to known active compounds. Some works attempt to explore novelty through fragment-based RL or prioritized replay, but these methods are still constrained by the fragments derived from known molecules and incur high computational costs. Even recent diffusion models like GDSS, while capable of generating high-quality molecular graphs, lack the mechanism to explicitly control the deviation from training data. Furthermore, these models often optimize over simplified properties (e.g., penalized logP, QED) that do not necessarily correlate with real-world drug efficacy.

MOOD: A New Paradigm for Out-of-Distribution Generation

Figure 1: Concept Figure of MOOD.

MOOD introduces a new framework for molecule generation that targets two fundamental challenges: (1) generating molecules that are truly novel and out-of-distribution (OOD), and (2) ensuring that these molecules satisfy real-world drug-like properties. MOOD achieves this by a novel OOD-controlled score-based diffusion process, together with gradient-based conditional generation. Unlike traditional reinforcement learning approaches or fragment-based generators, MOOD operates without additional training overhead or sampling complexity, and enables flexible control over how far the generation deviates from the training distribution.

Score-based Generative Modeling with SDEs

MOOD builds upon GDSS , which models graphs using a system of stochastic differential equations (SDEs). The molecular graph is represented as $G_t = (X_t, A_t)$, where $X_t$ are node features and $A_t$ is the adjacency matrix.

The forward SDE is:

\[dG_t = f_t(G_t)\,dt + g_t\,d\omega,\]

and the reverse-time diffusion becomes:

\[\begin{aligned} dX_t &= \left[ f_{1,t}(X_t) - g_{1,t}^2 \nabla_{X_t} \log p_t(X_t, A_t) \right] dt + g_{1,t} d\bar{\omega}_1, \\\\ dA_t &= \left[ f_{2,t}(A_t) - g_{2,t}^2 \nabla_{A_t} \log p_t(X_t, A_t) \right] dt + g_{2,t} d\bar{\omega}_2, \end{aligned}\]

using learned score networks $s_{\theta_1,t}, s_{\theta_2,t}$. (More details about GDSS can be found in Blog Post “Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations”).

Expanding the Exploration Space with OOD Control

To expand the exploration space of the diffusion, MOOD introduces a novel OOD-controlled score-based graph generative model that can generate samples outside the in-distribution, where the OOD-ness is controlled by a hyperparameter $\lambda \in [0, 1)$.

We model this by sampling from the conditional distribution:

\[p_t(G_t \mid y_o = \lambda),\]

where $y_o$ denotes the OOD control condition. The reverse-time SDE becomes:

\[dG_t = \left[ f_t(G_t) - g_t^2 \nabla_{G_t} \log p_t(G_t \mid y_o = \lambda) \right] dt + g_t d\bar{\omega}. \tag{3}\]

The conditional score can be decomposed as:

\[\nabla_{G_t} \log p_t(G_t \mid y_o = \lambda) = \nabla_{G_t} \log p_t(G_t) + \nabla_{G_t} \log p_t(y_o = \lambda \mid G_t). \tag{4}\]

While $\nabla_{G_t} \log p_t(G_t)$ can be estimated by the score networks $s_{\theta_1,t}, s_{\theta_2,t}$, the OOD condition term is modeled as:

\[p_t(y_o = \lambda \mid G_t) \propto p_t(G_t)^{-\sqrt{\lambda}}.\]

This implies that low-likelihood samples are more likely to be considered OOD, which is inspired by and , encouraging the model to explore underrepresented regions of chemical space.

Conditional Generation for Property Optimization

MOOD further modifies generation to favor molecules with desirable chemical properties using conditional generation. Here, the objective is to sample from the joint conditional distribution:

\[p_t(G_t \mid y_o = \lambda, y_p),\]

where $y_p$ represents a property condition such as high binding affinity, drug-likeness. This is decomposed using Bayes’ rule:

\[p_t(G_t \mid y_o = \lambda, y_p) \propto p_t(G_t) \, p_t(y_o = \lambda \mid G_t) \, p_t(y_p \mid G_t, y_o = \lambda).\]

The property term $p_t(y_p \mid G_t, y_o = \lambda)$ is modeled with a Boltzmann distribution:

\[p_t(y_p \mid G_t, y_o = \lambda) = \frac{1}{Z_t} \exp\left( \alpha_t P_\phi(G_t, \lambda) \right),\]

where $P_\phi$ is a learned property prediction function and $\alpha_t$ is a scaling parameter. Substituting this into the reverse-time SDE gives:

\[dG_t = \left[ f_t(G_t) - (1 - \sqrt{\lambda}) g_t^2 \nabla_{G_t} \log p_t(G_t) - \alpha_t g_t^2 \nabla_{G_t} P_\phi(G_t, \lambda) \right] dt + g_t d\bar{\omega},\]

where the last term encourages sampling toward regions with higher predicted property values.

Therefore, following the form of GDSS, this can be seperated into node features $X_t$ and adjacency matrix $A_t$ as below:

\[\begin{aligned} dX_t &= \left[ f_{1,t}(X_t) - (1 - \sqrt{\lambda}) g_{1,t}^2 s_{\theta_1,t}(X_t, A_t) - \alpha_{1,t} g_{1,t}^2 \nabla_{X_t} P_\phi(X_t, A_t, \lambda) \right] dt + g_{1,t} d\bar{\omega}_1, \\ dA_t &= \left[ f_{2,t}(A_t) - (1 - \sqrt{\lambda}) g_{2,t}^2 s_{\theta_2,t}(X_t, A_t) - \alpha_{2,t} g_{2,t}^2 \nabla_{A_t} P_\phi(X_t, A_t, \lambda) \right] dt + g_{2,t} d\bar{\omega}_2, \end{aligned}\]

where $s_{\theta_1,t}$ and $s_{\theta_2,t}$ are score networks approximating the partial derivatives of the log data density with respect to $X_t$ and $A_t$, respectively.

To balance the influence of the score and property gradients, MOOD dynamically adjusts the weighting coefficients:

\[\alpha_{1,t} = \frac{r_{1,t} \| s_{\theta_1,t}(G_t) \|}{\| \nabla_{X_t} P_\phi(G_t, \lambda) \|}, \quad \alpha_{2,t} = \frac{r_{2,t} \| s_{\theta_2,t}(G_t) \|}{\| \nabla_{A_t} P_\phi(G_t, \lambda) \|},\]

where $r_{1,t}$ and $r_{2,t}$ are manually defined scaling ratios. This ensures that the property optimization does not overpower or vanish relative to the diffusion guidance.

Property Prediction Network Architecture

To approximate the property function $P_\phi$, MOOD trains a dedicated neural network on molecule-property pairs. This model predicts properties such as docking score, QED, and synthetic accessibility from graph inputs:

\[P_\phi(G_t) := \text{MLP}_s(\tanh(H')),\]

where

\[H' = \text{MLP}_s\left( \left[ H_0, H_1, ..., H_L \right] \right) \odot \text{MLP}_t\left( \left[ H_0, H_1, ..., H_L \right] \right).\]

Here, $H_0 = X_t$, and each $H_{l+1} = \text{GNN}(H_l, A_t)$, using a Graph Convolutional Network (GCN). The final feature is passed through two MLPs, one with a sigmoid and the other with a tanh activation, and their outputs are combined by element-wise multiplication to yield the final property score.

Overall, by focusing on OOD generation and property optimization, MOOD enables both controlled exploration and the generation of chemically meaningful molecules.

Experimental Results

Novel Molecule Generation

To evaluate MOOD’s ability to generate truly novel molecules, expriment was conducted in unconstrained OOD generation task, so generating molecules without explicitly optimizing for chemical properties. The goal is to validate whether MOOD’s $\lambda$-controlled diffusion process can produce molecules that systematically deviate from the training data distribution. Trained on ZINC250k dataset, MOOD generates 3000 molecules using different values of $\lambda$, with the property optimized term $P_\phi$ removed. The following metrics are used to evaluate the novelty and diversity of the generated molecules:

Fréchet ChemNet Distance (FCD): Measures distributional shift in learned chemical representations between training and generated sets.
NSPDK MMD: Measures structural differences based on graph kernel statistics.
Novelty Score: The fraction of valid molecules that have a Tanimoto similarity less than 0.4 to their closest neighbor in the training data.
\[\text{Novelty} = \frac{\text{# of valid molecules with } \text{Tanimoto}(m, m') < 0.4}{\text{Total # of valid generated molecules}}\]

Figure 3: (Left) UMAP visualization of the ZINC250k dataset and the generated molecules by the proposed OOD-controlled diffusion process. (Right) Evaluation results of the molecules generated by the OOD-controlled diffusion.

As shown in Figure 3 (a)-(d), increasing λ causes the samples to diverge farther from the training data in the latent space. This visually confirms that MOOD’s OOD-controlled diffusion process offers smooth and tunable control over novelty.

Moreover, in the right side of Figure 3, both FCD and NSPDK MMD increase monotonically with λ, suggesting greater divergence in biochemical features and molecular graph structures. The novelty also increases, meaning the generated molecules are not only different in distribution but also chemically unique at the molecular level.

Therefore, these results demonstrate that MOOD can generate molecules in a controlled and data-driven manner that generalizes beyond the training distribution—opening doors to the exploration of novel chemical space.

Property Optimization

To evaluate whether MOOD can discover compounds that are not only out-of-distribution (OOD) but also biochemically optimal, such as high binding affinity, strong drug-likeness, and good synthetic accessibility, experiment regarding practical demands of real-world de novo drug discovery has been conducted.

To reflect multi-objective optimization, the following commposite property function has been defined:

\[P_{\text{obj}}(G_t) = \text{DS'}(G_t) \times \text{QED}(G_t) \times \text{SA'}(G_t)\]

where

DS’ is the normalized docking score (lower is better)
QED quantifies drug-likeness
SA’ measures normalized synthetic accessibility (lower means easier synthesis)

A neural property predictor $P_\phi$ is trained on molecules from the ZINC250k dataset to learn $P_{\text{obj}}$. To evaluate performance, joint measures of novelty and property optimization have been conducted as follows:

Novel Hit Ratio (%):
Fraction of unique hit molecules (with DS < known active median, QED > 0.5, SA < 5) that are structurally novel (Tanimoto similarity < 0.4 with training data).
Novel Top-5% DS:
Average docking score of the top 5% unique and novel molecules, ensuring property quality alongside novelty.

To test the generality, the evaluations are performed across five protein targets, parp1, fa7, 5ht1b, braf, and jak2, with OOD control strength is fixed at $\lambda = 0.04$.

Table 1: Novel hit ratio (%) results.

Table 2: Novel top 5% docking score (kcal/mol) results.

As shown in Tables 1, 2, MOOD achieves state-of-the-art performance across almost all protein targets.

Table 3: Novel hit ratio (%) results with the similarity condition of 0.3.

Table 4: Novel top 5% docking score (kcal/mol) results with the similarity condition of 0.3.

Moreover, it outperforms all baselines in novel hit ratio and top-5% docking scores, especially under stricter novelty thresholds (Tables 3, 4).

Especially, MOOD consistently beats MOOD-w/o OOD control, which proves that $\lambda$-based exploration improves discovery. But MOOD-w/o property predictor still outperforms GDSS, showing that the OOD mechanism alone adds significant value.

Further results in the Appendix (Table 9-13) of the paper shows that MOOD also maintains high uniqueness, broad structural diversity, and robust hit rates.

These findings suggest that MOOD’s balanced guidance allows it to find chemically diverse and viable compounds that elude conventional methods.

Explorability Visualization

Figure 4: (Left) UMAP visualization of the molecules from ZINC250k and the generated samples with parp1 as the target protein. (Right) Distributional distances of the generated molecules measured by FCD and NSPDK MMD with respect to ZINC250k.

UMAP visualization (Figure 4, Left) shows that MOOD explores chemical space more broadly than competitors like REINVENT and FREED-QS, whose outputs cluster near the training data.
MOOD’s samples visibly shift into new, unexplored regions, showing the role of the OOD term.

Metrics such as FCD and NSPDK MMD (Figure 4, Right) further demonstrates this distributional divergence.

Generated Molecules and Visual Inspection

Figure 5: Generated hit molecules with parp1 as the target protein and the corresponding ZINC250k molecules of the highest similarity.

Figures 5 compares the actual molecules produced by each method, and this shows that the baseline models tend to reproduce redundant motifs or slight variations of training data. On the other hand, MOOD’s molecules reveal low similarity to ZINC250k yet high binding affinity, demonstrating the novelty and utility.

Figure 6: Novel hit molecules found by MOOD against parp1 and the top 0.01% ZINC250k molecules.

Furthermore, Figure 6 highlights MOOD’s hits that outperform the top 0.01% of ZINC250k molecules in docking score while remaining dissimilar, underscoring MOOD’s effectiveness in discovering novel chemical optima.

Comparison to 3D Generation Models

Table 5: Novel hit ratio (%) and novel top 5% docking score (kcal/mol) results with 3D molecule generation baselines and GDSS with respect to the target protein glmu.

MOOD is also benchmarked against modern 3D molecule generation models, such as Luo et al. and Pocket2Mol, which leverages spatial binding site data. The result in Table 5 shows that MOOD outperforms both, even though it does not use any 3D information, emphasizing its practicality and generalization strength.

Ablation Studies

To isolate the contributions of each core component in MOOD, the authors perform a comprehensive ablation study. This analysis investigates how both the OOD control mechanism and the property-guided gradient contribute to the quality and novelty of generated molecules.

Effects of OOD Control and Property Gradient

To understand the effect of each module, the following variants are compared: (1) MOOD-w/o Property Predictor, which disables property conditioning, keeping OOD guidance only (2) MOOD-w/o OOD Control, which disables $\lambda$-based novelty control, using only the property gradient (3) GDSS: baseline diffusion model without OOD or property guidance (4) Full MOOD, which combines both OOD control and property optimization.

As shown in Tables 1–3, both OOD control and property optimization are necessary for achieving the best chemical optima, and this can be explained as:

MOOD > MOOD-w/o Property Predictor: Conditioning on biochemical properties boosts optimization.
MOOD-w/o Property Predictor > GDSS: Demonstrates the standalone power of OOD exploration.
MOOD > MOOD-w/o OOD Control: Confirms that novelty control further enhances exploration.
MOOD-w/o OOD Control > GDSS: Even without property gradients, OOD-controlled diffusion finds more promising areas of chemical space.

Training on Low-Property Subsets: Can MOOD Discover What It Never Saw?

Figure 7: Top 5% docking score distribution of the molecules with respect to the target protein parp1.

To test MOOD’s ability to generalize beyond biased training data, the following model variants are trained on a low-quality subset of the ZINC250k dataset (bottom 50% ranked by $P_{\text{obj}}$): (1) L-MOOD-w/o OOD Control, which is trained only with property guidance (2) L-MOOD, which is trained with both OOD control and property guidance. The top 5% docking scores (DS) is evaluated for generated molecules that satisfy QED > 0.5 and SA < 5. As shown in Figure 7, both L-MOOD and L-MOOD-w/o OOD Control outperform their low-quality training data, proving the effectiveness of property-based diffusion. Only L-MOOD surpasses the full ZINC250k dataset in top-5% DS, despite never seeing high-quality molecules during training. This shows that OOD-controlled diffusion not only fosters novelty but can also find superior optima in property space, effectively extrapolating beyond the limits of training data.

Personal Thoughts

MOOD addresses the core limitation of molecular graph generation with the ability to explore novel chemical spaces while optimizing for real-world drug-like properties.

By integrating OOD-controlled reverse-time diffusion with property-guided gradient optimization, MOOD creates a principled and controllable pathway for discovering molecules that are both chemically novel and pharmacologically relevant.

Key Takeaways

Controlled Novelty through OOD-guided Diffusion:
MOOD introduces a framework to bias sampling toward low-likelihood regions using a tunable $\lambda$ parameter. It is especially impressive that it only adds slight modification to the equation and does not require any further training, while leading to effective chemical diversity and exploration.
Multi-objective Property Optimization:
Rather than optimizing a single property, MOOD uses a composite objective: $P_{\text{obj}}(G_t) = \text{DS'}(G_t) \times \text{QED}(G_t) \times \text{SA'}(G_t)$ These are modeled through a learnable property predictor $P_\phi$, which provides gradient signals to steer generation.
Complementarity of OOD and Property Guidance:
Ablation studies show that both components, OOD control and property gradients, are individually beneficial but synergistic when combined.
Generalization Beyond Training Distribution:
Even when trained on low-quality subsets of the data, MOOD discovers superior molecules not seen during training, highlighting its extrapolative power.

Further Research Directions

Based on the proposed method, I believe that below approaches could be a possible further research direction.

Incorporation of 3D Structural Information
Currently MOOD is limited to 2D graphs, so it could be extended to integrate 3D binding pocket data, similar to approaches like Pocket2Mol. This would be especially beneficial for structure-based drug design, where spatial complementarity plays a critical role.
End-to-End Differentiable Docking
While MOOD uses a learned property predictor $P_\phi$ to approximate scores like docking affinity, it does not simulate actual protein–ligand interactions. One possible extension could done by replacing or augmenting $P_\phi$ with differentiable or learned docking simulators, such as DiffDock, EquiBind, that directly model the binding between a molecule and a protein target. This would allow end-to-end optimization grounded in real physical interaction modeling, improving the relevance of generated candidates for biochemical evaluation.
Multi-agent or Population-based Exploration
Inspired by evolutionary strategies, MOOD could be extended to use multiple agents or a population of models exploring different regions of chemical space in parallel. Such an approach would increase diversity and robustness in molecule discovery, potentially leading to more novel and optimal candidates by escaping local optima through collaborative or competitive exploration dynamics.
Generalization to Other Modalities
With the score-based generative modeling, OOD control, and property-conditioned sampling, MOOD could potentially be adapted to other structured domains like protein design, material discovery, or neural architecture search, where goal-conditioned and diverse generation is essential.

Extending MOOD: Incorporating 3D Structural Information

In this section, I propose a potential extension of MOOD by integrating 3D structural information of target proteins into the generative process. While MOOD currently operates only in the 2D topological space of molecules, ligand–protein interactions are fundamentally 3D, and incorporating spatial constraints from protein binding pockets can improve biological relevance.

The 3D-aware MOOD framework could be designed as follows:

Protein Pocket Encoder
A neural encoder processes the 3D structure of a protein binding site, represented as a voxel grid, surface mesh, or point cloud, to produce a latent descriptor $p \in \mathbb{R}^d,$ capturing shape and electrochemical features relevant for binding.
Conditioned Property Predictor
The property prediction network $P_\phi(G_t, \lambda)$ is augmented to incorporate protein context as $P_\phi(G_t, \lambda, p),$ allowing MOOD to optimize molecules for both chemical properties and spatial compatibility with the target site.
Modified Reverse-Time SDE
The conditional diffusion process is updated to guide generation with both OOD control and protein-specific gradients:
\[dG_t = \left[ f_t(G_t) - (1 - \sqrt{\lambda}) g_t^2 \nabla_{G_t} \log p_t(G_t) - \alpha_t g_t^2 \nabla_{G_t} P_\phi(G_t, \lambda, p) \right] dt + g_t d\bar{\omega}.\]

This protein-aware extension of MOOD offers several advantages:

Target-specific Design: Molecules are generated to be spatially compatible with specific protein pockets, improving hit rates in structure-based drug discovery.
End-to-End Learning: Gradients from the property predictor can be backpropagated through the binding site encoder, enabling joint optimization of ligands and binding-site representations.
Flexible Integration: This formulation can incorporate various protein representations, including pretrained geometric encoders or differentiable docking modules.

Possible issues such as the limited availability of aligned protein–ligand datasets and the increased complexity of modeling in 3D can be mitigated using transfer learning from structural databases such as PDBbind, or by leveraging pretrained protein encoders like those used in AlphaFold2 or EquiBind.