CDSML Seminar Series 

a. Online via Zoom: 10.00 am – 11.00 am, Thursday OR

b. *Face-to-Face: 4.00 pm – 5.00 pm, Thursday 

Details: Zoom link; Passcode: 123321

Coordinator: Louxin Zhang

Title Mathematical AI for molecular data analysis

Abstract  Artificial intelligence (AI) based molecular data analysis has begun to gain momentum due to the great advancement in experimental data, computational power and learning models. However, a major issue that remains for all AI-based learning models is the efficient molecular representations and featurization. Here we propose advanced mathematics-based molecular representations and featurization (or feature engineering). Molecular structures and their interactions are represented as various simplicial complexes (Rips complex, Neighborhood complex, Dowker complex, and Hom-complex), hypergraphs, and Tor-algebra-based models. Molecular descriptors are systematically generated from various persistent invariants, including persistent homology, persistent Ricci curvature, persistent spectral, and persistent Tor-algebra. These features are combined with machine learning and deep learning models, including random forest, CNN, RNN, Transformer, BERT, and others. They have demonstrated great advantage over traditional models in drug design and material informatics.

* Workshop is held in the face-to-face format at NUS Block S16, #05-98.

Title  A Reinforcement Learning Framework for Bayesian Sequential Experimental Design

Abstract  Experiments are indispensable for learning and developing models in engineering and science. A careful design of these often-expensive data-acquisition opportunities can be immensely beneficial. Simulation-based optimal experimental design, while leveraging a predictive model, offers a framework to systematically quantify and maximize the value of experiments. We focus on designing a finite sequence of experiments, seeking optimal design policies that can (a) adapt to newly collected data during the sequence (i.e. feedback) and (b) anticipate future changes (i.e. lookahead). We cast this sequential decision-making problem in a Bayesian setting with information-based utilities, and solve it numerically via policy gradient methods from reinforcement learning. In particular, we directly parameterize the policies and value functions by neural networks—thus adopting an actor-critic approach—and improve them using gradient estimates produced from simulated design and observation sequences. We demonstrate the overall method on a problem of optimal sensor movement for contaminant source inversion.



Title  Debiased Learning for Optimization under Tail-based Data Imbalance

Abstract Several problems in data-driven decision-making and risk management suffer from data imbalance, a term referring to settings where a small fraction of data has an outsized impact on estimating one or more decision-making criteria. Due to the paucity of relevant samples, such problems are usually approached with the “estimate, then optimize” workflow involving a model estimation from data in the first step before plugging in the trained model to solve various downstream optimization tasks. As biases due to model selection, misspecification, and overfitting to in-sample data are especially difficult to avoid in the first-step estimated model in settings affected by data imbalance, we construct novel locally robust optimization formulations in which the first-step estimation has no effect, locally, on the optimal solutions obtained.

 

We show that this local insensitivity translates to improved out-of-sample performance freed from the first-order impact of model errors introduced in the first-step estimation. A key ingredient in achieving this local robustness is a novel debiasing procedure that adds a non-parametric bias correction term to the objective. The debiased formulation retains convexity, and the imputation of the correction term relies only on a non-restrictive large deviations behavior conducive for transferring knowledge from representative data-rich regions to the data-scarce tail regions suffering from imbalance. The bias correction gets determined by the extent of model error in the estimation step and the specifics of the stochastic program in the optimization step, thereby serving as a scalable “smart-correction” step bridging the disparate goals in estimation and optimization.

*This talk will be held in the face-to-face format from 4.00 pm to 5.00 pm. The venue is Block S16, Room 03-06 at NUS.

Title Mathematical Foundations of Graph-Based Bayesian Semi-Supervised Learning

Abstract Semi-supervised learning refers to the problem of recovering an input-output map using many unlabeled examples and a few labeled ones. In this talk I will survey several mathematical questions arising from the Bayesian formulation of graph-based semi-supervised learning. These questions include the modeling of prior distributions for functions on graphs, the derivation of continuum limits for the posterior, the design of scalable posterior sampling algorithms, and the contraction of the posterior in the large data limit.

**Zoom meeting details:
Link: https://nus-sg.zoom.us/j/81261210349?pwd=M2VlVjUvR1o4dGVodDY4MnBtRzMzUT09

Meeting ID: 812 6121 0349
Passcode: 557052