Available Masters/Part III Projects

Constrained Bayesian Optimisation with Unknown Constraints

Optimisation problems are commonplace in many engineering disciplines. From optimising the fuel efficiency of a jet engine, to minimising the cost of shipping goods around the globe. These optimisation problems come almost always with certain constraints, such as needing to ensure that the turbulance of the jet is within an acceptable level, or keeping the travel time of goods below a certain threshold of days. In most real-world settings, both the objective (e.g., fuel efficiency) and the constraints (e.g., turbulance levels) are unknown and need to be learned from data. Bayesian optimisation is the de-facto standard method to tackle the optimisation of unknown objectives. In this project, we want to extend Bayesian optimisation to tackle the problem of unknown constraints. We propose to do this using Lagrange multipliers, a well-known technique used in classic optimisation.

Cross-modal alignment model for audio to sentence.

Establishing vector representations of input features has been crucial to the success of machine learning especially natural language processing. The vector representations( embeddings) are always exploited in downstream tasks such as machine translation where the intuition is that words appearing in similar context should generate similar word embeddings hence should be aligned. The idea of similar words generating similar representation has been extended to cross-lingual alignment. It has been exploited to perform machine translation between two different languages without any annotated dataset linking these languages. This is due to the discovery that similar words from different languages share similar structures in a continuous word embedding space. This eliminates the need for a large parallel training corpus to train machine translation systems. work in [1] extends this concept to perform cross-modal alignment between text and audio where audio segments are aligned to words with similar embeddings in a joint continuous embedding space. This is helpful to generate audio transcription without any transcribed dataset to train the transcription model. For low resource languages this is crucial. Despite the remarkable results were reported in [1] and while the word based generation from audio may appear natural for speech, it is still not clear how to chunk speech audio in a lengths that correctly generates words. Also the word based generation of textfrom audio may be slow for downstream applications. This project proposes an audio to sentence generation as opposed to existing audio to word generation. Can we develop a model where a whole sentence is generated at once based on a given chunk of speech. Basically can cross-modal alignment be developed where an audio chunk and a sentence share a continuous embedding space. Will this speed up transcription of a recorded speech ? will the long dependencies within the sentence level audio chunks affect the quality of transcription. How can we effectively identify sentences boundaries with a given audio ?. In this project, we will utilize the dataset and model proposed in [1] and extend it to handle sentence level text generation. reference [1]Chung, Yu-An, et al. “Unsupervised cross-modal alignment of speech and text embedding spaces.” Advances in neural information processing systems 31 (2018).

Dynamical Gaussian processes for Sequential Data

Gaussian process dynamical models (state-space models) builds upon a long line of work combining Gaussian processes (GP) with latent variable models for unsupervised learning tasks. Specifically, we narrow our focus on modelling high-dimensional sequential data ubiquitous in nature. The dynamics in observed space are captured by a smoothly evolving latent variable indexed by time and governed by a latent Gaussian process prior. The idea behind the project is to develop a scalable algorithm for sequential data which does not rely on holding the complete sequence in memory but can process the time-series or sequence in chunks. We also want to amortise the model with a suitable encoder like a recurrent neural network, LSTM or an Attention-based transformer.

The model is basically an extension of the model proposed here: https://gregorygundersen.com/blog/2020/07/24/gpdm/

Required reading:

[1] Gaussian process dynamical model: https://gregorygundersen.com/blog/2020/07/24/gpdm/

[2] Variational GPDM: https://proceedings.neurips.cc/paper/2011/file/af4732711661056eadbf798ba191272a-Paper.pdf

[3] Variational GPSSM: https://proceedings.neurips.cc/paper/2014/file/139f0874f2ded2e41b0393c4ac5644f7-Paper.pdf

[4] Recurrent Gaussian processes with SVI: https://arxiv.org/abs/1511.06644

[5] GPVAE for interpretable latent dynamics: http://proceedings.mlr.press/v118/pearce20a/pearce20a.pdf

How fair is fairness? A multiverse analysis of fairness definitions

As machine learning is increasingly deployed into high-stakes settings such as healthcare and justice, it’s essential our models behave in a fair and unbiased manner. Unfortunately this isn’t the case, as evidenced by a growing number of papers showing significant bias and disparate performance (e.g. [1-3]). These failures demonstrate the importance of auditing both datasets and models to identify biases, harmful associates, and performance gaps across groups.

However, in order to audit, we have a series of choices to make about how we measure fairness [4]. Do we define fairness as parity betweens similar individuals, or between groups based on demographic attributes? Do we want parity of accuracy, or loss, false negatives, or false positives? Which dataset do we evaluate on? If we are using demographic data, how are we defining socially-constructed terms like race; are we relying on self-identified or perceived definitions of gender? We could go on, and each of these decisions may impact the resulting outcome of the fairness audit.

First introduced in psychology, the multiverse analysis [5] is a principled framework for evaluating the robustness and reproducibility of claims in light of the multitude of possible choices a researcher can make along the way [6]. The multiverse entails systematically re-evaluating our analyses at each leaf of the decision tree, then transparently reporting and visualizing the results. To make this tractable in a machine learning search space Bell et al. [6] rely on a Gaussian Process surrogate and Bayesian experimental design for efficient exploration of the multiverse of choices.

This project will apply a multiverse analysis to machine learning fairness, and systematically evaluate the conclusions of previous fairness audits in light of different definitions of fairness, different evaluation protocols, and different evaluation datasets. There is also scope to extend the project to a dataset audit or to a fairness intervention (e.g. gDRO [7]).

If you’re interested in the project, get in touch with Samuel Bell (sjb326@cam.ac.uk) for an informal chat.

[1] Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. FAccT.

[2] De Vries, T., Misra, I., Wang, C., & Van der Maaten, L. (2019). Does object recognition work for everyone?. CVPR.

[3] Bolukbasi, T., Chang, K. W., Zou, J. Y., Saligrama, V., & Kalai, A. T. (2016). Man is to computer programmer as woman is to homemaker? Debiasing word embeddings. NeurIPS.

[4] Jacobs, A., Z., & Wallach, H. (2021). Measurement and Fairness. FAccT.

[5] Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science.

[6] Bell, S. J., Kampman, O. P., Dodge, J., & Lawrence, N. D. (2022). Modeling the Machine Learning Multiverse. NeurIPS.

[7] Sagawa, S., Koh, P. W., Hashimoto, T. B., & Liang, P. (2019). Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. ICLR.

Investigating the Effect of Sequence Length on the quality of clean speech generated

To effectively perform speech separation, the speech separation tools need to model long sequence dependencies that exist within the audio signal input. The initial speech separation models relied on deep neural network to estimate clean speech from a noisy one. However, feedforward DNN models are ill poised for speech data since they are unable to effectively model long dependencies across time that are present in the speech data. Due to this researchers progressively introduced recurrent neural network (RNN) which has a feedback structure such that the representations at given time step t is a function of the data at time t, the hidden state and memory at time t − 1. One such RNN that has been exploited in speech separation is long-short-term memory (LSTM). Through LSTM structures one can learn sequential prediction networks which are able to exploit use of long-term contextual information. Due to their inherently sequential nature, RNN models are unable to support parallelization of computation. This limits their use when working with large datasets with long sequences. Convectional convolution neural (CNN) has been used to design speech separation models, however they are limited in their ability to model long range dependencies due to limited receptive field. Due to the shortcomings of the CNN and RNN, dilated temporal convolution network (TCN) has been exploited to encode long-range dependencies using hierarchical convolutional layers. Speech separation models have also explored the use of transformers. Transformers are attention based that have been successful in the recent past in modeling sequences and allows uncovering of dependencies that exist within an input without regard to the distance between any two values of the input. One common thing in all these progression is that researchers have focused on designing models that are large to effectively model audio input without looking at the impact of input data size. In audio separation, a typical model’s input is a sequence of frames, each of which is 25 seconds long. It is possible to reduce the length of the input hence reducing the burden of modelling long sequence dependencies ? Will this reduce the training cost of the models without impacting on the quality of their output? Basically by focusing on the data side of the training spectrum can we reduce the training cost without impacting on the quality of the separations generated. The project will focus on utilizing already existing systems such as [1] and evaluating its performance on variable frame size. Reference [1] Cem Subakan, Mirco Ravanelli, Samuele Cornell, Mirko Bronzi, and Jianyuan Zhong. Attention is all you need in speech separation. ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, 2021-June:21–25, 2021. ISSN 15206149. doi:10.1109/ICASSP39728.2021.9413901.

Latent Probabilistic Models for Unsupervised Structure Learning in Massively Missing Data

Real-world datasets often contain entries with missing elements e.g. in a medical dataset, a patient is unlikely to have taken all possible diagnostic tests. Variational Autoencoders (VAEs) are popular generative models often used for unsupervised learning. Despite their widespread use it is unclear how best to apply VAEs to datasets with missing data.

In this projects we intend to explore a novel incarnation of a Gaussian process latent variables which can work seamlessly with missing data, the architecture would include a pointNet encoder which will encode every partial data point as a set with indicator variables (capturing the dimension identity) and a classical Gaussian process decoder.

The focus would be on assessing the quality of uncertainty calibration, structure learning in latent space, and reconstruction quality on previously unseen data.

Learning to Learn by Denoising Diffusion Optimisers

The seminal paper Learning to Learn by Gradient Descent By Gradient Descent [1] explores learning optimisers (rather than using SGD or Adam) through an RNN architecture which models time steps as training steps and an RL based loss.

In this project we aim to achieve a similar product but rather than RNNs and RL we would like to explore the closely linked diffusion models [5,6] and stochastic control. In particular there is already theoretical work that motivates the use of these methodologies for global optimisation [2], what now remains is to explore it in practice.

We expect the student to lightly adapt the methods in [3,4] to the optimisation setting in [2] (This simply amounts to exploring low temperatures in the artificial target distribution induced by the loss function). Furthermore, exploring tasks in engineering as in [1] might require being creative about the inductive biases baked into the NN parametrisation that we are working with.

The nature of this project will be mostly ML-engineering and playing around / being creative about NN inductive biases. That said if the student is more mathematically/theory oriented we could explore extending the results in [2] (which apply to the method in [4] only) to the method in in [3] using the mixing rates in [7] and maybe comparing to things like [8] (last bit would be very bonus/extra type of thing).

As always the student should explore simple 1d and 2d toy optimisation examples to assess the validity of the method before moving onto real world examples.

[1] https://arxiv.org/pdf/1606.04474.pdf 

[2] https://arxiv.org/abs/2111.00402

[3] https://openreview.net/forum?id=8pvnfTAbu1f

[4] https://arxiv.org/abs/2111.15141

[5] https://arxiv.org/abs/2006.11239

[6] https://arxiv.org/abs/2006.11239

[7] https://arxiv.org/pdf/2209.11215.pdf

[8] https://arxiv.org/abs/1707.06618

Machine Learning (Bayesian Methodology, Inference and Applications)

Students interested to work with me should come up with a grain of an idea before reaching out. If there is a match I would be happy to discuss to flesh out the details and create a project out of it. I always believed that part of doing a project is coming up with ideas and angles ripe for exploration.

I am broadly interested in probabilistic machine learning and applications in climate science.

Machine Learning for Modelling Formula One Races

The machine learning group is working with one of the leading forumla one teams in analysis of data generated in Formula One races with the aim of improving strategy. With this aim we are running one or more projects this year focussed on Formula One data. Formula one is a data intensive sport, information about the location of each team’s car during the race is provided to the teams. Optimization of pit stop strategy can make the difference between winning and losing the race. There are commercial confidentiality issues over which areas will be studied, but interested students can discuss these areas directly with the supervisors.

Protein Folding Explantions via Diffusion Bridge Score Matching

Technical Title: Sampling Transition Paths Between Molecular Conformations Using Diffusion Bridges and Score Matching.

Recent advances in Schrodinger Bridges [1,2] have enabled to learn stochastic mappings between 2 probability distributions (p(x) and q(x)) such that the stochastic map (which is modelled by a diffusion / SDE) is regularised by some prior process (whether it be computational or physical).

This project seeks to explore these methodologies in particular the simpler case studied in [3] where both the source and the target distributions are modeled as point masses (dirac delta functions / measures). We seek to apply the approach in [3] to sampling physically meaningful transition paths between two protein configurations as done in [4]. Ideally, we would aim to start working with simple/toy proteins and then move on to larger scale tasks where one of the protein configurations is a flat amino-acid and proteins produced by alpha fold, the end product would be to generate a video which gives a physically plausible folding process for alphafold [5] predictions.

An example Timeline of the project could be:

  1. Get the codebase of [4] working and reproduce results on simple proteins.
  2. Extend the work in [3] to the underdampened version. (Francisco can help with this)
  3. Apply the extensions and adaptions of [3] to work on the simple proteins of [4] and compare to the method in [4].
  4. Consider enhancements / extensions, would a full Schrödinger bridge work better here?
  5. If time allowing, pick some of the most recently exciting discoveries from alphafold [5] that have a known potential and try and see if we can get it working.

Point 5. is an “if time allows” type of objective and I predict most the time will be spent in 3., successful completion of 3 could lead to a publication at a top venue whilst 5. could have a broader impact on the field.

Ideally a good background in the following could be very helpful for this project:

  1. Timeseries models (Kalman filters, AR processes, Gaussian processes).
  2. Introductory calculus (ODEs, Basic PDEs, limits).
  3. Probability Theory (Limit Theorems, Change of Variables, basic concentration inequalities e.g., Markov/Chebyshev)
  4. Variational Inference (MFVI, Amortised VI, Deep hierarchical latent variable models)


[1] https://arxiv.org/pdf/2106.01357.pdf
[2] https://arxiv.org/pdf/2106.02081.pdf
[3] https://arxiv.org/pdf/2207.02149.pdf
[4] https://arxiv.org/pdf/2111.07243.pdf
[5] https://alphafold.ebi.ac.uk/

Unconventional AI and explainable AI

Project idea 1

Contemporary approaches towards explainable AI are model-centric. We will use data-centric approaches to explain the complex interplay between data and models. This will build on published work [1]. This project will be ideal for a student with interest in machine learning and who has coding experience.

Project idea 2

For high stakes decisions we need simple and explainable/interpretable models. This need is acute in the case of healthcare and social sciences like recidivism prediction [2]. In this project, we will build simple interpretable models that are surrogates for deep learning models. The student will look at publicly available data and synthetic data to generate surrogate models that are transparent and interpretable. The process of creating these surrogate interpretable models will be automated. This can also be partially based on published work [1]. The surrogate models can be decision trees (like CART) trained on the input and output of a deep learning model [1]. This can use R packages like party, rpart, partykit or other packages. This will lead to tools that automated the creation of surrogate interpretable models based on deep learning models in healthcare.

Project idea 3

Another project idea is to apply explainable AI approaches to genomic data. This will be a machine learning and bioinformatics project. The student will develop explainable AI approaches for interpreting clusters in single-cell gene expression data. This work is part of the Accelerate Programme for Scientific Discovery which aims to democratize access to AI tools and apply AI to problems from diverse disciplines. The student will be part of a growing community of inter-disciplinary AI researchers at the University of Cambridge.

Project idea 4

Tailor machine learning model explanations based on audience (e.g. patients, clinicians, farmers, etc.). Generate natural language explanations from machine learning model and tailor these natural language expplnanations based on unique background of listener.

Project idea 5

Build a computational model of analogy making and apply it to biomedical and genomic data. For other project ideas related to explainable AI see the following page. Broadly this will use concepts like analogies and stories to create new explainable AI methods. Example 1, example 2, example 3.

Project idea 6

Build a machine learning algorithm or domain specific language to solve the Abstraction and Reasoning Corpus Challenge. See also here and here . Domain specific languages may be required (as suggested by Chollet) like genetic algorithms and cellular automata

Project idea 7

Building a Bayesian model and/or probabilistic programming model of infection dynamics (like a SIR model) or an intra-cellular regulatory network [5]. This would apply a probabilistic programming model to infection data from different sources. This would be an explainable AI model for a complex model of a physical system. The project would involve building a model that would generate insights from these complex systems (an artificial model of human creativity).

Project idea 8

Building a Bayesian model and/or probabilistic programming model of a complex systems model like infection dynamics (SIR model) or an intra-cellular regulatory network [5]. This would involve building a qualitative process model for a physical system. This would be an explainable AI model for a complex model of a physical system. The project would involve building a model that would generate insights from these complex systems (an artificial model of human creativity).

Project idea 9

Extend the Ramanujan machine by applying it to other data or other dynamical systems or using another machine learning approach. This would be an artificial model of human creativity.

Project idea 10

Dynamics of learning in artificial neural networks, Hopfield networks, self organizing maps and neural gas.

Project idea 11

The role of noise in collective artifical intelligence in building behaviour (altruism, co-operation, competition) and structures (structures to capture prey). This will use the multi-agent platform MAgent

Project idea 12

Other project ideas are generating synthetic data from private datasets like data from electronic healthcare records data [3], other explanatory artificial intelligence (xAI) techniques, privacy preserving machine learning [4], documenting data and models, detecting concept drift, etc.

Project ideas can be developed according to student interests.

Students will be jointly supervised with Prof. Neil Lawrence.

Contact

Please contact Soumya Banerjee at sb2333@cam.ac.uk to have an informal chat. You can learn more about my work here: https://sites.google.com/site/neelsoumya

References

  1. Banerjee S, Lio P, Jones PB, Cardinal RN (2021) A class-contrastive human-interpretable machine learning approach to predict mortality in severe mental illness. npj Schizophr 7: 1–13.
  2. Rudin C (2019) Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat Mach Intell 1: 206–215.
  3. Banerjee S, tom rp Bishop (2022) dsSynthetic: Synthetic data generation for the DataSHIELD federated analysis system. BMC Res Notes 15: 230.
  4. Banerjee S, Sofack GN, Papakonstantinou T, Avraam D, Burton P, et al. (2022) dsSurvival: Privacy preserving survival models for federated individual patient meta-analysis in DataSHIELD. BMC Res Notes 15: 197.

Why 10 random seeds? Exploring a heteroskedastic multiverse

A multiverse analysis, first introduced in psychology [1], is a principled toolkit for evaluating the robustness and generality of scientific claims. As researchers, we make numerous choices through the course of an investigation that could influence the final result. Applying the multiverse analysis to machine learning, Bell et al. [2] systematically evaluate how researcher choices (e.g., hyperparameters, evaluation dataset) can fundamentally change the conclusions drawn. For example, Bell et al. evaluate the role of hyperparameters in optimizer choice, and evaluate whether large batch training necessarily leads to a drop in generalisation performance.

In a deep learning setting, the multiverse is large and contains continuous dimensions, making it challenging to rigorously explore. To overcome this, Bell et al. implement a simple Gaussian Process surrogate and turn to Bayesian experimental design to sample experimental configurations to evaluate. This has the added benefit of allowing us to quantify our confidence in different regions of the multiverse.

A simplifying assumption in Bell et al.’s approach is that the multiverse is homoskedastic: that each point has the same variance. This is unlikely to be the case when working with neural networks. This project will extend the underlying surrogate to account for heteroskedasticity. One possible approach to this is to explicitly model the noise as a function of the inputs using a separate Gaussian Process [3].

This approach has an important consequence. In deep learning, it is typical to evaluate models using the mean performance over a number (usually 10) runs with different random seeds. But why 10? Whether this is an appropriately large sample size for robust inference, i.e. to confidently claim model A outperforms model B, remains to be seen (for an example in NLP, see [4]). By accounting for heteroskedasticity in a multiverse analysis, we can make an informed decision about how many runs are appropriate, and make a principled trade-off between re-running the same configuration and sampling a new point in the multiverse [5].

If you’re interested in the project, get in touch with Samuel Bell (sjb326@cam.ac.uk) for an informal chat.

[1] Steegen, S., Tuerlinckx, F., Gelman, A., & Vanpaemel, W. (2016). Increasing transparency through a multiverse analysis. Perspectives on Psychological Science.

[2] Bell, S. J., Kampman, O. P., Dodge, J., & Lawrence, N. D. (2022). Modeling the Machine Learning Multiverse. NeurIPS.

[3] Goldberg, P., Williams, C., & Bishop, C. (1997). Regression with input-dependent noise: A Gaussian process treatment. NeurIPS.

[4] Card, D., Henderson, P., Khandelwal, U., Jia, R., Mahowald, K., & Jurafsky, D. (2020, November). With Little Power Comes Great Responsibility. EMNLP.

[5] Binois, M., Huang, J., Gramacy, R. B., & Ludkovski, M. (2019). Replication or exploration? Sequential design for stochastic simulation experiments. Technometrics.

Available Undergrad Projects

An Academic Scholar System

Academic publications systems such as Microsoft Academic and Google Scholar track individual authors, their papers and who they’ve cited. Semantic scholar makes its data freely available. In this project you will absorb the semantic scholar data to provide a scalable cloud based scholar service.

Decision Support for Cities Management

Smart cities are urban environments where computing devices generate considerable amounts of heterogeneous data. Cities’ authorities need sophisticated platforms to manage and analyse such data before using it. In this project you will provide a flexible, scalable, and real time decision-support tool for city managers. This tool will manage data from heterogeneous sources and provide meaningful insights to city managers to support their decisions.

Machine Learning (Bayesian Methodology, Inference and Applications)

Students interested to work with me should come up with a grain of an idea before reaching out. If there is a match I would be happy to discuss to flesh out the details and create a project out of it. I always believed that part of doing a project is coming up with ideas and angles ripe for exploration.

I am broadly interested in probabilistic machine learning and applications in climate science.

Machine Learning for Modelling Formula One Races

The machine learning group is working with one of the leading forumla one teams in analysis of data generated in Formula One races with the aim of improving strategy. With this aim we are running one or more projects this year focussed on Formula One data. Formula one is a data intensive sport, information about the location of each team’s car during the race is provided to the teams. Optimization of pit stop strategy can make the difference between winning and losing the race. There are commercial confidentiality issues over which areas will be studied, but interested students can discuss these areas directly with the supervisors.