πŸ“Š AI and Materials Databases for Self-driving Labs

Unleash the power of data science in the realm of self-driving laboratories. This remote, asynchronous course empowers you to apply data science concepts to materials discovery tasks. You’ll create Bayesian optimization scripts, explore advanced optimization topics, and adapt templates to create an advanced optimization setup for a materials discovery task. Topics will include multi-objective, constrained, high-dimensional, multi-fidelity, batch, asynchronous, and domain-aware Bayesian optimization. Additionally, you’ll learn to share your findings by uploading datasets to a data repository, creating benchmark models, and hosting models on data science platforms.

Animation of 1D Bayesian Optimization: A Gaussian Process surrogate model can be used with an acquisition function to seamlessly transition from exploration to optimization in noisy settings. Source: https://ax.dev/docs/bayesopt.html

🎯 Learning Outcomes

  • Describe and categorize a materials discovery task using data science language and concepts

  • Customize a Bayesian optimization script to systematically identify the optimal chocolate chip cookie recipe, demonstrating practical application of optimization techniques

  • Evaluate and select an advanced optimization setup that is best suited for a specific materials discovery task, showcasing critical analysis and decision-making skills

  • Develop and execute a program to upload a dataset to a public database, construct a benchmark model, and deploy it online, illustrating proficiency in data sharing and model hosting

πŸ› οΈ Competencies/Skills

  • Data science literacy

  • Bayesian optimization

  • Advanced Bayesian optimization

  • Workflow orchestration

  • Benchmarking

🧩 Modules

Each module is intended to take approximately 3-4 hours, assuming that the recommended prerequisites have been met.

Module Name

Topics

Learning Outcomes

Intro to Bayesian optimization

  • Design of experiments

  • Quasi-random search methods

  • Bayesian optimization

  • Expected improvement (EI)

  • Ax Platform

  • Honegumi template generator

  • Describe a materials discovery task using data science language and concepts

  • Adapt a Bayesian optimization script to find an optimal chocolate chip cookie recipe

Mechanics

  • Attaching existing data

  • Human-in-the-loop

  • Utilize the β€˜plumbing’ needed by optimization scripts in real-world lab settings

Multi-objective optimization

  • Pareto fronts

  • Hypervolume

  • qNEHVI

  • Objective thresholds

  • Explain the significance of a Pareto front

  • Compare simple scalarization with expected hypervolume improvement

  • Explore the effect of setting objective thresholds

Constrained optimization

  • Linear constraints

  • Nonlinear constraints

  • Compositional constraints

  • Order constraints

  • Provide examples of materials discovery tasks with constraints

  • Adapt a Bayesian optimization script to include constraints

High-dimensional optimization

  • Curse of dimensionality

  • SAASBO

  • Explain the curse of dimensionality

  • Compare the efficiency of expected improvement and SAASBO as a function of dimensionality

Batch/asynchronous optimization

  • Conditioning

  • Batch optimization

  • Asynchronous optimization

  • Parallelism

  • Explain the effect of parallelism on optimization efficiency in terms of wall time and resource utilization

  • Explain what conditioning is in the context of batch and asynchronous BO

  • Adapt a Bayesian optimization script to use batch optimization

  • Adapt a Bayesian optimization script to use asynchronous optimization

Featurization

  • Domain knowledge integration

  • Contextual variables

  • Predefined candidates

  • Explain the advantages and disadvantages of featurization

  • Adapt a Bayesian optimization script to use predefined candidates with featurization

  • Adapt a Bayesian optimization script to use contextual variables

Multi-fidelity/multi-task

  • Cost-fidelity tradeoffs

  • Knowledge gradient

  • Explain the effect of cost-fidelity tradeoffs on optimization

  • Explain when to use multi-fidelity vs. multi-task optimization

  • Assess the efficiency of expected improvement at fixed fidelities vs. knowledge gradient

  • Adapt a Bayesian optimization script to use a knowledge gradient

Benchmark datasets/models

  • Benchmarks

  • Surrogate models

  • Random forest regression

  • FAIR data

  • Model deployment

  • APIs

  • Explain the effect of noise on benchmarking tasks

  • Programatically upload a completed dataset to Figshare

  • Create a benchmark model with scikit-learn

  • Host a model on HuggingFace

βš–οΈ Course Assessments and Grading Schema

Each student is required to complete various quizzes and GitHub Classroom assignments. The course is structured into an orientation module followed by six subsequent modules. The course is graded on a pass/fail basis with 70% as the threshold for passing. Here is the breakdown of the points for each part of the course:

  • 🧭 Orientation Module: Worth 15 points.
  • πŸ“š Modules 1-6: Each includes:
    • 🧭 A guided notebook tutorial (ungraded)
    • πŸ““ A knowledge check (graded, 5 points)
    • πŸ› οΈ A GitHub Classroom assignment (graded, 10 points*)

*The final module's GitHub Classroom assignment is worth 30 points.

Note that partial points are available on certain assignments.

πŸ‘€ Course developer(s)

  • Sterling Baird, PhD Materials Science and Engineering (Acceleration Consortium)