Open project database

We want to make finding projects easier for students and advertising projects easier for scientists with research opportunities. Today, this is very much a work in progress… eventually you will be able to search for perspective project ideas and easily add new project opportunities to this repository.

This page is a prototype project database. Use the menu bars to select projects based on their attributes. Projects with no value specified for a given attribute will not be included if a selection is made on that attribute. Projects may instead specify that they are appropriate for multiple options (or any option) for a given attribute. Click the triangle next to each project for more information (if provided by the project mentors).

Project selection menus

Selected projects

Mitigating the impact of simulation mis-modeling on DNN Training: Building robust DNNs in the presence of detector mis-modeling. Mentee: Andrii Len. Email the mentors (Dmytro Kovalskyi). Date posted 2025-04-10.

Simulation mis-modeling can significantly impact the performance of a DNN model trained using simulated signal events against data background. Under such conditions, the model may treat mis-modeled features as signal/background discriminators, introducing large systematic effects. There are multiple ways to address this issue, such as training solely on data samples or modifying the loss function to include penalty terms for mis-modeled features. In this project, we will compare such methods to assess their relative performance and identify common trends. The project requires a solid understanding of machine learning algorithms and the tools used to build and train deep neural networks.

BEAD - Background Enrichment for Anomaly Detection: Improving anomaly detection using enriched background representations via latent space ML models. Email the mentors (Pratik Jawahar,Sukanya Sinha). Date posted 2025-04-03.

Several Large Hadron Collider (LHC) experiments are conducting searches aimed at detecting dark matter. Unsupervised and semi-supervised learning outlier detection techniques are advantageous to these searches, for casting a wide net on a variety of possibilities for how dark matter manifests, as they impose minimal constraints from specific physics model details, but rather learn to separate characteristics of rare signals starting from the knowledge of the background they’ve been trained on. Developing innovative search techniques for probing dark matter signatures is crucial for broadening the DM search program at the LHC, and BEAD is a Python package that uses deep learning based methods for anomaly detection in HEP data for such new physics searches. BEAD has been designed with modularity in mind, to enable usage of various unsupervised latent variable models for any task. The aim of this project would be to develop new approaches for background enrichment with the end goal of improving anomaly detection performance for new physics searches.

Building an Acknowledgement Plugin for the Alarms and Alerts System: Building an Acknowledgement Plugin for the Alarms and Alerts System. Email the mentors (Petya Vasileva,Ilija Vukotic). Date posted 2025-03-14.

perfSONAR is a widely-used monitoring framework for Research and Education networks, which are critical to supporting global scientific collaboration. These high-speed networks connect universities, research institutes, and laboratories, enabling massive data transfers and providing the backbone for international science projects. The Alarms and Alerts schedulers process millions of daily network tests and produce alarms which aim to alert site administrators and scientists using the computing grid, about current network problems. By improving how these alerts are managed and acknowledged, you’ll help ensure that researchers worldwide can rely on stable, high-performing connections to conduct their work without interruptions.

The project involves enhancing an existing alert subscription frontend that notifies users of network issues. Your main responsibility will be implementing a new feature allowing administrators to acknowledge alarms, signaling that they are working on the issue. If time allows, you’ll also explore adding new alarm types to expand the system’s usefulness. These changes will help site admins react to problems more efficiently, reducing downtime and ensuring the smooth operation of a global research network that serves critical scientific projects.

You work within a pre-established framework and codebase, integrating your contributions with the current front-end. You’ll learn how to connect the user interface with backend logic to handle alarm status changes, enabling clear visibility into which issues are being addressed. The goal is to produce a polished feature that is both functional and user-friendly. Throughout the project, we expect regular check-ins (e.g., weekly) to help you stay on track and to provide guidance.

Learning Experience You’ll gain experience integrating code into an existing system, working on both front-end and back-end components. You’ll learn to implement new features in a eal-world environment, collaborate with experienced developers, and test your code under various scenarios. This project provides an opportunity to sharpen your full-stack development skills, while directly supporting a global scientific community.

Preferred Skills

Familiarity with modern web development tools and languages.
Some experience working with Jupyter Notebooks, Python and REST APIs.

HS3 Support for Combine: Provide HS3 support for the Combine statistical analysis tool. Email the mentors (Massimiliano Galli,Nick Smith). Date posted 2025-03-06.

The High Energy Physics Statistics Serialization Standard (HS3) [1] is an ongoing effort that defines standards of different statistical procedures and results used in High Energy Physics (HEP) in terms of human-and machine readable representations. Having a code-independent representation of the likelihood and inference results allows not only to publish the likelihood itself (a long-term goal in HEP experiments), but also to use different frameworks interchangeably, removing the dependency on code and reducing the need for maintenance of legacy projects.

The tool of election to perform statistical analysis in the CMS collaboration is Combine [2]. Combine uses a very specific format called “Datacard” (a plain text file), together with the ROOT workspaces, to define the statistical model. While human-readable and convenient in very simple analyses, this format scales poorly with the complexity of the analyses, making it very difficult to handle in the case of high number of channels, processes and nuisance parameters.

The goal of this projects consists in introducing the necessary changes in Combine to support HS3. Contributions to HS3 are also expected whenever aspects not taken into consideration in the current implementation arise.

[1] https://github.com/hep-statistics-serialization-standard/hep-statistics-serialization-standard

[2] https://arxiv.org/abs/2404.06614

SW development for the Inner Tracker Data Trigger and Control system (IT-DTC): Writing SW for the IT-DTC system that runs on the APOLLO board. Email the mentors (Serhii Cholak,Zeynep Demiragli,Gianfranco De Castro). Date posted 2025-02-26.

The APOLLO platform will be used as the Inner Tracker Data Trigger and Control system (IT-DTC) and the Level-1 track trigger finder for the CMS Phase-II upgrade at the HL-LHC. This project is focused on developing and testing the IT-DTC plugin SW, which is part of the Online Software for the Tracker upgrade. The student will join the effort of the SW development by testing the Front-End module readout blocks. This involves active collaboration with the system’s HW, FW and SW developers and will greatly contribute to student’s experience and understanding of the detector operation and readout systems. Depending on the project duration and general development progress, this project can be extended to participation in development and tests of the calibration of IT modules within the IT-DTC plugin.

Developing a Pythonic tool for smoothing histograms: Developing a Pythonic tool for smoothing histograms. Email the mentors (Mohamed Aly,Alexander Held,Matthew Feickert). Date posted 2025-02-23.

Histograms are fundamental objects in statistical modelling frameworks used in many Large Hadron Collider (LHC) analyses. They serve as the primary representation of both experimental data and theoretical predictions, with systematic uncertainties often modelled as variations on these histograms. Smoothing refers to techniques that reduce statistical noise in histograms, enhancing the stability of statistical models and ultimately leading to more reliable results in physics analyses. While smoothing is essential for analyses, existing tools lack accessible interfaces, comprehensive documentation, and integration into the Scikit-HEP ecosystems. This project aims to address this gap by developing a smoothing tool in Python based on existing algorithms, enabling physicists to apply smoothing methods more efficiently and consistently in their workflows.

Cooperative track building for particle tracking: A creative exploration of how to cooperatively build tracks in particle tracking. Email the mentors (Liv Våge). Date posted 2025-02-20.

When a charged particle traverses a detector, it leaves traces of its passage in the form of hits. These hits are used to reconstruct the particle’s trajectory, which is essential for understanding the particle’s properties. The traditional tracking methods reconstruct one particle track independently from its neighbouring tracks, when in fact the probability of a hit belonging to a track could be seen in the light of its neighbouring tracks. This information is exploited in graph neural nets for tracking, which is an [active area of research][gnn-paper]. The goal of this project is to explore an alternative approach to graph neural nets, which can take long to train, implement and understand. Instead, this project will attempt to exploit the same information in a more lightweight way. The project will involve implementing and testing different algorithms, which may involve classic computing algorithms like maze-solving, or it may be something more exotic like reinforcement learning. Code will be provided as a starting point, and the student is encouraged to experiment with different architectures and be creative. [gnn-paper]: https://www.epj-conferences.org/articles/epjconf/abs/2024/05/epjconf_chep2024_09004/epjconf_chep2024_09004.html

Creating ML templates for HEP: Building simple ML workflows for HEP applications. Email the mentors (Liv Våge). Date posted 2025-02-20.

In HEP, there are many ML methods and tasks that are common across experiments and across projects. Physics analysis projects often include boosted decision trees and neural nets, calorimeter clustering is sometimes done with ML clustering, and graph neural nets are being explored for tracking. Building an ML pipeline can be a time-consuming task, especially for those new to the field. This project will create a set of templates that people can use as a jumping-off point for their ML tasks. These templates will include the basic structure of a project, including data loading, preprocessing, model building, training, and evaluation. The templates will be written in Python and will use common ML libraries such as scikit-learn and PyTorch. Instead of being generic ML templates, these will be HEP specific, with examples of how to load and preprocess HEP data. This is a chance to learn best practices and get comfortable with a variety of ML models. For an overview of ML uses in HEP, see the HEP-ML Living review.

Exploring ML methods for Kalman filter tracking: Combining Kalman filters and ML for particle tracking. Email the mentors (Liv Våge). Date posted 2025-02-20.

When a charged particle traverses a detector, it leaves traces of its passage in the form of hits. These hits are used to reconstruct the particle’s trajectory, which is essential for understanding the particle’s properties. The Kalman filter is a widely used algorithm for tracking particles in high-energy physics experiments. It is an iterative algorithm that estimates the particle’s trajectory by combining the measurements from the detector with the predictions from the previous step. In this project, we will explore how machine learning methods can be combined with the Kalman filter to improve the tracking performance. The goal is to develop a hybrid tracking algorithm that leverages the strengths of both approaches. Code will be provided as a starting point, and the student is encouraged to experiment with different architectures and be creative. The project will use Python and Pytorch.

Displaced Vertex Finder on GPUs for LHCb: Development of algorithm that finds displaced vertices in Velo detector of LHCb. Email the mentors (Andrii Usachov). Date posted 2025-02-20.

Many models Beyond the Standard Model predict existence of new long-lived particles. When produced at the interaction point, these particles can travel a macroscopic distance before decaying. The decay products can leave traces (tracks) in the Velo detector of LHCb. The goal of this project is to develop fast and efficient algorithm that finds anomalous displaced vertices using GPUs. The developments will take place within the Allen project at LHCb, and the algorithm will be implemented in Python and CUDA, and will be tested on simulated data. Such algorithm will result in significant expansion of the LHCb sensitivity for many classes of BSM models at once. This project requires a solid understanding of math and experience with CUDA, knowledge of the LHCb tracking is a plus.

Adopting Analysis Grand Challenge (AGC) for Kubeflow MLOps platform: Adopting Analysis Grand Challenge (AGC) for Kubeflow MLOps platform. Email the mentors (Alexander Held,Oksana Shadura,Carl Lundstedt,Sam Albin). Date posted 2025-02-18.

The Analysis Grand Challenge (AGC) is performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC. The goal of this project is to adopt Kubeflow Pipelines for the AGC machine learning workflow. Kubeflow is an open source platform for implementing MLOps, providing a framework for building, deploying, and managing machine learning workflows.

One Flow to Unfold Them All: Unfolding quantities with a single normalizing flow. Email the mentors (Massimiliano Galli). Date posted 2025-02-17.

A key challenge in HEP analyses consists in linking the predictions from quantum field theory, provided at the level of partons, with the corresponding detector signatures. One of the most common approaches to overcome this issue consists in “unfolding” to the parton level, effectively adjusting the data on a statistical basis to provide an estimate of their pre-detector distributions. Traditional unfolding methods [1,2,3] have been successfully used so far, but they can be applied only to binned datasets of small dimensionality, such that the unfolded observables and their binning have to be selected in advance. In the most recent years, machine learning (ML) techniques have revolutionized unfolding by allowing for unbinned quantities to be measured across many dimensions [4].

In this project, we explore the application of a novel technique based on normalizing flows [5] to the unfolding case. The student(s), with a solid ML background and interest, will focus on the following aspects:

understand the correction technique and set-up a training+testing pipeline to unfold an arbitrary number of observables;
investigate the impact of negative weights in the samples;
test the scalability of the technique (how many observables are we able to unfold at once? what are the limits of normalizing flows?);
test the procedure on different physics samples;
investigate the impact of different labels of correlations among the observables on the performance of the normalizing flow;
investigate different ways of handling signal and background in the training procedure;
possibly explore using techniques like flow-matching to achieve the same results.

[1] https://inspirehep.net/literature/97289

[2] https://arxiv.org/abs/hep-ph/9509307

[3] https://arxiv.org/abs/1205.6201

[4] https://arxiv.org/abs/2404.18807

[5] https://arxiv.org/abs/2403.18582

AGC-column-joining: On-demand column-joining of input data for the AGC demonstrator analysis. Email the mentors (Nick Manganelli). Date posted 2025-02-13.

The project’s goal is to integrate ongoing R&D) of on-demand column-joining into the Analysis Grand Challenge. The project would involve testing components already developed and developing new integrations with coffea) to facilitate joining columns in separate datasets together for analysis, e.g. loading cached columns from expensive calculations. Benchmarking throughput and capability of the components, identifying and mitigating bottlenecks, and understanding overall feasibility and applicability are crucial. trino provides a Distributed SQL engine capable of joining ragged columnar data. This requires inputs to be in parquet format, which can be facilitated with transforming input ROOT TTree’s using ServiceX or hepconver

Scientific workflows with xeus-cpp: Research and development of scientific use cases powered by the HEP Python-C++ ecosystem. Email the mentors (Vassil Vasilev,Aaron Jomy). Date posted 2025-02-12.

High-energy physics (HEP) presents a wide range of computational challenges that demand scalable, high-performance software solutions. While significant efforts have been made to develop robust and efficient tools, the diversity of scientific applications often extends beyond the original scope envisioned by software developers. As a result, there is a growing need to explore and develop novel scientific use cases that leverage modern computational techniques.

Cutting-edge instruments able to collect unprecedented amounts of data such as radio telescopes, particle accelerators, and real-time AI inference in healthcare will go online by the 2030s. The seamless operation of these facilities demands nearly instantaneous data transfer from instruments to accelerator hardware, necessitating the development of next-generation software architectures and enhanced language interoperability. Unlike the current unidirectional interaction, future pipelines in C++ and Python must collaborate synergistically.

The field of HEP has developed an on-demand synergy between high-performance languages such as CUDA C++ and fast prototyping languages such as Python. Tools such as xeus-cpp explore high-performance scientific workflows. Existing examples prototyped with xeus-clang-repl (precursor to xeus-cpp) can be found here. The project aims to demonstrate low-latency interactions between C++ and Python in the context of various sciences such as physics and computational biology.

CMS Forward calorimeter Run 3 data analysis: Optimization of material activation correction for CMS Forward calorimeter data. Email the mentors (Alexey Shevelev,Olena Karacheban). Date posted 2025-02-09.

The Beam Radiation Instrumentation and Luminosity (BRIL) project focuses on precision luminosity measurements and beam-induced background studies. One of the primary systems used for precise luminosity measurement is the Forward Hadron (HF) Calorimeter. While HF provides excellent stability and linearity, the detector is susceptible to significant radiation activation. This activation, involving multiple short-lived isotopes with typical lifetimes of hundreds of nanoseconds, can introduce biases in measurements taken every 25 ns . Another potential source of systematic uncertainty arises from imperfections in the timing alignment of the detector, leading to signal spillover into neighboring measurement slices or bunch-crossings. To correct these effects, a multi-parameter fit should be derived and applied to the full year of data, significantly impacting the final luminosity measurement. Students participating in this project will gain insights into data analyses, work with Python-based data processing tools, including pandas and numpy.

DevOps in cloud hosted Django project to support new FBCM detector development: DevOps in cloud hosted database for HL-LHC FBCM detector development. Email the mentors (Arkady Lokhovitskiy,Mihailo Obradovic,Olena Karacheban). Date posted 2025-02-09.

For High-Lumi LHC in the CMS experiment the standalone luminometer FBCM is being designed. It is silicon pad-based detector with dedicated fast ASIC. Various parts of the detector are at the final design stage and will be produced in 2025. To keeping track of the produced components, test results, and overall progress of the detector construction the database is under development. It is Django framework with an accompanying Python frontend. The student will contribute to the design and optimization of the database, will take part in the ongoing development and subsequent deployment of this application. They will gain knowledge and experience in Python scripting, databases, and containerized deployment (OKD).

CMS Level 1 trigger scouting for luminosity measurement: Data processing of CMS Level 1 trigger 40 MHz scouting for luminosity measurement. Email the mentors (Olena Karacheban,Alexey Shevelev). Date posted 2025-02-09.

In High-Luminosity LHC era the CMS 40 MHz L1 Scouting system will be used for luminosity measurements by counting multiple trigger primitives, as muon stubs, energy deposition in calorimeters, track primitives, and others. A detailed study of these primitives and their correlations is required to determine their suitability for precision luminosity measurement in the HL-LHC conditions. A prototype of the system has already been deployed as a Run 3 demonstrator, and data is available for analysis. Students involved in this project will gain experience with the CMS data flow, including the use of the CMS Software Framework (CMSSW) and HTCondor for large-scale data processing.

Machine learning pileup suppression trigger algorithms at the HL-LHC: Machine learning pileup suppression trigger algorithms at the HL-LHC. Email the mentors (Ariel Schwartzman,Rainer Bartoldus). Date posted 2025-02-07.

This project proposes to investigate new approaches to the challenge of pileup at the High Luminosity LHC (HL-LHC). The goal is to investigate and implement advanced methods for the selection of multi-jet signatures at the trigger level, addressing key experimental challenges that arise at high luminosity and require the development of new techniques. The project consists of utilizing simulated HL-LHC data to implement and study a new type of machine learning algorithm to tackle these new challenges and enhance the power of the HL-LHC to discover new physics.

New Frontiers in 5D Calorimeter Reconstruction: New Frontiers in 5D Calorimeter Reconstruction. Email the mentors (Ariel Schwartzman,Michael Kagan,Rainer Bartoldus). Date posted 2025-02-07.

Machine learning is having a transformative impact in collider detectors, both at the LHC, and also in the context of future colliders. While many efforts in particle flow reconstruction are being pursued, there are unique new opportunities that are largely unexplored and have the potential to significantly enhance this area of research. In particular, the incorporation of fast-timing with dual-readout information. Fast-timing provides a new dimension to calorimetry. In the case of high granularity calorimeters, cell-time information can be utilized to better resolve nearby hadronic showers and improve the jet energy resolution beyond what can be achieved with existing AI/ML particle flow methods. This project consists of exploring a new frontier for particle flow reconstruction incorporating fast timing and dual readout information. The goal is to investigate how to combine fast-timing information with dual readout signals and tracks using modern machine learning techniques. This project presents an opportunity to explore cutting-edge calorimetry technology and computing algorithms and methods with the goal of realizing the next generation of 5D (x,y,z,t,E) particle flow reconstruction for future colliders.

Vertex Classification in High PU scenario: Selecting vertices in high-luminosity high energy hadron colliders. Email the mentors (Ariel Schwartzman,Wasikul Islam,Rocky Bala garg). Date posted 2025-02-05.

Future High luminosity LHC upgrades (HL-LHC) and high energy proton-proton colliders such as FCC-hh will produce 200 – 1000 simultaneous vertex interactions per event, posing significant challenges to the physics event reconstruction. At the LHC, we utilize algorithms that selects one “hard-scatter” vertex among all reconstructed vertices. The ability to select the correct vertex using existing algorithms degrade as the number of vertices continues to increase and new solutions will be required. This problem is particularly challenging due to the fact that different physics process give rise to very different event topologies making it difficult to design a single algorithm that can be broadly applicable to all events. In this project, we propose to utilize advanced machine learning methods to develop the next generation of vertex selection algorithms suitable for the very high luminosity hadron colliders of the future.

Validating fast analytical tools against ACTS framework to allow fast turn-around of detector designs: Validating fast analytical tools against ACTS framework to allow fast turn-around of detector designs. Email the mentors (Charles Young,Rocky Bala Garg). Date posted 2025-02-05.

Expected transverse momentum resolution (σ(pT)/pT) at a future electron-positron Higgs Factory is of the order of (10^−5⋅pT⊕10^−3). Achieving this ambitious goal requires innovative detector concepts. While detailed Monte Carlo simulations combined with sophisticated reconstruction software represent the gold standard for performance evaluation, this approach is often cumbersome and slow, making it impractical during the early stages of detector design. A fast feedback tool is therefore essential—one that is simple to use and does not rely on resource-intensive frameworks like Geant4. Such a tool should focus on intrinsic detector performance, minimizing sensitivity to reconstruction software tunings. It should be capable of rapid execution to enable the exploration of multiple detector concepts, facilitating the identification of the most promising designs. An early analytical framework addressing this need was introduced by R.L. Gluckstern in 1963 [Ref: NIM A24 (1963) 381–389], offering direct calculations based on simplifying assumptions such as negligible multiple scattering, uniformly spaced detector planes, and consistent hit resolution across layers. An enhanced calculation tool based on this foundation, relaxing these assumptions, allows for more realistic modeling. This direct calculation method, being analytical rather than simulation-based, is highly efficient. Validation against detailed simulation and reconstruction results from ATLAS and CMS has shown reasonable agreement. However, some limitations remain due to residual assumptions (e.g., discrete multiple scattering) and complexities in actual detector geometries, such as non-uniform material distribution as a function of η and ϕ. Project Objective: > The current project aims to utilize the ACTS framework, in conjunction with a detector geometry such as the Open Data Detector, to simulate and reconstruct single muon tracks. This will enable direct comparisons between full simulation + reconstruction methods and the analytical calculation approach with identical detector descriptions. Project Workflow: 1. Acquire a basic understanding of track reconstruction and the ACTS framework. 2. Simulate single muons, perform track reconstruction within ACTS, and obtain various resolutions. 3. Get key detector parameters (e.g., hit resolution, detector layer positions) for input into the direct calculation tool. 4. Run the calculation tool, determine resolutions, and compare results with those from ACTS simulations.

ML backends for Vector: Adding TensorFlow, PyTorch, and JAX backends to Vector. Email the mentors (Ianna Osborne,Liv Våge). Date posted 2025-01-20.

Scikit-HEP’s Vector library has several backends with different features, including a NumPy backend for flat or rectilinear arrays of Lorentz vectors. However, it would also benefit from TensorFlow, PyTorch, and/or JAX backends, since these could be used within an optimizable model with built-in autodiff. vector#541 is a step in this direction for PyTorch.

Writing std::vector from Uproot: Add support for writing ragged arrays as std::vectors to Uproot. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Uproot can write ragged data as dynamic arrays, but ROOT requires TBranches of this type (e.g. Muon_pt) to be accompanied by a “counter” TBranch (e.g. nMuon), which can be cumbersome when writing many ragged arrays, some with the same dimensions, others with different dimensions. std::vectors do not require “counter” TBranches, but they’re expressed in a different format that would have to be handled separately. It is an often-requested feature, though (see uproot#257).

Two-phase Uproot: Split Uproot into two phases: metadata-reading and data-reading. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Uproot is designed for interactive use: a user specifies a file to open, Uproot reads the file’s metadata, and then the user specifies which TBranches to read as arrays. However, in high-throughput situations, the metadata-reading (implemented in pure Python) is a bottleneck—it would be better to jump directly to the high-performance array-reading. This can be done by using Uproot in two passes: the first pass reads all the metadata and stores it in a database (once), and the second pass uses the database to jump directly to the seek locations of array data (many times). This is conceptually similar to what kerchunk does with other file formats (making ROOT a kerchunk format is one possible implementation), and a demonstrator for this concept was implemented using the Tiled database as tiled-uproot. This project would be to decide on a database format and implement two-phase reading with a nice interface.

TTree::Draw syntax from Uproot: Integrate Formulate into Uproot as a “language” for computing aliases. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Recently, the Formulate library was updated to parse any ROOT TTree::Draw expression (also known as TFormula) robustly and with high performance. Uproot’s expressions and aliases arguments to uproot.TTree.arrays and similar functions accept string expressions, but they are interpreted as Python strings. It would be more natural for them to accept TTree::Draw syntax, especially for fAliases embedded within the ROOT file. This project is to add a second uproot.language object that uses Formulate to interpret the strings in TTree::Draw syntax and add Formulate as a dependency of Uproot.

Uproot to all the dataframes: Export Uproot data to all dataframes supported by Akimbo. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Uproot’s library="pd" indicates that the data should be read into Pandas DataFrames, rather than NumPy or Awkward Arrays, and it’s implemented by invoking Akimbo, formerly known as uproot-pandas, to convert Awkward Arrays into Apache Arrow and store them as Pandas columns with the Arrow dtype. However, Akimbo also supports dask.dataframe, Polars, and cuDF (GPU) dataframes. Polars has been requested (see uproot#1096), but all of these dataframes would be potentially useful.

Modifying existing TTrees in Uproot: Add new columns to existing TTrees (99% done) and/or new rows (new project) in Uproot. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Uproot can add new objects to existing ROOT files through uproot.update, but it would be even more useful if it could modify existing TTrees in place. Zoë Bilodeau implemented the ability to add new columns/TBranches, which is especially useful for backfilling data (e.g. adding an array of False for triggers that didn’t exist at the time of data-taking). This implementation is nearly done (see uproot#1155), apart from a few corner-cases that need to be tested and debugged. It would also be useful to be able to add rows/entries, which would be an entirely new project. Completing the adding-columns project would provide the experience necessary to tackle the adding-rows project.

Solidify the Scikit-HEP GPU ecosystem: Test and identify missing capabilities in the Scikit-HEP GPU ecosystem. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Awkward Array’s CUDA kernels and Numba-CUDA support exist (see this training), as well as cuda-histogram, but these features haven’t been heavily tested and probably haven’t ever been used in an analysis. This project would be to try using Scikit-HEP libraries (including Vector and any other relevant libraries) in an analysis using GPUs to find out what the pain points are, and either fixing them directly or raising awareness among the developers.

Completing the Ragged library: Implement the remaining functions to make Ragged an Array-API compliant ragged array library. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Scikit-HEP’s Ragged library is an interface over Awkward Array that restricts it to ragged arrays only (no records, missing data, etc.) and satisfies DataAPI’s Array API, which is rapidly becoming the standard interface for array libraries. As such, the requirements for Ragged are very precise: all required functions have already been stubbed out with full docstrings, and about half of them have been implemented. This project would be to complete it and promote it as a fully functional, Array API-compliant ragged array library.

Awkward Arrays with physical units: Adding “units” as Awkward Array metadata and conversions as behaviors. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Awkward Arrays already have an ak.Array.attrs attribute that can carry arbitrary metadata (persistent or transient) and an ak.behavior that attaches functionality to arrays. One, the other, or both of these would be able to implement physical units on arrays and convert between units when appropriate, such as putting two arrays into common units before adding them. awkward#2468 is a discussion of this feature and possible implementations.

Using std::maps in Awkward Array: Implement sorted_map type in Awkward Array. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Awkward Array implements some data types as types with equivalent storage (e.g. lists of uint8 for strings) plus ak.behavior to provide specialized functionality (e.g. printing as strings and broadcasting one string as one object). A basic type that has not been implemented is a key-value mapping, such as C++’s std::map. This is different from Awkward Array’s “record” type, which has a fixed set of field names, each of which can have a different type. A key-value mapping has keys of one type (often but not always strings) and values of another, fixed type (not different for each key), like std::map<std::string, int>. When Uproot encounters C++ std::map<K, V> in a ROOT file, it produces an Awkward Array of lists of pairs of types K and V with name "sorted_map". However, “sorted map” behaviors have not yet been implemented in Awkward Array, which would make this data type useful (see awkward#780). This project would be to add such functionality.

Custom autodiff in Awkward Array: Replace JAX with custom autodiff in Awkward Array. Email the mentors (Ianna Osborne,Lino Gerlach,Peter Fackeldey). Date posted 2025-01-20.

At an Analysis Tools meeting and in awkward#3349, we’ve discussed the possibility of switching from JAX to a custom implementation to implement automatic differentiation (autodiff, also known as autograd). The problems with JAX are related to its interface, which is intended to do much more than just autodiff. Also, implementing eager autodiff is likely not a major project, especially if we take advantage of complex-step differentiation. This project would either implement autodiff as a module within Awkward Array or as a new Scikit-HEP library (and possibly as a backend for Vector, too).

ML-ready Awkward Arrays: Helper functions to turn Awkward records into array dimensions and PyG indexes. Email the mentors (Ianna Osborne,Liv Våge). Date posted 2025-01-20.

Awkward Array has functions to convert to and from TensorFlow and PyTorch, such as ak.from_raggedtensor and following, with support for TensorFlow’s RaggedTensor. However, there are format conversions that still have to be handled manually, such as turning an Awkward Array of records (e.g. muon with pT, eta, phi fields) into an array dimension (e.g. length-3 dimension in the tensor shape). NumPy has a function for this, np.lib.recfunctions.structured_to_unstructured, though the Awkward equivalent can have a different name (since it has different submodules). The labor-intensive steps described in this StackOverflow answer and this tutorial could be encapsulated as ready-to-use functions. Also, PyTorch-Geometric (PyG) expects ragged arrays to be represented as an external array of integers, which Awkward Array could generate with a function (see awkward#3256). Yet another framework, PyTorch Cluster, expects raggedness to be expressed as a list of tensors (see awkward#3265). All of these helper functions would simplify the conversion of Awkward Arrays into tensors for fixed-size NNs and GNNs.

Dates and strings in Awkward Array: More date & string functions and NumPy’s new varlen string in Awkward Array. Email the mentors (Jim Pivarski,Ianna Osborne). Date posted 2025-01-20.

Awkward Array has a suite of string functions provided by Apache Arrow (in ak.str.*). However, it’s missing a few string functions (see awkward#2703) and it could also be useful to similarly wrap Arrow’s date-handling functions (see awkward#2702), taking care to translate between NumPy’s date format (which Awkward uses) and Arrow’s date format. In addition, NumPy added a new variable-length string format that is different from all other such formats and it would be useful to convert to and from Awkward Arrays (see awkward#3170). Although most functionality can be added in Python, there’s a slight chance that accessing NumPy’s varlen strings would require C (not C++).

Optimizing automatic differentiation using activity analysis: Optimizing automatic differentiation using activity analysis. Email the mentors (Maksym Andriichuk,Petro Zarytskyi,Vassil Vasilev). Date posted 2024-06-15.

In mathematics and computer algebra, automatic differentiation (AD) is a set of techniques to numerically evaluate the derivative of a function specified by a computer program. Automatic differentiation is an alternative technique to Symbolic differentiation and Numerical differentiation (the method of finite differences). Clad is a source transformation based AD tool which can perform more advanced program optimization by implementing more sophisticated analyses because it has access to a rich program representation – the Clang AST. These optimizations investigate which parts of the computation graph are relevant for the AD rules. One such optimization is the To-Be-Recorded optimization which reduces the memory pressure to the clad tape data structure. TBR analysis is a part of an adjoint mode of AD. It finds variables whose present value is used in a derivative instruction and reduces the number of statements by not creating temporary variables for dependent variables that are being overwritten and not being used. Another optimization is the activity analysis optimization which discards all statements which are irrelevant for the generated code. That is, if the statements do not depend on the input/output variables of a routine in a differentiable way, they are ignored. The advantage is that this improves the performance of the generated code and reduces the phase space of features needed to be supported to enable differentiable STL, for example.

CI/CD improvements for Alpaka library, and related projects.: Improvement and optimization of the CI job generator for the alpaka library. Mentee: Yurii Perets. Email the mentors (Jiri Vyskocil,Simeon Ehrig,Volodymyr Bezguba). Date posted 2024-06-07.

alpaka is an abstraction library that allows writing a function once and executed on different accelerators, e.g. on different CPU (x86, ARM, RISC V) or GPU architectures (Nvidia, AMD, Intel). Therefore, the library must support a wide range of different build tools and runtime libraries in different versions. At the moment, we could test 2,500,000 different combinations of software dependencies in our CI, which would take months for a single commit. To reduce the number of jobs and also save the time for manually adding new software dependency combinations, we implemented a CI job generator written in Python using pair-wise testing. The generator reduces the number of jobs to about 170. Unfortunately, we encountered several problems and limitations when using the generator. The biggest problem is to check whether all expected test pairs are generated. Therefore, we have started to rewrite the generator to solve all problems and ensure that it works as expected from the first commit by using strong test coverage. Your task is to finalize the work on the new version of the generator, migrate alpaka to the new generator and verify that the CI works as expected. The biggest challenge of the project is that trial and error does not work due to the immense number of combinations. Verifying that the generator works as expected is the main problem. Depending on the result of the migration, further optimizations are required. These can be simple optimizations, such as reducing the number of test parameters, but also clever optimizations inspired by HPC applications, such as an intelligent scheduling algorithm for the CI jobs to make better use of the CI resources and abort the CI as early as possible in case of broken code.

Beam-Induced Background Simulations at Muon Colliders: First characterization of the detector environment and simiulation development for a very high-energy wakefield-acceleration based collider.. Email the mentors (Simone Pagan Griso,Angira Rastogi). Date posted 2024-04-25.

A 10 TeV lepton collider has the promise to revolutionize our understanding of particle physics. One of the proposal entiles a very high-energy electron or photon collider using novel acceleration techniques that promise a relatively compact layout even for such large energies. The design of a detector for such a collider is just starting to be studied. In particular this project will develop the needed tools to study the environment created by beam-induced backgrounds that will inform which detector technologies are expected to be needed to be developed in the future. If time allows, a first proposal of how the detector layout will be also pursued for the first time. The project allows the fellow to mix acquiring good technical skills in Python and C++, and the ability to innovate state-of-the-art simulation techniques. The impact is setting up the first-ever simulation infrastructure for such colliders.

Charged-particles reconstruction at Muon Colliders: Optimization of charged-particle reconstruction algorithms in future Muon Colliders. Mentee: Sarah Shinde. Email the mentors (Simone Pagan Griso,Sergo Jindariani). Date posted 2024-04-25.

A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of investigating how the tracking algorithms can be improved by utilizing a combination of directional information from specially-arranged silicon-detector layers and more advanced reconstruction techniques and algorithms available in the experiment-generic ACTS tracking library adapted to the specific environment expected for a Muon Collider detector. The project allows the fellow to mix acquiring good technical skills and the ability to innovate state-of-the-art tracking algorithms in this less-explored environment.

Beam-Induced Background Simulations at Muon Colliders: New sampling techniques for beam-induced background generation in future Muon Colliders. Email the mentors (Simone Pagan Griso,Sergo Jindariani,Angira Rastogi). Date posted 2024-04-25.

A very high-energy muon-collider has been proposed as a possible path for future high-energy physics and represents a novel concept never implemented before. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The dominant source of such background comes from showering of high-energy electroncs produced by the beam muon decays into dedicated shielding. Simulations of beam-indueced backgrounds are extremely resource-intensive to carry out. The current approach is to simulate a single bunch-crossing and then create new events assuming azimutha;l symmetry of the produced particles. Unfortunately, this is not a good nor valid assumption. In this project you will change fundamentally the way we sample the initial simulation, keeping track of each muon decay and sampling those muons randomly to create new bunch crossings. To achieve that, we will ensure correlations and physucs of the nature of the beam-induced backgrounds are respected. The project allows the fellow to mix acquiring good technical skills in Python and C++, and the ability to innovate state-of-the-art simulation techniques. The impact is vritually changing how efficiently and realistically we perform any simulation of muon collider detectors that is used to assess the feasibility and challenges of this innovative endavor.

Towards Accountable Network Bandwidth Utilization via SDN: Contributing to the development of the Data Movement Manager (DMM) within the Rucio/SENSE context. Email the mentors (Aashay Arora,Diego Davila). Date posted 2024-04-17.

In the exa-scale computing era for large scale experiments such as the High Luminosity LHC, the current model for data transfer workflows summarized as push-now-worry-later will not be feasible. We are integrating software defined networking (SDN) with the current LHC software stack for data transfers in order to have accountability in the network bandwidth utilization for large bandwidth data flows. This project entails actively contributing to the development of the Data Movement Manager (DMM) which is the interface between SENSE, the SDN service from ESNet, and Rucio, the data management application used for LHC and beyond.

Implementing the ATLAS TopCP Transformer in ServiceX: Implement a standard ATLAS ntuple maker in a ServiceX transformer. Mentee: Alexander Christoph Schmidt. Email the mentors (Peter Onyisi,KyungEon Choi). Date posted 2024-04-15.

ATLAS has plans underway to develop standard ntuple dumpers for the PHYSLITE analysis data format, which include the possibility of dumping systematic uncertainties in a compact and efficient manner. This project is to encapsulate one of these dumpers in a ServiceX transformer to enable it to be used as a seamless way of accessing PHYSLITE data in analysis jobs in a columnar manner.

Topological Rare Hadron Decay Tagging with DNN: Deep neural net topological tagger for rare hadron decay identification. Mentee: Andrii Len. Email the mentors (Dmytro Kovalskyi). Date posted 2024-04-15.

The CMS experiment provides a unique opportunity to study extremely rare hadron decays due to a significantly larger amount of data compared with the LHCb and Belle II experiments. The main challenge is to identify such decays and suppress associated backgrounds in the high pile-up environment. In this project, we will use a Deep Neural Network to develop a new method of tagging such decays using topological information to identify the relevant charged tracks associated with the decays of interest. The project requires familiarity with machine learning algorithms and tools to build and train deep neural networks. It will involve testing different DNN types and feature engineering.

Implement Differentiating of the Kokkos Framework With Clad: Implement Differentiating of the Kokkos Framework With Clad. Email the mentors (Vaibhav Thakkar,Petro Zarytskyi,Vassil Vasilev). Date posted 2024-04-11.

The Kokkos C++ Performance Portability Ecosystem is a production level solution for writing modern C++ applications in a hardware agnostic way. It is part of the US Department of Energies Exascale Project – the leading effort in the US to prepare the HPC community for the next generation of super computing platforms. The Ecosystem consists of multiple libraries addressing the primary concerns for developing and maintaining applications in a portable way. The three main components are the Kokkos Core Programming Model, the Kokkos Kernels Math Libraries and the Kokkos Profiling and Debugging Tools.

The Kokkos framework is used in several domains including climate modeling where gradients are important part of the simulation process. This project aims at teaching Clad to differentiate Kokkos entities in a performance portable way.

Task ideas and expected results:

Implement common test cases for Kokkos in Clad
Add support for Kokkos functors
Add support for Kokkos lambdas
Incorporate the changes from the initial Kokkos PR
Enhance existing benchmarks demonstrating effectiveness of Clad for Kokkos
[Stretch goal] Performance benchmarks

Candidate requirements:

Experience with C++, experience with ML and backpropagation
Knowledge of Clang.

Statistical treatment of the AGC results with RooFit: Implement estimation of physics model parameters of the AGC with RooFit. Mentee: Valerii Kholoimov. Email the mentors (Jonas Rembser,Alexander Held). Date posted 2024-04-04.

The IRIS-HEP Analysis Grand Challenge (AGC) is a realistic environment for investigating how high energy physics data analysis workflows scale to the demands of the High-Luminosity LHC (HL-LHC). The project offers a blueprint for HEP analysis applications that can be implemented using different tools and approaches. One of the implementations offered is done with ROOT, the tool for storing, processing and data analysis used by LHC experiments. In particular, it demonstrates usage of the RDataFrame high-level interface for data analysis in the CMS ttbar OpenData application. At the same time, it lacks the final steps of the AGC workflow, which involve the estimation of physics model parameters from the output histograms using the maximum likelihood method. The objective of this project is adding those steps via RooFit, the tool provided by ROOT for statistical analysis and advanced fitting, showcasing the use of such tool in a Python environment.

Embedded software application for a RISC-V based system-on-chip (SoC) for LHCb Velo detector.: Embedded software application for a RISC-V based system-on-chip (SoC) for LHCb Velo detector.. Email the mentors (Alessandro Caratelli,Marco Andorno). Date posted 2024-03-20.

At the CERN Microelectronics section, we are in the process of developing a custom RISC-V based system-on-chip (SoC) generator. This will be used to embed on-chip processors and programmability into ASICs for multiple experiments. The student will participate developing the specific embedded software applications (C++/Python) for various operations targeting the custom processor system. If interested, the student can also participate in the development of software tools used to build the custom hardware system. The first prototype will be part of the LHCb Velo detector and possibly the CMS tracker.

Alpaka for CMS: Extending the Alpaka performance portability library with task-parallel constructs for the CMS pixel reconstruction. Mentee: Mykhailo Varvarin. Email the mentors (Jiri Vyskocil,Volodymyr Bezguba). Date posted 2024-03-06.

This project proposes to extend the Alpaka performance portability library with task-parallel constructs, like task graphs and cooperative groups, and to evaluate their performance using them in the pixel track reconstruction software of the CMS experiment at CERN. As data volume and complexity surge, the CMS pixel reconstruction process, crucial for accurate particle tracking and collision event analysis, demands optimized computational strategies for timely data processing. Alpaka, facilitating development across diverse hardware architectures by providing a unified API for writing parallel software for CPUs, GPUs, and FPGAs, will be extended with task graph and cooperative groups APIs to meet this demand. Integrating task graphs into Alpaka will streamline the scheduling and execution of interdependent tasks, optimizing resource utilization and reducing time-to-solution for complex data analyses. Cooperative groups will facilitate more flexible and efficient thread collaboration, crucial for fine-grained parallelism and dynamic workload distribution. These developments aim to improve the performance and scalability of CMS pixel track reconstruction algorithms, ensuring faster and more accurate data analysis for high-energy physics research. Students participating in this project will have the opportunity to contribute to programming a state-of-the-art C++ library, engaging directly with a developer group that adheres to the best software programming practices. They will gain hands-on experience in developing and implementing advanced computational solutions within a real-world scientific framework, enhancing their technical skills in high-performance computing and software development within a collaborative and cutting-edge research environment.

Enabling Julia code to run at scale with artefact caching: Develop HEP strategies for artefact caching in Julia to allow large scale running. Mentee: Elvis Agüero. Email the mentors (Graeme Stewart,Pere Mato). Date posted 2024-02-23.

Julia is a promising language for high-energy physics as it combines the easy of use and ergonomics of dynamic languages such as Python, with the runtime speed of C or C++. One of Julia’s features is that it uses a JIT (or just-ahead-of-time) compiler to target the specific architecture on which it is being run. This however, comes at the cost of the compilation time, meaning that the first pass through the code is slower. If Julia is to be adopted widely in high-energy physics, and run at large scales, then it is important to mitigate this cost by pre-compiling the Julia code to be used on the system and avoid the cost of recompiling on every node. This is accomplished by the use of the DEPOT_PATH setting. This will first be investigated on cluster systems at CERN, e.g., SWAN and lxbatch. Startup time and runtime will be investigated with increasingly large sets of jobs running. Then we shall extend our investigations to caching Julia code on CVMFS, which would allow scaling to running on the whole grid. Finally, we shall examine the possibility of precompiling artefacts for different microarchitectures, that would allow exploitation of the full power of modern CPUs in large scale heterogeneous systems.

Development of the generic vertex finder in HLT1-level trigger at LHCb.: Development of new algorithm that finds generic vertices in the LHCb using GPUs.. Mentee: Vladyslav Yankovskyi. Email the mentors (Andrii Usachov). Date posted 2024-02-22.

The project is dedicated to the development of an innovative algorithm designed to identify displaced decay vertices within the LHCb experiment. The primary aim of this algorithm is to facilitate an inclusive search methodology for long-lived particles, instead of targeting specific decay signatures. This strategy requires the algorithm to effectively differentiate and suppress vertices associated with well-established long-lived hadrons, necessitating the integration of ML solutions. This approach will enable the analysis to be adaptable to a wide range of New Physics models and searches. Furthermore, the algorithm is constrained by computational speed requirements of the online HLT1 trigger at LHCb, which makes it challenging. The student will develop the algorithm to be run on the LHCb GPU farm. This endeavor offers a rich opportunity for the student to gain hands-on experience with CUDA, C++, and Python. It also assumes the implication of the ML algorithms from at least scikit-learn or pytorch.

Searching for light dark hadrons at LHCb.: Development of the ML selection and optimisation for dark hadrons search at LHCb.. Email the mentors (Andrii Usachov). Date posted 2024-02-22.

The Standard Model of elementary particles does not contain a proper Dark Matter candidate. One of the most tantalizing theoretical developments is the so-called Hidden Valley models. These models predict the existence of dark hadrons - composite particles that are bound similarly to ordinary hadrons in the Standard Model. Such dark hadronscan be abundantly produced in high-energy proton-proton collisions. Some dark hadrons are stable like a proton, which makes them excellent Dark Matter candidates, while others decay to ordinary particles after flying a certain distance in the collider experiment. The LHCb detector has a unique capability to identify such decays, particularly if the new particles have a mass below ten times the proton mass. This project assumes a unique search for light dark hadrons that covers a mass range not accessible to other experiments. It assumes an interesting program on data analysis (python-based) with non-trivial machine learning solutions and phenomenology research using fast simulation framework. In particular, the search will cover a range of invariant masses and lifetimes which has to be covered optimally by smooth ML application. On top of this, to deal with theory dependence, the Pythia Hidden Valey modules to be used for developing fast simulation framework. Developed signal selection to be used for data analysis but also new trigger lines for LHCb Run 3. Depending on the outcome the autoencoder-based anomaly detection can be used on real data as well. The project offers a rich opportunity for the student to gain hands-on experience with python, C++. It also assumes the implication of the ML/NN algorithms from at least scikit-learn or pytorch.

Coffea development for LHC experiments: Development of coffea to support multiple LHC experiment analysis workflows. Mentee: Sam Kelson. Email the mentors (Lindsey Gray,Nick Smith,Matthew Feickert). Date posted 2024-02-22.

As the coffea analysis framework has continued to gain use across the CMS experiment and is beginning to be used more in ATLAS, the amount of experiment specific code has grown as well. With coffea’s recent transition in late 2023 to be more Dask focused there is work to be done to support experiment specific areas, such as the ATLAS specific coffea.nanoevents.PHYSLITESchema, while still focusing on performance and user needs. This project would support the general development of coffea with a focus on targeting experiment specific code to support and improve the user experience and would have the Fellow work closely with the coffea core team.

Alpaka benchmarks for CMS: Writing benchmarks for the alpaka performance portability library.. Email the mentors (Jiri Vyskocil,Volodymyr Bezguba). Date posted 2024-02-10.

This project proposes to write benchmarks for the alpaka performance portability library to evaluate the performance of using various parallel constructs in the pixel track reconstruction software of the CMS experiment at CERN, and other software. The alpaka library facilitates development across diverse hardware architectures by providing a unified API for writing parallel software for CPUs, GPUs, and FPGAs. As the complexity of scientific software and alpaka itself grows, we need to continuously monitor the performance of its modules. The student’s task will be to write performance benchmarks for alpaka in C++. In future, running these benchamerks will be integrated into our CI (Continuous Integration) and Unit Testing systems. A benchmark is a piece of C++ code written against the alpaka API that tests a specific functionality in alpaka, and measures its performance. The benchmarks range in complexity from short functions testing a single isolated feature, similar to a unit test, to complex algorithms that measure the performance of several modules interacting with each other. The project will start with short single-purpose benchmarks which will gradually introduce the student to the basics of using alpaka API and its different modules so that they can tackle complex challenging tasks in the later phases of the project. Students participating in this project will have the opportunity to contribute to programming a state-of-the-art C++ library, engaging directly with a developer group that adheres to the best software programming practices. They will gain hands-on experience in developing and implementing advanced computational solutions within a real-world scientific framework, enhancing their technical skills in high-performance computing and software development within a collaborative and cutting-edge research environment.

Packaging jet substructure observable tools: Repackage the EnergyFlow and Wasserstein tools for modern PyPI and conda-forge. Email the mentors (Matthew Feickert,Henry Schreiner). Date posted 2024-02-08.

In the area of jet substructure observable tools that leverage machine learning applications, the Python packages EnergyFlow and Wasserstein stood out for their use in the broader high energy physics theory and experimental communities. The original project creators and maintainers have left the projects, and though the projects have been placed in a community organization to aid support, the packaging tooling choices have made it difficult to maintain the projects. This project would update the original SWIG-based Python bindings for the project C++ code to use scikit-build-core and pybind11 and update the CI/CD system to allow for fluid building and testing of the packages. The end deliverables of the project would be for the project repositories to have transitioned their packaging systems, updated their CI/CD systems, made new releases to the PyPI package index, and releases the packages to conda-forge.

Packaging the HEP simulation stack on conda-forge: Package HEP tools for conda-forge. Email the mentors (Matthew Feickert). Date posted 2024-02-08.

One common toolchain used in high energy physics for simulation is: MadGraph5_aMC@NLO to PYTHIA8 to Delphes. Installing these tools can be challenging at times, especially for new users. Conda-forge allows for distribution of arbitrarily complex binaries across multiple platforms though the conda/mamba/micromamba/pixi package management ecosystem. As ROOT has been successfully packaged and distributed on conda-forge along with the PYTHIA8 Python bindings it should be possible to package all the components of the HEP simulation stack and distribute them on conda-forge. However, the interconnected nature of some of these tools requires that multiple dependencies are first packaged and distributed before the full stack can be. This project would attempt to package as many of the dependencies of the broader HEP simulation stack on conda-forge as possible starting with those outlined in the HEP Packaging Coordination list and additionally contribute to the maintenance of the ROOT feedstock on conda-forge.

Parameter optimization in ACTS framework: Developing an automatic parameter tuning pipeline for ACTS track reconstruction software framework. Email the mentors (Lauren Tompkins,Rocky Bala garg). Date posted 2024-02-02.

Particle tracking in HEP constitutes a crucial and intricate segment of the overall event reconstruction process. The track reconstruction algorithms involve numerous configuration parameters that require meticulous fine-tuning to accurately accommodate the underlying detector and experimental conditions. Traditionally, these parameters are manually adjusted, presenting an opportunity for improvement. Introducing an automated parameter tuning pipeline can significantly enhance tracking performance. The open-source track reconstruction software framework known as “A Common Tracking Framework (ACTS)” offers a robust R&D platform with a well-organized tracking flow and support for multiple detector geometries. Our previous work within this domain focused on developing derivative-less auto-tuning techniques for track seeding and primary vertexing algorithms. The goal of the current project will be to further advance these efforts by researching and developing an optimization technique that is differentiable across the intricate tracking algorithms. Additionally, this project aims to establish a seamless pipeline that facilitates the application of these optimization techniques with minimal user concerns.

Self-Supervised Approaches to Jet Assignment: Self-Supervised Approaches to Jet Assignment. Email the mentors (Javier Duarte). Date posted 2024-02-01.

Supervised machine learning has assisted various tasks in experimental high energy physics. However, using supervised learning to solve complicated problems, like assigning jets to resonant particles like Higgs bosons, requires a statistically representative, accurate, and fully labeled dataset. With the HL-LHC upgrade [1] in the near future, we will need to simulate an order of magnitude more events with a more complicated detector geometry to keep up with the recorded data [2], facing both budgetary and technological challenges [2, 3]. Therefore, it is desirable to explore how to assign jets to reconstruct particles via self-supervised learning (SSL) methods, which pretrain models on a large amount of unlabeled data and fine-tune those models on a small high-quality labeled dataset. Existing attempts [4-6] to use SSL in HEP focus on performing tasks at the jet or event levels. In this project, we propose to use the reconstruction of Higgs bosons from bottom quark jets as a test case to explore SSL for jet assignment. We will explore different neural network architectures, including PASSWD-ABC [7] for the self-supervised pretraining and SPANet [8, 9] for the supervised fine-tuning. The SSL model’s performance will be compared with a baseline model trained from scratch on the small labeled dataset. We will test if pretraining with diverse objectives [10] improves the model performance on downstream tasks like jet assignment or tagging. The code will be developed open source to help other SSL projects.

[HL-LHC] https://arxiv.org/abs/1705.08830 \ 2. [Computing for HL LHC] https://doi.org/10.1051/epjconf/201921402036 \ 3. [Computing summary] https://arxiv.org/abs/1803.04165 \ 4. [JetCLR] https://arxiv.org/abs/2108.04253 \ 5. [DarkCLR] https://arxiv.org/abs/2312.03067 \ 6. [SSL for new physics] https://doi.org/10.1103/PhysRevD.106.056005 \ 7. [PASSWD-ABC] https://arxiv.org/abs/2309.05728 \ 8. [SPANet1] https://arxiv.org/abs/2010.09206 \ 9. [SPANet2] https://arxiv.org/abs/2106.03898 \ 10. [Pretraining benefits] https://arxiv.org/abs/2306.15063

Improving the Maintainability of the ATLAS Global Sequential Calibration: Refactor parts of the ATLAS Global Sequential Calibration for ATLAS, integrating tests for continuous validation. Email the mentors (Tobias Fitschen). Date posted 2024-02-01.

The Monte Carlo-based Global Sequential Calibration (GSC) is a calibration step intended to correct differences in response between quark- & gluon-initiated jets. It is part of the jet calibration recommendations used by nearly every ATLAS analysis utilising jets. The candidate will contribute towards a more robust framework for semi-automated validation of newly derived calibrations of this kind. They will refactor parts of the code, making it more robust and less error-prone while improving the documentation. Tests will be added at various steps of the procedure to ensure that newly derived calibrations do not introduce any unphysical effects.

Charged-particles reconstruction at Muon Colliders: Charged-particle reconstruction algorithms in future Muon Colliders. Mentee: Samuel Ferraro. Email the mentors (Simone Pagan Griso,Sergo Jindariani). Date posted 2024-01-31.

A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of two complementary objectives. In the first one, we will investigate how the tracking algorithms can be improved by utilizing directional information from specially-arranged silicon-detector layers. The second one, that will be the bulk of the project, aims to port the modern track reconstruction algorithms we are using from the older ILCSoft framework to the new Key4HEP software framework, which supports parallel multi-threaded execution of algorithms and will be needed to scale performance to the needs of the Collaboration, validate them and ensure they can be widely used by all collaborators. Improvements and ample space for new creative solutions and optimization allows the fellow to mix acquiring good technical skills and the ability to innovate state-of-the-art tracking algorithms in this less-explored environment.

Data balancing tool for CMS data management: Adapting Rucio data rebalancing daemon for CMS use. Email the mentors (Hasan Ozturk,Panos Paparrigopoulos,Rahul Chauhan,Eric Wayne Vaandering,Dmytro Kovalskyi). Date posted 2024-01-31.

CMS data management relies on the open-source Rucio software framework to manage its data, which is stored at various sites around the world. A common challenge faced by the data management operators team is the overfilling of sites, either due to being targeted by specific physics data or from the gradual accumulation of data over time. Monitoring the capacity status of these sites and rebalancing data as needed is essential. To address such issues, the core Rucio team has developed a rebalancing daemon dedicated to managing data distribution across different sites. Originally designed for specific experiments, this rebalancing daemon contains code that is experiment-specific, limiting its applicability across diverse experiments or communities. Consequently, the main objective of this project is to refine the daemon to be experiment agnostic. Achieving this will enable the daemon to effectively serve not only CMS but also the broader Rucio community. This refinement process involves identifying and modifying any experiment-specific elements within the daemon, ensuring its seamless integration with CMS data management practices. Through this work, CMS will become more agile in maintaining balanced site storage, thereby avoiding scenarios where sites run out of space.

Smart Task Resubmission for CMS Workload Management System: Develop a tool to monitor and make smart decisions on how to resubmit tasks within CMS workflows.. Mentee: Volodymyr Kovalenko. Email the mentors (Hassan Ahmed,Hasan Ozturk,Luca Lavezzo,Jennifer Adelman McCarthy,Zhangqier Wang,Dmytro Kovalskyi). Date posted 2024-01-30.

The CMS experiment runs its data processing and simulation jobs on the Worldwide LHC Computing Grid at the scale of ~100k jobs in parallel. It is inevitable to avoid job failures at this scale, making an effective failure recovery system crucial. The existing algorithm for job resubmission is inefficient, as it does not consider information from other jobs running at the same computing site or belonging to the same physics class. The objective of this project is to migrate the autoACDC script to the new version of Unified, which controls the distribution of workflows across GRID computing clusters. The new “smart” autoACDC should operate with minimal human intervention, automatically selecting resubmission parameters based on error code statistics and previously successful resubmissions gathered from department staff. This tool aims to significantly reduce the manual effort required by the Production and Reprocessing (P&R) team, thus improving operational efficiency and increasing overall throughput.

CI/CD and Automation of Manual Operations: Automate manual operations and implement CI/CD for CMS Production & Reprocessing group.. Email the mentors (Hassan Ahmed,Hasan Ozturk,Luca Lavezzo,Jennifer Adelman McCarthy,Zhangqier Wang,Dmytro Kovalskyi). Date posted 2024-01-30.

The Production & Reprocessing (P&R) group is responsible for maintaining and operating the CMS central workload management system, which processes hundreds of physics workflows daily which produce the data that physicists use in their analyses. The requests which have a similar physics goal are grouped as ‘campaigns’. P&R manages the lifecycle of hundreds of campaigns, each with its unique needs. The objective of this project is to automate the checks that are performed manually on these campaigns. This involves creating a system to automatically set up, verify, and activate new campaigns, along with managing data storage and allocation. The second part of the project focuses on implementing a Continuous Integration and Continuous Deployment (CI/CD) pipeline for efficiently deploying and maintaining software services. This will include converting manual testing procedures into automated ones, improving overall efficiency and reducing errors. Tools such as Gitlab Pipelines for CI/CD, Python for scripting, and various automated testing frameworks will be employed. This initiative is designed to streamline operations, making them more efficient and effective.

Developing an automatic differentiation and initial parameters optimisation pipeline for the particle shower model: Developing an automatic differentiation and initial parameters optimisation pipeline for the particle shower model.. Mentee: Nishank Gite. Email the mentors (Lukas Heinrich,Michael Kagan). Date posted 2023-07-26.

The goal of this project is to develop a differentiable simulation and optimization pipeline for Geant4. The narrow task of this Fellowship project is to develop a trial automatic differentiation and backpropagation pipeline for the Markov-like stochastic branching process that is modeling a particle shower spreading inside a detector material in three spatial dimensions.

Improve testing procedures for prompt data processing at CMS: Improve functional testing before deployment of critical changes for CMS Tier-0. Mentee: Mycola Kolomiiets. Email the mentors (Dmytro Kovalskyi,German Giraldo,Jan Eysermans). Date posted 2023-06-27.

The CMS Tier-0 service is responsible for the prompt processing and distribution of the data collected by the CMS Experiment. Thorough testing of any code or configuration changes for the service is critical for timely data processing. The existing system has a Jenkins pipeline to execute a large-scale “replay” of the data processing using old data for the final functional testing before deployment of critical changes. The project is focusing on integration of unit tests and smaller functional tests in the integration pipeline to speed up testing and reduce resource utilization.

Predict data popularity to improve its availability for physics analysis: Predict data popularity to improve its availability for physics analysis. Mentee: Andrii Len. Email the mentors (Dmytro Kovalskyi,Rahul Chauhan,Hasan Ozturk). Date posted 2023-06-27.

The CMS data management team is responsible for distributing data among computing centers worldwide. Given the limited disk space at these sites, the team must dynamically manage the available data on disk. Whenever users attempt to access unavailable data, they are required to wait for the data to be retrieved from permanent tape storage. This delay impedes data analysis and hinders the scientific productivity of the collaboration. The objective of this project is to create a tool that utilizes machine learning algorithms to predict which data should be retained, based on current usage patterns.

Towards an end-end ML event reconstruction algorithm: Towards an end-end ML event reconstruction algorithm. Email the mentors (Javier Duarte,Farouk Mokhtar,Joosep Pata). Date posted 2023-05-25.

The most common approach to reconstructing full scale events at general purpose detectors (such as ATLAS and CMS) is a particleflow-like algorithm [1–3] which combines low-level information (PF-elements) from different sub-detectors to reconstruct stable particles (PF-candidates). Attempts are being developed to improve upon the current PF-algorithm with machine learning (ML) methods and improving particle reconstruction in general. Within CMS, work has been conducted [4] to develop a graph neural network (GNN) model by the name MLPF [5] to reproduce the existing PF algorithm. The proposed MLPF algorithm has the advantage of running on heterogeneous computing architectures and may be efficient when scaling up to accommodate the high luminosity LHC upgrade. The current status of the MLPF model works by reconstructing high level PF-candidates, in a supervised fashion [6], from low level PF-elements. These “low level inputs’’ have already gone through several steps of reconstruction such as track reconstruction, and energy clustering. Although each reconstruction step goes through several layers of development and validation, the optimization of each step is done independently which is not ideal since, for example, the tracking and clustering steps may not be well optimized for the final task if it is done separately from the full particle reconstruction. It is therefore interesting to explore an ML-based algorithm capable of performing the reconstruction in one-shot, in an end-end fashion. The first step towards this, is to train an MLPF-like model which takes as input the reconstructed tracks and, instead of the calorimeter clusters, the calorimeter cell energy deposits. The project will entail the exploration of the CLIC dataset published here [7] which contains hit-based information in the form of calorimeter energy deposits. Furthermore, the training of a GNN-based model, and possibly other models, and comparing with the current state-of-the-art results of the cluster-based model.

[1] https://www.sciencedirect.com/science/article/abs/pii/0168900295001387?via%3Dihub \ [2] https://arxiv.org/abs/1706.04965 \ [3] https://arxiv.org/abs/1703.10485 \ [4] https://github.com/jpata/particleflow \ [5] https://arxiv.org/abs/2101.08578 \ [6] https://arxiv.org/abs/2303.17657 \ [7] https://zenodo.org/records/8414225

Refactoring Uproot’s AwkwardForth implementation: Keeping the functionality of Uproot’s accelerated reading through AwkwardForth, but making it more maintainable by removing mutable state/coding it in a functional style. . Mentee: Seth Bendigo. Email the mentors (Ioana Ifrim,Jim Pivarski). Date posted 2023-05-24.

Uproot is a Python library for reading and writing ROOT files (the most common file format in particle physics). While it is relatively fast at reading “columnar” data, either arrays of numbers or arrays of numbers that are grouped into variable-length lists, any other data type requires iteration, which is a performance limitation in the Python language. (“for” loops in Python are 100’s of times slower than in compiled languages.) To improve this situation, we introduced a domain-specific language (DSL) called AwkwardForth, in which loops are much faster to execute than they are in Python (factors of 100’s again). This language was created in 2021 (https://arxiv.org/abs/2102.13516) and added to Uproot in 2022 (https://arxiv.org/abs/2303.02202). In the end, an example data structure (std::vector<std::vector>) could be read 400× faster with AwkwardForth than with Python. Users of Uproot don't have to opt in or change their code, it just runs faster. That would be the end of the story, except that the AwkwardForth-generating code in Uproot has been very hard to maintain. In part, it's because it's doing something complicated: generating code that runs later or generating code that generates code that runs later. But it is also more complicated than it needs to be, with Python objects that change their own attributes in arbitrary ways as information about what AwkwardForth needs to be generated accumulates. The code would be much easier to read and reason about if it were stateless or append-only (see: functional programming), and it easily could be. This project would be to restructure the AwkwardForth-generating code in a functional style, to "remove the moving parts." To be clear, the project will not require you to understand the AwkwardForth that is being generated (though that's not a bad thing), and it will not require you to figure out how to generate the right AwkwardForth for a given data type. This part of the problem has been solved and there are many unit tests that can check correctness, to allow you to do test-driven development. The project is about software engineering: how to structure code so that it can be read and understood, while keeping the problem-solving aspect unchanged.

Snakemake backend for RECAST: Implement a Snakemake backend for RECAST workflows. Mentee: Andrii Povsten. Email the mentors (Matthew Feickert,Lukas Heinrich). Date posted 2023-05-24.

RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. When RECAST was first implemented for ATLAS a workflow language with sufficient Linux container support was not available and the yadage workflow system was created. In the years since yadage was created the Snakemake workflow management system has become increasingly popular in the broader scientific community and has developed mature Linux container support. This project would aim to implement another RECAST backend using Snakemake as a modern alternative to yadage.

Recasting of IRIS-HEP Analysis Grand Challenge: Implement the CMS open data AGC analysis with RECAST and REANA. Email the mentors (Matthew Feickert,Tibor Simko,Kyle Cranmer,Lukas Heinrich). Date posted 2023-05-24.

RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. A yet unrealized component of the IRIS-HEP Analysis Grand Challenge (AGC) is reuse and reinterpretation of the analysis. This project would aim to preserve the AGC CMS open data analysis and the accompanying distributed infrastructure and implement a RECAST workflow allowing REANA integration with the AGC. A key challenge of the project is creating a preservation scheme for the associated Kubernetes distributed infrastructure.

Intelligent Caching for HSF Conditions Database: investigate patterns in conditions database accesses. Mentee: Ernest Sorokun. Email the mentors (Lino Gerlach). Date posted 2023-05-22.

Conditions data refers to additional information collected in particle physics experiments beyond what is recorded directly by the detector. This data plays a critical role in many experiments, providing crucial context and calibration information for the recorded measurements. However, managing conditions data poses unique challenges, particularly due to the high access rates involved. The High-Energy Physics Software Foundation (HSF) has developed an experiment-agnostic approach to address these challenges, which has already been successfully deployed for the sPHENIX experiment at the Brookhaven National Laboratory (BNL). The project focuses on investigating the access patterns for this conditions database to gather insights that will enable the development of an optimized caching system in the future. Machine Learning may be used for pattern recognition.

Rust interfaces to I/O libraries: Explore the addition of a Rust interface to the PODIO library.. Email the mentors (Benedikt Hegner). Date posted 2023-05-18.

The project’s main goal is to prototype a Rust interface to the PODIO library. Data models in PODIO are declared with a simple programming-language agnostic syntax. In this project we will explore how the PODIO data model concepts can be mapped best onto Rust. At the same time, we will investigate how much Rust’s macro system can support the function of PODIO. For the other supported languages Python, C++ and (experimentally) Julia PODIO generates all of the source code. With proper usage of Rust macros this code generation could be kept to a minimum.

Julia interface to PODIO: Add Julia Interface to the PODIO library.. Email the mentors (Benedikt Hegner,Graeme Stewart). Date posted 2023-05-18.

The project’s main goal is to add a proper Julia back-end to PODIO. A previous Google Summer of Code project worked on an early prototype, which showed the feasibility. The aim is to complete the feature set, and do thorough (performance) testing afterwards.

Measuring energy consumption of HEP software on user analysis facilities: Implementing energy consumption benchmarks on different analysis platforms and facilities. Email the mentors (Caterina Doglioni). Date posted 2023-05-10.

Benchmarks for software energy consumption are starting to appear (see e.g. the SCI score) alongside more common performance benchmarks. In this project, we will pilot the implementation of selected software energy consumption benchmarks on two different facilities for user analysis:

the Virtual Research Environment, a prototype analysis platform for the European Open Science Cloud.
Coffea-casa, a prototype Analysis Facility (AF), which provides services for “low-latency columnar analysis.” We will then test them with simple user software pipelines. The candidate will work in collaboration with another IRIS-HEP fellow investigating energy consumption benchmarks for ML algorithms, and alongside a team of students and interns working on the selection and implementation of the benchmarks.

Estimating the energy cost of ML algorithms: Test and validate power consumption estimates for a ML algorithm for data compression. Mentee: Leonid Didukh. Email the mentors (Caterina Doglioni,Alexander Ekman). Date posted 2023-05-10.

Baler is a machine-learning based lossy data compression tool. In this project, we will review the literature concerning energy consumption for ML models and obtain estimates of the energetic cost of training and hyperparameter optimization following the guidelines of the Green Software Foundation. The candidate will work in a team of students and interns who are developers of the compression tool, and will be able to suggest improvements and insights towards Baler’s energy efficiency.

Rucio-S3-compatible access interface for analysis facilities: Add S3 compatible access interface to Rucio. Mentee: Kyrylo Meliushko. Email the mentors (Lukas Heinrich,Matthew Feickert,Mario Lassnig). Date posted 2023-05-07.

Rucio is an open source software framework that provides functionality to scientific collaborations to organize, manage, monitor, and access their distributed data and data flows across heterogeneous infrastructures. Rucio was originally developed to meet the requirements of the high-energy physics experiment ATLAS, and is continuously enhanced to support diverse scientific communities. Since 2016 Rucio has orchestrated multiple Exabytes of data access and data transfers globally. With this project we seek to enhance Rucio to support a new mechanism for analysis facilities, which are oriented towards object stores in order to provide a portable destination for HEP analyzers to store data products produced in their research in a portable, shareable and standardized way.

Point-cloud diffusion models for TPCs: Build, train, and tune a point-cloud diffusion model for the Active-Target Time Projection Chamber. Mentee: Artem Havryliuk. Email the mentors (Michelle Kuchera). Date posted 2023-05-07.

This project aims to build a conditional point-cloud diffusion model to simulate detector response to events in time projection chambers (TPCs), with a focus on the Active-Target Time Projection Chamber (AT-TPC). Current analysis and simulation software is currently unable to model the noise signature in the AT-TPC. Prior explorations have successfully modeled this noise in a cycle-GAN, but in a lower-dimensional space than the detector resolution. The IRIS-HEP project will be completed in collaboration with the ALPhA research group at Davidson College. The applicant should have a foundation in ML and a comfort in python. This is a newer type of ML deep learning generative architecture, which will require the applicant to read, implement, and modify approaches taken in ML literature using pytorch or tensorflow.

Testing Real Time Analysis at LHCb using 2022 data: Testing the Real Time Analysis at LHCb using data collected in 2022.. Mentee: Nazar Semkiv. Email the mentors (Michele Atzeni,Eluned Anne Smith). Date posted 2023-05-06.

In this project, the candidate will use data collected at the LHCb experiment in 2022 to test and compare the different machine learning algorithms used to select events, in addition to probing the performance of the new detector more generally. They will first develop a BDT-based machine learning algorithm to cleanly select the decay candidates of interest in 2022 data. Statistical subtraction techniques combined with likelihood estimations will be used to remove any remaining background contamination. The decay properties of these channels, depending on which machine learning algorithm was employed in the trig- ger, will then be compared and their agreement with simulated events examined. If needed, the existing trigger algorithms may be altered and improved. If time allows, the student will also have a first look at the angular distributions of the calibration channels being studied. Understanding such angular distributions are vital to the LHCb physics program.

Data Popularity, Placement Optimization and Storage Usage Effectiveness: Data Popularity, Placement Optimization and Storage Usage Effectiveness at the Data Center. Mentee: Avi G. Kaufman. Email the mentors (Vincent Garonne). Date posted 2023-04-12.

The goal of this project is to take data management to the next level by employing machine learning methods to create a precise data use prediction model. This model applied to data placement decisions can bring important benefits both 1) to data center in dealing with large amounts of “cold,” or unused data, which can potentially become “hot”, or popular and heavily used, and 2) to scientists, by enabling them to access their data more quickly. Deploying such models in production have many challenges and pitfalls like the accuracy of the predictions at different scales.

Improving the Analysis Grand Challenge (AGC) machine learning training workflow: Improving the Analysis Grand Challenge (AGC) machine learning training workflow. Mentee: Con Muangkod. Email the mentors (Elliott Kauffman,Alexander Held,Oksana Shadura). Date posted 2023-04-07.

The Analysis Grand Challenge (AGC) is performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC. Since ML methods have become so widespread in analysis and these analyses also need to be scaled up for HL-LHC data, ML training and inference were also integrated into the AGC analysis pipeline. The goal of this project is to improve current the ML training implementation with Kubeflow Pipelines, an open source platform for implementing MLOps, providing a framework for building, deploying, and managing machine learning workflows.

Julia for the Analysis Grand Challenge: Implement an analysis pipeline for the Analysis Grand Challenge (AGC) using JuliaHEP ecosystem.. Mentee: Atell-Yehor Krasnopolski. Email the mentors (Jerry Ling,Alexander Held). Date posted 2023-04-07.

The project’s main goal is to implement AGC pipeline using Julia to demonstrate usability and as a test of performance. New utility packages can be expected especially for systematics handling and out-of-core orchestration. (built on existing packages such as FHist.jl and Dagger.jl) At the same time, the project can explore using RNTuple instead of TTree for AGC data storage. As the interface is exactly transparent, this goal mainly requires data conversion unless performance bugs are spotted. This will be help inform transition at LHC experiments in near future (Run 4).

Development of high level trigger (HLT1) lines to detect long lived particles at LHCb.: Development of trigger lines to detect long-lived particles at LHCb. Mentee: Volodymyr Svintozelskyi. Email the mentors (Arantza Oyanguren,Brij Kishor Jashal). Date posted 2023-03-29.

The project focuses on developing trigger lines to detect long-lived particles by utilizing recently developed reconstruction algorithms within the Allen framework. These lines will employ the topologies of SciFi seeds originating from standard model long-lived particles, such as the strange Λ0. Multiple studies based on Monte Carlo (MC) simulations will be conducted to understand the variables of interest that can be used to select events. The development of new trigger lines faces limitations due to output trigger rate throughput constraints of the entire HLT1 system. An essential aspect of the project is the physics performance and the capability of the trigger lines for long-lived particle detection. The student will develop and employ ML/AI techniques that could significantly boost performance while maintaining control over execution time. The physics performance of the new trigger lines will be tested using specific decay channels from both Standard Model (SM) transitions and novel processes, such as dark bosons. Additionally, 2023 data from collisions collected during Run 3 will be used to commission these lines.

To develop microservice architecture for CMS HTCondor Job Monitoring: To develop microservice architecture for CMS HTCondor Job Monitoring. Email the mentors (Brij Kishor Jashal). Date posted 2023-03-29.

Current implementation of HTCondor Job Monitoring, internally known as Spider service, is a monolithic application which query HTCondor Schedds periodically. This implementation does not allow deployment in modern Kubernetes infrastructures with advantages like auto-scaling, resilience, self-healing, and so on. However, it can be separated into microservices responsible for “ClassAds calculation and conversion to JSON documents”, “transmitting results to ActiveMQ and OpenSearch without any duplicates” and “highly durable query management”. Such a microservice architecture will allow the use of appropriate languages like GoLang when it has advantages over Python. Moreover, intermediate monitoring pipelines can be integrated into this microservice architecture and it will drop the work-power needed for the services that produce monitoring outcomes using HTCondor Job Monitoring data

Development of simulation workflows for BSM LLPs at LHCb: Development of simulation workflows for BSM LLPs at LHCb. Mentee: Valerii Kholoimov. Email the mentors (Arantza Oyanguren,Brij Kishor Jashal). Date posted 2023-03-29.

In this project, the candidate will become familiar with Monte Carlo (MC) simulations in the LHC environment, using generators such as Pythia and EvtGen, and will learn to work within the LHCb framework. They will learn to generate new physics scenarios concerning long-lived particles, particularly dark bosons and heavy neutral leptons. The software implementation of these models will be of general interest to the entire LHC community. Moreover, these simulations will be used to assess the physics performance of various trigger algorithms developed within the Allen framework and to determine LHCb’s sensitivity to detecting these types of particles. New particles in a wide mass range, from 500 to 4500 MeV, and lifetimes between 10 ps and 2 ns, will be generated and analyzed to develop pioneering selection procedures for recognizing these signals during Run 3 of LHC data collection.

Horizontal scaling of HTTP caches: learn how to automate load-balancing with Kubernetes. Email the mentors (Brian Bockelman,Brian Lin,Mátyás Selmeci). Date posted 2023-03-28.

The OSG offers an integrated software stack and infrastructure used by the High-Energy Physics community to meet their computational needs. Frontier Squid is part of this software stack and acts as an HTTP proxy, caching requests to improve network usage. We seek a fellow to turn our existing single cache Kubernetes deployment into one that can scale horizontally using the same underlying storage for its cache.

Software development for the Rucio Scientific Data Management system: Rucio core developments for large-scale data management. Mentee: Lev Pambuk. Email the mentors (Mario Lassnig,Martin Barisits). Date posted 2023-03-17.

The Rucio system is an open and community-driven data management system for data organisation, management, and access of scientific data. Several communities and experiments have adopted Rucio as a common solution, therefore we seek a dedicated software engineer to help implement much wished for features and extensions in the Rucio core. The selected candidate will focus on producing software for several Rucio components. There is a multitude of potential topics, also based on the candidate’s interests, that can be tackled. Examples include, but are not limited to (1) integrate static type checking capabilities into the framework as well as improve its runtime efficiency, (2) continue the documentation work for automatically generated API and REST interface documentation, (3) evolve the Rucio Upload and Download clients to new complex workflows suitable to modern analysis, (4) continue the development work on the new Rucio Web User Interface, and many more. The selected candidate will participate in a large distributed team using best industry practices such code review, continuous integration, test-driven development, and blue-green deployments. It is important to us that the candidate bring their creativity to the team, therefore we encourage them to also help with developing and evaluating new ideas and designs.

Geometric deep learning for high energy collision simulations: Simulate the detector response to high energy particle collisions using graph neural networks. Email the mentors (Javier Duarte,Raghav Kansal). Date posted 2023-03-10.

Geometric deep learning and graph neural networks (GNNs) have proven to be especially successful for machine learning tasks such as classification and reconstruction for high energy physics (HEP) data, such as jets and calorimeter showers produced at the Large Hadron Collider. This project extends their application to the computationally costly task of simulations, building off of existing work in this area [1]. Possible and complementary research directions are: (1) conditional generation of jets using auxiliary conditioning networks [2-3]; (2) application to generator-level showering and hadronization simulations, and investigating merging schemes with hard matrix elements; (3) developing an attention-based architecture, which has recently found success in jet classification [4]; (4) applications to shower datasets [5-6]. [1] https://arxiv.org/abs/2106.11535 \ [2] https://arxiv.org/abs/1610.09585 \ [3] https://cds.cern.ch/record/2701779/files/10.1051_epjconf_201921402010.pdf \ [4] https://arxiv.org/abs/2202.03772 \ [5] https://zenodo.org/record/3603086 \ [6] https://calochallenge.github.io/homepage/

Enabling Advanced Network and Infrastructure Alarms: Enabling advanced network problem detection for the science community.. Mentee: Yana Holoborodko. Email the mentors (Shawn McKee,Ilija Vukotic). Date posted 2023-03-09.

Research and Education networks are critical for modern, distributed scientific infrastructures. Networks enable data and services to operate across data centers and across the world. The IRIS-HEP/OSG-LHC team has members working on network measurement, analytics and pre-emptive problem identification and localization and would like to involve a student or students interested in data science, monitoring or analytics to participate in our work. The team has assembled a rich, unique dataset, consisting of network-specific metrics, statistics and other measurements which are collected by various tools and systems. In addition, we have developed simple functions to create alarms identifying some types of problems. This project is intended to expand and augment the existing simple alarms with new alarms based upon the extensive data we are collecting. As we examine the data in depth, we realize there are important indicators of problems both in our networks as well as in our network monitoring infrastructure. Interested students would work with our data using tools like Elasticsearch, Kibana and Jupyter Notebooks to first understand the types of data being collected and then use that knowledge to create increasingly powerful alarms which clearly identify specific problems. The task is to maximize the diagnostic range and capability of our alarms to proactively identify problems before they impact scientists who rely on these networks or impact our network measurement infrastructure’s ability to gather data in the first place. The student will be expected to participate in a weekly group meeting focused on network measurement, analytics, monitoring and alarming, which will provide a venue to discuss and learn about concepts, tools and methodologies relevant to the project. The project goal is to create improved alerting and alarming related to both the research and education networks used by HEP, WLCG and OSG communities and the infrastructure we have created to measure and monitor it.

Machine Learning on Network Data for Problem Identification: Machine learning for network problem identification.. Mentee: Maxym Naumchyk. Email the mentors (Shawn McKee,Petya Vasileva). Date posted 2023-03-09.

Research and Education networks are critical for modern, distributed scientific infrastructures. Networks enable data and services to operate across data centers and across the world. The IRIS-HEP/OSG-LHC team has members working on network measurement, analytics and pre-emptive problem identification and localization and would like to involve a student or students interested in data science, machine learning or analytics to participate in our work. The team has assembled a rich, unique dataset, consisting of network-specific metrics, statistics and other measurements which are collected by various tools and systems. In addition, we have developed simple functions to create alarms identifying some types of problems. Interested students would work with pre-prepared datasets, annotated via our existing alarms, to train one or more machine learning algorithms and then use the trained algorithms to process another dataset, comparing results with the sample alarm method. The task is to provide a more effective method of identifying certain types of network issues using machine learning so that such problems can be quickly resolved before they impact scientists who rely on these networks. The student will be expected to participate in a weekly group meeting focused on network measurement, analytics, monitoring and alarming, which will provide a venue to discuss and learn about concepts, tools and methodologies relevant to the project. The project goal is to create improved user-facing alerting and alarming related to the research and education networks used by HEP, WLCG and OSG communities.

PV-Finder ACTS example: Adapting a Machine Learning Algorithm for use in ACTS. Mentee: Layan AlSarayra. Email the mentors (Lauren Tompkins,Rocky Bala garg,Michael Sokoloff). Date posted 2023-03-08.

PV-Finder is a hybrid deep learning algorithm designed to identify the locations of proton-proton collisions (primary vertices) in the Run 3 LHCb detector. The underlying structure of the data and the approach to learning the locations of primary vertices may be useful for other detectors, including those at the LHC, such as ATLAS. The algorithm is approximately factorizable. Starting with reconstructed tracks, a kernel density estimator (KDE) can be calculated by a hand-written algorithm that reduces sparse point clouds of three dimensional track data to rich one dimensional data sets amenable to processing by a deep convolutional network. This is called a kde-to-hist algorithm and its predicted histograms are easily interpreted by a heuristic algorithm. A separate tracks-to-kde algorithm uses track parameters evaluated at their points of closest approach to the beamline as input features and predicts an approximation to the KDE. These two algorithms can be merged and the combined model trained to predict the easily interpreted histograms directly from track information. The incumbent will work under the joint supervision of Rocky Bala Garg and Lauren Tompkins (ATLAS physicists and ACTS developers, Stanford), and Michael Sokoloff (an LHCb physicist, Cincinnati) to adapt the pv-finder algorithms to process data generated by ACTS rather than LHCb.

Refactoring fastjet with Awkward LayoutBuilder: Replacing the fastjet implementation with safe, maintainable LayoutBuilder while retaining its interface. Email the mentors (Javier Duarte,Jim Pivarski). Date posted 2023-03-08.

fastjet is a Python interface to the popular FastJet particle-clustering package, which is written in C++. fastjet is unique in that it offers a vectorized interface to FastJet’s algorithms, allowing Python users to analyze many collision events in a single function call, avoiding the overhead of Python iteration. Collections of particles and jets with different lengths per event are managed by Awkward Array. Although the fastjet package functions and is currently used in HEP analysis, its Python-C++ interface predates LayoutBuilder, which simplifies the construction of Awkward Arrays in C++, is easier to maintain, and avoids the dangers of raw array handling. This project would be to refactor fastjet to use the new abstraction layer, maintaining its well-tested interface, and possibly adding new algorithms and functionality new algorithms and functionality, such as jet groomers and other transformations.

Dask in a HEP Analysis Facility at Scale: How fast can large-scale HEP data analysis be performed using Dask and Awkward Arrays?. Email the mentors (Alexander Held,Oksana Shadura,Nick Smith,Jim Pivarski,Ianna Osborne). Date posted 2023-03-08.

Coffea-casa is a prototype Analysis Facility (AF), which provides services for “low-latency columnar analysis.” Dask is an industry-standard scale-out mechanism for array-oriented data processing in Python, and Dask capabilities were recently added to Uproot and Awkward Array, making it possible to analyze jagged particle physics data from ROOT files as Dask arrays for the first time. The aim of this project is to determine how well these capabilities scale in a real AF environment. The project could be divided in the multiple sub-steps (depending on the timeline): adding CPU-intence CMS systematics to defined analysis workflow an/or perform the performance testing the uproot.dask and dask-awkward implementations with defined workload, learning best-practices, and performance-testing/tuning in single and multi-user environments.

Analysis Grand Challenge with ATLAS PHYSLITE data: Create an columnar analysis prototype using ATLAS PHYSLITE data. Mentee: Denys Klekots. Email the mentors (Vangelis Kourlitis,Alexander Held,Matthew Feickert). Date posted 2023-02-28.

The IRIS-HEP Analysis Grand Challenge (AGC) is a realistic environment for investigating how high energy physics data analysis workflows scale to the demands of the High-Luminosity LHC (HL-LHC). It captures relevant workflow aspects from data delivery to statistical inference. The AGC has so far been based on publicly available Open Data from the CMS experiment. The ATLAS collaboration aims to use a data format called PHYSLITE at the HL-LHC, which slightly differs from the data formats used so far within the AGC. This project involves implementing the capability to analyze PHYSLITE ATLAS data within the similar to AGC workflow, the columnar analysis prototype, and optimizing the related performance under large volumes of data. In addition to this, the evaluation of systematic uncertainties for ATLAS with PHYSLITE is expected to differ in some aspects from what the AGC has considered thus far. This project will also investigate workflows to integrate the evaluation of such sources of uncertainty for ATLAS.

ROOT’s RDataFrame for the Analysis Grand Challenge: Develop and test an analysis pipeline using ROOT’s RDataFrame for the next iteration of the Analysis Grand Challenge. Mentee: Andrii Falko. Email the mentors (Enrico Guiraud,Alexander Held). Date posted 2023-02-20.

The IRIS-HEP Analysis Grand Challenge (AGC) aims to develop examples of realistic, end-to-end high-energy physics analyses, as well as demonstrate the advantages of modern tools and technologies when applied to such tasks. The next iteration of the AGC (v2) will put the capabilities of modern analysis interfaces such as Coffea and ROOT’s RDataFrame under further test, for example by including more complex systematic variations and sophisticated machine learning techniques. The project consists in the investigation and implementation of such new developments in the context of RDataFrame as well as their benchmarking on state-of-the-art analysis facilities. The goal is to gain insights useful to guide the future design of both the analysis facilities and the applications that will be deployed on them.

Augmenting Line-Segment Tracking with Graph Neural Network: Leveraging Graph Neural Network to utilize graph input data produced by Line-Segment Tracking. Mentee: Povilas Pugzlys. Email the mentors (Philip Chang). Date posted 2023-02-18.

The increase of the pile-up in the upcoming HL-LHC will present a challenge to event reconstruction for the CMS experiment. The single largest contribution to the total reconstruction time comes from charged-particle tracking. Without algorithm innovation, the projected charged-particle reconstruction timing is projected to exponentially increase. This increase in timing in combination with the fact that the computational performance of single thread processors is plateauing, CMS Collaboration estimates that without algorithmic innovation the computing resource requirement will hit a factor 2 to 5 over the projected computing capabilities. This can seriously hinder physicists to publish timely scientific results. This motivates a new approach in tracking to develop a new algorithm that are parallel in nature to alleviate problems of combinatorics, and also can leverage industry advancements in parallel computing such as the GPUs. In light of this, Line-Segment Tracking project started. Line-Segment Tracking (LST) project leverages the CMS outer-tracker’s doublet modules to build mini-doublets (a pair of hits in each layer of the doublet layer) in parallel, and subsequently build line-segments via connecting consistent pair of mini-doublets across different logical layers of the tracker, all done on high-performance GPUs. Eventually, the line-segments are linked together iteratively to form a long chain of line-segments to produce list of track candidates. The parallel nature of the LST algorithm allows the algorithm to naturally lends itself for GPU usage. The project has produced on-par performance with the existing tracking alternatives, and have been integrated to central CMS Software as a step towards production. As LST algorithm creates line-segments and links them to create track candidates, a graph representation of hits and linking between them is naturally obtained. In other words, LST can also be thought of as a fast graph producing algorithm. The project will take the graph data and develop GNN models that classify linkings. We plan to integrate the GNN model to the LST algorithm to augment its capability to produce high-quality track candidates at a shorter time while keeping the same or better tracking performance. Also, a solution for a “one-shot” linking of long chains of line-segments in one algorithm instead of through iteration will also be studied. Estimated Timeline: Week 1/2: Understanding the preliminary LST GNN workflow for Line Segment classification Week 3: Creating example of running the Line Segment classification inference on C++ environment with TorchScript Week 4/5: Integrating the inference with LST’s CUDA code to run the inference on GNN Week 5: Validating the implementation in the LST framework Week 6/7: Performing optimization of utilizing the GNN inferences to measure performance gain in the efficiency metric of LST framework (i.e. efficiency, fake rate, and duplicate rate) Week 8/9: Perform large scale hyperparameter optimization to find best resulting model architectecture Week 10/11: Perform research and development of extending the ability to classify Triplets, and beyond, with the Line Graph transformation approach, which would enable “one-shot” inference Week 12: Wrap up the project, document and summarize the findings to allow for next steps

GNN Tracking: Reconstruct the trajectories of particle with graph neural networks. Mentee: Refilwe Bua. Email the mentors (Kilian Lieret,Gage deZoort). Date posted 2023-02-14.

In the GNN tracking project, we use graph neural networks (GNNs) to reconstruct trajectories (“tracks”) of elementary particles traveling through a detector. This task is called “tracking” and is different from many other problems that involve trajectories:

there are several thousand particles that need to be tracked at once,
there is no time information (the particles travel too fast),
we do not observe a continuous trajectory but instead only around five points (“hits”) along the way in different detector layers.

The task can be described as a combinatorically very challenging “connect-the-dots” problem, essentially turning a cloud of points (hits) in 3D space into a set of O(1000) trajectories. Expressed differently, each hit (containing not much more than the x/y/z coordinate) must be assigned to the particle/track it belongs to.

A conceptually simple way to turn this problem into a machine learning task is to create a fully connected graph of all points and then train an edge classifier to reject any edge that doesn’t connect points that belong to the same particle. In this way, only the individual trajectories remain as components of the initial fully connected graph. However, this strategy does not seem to lead to perfect results in practice. The approach of this project uses this strategy only as the first step to arrive at “small” graphs. It then projects all hits into a learned latent space with the model learning to place hits of the same particle close to each other, such that the hits belonging to the same particle form clusters.

The project code together with documentation and a reading list is available on github and uses pytorch geometric. See also our GSoC proposal for the same project, which lists prerequisites and possible tasks.

REANA workflow for Dark Matter Searches: Implement a REANA workflow for dark matter searches at RCFM. Mentee: Sambridhi Deo. Email the mentors (Matthew Feickert,Lukas Heinrich,Amy Roberts,Giordon Stark). Date posted 2023-02-14.

REANA is a platform for reproducible data analysis workflows that can be run at scale. REANA has been used extensively for running containerized workflows of LHC experiments, like ATLAS, and for reinterpretation of published analyses. This project would aim to implement a REANA workflow for a galaxy rotation-curve fitting analysis (RCFM) to improve replicability and to provide a starting point for future work.

CMS RECAST Example: Implement a CMS analysis with RECAST and REANA. Email the mentors (Kyle Cranmer,Matthew Feickert,Clemens Lange). Date posted 2023-02-14.

RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. Recently, IRIS-HEP and the HEP Software Foundation (HSF) supported an analysis preservation bootcamp at CERN teaching these tools. Furthermore, the ATLAS experiment is now actively using RECAST. We seek a member of CMS to incorporate a CMS analysis into the system with support from IRIS-HEP, REANA, and RECAST developers. The analysis can be done using CMS internal data or CMS open data and would use CMS analysis tooling.

Charged-particles reconstruction at Muon Colliders: Simulation and Charged-particle reconstruction algorithms in future Muon Colliders. Mentee: Chris Sellgren. Email the mentors (Simone Pagan Griso,Sergo Jindariani). Date posted 2023-02-08.

A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of two complementary objectives. In the first one, we will investigate how the tracking algorithms can be improved by utilizing directional information from specially-arranged silicon-detector layers; this requires improving the simulation of the detector as well. The second one, aims to port the modern track reconstruction algorithms we are using from the older ILCSoft framework to the new Key4HEP software framework, which supports parallel multi-threaded execution of algorithms and will be needed to scale performance to the needs of the Collaboration, validate them and ensure they can be widely used by all collaborators.

Bayesian Analysis with pyhf: Build a library on top of the pyhf Python API to allow for Bayesian analysis of HistFactory models. Mentee: Malin Horstmann. Email the mentors (Matthew Feickert,Lukas Heinrich). Date posted 2023-02-03.

Collider physics analyses have historically favored Frequentist statistical methodologies, with some exceptions of Bayesian inference in LHC analyses through use of the Bayesian Analysis Toolkit (BAT). As HistFactory’s model construction allows for creation of models that can be interpreted as having Bayesian priors, HistFactory models built with pyhf can be used for both Frequentist and Bayesian analyses. The project goal is to build a library on top of the pyhf Python API to allow for Bayesian analysis of HistFactory models using the PyMC Python library and leverage pyhf’s automatic differentiation and hardware acceleration through its JAX computational backend. Validation tests of results will be conducted against the BAT and LiteHF Julia libraries. If time permits, work on integrating the functionality into pyhf would be possible, though it would not be expected to be completed in this Fellow project. Applicants are expected to have strong working experience with Python and basic knowledge of statistical analysis.

Interactive C++ for Machine Learning: Interfacing Cling and PyTorch together to facilitate\ Machine Learning workflows. Email the mentors (David Lange,Vassil Vasilev). Date posted 2023-02-03.

Cling is an interactive C++ interpreter, built on top of Clang and the LLVM compiler infrastructure. Cling realizes the read-eval-print loop (REPL) concept, in order to leverage rapid application development. Implemented as a small extension to LLVM and Clang, the interpreter reuses its strengths such as the praised concise and expressive compiler diagnostics. The LLVM-based C++ interpreter has enabled interactive C++ coding environments, whether in a standalone shell or a Jupyter notebook environment in xeus-cling.

This project aims to demonstrate that interactive C++ is useful with data analysis tools outside of the field of HEP. For example, prototype implementations for Pytorch have been successfully completed. See this link

The project deliverables are:

Demonstrate that we can build and use PyTorch programs with
Cling in Jupyter notebook.
Develop several non-trivial ML tutorials using interactive C++
in Jupyter
Experiment with automatic differentiation via Clad of PyTorch codes Candidate requirements:
Experience with C++, Jupyter notebooks and Conda are desirable
Experience with ML Interest in exploring the intersection of data
science and interactive C++.