Open projects for IRIS-HEP Fellows (and more)

This page lists a number of known software R&D projects of interest to IRIS-HEP researchers. (This page will be updated from time to time, so check back and reload to see if new projects have been added.) Contact the mentors for more information about any of these projects! Be sure you have read the guidelines.

You can also find open projects for other programs or of general interest. Use the pulldown menus to select projects based on their attributes. Projects may instead specify that they are appropriate for multiple options (or any option) for a given attribute. Click the triangle next to each project for more information (if provided by the project mentors).

Project selection menus



Text search:


Selected projects

Enabling Advanced Network and Infrastructure Alarms: Enabling advanced network problem detection for the science community.. Email the mentors (Shawn McKee,Ilija Vukotic)

Research and Education networks are critical for modern, distributed scientific infrastructures. Networks enable data and services to operate across data centers and across the world. The IRIS-HEP/OSG-LHC team has members working on network measurement, analytics and pre-emptive problem identification and localization and would like to involve a student or students interested in data science, monitoring or analytics to participate in our work. The team has assembled a rich, unique dataset, consisting of network-specific metrics, statistics and other measurements which are collected by various tools and systems. In addition, we have developed simple functions to create alarms identifying some types of problems. This project is intended to expand and augment the existing simple alarms with new alarms based upon the extensive data we are collecting. As we examine the data in depth, we realize there are important indicators of problems both in our networks as well as in our network monitoring infrastructure. Interested students would work with our data using tools like Elasticsearch, Kibana and Jupyter Notebooks to first understand the types of data being collected and then use that knowledge to create increasingly powerful alarms which clearly identify specific problems. The task is to maximize the diagnostic range and capability of our alarms to proactively identify problems before they impact scientists who rely on these networks or impact our network measurement infrastructure’s ability to gather data in the first place. The student will be expected to participate in a weekly group meeting focused on network measurement, analytics, monitoring and alarming, which will provide a venue to discuss and learn about concepts, tools and methodologies relevant to the project. The project goal is to create improved alerting and alarming related to both the research and education networks used by HEP, WLCG and OSG communities and the infrastructure we have created to measure and monitor it.

Machine Learning on Network Data for Problem Identification: Machine learning for network problem identification.. Mentee: Maxym Naumchyk. Email the mentors (Shawn McKee,Petya Vasileva)

Research and Education networks are critical for modern, distributed scientific infrastructures. Networks enable data and services to operate across data centers and across the world. The IRIS-HEP/OSG-LHC team has members working on network measurement, analytics and pre-emptive problem identification and localization and would like to involve a student or students interested in data science, machine learning or analytics to participate in our work. The team has assembled a rich, unique dataset, consisting of network-specific metrics, statistics and other measurements which are collected by various tools and systems. In addition, we have developed simple functions to create alarms identifying some types of problems. Interested students would work with pre-prepared datasets, annotated via our existing alarms, to train one or more machine learning algorithms and then use the trained algorithms to process another dataset, comparing results with the sample alarm method. The task is to provide a more effective method of identifying certain types of network issues using machine learning so that such problems can be quickly resolved before they impact scientists who rely on these networks. The student will be expected to participate in a weekly group meeting focused on network measurement, analytics, monitoring and alarming, which will provide a venue to discuss and learn about concepts, tools and methodologies relevant to the project. The project goal is to create improved user-facing alerting and alarming related to the research and education networks used by HEP, WLCG and OSG communities.

Development of high level trigger (HLT1) lines to detect long lived particles at LHCb.: Development of trigger lines to detect long-lived particles at LHCb. Mentee: Volodymyr Svintozelskyi. Email the mentors (Arantza Oyanguren,Brij Kishor Jashal)

The project focuses on developing trigger lines to detect long-lived particles by utilizing recently developed reconstruction algorithms within the Allen framework. These lines will employ the topologies of SciFi seeds originating from standard model long-lived particles, such as the strange Λ0. Multiple studies based on Monte Carlo (MC) simulations will be conducted to understand the variables of interest that can be used to select events. The development of new trigger lines faces limitations due to output trigger rate throughput constraints of the entire HLT1 system. An essential aspect of the project is the physics performance and the capability of the trigger lines for long-lived particle detection. The student will develop and employ ML/AI techniques that could significantly boost performance while maintaining control over execution time. The physics performance of the new trigger lines will be tested using specific decay channels from both Standard Model (SM) transitions and novel processes, such as dark bosons. Additionally, 2023 data from collisions collected during Run 3 will be used to commission these lines.

To develop microservice architecture for CMS HTCondor Job Monitoring: To develop microservice architecture for CMS HTCondor Job Monitoring. Email the mentors (Brij Kishor Jashal)

Current implementation of HTCondor Job Monitoring, internally known as Spider service, is a monolithic application which query HTCondor Schedds periodically. This implementation does not allow deployment in modern Kubernetes infrastructures with advantages like auto-scaling, resilience, self-healing, and so on. However, it can be separated into microservices responsible for “ClassAds calculation and conversion to JSON documents”, “transmitting results to ActiveMQ and OpenSearch without any duplicates” and “highly durable query management”. Such a microservice architecture will allow the use of appropriate languages like GoLang when it has advantages over Python. Moreover, intermediate monitoring pipelines can be integrated into this microservice architecture and it will drop the work-power needed for the services that produce monitoring outcomes using HTCondor Job Monitoring data

Development of simulation workflows for BSM LLPs at LHCb: Development of simulation workflows for BSM LLPs at LHCb. Mentee: Valerii Kholoimov. Email the mentors (Arantza Oyanguren,Brij Kishor Jashal)

In this project, the candidate will become familiar with Monte Carlo (MC) simulations in the LHC environment, using generators such as Pythia and EvtGen, and will learn to work within the LHCb framework. They will learn to generate new physics scenarios concerning long-lived particles, particularly dark bosons and heavy neutral leptons. The software implementation of these models will be of general interest to the entire LHC community. Moreover, these simulations will be used to assess the physics performance of various trigger algorithms developed within the Allen framework and to determine LHCb’s sensitivity to detecting these types of particles. New particles in a wide mass range, from 500 to 4500 MeV, and lifetimes between 10 ps and 2 ns, will be generated and analyzed to develop pioneering selection procedures for recognizing these signals during Run 3 of LHC data collection.

Testing Real Time Analysis at LHCb using 2022 data: Testing the Real Time Analysis at LHCb using data collected in 2022.. Mentee: Nazar Semkiv. Email the mentors (Michele Atzeni,Eluned Anne Smith)

In this project, the candidate will use data collected at the LHCb experiment in 2022 to test and compare the different machine learning algorithms used to select events, in addition to probing the performance of the new detector more generally. They will first develop a BDT-based machine learning algorithm to cleanly select the decay candidates of interest in 2022 data. Statistical subtraction techniques combined with likelihood estimations will be used to remove any remaining background contamination. The decay properties of these channels, depending on which machine learning algorithm was employed in the trig- ger, will then be compared and their agreement with simulated events examined. If needed, the existing trigger algorithms may be altered and improved. If time allows, the student will also have a first look at the angular distributions of the calibration channels being studied. Understanding such angular distributions are vital to the LHCb physics program.

Point-cloud diffusion models for TPCs: Build, train, and tune a point-cloud diffusion model for the Active-Target Time Projection Chamber. Mentee: Artem Havryliuk. Email the mentors (Michelle Kuchera)

This project aims to build a conditional point-cloud diffusion model to simulate detector response to events in time projection chambers (TPCs), with a focus on the Active-Target Time Projection Chamber (AT-TPC). Current analysis and simulation software is currently unable to model the noise signature in the AT-TPC. Prior explorations have successfully modeled this noise in a cycle-GAN, but in a lower-dimensional space than the detector resolution. The IRIS-HEP project will be completed in collaboration with the ALPhA research group at Davidson College. The applicant should have a foundation in ML and a comfort in python. This is a newer type of ML deep learning generative architecture, which will require the applicant to read, implement, and modify approaches taken in ML literature using pytorch or tensorflow.

Developing an automatic differentiation and initial parameters optimisation pipeline for the particle shower model: Developing an automatic differentiation and initial parameters optimisation pipeline for the particle shower model.. Email the mentors (Lukas Heinrich,Michael Kagan)

The goal of this project is to develop a differentiable simulation and optimization pipeline for Geant4. The narrow task of this Fellowship project is to develop a trial automatic differentiation and backpropagation pipeline for the Markov-like stochastic branching process that is modeling a particle shower spreading inside a detector material in three spatial dimensions.

Improve testing procedures for prompt data processing at CMS: Improve functional testing before deployment of critical changes for CMS Tier-0. Email the mentors (Dmytro Kovalskyi,German Giraldo,Jan Eysermans)

The CMS Tier-0 service is responsible for the prompt processing and distribution of the data collected by the CMS Experiment. Thorough testing of any code or configuration changes for the service is critical for timely data processing. The existing system has a Jenkins pipeline to execute a large-scale “replay” of the data processing using old data for the final functional testing before deployment of critical changes. The project is focusing on integration of unit tests and smaller functional tests in the integration pipeline to speed up testing and reduce resource utilization.

Predict data popularity to improve its availability for physics analysis: Predict data popularity to improve its availability for physics analysis. Mentee: Andrii Len. Email the mentors (Dmytro Kovalskyi,Rahul Chauhan,Hasan Ozturk)

The CMS data management team is responsible for distributing data among computing centers worldwide. Given the limited disk space at these sites, the team must dynamically manage the available data on disk. Whenever users attempt to access unavailable data, they are required to wait for the data to be retrieved from permanent tape storage. This delay impedes data analysis and hinders the scientific productivity of the collaboration. The objective of this project is to create a tool that utilizes machine learning algorithms to predict which data should be retained, based on current usage patterns.

Semi supervised methods for event reconstruction at the LHC: Semi supervised methods for event reconstruction at the LHC. Email the mentors (Javier Duarte,Farouk Mokhtar)

The most common approach to reconstructing full scale events at general purpose detectors (such as ATLAS and CMS) is a particleflow-like algorithm [1–3] which combines low-level information (PF-elements) from different sub-detectors to reconstruct stable particles (PF-candidates). Attempts are being developed to improve upon the current PF-algorithm with machine learning (ML) methods and improving particle reconstruction in general. Within CMS, a work has been conducted [4] to develop a graph neural network (GNN) model by the name MLPF [5] to reproduce the existing PF algorithm. The proposed MLPF algorithm has the advantage of running on heterogeneous computing architectures and may be efficient when scaling up to accommodate the high luminosity LHC upgrade. MLPF works by reconstructing high level PF candidates, in a supervised fashion [6], model from low level PF-elements. We propose to test new self-supervised learning (SSL) ideas, borrowed from computer vision tasks, to pre-train the MLPF model before performing the downstream task of event reconstruction. In particular, a method by the name VICReg [7] was developed under the umbrella of contrastive SSL in computer vision. In the pre-training phase, the algorithm relies on taking two views of the same image and learns to associate their embeddings in some latent space. The aim is to learn efficient latent representations before performing the downstream task. In the case of MLPF, a natural way to think of two (or more) views of the same event is by considering the different sub detector systems (e.g. track hits and calorimeter clusters). In addition, other event augmentations, that are physics-motivated, may be considered (e.g. Lorentz boosts). We expect that pre-training MLPF on a huge amount of unlabeled LHC data can provide better performance, and better domain adaptation (e.g. from MC to data). [1] https://www.sciencedirect.com/science/article/abs/pii/0168900295001387?via%3Dihub \ [2] https://arxiv.org/abs/1706.04965 \ [3] https://arxiv.org/abs/1703.10485 \ [4] https://github.com/jpata/particleflow \ [5] https://arxiv.org/abs/2101.08578 \ [6] https://arxiv.org/abs/2303.17657 \ [7] https://arxiv.org/abs/2105.04906

Snakemake backend for RECAST: Implement a Snakemake backend for RECAST workflows. Email the mentors (Matthew Feickert,Lukas Heinrich)

RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. When RECAST was first implemented for ATLAS a workflow language with sufficient Linux container support was not available and the yadage workflow system was created. In the years since yadage was created the Snakemake workflow management system has become increasingly popular in the broader scientific community and has developed mature Linux container support. This project would aim to implement another RECAST backend using Snakemake as a modern alternative to yadage.

Recasting of IRIS-HEP Analysis Grand Challenge: Implement the CMS open data AGC analysis with RECAST and REANA. Email the mentors (Kyle Cranmer,Matthew Feickert,Lukas Heinrich)

RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. A yet unrealized component of the IRIS-HEP Analysis Grand Challenge (AGC) is reuse and reinterpretation of the analysis. This project would aim to preserve the AGC CMS open data analysis and the accompanying distributed infrastructure and implement a RECAST workflow allowing REANA integration with the AGC. A key challenge of the project is creating a preservation scheme for the associated Kubernetes distributed infrastructure.

Intelligent Caching for HSF Conditions Database: investigate patterns in conditions database accesses. Mentee: Ernest Sorokun. Email the mentors (Lino Gerlach)

Conditions data refers to additional information collected in particle physics experiments beyond what is recorded directly by the detector. This data plays a critical role in many experiments, providing crucial context and calibration information for the recorded measurements. However, managing conditions data poses unique challenges, particularly due to the high access rates involved. The High-Energy Physics Software Foundation (HSF) has developed an experiment-agnostic approach to address these challenges, which has already been successfully deployed for the sPHENIX experiment at the Brookhaven National Laboratory (BNL). The project focuses on investigating the access patterns for this conditions database to gather insights that will enable the development of an optimized caching system in the future. Machine Learning may be used for pattern recognition.

Measuring energy consumption of HEP software on user analysis facilities: Implementing energy consumption benchmarks on different analysis platforms and facilities. Email the mentors (Caterina Doglioni)

Benchmarks for software energy consumption are starting to appear (see e.g. the SCI score) alongside more common performance benchmarks. In this project, we will pilot the implementation of selected software energy consumption benchmarks on two different facilities for user analysis:

  • the Virtual Research Environment, a prototype analysis platform for the European Open Science Cloud.
  • Coffea-casa, a prototype Analysis Facility (AF), which provides services for “low-latency columnar analysis.” We will then test them with simple user software pipelines. The candidate will work in collaboration with another IRIS-HEP fellow investigating energy consumption benchmarks for ML algorithms, and alongside a team of students and interns working on the selection and implementation of the benchmarks.
Estimating the energy cost of ML algorithms: Test and validate power consumption estimates for a ML algorithm for data compression. Mentee: Leonid Didukh. Email the mentors (Caterina Doglioni,Alexander Ekman)

Baler is a machine-learning based lossy data compression tool. In this project, we will review the literature concerning energy consumption for ML models and obtain estimates of the energetic cost of training and hyperparameter optimization following the guidelines of the Green Software Foundation. The candidate will work in a team of students and interns who are developers of the compression tool, and will be able to suggest improvements and insights towards Baler’s energy efficiency.

Rucio-S3-compatible access interface for analysis facilities: Add S3 compatible access interface to Rucio. Mentee: Kyrylo Meliushko. Email the mentors (Lukas Heinrich,Matthew Feickert,Mario Lassnig)

Rucio is an open source software framework that provides functionality to scientific collaborations to organize, manage, monitor, and access their distributed data and data flows across heterogeneous infrastructures. Rucio was originally developed to meet the requirements of the high-energy physics experiment ATLAS, and is continuously enhanced to support diverse scientific communities. Since 2016 Rucio has orchestrated multiple Exabytes of data access and data transfers globally. With this project we seek to enhance Rucio to support a new mechanism for analysis facilities, which are oriented towards object stores in order to provide a portable destination for HEP analyzers to store data products produced in their research in a portable, shareable and standardized way.

Rust interfaces to I/O libraries: Explore the addition of a Rust interface to the PODIO library.. Email the mentors (Benedikt Hegner)

The project’s main goal is to prototype a Rust interface to the PODIO library. Data models in PODIO are declared with a simple programming-language agnostic syntax. In this project we will explore how the PODIO data model concepts can be mapped best onto Rust. At the same time, we will investigate how much Rust’s macro system can support the function of PODIO. For the other supported languages Python, C++ and (experimentally) Julia PODIO generates all of the source code. With proper usage of Rust macros this code generation could be kept to a minimum.

Julia interface to PODIO: Add Julia Interface to the PODIO library.. Email the mentors (Benedikt Hegner,Graeme Stewart)

The project’s main goal is to add a proper Julia back-end to PODIO. A previous Google Summer of Code project worked on an early prototype, which showed the feasibility. The aim is to complete the feature set, and do thorough (performance) testing afterwards.

Julia for the Analysis Grand Challenge: Implement an analysis pipeline for the Analysis Grand Challenge (AGC) using JuliaHEP ecosystem.. Mentee: Atell-Yehor Krasnopolski. Email the mentors (Jerry Ling,Alexander Held)

The project’s main goal is to implement AGC pipeline using Julia to demonstrate usability and as a test of performance. New utility packages can be expected especially for systematics handling and out-of-core orchestration. (built on existing packages such as FHist.jl and Dagger.jl) At the same time, the project can explore using RNTuple instead of TTree for AGC data storage. As the interface is exactly transparent, this goal mainly requires data conversion unless performance bugs are spotted. This will be help inform transition at LHC experiments in near future (Run 4).

Horizontal scaling of HTTP caches: learn how to automate load-balancing with Kubernetes. Email the mentors (Brian Bockelman,Brian Lin,Mátyás Selmeci)

The OSG offers an integrated software stack and infrastructure used by the High-Energy Physics community to meet their computational needs. Frontier Squid is part of this software stack and acts as an HTTP proxy, caching requests to improve network usage. We seek a fellow to turn our existing single cache Kubernetes deployment into one that can scale horizontally using the same underlying storage for its cache.

Software development for the Rucio Scientific Data Management system: Rucio core developments for large-scale data management. Mentee: Lev Pambuk. Email the mentors (Mario Lassnig,Martin Barisits)

The Rucio system is an open and community-driven data management system for data organisation, management, and access of scientific data. Several communities and experiments have adopted Rucio as a common solution, therefore we seek a dedicated software engineer to help implement much wished for features and extensions in the Rucio core. The selected candidate will focus on producing software for several Rucio components. There is a multitude of potential topics, also based on the candidate’s interests, that can be tackled. Examples include, but are not limited to (1) integrate static type checking capabilities into the framework as well as improve its runtime efficiency, (2) continue the documentation work for automatically generated API and REST interface documentation, (3) evolve the Rucio Upload and Download clients to new complex workflows suitable to modern analysis, (4) continue the development work on the new Rucio Web User Interface, and many more. The selected candidate will participate in a large distributed team using best industry practices such code review, continuous integration, test-driven development, and blue-green deployments. It is important to us that the candidate bring their creativity to the team, therefore we encourage them to also help with developing and evaluating new ideas and designs.

Geometric deep learning for high energy collision simulations: Simulate the detector response to high energy particle collisions using graph neural networks. Email the mentors (Javier Duarte,Raghav Kansal)

Geometric deep learning and graph neural networks (GNNs) have proven to be especially successful for machine learning tasks such as classification and reconstruction for high energy physics (HEP) data, such as jets and calorimeter showers produced at the Large Hadron Collider. This project extends their application to the computationally costly task of simulations, building off of existing work in this area [1]. Possible and complementary research directions are: (1) conditional generation of jets using auxiliary conditioning networks [2-3]; (2) application to generator-level showering and hadronization simulations, and investigating merging schemes with hard matrix elements; (3) developing an attention-based architecture, which has recently found success in jet classification [4]; (4) applications to shower datasets [5-6]. [1] https://arxiv.org/abs/2106.11535 \ [2] https://arxiv.org/abs/1610.09585 \ [3] https://cds.cern.ch/record/2701779/files/10.1051_epjconf_201921402010.pdf \ [4] https://arxiv.org/abs/2202.03772 \ [5] https://zenodo.org/record/3603086 \ [6] https://calochallenge.github.io/homepage/

PV-Finder ACTS example: Adapting a Machine Learning Algorithm for use in ACTS. Mentee: Layan AlSarayra. Email the mentors (Lauren Tompkins,Rocky Bala garg,Michael Sokoloff)

PV-Finder is a hybrid deep learning algorithm designed to identify the locations of proton-proton collisions (primary vertices) in the Run 3 LHCb detector. The underlying structure of the data and the approach to learning the locations of primary vertices may be useful for other detectors, including those at the LHC, such as ATLAS. The algorithm is approximately factorizable. Starting with reconstructed tracks, a kernel density estimator (KDE) can be calculated by a hand-written algorithm that reduces sparse point clouds of three dimensional track data to rich one dimensional data sets amenable to processing by a deep convolutional network. This is called a kde-to-hist algorithm and its predicted histograms are easily interpreted by a heuristic algorithm. A separate tracks-to-kde algorithm uses track parameters evaluated at their points of closest approach to the beamline as input features and predicts an approximation to the KDE. These two algorithms can be merged and the combined model trained to predict the easily interpreted histograms directly from track information. The incumbent will work under the joint supervision of Rocky Bala Garg and Lauren Tompkins (ATLAS physicists and ACTS developers, Stanford), and Michael Sokoloff (an LHCb physicist, Cincinnati) to adapt the pv-finder algorithms to process data generated by ACTS rather than LHCb.

Refactoring fastjet with Awkward LayoutBuilder: Replacing the fastjet implementation with safe, maintainable LayoutBuilder while retaining its interface. Email the mentors (Javier Duarte,Jim Pivarski)

fastjet is a Python interface to the popular FastJet particle-clustering package, which is written in C++. fastjet is unique in that it offers a vectorized interface to FastJet’s algorithms, allowing Python users to analyze many collision events in a single function call, avoiding the overhead of Python iteration. Collections of particles and jets with different lengths per event are managed by Awkward Array. Although the fastjet package functions and is currently used in HEP analysis, its Python-C++ interface predates LayoutBuilder, which simplifies the construction of Awkward Arrays in C++, is easier to maintain, and avoids the dangers of raw array handling. This project would be to refactor fastjet to use the new abstraction layer, maintaining its well-tested interface, and possibly adding new algorithms and functionality new algorithms and functionality, such as jet groomers and other transformations.

Dask in a HEP Analysis Facility at Scale: How fast can large-scale HEP data analysis be performed using Dask and Awkward Arrays?. Email the mentors (Ianna Osborne,Oksana Shadura)

Coffea-casa is a prototype Analysis Facility (AF), which provides services for “low-latency columnar analysis.” Dask is an industry-standard scale-out mechanism for array-oriented data processing in Python, and Dask capabilities were recently added to Uproot and Awkward Array, making it possible to analyze jagged particle physics data from ROOT files as Dask arrays for the first time. The aim of this project is to determine how well these capabilities scale in a real AF environment. The project will involve stress-testing the uproot.dask and dask-awkward implementations with physics-motivated workloads, learning best-practices, and performance-testing/tuning in single and multi-user environments.

Analysis Grand Challenge with ATLAS PHYSLITE data: Create an Analysis Grand Challenge implementation using ATLAS PHYSLITE data. Email the mentors (Matthew Feickert,Tal van Daalen,Alexander Held)

The IRIS-HEP Analysis Grand Challenge (AGC) is a realistic environment for investigating how high energy physics data analysis workflows scale to the demands of the High-Luminosity LHC (HL-LHC). It captures relevant workflow aspects from data delivery to statistical inference. The AGC has so far been based on publicly available Open Data from the CMS experiment. The ATLAS collaboration aims to use a data format called PHYSLITE at the HL-LHC, which slightly differs from the data formats used so far within the AGC. This project involves implementing the capability to analyze PHYSLITE ATLAS data within the AGC workflow and optimizing the related performance under large volumes of data. In addition to this, the evaluation of systematic uncertainties for ATLAS with PHYSLITE is expected to differ in some aspects from what the AGC has considered thus far. This project will also investigate workflows to integrate the evaluation of such sources of uncertainty within a Python-based implementation of an AGC analysis task.

ROOT’s RDataFrame for the Analysis Grand Challenge: Develop and test an analysis pipeline using ROOT’s RDataFrame for the next iteration of the Analysis Grand Challenge. Mentee: Andrii Falko. Email the mentors (Enrico Guiraud,Alexander Held)

The IRIS-HEP Analysis Grand Challenge (AGC) aims to develop examples of realistic, end-to-end high-energy physics analyses, as well as demonstrate the advantages of modern tools and technologies when applied to such tasks. The next iteration of the AGC (v2) will put the capabilities of modern analysis interfaces such as Coffea and ROOT’s RDataFrame under further test, for example by including more complex systematic variations and sophisticated machine learning techniques. The project consists in the investigation and implementation of such new developments in the context of RDataFrame as well as their benchmarking on state-of-the-art analysis facilities. The goal is to gain insights useful to guide the future design of both the analysis facilities and the applications that will be deployed on them.

Augmenting Line-Segment Tracking with Graph Neural Network: Leveraging Graph Neural Network to utilize graph input data produced by Line-Segment Tracking. Mentee: Povilas Pugzlys. Email the mentors (Philip Chang)

The increase of the pile-up in the upcoming HL-LHC will present a challenge to event reconstruction for the CMS experiment. The single largest contribution to the total reconstruction time comes from charged-particle tracking. Without algorithm innovation, the projected charged-particle reconstruction timing is projected to exponentially increase. This increase in timing in combination with the fact that the computational performance of single thread processors is plateauing, CMS Collaboration estimates that without algorithmic innovation the computing resource requirement will hit a factor 2 to 5 over the projected computing capabilities. This can seriously hinder physicists to publish timely scientific results. This motivates a new approach in tracking to develop a new algorithm that are parallel in nature to alleviate problems of combinatorics, and also can leverage industry advancements in parallel computing such as the GPUs. In light of this, Line-Segment Tracking project started. Line-Segment Tracking (LST) project leverages the CMS outer-tracker’s doublet modules to build mini-doublets (a pair of hits in each layer of the doublet layer) in parallel, and subsequently build line-segments via connecting consistent pair of mini-doublets across different logical layers of the tracker, all done on high-performance GPUs. Eventually, the line-segments are linked together iteratively to form a long chain of line-segments to produce list of track candidates. The parallel nature of the LST algorithm allows the algorithm to naturally lends itself for GPU usage. The project has produced on-par performance with the existing tracking alternatives, and have been integrated to central CMS Software as a step towards production. As LST algorithm creates line-segments and links them to create track candidates, a graph representation of hits and linking between them is naturally obtained. In other words, LST can also be thought of as a fast graph producing algorithm. The project will take the graph data and develop GNN models that classify linkings. We plan to integrate the GNN model to the LST algorithm to augment its capability to produce high-quality track candidates at a shorter time while keeping the same or better tracking performance. Also, a solution for a “one-shot” linking of long chains of line-segments in one algorithm instead of through iteration will also be studied. Estimated Timeline: Week 1/2: Understanding the preliminary LST GNN workflow for Line Segment classification Week 3: Creating example of running the Line Segment classification inference on C++ environment with TorchScript Week 4/5: Integrating the inference with LST’s CUDA code to run the inference on GNN Week 5: Validating the implementation in the LST framework Week 6/7: Performing optimization of utilizing the GNN inferences to measure performance gain in the efficiency metric of LST framework (i.e. efficiency, fake rate, and duplicate rate) Week 8/9: Perform large scale hyperparameter optimization to find best resulting model architectecture Week 10/11: Perform research and development of extending the ability to classify Triplets, and beyond, with the Line Graph transformation approach, which would enable “one-shot” inference Week 12: Wrap up the project, document and summarize the findings to allow for next steps

GNN Tracking: Reconstruct the trajectories of particle with graph neural networks. Mentee: Refilwe Bua. Email the mentors (Kilian Lieret,Gage deZoort)

In the GNN tracking project, we use graph neural networks (GNNs) to reconstruct trajectories (“tracks”) of elementary particles traveling through a detector. This task is called “tracking” and is different from many other problems that involve trajectories:

  • there are several thousand particles that need to be tracked at once,
  • there is no time information (the particles travel too fast),
  • we do not observe a continuous trajectory but instead only around five points (“hits”) along the way in different detector layers.

The task can be described as a combinatorically very challenging “connect-the-dots” problem, essentially turning a cloud of points (hits) in 3D space into a set of O(1000) trajectories. Expressed differently, each hit (containing not much more than the x/y/z coordinate) must be assigned to the particle/track it belongs to.

A conceptually simple way to turn this problem into a machine learning task is to create a fully connected graph of all points and then train an edge classifier to reject any edge that doesn’t connect points that belong to the same particle. In this way, only the individual trajectories remain as components of the initial fully connected graph. However, this strategy does not seem to lead to perfect results in practice. The approach of this project uses this strategy only as the first step to arrive at “small” graphs. It then projects all hits into a learned latent space with the model learning to place hits of the same particle close to each other, such that the hits belonging to the same particle form clusters.

The project code together with documentation and a reading list is available on github and uses pytorch geometric. See also our GSoC proposal for the same project, which lists prerequisites and possible tasks.

REANA workflow for Dark Matter Searches: Implement a REANA workflow for dark matter searches at RCFM. Mentee: Sambridhi Deo. Email the mentors (Matthew Feickert,Lukas Heinrich,Amy Roberts,Giordon Stark)

REANA is a platform for reproducible data analysis workflows that can be run at scale. REANA has been used extensively for running containerized workflows of LHC experiments, like ATLAS, and for reinterpretation of published analyses. This project would aim to implement a REANA workflow for a galaxy rotation-curve fitting analysis (RCFM) to improve replicability and to provide a starting point for future work.

CMS RECAST Example: Implement a CMS analysis with RECAST and REANA. Email the mentors (Kyle Cranmer,Matthew Feickert)

RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. Recently, IRIS-HEP and the HEP Software Foundation (HSF) supported an analysis preservation bootcamp at CERN teaching these tools. Furthermore, the ATLAS experiment is now actively using RECAST. We seek a member of CMS to incorporate a CMS analysis into the system with support from IRIS-HEP, REANA, and RECAST developers. The analysis can be done using CMS internal data or CMS open data and would use CMS analysis tooling.

Charged-particles reconstruction at Muon Colliders: Simulation and Charged-particle reconstruction algorithms in future Muon Colliders. Mentee: Chris Sellgren. Email the mentors (Simone Pagan Griso,Sergo Jindariani)

A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of two complementary objectives. In the first one, we will investigate how the tracking algorithms can be improved by utilizing directional information from specially-arranged silicon-detector layers; this requires improving the simulation of the detector as well. The second one, aims to port the modern track reconstruction algorithms we are using from the older ILCSoft framework to the new Key4HEP software framework, which supports parallel multi-threaded execution of algorithms and will be needed to scale performance to the needs of the Collaboration, validate them and ensure they can be widely used by all collaborators.

Bayesian Analysis with pyhf: Build a library on top of the pyhf Python API to allow for Bayesian analysis of HistFactory models. Mentee: Malin Horstmann. Email the mentors (Matthew Feickert,Lukas Heinrich)

Collider physics analyses have historically favored Frequentist statistical methodologies, with some exceptions of Bayesian inference in LHC analyses through use of the Bayesian Analysis Toolkit (BAT). As HistFactory’s model construction allows for creation of models that can be interpreted as having Bayesian priors, HistFactory models built with pyhf can be used for both Frequentist and Bayesian analyses. The project goal is to build a library on top of the pyhf Python API to allow for Bayesian analysis of HistFactory models using the PyMC Python library and leverage pyhf’s automatic differentiation and hardware acceleration through its JAX computational backend. Validation tests of results will be conducted against the BAT and LiteHF Julia libraries. If time permits, work on integrating the functionality into pyhf would be possible, though it would not be expected to be completed in this Fellow project. Applicants are expected to have strong working experience with Python and basic knowledge of statistical analysis.

Interactive C++ for Machine Learning: Interfacing Cling and PyTorch together to facilitate\ Machine Learning workflows. Email the mentors (David Lange,Vassil Vasilev)

Cling is an interactive C++ interpreter, built on top of Clang and the LLVM compiler infrastructure. Cling realizes the read-eval-print loop (REPL) concept, in order to leverage rapid application development. Implemented as a small extension to LLVM and Clang, the interpreter reuses its strengths such as the praised concise and expressive compiler diagnostics. The LLVM-based C++ interpreter has enabled interactive C++ coding environments, whether in a standalone shell or a Jupyter notebook environment in xeus-cling.

This project aims to demonstrate that interactive C++ is useful with data analysis tools outside of the field of HEP. For example, prototype implementations for Pytorch have been successfully completed. See this link

The project deliverables are:

  • Demonstrate that we can build and use PyTorch programs with
    Cling in Jupyter notebook.
  • Develop several non-trivial ML tutorials using interactive C++
    in Jupyter
  • Experiment with automatic differentiation via Clad of PyTorch codes Candidate requirements:
  • Experience with C++, Jupyter notebooks and Conda are desirable
  • Experience with ML Interest in exploring the intersection of data
    science and interactive C++.
Data Popularity, Placement Optimization and Storage Usage Effectiveness: Data Popularity, Placement Optimization and Storage Usage Effectiveness at the Data Center. Mentee: Avi G. Kaufman. Email the mentors (Vincent Garonne)

The goal of this project is to take data management to the next level by employing machine learning methods to create a precise data use prediction model. This model applied to data placement decisions can bring important benefits both 1) to data center in dealing with large amounts of “cold,” or unused data, which can potentially become “hot”, or popular and heavily used, and 2) to scientists, by enabling them to access their data more quickly. Deploying such models in production have many challenges and pitfalls like the accuracy of the predictions at different scales.