Open projects for IRIS-HEP Fellows (and more)
This page lists a number of known software R&D projects of interest to IRIS-HEP researchers. (This page will be updated from time to time, so check back and reload to see if new projects have been added.) Contact the mentors for more information about any of these projects! Be sure you have read the guidelines.
You can also find open projects for other programs or of general interest. Use the pulldown menus to select projects based on their attributes. Projects may instead specify that they are appropriate for multiple options (or any option) for a given attribute. Click the triangle next to each project for more information (if provided by the project mentors).
Project selection menus
Enabling Advanced Network and Infrastructure Alarms: Enabling advanced network problem detection for the science community.. Email the mentors (Shawn McKee,Ilija Vukotic)
Research and Education networks are critical for modern, distributed scientific infrastructures. Networks enable data and services to operate across data centers and across the world. The IRIS-HEP/OSG-LHC team has members working on network measurement, analytics and pre-emptive problem identification and localization and would like to involve a student or students interested in data science, monitoring or analytics to participate in our work. The team has assembled a rich, unique dataset, consisting of network-specific metrics, statistics and other measurements which are collected by various tools and systems. In addition, we have developed simple functions to create alarms identifying some types of problems. This project is intended to expand and augment the existing simple alarms with new alarms based upon the extensive data we are collecting. As we examine the data in depth, we realize there are important indicators of problems both in our networks as well as in our network monitoring infrastructure. Interested students would work with our data using tools like Elasticsearch, Kibana and Jupyter Notebooks to first understand the types of data being collected and then use that knowledge to create increasingly powerful alarms which clearly identify specific problems. The task is to maximize the diagnostic range and capability of our alarms to proactively identify problems before they impact scientists who rely on these networks or impact our network measurement infrastructure’s ability to gather data in the first place. The student will be expected to participate in a weekly group meeting focused on network measurement, analytics, monitoring and alarming, which will provide a venue to discuss and learn about concepts, tools and methodologies relevant to the project. The project goal is to create improved alerting and alarming related to both the research and education networks used by HEP, WLCG and OSG communities and the infrastructure we have created to measure and monitor it.
Machine Learning on Network Data for Problem Identification: Machine learning for network problem identification.. Email the mentors (Shawn McKee,Petya Vasileva)
Research and Education networks are critical for modern, distributed scientific infrastructures. Networks enable data and services to operate across data centers and across the world. The IRIS-HEP/OSG-LHC team has members working on network measurement, analytics and pre-emptive problem identification and localization and would like to involve a student or students interested in data science, machine learning or analytics to participate in our work. The team has assembled a rich, unique dataset, consisting of network-specific metrics, statistics and other measurements which are collected by various tools and systems. In addition, we have developed simple functions to create alarms identifying some types of problems. Interested students would work with pre-prepared datasets, annotated via our existing alarms, to train one or more machine learning algorithms and then use the trained algorithms to process another dataset, comparing results with the sample alarm method. The task is to provide a more effective method of identifying certain types of network issues using machine learning so that such problems can be quickly resolved before they impact scientists who rely on these networks. The student will be expected to participate in a weekly group meeting focused on network measurement, analytics, monitoring and alarming, which will provide a venue to discuss and learn about concepts, tools and methodologies relevant to the project. The project goal is to create improved user-facing alerting and alarming related to the research and education networks used by HEP, WLCG and OSG communities.
Software development for the Rucio Scientific Data Management system: Rucio core developments for large-scale data management. Email the mentors (Mario Lassnig,Martin Barisits)
The Rucio system is an open and community-driven data management system for data organisation, management, and access of scientific data. Several communities and experiments have adopted Rucio as a common solution, therefore we seek a dedicated software engineer to help implement much wished for features and extensions in the Rucio core. The selected candidate will focus on producing software for several Rucio components. There is a multitude of potential topics, also based on the candidate’s interests, that can be tackled. Examples include, but are not limited to (1) integrate static type checking capabilities into the framework as well as improve its runtime efficiency, (2) continue the documentation work for automatically generated API and REST interface documentation, (3) evolve the Rucio Upload and Download clients to new complex workflows suitable to modern analysis, (4) continue the development work on the new Rucio Web User Interface, and many more. The selected candidate will participate in a large distributed team using best industry practices such code review, continuous integration, test-driven development, and blue-green deployments. It is important to us that the candidate bring their creativity to the team, therefore we encourage them to also help with developing and evaluating new ideas and designs.
PV-Finder ACTS example: Adapting a Machine Learning Algorithm for use in ACTS. Email the mentors (Lauren Tompkins,Rocky Bala garg,Michael Sokoloff)
PV-Finder is a hybrid deep learning algorithm designed to identify the locations of proton-proton collisions (primary vertices) in the Run 3 LHCb detector. The underlying structure of the data and the approach to learning the locations of primary vertices may be useful for other detectors, including those at the LHC, such as ATLAS. The algorithm is approximately factorizable. Starting with reconstructed tracks, a kernel density estimator (KDE) can be calculated by a hand-written algorithm that reduces sparse point clouds of three dimensional track data to rich one dimensional data sets amenable to processing by a deep convolutional network. This is called a kde-to-hist algorithm and its predicted histograms are easily interpreted by a heuristic algorithm. A separate tracks-to-kde algorithm uses track parameters evaluated at their points of closest approach to the beamline as input features and predicts an approximation to the KDE. These two algorithms can be merged and the combined model trained to predict the easily interpreted histograms directly from track information. The incumbent will work under the joint supervision of Rocky Bala Garg and Lauren Tompkins (ATLAS physicists and ACTS developers, Stanford), and Michael Sokoloff (an LHCb physicist, Cincinnati) to adapt the pv-finder algorithms to process data generated by ACTS rather than LHCb.
Refactoring fastjet with Awkward LayoutBuilder: Replacing the fastjet implementation with safe, maintainable LayoutBuilder while retaining its interface. Email the mentors (Javier Duarte,Jim Pivarski)
fastjet is a Python interface to the popular FastJet particle-clustering package, which is written in C++. fastjet is unique in that it offers a vectorized interface to FastJet’s algorithms, allowing Python users to analyze many collision events in a single function call, avoiding the overhead of Python iteration. Collections of particles and jets with different lengths per event are managed by Awkward Array. Although the fastjet package functions and is currently used in HEP analysis, its Python-C++ interface predates LayoutBuilder, which simplifies the construction of Awkward Arrays in C++, is easier to maintain, and avoids the dangers of raw array handling. This project would be to refactor fastjet to use the new abstraction layer, maintaining its well-tested interface, and possibly adding new algorithms and functionality new algorithms and functionality, such as jet groomers and other transformations.
Dask in a HEP Analysis Facility at Scale: How fast can large-scale HEP data analysis be performed using Dask and Awkward Arrays?. Email the mentors (Ianna Osborne,Oksana Shadura)
Coffea-casa is a prototype Analysis Facility (AF), which provides services for “low-latency columnar analysis.” Dask is an industry-standard scale-out mechanism for array-oriented data processing in Python, and Dask capabilities were recently added to Uproot and Awkward Array, making it possible to analyze jagged particle physics data from ROOT files as Dask arrays for the first time. The aim of this project is to determine how well these capabilities scale in a real AF environment. The project will involve stress-testing the uproot.dask and dask-awkward implementations with physics-motivated workloads, learning best-practices, and performance-testing/tuning in single and multi-user environments.
Analysis Grand Challenge with ATLAS PHYSLITE data: Create an Analysis Grand Challenge implementation using ATLAS PHYSLITE data. Email the mentors (Matthew Feickert,Tal van Daalen,Alexander Held)
The IRIS-HEP Analysis Grand Challenge (AGC) is a realistic environment for investigating how high energy physics data analysis workflows scale to the demands of the High-Luminosity LHC (HL-LHC). It captures relevant workflow aspects from data delivery to statistical inference. The AGC has so far been based on publicly available Open Data from the CMS experiment. The ATLAS collaboration aims to use a data format called PHYSLITE at the HL-LHC, which slightly differs from the data formats used so far within the AGC. This project involves implementing the capability to analyze PHYSLITE ATLAS data within the AGC workflow and optimizing the related performance under large volumes of data. In addition to this, the evaluation of systematic uncertainties for ATLAS with PHYSLITE is expected to differ in some aspects from what the AGC has considered thus far. This project will also investigate workflows to integrate the evaluation of such sources of uncertainty within a Python-based implementation of an AGC analysis task.
ROOT’s RDataFrame for the Analysis Grand Challenge: Develop and test an analysis pipeline using ROOT’s RDataFrame for the next iteration of the Analysis Grand Challenge. Email the mentors (Enrico Guiraud,Alexander Held)
The IRIS-HEP Analysis Grand Challenge (AGC) aims to develop examples of realistic, end-to-end high-energy physics analyses, as well as demonstrate the advantages of modern tools and technologies when applied to such tasks. The next iteration of the AGC (v2) will put the capabilities of modern analysis interfaces such as Coffea and ROOT’s RDataFrame under further test, for example by including more complex systematic variations and sophisticated machine learning techniques. The project consists in the investigation and implementation of such new developments in the context of RDataFrame as well as their benchmarking on state-of-the-art analysis facilities. The goal is to gain insights useful to guide the future design of both the analysis facilities and the applications that will be deployed on them.
Augmenting Line-Segment Tracking with Graph Neural Network: Leveraging Graph Neural Network to utilize graph input data produced by Line-Segment Tracking. Email the mentors (Philip Chang)
The increase of the pile-up in the upcoming HL-LHC will present a challenge to event reconstruction for the CMS experiment. The single largest contribution to the total reconstruction time comes from charged-particle tracking. Without algorithm innovation, the projected charged-particle reconstruction timing is projected to exponentially increase. This increase in timing in combination with the fact that the computational performance of single thread processors is plateauing, CMS Collaboration estimates that without algorithmic innovation the computing resource requirement will hit a factor 2 to 5 over the projected computing capabilities. This can seriously hinder physicists to publish timely scientific results. This motivates a new approach in tracking to develop a new algorithm that are parallel in nature to alleviate problems of combinatorics, and also can leverage industry advancements in parallel computing such as the GPUs. In light of this, Line-Segment Tracking project started. Line-Segment Tracking (LST) project leverages the CMS outer-tracker’s doublet modules to build mini-doublets (a pair of hits in each layer of the doublet layer) in parallel, and subsequently build line-segments via connecting consistent pair of mini-doublets across different logical layers of the tracker, all done on high-performance GPUs. Eventually, the line-segments are linked together iteratively to form a long chain of line-segments to produce list of track candidates. The parallel nature of the LST algorithm allows the algorithm to naturally lends itself for GPU usage. The project has produced on-par performance with the existing tracking alternatives, and have been integrated to central CMS Software as a step towards production. As LST algorithm creates line-segments and links them to create track candidates, a graph representation of hits and linking between them is naturally obtained. In other words, LST can also be thought of as a fast graph producing algorithm. The project will take the graph data and develop GNN models that classify linkings. We plan to integrate the GNN model to the LST algorithm to augment its capability to produce high-quality track candidates at a shorter time while keeping the same or better tracking performance. Also, a solution for a “one-shot” linking of long chains of line-segments in one algorithm instead of through iteration will also be studied. Estimated Timeline: Week 1/2: Understanding the preliminary LST GNN workflow for Line Segment classification Week 3: Creating example of running the Line Segment classification inference on C++ environment with TorchScript Week 4/5: Integrating the inference with LST’s CUDA code to run the inference on GNN Week 5: Validating the implementation in the LST framework Week 6/7: Performing optimization of utilizing the GNN inferences to measure performance gain in the efficiency metric of LST framework (i.e. efficiency, fake rate, and duplicate rate) Week 8/9: Perform large scale hyperparameter optimization to find best resulting model architectecture Week 10/11: Perform research and development of extending the ability to classify Triplets, and beyond, with the Line Graph transformation approach, which would enable “one-shot” inference Week 12: Wrap up the project, document and summarize the findings to allow for next steps
GNN Tracking: Reconstruct the trajectories of particle with graph neural networks. Email the mentors (Kilian Lieret,Gage deZoort)
In the GNN tracking project, we use graph neural networks (GNNs) to reconstruct trajectories (“tracks”) of elementary particles traveling through a detector. This task is called “tracking” and is different from many other problems that involve trajectories:
- there are several thousand particles that need to be tracked at once,
- there is no time information (the particles travel too fast),
- we do not observe a continuous trajectory but instead only around five points (“hits”) along the way in different detector layers.
The task can be described as a combinatorically very challenging “connect-the-dots” problem, essentially turning a cloud of points (hits) in 3D space into a set of O(1000) trajectories. Expressed differently, each hit (containing not much more than the x/y/z coordinate) must be assigned to the particle/track it belongs to.
A conceptually simple way to turn this problem into a machine learning task is to create a fully connected graph of all points and then train an edge classifier to reject any edge that doesn’t connect points that belong to the same particle. In this way, only the individual trajectories remain as components of the initial fully connected graph. However, this strategy does not seem to lead to perfect results in practice. The approach of this project uses this strategy only as the first step to arrive at “small” graphs. It then projects all hits into a learned latent space with the model learning to place hits of the same particle close to each other, such that the hits belonging to the same particle form clusters.
The project code together with documentation and a reading list is available on github and uses pytorch geometric. See also our GSoC proposal for the same project, which lists prerequisites and possible tasks.
CMS RECAST Example: Implement a CMS analysis with RECAST and REANA. Email the mentors (Kyle Cranmer,Matthew Feickert)
RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. Recently, IRIS-HEP and the HEP Software Foundation (HSF) supported an analysis preservation bootcamp at CERN teaching these tools. Furthermore, the ATLAS experiment is now actively using RECAST. We seek a member of CMS to incorporate a CMS analysis into the system with support from IRIS-HEP, REANA, and RECAST developers.
Charged-particles reconstruction at Muon Colliders: Simulation and Charged-particle reconstruction algorithms in future Muon Colliders. Email the mentors (Simone Pagan Griso,Sergo Jindariani)
A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of two complementary objectives. In the first one, we will investigate how the tracking algorithms can be improved by utilizing directional information from specially-arranged silicon-detector layers; this requires improving the simulation of the detector as well. The second one, aims to port the modern track reconstruction algorithms we are using from the older ILCSoft framework to the new Key4HEP software framework, which supports parallel multi-threaded execution of algorithms and will be needed to scale performance to the needs of the Collaboration, validate them and ensure they can be widely used by all collaborators.
Bayesian Analysis with pyhf: Build a library on top of the pyhf Python API to allow for Bayesian analysis of HistFactory models. Email the mentors (Matthew Feickert,Lukas Heinrich)
Collider physics analyses have historically favored Frequentist statistical methodologies, with some exceptions of Bayesian inference in LHC analyses through use of the Bayesian Analysis Toolkit (BAT). As HistFactory’s model construction allows for creation of models that can be interpreted as having Bayesian priors, HistFactory models built with pyhf can be used for both Frequentist and Bayesian analyses. The project goal is to build a library on top of the pyhf Python API to allow for Bayesian analysis of HistFactory models using the PyMC Python library and leverage pyhf’s automatic differentiation and hardware acceleration through its JAX computational backend. Validation tests of results will be conducted against the BAT and LiteHF Julia libraries. If time permits, work on integrating the functionality into pyhf would be possible, though it would not be expected to be completed in this Fellow project. Applicants are expected to have strong working experience with Python and basic knowledge of statistical analysis.
Interactive C++ for Machine Learning: Interfacing Cling and PyTorch together to facilitate\ Machine Learning workflows. Email the mentors (David Lange,Vassil Vasilev)
Cling is an interactive C++ interpreter, built on top of Clang and the LLVM compiler infrastructure. Cling realizes the read-eval-print loop (REPL) concept, in order to leverage rapid application development. Implemented as a small extension to LLVM and Clang, the interpreter reuses its strengths such as the praised concise and expressive compiler diagnostics. The LLVM-based C++ interpreter has enabled interactive C++ coding environments, whether in a standalone shell or a Jupyter notebook environment in xeus-cling.
This project aims to demonstrate that interactive C++ is useful with data analysis tools outside of the field of HEP. For example, prototype implementations for Pytorch have been successfully completed. See this link
The project deliverables are:
- Demonstrate that we can build and use PyTorch programs with
Cling in Jupyter notebook.
- Develop several non-trivial ML tutorials using interactive C++
- Experiment with automatic differentiation via Clad of PyTorch codes Candidate requirements:
- Experience with C++, Jupyter notebooks and Conda are desirable
- Experience with ML Interest in exploring the intersection of data
science and interactive C++.