Open project database

We want to make finding projects easier for students and advertising projects easier for scientists with research opportunities. Today, this is very much a work in progress… eventually you will be able to search for perspective project ideas and easily add new project opportunities to this repository.

This page is a prototype project database. Use the menu bars to select projects based on their attributes. Projects with no value specified for a given attribute will not be included if a selection is made on that attribute. Projects may instead specify that they are appropriate for multiple options (or any option) for a given attribute. Click the triangle next to each project for more information (if provided by the project mentors).

Project selection menus



Text search:


Selected projects

Enabling Advanced Network and Infrastructure Alarms: Enabling advanced network problem detection for the science community.. Email the mentors (Shawn McKee,Ilija Vukotic)

Research and Education networks are critical for modern, distributed scientific infrastructures. Networks enable data and services to operate across data centers and across the world. The IRIS-HEP/OSG-LHC team has members working on network measurement, analytics and pre-emptive problem identification and localization and would like to involve a student or students interested in data science, monitoring or analytics to participate in our work. The team has assembled a rich, unique dataset, consisting of network-specific metrics, statistics and other measurements which are collected by various tools and systems. In addition, we have developed simple functions to create alarms identifying some types of problems. This project is intended to expand and augment the existing simple alarms with new alarms based upon the extensive data we are collecting. As we examine the data in depth, we realize there are important indicators of problems both in our networks as well as in our network monitoring infrastructure. Interested students would work with our data using tools like Elasticsearch, Kibana and Jupyter Notebooks to first understand the types of data being collected and then use that knowledge to create increasingly powerful alarms which clearly identify specific problems. The task is to maximize the diagnostic range and capability of our alarms to proactively identify problems before they impact scientists who rely on these networks or impact our network measurement infrastructure’s ability to gather data in the first place. The student will be expected to participate in a weekly group meeting focused on network measurement, analytics, monitoring and alarming, which will provide a venue to discuss and learn about concepts, tools and methodologies relevant to the project. The project goal is to create improved alerting and alarming related to both the research and education networks used by HEP, WLCG and OSG communities and the infrastructure we have created to measure and monitor it.

Machine Learning on Network Data for Problem Identification: Machine learning for network problem identification.. Mentee: Maxym Naumchyk. Email the mentors (Shawn McKee,Petya Vasileva)

Research and Education networks are critical for modern, distributed scientific infrastructures. Networks enable data and services to operate across data centers and across the world. The IRIS-HEP/OSG-LHC team has members working on network measurement, analytics and pre-emptive problem identification and localization and would like to involve a student or students interested in data science, machine learning or analytics to participate in our work. The team has assembled a rich, unique dataset, consisting of network-specific metrics, statistics and other measurements which are collected by various tools and systems. In addition, we have developed simple functions to create alarms identifying some types of problems. Interested students would work with pre-prepared datasets, annotated via our existing alarms, to train one or more machine learning algorithms and then use the trained algorithms to process another dataset, comparing results with the sample alarm method. The task is to provide a more effective method of identifying certain types of network issues using machine learning so that such problems can be quickly resolved before they impact scientists who rely on these networks. The student will be expected to participate in a weekly group meeting focused on network measurement, analytics, monitoring and alarming, which will provide a venue to discuss and learn about concepts, tools and methodologies relevant to the project. The project goal is to create improved user-facing alerting and alarming related to the research and education networks used by HEP, WLCG and OSG communities.

Development of high level trigger (HLT1) lines to detect long lived particles at LHCb.: Development of trigger lines to detect long-lived particles at LHCb. Mentee: Volodymyr Svintozelskyi. Email the mentors (Arantza Oyanguren,Brij Kishor Jashal)

The project focuses on developing trigger lines to detect long-lived particles by utilizing recently developed reconstruction algorithms within the Allen framework. These lines will employ the topologies of SciFi seeds originating from standard model long-lived particles, such as the strange Λ0. Multiple studies based on Monte Carlo (MC) simulations will be conducted to understand the variables of interest that can be used to select events. The development of new trigger lines faces limitations due to output trigger rate throughput constraints of the entire HLT1 system. An essential aspect of the project is the physics performance and the capability of the trigger lines for long-lived particle detection. The student will develop and employ ML/AI techniques that could significantly boost performance while maintaining control over execution time. The physics performance of the new trigger lines will be tested using specific decay channels from both Standard Model (SM) transitions and novel processes, such as dark bosons. Additionally, 2023 data from collisions collected during Run 3 will be used to commission these lines.

To develop microservice architecture for CMS HTCondor Job Monitoring: To develop microservice architecture for CMS HTCondor Job Monitoring. Email the mentors (Brij Kishor Jashal)

Current implementation of HTCondor Job Monitoring, internally known as Spider service, is a monolithic application which query HTCondor Schedds periodically. This implementation does not allow deployment in modern Kubernetes infrastructures with advantages like auto-scaling, resilience, self-healing, and so on. However, it can be separated into microservices responsible for “ClassAds calculation and conversion to JSON documents”, “transmitting results to ActiveMQ and OpenSearch without any duplicates” and “highly durable query management”. Such a microservice architecture will allow the use of appropriate languages like GoLang when it has advantages over Python. Moreover, intermediate monitoring pipelines can be integrated into this microservice architecture and it will drop the work-power needed for the services that produce monitoring outcomes using HTCondor Job Monitoring data

Development of simulation workflows for BSM LLPs at LHCb: Development of simulation workflows for BSM LLPs at LHCb. Mentee: Valerii Kholoimov. Email the mentors (Arantza Oyanguren,Brij Kishor Jashal)

In this project, the candidate will become familiar with Monte Carlo (MC) simulations in the LHC environment, using generators such as Pythia and EvtGen, and will learn to work within the LHCb framework. They will learn to generate new physics scenarios concerning long-lived particles, particularly dark bosons and heavy neutral leptons. The software implementation of these models will be of general interest to the entire LHC community. Moreover, these simulations will be used to assess the physics performance of various trigger algorithms developed within the Allen framework and to determine LHCb’s sensitivity to detecting these types of particles. New particles in a wide mass range, from 500 to 4500 MeV, and lifetimes between 10 ps and 2 ns, will be generated and analyzed to develop pioneering selection procedures for recognizing these signals during Run 3 of LHC data collection.

Testing Real Time Analysis at LHCb using 2022 data: Testing the Real Time Analysis at LHCb using data collected in 2022.. Mentee: Nazar Semkiv. Email the mentors (Michele Atzeni,Eluned Anne Smith)

In this project, the candidate will use data collected at the LHCb experiment in 2022 to test and compare the different machine learning algorithms used to select events, in addition to probing the performance of the new detector more generally. They will first develop a BDT-based machine learning algorithm to cleanly select the decay candidates of interest in 2022 data. Statistical subtraction techniques combined with likelihood estimations will be used to remove any remaining background contamination. The decay properties of these channels, depending on which machine learning algorithm was employed in the trig- ger, will then be compared and their agreement with simulated events examined. If needed, the existing trigger algorithms may be altered and improved. If time allows, the student will also have a first look at the angular distributions of the calibration channels being studied. Understanding such angular distributions are vital to the LHCb physics program.

Development of the generic vertex finder in HLT1-level trigger at LHCb.: Development of new algorithm that finds generic vertices in the LHCb using GPUs.. Email the mentors (Andrii Usachov)

The project is dedicated to the development of an innovative algorithm designed to identify displaced decay vertices within the LHCb experiment. The primary aim of this algorithm is to facilitate an inclusive search methodology for long-lived particles, instead of targeting specific decay signatures. This strategy requires the algorithm to effectively differentiate and suppress vertices associated with well-established long-lived hadrons, necessitating the integration of ML solutions. This approach will enable the analysis to be adaptable to a wide range of New Physics models and searches. Furthermore, the algorithm is constrained by computational speed requirements of the online HLT1 trigger at LHCb, which makes it challenging. The student will develop the algorithm to be run on the LHCb GPU farm. This endeavor offers a rich opportunity for the student to gain hands-on experience with CUDA, C++, and Python. It also assumes the implication of the ML algorithms from at least scikit-learn or pytorch.

Enabling Julia code to run at scale with artefact caching: Develop HEP strategies for artefact caching in Julia to allow large scale running. Email the mentors (Graeme Stewart,Pere Mato)

Julia is a promising language for high-energy physics as it combines the easy of use and ergonomics of dynamic languages such as Python, with the runtime speed of C or C++. One of Julia’s features is that it uses a JIT (or just-ahead-of-time) compiler to target the specific architecture on which it is being run. This however, comes at the cost of the compilation time, meaning that the first pass through the code is slower. If Julia is to be adopted widely in high-energy physics, and run at large scales, then it is important to mitigate this cost by pre-compiling the Julia code to be used on the system and avoid the cost of recompiling on every node. This is accomplished by the use of the DEPOT_PATH setting. This will first be investigated on cluster systems at CERN, e.g., SWAN and lxbatch. Startup time and runtime will be investigated with increasingly large sets of jobs running. Then we shall extend our investigations to caching Julia code on CVMFS, which would allow scaling to running on the whole grid. Finally, we shall examine the possibility of precompiling artefacts for different microarchitectures, that would allow exploitation of the full power of modern CPUs in large scale heterogeneous systems.

Packaging jet substructure observable tools: Repackage the EnergyFlow and Wasserstein tools for modern PyPI and conda-forge. Email the mentors (Matthew Feickert,Henry Schreiner)

In the area of jet substructure observable tools that leverage machine learning applications, the Python packages EnergyFlow and Wasserstein stood out for their use in the broader high energy physics theory and experimental communities. The original project creators and maintainers have left the projects, and though the projects have been placed in a community organization to aid support, the packaging tooling choices have made it difficult to maintain the projects. This project would update the original SWIG-based Python bindings for the project C++ code to use scikit-build-core and pybind11 and update the CI/CD system to allow for fluid building and testing of the packages. The end deliverables of the project would be for the project repositories to have transitioned their packaging systems, updated their CI/CD systems, made new releases to the PyPI package index, and releases the packages to conda-forge.

Packaging the HEP simulation stack on conda-forge: Package the HEP simulation tools for conda-forge. Email the mentors (Matthew Feickert)

One common toolchain used in high energy physics for simulation is: MadGraph5_aMC@NLO to PYTHIA8 to Delphes. Installing these tools can be challenging at times, especially for new users. Conda-forge allows for distribution of arbitrarily complex binaries across multiple platforms though the conda/mamba/micromamba/pixi package management ecosystem. As ROOT has been successfully packaged and distributed on conda-forge along with the PYTHIA8 Python bindings it should be possible to package all the components of the HEP simulation stack and distribute them on conda-forge. However, the interconnected nature of some of these tools requires that multiple dependencies are first packaged and distributed before the full stack can be. This project would attempt to package as many of the dependencies of the HEP simulation stack on conda-forge as possible starting with FastJet, LHAPDF, and adding Python 3 bindings for the HepMC2, and HepMC3 conda-forge feedstocks.

Parameter optimization in ACTS framework: Developing an automatic parameter tuning pipeline for ACTS track reconstruction software framework. Email the mentors (Lauren Tompkins,Rocky Bala garg)

Particle tracking in HEP constitutes a crucial and intricate segment of the overall event reconstruction process. The track reconstruction algorithms involve numerous configuration parameters that require meticulous fine-tuning to accurately accommodate the underlying detector and experimental conditions. Traditionally, these parameters are manually adjusted, presenting an opportunity for improvement. Introducing an automated parameter tuning pipeline can significantly enhance tracking performance. The open-source track reconstruction software framework known as “A Common Tracking Framework (ACTS)” offers a robust R&D platform with a well-organized tracking flow and support for multiple detector geometries. Our previous work within this domain focused on developing derivative-less auto-tuning techniques for track seeding and primary vertexing algorithms. The goal of the current project will be to further advance these efforts by researching and developing an optimization technique that is differentiable across the intricate tracking algorithms. Additionally, this project aims to establish a seamless pipeline that facilitates the application of these optimization techniques with minimal user concerns.

Self-Supervised Approaches to Jet Assignment: Self-Supervised Approaches to Jet Assignment. Email the mentors (Javier Duarte)

Supervised machine learning has assisted various tasks in experimental high energy physics. However, using supervised learning to solve complicated problems, like assigning jets to resonant particles like Higgs bosons, requires a statistically representative, accurate, and fully labeled dataset. With the HL-LHC upgrade [1] in the near future, we will need to simulate an order of magnitude more events with a more complicated detector geometry to keep up with the recorded data [2], facing both budgetary and technological challenges [2, 3]. Therefore, it is desirable to explore how to assign jets to reconstruct particles via self-supervised learning (SSL) methods, which pretrain models on a large amount of unlabeled data and fine-tune those models on a small high-quality labeled dataset. Existing attempts [4-6] to use SSL in HEP focus on performing tasks at the jet or event levels. In this project, we propose to use the reconstruction of Higgs bosons from bottom quark jets as a test case to explore SSL for jet assignment. We will explore different neural network architectures, including PASSWD-ABC [7] for the self-supervised pretraining and SPANet [8, 9] for the supervised fine-tuning. The SSL model’s performance will be compared with a baseline model trained from scratch on the small labeled dataset. We will test if pretraining with diverse objectives [10] improves the model performance on downstream tasks like jet assignment or tagging. The code will be developed open source to help other SSL projects.

  1. [HL-LHC] https://arxiv.org/abs/1705.08830 \ 2. [Computing for HL LHC] https://doi.org/10.1051/epjconf/201921402036 \ 3. [Computing summary] https://arxiv.org/abs/1803.04165 \ 4. [JetCLR] https://arxiv.org/abs/2108.04253 \ 5. [DarkCLR] https://arxiv.org/abs/2312.03067 \ 6. [SSL for new physics] https://doi.org/10.1103/PhysRevD.106.056005 \ 7. [PASSWD-ABC] https://arxiv.org/abs/2309.05728 \ 8. [SPANet1] https://arxiv.org/abs/2010.09206 \ 9. [SPANet2] https://arxiv.org/abs/2106.03898 \ 10. [Pretraining benefits] https://arxiv.org/abs/2306.15063
Improving the Maintainability of the ATLAS Global Sequential Calibration: Refactor parts of the ATLAS Global Sequential Calibration for ATLAS, integrating tests for continuous validation. Email the mentors (Tobias Fitschen)

The Monte Carlo-based Global Sequential Calibration (GSC) is a calibration step intended to correct differences in response between quark- & gluon-initiated jets. It is part of the jet calibration recommendations used by nearly every ATLAS analysis utilising jets. The candidate will contribute towards a more robust framework for semi-automated validation of newly derived calibrations of this kind. They will refactor parts of the code, making it more robust and less error-prone while improving the documentation. Tests will be added at various steps of the procedure to ensure that newly derived calibrations do not introduce any unphysical effects.

Charged-particles reconstruction at Muon Colliders: Charged-particle reconstruction algorithms in future Muon Colliders. Email the mentors (Simone Pagan Griso,Sergo Jindariani)

A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of two complementary objectives. In the first one, we will investigate how the tracking algorithms can be improved by utilizing directional information from specially-arranged silicon-detector layers. The second one, that will be the bulk of the project, aims to port the modern track reconstruction algorithms we are using from the older ILCSoft framework to the new Key4HEP software framework, which supports parallel multi-threaded execution of algorithms and will be needed to scale performance to the needs of the Collaboration, validate them and ensure they can be widely used by all collaborators. Improvements and ample space for new creative solutions and optimization allows the fellow to mix acquiring good technical skills and the ability to innovate state-of-the-art tracking algorithms in this less-explored environment.

Data balancing tool for CMS data management: Adapting Rucio data rebalancing daemon for CMS use. Email the mentors (Hasan Ozturk,Panos Paparrigopoulos,Rahul Chauhan,Eric Wayne Vaandering,Dmytro Kovalskyi)

CMS data management relies on the open-source Rucio software framework to manage its data, which is stored at various sites around the world. A common challenge faced by the data management operators team is the overfilling of sites, either due to being targeted by specific physics data or from the gradual accumulation of data over time. Monitoring the capacity status of these sites and rebalancing data as needed is essential. To address such issues, the core Rucio team has developed a rebalancing daemon dedicated to managing data distribution across different sites. Originally designed for specific experiments, this rebalancing daemon contains code that is experiment-specific, limiting its applicability across diverse experiments or communities. Consequently, the main objective of this project is to refine the daemon to be experiment agnostic. Achieving this will enable the daemon to effectively serve not only CMS but also the broader Rucio community. This refinement process involves identifying and modifying any experiment-specific elements within the daemon, ensuring its seamless integration with CMS data management practices. Through this work, CMS will become more agile in maintaining balanced site storage, thereby avoiding scenarios where sites run out of space.

Smart job retries for CMS workload management system: Develop a tool to monitor and make smart decisions on how to retry CMS grid jobs.. Email the mentors (Hassan Ahmed,Hasan Ozturk,Luca Lavezzo,Jennifer Adelman McCarthy,Zhangqier Wang,Dmytro Kovalskyi)

The CMS experiment runs its data processing and simulation jobs on the Worldwide LHC Computing Grid in the scale of ~100k jobs in parallel. It’s inevitable to avoid job failures on this scale, and thus it’s crucial to have an effective failure recovery system. The existing algorithm is agnostic to the information of other jobs which run at the same site or belong to the same physics class. The objective of this project is to develop a tool which will monitor all the CMS grid jobs and make smart decisions on how to retry them by aggregating the data coming from different jobs across the globe. Such decisions can potentially be: reducing the job submission to computing sites experiencing particular failures, changing the job configuration in case of inaccurate configurations, and not retrying potentially ill-configured jobs. This project has the potential to significantly improve efficiency of the whole CMS computing grid, reducing the wasted cpu cycles and increasing the overall throughput.

CI/CD and Automation of Manual Operations: Automate manual operations and implement CI/CD for CMS Production & Reprocessing group.. Email the mentors (Hassan Ahmed,Hasan Ozturk,Luca Lavezzo,Jennifer Adelman McCarthy,Zhangqier Wang,Dmytro Kovalskyi)

The Production & Reprocessing (P&R) group is responsible for maintaining and operating the CMS central workload management system, which processes hundreds of physics workflows daily which produce the data that physicists use in their analyses. The requests which have a similar physics goal are grouped as ‘campaigns’. P&R manages the lifecycle of hundreds of campaigns, each with its unique needs. The objective of this project is to automate the checks that are performed manually on these campaigns. This involves creating a system to automatically set up, verify, and activate new campaigns, along with managing data storage and allocation. The second part of the project focuses on implementing a Continuous Integration and Continuous Deployment (CI/CD) pipeline for efficiently deploying and maintaining software services. This will include converting manual testing procedures into automated ones, improving overall efficiency and reducing errors. Tools such as Gitlab Pipelines for CI/CD, Python for scripting, and various automated testing frameworks will be employed. This initiative is designed to streamline operations, making them more efficient and effective.

Point-cloud diffusion models for TPCs: Build, train, and tune a point-cloud diffusion model for the Active-Target Time Projection Chamber. Mentee: Artem Havryliuk. Email the mentors (Michelle Kuchera)

This project aims to build a conditional point-cloud diffusion model to simulate detector response to events in time projection chambers (TPCs), with a focus on the Active-Target Time Projection Chamber (AT-TPC). Current analysis and simulation software is currently unable to model the noise signature in the AT-TPC. Prior explorations have successfully modeled this noise in a cycle-GAN, but in a lower-dimensional space than the detector resolution. The IRIS-HEP project will be completed in collaboration with the ALPhA research group at Davidson College. The applicant should have a foundation in ML and a comfort in python. This is a newer type of ML deep learning generative architecture, which will require the applicant to read, implement, and modify approaches taken in ML literature using pytorch or tensorflow.

Developing an automatic differentiation and initial parameters optimisation pipeline for the particle shower model: Developing an automatic differentiation and initial parameters optimisation pipeline for the particle shower model.. Email the mentors (Lukas Heinrich,Michael Kagan)

The goal of this project is to develop a differentiable simulation and optimization pipeline for Geant4. The narrow task of this Fellowship project is to develop a trial automatic differentiation and backpropagation pipeline for the Markov-like stochastic branching process that is modeling a particle shower spreading inside a detector material in three spatial dimensions.

Improve testing procedures for prompt data processing at CMS: Improve functional testing before deployment of critical changes for CMS Tier-0. Mentee: Mycola Kolomiiets. Email the mentors (Dmytro Kovalskyi,German Giraldo,Jan Eysermans)

The CMS Tier-0 service is responsible for the prompt processing and distribution of the data collected by the CMS Experiment. Thorough testing of any code or configuration changes for the service is critical for timely data processing. The existing system has a Jenkins pipeline to execute a large-scale “replay” of the data processing using old data for the final functional testing before deployment of critical changes. The project is focusing on integration of unit tests and smaller functional tests in the integration pipeline to speed up testing and reduce resource utilization.

Predict data popularity to improve its availability for physics analysis: Predict data popularity to improve its availability for physics analysis. Mentee: Andrii Len. Email the mentors (Dmytro Kovalskyi,Rahul Chauhan,Hasan Ozturk)

The CMS data management team is responsible for distributing data among computing centers worldwide. Given the limited disk space at these sites, the team must dynamically manage the available data on disk. Whenever users attempt to access unavailable data, they are required to wait for the data to be retrieved from permanent tape storage. This delay impedes data analysis and hinders the scientific productivity of the collaboration. The objective of this project is to create a tool that utilizes machine learning algorithms to predict which data should be retained, based on current usage patterns.

Towards an end-end ML event reconstruction algorithm: Towards an end-end ML event reconstruction algorithm. Email the mentors (Javier Duarte,Farouk Mokhtar,Joosep Pata)

The most common approach to reconstructing full scale events at general purpose detectors (such as ATLAS and CMS) is a particleflow-like algorithm [1–3] which combines low-level information (PF-elements) from different sub-detectors to reconstruct stable particles (PF-candidates). Attempts are being developed to improve upon the current PF-algorithm with machine learning (ML) methods and improving particle reconstruction in general. Within CMS, work has been conducted [4] to develop a graph neural network (GNN) model by the name MLPF [5] to reproduce the existing PF algorithm. The proposed MLPF algorithm has the advantage of running on heterogeneous computing architectures and may be efficient when scaling up to accommodate the high luminosity LHC upgrade. The current status of the MLPF model works by reconstructing high level PF-candidates, in a supervised fashion [6], from low level PF-elements. These “low level inputs’’ have already gone through several steps of reconstruction such as track reconstruction, and energy clustering. Although each reconstruction step goes through several layers of development and validation, the optimization of each step is done independently which is not ideal since, for example, the tracking and clustering steps may not be well optimized for the final task if it is done separately from the full particle reconstruction. It is therefore interesting to explore an ML-based algorithm capable of performing the reconstruction in one-shot, in an end-end fashion. The first step towards this, is to train an MLPF-like model which takes as input the reconstructed tracks and, instead of the calorimeter clusters, the calorimeter cell energy deposits. The project will entail the exploration of the CLIC dataset published here [7] which contains hit-based information in the form of calorimeter energy deposits. Furthermore, the training of a GNN-based model, and possibly other models, and comparing with the current state-of-the-art results of the cluster-based model.

[1] https://www.sciencedirect.com/science/article/abs/pii/0168900295001387?via%3Dihub \ [2] https://arxiv.org/abs/1706.04965 \ [3] https://arxiv.org/abs/1703.10485 \ [4] https://github.com/jpata/particleflow \ [5] https://arxiv.org/abs/2101.08578 \ [6] https://arxiv.org/abs/2303.17657 \ [7] https://zenodo.org/records/8414225

Refactoring Uproot’s AwkwardForth implementation: Keeping the functionality of Uproot’s accelerated reading through AwkwardForth, but making it more maintainable by removing mutable state/coding it in a functional style. . Mentee: Seth Bendigo. Email the mentors (Ioana Ifrim,Jim Pivarski)

Uproot is a Python library for reading and writing ROOT files (the most common file format in particle physics). While it is relatively fast at reading “columnar” data, either arrays of numbers or arrays of numbers that are grouped into variable-length lists, any other data type requires iteration, which is a performance limitation in the Python language. (“for” loops in Python are 100’s of times slower than in compiled languages.) To improve this situation, we introduced a domain-specific language (DSL) called AwkwardForth, in which loops are much faster to execute than they are in Python (factors of 100’s again). This language was created in 2021 (https://arxiv.org/abs/2102.13516) and added to Uproot in 2022 (https://arxiv.org/abs/2303.02202). In the end, an example data structure (std::vector<std::vector>) could be read 400× faster with AwkwardForth than with Python. Users of Uproot don't have to opt in or change their code, it just runs faster. That would be the end of the story, except that the AwkwardForth-generating code in Uproot has been very hard to maintain. In part, it's because it's doing something complicated: generating code that runs later or generating code that generates code that runs later. But it is also more complicated than it needs to be, with Python objects that change their own attributes in arbitrary ways as information about what AwkwardForth needs to be generated accumulates. The code would be much easier to read and reason about if it were stateless or append-only (see: functional programming), and it easily could be. This project would be to restructure the AwkwardForth-generating code in a functional style, to "remove the moving parts." To be clear, the project will not require you to understand the AwkwardForth that is being generated (though that's not a bad thing), and it will not require you to figure out how to generate the right AwkwardForth for a given data type. This part of the problem has been solved and there are many unit tests that can check correctness, to allow you to do test-driven development. The project is about software engineering: how to structure code so that it can be read and understood, while keeping the problem-solving aspect unchanged.

Snakemake backend for RECAST: Implement a Snakemake backend for RECAST workflows. Mentee: Andrii Povsten. Email the mentors (Matthew Feickert,Lukas Heinrich)

RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. When RECAST was first implemented for ATLAS a workflow language with sufficient Linux container support was not available and the yadage workflow system was created. In the years since yadage was created the Snakemake workflow management system has become increasingly popular in the broader scientific community and has developed mature Linux container support. This project would aim to implement another RECAST backend using Snakemake as a modern alternative to yadage.

Recasting of IRIS-HEP Analysis Grand Challenge: Implement the CMS open data AGC analysis with RECAST and REANA. Email the mentors (Matthew Feickert,Tibor Simko,Kyle Cranmer,Lukas Heinrich)

RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. A yet unrealized component of the IRIS-HEP Analysis Grand Challenge (AGC) is reuse and reinterpretation of the analysis. This project would aim to preserve the AGC CMS open data analysis and the accompanying distributed infrastructure and implement a RECAST workflow allowing REANA integration with the AGC. A key challenge of the project is creating a preservation scheme for the associated Kubernetes distributed infrastructure.

Intelligent Caching for HSF Conditions Database: investigate patterns in conditions database accesses. Mentee: Ernest Sorokun. Email the mentors (Lino Gerlach)

Conditions data refers to additional information collected in particle physics experiments beyond what is recorded directly by the detector. This data plays a critical role in many experiments, providing crucial context and calibration information for the recorded measurements. However, managing conditions data poses unique challenges, particularly due to the high access rates involved. The High-Energy Physics Software Foundation (HSF) has developed an experiment-agnostic approach to address these challenges, which has already been successfully deployed for the sPHENIX experiment at the Brookhaven National Laboratory (BNL). The project focuses on investigating the access patterns for this conditions database to gather insights that will enable the development of an optimized caching system in the future. Machine Learning may be used for pattern recognition.

Measuring energy consumption of HEP software on user analysis facilities: Implementing energy consumption benchmarks on different analysis platforms and facilities. Email the mentors (Caterina Doglioni)

Benchmarks for software energy consumption are starting to appear (see e.g. the SCI score) alongside more common performance benchmarks. In this project, we will pilot the implementation of selected software energy consumption benchmarks on two different facilities for user analysis:

  • the Virtual Research Environment, a prototype analysis platform for the European Open Science Cloud.
  • Coffea-casa, a prototype Analysis Facility (AF), which provides services for “low-latency columnar analysis.” We will then test them with simple user software pipelines. The candidate will work in collaboration with another IRIS-HEP fellow investigating energy consumption benchmarks for ML algorithms, and alongside a team of students and interns working on the selection and implementation of the benchmarks.
Estimating the energy cost of ML algorithms: Test and validate power consumption estimates for a ML algorithm for data compression. Mentee: Leonid Didukh. Email the mentors (Caterina Doglioni,Alexander Ekman)

Baler is a machine-learning based lossy data compression tool. In this project, we will review the literature concerning energy consumption for ML models and obtain estimates of the energetic cost of training and hyperparameter optimization following the guidelines of the Green Software Foundation. The candidate will work in a team of students and interns who are developers of the compression tool, and will be able to suggest improvements and insights towards Baler’s energy efficiency.

Rucio-S3-compatible access interface for analysis facilities: Add S3 compatible access interface to Rucio. Mentee: Kyrylo Meliushko. Email the mentors (Lukas Heinrich,Matthew Feickert,Mario Lassnig)

Rucio is an open source software framework that provides functionality to scientific collaborations to organize, manage, monitor, and access their distributed data and data flows across heterogeneous infrastructures. Rucio was originally developed to meet the requirements of the high-energy physics experiment ATLAS, and is continuously enhanced to support diverse scientific communities. Since 2016 Rucio has orchestrated multiple Exabytes of data access and data transfers globally. With this project we seek to enhance Rucio to support a new mechanism for analysis facilities, which are oriented towards object stores in order to provide a portable destination for HEP analyzers to store data products produced in their research in a portable, shareable and standardized way.

Rust interfaces to I/O libraries: Explore the addition of a Rust interface to the PODIO library.. Email the mentors (Benedikt Hegner)

The project’s main goal is to prototype a Rust interface to the PODIO library. Data models in PODIO are declared with a simple programming-language agnostic syntax. In this project we will explore how the PODIO data model concepts can be mapped best onto Rust. At the same time, we will investigate how much Rust’s macro system can support the function of PODIO. For the other supported languages Python, C++ and (experimentally) Julia PODIO generates all of the source code. With proper usage of Rust macros this code generation could be kept to a minimum.

Julia interface to PODIO: Add Julia Interface to the PODIO library.. Email the mentors (Benedikt Hegner,Graeme Stewart)

The project’s main goal is to add a proper Julia back-end to PODIO. A previous Google Summer of Code project worked on an early prototype, which showed the feasibility. The aim is to complete the feature set, and do thorough (performance) testing afterwards.

Improving the Analysis Grand Challenge (AGC) machine learning training workflow: Improving the Analysis Grand Challenge (AGC) machine learning training workflow. Email the mentors (Elliott Kauffman,Alexander Held,Oksana Shadura)

The Analysis Grand Challenge (AGC) is performing the last steps in an analysis pipeline at scale to test workflows envisioned for the HL-LHC. Since ML methods have become so widespread in analysis and these analyses also need to be scaled up for HL-LHC data, ML training and inference were also integrated into the AGC analysis pipeline. The goal of this project is to improve current the ML training implementation with Kubeflow Pipelines, an open source platform for implementing MLOps, providing a framework for building, deploying, and managing machine learning workflows.

Julia for the Analysis Grand Challenge: Implement an analysis pipeline for the Analysis Grand Challenge (AGC) using JuliaHEP ecosystem.. Mentee: Atell-Yehor Krasnopolski. Email the mentors (Jerry Ling,Alexander Held)

The project’s main goal is to implement AGC pipeline using Julia to demonstrate usability and as a test of performance. New utility packages can be expected especially for systematics handling and out-of-core orchestration. (built on existing packages such as FHist.jl and Dagger.jl) At the same time, the project can explore using RNTuple instead of TTree for AGC data storage. As the interface is exactly transparent, this goal mainly requires data conversion unless performance bugs are spotted. This will be help inform transition at LHC experiments in near future (Run 4).

Horizontal scaling of HTTP caches: learn how to automate load-balancing with Kubernetes. Email the mentors (Brian Bockelman,Brian Lin,Mátyás Selmeci)

The OSG offers an integrated software stack and infrastructure used by the High-Energy Physics community to meet their computational needs. Frontier Squid is part of this software stack and acts as an HTTP proxy, caching requests to improve network usage. We seek a fellow to turn our existing single cache Kubernetes deployment into one that can scale horizontally using the same underlying storage for its cache.

Software development for the Rucio Scientific Data Management system: Rucio core developments for large-scale data management. Mentee: Lev Pambuk. Email the mentors (Mario Lassnig,Martin Barisits)

The Rucio system is an open and community-driven data management system for data organisation, management, and access of scientific data. Several communities and experiments have adopted Rucio as a common solution, therefore we seek a dedicated software engineer to help implement much wished for features and extensions in the Rucio core. The selected candidate will focus on producing software for several Rucio components. There is a multitude of potential topics, also based on the candidate’s interests, that can be tackled. Examples include, but are not limited to (1) integrate static type checking capabilities into the framework as well as improve its runtime efficiency, (2) continue the documentation work for automatically generated API and REST interface documentation, (3) evolve the Rucio Upload and Download clients to new complex workflows suitable to modern analysis, (4) continue the development work on the new Rucio Web User Interface, and many more. The selected candidate will participate in a large distributed team using best industry practices such code review, continuous integration, test-driven development, and blue-green deployments. It is important to us that the candidate bring their creativity to the team, therefore we encourage them to also help with developing and evaluating new ideas and designs.

Geometric deep learning for high energy collision simulations: Simulate the detector response to high energy particle collisions using graph neural networks. Email the mentors (Javier Duarte,Raghav Kansal)

Geometric deep learning and graph neural networks (GNNs) have proven to be especially successful for machine learning tasks such as classification and reconstruction for high energy physics (HEP) data, such as jets and calorimeter showers produced at the Large Hadron Collider. This project extends their application to the computationally costly task of simulations, building off of existing work in this area [1]. Possible and complementary research directions are: (1) conditional generation of jets using auxiliary conditioning networks [2-3]; (2) application to generator-level showering and hadronization simulations, and investigating merging schemes with hard matrix elements; (3) developing an attention-based architecture, which has recently found success in jet classification [4]; (4) applications to shower datasets [5-6]. [1] https://arxiv.org/abs/2106.11535 \ [2] https://arxiv.org/abs/1610.09585 \ [3] https://cds.cern.ch/record/2701779/files/10.1051_epjconf_201921402010.pdf \ [4] https://arxiv.org/abs/2202.03772 \ [5] https://zenodo.org/record/3603086 \ [6] https://calochallenge.github.io/homepage/

PV-Finder ACTS example: Adapting a Machine Learning Algorithm for use in ACTS. Mentee: Layan AlSarayra. Email the mentors (Lauren Tompkins,Rocky Bala garg,Michael Sokoloff)

PV-Finder is a hybrid deep learning algorithm designed to identify the locations of proton-proton collisions (primary vertices) in the Run 3 LHCb detector. The underlying structure of the data and the approach to learning the locations of primary vertices may be useful for other detectors, including those at the LHC, such as ATLAS. The algorithm is approximately factorizable. Starting with reconstructed tracks, a kernel density estimator (KDE) can be calculated by a hand-written algorithm that reduces sparse point clouds of three dimensional track data to rich one dimensional data sets amenable to processing by a deep convolutional network. This is called a kde-to-hist algorithm and its predicted histograms are easily interpreted by a heuristic algorithm. A separate tracks-to-kde algorithm uses track parameters evaluated at their points of closest approach to the beamline as input features and predicts an approximation to the KDE. These two algorithms can be merged and the combined model trained to predict the easily interpreted histograms directly from track information. The incumbent will work under the joint supervision of Rocky Bala Garg and Lauren Tompkins (ATLAS physicists and ACTS developers, Stanford), and Michael Sokoloff (an LHCb physicist, Cincinnati) to adapt the pv-finder algorithms to process data generated by ACTS rather than LHCb.

Refactoring fastjet with Awkward LayoutBuilder: Replacing the fastjet implementation with safe, maintainable LayoutBuilder while retaining its interface. Email the mentors (Javier Duarte,Jim Pivarski)

fastjet is a Python interface to the popular FastJet particle-clustering package, which is written in C++. fastjet is unique in that it offers a vectorized interface to FastJet’s algorithms, allowing Python users to analyze many collision events in a single function call, avoiding the overhead of Python iteration. Collections of particles and jets with different lengths per event are managed by Awkward Array. Although the fastjet package functions and is currently used in HEP analysis, its Python-C++ interface predates LayoutBuilder, which simplifies the construction of Awkward Arrays in C++, is easier to maintain, and avoids the dangers of raw array handling. This project would be to refactor fastjet to use the new abstraction layer, maintaining its well-tested interface, and possibly adding new algorithms and functionality new algorithms and functionality, such as jet groomers and other transformations.

Dask in a HEP Analysis Facility at Scale: How fast can large-scale HEP data analysis be performed using Dask and Awkward Arrays?. Email the mentors (Alexander Held,Oksana Shadura,Nick Smith,Jim Pivarski,Ianna Osborne)

Coffea-casa is a prototype Analysis Facility (AF), which provides services for “low-latency columnar analysis.” Dask is an industry-standard scale-out mechanism for array-oriented data processing in Python, and Dask capabilities were recently added to Uproot and Awkward Array, making it possible to analyze jagged particle physics data from ROOT files as Dask arrays for the first time. The aim of this project is to determine how well these capabilities scale in a real AF environment. The project could be divided in the multiple sub-steps (depending on the timeline): adding CPU-intence CMS systematics to defined analysis workflow an/or perform the performance testing the uproot.dask and dask-awkward implementations with defined workload, learning best-practices, and performance-testing/tuning in single and multi-user environments.

Analysis Grand Challenge with ATLAS PHYSLITE data: Create an columnar analysis prototype using ATLAS PHYSLITE data. Email the mentors (Vangelis Kourlitis,Alexander Held,Matthew Feickert)

The IRIS-HEP Analysis Grand Challenge (AGC) is a realistic environment for investigating how high energy physics data analysis workflows scale to the demands of the High-Luminosity LHC (HL-LHC). It captures relevant workflow aspects from data delivery to statistical inference. The AGC has so far been based on publicly available Open Data from the CMS experiment. The ATLAS collaboration aims to use a data format called PHYSLITE at the HL-LHC, which slightly differs from the data formats used so far within the AGC. This project involves implementing the capability to analyze PHYSLITE ATLAS data within the similar to AGC workflow, the columnar analysis prototype, and optimizing the related performance under large volumes of data. In addition to this, the evaluation of systematic uncertainties for ATLAS with PHYSLITE is expected to differ in some aspects from what the AGC has considered thus far. This project will also investigate workflows to integrate the evaluation of such sources of uncertainty for ATLAS.

ROOT’s RDataFrame for the Analysis Grand Challenge: Develop and test an analysis pipeline using ROOT’s RDataFrame for the next iteration of the Analysis Grand Challenge. Mentee: Andrii Falko. Email the mentors (Enrico Guiraud,Alexander Held)

The IRIS-HEP Analysis Grand Challenge (AGC) aims to develop examples of realistic, end-to-end high-energy physics analyses, as well as demonstrate the advantages of modern tools and technologies when applied to such tasks. The next iteration of the AGC (v2) will put the capabilities of modern analysis interfaces such as Coffea and ROOT’s RDataFrame under further test, for example by including more complex systematic variations and sophisticated machine learning techniques. The project consists in the investigation and implementation of such new developments in the context of RDataFrame as well as their benchmarking on state-of-the-art analysis facilities. The goal is to gain insights useful to guide the future design of both the analysis facilities and the applications that will be deployed on them.

Augmenting Line-Segment Tracking with Graph Neural Network: Leveraging Graph Neural Network to utilize graph input data produced by Line-Segment Tracking. Mentee: Povilas Pugzlys. Email the mentors (Philip Chang)

The increase of the pile-up in the upcoming HL-LHC will present a challenge to event reconstruction for the CMS experiment. The single largest contribution to the total reconstruction time comes from charged-particle tracking. Without algorithm innovation, the projected charged-particle reconstruction timing is projected to exponentially increase. This increase in timing in combination with the fact that the computational performance of single thread processors is plateauing, CMS Collaboration estimates that without algorithmic innovation the computing resource requirement will hit a factor 2 to 5 over the projected computing capabilities. This can seriously hinder physicists to publish timely scientific results. This motivates a new approach in tracking to develop a new algorithm that are parallel in nature to alleviate problems of combinatorics, and also can leverage industry advancements in parallel computing such as the GPUs. In light of this, Line-Segment Tracking project started. Line-Segment Tracking (LST) project leverages the CMS outer-tracker’s doublet modules to build mini-doublets (a pair of hits in each layer of the doublet layer) in parallel, and subsequently build line-segments via connecting consistent pair of mini-doublets across different logical layers of the tracker, all done on high-performance GPUs. Eventually, the line-segments are linked together iteratively to form a long chain of line-segments to produce list of track candidates. The parallel nature of the LST algorithm allows the algorithm to naturally lends itself for GPU usage. The project has produced on-par performance with the existing tracking alternatives, and have been integrated to central CMS Software as a step towards production. As LST algorithm creates line-segments and links them to create track candidates, a graph representation of hits and linking between them is naturally obtained. In other words, LST can also be thought of as a fast graph producing algorithm. The project will take the graph data and develop GNN models that classify linkings. We plan to integrate the GNN model to the LST algorithm to augment its capability to produce high-quality track candidates at a shorter time while keeping the same or better tracking performance. Also, a solution for a “one-shot” linking of long chains of line-segments in one algorithm instead of through iteration will also be studied. Estimated Timeline: Week 1/2: Understanding the preliminary LST GNN workflow for Line Segment classification Week 3: Creating example of running the Line Segment classification inference on C++ environment with TorchScript Week 4/5: Integrating the inference with LST’s CUDA code to run the inference on GNN Week 5: Validating the implementation in the LST framework Week 6/7: Performing optimization of utilizing the GNN inferences to measure performance gain in the efficiency metric of LST framework (i.e. efficiency, fake rate, and duplicate rate) Week 8/9: Perform large scale hyperparameter optimization to find best resulting model architectecture Week 10/11: Perform research and development of extending the ability to classify Triplets, and beyond, with the Line Graph transformation approach, which would enable “one-shot” inference Week 12: Wrap up the project, document and summarize the findings to allow for next steps

GNN Tracking: Reconstruct the trajectories of particle with graph neural networks. Mentee: Refilwe Bua. Email the mentors (Kilian Lieret,Gage deZoort)

In the GNN tracking project, we use graph neural networks (GNNs) to reconstruct trajectories (“tracks”) of elementary particles traveling through a detector. This task is called “tracking” and is different from many other problems that involve trajectories:

  • there are several thousand particles that need to be tracked at once,
  • there is no time information (the particles travel too fast),
  • we do not observe a continuous trajectory but instead only around five points (“hits”) along the way in different detector layers.

The task can be described as a combinatorically very challenging “connect-the-dots” problem, essentially turning a cloud of points (hits) in 3D space into a set of O(1000) trajectories. Expressed differently, each hit (containing not much more than the x/y/z coordinate) must be assigned to the particle/track it belongs to.

A conceptually simple way to turn this problem into a machine learning task is to create a fully connected graph of all points and then train an edge classifier to reject any edge that doesn’t connect points that belong to the same particle. In this way, only the individual trajectories remain as components of the initial fully connected graph. However, this strategy does not seem to lead to perfect results in practice. The approach of this project uses this strategy only as the first step to arrive at “small” graphs. It then projects all hits into a learned latent space with the model learning to place hits of the same particle close to each other, such that the hits belonging to the same particle form clusters.

The project code together with documentation and a reading list is available on github and uses pytorch geometric. See also our GSoC proposal for the same project, which lists prerequisites and possible tasks.

REANA workflow for Dark Matter Searches: Implement a REANA workflow for dark matter searches at RCFM. Mentee: Sambridhi Deo. Email the mentors (Matthew Feickert,Lukas Heinrich,Amy Roberts,Giordon Stark)

REANA is a platform for reproducible data analysis workflows that can be run at scale. REANA has been used extensively for running containerized workflows of LHC experiments, like ATLAS, and for reinterpretation of published analyses. This project would aim to implement a REANA workflow for a galaxy rotation-curve fitting analysis (RCFM) to improve replicability and to provide a starting point for future work.

CMS RECAST Example: Implement a CMS analysis with RECAST and REANA. Email the mentors (Kyle Cranmer,Matthew Feickert)

RECAST is a platform for systematic interpretation of LHC searches. It reuses preserved analysis workflows from the LHC experiments, which is now possible with containerization and tools such as REANA. Recently, IRIS-HEP and the HEP Software Foundation (HSF) supported an analysis preservation bootcamp at CERN teaching these tools. Furthermore, the ATLAS experiment is now actively using RECAST. We seek a member of CMS to incorporate a CMS analysis into the system with support from IRIS-HEP, REANA, and RECAST developers. The analysis can be done using CMS internal data or CMS open data and would use CMS analysis tooling.

Charged-particles reconstruction at Muon Colliders: Simulation and Charged-particle reconstruction algorithms in future Muon Colliders. Mentee: Chris Sellgren. Email the mentors (Simone Pagan Griso,Sergo Jindariani)

A muon-collider has been proposed as a possible path for future high-energy physics. The design of a detector for a muon collider has to cope with a large rate of beam-induced background, resulting in an unprecedentedly large multplicity of particles entering the detector that are unrelated to the main muon-muon collision. The algorithms used for charged particle reconstruction (tracking) need to cope with such “noise” and be able to successfully reconstruct the trajectories of the particles of interest, which results in a very large conbinatorial problem that challenges the approaches adopted so far. The project consists of two complementary objectives. In the first one, we will investigate how the tracking algorithms can be improved by utilizing directional information from specially-arranged silicon-detector layers; this requires improving the simulation of the detector as well. The second one, aims to port the modern track reconstruction algorithms we are using from the older ILCSoft framework to the new Key4HEP software framework, which supports parallel multi-threaded execution of algorithms and will be needed to scale performance to the needs of the Collaboration, validate them and ensure they can be widely used by all collaborators.

Bayesian Analysis with pyhf: Build a library on top of the pyhf Python API to allow for Bayesian analysis of HistFactory models. Mentee: Malin Horstmann. Email the mentors (Matthew Feickert,Lukas Heinrich)

Collider physics analyses have historically favored Frequentist statistical methodologies, with some exceptions of Bayesian inference in LHC analyses through use of the Bayesian Analysis Toolkit (BAT). As HistFactory’s model construction allows for creation of models that can be interpreted as having Bayesian priors, HistFactory models built with pyhf can be used for both Frequentist and Bayesian analyses. The project goal is to build a library on top of the pyhf Python API to allow for Bayesian analysis of HistFactory models using the PyMC Python library and leverage pyhf’s automatic differentiation and hardware acceleration through its JAX computational backend. Validation tests of results will be conducted against the BAT and LiteHF Julia libraries. If time permits, work on integrating the functionality into pyhf would be possible, though it would not be expected to be completed in this Fellow project. Applicants are expected to have strong working experience with Python and basic knowledge of statistical analysis.

Interactive C++ for Machine Learning: Interfacing Cling and PyTorch together to facilitate\ Machine Learning workflows. Email the mentors (David Lange,Vassil Vasilev)

Cling is an interactive C++ interpreter, built on top of Clang and the LLVM compiler infrastructure. Cling realizes the read-eval-print loop (REPL) concept, in order to leverage rapid application development. Implemented as a small extension to LLVM and Clang, the interpreter reuses its strengths such as the praised concise and expressive compiler diagnostics. The LLVM-based C++ interpreter has enabled interactive C++ coding environments, whether in a standalone shell or a Jupyter notebook environment in xeus-cling.

This project aims to demonstrate that interactive C++ is useful with data analysis tools outside of the field of HEP. For example, prototype implementations for Pytorch have been successfully completed. See this link

The project deliverables are:

  • Demonstrate that we can build and use PyTorch programs with
    Cling in Jupyter notebook.
  • Develop several non-trivial ML tutorials using interactive C++
    in Jupyter
  • Experiment with automatic differentiation via Clad of PyTorch codes Candidate requirements:
  • Experience with C++, Jupyter notebooks and Conda are desirable
  • Experience with ML Interest in exploring the intersection of data
    science and interactive C++.
Coffea development for LHC experiments: Development of coffea to support multiple LHC experiment analysis workflows. Email the mentors (Lindsey Gray,Nick Smith,Matthew Feickert)

As the coffea analysis framework has continued to gain use across the CMS experiment and is beginning to be used more in ATLAS, the amount of experiment specific code has grown as well. With coffea’s recent transition in late 2023 to be more Dask focused there is work to be done to support experiment specific areas, such as the ATLAS specific coffea.nanoevents.PHYSLITESchema, while still focusing on performance and user needs. This project would support the general development of coffea with a focus on targeting experiment specific code to support and improve the user experience and would have the Fellow work closely with the coffea core team.

Data Popularity, Placement Optimization and Storage Usage Effectiveness: Data Popularity, Placement Optimization and Storage Usage Effectiveness at the Data Center. Mentee: Avi G. Kaufman. Email the mentors (Vincent Garonne)

The goal of this project is to take data management to the next level by employing machine learning methods to create a precise data use prediction model. This model applied to data placement decisions can bring important benefits both 1) to data center in dealing with large amounts of “cold,” or unused data, which can potentially become “hot”, or popular and heavily used, and 2) to scientists, by enabling them to access their data more quickly. Deploying such models in production have many challenges and pitfalls like the accuracy of the predictions at different scales.