Brief: Build an intelligent monitoring and recommendation system to analyze software usage and dependencies across a large HPC cluster (Intel Xeon CPUs, NVIDIA A100/H100). Provide data‑driven recommendations for cleanup, retention, and optimization of the software stack managed with EasyBuild/Spack and environment modules.
Goals and responsibilities:
- Collect and analyze software usage metrics (module load frequency, scheduler logs such as SLURM accounting, node/partition patterns).
- Correlate installed software with actual usage and real application workflows and dependencies.
- Develop a recommendation engine to identify unused/redundant modules, critical software, and optimization opportunities; generate safe‑to‑remove and must‑preserve lists with confidence/risk levels.
- Integrate with the existing inventory system and enhance color‑coded reports (Red/Orange/White) with confidence scores.
Required skills:
- Linux and scripting (Python or Bash), software development (C/C++ or Python).
- Basics of HPC tools/environments; Git; data analysis familiarity is a plus.
- Good English, organization, and use of project management tools.
Planned training:
- Linux fundamentals (Udemy course), intro to HPC and parallel programming (MPI/OpenMP/GPU basics), EasyBuild training, 1:1 mentorship with ReDX engineers.
Other details:
- Recommended period: 2–3 months.
- Compensation: Monthly stipend with potential end‑of‑internship performance bonus.
- Opportunity to work on real HPC systems and interact with end users.