ReDX Technologies
ReDX Technologies
Tunisie

Intelligent HPC Software Stack Monitoring & Recommendation System

HPCIT Consulting / Software EngineeringSystems MonitoringERP & Data AnalyticsCloud infrastructure / DevOpsCluster Computing

Publié il y a environ 20 heures

Stage
⏱️2-3 mois
💼Présentiel
💰Rémunéré
📅Expire dans 13 jours
Vérifie que tes liens sont cliquables.

Description du poste

Brief: Build an intelligent monitoring and recommendation system to analyze software usage and dependencies across a large HPC cluster (Intel Xeon CPUs, NVIDIA A100/H100). Provide data‑driven recommendations for cleanup, retention, and optimization of the software stack managed with EasyBuild/Spack and environment modules.

Goals and responsibilities:

  • Collect and analyze software usage metrics (module load frequency, scheduler logs such as SLURM accounting, node/partition patterns).
  • Correlate installed software with actual usage and real application workflows and dependencies.
  • Develop a recommendation engine to identify unused/redundant modules, critical software, and optimization opportunities; generate safe‑to‑remove and must‑preserve lists with confidence/risk levels.
  • Integrate with the existing inventory system and enhance color‑coded reports (Red/Orange/White) with confidence scores.

Required skills:

  • Linux and scripting (Python or Bash), software development (C/C++ or Python).
  • Basics of HPC tools/environments; Git; data analysis familiarity is a plus.
  • Good English, organization, and use of project management tools.

Planned training:

  • Linux fundamentals (Udemy course), intro to HPC and parallel programming (MPI/OpenMP/GPU basics), EasyBuild training, 1:1 mentorship with ReDX engineers.

Other details:

  • Recommended period: 2–3 months.
  • Compensation: Monthly stipend with potential end‑of‑internship performance bonus.
  • Opportunity to work on real HPC systems and interact with end users.