High-Performance ComputingNetworking / Time SynchronizationMachine Learning (LLM)ETL / Data EngineeringSaaS / Software engineering
Publié il y a environ 16 heures
Stage
⏱️4-6 mois
💼Hybride
💰Rémunéré
📅Expire dans 13 jours
Version adaptée à l’offre, pas générique.
Description du poste
Context & Motivation
HPC network fabrics (InfiniBand, high-speed Ethernet) are critical for scaling performance, latency predictability, energy efficiency, and TCO in large clusters.
Knowledge is fragmented across vendor docs (NVIDIA/Mellanox, Intel, Broadcom), standards bodies (IEEE, OpenFabrics Alliance), whitepapers, and community guides, which evolve frequently.
Goal & Problem Statement
Build an autonomous, domain-specialized, LLM-assisted networking design assistant (GPTFabrics) that discovers, interprets, normalizes, and reasons over InfiniBand and Ethernet information.
The assistant should expose valid insights to the Cluster Configurator such as topology design, bandwidth/latency estimation, congestion control, and offload compatibility.
Required Features & System Capabilities
Discover relevant networking information from heterogeneous sources without relying on fixed structures or vendor-specific assumptions.
Extract and normalize key fields: link speed (Gb/s), port count, ASIC generation, latency, buffer depth, PFC/ECN support, RDMA/RoCE capabilities, routing type, topology constraints, and oversubscription ratios.
Support design reasoning (e.g., topology non-blocking checks, oversubscription calculations) and estimate topology-aware metrics (bisection bandwidth, port utilization) while flagging configuration issues.
Track changes and detect new hardware generations or unfamiliar field names and flag them for human inspection rather than silent failure.
Software ecosystem: HuggingFace Transformers, NVIDIA NeMo, FAISS, LangChain or LlamaIndex; data sources include vendor docs, IEEE papers, OpenFabrics Alliance resources, and HPC case studies.
Learning Objectives & Evaluation
Analyze differences between InfiniBand and Ethernet in latency, throughput, congestion behavior, and scalability.
Design robust data and retrieval pipelines resilient to heterogeneous and evolving documentation formats; compare reasoning strategies (rule-based, semantic search, LLM prompting, fine-tuning, hybrid RAG).
Define evaluation metrics (precision/recall for field extraction, correctness of topology reasoning, hallucination rate, adaptability to new generations) and communicate trade-offs and limitations.
Required Skills
Python programming: HTTP requests, HTML parsing, data processing.