ReDX Technologies
ReDX Technologies
Tunisie

Open-Source HPC Networking Chatbot (FabricsGPT)

High-Performance ComputingNetworking / Time SynchronizationMachine Learning (LLM)ETL / Data EngineeringSaaS / Software engineering

Publié il y a environ 16 heures

Stage
⏱️4-6 mois
💼Hybride
💰Rémunéré
📅Expire dans 13 jours
Version adaptée à l’offre, pas générique.

Description du poste

Context & Motivation

  • HPC network fabrics (InfiniBand, high-speed Ethernet) are critical for scaling performance, latency predictability, energy efficiency, and TCO in large clusters.
  • Knowledge is fragmented across vendor docs (NVIDIA/Mellanox, Intel, Broadcom), standards bodies (IEEE, OpenFabrics Alliance), whitepapers, and community guides, which evolve frequently.

Goal & Problem Statement

  • Build an autonomous, domain-specialized, LLM-assisted networking design assistant (GPTFabrics) that discovers, interprets, normalizes, and reasons over InfiniBand and Ethernet information.
  • The assistant should expose valid insights to the Cluster Configurator such as topology design, bandwidth/latency estimation, congestion control, and offload compatibility.

Required Features & System Capabilities

  • Discover relevant networking information from heterogeneous sources without relying on fixed structures or vendor-specific assumptions.
  • Extract and normalize key fields: link speed (Gb/s), port count, ASIC generation, latency, buffer depth, PFC/ECN support, RDMA/RoCE capabilities, routing type, topology constraints, and oversubscription ratios.
  • Support design reasoning (e.g., topology non-blocking checks, oversubscription calculations) and estimate topology-aware metrics (bisection bandwidth, port utilization) while flagging configuration issues.
  • Track changes and detect new hardware generations or unfamiliar field names and flag them for human inspection rather than silent failure.

Tools, Data Sources & Resources

  • Hardware resources: 2–8 × NVIDIA H100 GPUs, Lustre filesystem, HPC cluster access.
  • Software ecosystem: HuggingFace Transformers, NVIDIA NeMo, FAISS, LangChain or LlamaIndex; data sources include vendor docs, IEEE papers, OpenFabrics Alliance resources, and HPC case studies.

Learning Objectives & Evaluation

  • Analyze differences between InfiniBand and Ethernet in latency, throughput, congestion behavior, and scalability.
  • Design robust data and retrieval pipelines resilient to heterogeneous and evolving documentation formats; compare reasoning strategies (rule-based, semantic search, LLM prompting, fine-tuning, hybrid RAG).
  • Define evaluation metrics (precision/recall for field extraction, correctness of topology reasoning, hallucination rate, adaptability to new generations) and communicate trade-offs and limitations.

Required Skills

  • Python programming: HTTP requests, HTML parsing, data processing.
  • Networking fundamentals: InfiniBand, Ethernet, RDMA/RoCE, PFC/ECN, topology design (fat-tree, Dragonfly, leaf-spine).
  • Data normalization and database management: schema definition, DuckDB/PostgreSQL.
  • Intro AI/LLM skills (preferred): prompting, mapping unstructured fields into schemas, producing factual summaries of HPC network concepts.

Duration & Compensation

  • Recommended period: 6 months (4-6 months as listed).
  • Compensation: Monthly stipend with potential end-of-internship performance bonus and potential paper publication co-authorship.

📧 Pour postuler: contact@redxt.com