ReDX Technologies - Open-Source HPC Networking Chatbot (FabricsGPT) | Hi Interns

Context & Motivation

HPC network fabrics (InfiniBand, high-speed Ethernet) are critical for scaling performance, latency predictability, energy efficiency, and TCO in large clusters.
Knowledge is fragmented across vendor docs (NVIDIA/Mellanox, Intel, Broadcom), standards bodies (IEEE, OpenFabrics Alliance), whitepapers, and community guides, which evolve frequently.

Build an autonomous, domain-specialized, LLM-assisted networking design assistant (GPTFabrics) that discovers, interprets, normalizes, and reasons over InfiniBand and Ethernet information.
The assistant should expose valid insights to the Cluster Configurator such as topology design, bandwidth/latency estimation, congestion control, and offload compatibility.

Discover relevant networking information from heterogeneous sources without relying on fixed structures or vendor-specific assumptions.
Extract and normalize key fields: link speed (Gb/s), port count, ASIC generation, latency, buffer depth, PFC/ECN support, RDMA/RoCE capabilities, routing type, topology constraints, and oversubscription ratios.
Support design reasoning (e.g., topology non-blocking checks, oversubscription calculations) and estimate topology-aware metrics (bisection bandwidth, port utilization) while flagging configuration issues.
Track changes and detect new hardware generations or unfamiliar field names and flag them for human inspection rather than silent failure.

Hardware resources: 2–8 × NVIDIA H100 GPUs, Lustre filesystem, HPC cluster access.
Software ecosystem: HuggingFace Transformers, NVIDIA NeMo, FAISS, LangChain or LlamaIndex; data sources include vendor docs, IEEE papers, OpenFabrics Alliance resources, and HPC case studies.

Analyze differences between InfiniBand and Ethernet in latency, throughput, congestion behavior, and scalability.
Design robust data and retrieval pipelines resilient to heterogeneous and evolving documentation formats; compare reasoning strategies (rule-based, semantic search, LLM prompting, fine-tuning, hybrid RAG).
Define evaluation metrics (precision/recall for field extraction, correctness of topology reasoning, hallucination rate, adaptability to new generations) and communicate trade-offs and limitations.

Python programming: HTTP requests, HTML parsing, data processing.
Networking fundamentals: InfiniBand, Ethernet, RDMA/RoCE, PFC/ECN, topology design (fat-tree, Dragonfly, leaf-spine).
Data normalization and database management: schema definition, DuckDB/PostgreSQL.
Intro AI/LLM skills (preferred): prompting, mapping unstructured fields into schemas, producing factual summaries of HPC network concepts.

Recommended period: 6 months (4-6 months as listed).
Compensation: Monthly stipend with potential end-of-internship performance bonus and potential paper publication co-authorship.

📧 Pour postuler: contact@redxt.com