MASS Analytics
MASS Analytics
Tunisie

Software Engineering Intern

Data science / data engineeringMachine Learning/AIDistributed Systems

Publié il y a 2 jours

Stage
⏱️3-6 mois
💼Hybride
📅Expire dans 11 jours
Reste lisible (ATS friendly).

Description du poste

Topic 1: Design of a Schema-Resilient Data Ingestion Architecture for MASS Analytics Using Apache NiFi and Apache Iceberg

  • Description: MASS Analytics products ingest data from external client systems such as Snowflake and process it through analytics models and AOA workflows. Frequent schema changes in source systems can break ingestion, modeling, and automation pipelines, causing downtime and manual intervention. This project aims to design and implement a robust data ingestion and storage architecture using Apache NiFi and Apache Iceberg to detect, control, and manage schema evolution.
  • Key attributes / Main competencies:
  • Java and Python programming
  • Relational databases and SQL
  • Data modeling and schema management
  • Data pipeline design and integration
  • Problem-solving and analytical skills
  • Software engineering principles
  • Learning Outcomes:
  • Understand challenges of schema evolution in large-scale analytics platforms
  • Design resilient data pipelines decoupled from source system changes
  • Implement controlled schema evolution using modern Lakehouse technologies
  • Evaluate pipeline stability and performance under schema variability

Topic 2: Design of an intelligent orchestration framework for MASS Analytics' Always On Analytics workflows using MCP and Large Language Models

  • Description: The project exposes AOA components as MCP tools and uses an LLM to dynamically plan, execute, and monitor end-to-end workflows. The solution handles failures, conditional steps, and component dependencies through policy-driven decision making. Built-in guardrails ensure secure, explainable, and auditable execution suitable for enterprise environments. The outcome is a more resilient, adaptive, and maintainable AOA pipeline orchestration.
  • Key attributes / Main competencies:
  • Large Language Models and AI-assisted systems
  • Distributed systems and workflow orchestration
  • API-based system integration and MCP concepts
  • Software architecture and modular design
  • Learning Outcomes:
  • Understanding of LLM-based orchestration and decision-making systems
  • Ability to design and integrate distributed workflow components
  • Application of policy-driven control and guardrails in AI systems
  • Analysis and handling of failures in automated pipelines
  • Evaluation of system resilience, explainability, and maintainability

Topic 3: Design and implement an Always-On Analytics (AOA) application for the Databricks Marketplace that continuously runs cost-efficient analytics pipelines

  • Description: The project focuses on incremental data processing to refresh models automatically while minimizing compute usage. It includes monitoring mechanisms to track data quality, model stability, and performance over time. The application generates actionable insights and prioritized recommendations ready to drive the “next dollar” of value. The solution is built as a scalable, reusable, and marketplace-ready Databricks app.
  • Key attributes / Main competencies:
  • Incremental and batch data processing
  • Machine learning model lifecycle management
  • Data quality monitoring and validation
  • Performance analysis and system monitoring
  • Distributed computing with Databricks and Spark
  • Scalable application design
  • Learning Outcomes:
  • Understanding incremental data processing strategies to optimize compute usage
  • Ability to automate model refresh and evaluation pipelines
  • Application of data quality and model stability monitoring techniques
  • Design of scalable and reusable analytics applications
  • Generation of data-driven insights and business recommendations