MASS Analytics - Software Engineering Intern | Hi Interns

Topic 1: Design of a Schema-Resilient Data Ingestion Architecture for MASS Analytics Using Apache NiFi and Apache Iceberg

Description: MASS Analytics products ingest data from external client systems such as Snowflake and process it through analytics models and AOA workflows. Frequent schema changes in source systems can break ingestion, modeling, and automation pipelines, causing downtime and manual intervention. This project aims to design and implement a robust data ingestion and storage architecture using Apache NiFi and Apache Iceberg to detect, control, and manage schema evolution.
Key attributes / Main competencies:
Java and Python programming
Relational databases and SQL
Data modeling and schema management
Data pipeline design and integration
Problem-solving and analytical skills
Software engineering principles
Learning Outcomes:
Understand challenges of schema evolution in large-scale analytics platforms
Design resilient data pipelines decoupled from source system changes
Implement controlled schema evolution using modern Lakehouse technologies
Evaluate pipeline stability and performance under schema variability

Description: The project exposes AOA components as MCP tools and uses an LLM to dynamically plan, execute, and monitor end-to-end workflows. The solution handles failures, conditional steps, and component dependencies through policy-driven decision making. Built-in guardrails ensure secure, explainable, and auditable execution suitable for enterprise environments. The outcome is a more resilient, adaptive, and maintainable AOA pipeline orchestration.
Key attributes / Main competencies:
Large Language Models and AI-assisted systems
Distributed systems and workflow orchestration
API-based system integration and MCP concepts
Software architecture and modular design
Learning Outcomes:
Understanding of LLM-based orchestration and decision-making systems
Ability to design and integrate distributed workflow components
Application of policy-driven control and guardrails in AI systems
Analysis and handling of failures in automated pipelines
Evaluation of system resilience, explainability, and maintainability

Description: The project focuses on incremental data processing to refresh models automatically while minimizing compute usage. It includes monitoring mechanisms to track data quality, model stability, and performance over time. The application generates actionable insights and prioritized recommendations ready to drive the “next dollar” of value. The solution is built as a scalable, reusable, and marketplace-ready Databricks app.
Key attributes / Main competencies:
Incremental and batch data processing
Machine learning model lifecycle management
Data quality monitoring and validation
Performance analysis and system monitoring
Distributed computing with Databricks and Spark
Scalable application design
Learning Outcomes:
Understanding incremental data processing strategies to optimize compute usage
Ability to automate model refresh and evaluation pipelines
Application of data quality and model stability monitoring techniques
Design of scalable and reusable analytics applications
Generation of data-driven insights and business recommendations