10 Subject 4 : AI Model for Real-Time Video Translation with Live Lip Sync PFE
10 Subject 4 : AI Model for Real-Time Video Translation with Live Lip Sync PFE
Olindias•Monastir
Time Series Modeling & Machine LearningComputer Vision & NLPnatural language processing
Publié il y a 6 mois
Stage
⏱️3-6 mois
💼Hybride
📅Expiré il y a 6 mois
Reste lisible (ATS friendly).
Description du poste
Project Overview
Develop an AI system capable of translating spoken language in real time within video streams while preserving natural lip synchronization on the target-language video output.
Aim to combine automatic speech recognition (ASR), neural machine translation (NMT), text-to-speech (TTS) or voice conversion, and visual lip-sync generation to produce low-latency, high-quality translated video suitable for live or near-live scenarios.
Objectives & Key Tasks
Design and implement a pipeline that performs: speech capture → ASR → translation → speech generation → lip-sync-driven video rendering, minimizing end-to-end latency.
Research and integrate state-of-the-art models for real-time ASR, low-latency NMT, and lip-sync synthesis (e.g., audio-driven facial animation / viseme alignment), and evaluate trade-offs between quality and speed.
Technical Requirements & Skills
Strong knowledge in Machine Learning, especially deep learning frameworks (PyTorch or TensorFlow) and experience with sequence-to-sequence models.
Experience in Computer Vision (video processing, facial landmark detection), Speech/NLP (ASR, NMT, TTS) and real-time inference optimization (quantization, model pruning, batching strategies).
Deliverables & Evaluation
A working prototype demonstrating live or near-real-time translation with synchronized lip movements on output video, plus qualitative and quantitative evaluation (latency, translation accuracy, lip-sync accuracy, perceptual quality).
Documentation, source code, and a short demo video showcasing typical use-cases and performance metrics.
Tools & Environment
Development on Python with common ML libraries (PyTorch/TensorFlow, OpenCV, Hugging Face Transformers, Kaldi/ESPnet or similar for ASR, TTS toolkits).
Optional deployment targets: desktop GPU, edge device, or cloud inference; include notes on scalability and latency optimization.
How to Apply
To apply for this PFE internship, send your CV, a brief cover letter describing relevant projects, and links to any demos or repositories.
Use the subject line: "PFE Application - 10 Subject 4: AI Model for Real-Time Video Translation with Live Lip Sync" and send to career@olindias.com.