Location: Dubai, United Arab Emirates (Micropolis Robotics)
About the Role
We’re building an AI Assistant that serves as a digital concierge, helpdesk, and knowledge hub inside our
mobile application. Users will:
- Ask questions about our services and their own data (documents and tables)
- Perform actions by invoking our application services through the assistant (tool use/MCP/API integrations)
- Converse via both text and real‐time voice
The AI system runs using open‐source models and tooling, with on‐premise deployment as a Ërst‐class option.
Key Responsibilities
Design, build, and operate a production server‐side AI assistant that:
– Answers questions grounded in user‐scoped data (docs, tables) with citations where applicable
– Performs actions by calling internal/external APIs securely on the user’s behalf
– Supports low‐latency, real‐time voice chat (streaming STT/TTS + incremental LLM responses)
Implement the tool/agent layer:
– Structured tool calling (JSON‐schema based) to integrate business services
– Model Context Protocol (MCP) servers/clients where appropriate for tool discovery and execution
Architect retrieval‐augmented generation (RAG):
– Ingestion for documents and tables, parsing, chunking, embeddings, metadata, and indexing
– Hybrid retrieval (sparse+dense), query rewriting, and answer attribution
Deliver performant, cost‐eÉcient inference on open‐source models:
– Model selection/routing; context management; caching/batching; streaming token delivery
– GPU utilization and serving via vLLM/TGI/llama.cpp/Ollama or similar
Build resilient APIs and real‐time integrations:
– WebSockets/WebRTC/gRPC for streaming voice; REST/GraphQL for control and orchestration
Productionize and operate on server/on‐prem:
– Containerize with Docker; automate CI/CD; implement logs/metrics/traces (OpenTelemetry)
– Evals, A/B tests, safety/guardrails, and human‐in‐the‐loop feedback
Required Skills and Experience
- Proficiency in Python
- Open‐source LLMs and serving:
- Experience with models like Llama, Mistral/Mixtral, Qwen, etc.
- Serving stacks: vLLM, Text Generation Inference (TGI), llama.cpp, Ollama
- Prompt engineering, routing, context/window management
 
- RAG and data systems:
- Vector DBs (like FAISS, Qdrant, Weaviate, Milvus, pgvector) and hybrid search
- Document/table ingestion and normalization; schema/metadata design
 
- Real‐time voice:
- STT/TTS (e.g., Whisper/faster‐whisper, Vosk, Coqui TTS, Piper) with streaming pipelines
- Low‐latency streaming via WebSockets/WebRTC/gRPC
 
- Tooling for actions:
- Structured tool/function calling, API design/integration, and service orchestration
- Familiarity with Model Context Protocol (MCP) concepts and usage
 
- Deployment and operations (on‐prem Ërst):
- Docker, Linux, networking, and secure service deployment
- GPU stacks (CUDA/drivers/containers) and performance tuning
 
- Excellent communication, documentation, and cross‐functional collaboration
Preferred Skills
- Agents/frameworks: LangChain, LlamaIndex, Semantic Kernel, or custom tool routers
- Advanced retrieval: multi‐vector stores, RRF/hybrid search, query planning, re‐ranking
- SQL generation and safe execution over tabular data; row‐level security; schema mapping
- Document processing: OCR, table extraction, CSV/Parquet pipelines
- Serving/perf: Triton, quantization (GGUF/GGML), LoRA/QLoRA with PEFT, KV‐cache optimizations
- Evals and observability: Ragas/DeepEval, Langfuse/PromptLayer, OpenTelemetry
Nice to Have (Not Required)
- Public cloud experience (AWS/Azure/GCP) for hybrid or future deployments
Apply for this position
Backend Developer