Job Responsibilities
- Own and drive the enterprise MLOps and data platform architecture, enabling scalable ML workloads and data pipelines across business units
- Lead development of enterprise-wide MLOps and Generative AI frameworks, standardizing ML lifecycle, deployment, and governance practices
- Drive DataOps and feature engineering initiatives, managing large-scale data pipelines from ingestion to model serving
- Design and implement end-to-end ML systems including training, validation, deployment, monitoring, and retraining workflows
- Lead DevOps practices and CI/CD implementation for ML and data platforms, ensuring reliable and automated deployments
- Own Infrastructure as Code (IaC) strategy using Terraform, Terragrunt, and CloudFormation across multi-account cloud environments
- Establish and enforce security, governance, and compliance frameworks including IAM, RBAC, encryption, and auditability
- Build and maintain observability frameworks for monitoring, logging, model performance, and system reliability
- Enable and scale self-service ML platforms, improving developer productivity and reducing deployment timelines
- Drive cloud cost optimization (FinOps) and operational efficiency across ML and data workloads
- Lead cross-functional collaboration with data science, engineering, and business teams to deliver production-ready AI solutions
- Evaluate and integrate emerging technologies, including Generative AI (LLMs, RAG, multi-model systems), into enterprise platforms
Job Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field
- 6+ years of experience in MLOps, DevOps, Platform Engineering, or Data Engineering
- Strong experience building and operating enterprise-scale ML platforms
- Hands-on experience with AWS (preferred) or other cloud platforms
- Strong experience with CI/CD tools (Jenkins, GitHub Actions, etc.)
- Experience with Infrastructure as Code (Terraform, Terragrunt, CloudFormation)
- Experience with containerization and orchestration (Docker, Kubernetes)
- Strong understanding of ML lifecycle management and MLOps tools (MLflow, SageMaker, Databricks, etc.)
- Experience with data engineering systems (ETL/ELT pipelines, feature stores, large-scale data processing)
- Experience implementing observability and monitoring frameworks
- Strong understanding of security and governance in cloud and ML systems
Preferred Qualifications
- Experience building multi-tenant, enterprise ML platforms
- Experience with Databricks, feature stores, and large-scale distributed data systems
- Strong programming skills (Python preferred)
- Experience handling petabyte-scale data pipelines
- Experience in multi-account AWS environments and platform standardization
- Exposure to Generative AI frameworks (LLMs, RAG, vector databases)
- Experience in manufacturing or industrial domain environments
Key Competencies
- Strong system design and architecture mindset
- Ability to build platforms, not just pipelines
- Ownership of large-scale, cross-team engineering initiatives
- Strong collaboration and stakeholder communication skills
- Focus on scalability, reliability, and governance