WHO WE ARE Foundation models have transformed text and images, but structured data - the largest and most consequential data modality in the world - has remained untouched. Tables power every clinical trial, every financial model, every scientific experiment, every business decision. No one has built a foundation model that truly understands them. Until now. What LLMs did for language, we're doing for tables. Momentum: We pioneered tabular foundation models and are now the world-leading organization in structured data ML. Our TabPFN v2 model was published in Nature https://www.nature.com/articles/s41586-024-08328-6 and set a new state-of-the-art for tabular machine learning. Since its release, we've scaled model capabilities more than 20x, reached 3M+ downloads, 6,000+ GitHub stars, and are seeing accelerating adoption across research and industry - from detecting lung disease with Oxford Cancer Analytics https://www.oxcan.org/news/prior-labs-and-oxford-cancer-analytics-partner-to-advance-liquid-biopsy-and-clinical-decision-making-in-lung-disease to preventing train failures with Hitachi https://siliconangle.com/2025/12/01/prior-labs-debuts-tabular-ai-foundation-model-scales-10-million-rows/ to improving clinical trial decisions with BostonGene https://priorlabs.ai/case-studies/boston-gene. The hardest work is in front of us. We're scaling tabular foundation models to handle millions of rows, thousands of features, real-time inference, and entirely new data modalities - while building the infrastructure to deploy them in production across some of the most demanding industries on earth. These are open problems no one else is working on at this level. Our team: We’re a small, highly selective team https://priorlabs.ai/about of 20+ engineers and researchers, selected from over 5,000 applicants, with backgrounds spanning Google, Apple, Amazon, Microsoft, G-Research, Jane Street, Goldman Sachs, and CERN, led by Frank Hutter https://www.linkedin.com/in/frank-hutter-9190b24b/, Noah Hollmann https://www.linkedin.com/in/noah-hollmann-668b9010b/ and Sauraj Gambhir https://www.linkedin.com/in/sauraj-g/ and advised by world-leading AI researchers such as Bernhard Schölkopf and Turing Award winner Yann LeCun. We ship fast, create top-tier research, and hold each other to an extremely high bar. What’s Next: In 2025, we raised €9m pre-seed led by Balderton Capital, backed by leaders from Hugging Face, DeepMind, and Black Forest Labs. The next modality shift in AI is happening - and we're hiring the team that makes it. ABOUT THE ROLE You’ll take on challenging engineering tasks crucial to the development of tabular foundation models. You’ll work on building and maintaining best-in-class training infrastructure, while maintaining our developer productivity tooling and open source projects. You’ll work closely with researchers to ensure that we can iterate quickly and scale our models. KEY RESPONSIBILITIES - Own and evolve our cloud GPU infrastructure (currently Slurm on GCP) — operations, reliability, cost optimization, and architecture for scaling across multiple cloud and HPC providers - Work closely with researchers to identify and resolve performance bottlenecks in distributed training and inference. Support high hardware utilization and efficient memory usage through systems-level debugging, profiling, and infrastructure improvements. - Build and maintain the developer productivity layer: CI pipelines, experiment tracking, model registry, data processing, and internal tooling that keeps the research team's iteration speed high - Try out your own ideas: We operate an open environment. If you’ve got the next SOTA tabular architecture up your sleeve, go ahead and train it. What we use today: Slurm, GCP, Docker, wandb, GitHub Actions, uv, Torch, Triton QUALIFICATIONS - Exceptional software engineering fundamentals and expert-level Python proficiency, with 5+ years of hands-on industry experience building and operating production systems. - Proven track record of designing and building complex, scalable software, preferably for data processing or distributed systems. - Deep, practical knowledge of the modern ML ecosystem (PyTorch, scikit-learn, etc.) and a genuine interest in applying systems thinking to solve hard problems in AI. - Core MLOps Concepts: Strong understanding of the entire machine learning lifecycle (MLLC) from data ingestion and preparation to model deployment, monitoring, and retraining. Familiarity with MLOps principles and best practices (e.g., reproducibility, versioning, automation, continuous integration/delivery for ML). COMPENSATION & BENEFITS - Competitive compensation package with meaningful equity (We compete with the world's biggest AI companies for talent) - Work with state-of-the-art ML architecture, substantial compute resources, and a world-class team - Annual company-wide offsites to bring the team together (last trip was to the Alps 🏔️ https://www.linkedin.com/posts/prior-labs_tabpfn-offsite-priorlabs-activity-7382386777921105921-7tRw?utm_source=share&utm_medium=member_desktop&rcm=ACoAABvC0LAB5X27zAd4f-VyE4KnR-29-QZuaSE) - 30 days of paid vacation + public holidays - Comprehensive benefits including healthcare, transportation, and fitness - Support with relocation where needed OUR COMMITMENTS - We believe the best products and teams come from a wide range of perspectives, experiences, and backgrounds. That’s why we welcome applications from people of all identities and walks of life, especially anyone who’s ever felt discouraged by "not checking every box." - We’re committed to creating a safe, inclusive environment and providing equal opportunities regardless of gender, sexual orientation, origin, disabilities, or any other traits that make you who you are.

ML Engineer, Training Infrastructure

Job Description

Similar Jobs