What if the expertise that built foundation models could reshape how you think about AI's future? In this episode, Benjamin sits down with Soumya Batra, founder and CEO of WisePort AI and former safety lead on Llama 2 and Llama 3 at Meta, to explore how foundation models evolved from traditional NLP, why post-training holds the highest leverage for safety and controllability, and what natively agentic AI means for the next frontier of AI development. Whether you're curious about the model training lifecycle or wondering what comes after large language models, this conversation unpacks the technical strategies and vision shaping tomorrow's AI systems.
In this episode of The Data Engineering Show, host
Benjamin Wagner sits down with
Soumya Batra, founder and CEO of WisePort AI and former tech lead at Meta where she led safety efforts for Llama 2 and Llama 3, to explore the evolution of NLP, the complete lifecycle of foundation model training, and why the next AI frontier lies in natively agentic systems rather than simply scaling larger transformers.
What You'll Learn:
- Why historical NLP work becomes obsolete with each paradigm shift: Understand how Bayesian networks, RNNs, and LSTMs each dominated until replaced - and why current transformer-scaling dogma will likely face the same fate
- How to structure the foundation model training lifecycle for safety: Learn the three critical phases - pretraining (data mix optimization), supervised fine-tuning (instruction alignment), and reinforcement learning (human preference integration)—and where safety interventions deliver maximum leverage
- The counterintuitive data strategy for pretraining safety: Discover why removing all toxic content actually weakens model robustness, and how maintaining a precise balance preserves the model's ability to classify and refuse harmful requests
- How dual reward models maximize both helpfulness and safety: See why combining helpfulness and safety objectives (as done in Llama 3) ensures every training sample reinforces both capabilities simultaneously rather than creating trade-offs
- What "natively agentic" means and why it matters more than LLM-powered agents: Learn how foundational agentic models dynamically explore action spaces at inference time instead of relying on fixed developer-defined scaffolding, unlocking domain-agnostic workflows
- How to build a foundational AI startup without massive training datasets: Understand why synthetic data generation, deterministic task validation, and deep domain expertise can substitute for Internet-scale language corpora in the agentic space
If you enjoyed this episode, make sure to subscribe, rate, and review it on Apple Podcasts, Spotify, and YouTube Podcasts. Instructions on how to do this are
here.
About the Guest(s)
Soumya Batra is the Founder and CEO of WisePort AI, a foundational AI company specializing in agentic AI systems. With over twelve years of expertise in NLP and machine learning, she previously served as a Tech Lead and Applied Research Scientist at Meta, where she led safety and controllability efforts for both Llama 2 and Llama 3. Her career spans foundational work at Carnegie Mellon University, Microsoft, and Meta, establishing her as a pioneering voice in conversational AI and foundation model development. In this episode, Soumya demystifies the journey from traditional NLP to large language models, revealing how safety and controllability are embedded across the entire model lifecycle—from pretraining through reinforcement learning. Her insights on the future of agentic AI and the limitations of current scaling-only approaches provide essential perspective for data engineers and ML practitioners navigating the rapidly evolving AI landscape.
Quotes
"I did not know then that this would become my career for the next decade." - Soumya
"Whatever work that I've done in the past becomes irrelevant all of a sudden." - Soumya
"There is always a notion of, yes, this is the big thing, and then no, it's not anymore." - Soumya
"I really think that we are going to be proven wrong once again about scaling transformers being the only way to achieve general intelligence." - Soumya
"Safety was an issue even back then, even though we were training in such controlled settings." - Soumya
"If you don't put some toxic content there, then it will lose the ability to classify it and it'll be much easier to break the safety later on." - Soumya
"In the post training phase, we are giving it that ability to be able to answer users' questions." - Soumya
"The next unlock will now come from foundational agent models that are natively agentic, which will unlock use cases that look unimaginable to us right now." - Soumya
"Natively agentic means the foundational model itself needs to dynamically explore the action space, rather than scaffolding around existing LLMs." - Soumya
"The real unlock comes from creating your own use cases, creating your own synthetic data, and going deep into a few workflows." - Soumya
Resources
Connect on LinkedIn:
Websites:
Articles & Research Papers:
- LLaMA: Open and Efficient Foundation Language Models – Meta AI Research
- Lima: Less Is More for Alignment – Stanford & Meta AI Research
Educational Institutions:
- Carnegie Mellon University - Language Technologies Institute (ATI)