Sign up

What matters to you matters to us! Customize your newsletter–tell us what you're most interested in and we'll handle the rest.

loader image

Resources

Blog

SHARE:

February 23, 2026

The Rise of Reinforcement Learning Gyms and the Future of Agentic AI

A recent conversation, hosted by Norwest, brought together Micro1 CEO Ali Ansari, former chief AI Officer at Google and GM, Barak Turovsky and Meta Superintelligence Labs Director Rohit Patel – moderated by Forbes AI reporter Anna Tong – to offer a clear view into why RL gyms are essential infrastructure and why building them well remains one of the hardest problems in AI.

It remains clear that AI has not reliably learned how act without oversight. In recent years, the AI field has turned its attention to reinforcement learning gyms (RL gyms). These “gyms” are simulated environments where models learn by interacting with a responsive world. Instead of absorbing static data, the AI system tries something, observes the outcome, and adjusts its next action.

Investment and research activity around RL gyms has surged as companies and labs push beyond language generation toward agentic systems. These are independently acting, goal-driven models that can use tools, navigate workflows, and optimize toward long term objectives. As the ambition for agentic systems grows, so does the realization that better models alone are not enough. The environment matters just as much as the algorithm.

Key takeaways:

RL gyms are the training ground for the shift from answering to acting: As models move from single-response accuracy to multi-step, long-horizon decision making, errors compound quickly. Reinforcement learning gyms provide the only practical way to train agents to take sequential actions, optimize toward rewards, and behave reliably in real-world workflows.

The bottleneck isn’t the algorithm, it’s high-fidelity human expertise: The hardest and most expensive part of building high-quality RL environments is expert-created data: defining tasks, shaping reward signals, and verifying outcomes. Expert quality and engagement directly determine environment quality, making RL gyms inherently domain-specific and difficult to scale generically.

Most enterprises are not ready for RL yet, and evaluation comes first: The majority of enterprises are still extracting value from prompt engineering, retrieval, and off-the-shelf models. Rigorous evaluation frameworks are the missing foundation. Without strong evals, fine-tuning and RL are hard to justify, hard to steer, and easy to turn into expensive motion.

RL gyms will likely not commoditize because the real world is too complex: Each meaningful application requires a tailored environment that reflects real workflows, constraints, and failure modes. As AI moves into core business functions and physical systems, environment design becomes a durable source of differentiation.


Excerpts from the Webinar

Defining RL Gyms: From Tasks to Training

Anna Tong: Let’s start with a basic question for people like myself. How do you define what a reinforcement learning, or RL gym is and where in the model creation pipeline are RL gyms normally used?

Ali Ansari: The way to think about RL environments is: there’s a reward model which a policy model uses to train. Tasks are defined as functions that a policy model needs to get good at. You can think of those functions as model capability improvements in a given domain.

We can choose a domain like tax as a real-world example. Let’s say we’re trying to improve model capabilities on W-2 California taxes. We would define the action space and state space of what it means to do W-2 California taxes. And with a tax expert, we would define tasks the model needs to get good at.

Each task could be something like: the tax expert collects data from a consumer to file their taxes, like an income statement and other information. Then the tax expert has a conversation with the consumer to optimize their taxes. And ultimately, the tax expert files the taxes, which is a PDF. What we aim to do with RL environments is train policy models, the foundational models, to do these tasks that experts do in real life, which require sequential actions through training. That’s one example to convey the idea.

Barak Turovsky: To me, the most important thing is that it’s a transition from generating a nice response to focusing on acting. Ali mentioned tasks, and the key is actionable tasks. I believe one of the reasons there’s a lot of discussion and attention on RL gyms, and RL in general, is that we squeezed a lot of value from large, static LLMs and what I would call “prompt soup.” I like kitchen analogies — I’ll expand on that later.

If you want systems to actually act like agents and workflows, especially in the physical world, you need to measure success and optimize toward a reward, not just generate a nice response. If you care about long-horizon acting behavior, safety, and cost/quality trade-offs at scale, then reinforcement learning and by extension RL gyms becomes a natural next layer.

“If you want systems to actually act like agents and workflows, especially in the physical world, you need to measure success and optimize toward a reward. If you care about long-horizon acting behavior, safety, and cost/quality trade-offs at scale, then reinforcement learning becomes a natural next layer.” – Barak Turovsky, AI Strategist

RL Gyms and the Agentic Era: Why Now

Anna Tong: Why are RL gyms having a moment right now?

Ali Ansari: I think the main reason is very similar to what Barak just said: we’re going from this idea of answering very complex questions in domains like medical, finance, legal, and so forth, where models are getting quite good at actually doing things.

The way to think about models doing things is that they’re answering multiple questions in a row and then acting on those answers. If you assume models can answer questions with 90% accuracy on average (just as an arbitrary number) and then you assume the model has to answer five to ten questions in a row and act on those answers, you’re compounding errors quickly. If you take 0.9 to the power of 5 or 10, you end up with a much worse result.

So acting, which is answering multiple questions in a row, versus answering one very complex question is a different problem. We’re in this moment where this year was supposed to be the year of agents. I think 2026 will be the year of agents, and the year of agents means we need models to actually do things. That’s the transition happening now.

“I think 2026 will be the year of agents, and the year of agents means we need models to actually do things.” – Ali Ansari, CEO, Micro1

Anna Tong: Most RL gym vendors today seem to primarily sell their environments to just a handful of frontier AI labs, teams building cutting-edge models and using these environments to post-train them. From your vantage point, how have you seen this vendor landscape of startups evolve over the past couple of months?

Rohit Patel: I spend a lot of time in evals and RL. From a technical standpoint, what RL allows us to do is: once you run out of supervised fine-tuning data to update model parameters, but you still want to keep making the model better, you ask, “How do I update parameters in a way that’s consistent with the world?” A simple way is to have the model do something in the real world and then reinforce the parameters that gave the correct result.

That’s what’s happening at the core. Early Reinforcement Learning from Human Feedback (RLHF) tried to do this by having humans compare responses (by saying “this one is better than this one”) and reinforcing those answers. What’s happening now is that people are starting to realize we can move beyond having humans label which answers are better, to more high-fidelity feedback like matching answers to some state of the world. And maybe these aren’t just Q&A formats anymore; maybe models are actually taking actions.

You match the output to the state of the world and move parameters in the direction of what’s correct. If you look at what vendors are doing, they’re following that evolution. Initially, most of the work was around creating supervised fine-tuning datasets, Q&A pairs, or labeling which answer is better. They’re increasingly moving into richer, more automated ways of providing feedback, either by automating feedback themselves, or by creating a code execution environment like Docker, simulating an entire backend, or building a job task.

So I think that’s where this is going: models get feedback from these environments vendors are building. It’s an evolution from data labeling and tagging to building these environments which is what’s needed, and that’s a good direction.

Challenges of Building RL Gyms

Anna Tong: Ali, I know you’re on the vendor side. I’ve never built an RL gym… I’m sure it’s difficult. Can you tell us a little more about that process? What’s the hardest part of building an RL gym? Is it the design? Is it working with the labs to figure out what tasks should be rewarded, given unpredictable agent behavior? What’s the hard part?

Ali Ansari: Yeah, I think the hardest part is the quality of data that goes into these RL environments. Of course, I’m biased when I say this, but I’d say almost entirely in RL environments it is human data. You can expand on that with synthetic data, and there are other parts, like defining the state space, designing the algorithm for the reward model, and so forth. But we think the most important ingredient — and frankly the most expensive part of creating an environment — is the human data that goes into it. That’s our emphasis and it’s also the most difficult part.

And I’d say there are two things that result in really high-quality tasks, prompts and verifiers, in most cases. The first is expert quality: being able to source and deeply vet the experts working on these tasks, not just for their general skill sets, but for the specific skill sets the job and the environment requires. The second thing, which I think is undermined in the space, is the happiness of the experts who are creating the data.

We track an “expert happiness index” at Micro1 because we want to make sure the folks doing this work are having a good time and that this new job sector results in good pay. Ultimately, that ties back to quality. When experts are happy, they create better data, which results in better models. So yeah: data is the hardest part, and those are some of the things we do to improve data quality.

“We think the most important ingredient of creating high-quality RL environments — and frankly the most expensive part of creating an environment — is the human data that goes into it. That’s our emphasis and it’s also the most difficult part.” – Ali Ansari, CEO, Micro1

The Talent Differentiator

Anna Tong: Let’s talk about talent a little more. I’d love to hear about the differences and similarities between the talent required to make these RL gyms versus the human talent required for previous types of model training. Is there a difference, or is it more of the same requirement for human experts?

Ali Ansari: Yeah, similar to what Rohit was saying a second ago, we’ve gone from very low-fidelity, generalist labeling and preference labeling to high-fidelity, complex datasets. And “labeling” is no longer really the right word, because of how complex the datasets have become and how long each data point takes to create, which is often 15 to 25 hours. The talent has shifted from journalists, crowdsourcing (basically anyone can do these labelings) to folks who are better than the models, which is obviously a high bar. So we’re often looking for PhDs, professors, and very experienced industry folks.

We have lots of lawyers, doctors, and finance experts working across many of our projects.

Barak Turovsky: Just to add: it’s very dependent on the use case. To give a comparison, when we worked on launching Google Translate in 2016, or improving Search and Ads you generally needed someone who knows how to translate and evaluate translation. But if you’re going to a specific domain, say, for General Motors, how to design a car, or when I was VP of AI at Cisco, how to reconfigure a router, you need a subject matter expert. There’s a lot of expertise, and the task is way more complex. In my opinion, the use case drives the complexity, not necessarily whether it’s RL or other techniques. That’s what drives the complexity of human-collected data in evaluation.

The Labeling Market

Anna Tong: Yeah, maybe we can talk about market size a little bit too. A Scale AI study that came out yesterday said there are 200,000 folks now who do data labeling in the U.S. in some way, shape, or form. Most do it part-time as a supplement to their normal job. Do you guys have any stats on market size? Or how much money people can make doing that kind of thing?

Ali Ansari: Yeah, that sounds right. There are a lot of people doing this type of work now, and it’s growing fast. The way we think about it is the market will continue to grow and I think the growth will accelerate because of these real-world RL environments that need to be created by essentially every domain expert. The estimate I use is that spend in this market is roughly $10–15B a year right now across labs and enterprises, and in the next couple of years it will surpass $100B a year. Part of that will be labs continuing to spend and grow.

But I think a larger part could be the Fortune 1000 broadly spending on contextual evals to build agents. Spend per company will be a lot less than labs, but there are a lot more of them. And I think this upcoming year, enterprises will realize that implementing agents in production at large scale requires running lots of evals. It’ll become a very important part of the product development lifecycle, and eventually, enterprises may spend large portions of product budgets on evals, like they do on engineering talent.

Enterprise Readiness for RL Gyms

Anna Tong: Let’s talk about enterprises. Barak, you’re coming from a large enterprise. Give us some insight into where enterprises are in readiness to use RL gyms and what’s going to be coming in the next year or so.

Barak Turovsky: My assessment is that most enterprises are still in the phase of getting value from prompt engineering, RAG, and off-the-shelf tools. It also depends on the use case. Only a minority, in my opinion, are truly ready for serious fine-tuning. Fine-tuning makes sense when you already have clean, labeled domain data and stable use cases. That’s easier said than done. Clean, labeled domain data is in multiple sources; it needs to be centralized and cleaned. You need instrumentation, an evaluation framework, and actual tooling and frameworks. And I want to double down on what Ali said.

Evaluation is the most under-appreciated aspect of RL gym readiness. I constantly hear people say, “We built something and it’s great,” and I ask, “How do you evaluate it?” And they say, “We ask a couple of people to evaluate the results.” That’s not production-grade evaluation. You need to be able to measure well. And finally, you need small, top-tier internal teams that understand both how to fine-tune or post-train the model and the business process. Without those, fine-tuning can become an expensive distraction because you won’t get the value if you don’t know what you’re doing.

“My assessment is that most enterprises are still in the phase of getting value from prompt engineering, RAG, and off-the-shelf tools. Only a minority are truly ready for serious fine-tuning, which makes sense when you already have clean, labeled domain data and stable use cases.” – Barak Turovsky, AI Strategist

Anna Tong: So what does this look like over the next year? You were saying very few enterprises do this now. What percent do you think will be doing it in the next year?

Barak Turovsky: I think it will increase, but I’d look at it based on the focus of the enterprise, the group, or the division. On a high level, I think of two simplified categories: use cases that are core to the business, and use cases that are corporate functions. Corporate functions, unless it’s a fintech company, are finance, legal, HR. I believe you can get by with off-the-shelf solutions with some customization. If you’re talking about core functionality, it’s different for different companies: for physical AI companies like GM, it’s designing cars, planes, fridges; for an entertainment company, it’s creating AI-generated movies or scripts.

For each core function, the requirements are much higher for accuracy and domain specificity when they’re ready. In terms of ingredients: you need infrastructure, which is like kitchen equipment and models, you need ingredients, like data, and you need staff, meaning talent. You need all three if you want to create a gourmet meal, or an AI solution that addresses core use cases that are more complex and require a combination of models and techniques, good data, and good talent.

Anna Tong: To your earlier point on talent: do you see enterprises making moves to put together teams that are able to execute on this?

Barak Turovsky: It’s mixed. Some are more advanced than others, but it’s not easy. It depends on the use cases. There are very few people who understand the space well. If you’re going to more frontier or novel areas like physical AI, which is very new, and even the models aren’t fully there, it’s an even narrower sliver of talent. It comes down to what’s important: to what extent the enterprise understands whether the AI solution is core to their business model, their margin, and their ability to hire, motivate, and retain the staff.

Anna Tong: To your point, do you think enterprises will allow AI companies to train on their environments, or do we need RL gym companies to clone these apps to allow training?

Rohit Patel: I’m not sure the best solution is big models having the ability to do everything. Maybe general navigation is okay. But I think the right solution might be that the enterprise itself trains a version on its own environment that works really well for interfacing with it. Then when I interface with that business, its model has been trained on its environment and can use internal APIs really well.

I can’t imagine there will be one model or one set of models by any lab that can do everything across all the things out there. But you do want to get part of the way there: you want strong zero-shot capabilities at baseline, and then enterprises can take it and further fine-tune. You don’t want to start from scratch. That’s where RL gyms come in: to ensure models have a base level of zero-shot ability to do these things when they’re released, and then people can build from there.

“You don’t want to start from scratch. That’s where RL gyms come in: to ensure models have a base level of zero-shot ability to do these things when they’re released, and then people can build from there.” – Rohit Patel, Director, Meta Superintelligence Labs

Anna Tong: Barak, do you think enterprises will be willing to train on their own sites? Because right now I’m seeing a lot of enterprises trying to block AI agents.

Barak Turovsky: Eventually yes, but not everywhere and not all at once. As I mentioned, you need all three things. You need infrastructure, where you could choose a model, and you definitely won’t start from scratch. But you also need ingredients in good shape.

Otherwise, the ingredients will come in different flavors, come late, and come with different quality in terms of data. And you need staff, talent. So I expect leading enterprises that understand it and focus on core use cases will start building small cross-functional AI product teams that will own defining the use cases well because it starts with that, then evaluation and data pipelines, and fine-tuning/post-training for a handful of mission-critical journeys.

And “own” doesn’t necessarily mean you develop everything yourself. You need a coherent picture: a combination of buy, build, partner; third-party tooling; internal work; and so on. All of this is non-trivial, and it will take time.

Ali Ansari: One quick point to add: there’s almost no case where you can take enterprise data off the shelf from a database and just train on it. If you’re creating a supervised fine-tuning SFT dataset and you want to train in a more unsupervised manner, or you’re building a model that’s niche to your use case, even then you need something like an RL gym. You need to structure that data, and at the very least categorize it and clean it up.

So the way to think about enterprise data is that it’s not mutually exclusive with RL gyms. You often need enterprise data to build RL gyms. It’s not always a necessary component, but if you’re using enterprise data, it can be very important.

“There’s almost no case where you can take enterprise data off the shelf and just train on it. Enterprise data is not mutually exclusive with RL gyms. You often need enterprise data to build RL gyms. It’s not always a necessary component, but if you’re using enterprise data, it can be very important.” – Ali Ansari, CEO, Micro1

And on favorite RL environments: I don’t have a favorite RL environment, but my favorite thing about RL environments is that experts, and humans in general, can do tasks they already do and are really good at, get paid for those tasks more than what they make in their day jobs and train models that will help them do those tasks in the future. That’s a beautiful thing to think about.

Core vs. Corporate: Where Customization Matters Most

Anna Tong: Can you think of any examples of enterprises now that have been able to make any models for their core — what you’re calling their core ability — using custom?

Barak Turovsky: Custom? I can’t think of an example. I think it’s early. Obviously, a lot of companies are experimenting with off-the-shelf. And to be clear, in some areas you don’t need custom models for your core functions. If you’re in legal, you probably can use off-the-shelf. It depends on the nature of the core use cases for that enterprise.

Ali Ansari: One quick point to add: I think in most cases, even when enterprises start doing lots of contextual evals for their core and other use cases, they won’t be creating any models from scratch. In some cases they won’t even be fine-tuning off-the-shelf models. They’ll run evaluations on the functions an off-the-shelf model needs to be good at, then make tweaks to prompts, make tweaks to the software layer, and make tweaks to the front-end interface. Those tweaks can be lower-hanging fruit with higher impact than fine-tuning.

What we’ve seen with some customers is the assumption that evals must always result in fine-tuning. And if an enterprise isn’t thinking about fine-tuning, they don’t think about evals. But those things aren’t coupled in every case. Even for us, if we’re improving our AI recruiter agent, Zara, we run what we call stability evals to track things like video latency. Often, changes we make based on those evals are switching a fallback model or changing a front-end interface. We don’t always fine-tune something. That distinction is important.

Anna Tong: Do you think most of the time people need to fine-tune, or is it mostly that you need to alter prompts or do something easier?

Ali Ansari: I think most of the time, because foundational models are improving so quickly and labs are spending so much on fine-tuning on their end, you don’t need to do it for a long time. Of course, if your use case is very niche, you’ll probably benefit from some fine-tuning. But doing lots of evals, understanding exactly the functions your model needs to be good at, and then changing prompts, tools, and interface on the front end will probably have more impact for now.

What Matters in an RL Gym: Vendor Evaluations

Anna Tong: I want to talk about RL gym vendor evaluations. What’s the best way to do this, and what dimensions matter the most and move the needle in terms of model performance?

Rohit Patel: If you think about how this is used today: most of the focus is still on training LLMs, and labs are working on many other specialized models. But where the money is being spent is through tool use. You get a description of an API tool, and as a language model you’re trying to call these tools. You’re not necessarily operating in a physical environment yet, and tool use has high latency. That restricts what kinds of environments these models can operate in, because they operate through APIs and tools.

So one criterion is: can we plug it in at all? Does it work? Then there’s: how general is it? There are two dimensions of RL use cases: one is building general models, and the other is building specialized models that do certain tasks. Another dimension is who is doing it: labs versus enterprises. Most RL use today still sits in labs, but it can be general or specialized. For example, Google using RL for chip design, it’s unlikely that goes into a consumer-grade model; it’s a separate effort.

So you can have RL environments that are useful for specific purposes but not necessarily for general model training. Tasks have to be useful enough that you think they will generally enhance capabilities in the consumer space.

Then there’s fidelity: is it rich enough, and does it have high fidelity with the real world? You don’t want models learning the wrong thing because the RL environment is underspecified. You also want enough variety in tasks so you don’t just learn a finite pattern.

And from a technical standpoint, you don’t want environments with very sparse rewards, because that can be compute-intensive. These are really large networks. If rewards are massively sparse, that becomes an issue. You want somewhat dense rewards so you can train reasonably. Those are the top considerations.

RL Gyms Environments and Training Pipeline

Anna Tong: You were also talking about plugging RL environments into the training pipeline. How hard is that, and how much work is it? Is it easy or hard?

Rohit Patel: Training is challenging in general. But if you think about what RL is, if you have your reward function and your update function, then in a way, supervised fine-tuning is a special case. So from a technical standpoint, it’s not massively challenging. The challenge is more: how does this affect the rest of your training? How does it work with your data mix? Which environment helps you, given you have a finite amount of compute? It’s more an optimization problem than “is it hard to get working.” It’s hard to figure out whether you should do it, for how long, and how much compute to spend on it versus something else.

Barak Turovsky: From an enterprise perspective, another challenge is access. Rohit is speaking from a frontier model development perspective. But in today’s infrastructure, frontier model providers don’t provide much access to deeply customize the model, because it can affect the models and you don’t want semi-random people messing around with them. There is basic fine-tuning or maybe basic to intermediate fine-tuning that platforms like Google Cloud, AWS, and Microsoft’s AI Foundry provide. But to go deeper into the model, the tooling doesn’t exist or exists in limited fashion for “whales” (very big clients) with sophisticated AI teams and a lot of budget. So there’s technical complexity, and there’s also an access component: to what extent frontier model developers allow you to do it.

Rohit Patel: And this question becomes very different if you’re thinking about open-source models and enterprises trying to fine-tune them with RL. But there are many stages before the world gets there. Companies first need to get into the habit of building good evals. Once they have evals, they’ll be able to use existing models and prompt engineers effectively. Then they’ll say, “Okay, maybe we’re not happy, maybe we start fine-tuning,” because evals give them the North Star. Then they start fine-tuning, and then they realize they need RL gyms. That progression will take time. The necessary condition is getting to a place where everyone knows how to evaluate before they decide they need RL because then they trust they’re moving the ball forward.

Will RL Gyms Commoditize?

Anna Tong: Maybe next panel we’ll talk about how to make a good evaluation. Do you see RL environment creation being quickly commoditized over time? Is that going to happen?

Ali Ansari: I think until we can model the world itself, it won’t be commoditized. We can infinitely strive toward modeling the world itself, but it’s an infinite journey and we won’t actually reach it. Each domain we need to improve models on will require expert domain data that won’t be commoditized. And physical intelligence requires even more dimensions and more data. The journey there will start similarly to how LLMs did, with generalists creating data, and then it will ultimately move to experts doing things in the physical world to train physically intelligent models. That will take even longer to commoditize. I don’t think it ever will.

Rohit Patel: Yeah. I don’t think people fully appreciate how much RL we’re going to need in the future. No matter what the future scenario is — bubble or not — we’re going to continue to want to use AI in different settings. The only way to make it good at different things is to build an environment for that particular task or setting. The world is infinitely rich, so there’s an explosion of places where people will try to apply this. They’ll need to build an RL environment for that before they can do any of it. It’s underappreciated because only the labs are working on it right now. But in a few years, I can imagine everyone building their own environments. That’s how we’ll get models to work on anything real and extract economic value.

“No matter what the future scenario is — bubble or not — we’re going to continue to want to use AI in different settings. The only way to make it good at different things is to build an environment for that particular task or setting.” – Rohit Patel, Director, Meta Superintelligence Labs

The Next Frontier: Physical Intelligence

Anna Tong: Ali, you were talking about the physical intelligence component. Can you talk about that a little more? Right now our models are mostly text- or image-based, or on the computer. What does physical intelligence look like?

Ali Ansari: The way to think about physical intelligence versus LLMs is that we’re currently bound to a laptop. A laptop isn’t the entire world, and LLMs can only act within that hardware. The actions they can take, even with environments that let them do things like file taxes are limited compared to the physical world itself. The next step is having agents act in the physical world, moving atoms, not only bits.

Right now, for physical intelligence, we’re creating a pre-training dataset, the internet-equivalent dataset, that doesn’t exist yet. That will require lots of data from generalists doing things in the real world and recording themselves. We annotate those actions for what are called VLA models, and there are other architectures, but that’s one main approach. It will probably take a couple of years for physical intelligence models to have the pre-training dataset ready. After that, we’ll converge toward expert-level data and follow a similar trajectory to LLMs.

We see the pillars as similar. You need to source and vet the talent. You need a data platform which could be Meta Ray-Ban glasses and other tools to record the data. And you need a pipeline to manage data quality and velocity. At Micro1, we’re translating the data engine we have into physical intelligence.

Barak Turovsky: Just to add, working on that very recently, it’s definitely a new frontier. Those physical models will leverage LLMs a lot; LLMs meeting the physical world. Ali mentioned data, but we probably need to start from the model. The good news is recent LLM advances across text understanding. For example, if you want to design a robot that can load a dishwasher, we already have a lot of text training about understanding dishwasher manuals. Then you have multimodal understanding, learning from video on how to operate a dishwasher, or the ability to read a CAD file.

Some frontier models are very good at reading a CAD drawing to learn how to grab a part. And you also have LLMs becoming very good at reasoning to understand complex relationships. That likely makes it possible. The challenge is bringing all of those together, because no model is doing that end-to-end yet. That will require data, but also rare talent, not only on the data generation side, but on the training side. It’s exciting, early, and moving fast.

RL Gyms: The Practice Ground for Agentic AI

AI has been rewarded for producing the right words. The next era will reward it for taking the right actions.

As Ali Ansari, Barak Turovsky, and Rohit Patel underscored, the shift to autonomous action changes what “progress” looks like for AI and calls for high-quality human data, deep domain expertise, and evaluation frameworks that can meaningfully test reliability, safety, and long-horizon performance.

Many enterprises will continue to find value in prompt engineering and off-the-shelf systems, but as ambitions move from assistants to agents, the limiting factor will increasingly be the environment and how well it reflects the real world.

Watch the full webinar to further explore what it will take to build agentic systems that can operate reliably in the real world.


Editor’s Note: The Norwest team thanks our three panelists for sharing their insights on the panel, including Anna Tong, who guided the conversation as an independent moderator. 

Related Stories