2025 GenAI Whitepaper

Amazon Nova Act: Building Reliable Browser Agents

Generative AI & LLMOps

Learn everything you need to know about Amazon Nova Act, a groundbreaking AI-powered tool that combines intelligent UI understanding with a Python SDK, enabling developers to create more reliable browser automation compared to traditional methods.

Automating interactions with web applications is a common, yet often frustrating, task. While APIs offer clean integration points, many essential internal tools, legacy systems, or third-party websites lack adequate API coverage. This forces us back to crafting browser automation scripts, which are notoriously brittle, often breaking with the smallest UI change, leading to maintenance headaches and unreliable processes. Can AI offer a more robust way to interact directly with web UIs?

Amazon is exploring this space with Amazon Nova Act, currently available as a research preview. It combines a new AI model trained for UI understanding with a Python SDK, aiming to let developers build agents capable of performing actions within a web browser more reliably than traditional methods or some current AI agent approaches. The idea isn't necessarily full autonomy, but rather providing developers with better tools to automate specific, necessary browser-based workflows where APIs fall short.

This article provides a technical deep dive into the Amazon Nova Act SDK, based entirely on the information released by Amazon. We'll explore its core philosophy, particularly its emphasis on reliability, examine how to get started with the SDK, dissect the practical building blocks it offers for browser automation, look at the provided examples, and discuss the important considerations and limitations that come with using a research preview technology.

What is Amazon Nova Act?

Amazon Nova Act centers around an AI model specifically trained to understand web page structures and execute actions, exposed through the nova-act Python SDK. You interact with it by giving it natural language commands to perform tasks within a browser session it controls via Playwright.

The defining characteristic highlighted by Amazon is a deliberate focus on reliability. If you've experimented with AI agents prompted with high-level goals (like "Plan a complete vacation itinerary"), you might have encountered unpredictable behavior or frequent failures, especially with complex UI interactions. Amazon acknowledges this challenge, citing accuracy rates as low as 30-60% for state-of-the-art models on such tasks.

Nova Act takes a different path. Instead of aiming for end-to-end task completion from a single high-level prompt, it's designed for developers to break down workflows into a sequence of smaller, explicit, composable commands. Each command, representing a discrete step like "click the 'Submit' button" or "select 'Option 3' from the dropdown," is passed to the SDK's act() method. The premise is that by focusing the AI on smaller, well-defined actions, the reliability of each step, and thus the overall workflow, can be significantly improved. Why? Because this reduces ambiguity and the scope for misinterpretation by the AI model, leading to more predictable outcomes for each step compared to open-ended goals.

To this end, Amazon states they concentrated on achieving high accuracy (over 90% on internal tests) for fundamental UI actions known to trip up other models, such as interacting correctly with date pickers, dropdown menus, and popups. They also published benchmark results on ScreenSpot Web (evaluating interaction with text and visual elements based on language instructions) and GroundUI Web (evaluating interaction with various UI elements), showing competitive performance which further underscores this focus on dependable UI actuation. Early tests even showed some capability transfer to novel environments like web games, suggesting a degree of generalization in its UI understanding.

It's important to frame this correctly, however. Based on the provided information, Nova Act in its current form isn't positioned as a fully autonomous agent ready for complex, open-ended problem solving. It appears to be a tool for developers to build more reliable automation for specific, often multi-step, browser-based tasks where precision and predictability are key.

And critically, Nova Act is currently a research preview. It's experimental technology, evolving, and comes with specific caveats and limitations we'll detail later.

Setting Up Your Nova Act Environment

Before you can start automating, you need to set up your environment. You'll need MacOS or Ubuntu, and Python 3.10 or higher.

The setup involves two main steps: authentication and SDK installation.

1. Authentication: Nova Act requires an API key to authenticate your requests.

First, get a key by visiting the Nova Act portal and following the instructions to generate one. Next, set it in an environment variable named NOVA_ACT_API_KEY.

Security Reminder: Your API key grants access to Nova Act under your account. Protect it carefully. Do not embed it directly in your code, commit it to version control, or share it. Anyone with access to your key can use Nova Act under your account. If you believe your key has been compromised, you should contact Amazon support via nova-act@amazon.com to have it deactivated and request a new one.

2. Installation: With Python 3.10+ available and the environment variable set, you can install the SDK using pip: pip install nova-act.

A quick note: the very first time you run a script using NovaAct, you might notice a delay of a minute or two before things start happening. This is expected. The SDK needs to download and install the necessary Playwright browser components in the background. Subsequent runs should initialize much faster, typically within a few seconds.

Quick Start: Your First Browser Automation

Let's look at the basic usage pattern through the simple example provided in the SDK documentation: adding a coffee maker to an Amazon shopping cart. This gives you a feel for how NovaAct works in practice.

This example uses Script Mode, which is how you'd typically embed Nova Act within an automated script. The with NovaAct(...) as n: construct conveniently manages the browser lifecycle, ensuring it starts and stops correctly.

Executing this script performs the described sequence, automating the initial steps of purchasing an item.

Now let's look at Interactive Mode, which you'd use for experimentation or step-by-step debugging. This allows you to use Nova Act within a standard Python REPL (note: the documentation mentions ipython is not currently supported).

This mode allows you to send commands one at a time using n.act() and observe the result in the browser before proceeding. Remember to call n.start() after initialization and n.stop() when you're finished to manage the browser session correctly. It's important, as the documentation highlights, not to manually interfere with the browser window while an act() command is executing, as the agent's internal state won't reflect your manual changes.

Core Concepts: Prompting Strategies and the act() Method

Now that you've seen a basic example, let's dive deeper into the core concepts you'll need to use Nova Act effectively, starting with how to structure your prompts.

As mentioned earlier, Nova Act's design encourages a specific prompting style focused on reliability. The key is to avoid high-level, ambiguous goals and instead provide clear, sequential, step-by-step instructions. Why? This approach aims to minimize the chances of the AI misinterpreting the goal or getting lost in a complex UI, which is a common failure mode for agents given less specific instructions. By breaking the task down, you make each step more deterministic.

Prompting Guidelines:

1. Be Prescriptive and Specific: Clearly state the exact action the agent should take in the current step.

  • Instead of: n.act("Reorder my last pizza")
  • Try: n.act("Click 'Account', then 'Order History', find the most recent order from 'Pizza Place', and click the 'Reorder' button")
  • Instead of: n.act("Check train times")
  • Try: n.act("In the 'From' field enter 'Downtown Station', in the 'To' field enter 'Uptown Station', select tomorrow's date, and click 'Find Trains'")

2. Decompose Complex Tasks: Break down larger workflows into a series of distinct act() calls. Each call should represent a logical step a human would take.

  • Instead of: n.act("Find the highest-rated hotel in Seattle under $200 for next weekend and book it")
  • Try:

Adhering to this strategy of clear, decomposed steps is presented as the most effective way to build robust and maintainable browser automations using Nova Act in its current form.

The NovaAct Class and act() Method:

The NovaAct class is your main tool. Initializing it (n = NovaAct(...)) sets up the Playwright-managed browser session.

The central piece of the interaction is the n.act() method. It takes your natural language prompt as input, sends it to the Nova Act AI model along with the current state of the web page, receives back a plan of low-level browser actions (like clicks, typing sequences, scrolls), and executes that plan in the browser.

Key parameters for the act() method that you'll often use include:

  • prompt (str): Your natural language instruction for this step.
  • max_steps (int, default: 30): A safeguard. It limits the maximum number of individual browser interactions (clicks, key presses, etc.) the agent will attempt for a single act() call before timing out. This helps prevent the agent getting stuck in unexpected loops.
  • schema (Dict[str, Any], optional): Used for structured data extraction. You provide a JSON schema definition (as a Python dictionary), and the agent will attempt to return information from the page matching that structure.
  • timeout (int, optional): An overall time limit in seconds for the act() call to complete.

With these fundamentals covered, let's explore the specific building blocks the SDK offers.

Essential Building Blocks for Nova Act Agents

The Nova Act SDK provides patterns and integrates with tools to handle common, practical automation needs effectively. These building blocks allow you to move beyond simple clicks and searches towards more sophisticated automation workflows.

Extracting Structured Information using Pydantic

Often, you need to extract specific data from a page, not just interact with it. Nova Act integrates with Pydantic to make this more reliable. Instead of asking for free-form text which might be inconsistent, you define a structure for the data you need.

The process involves:

  1. Defining Pydantic models for your target data structure.
  2. Generating a JSON schema from your model (.model_json_schema()).
  3. Calling n.act() with a prompt requesting the data, passing the schema via the schema argument.
  4. Checking the returned ActResult's matches_schema attribute.
  5. If True, validating and parsing the result.parsed_response using your Pydantic model (.model_validate()).

Why use Pydantic schemas? This approach strongly guides the AI to return information in the precise format you expect. It transforms potentially unstructured web content into validated, predictable Python objects, making the data extraction far more robust and easier to integrate into the rest of your application logic compared to parsing free-form text responses.

Consider the example for extracting book data:

For simple boolean checks, use the provided BOOL_SCHEMA:

Remember to place data extraction prompts in separate act() calls from those performing actions.

Parallelizing Tasks with Multiple Sessions

While one NovaAct instance runs sequentially, you can achieve concurrency by running multiple NovaAct instances in parallel using Python's concurrent.futures.ThreadPoolExecutor. This is particularly good for tasks like scraping data from many URLs or performing independent checks across different web interfaces simultaneously.

The documentation describes this as creating a "browser use map-reduce". The core idea is to submit multiple independent NovaAct tasks (each running in its own thread and controlling its own browser instance) to the executor and collect results as they complete.

For a detailed code example of this pattern (fetching book data for multiple years in parallel), please refer to the apartments_caltrain.py sample script included in the Nova Act SDK repository. It demonstrates how to set up the ThreadPoolExecutor, submit tasks, and collect results using as_completed. Remember that when running in parallel, proper handling of the user_data_dir (using the default cloning behavior) is important to ensure session isolation.

Managing Authentication, Cookies, and State

Many useful automation tasks involve sites requiring login. Since NovaAct starts with a clean slate by default (temporary user_data_dir), you need a way to handle authentication cookies and session state.

The user_data_dir parameter in the NovaAct constructor lets you specify a path to a persistent Chrome profile directory. Nova Act can then use the cookies and local storage within that profile.

The recommended way to prepare such a profile is to dedicate a directory, then use the helper script nova_act.samples.setup_chrome_user_data_dir.py provided with the SDK. Running python -m nova_act.samples.setup_chrome_user_data_dir --user_data_dir /path/to/profile launches a browser using that directory; you log in to your sites manually, then press Enter to save the session state.

Remember the clone_user_data_dir=True default behavior: Nova Act copies the specified profile to a temporary location for each run. This protects your original profile and is necessary for parallel execution. Keep this enabled unless you have a specific need to work directly on the original profile with only a single NovaAct instance.

Secure Handling of Sensitive Information

This is critically important. Never include passwords, API keys, credit card numbers, or other sensitive data directly in the prompt string you pass to n.act().

Prompts and interaction data (including potentially screenshots) might be collected by Amazon during the research preview for model improvement. Putting secrets in prompts creates an unnecessary security risk.

The secure method involves leveraging Playwright's direct interaction capabilities, which bypass the AI model for sensitive input:

  1. Use n.act() to navigate and place focus on the sensitive input field (e.g., password box).
  2. Retrieve the sensitive data securely within your Python script (e.g., from environment variables, a secrets manager, or using getpass.getpass() for interactive input).
  3. Access the underlying Playwright Page object via n.page and use n.page.keyboard.type() to directly input the sensitive string.

Here's the recommended pattern:

Security Caveat Reminder: Be aware that if sensitive information typed via Playwright is visibly displayed on the screen when a subsequent n.act() call runs, it might still be captured in screenshots collected during the preview.

If focus is tricky, try the workaround: n.act("enter '' in the password field") followed by n.page.keyboard.type(password).

Interacting with Captchas

Nova Act does not solve CAPTCHAs. Workflows encountering them require human assistance. The suggested pattern is: detect the CAPTCHA, pause the script, prompt the user to solve it manually in the controlled browser, and then resume.

Common Web Actions: Search and Downloads

Nova Act handles standard actions like searching and downloading files:

Searching: Provide instructions to find the search field, enter text, and submit.

If needed, be more specific about submission:

File Downloads: Use Playwright's expect_download() context manager combined with the act() call that triggers the download.

Other Configuration Notes

A few other configuration points mentioned in the documentation:

  • Date Picking: When prompting for date selection, using absolute dates (e.g., "select April 10 2025") is suggested for better reliability.
  • User Agent: The browser's user agent string can be customized during NovaAct initialization using the user_agent="MyCustomAgent/1.0" parameter.
  • Logging: Console log verbosity is controlled via the NOVA_ACT_LOG_LEVEL environment variable, using standard Python logging level integers (e.g., 20 for INFO, 10 for DEBUG).

Debugging and Observing Agent Behavior

When your automation doesn't behave as expected, you need tools to see what went wrong. Nova Act provides two useful mechanisms:

  • Act Traces: For each act() call, an HTML trace file is generated. The location of this file is printed in the console logs. Opening this file provides a step-by-step visual replay of that specific act() command, showing screenshots and identified elements. This is extremely helpful for pinpointing where the agent deviated from your expectation. The logs_directory parameter in the NovaAct constructor controls where these traces are saved.
  • Session Video Recording: For a complete overview, you can record the entire browser session as a video file. Enable this by setting record_video=True and providing a logs_directory when initializing NovaAct. This allows you to watch the full workflow, which can reveal issues spanning multiple act() calls.

Limitations and Important Considerations

Using Nova Act effectively requires acknowledging its current status as experimental research preview software. This comes with important caveats and responsibilities:

Keep these Known Limitations in mind:

  • It operates only within the web browser; it cannot interact with native desktop applications.
  • It performs best with specific, step-by-step prompts and can be unreliable with high-level, ambiguous goals.
  • Interactions with elements hidden until mouse hover may not work correctly.
  • It cannot control the browser's own UI (menus, address bar, permission dialogs).

Pay close attention to the Important Considerations (Disclosures) provided by Amazon:

  • Potential for Errors: The agent might make mistakes. You are ultimately responsible for monitoring its actions and ensuring they align with expected behavior and comply with Amazon's Acceptable Use Policy.
  • Data Collection for Improvement: Interaction data, including your prompts and screenshots taken during agent operation, is collected by Amazon to develop and improve the service. Contact nova-act@amazon.com for data deletion requests.
  • API Key Security: Safeguard your API key.
  • Sensitive Information Handling: Follow the recommended procedures (using Playwright directly) for handling sensitive data and be aware of the potential for visible data capture in screenshots during the preview.

Amazon encourages Providing Feedback during this preview phase. You can report bugs, suggest improvements, or share your experiences by emailing nova-act@amazon.com. Including the session ID from logs and relevant script details is helpful for bug reports.

Conclusion: Taking the First Steps with Browser Agents

Reliably automating tasks through web interfaces remains a difficult challenge, especially when APIs are inadequate or just missing. Traditional UI automation scripts are pretty fragile and unreliable. Amazon Nova Act offers an early look at an alternative approach: Leveraging AI trained for UI understanding, accessible via a Python SDK designed for developers.

Its core idea, focusing on reliability through developer-guided, composable commands rather than autonomous interpretation of high-level goals, presents a different strategy given the current state of agentic computer use technology.

However, it's important to remember that Nova Act is a research preview. It's experimental, it comes with known limitations, and it requires careful handling, particularly regarding security and data privacy (remember the disclosures about data collection). It is not yet a general-purpose autonomous agent but a specialized toolkit for engineers building targeted browser automation solutions, and it's still in research preview.

Amazon hints at a longer-term vision involving more advanced training techniques like reinforcement learning to enable agents capable of more complex tasks, but it might take several months, if not a couple of years, until we reach a solid and reliable state. For now, Nova Act provides a tangible first step towards that reliability.

Generative AI & LLMOps
Guille Ojeda

Guille Ojeda

Guille Ojeda is a Software Architect at Caylent and a content creator. He has published 2 books, over 100 blogs, and writes a free newsletter called Simple AWS, with over 45,000 subscribers. Guille has been a developer, tech lead, cloud engineer, cloud architect, and AWS Authorized Instructor and has worked with startups, SMBs and big corporations. Now, Guille is focused on sharing that experience with others.

View Guille's articles

Learn more about the services mentioned

Caylent Catalysts™

Generative AI Strategy

Accelerate your generative AI initiatives with ideation sessions for use case prioritization, foundation model selection, and an assessment of your data landscape and organizational readiness.

Caylent Catalysts™

AWS Generative AI Proof of Value

Accelerate investment and mitigate risk when developing generative AI solutions.

Accelerate your GenAI initiatives

Leveraging our accelerators and technical experience

Browse GenAI Offerings

Related Blog Posts

Introducing Amazon Nova Sonic: Real-Time Conversation Redefined

Explore Amazon Nova Sonic, AWS’s new unified Speech-to-Speech model on Amazon Bedrock, that enables real-time voice interactions with ultra-low latency, enhancing user experience in voice-first applications.

Generative AI & LLMOps

Amazon Bedrock Pricing Explained

Explore Amazon Bedrock's intricate pricing, covering on-demand usage, provisioned throughput, fine-tuning, and custom model hosting to help leaders forecast and optimize costs.

Generative AI & LLMOps

The Art of Designing Bedrock Agents: Parallels with Traditional API Design

Learn how time-tested API design principles are crucial in building robust Amazon Bedrock Agents and shaping the future of AI-powered agents.

Generative AI & LLMOps