Skip to content

Android Automation for LLM Agents

LLM agents need a different Android automation interface than traditional test suites.

A test suite wants assertions, reports, fixtures, and integration with a QA system.

An agent wants a tight loop:

  1. Read the current screen.
  2. Decide the next action.
  3. Tap, type, or swipe.
  4. Wait for the next state.
  5. Repeat.

The quality of that loop depends on two things: what you show the model, and how quickly you can execute the action it chooses.

Quick answer

For LLM-driven Android automation, use a compact text UI first:

hs ui

Example:

fill  EditText  "Email"     #email     540,540
fill  EditText  "Password"  #password  540,640  [password]
tap   Button    "Continue"  #continue  540,860

Then let the model choose an action by label:

hs tap "Continue"
hs wait "Dashboard"

Add a screenshot only when visual context matters:

hs see --size 768 /tmp/screen.jpg

Why screenshots alone are not enough

Screenshots are intuitive. They look like what a human sees.

But screenshots are often a poor default for every agent step:

  • They are large.
  • They are slower to capture and send.
  • They force the model to infer text and controls visually.
  • They hide accessible labels that Android already knows.
  • They make actions harder to audit.

Screenshots still matter for custom-rendered UI, visual layout, maps, charts, games, and canvas-heavy screens. They should be available. They should not be the only interface.

Why raw XML is also not enough

The other common answer is uiautomator dump.

It is faithful, but verbose. Android UI XML includes layout containers, repeated default attributes, full class names, bounds rectangles, and many nodes the agent cannot act on.

An LLM can read that XML, but it pays for every token.

For one Settings screen, a raw UIAutomator dump measured 5,762 tokens. A compact Handsets action table for the same screen measured 729 tokens.

Same decision. Much smaller context.

What an Android agent actually needs

For most UI actions, the model needs:

  • The visible label.
  • The action type: tap, fill, scroll, wait.
  • The control type: Button, EditText, Switch.
  • A selector or coordinate for execution.
  • Current state hints such as visible, checked, password.

It usually does not need:

  • Empty layout ancestors.
  • Boolean attributes set to false.
  • Full package names on every node.
  • Four-number bounds rectangles.
  • XML closing tags.

The useful output is closer to an action menu than a DOM tree.

A practical agent loop

Pseudo-code:

from handsets import Session

with Session() as d:
    while True:
        ui = d.ui_text()
        action = model.choose_action(ui)

        if action.kind == "tap":
            d.tap(action.label, visible=True)
        elif action.kind == "fill":
            d.fill(action.selector, action.text)
        elif action.kind == "wait":
            d.wait(text=action.text, timeout="15s")

The same loop can be implemented with subprocess calls if your agent runtime is not Python:

hs ui
hs tap "Continue"
hs wait "Dashboard"

Make the model choose labels, not pixels

Pixel actions are tempting:

tap 540,860

But labels are easier to inspect and retry:

tap "Continue"

If a run fails, you can read the transcript and understand the intended action. If a screen changes, the selector can still find the button.

Use coordinates only when the UI has no accessible label or when visual placement is the actual target.

Failure handling

Good agent tools should fail with useful errors:

  • NOT_FOUND: no matching node exists.
  • AMBIGUOUS: more than one matching node exists.
  • TIMEOUT: the expected next state did not appear.
  • SECURE_WINDOW: Android blocked screenshot capture.

Those errors are better than "the click did nothing." They let the agent decide whether to retry, ask for a screenshot, scroll, or escalate to a human.

What to optimize first

Before changing prompts, inspect the tool output.

Ask:

  • Is the model reading fields it cannot act on?
  • Are default values repeated on every row?
  • Are you sending screenshots when labels would be enough?
  • Are you asking the model to compute coordinates that the tool can compute?
  • Are you sending the entire UI tree when the agent only needs visible actions?

Often the easiest win is not a better prompt. It is a smaller, more actionable observation.

FAQ

Can LLM agents control Android without root?

Yes. For normal UI automation, an agent can control Android through adb and the shell user. Root is not required for tapping, typing, swiping, reading visible UI, or waiting for text.

Should an Android agent use screenshots or UI dumps?

Use both, but not equally. Start with a compact text UI dump for actions. Add screenshots when visual context matters or when the app does not expose accessible labels.

Is Appium good for LLM agents?

Appium can work, especially if you already use it. But for Android-only agents, it can be heavier than necessary because every action goes through WebDriver and the UI source is usually verbose.

Why does token count matter?

Long mobile trajectories repeat screen observations many times. Reducing each observation from thousands of tokens to hundreds can lower cost, latency, and prompt noise.