Tapping Android in 5 ms (vs 100 ms in Appium, 30 ms in uiautomator2)¶
tap("Continue") from Python, including the text-to-coordinates lookup
on the live accessibility tree, runs in 2–7 ms on a recent Pixel.
For reference, the same call through uiautomator2 is 30–100 ms; through
Appium it's 100–500 ms. Raw adb shell input tap X Y is 40–80 ms — and
it only takes coordinates, not text.
Why does this matter, and how does it get that low? Both questions have the same answer, and they share an architecture diagram.
Why a millisecond budget matters¶
The inner loop of any LLM-driven Android workflow is roughly four steps:
- Read the screen (UI dump or screenshot).
- Ask the model what to tap.
- Tap it.
- Wait for the next screen.
If the model thinks for 800 ms and the screen settles in 400 ms, the automation framework is the part you stop thinking about — until you write a script that taps the same button 200 times. Mine did. The model and the device together took four minutes. Appium added twenty seconds on top — a 9 % surcharge for typing a thin layer of HTTP requests in front of UI Automator.
That overhead is fine for a regression suite where every call already costs 800 ms of model time. It's brutal for any tap-heavy workload where the framework's per-call cost dominates the wall-clock — which is exactly the shape of an agent or an RPA flow.
I looked at where the time was going. This is what I found.
How uiautomator2 and Appium add 50–500 ms per call¶
Both uiautomator2 and Appium ultimately reach the same Android
framework — UIAutomator running under an instrumentation. The path
looks like this:
Python / WebDriver client
│
▼ HTTP / WebDriver
atx-agent (u2)
appium server (Appium)
│
▼ UIAutomator instrumentation
AccessibilityNodeInfo
│
▼ binder
the device's WindowManager
Each call serialises a JSON command over a TCP socket, JSON-decodes on the other end, dispatches into instrumentation, dumps (or queries) the a11y tree, and crosses two process boundaries on the device. The HTTP round-trip alone — even on localhost — is rarely under 5 ms. Each instrumentation hop adds variable overhead because UI Automator is itself running fairly complex framework code.
Measured on a recent Pixel, single-call latencies cluster like this:
adb shell input tap X Y 40–80 ms (but X Y, not text)
uiautomator2 d(text=…).click() 30–100 ms
Appium driver.find_element(…).click() 100–500 ms
The Appium tail comes from the WebDriver session-management layer
between the local server and the device. uiautomator2 skips WebDriver
and talks to atx-agent over plain HTTP/JSON, which is why it's a band
faster. But both still pay for:
- A fresh HTTP request per call.
- A full UI Automator dump plus filter for every text lookup.
- A round-trip just to learn that "the screen finished settling."
These are not bugs in those tools. They're the natural consequence of choosing HTTP/WebDriver as the wire. The bill comes due when you make thousands of calls.
What I wanted instead¶
Two specific requirements made the existing tools awkward for the workload I was running:
An LLM-friendly UI dump. The default dump_hierarchy() returns
~20 KB of XML for a typical screen. That's hundreds of tokens of layout
noise wrapping the four labels the agent actually cares about. For a
loop that dumps the screen between every step, every byte the model
doesn't need is a byte it has to read anyway.
A latency budget under 10 ms per call. Not for sport — because at 100 ms per call, an agent that issues four calls between every model decision adds 400 ms of pure wait time per step. That's the difference between a 30-second flow and a 90-second flow.
What I built is called Handsets. The agent-loop surface is small:
$ hs use # connect, start the on-device daemon
$ hs ui # flat table of tappable nodes
fill EditText "Email" #email 540,540
fill EditText "Password" #password 540,640 [password]
tap Button "Continue" #continue 540,860
$ hs tap "Continue" # text-lookup tap
tapped "Continue" → ok
Each row reads like the next CLI call: the verb is what the agent would
issue, the label is in the quotes, the coordinates trail. An LLM picks a
label and you hand it back to hs tap. The UI dump for the screen above
is 3.3 KB / 729 tokens — versus Appium's page_source on the same
screen at 22.3 KB / 5 762 tokens. About 8× fewer tokens, same agent
decisions. (Full bench: An Android UI dump for LLMs.)
Where the time went¶
I took three things off the critical path.
1. The HTTP/WebDriver hop is gone. Handsets uses a length-prefixed
binary protocol over adb forward:
A tap x=540 y=860 is one frame in, one frame out. No JSON parser, no
HTTP headers, no session id. On a warm socket the round-trip is 2–3 ms.
2. The JVM stays warm. Most automation frameworks spawn or attach an
instrumentation per session. Handsets pushes one small jar to the device
(/data/local/tmp/hs.jar, a few hundred KB) and runs it under
app_process with shell UID and hidden-API restrictions lifted. That
process stays alive between calls. The first call is ~150 ms while the
daemon spins up; every subsequent call is one binary round-trip.
3. Read-only state is mirrored to the host. "What's the top
activity?", "What's the battery level?", "What's the screen size?" —
those don't need a device round-trip at all. A small daemon on the host
subscribes to a push stream of state changes and atomically rewrites
~/.handsets/state-<port>.json on each event. Reads are
microseconds, not milliseconds.
Here is the resulting per-call comparison versus raw adb shell — the
baseline most people are actually competing against:
hs show 0.21 µs push-mirror local read
hs prop KEY 1.6 ms vs adb shell getprop 46 ms → 29×
hs show top 2.0 ms vs dumpsys window | grep 86 ms → 43×
hs see x.jpg 7.7 ms vs adb exec-out screencap 705 ms → 92×
Reproduce locally with hs bench -n 50; methodology in
docs/benchmark.md.
The honest tradeoff¶
uiautomator2 and Appium ship things Handsets doesn't:
- Recorders that watch a human and emit code.
- pytest plugins, HTML reports, screenshot diffing.
- Cross-platform test orchestration (Appium drives iOS too).
- A huge community and a decade of Stack Overflow answers.
If you're writing a regression suite for a corporate test team, those things matter more than 50 ms per call. If you're driving an LLM agent loop, scripting RPA flows from bash, or running a tap-heavy workload where the framework's per-call cost dominates wall-clock time, the latency floor of UI Automator wrappers may be higher than your project can afford.
The code is on GitHub under
MIT. pip install handsets for the Python SDK, curl … | bash for the
CLI. macOS and Linux hosts. No root, no installed app on the phone, just
adb on $PATH.
I'd love to hear what your bench shows.