From "Play My Workout Playlist" to a Real Android Tap Plan

2026-06-16 · Tech

Built for the Build Small Hackathon Backyard AI track, sponsored by Modal. Also published on Hugging Face.

Introduction

How does a 3B-parameter model turn messy phone requests into replayable UI automation without shipping your life to a cloud API?

The Problem with Phone Automation Today

You say: “text mom on whatsapp i’m on my way.”

A voice assistant might reply with a web search, a generic “I can’t do that,” or a cloud API call that only works if WhatsApp cooperates. What you actually want is simpler and more direct: open WhatsApp, find Mom, type the message, send it.

That gap between natural language and deterministic UI actions on a real device is what Android Skill Router is built to close.

Why cloud agents fall short for personal automation

Most phone automation today follows one of two paths:

Approach	Strength	Weakness
Cloud voice assistants	Understand broad language	Can’t tap your apps; privacy concerns; needs network
Macro/script tools	Deterministic replay	Require exact trigger phrases; no natural language
Vision-based agents	Flexible	Slow, expensive, hallucinate UI coordinates

Android Skill Router takes a third path: a small local classifier that understands messy language, paired with pre-recorded UI trajectories that an accessibility runtime replays exactly.

The core insight:

You don’t need a 70B frontier model to do the tapping. You need a 3B model to understand what you mean, then hand off to a fixed replay plan.

"play my workout playlist"
    → spotify_play_playlist
    → trajectories/spotify_play_playlist.json
    → Pocket Automator replays taps on device

This is the classifier layer of the Pocket Automator stack: record once on your phone, route forever with a tiny local model.

The Architecture: Classify, Route, Replay

The system has three layers, each deliberately small and composable.

Figure: Natural language prompts flow through a fine-tuned classifier into skill routing and trajectory replay.

Layer 1: Intent classifier

A fine-tuned Qwen2.5-3B-Instruct model receives a user prompt and returns structured JSON:

{
  "skill": "whatsapp_send_message",
  "parameters": {
    "contact": "mom",
    "message": "i'm on my way"
  }
}

The model handles slang, typos, incomplete phrasing, and app disambiguation (WhatsApp vs Gmail vs Slack). It never invents UI steps. It only picks from 15 known skills and extracts parameter slots.

Layer 2: Skill router

A deterministic lookup table maps skill names to trajectory files:

SKILL_TO_TRAJECTORY = {
    "whatsapp_send_message": "trajectories/whatsapp_send_message.json",
    "spotify_play_playlist": "trajectories/spotify_play_playlist.json",
    # ... 15 skills total
}

If the model returns whatsapp_send_message, the router loads trajectories/whatsapp_send_message.json. No guessing, no hallucination. If the skill doesn’t exist or the file is missing, the system fails loudly with a clear error.

The router also includes defensive parsing: skill aliases (send_whatsapp → whatsapp_send_message), JSON extraction from noisy model output, and keyword fallbacks when the model returns an unknown label.

Layer 3: Trajectory replay

Each trajectory is a JSON file exported from Pocket Automator, an Android accessibility recorder. It contains:

A task description (the original human intent)
The target app package (com.whatsapp, com.spotify.music, etc.)
A sequence of steps, each with a full UI tree snapshot and an action

Example step from a WhatsApp trajectory:

{
  "timestamp": 4024,
  "screen": { /* full accessibility tree */ },
  "action": {
    "type": "click",
    "resourceId": "com.motorola.launcher3:id/icon",
    "contentDescription": "WhatsApp",
    "path": [0, 0, 0, 0, 2, 0, 0]
  },
  "packageName": "com.motorola.launcher3"
}

Action types include click, set_text, and scroll gestures. Pocket Automator resolves nodes at replay time using resource IDs, content descriptions, and tree paths, so minor UI changes don’t break the flow.

The separation of concerns

Component	Responsibility	Can fail?
Language model	Understand intent	Gracefully; fallbacks exist
Skill router	Map intent → file	Never; deterministic lookup
Trajectory	Store ground-truth UI steps	Never; fixed recording
Pocket Automator	Execute on device	Only if UI changed drastically

This is the design bet: language understanding is fuzzy; automation must be exact.

Recording UI Flows on Android

Every skill starts on hardware you own. No synthetic UI trees, no emulated taps, real recordings from a real Motorola device.

Pocket Automator: the Android recorder

Pocket Automator is an Android accessibility app that:

Records taps, text input, and scrolls while you use any app
Captures the full accessibility tree at each step (node IDs, bounds, class names, text)
Exports recordings as JSON for training pipelines
Replays saved recordings with smart node resolution

Requirements: Android 10+ (API 29), accessibility service enabled, overlay permission.

The recording workflow

Open Pocket Automator and tap Record
Name your task (e.g., “message hi to biraj on WhatsApp”)
Perform the task naturally on your phone
Stop recording from the floating overlay
Export the JSON to your development machine
Place it in trajectories/ and run scripts/generate_skill_dataset.py

The script reads each trajectory’s task and app fields, derives a snake_case skill name, and writes data/skills.jsonl:

{"skill": "whatsapp_send_message", "task": "message hi to biraj on WhatsApp"}
{"skill": "spotify_play_playlist", "task": "play liked songs playlist from Spotify"}
{"skill": "create_alarm", "task": "create alarm for 7 am tomorrow"}

Skill name derivation uses app package and task keywords, WhatsApp tasks become whatsapp_send_message, Spotify pause tasks become spotify_pause, and so on.

The 15 skills

Skill	App	Example task
`create_alarm`	Clock	Set alarm for 7 am tomorrow
`calendar_create_event`	Calendar	Create event tomorrow 4 pm
`wifi_enable`	Settings	Enable Wi-Fi
`bluetooth_enable`	Settings	Turn on Bluetooth
`whatsapp_send_message`	WhatsApp	Message a contact
`gmail_send_email`	Gmail	Send email to recipient
`slack_open_channel`	Slack	Open a channel
`spotify_play_playlist`	Spotify	Play a playlist
`spotify_search_play`	Spotify	Search and play music
`spotify_pause`	Spotify	Pause playback
`uber_request_ride`	Uber	Request ride to destination
`youtube_search`	YouTube	Search for videos
`linkedin_search_person`	LinkedIn	Search for a person
`contacts_search`	Contacts	Find a contact
`camera_take_photo`	Camera	Take a picture

Each trajectory file is large (often 5,000+ lines) because it includes the full accessibility tree at every step. That’s intentional, replay engines need rich node metadata to resolve targets reliably.

Why real recordings matter

Synthetic UI automation data is brittle. Real recordings capture:

Launcher states: how your home screen looks with your app icons
Keyboard transitions: when the soft keyboard appears during text input
Scroll positions: where list items sit after scrolling
Timing: natural pauses between actions

These details can’t be generated. They’re the ground truth that makes replay work on your specific device.

Training a Tiny Classifier

The model is Qwen2.5-3B-Instruct: deliberately under 4B parameters for the Build Small Hackathon’s Tiny Titan achievement.

Why 3B is enough

The classification task is narrow:

15 skill labels (not open-ended tool use)
Structured JSON output (not free-form text)
Parameter slot-filling (contact, message, time, not reasoning chains)

A 3B instruct model already understands apps, contacts, times, and natural language phrasing. Fine-tuning teaches it your skill taxonomy and output format, not general Android knowledge.

Training configuration

Training runs on Modal GPUs via modal_apps/train_modal.py:

Hyperparameter	Value
Base model	Qwen2.5-3B-Instruct
Method	4-bit QLoRA + SFT (Unsloth)
LoRA rank	32
LoRA alpha	32
Target modules	q/k/v/o_proj, gate/up/down_proj
Epochs	5
Batch size	8
Learning rate	2e-4
Optimizer	AdamW 8-bit
Max sequence length	2048
GPU	Modal A10G

The training pipeline:

Upload data/train_intent.jsonl to a Modal Volume
Load base model in 4-bit quantization
Apply QLoRA adapters to attention and MLP layers
Format examples with Qwen 2.5 chat template
Train with TRL’s SFTTrainer
Save LoRA adapter to /model/adapter
Save merged 16-bit model to /model/merged

python scripts/generate_intent_dataset.py
modal run modal_apps/train_modal.py --dataset train_intent.jsonl
modal volume get android-dataset-model adapter ./trained_model/adapter

V1 → V2: from labels to intents

V1 (skill classification only) mapped prompts to a skill name:

"play my workout playlist" → {"skill": "spotify_play_playlist"}

Training data: ~510 examples in data/train.jsonl (~30 variations per skill).

V2 (structured intent extraction) adds parameter slot-filling:

"text mom on whatsapp i'm on my way"
→ {"skill": "whatsapp_send_message", "parameters": {"contact": "mom", "message": "i'm on my way"}}

Training data: ~15,000 examples in data/train_intent.jsonl (~1,000 per skill).

Parameter schemas

Each skill declares its parameters in data/skill_schemas.json:

{
  "whatsapp_send_message": {
    "description": "Send a WhatsApp message to a contact",
    "parameters": {
      "contact": {"type": "string", "required": true},
      "message": {"type": "string", "required": true}
    }
  },
  "create_alarm": {
    "description": "Set an alarm at a specific time",
    "parameters": {
      "time": {"type": "string", "required": true},
      "day": {"type": "string", "required": false}
    }
  },
  "wifi_enable": {
    "description": "Enable Wi-Fi on the device",
    "parameters": {}
  }
}

Skills with no variable inputs (wifi_enable, bluetooth_enable, spotify_pause, camera_take_photo) return empty parameter objects.

The system prompt

The model receives a tight, deterministic instruction:

You extract structured Android automation intents from natural language.
Reply with JSON only: {"skill": "<skill_name>", "parameters": {<extracted_fields>}}.
Pick exactly one skill. Extract all relevant parameters mentioned in the request
(contact names, messages, times, destinations, channel names, search queries, etc.).
Use an empty object for parameters when the skill needs none.
Use the app or action named in the request (contacts, Gmail, Slack, YouTube, etc.)
to pick the correct skill.

No chain-of-thought. No tool descriptions. No examples in the prompt. Just JSON.

Training example format

Each row in train_intent.jsonl is a three-turn chat:

{
  "messages": [
    {"role": "system", "content": "You extract structured Android automation intents..."},
    {"role": "user", "content": "whatsapp message Vikram see you tonight"},
    {"role": "assistant", "content": "{\"skill\":\"whatsapp_send_message\",\"parameters\":{\"contact\":\"Vikram\",\"message\":\"see you tonight\"}}"}
  ]
}

The assistant always responds with compact JSON, no markdown fences, no explanation.

Synthetic Data at Scale

Fifteen real trajectories can’t train a robust classifier alone. The project generates ~15,000 synthetic SFT examples locally via scripts/generate_intent_dataset.py.

How data generation works

The generator follows a four-step pipeline:

skill_schemas.json + skills.jsonl
        ↓
   Entity pools (contacts, messages, times, destinations...)
        ↓
   Template variations (24+ templates per skill)
        ↓
   train_intent.jsonl (~1000 examples/skill)
   eval_intent_prompts.json (~6 held-out prompts/skill)

Entity pools

Realistic but synthetic entities ensure diversity without privacy concerns:

Pool	Examples
Contacts	Ri, Biraj, Mom, Parag Shah, grandma, my roommate
Messages	“see you soon”, “running late”, “project update attached”
Alarm times	5 am, 6:30 am, 7 am, noon, 10 pm
Alarm days	today, tomorrow, monday, next friday
Destinations	airport, train station, home, office
Playlists	workout, liked songs, chill vibes, focus
Channels	engineering, general, data contributors
Search queries	pasta recipes, jazz music, ghibli food

Template variations

Each skill has 15–30 prompt templates with placeholder slots:

WhatsApp templates:

"message {message} to {contact} on whatsapp"
"text {contact} {message} on whatsapp"
"whatsapp {contact} saying {message}"
"ping {contact} on whatsapp with {message}"

Alarm templates:

"create alarm for {time} {day}"
"wake me up at {time} {day}"
"set a {time} alarm for {day}"
"{time} alarm {day} please"

Uber templates:

"get an uber to {destination}"
"uber me to {destination}"
"book a cab to {destination} via uber"

Templates are crossed with random entity samples to produce unique training pairs. The same intent can appear as:

“set an alarm for 7 am tomorrow”
“wake me up at seven tomorrow morning”
“7am alarm pls”
“please alarm 7 am tomorrow thanks”

V1 training data (skill-only)

The earlier scripts/generate_training_data.py produces ~510 examples for V1 classification:

30 variations per skill from skills.jsonl task descriptions
Guaranteed inclusion of Gradio demo prompts
Regex-based parsing of task strings to derive alarm times, contacts, etc.

Held-out evaluation sets

Two evaluation sets prevent overfitting to templates:

File	Size	Purpose
`data/eval_intent_prompts.json`	~90 prompts	Structured eval during training
`data/pocket_benchmark_prompts.json`	200 prompts	Real-world messy language benchmark

The Pocket Automator benchmark is intentionally unlike training data, slang, typos, incomplete phrasing, conversational filler:

"yo set an alrm for like 5:45 tmrw morning pls"
"need to b up at 6ish on monday ngl"
"hit up zoe on whatsapp say im omw"
"wa msg marcus 'running 20 min late'"
"lowkey need 11:11 pm alarm tonight"
"deadass need alarm sunday noon"

Each benchmark case is tagged with domain (alarms, whatsapp, spotify…) and styles (slang, typo, incomplete, conversational). Prompts are filtered against training data to ensure zero overlap.

Deployment and Demo

Modal inference API

Training and inference both run on Modal: serverless GPU infrastructure with persistent volumes.

modal_apps/predict_api.py deploys a FastAPI endpoint:

modal deploy modal_apps/predict_api.py
# → https://<workspace>--android-skill-predict-api-skillpredictor-web.modal.run

Architecture:

Container class SkillPredictor loads the QLoRA model once via @modal.enter()
4-bit quantized base model + LoRA adapter from Modal Volume
Greedy decoding (do_sample=False) for deterministic JSON output
128 max new tokens: enough for any intent JSON
5-minute scale-down window: containers stay warm between requests

Request/response:

curl -X POST https://.../predict \
  -H "Content-Type: application/json" \
  -d '{"prompt": "text mom on whatsapp i am on my way"}'

{
  "skill": "whatsapp_send_message",
  "parameters": {
    "contact": "mom",
    "message": "i am on my way"
  }
}

The API applies the same post-processing as local evaluation: JSON extraction, skill normalization, alias resolution, and keyword fallbacks.

Gradio demo

The Gradio demo (app.py) is the hackathon submission UI, hosted on Hugging Face Spaces.

Flow:

User types a natural language prompt (or picks an example)
App POSTs to Modal /predict endpoint
Response is parsed: skill label, parameter tiles, confidence display
Skill router loads the matching trajectory from trajectories/
UI shows task description, app package, step count, and trajectory preview

Example prompts built into the demo:

“play my workout playlist”
“turn bluetooth on”
“wake me up tomorrow morning”
“send ri a message on whatsapp”
“book an uber to the airport”

The Space doesn’t ship model weights, inference stays on Modal. Only a MODAL_PREDICT_URL secret is needed.

Local development

Three commands to run everything locally:

# 1. Generate training data
python scripts/generate_intent_dataset.py

# 2. Train on Modal GPU
modal run modal_apps/train_modal.py --dataset train_intent.jsonl

# 3. Deploy inference + run demo
modal deploy modal_apps/predict_api.py
export MODAL_PREDICT_URL="https://..."
python app.py

Evaluation can run locally on CPU/MPS if you download the adapter:

modal volume get android-dataset-model adapter ./trained_model/adapter
python -m src.evaluate_intent

Evaluation

Metrics

Three metrics capture different levels of correctness:

Metric	Definition	What it measures
Skill accuracy	Predicted skill matches expected	App/action disambiguation
Parameter accuracy	All expected parameters match (normalized)	Slot-filling quality
Exact JSON match	Skill + all parameters match exactly	End-to-end correctness

Parameter matching uses normalized lowercase comparison, "Mom" matches "mom", extra whitespace is stripped.

Pocket Automator benchmark results

Evaluation on 200 held-out prompts with slang, typos, and conversational phrasing:

Metric	Score
Skill accuracy	99.0%
Parameter accuracy	86.0%
Exact JSON match	85.5%

The model almost never picks the wrong app or action. Parameter extraction is harder, preserving informal time expressions like "6ish" vs normalizing to "6 am", but 86% is strong for a 3B model with no cloud fallback.

Where errors happen

Parameter failures tend to cluster around:

Informal time expressions: “6ish on monday” vs "time": "6 am", "day": "monday"
Abbreviated days: “tmrw” vs “tomorrow morning”
Message truncation: model drops filler words the benchmark expects verbatim
Contact nicknames: “roomie” vs a full name

Skill errors (1%) mostly involve near-miss disambiguation, Spotify search-and-play vs play-playlist when the prompt is ambiguous.

Evaluation commands

# On Modal GPU
modal run modal_apps/evaluate_intent_modal.py
modal run modal_apps/evaluate_pocket_benchmark_modal.py

# Locally
python -m src.evaluate_intent
python -m src.evaluate_pocket_benchmark

The pocket benchmark runner produces a confusion matrix, per-domain breakdown, and a failure report saved to data/pocket_benchmark_report.txt.

Why This Approach Works

1. Local-first, privacy-preserving

A 3B model can run on-device (via llama.cpp, MLC, or similar) or on a small GPU. Your “text mom I’m running late” never needs to hit a frontier API. The entire inference stack fits in ~2GB of VRAM with 4-bit quantization.

2. Deterministic replay, not hallucinated taps

The model outputs a skill label and parameters. The trajectory is a fixed file recorded on a real device. No invented button coordinates, no drift between runs. If the model says whatsapp_send_message, you get the exact same tap sequence every time.

This is fundamentally different from vision-based agents that re-locate UI elements on every run and can click the wrong thing.

3. Cheap to extend

Adding a new skill is a repeatable pipeline:

Record one trajectory with Pocket Automator
Add parameter schema to data/skill_schemas.json
Add skill mapping to src/skill_router.py
Regenerate training data: python scripts/generate_intent_dataset.py
Fine-tune: modal run modal_apps/train_modal.py --dataset train_intent.jsonl

No prompt engineering session. No re-architecting the model. Just more data and another training run.

4. Separation of concerns

Component	Responsibility	Swappable?
Language model	Understand intent	Yes; any 3B instruct model
Skill router	Map intent → file	Yes; add skills without retraining
Pocket Automator	Execute UI steps	Yes; any accessibility replay engine
Trajectory JSON	Store ground truth	Yes; re-record when UI changes

Each piece can be improved independently. Better model? Swap the adapter. UI changed? Re-record one trajectory. New app? Add a skill.

5. Designed for the “backyard”

This project targets personal automation on hardware you own: the Backyard AI track. It’s not trying to automate every Android app in existence. It’s trying to automate your apps, your flows, your phrasing, with a model small enough to run locally.

What’s Next

The current gap

V2 extracts parameters at inference time:

"text mom on whatsapp i'm on my way"
→ {"contact": "mom", "message": "i'm on my way"}

But trajectories are still recorded with fixed entities: the WhatsApp trajectory says “message hi to biraj” and the set_text actions contain "hi" and "biraj". Replay uses those literal values, not the extracted parameters.

The planned solution

Slot-filling at replay time: when the model returns {"contact": "mom", "message": "i'm on my way"}, the replay engine:

Identifies parameterizable steps in the trajectory (text input actions)
Substitutes extracted values into set_text actions
Uses smart node resolution to find the contact field, search box, etc.

This closes the loop:

Natural language → structured intent → parameterized replay on any device

The trajectory becomes a template rather than a fixed recording. Record once with placeholder entities, replay with any contact, message, time, or destination.

Other future work

On-device inference: run the 3B model locally without Modal
More skills: maps, photos, settings toggles, banking apps
Multi-step intents: “set alarm and text mom I’ll be late”
Confidence calibration: know when to ask the user for clarification
UI change detection: alert when a trajectory needs re-recording

Links

Resource	URL
Blog post	Hugging Face Blog; Android Skill Router
Live demo	android-skill-router on Hugging Face Spaces
Demo video	YouTube Short
Pocket Automator	GitHub: Android recorder and replay
Social post	Twitter/X

Quick start

git clone https://github.com/kriyanshii/android-dataset.git
cd android-dataset

# Generate intent training data
python scripts/generate_intent_dataset.py

# Train on Modal (requires modal setup)
pip install modal && modal setup
modal run modal_apps/train_modal.py --dataset train_intent.jsonl

# Deploy inference API
modal deploy modal_apps/predict_api.py

# Run Gradio demo
pip install -r requirements.txt
export MODAL_PREDICT_URL="https://<your-modal-url>/predict"
python app.py

Project layout

app.py                      # Gradio demo (hackathon submission UI)
data/
  skill_schemas.json        # Parameter definitions per skill
  skills.jsonl              # Canonical skill ↔ task mapping
  train_intent.jsonl        # ~15k SFT examples (generated locally)
  eval_intent_prompts.json  # Held-out intent eval set
  pocket_benchmark_prompts.json  # 200 real-world messy prompts
src/
  skill_router.py           # Skill name → trajectory JSON
  skill_utils.py              # JSON parsing, aliases, fallbacks
  classifier_prompt.py        # System prompts for V1 and V2
  evaluate_intent.py          # Local evaluation
  pocket_benchmark.py         # Benchmark metrics and reports
modal_apps/
  train_modal.py              # QLoRA fine-tuning on Modal GPU
  predict_api.py              # FastAPI inference endpoint
  evaluate_intent_modal.py    # GPU evaluation
  evaluate_pocket_benchmark_modal.py
scripts/
  generate_skill_dataset.py   # trajectories → skills.jsonl
  generate_intent_dataset.py  # schemas → train_intent.jsonl
  generate_pocket_benchmark.py
trajectories/                 # Pocket Automator exports (15 skills)

Summary

Android Skill Router shows that personal phone automation doesn’t require a 70B agent in the cloud.

Record UI flows once on your Android device with Pocket Automator
Fine-tune a 3B model to understand how you actually talk (slang, typos, and all)
Route to deterministic trajectories, no hallucinated taps
Replay through accessibility APIs on real hardware

Classify → route → replay. Small model, real hardware, backyard-scale AI that actually does something useful.

Apache 2.0. Base model weights subject to Qwen license.

Open interactive version

Kriyanshi

I work at the intersection of satellites, distributed systems, and the teams who need their data on time. At ISRO's Space Applications Centre, I build ingestion, scheduling, and processing pipelines for geostationary satellite data and design JupyterHub environments that make planetary and weather research more accessible.

Work: Cloudraft (Kubernetes, n8n migration) · Space Applications Centre, ISRO (satellite data, JupyterHub, SeaDAS)

Projects: VisionBoardIt · Dagu · MapReduce · Serene · more on GitHub

Links: GitHub · Twitter/X

Blog — Thoughts on software development and backend engineering.

This site uses JavaScript for the full experience. Enable JS or visit the links above.