v0.2.1 · macOS · open source

AI-driven user testing for iOS, macOS, and the web.

Q: Which AI models does Harness support?

Harness supports vision tool-use models from Anthropic, OpenAI, and Google. Today that's Claude Opus 4.7, Sonnet 4.6, and Haiku 4.5; OpenAI GPT-5 Mini and GPT-4.1 Nano; and Google Gemini 2.5 Flash and Flash Lite. You bring your own API key for whichever provider you pick — keys are stored in the macOS Keychain. The Compose Run form has a per-run provider+model picker so you can trade speed for capability without leaving the app.

Write a goal in plain English. Pick a persona. Harness drives your target — iOS Simulator, macOS app, or a URL in an embedded browser — while an LLM agent reads each screen, taps and types its way through, and flags UX friction the way a real person would.

Download v0.2.1 View on GitHub

macOS 14+ · MIT licensed · Last updated 2026-05-06

Harness Run Session — agent driving an app with a live simulator mirror, step feed, and approval card

Not a UI test. A user test.

Harness is a native macOS dev tool that drives your iOS Simulator, your macOS app, or a web app the way a real user would. You write a goal in plain English and pick a persona; an LLM agent reads each screen, decides what to do, and pursues the goal — tapping, typing, scrolling, navigating. When something is confusing, ambiguous, or a dead-end, the agent flags it. Every run produces a replayable timeline of screens, actions, and friction events you can scrub through later. No accessibility identifiers, no source-code access, no pre-written plan — the agent reasons from what's on screen.

Drives three kinds of target.

Per-app setting: declare what kind of thing you're testing once, and Harness picks the right driver. Run history, replay, and friction events look the same across all three.

iOS Simulator

Provide an Xcode project + scheme. Harness builds, boots a Simulator, installs, launches, and drives via WebDriverAgent — taps, swipes, type, gestures.

macOS app

Launch a pre-built .app or build from source. CGEvent for clicks, scroll, keyboard, shortcuts; CGWindowList for capture.

Web app

Embedded WKWebView at any CSS-pixel viewport (1280×800 desktop, 375×812 mobile). JS-synthesised events for input. Same engine as Safari.

Compose

A goal sentence and a persona — that's the whole input.

Pick the application, choose a persona ("first-time user, never seen the app," "returning power user," "person in a hurry on a flaky network"), type the goal in your own words. Step and token budgets cap how long it can run; the model picker lets you trade speed for capability.

Built-in personas plus your own custom ones
Reusable Applications: project, scheme, and platform set once
Step-mode pauses for approval before every action

Harness goal-input form — Application picker, Persona dropdown, Goal text area, model and step-budget sliders

Run

Watch every tap, type, and swipe in real time.

The simulator mirror updates several frames per second with a coordinate overlay on the last action. The step feed scrolls alongside, narrating what the agent saw and what it decided. Step-mode lets you approve each action before it fires.

Live screenshot mirror with tap-coordinate overlay
Step feed: observation, intent, tool call
Auto-mode runs end-to-end; step-mode gates each action

Harness Run Session — simulator mirror, step feed, and approval card visible mid-run

Diagnose

When the agent gets confused, you know exactly where.

Friction events are tagged by kind — dead end, ambiguous label, unresponsive control, confusing copy — with a one-line description, timestamp, and screenshot. Browse them grouped by leg, or scan them flat in step order.

Semantic friction kinds, not free-text noise
Each event links back to the exact step that triggered it
Group by leg for chains, or by kind for an at-a-glance scan

Friction report — grouped cards showing dead-end, ambiguous-label, and unresponsive-control friction kinds

Review

Every run is a scrubbable timeline.

Drag the timeline scrubber to any moment. Each step shows the screenshot the agent saw, the observation it noted, the tool call it made, and any friction it raised. Use ←/→ to advance one step at a time. Leg boundaries on the scrubber show where action chains transition.

Append-only JSONL run log + screenshots, durable on disk
Reveal in Finder, export as a zipped bundle
Open old runs months later — the format is versioned

Run Replay — timeline scrubber dragged mid-run, step detail panel showing screenshot, observation, and tool call

Compare

A library of every run, filtered and searchable.

Filter by verdict — success, failure, or blocked — search by goal text, see which app each run ran against. Reveal in Finder for export. Re-run the same goal against a new build to compare friction counts side-by-side.

SwiftData-backed; survives app restarts and app upgrades
Per-run verdict pill plus a one-paragraph summary in the agent's words
Right-click context menu on every row

Run History — filterable list of runs with verdict pills, project icons, timestamps, and search

Why vision-grounded matters.

Most UI testing tools require accessibility identifiers, source-code access, or a hand-written script. Harness's agent reasons from pixels — which means it can test apps the same way a human cold-opens them, including apps you didn't write.

No a11y IDs needed

The agent reads pixels, not the responder chain. Test apps that ship without accessibility identifiers — including third-party builds, web pages you don't control, and screens where a11y tagging is incomplete.

No source access required

A buildable Xcode project for iOS, a .app for macOS, or a URL for web — that's the input. The agent never reads your code, never knows your bundle IDs, never has internal context. It evaluates your UI from the same starting point your users do.

Persona-conditioned reasoning

The same goal under "first-time user" produces a different path than "returning power user." Friction events shift accordingly. Pick a persona that matches the cohort you actually care about.

More on the loop algorithm: Agent-Loop · Tool-Schema · Run-Replay-Format on the wiki.

Get started

Harness is alpha. Build from source today; signed binary releases follow.

Prerequisites

macOS 14 or later, Xcode 16+ (Swift 6 strict concurrency)
Homebrew + idb-companion + xcodegen
An API key for at least one supported provider — Anthropic, OpenAI, or Google — stored in the macOS Keychain on first run

Build

git clone https://github.com/awizemann/harness.git
cd harness
git submodule update --init --recursive
brew tap facebook/fb && brew install idb-companion xcodegen
xcodegen generate
open Harness.xcodeproj

Full setup, including first-run WDA build details, lives on Build-and-Run.

Harness Settings — three providers (Anthropic, OpenAI, Google) with per-provider Keychain status, plus default Provider, Model, Mode, and Step budget

FAQ

Quick answers to the questions developers usually ask before installing.

Does Harness need access to my source code?

No. The agent reads screenshots and reasons about visible UI; it never reads your source. For iOS, you provide a buildable Xcode project so Harness can produce a .app to install in the Simulator. For macOS, a pre-built .app is enough. For web, just a URL.

Does Harness drive a real iPhone, or only the Simulator?

Today Harness is iOS-Simulator-only. Real-device support over WebDriverAgent and idb is on the roadmap. macOS targets are real (it's your actual Mac). Web targets render in an embedded WKWebView — the same engine Safari uses.

Which AI models does Harness support?

Harness supports vision tool-use models from Anthropic, OpenAI, and Google. Today that's:

Anthropic — Claude Opus 4.7, Sonnet 4.6, Haiku 4.5
OpenAI — GPT-5 Mini, GPT-4.1 Nano
Google — Gemini 2.5 Flash, Gemini 2.5 Flash Lite

You bring your own API key for whichever provider you pick — keys are stored in the macOS Keychain, one entry per provider. The Compose Run form has a per-run provider + model picker, so you can trade speed for capability without leaving the app.

Roughly how much does a run cost?

It depends on the goal length, step budget, model, and provider. A typical 20–40 step run on a mid-tier model lands in the few-cents-to-low-dollars range; smaller models (Haiku 4.5, Gemini 2.5 Flash Lite, GPT-4.1 Nano) shave that further. Token use is dominated by screenshot inputs. The step and token budgets cap cost predictably.

Can I run Harness in CI, headless?

Not yet. Harness today is a desktop tool with a GUI; a headless harness run goal.md mode is on the roadmap. If you want to fail PR builds on UX regressions, watch the issue tracker.

Does Harness self-update?

Auto-update isn't wired up yet. Releases are GitHub releases for now; a Sparkle-based auto-updater is queued.

Does my data leave my Mac?

Screenshots and step events are sent to whichever provider you picked — Anthropic, OpenAI, or Google — that's the inference path that drives the agent. Run logs (JSONL plus screenshots) live on disk in ~/Library/Application Support/Harness/runs/. Nothing else phones home — no analytics, no telemetry, no remote logging.

What versions of Xcode and macOS do I need?

macOS 14 (Sonoma) or later, and Xcode 16 or later — Harness uses Swift 6 strict concurrency. The iOS Simulator ships with Xcode.