open-source sdk for macos

Your AI doesn't need
to see the screen.

It just needs to know
what's on it.

Glance reads the macOS accessibility tree — the same API VoiceOver uses — and gives your AI structured data about native app UI: buttons, text fields, sliders, their labels, and exact pixel coordinates. No screenshots. No vision models. One function call.

0ms Your eye takes 300ms to blink. Glance reads the entire screen 10 times in that window.
faster LLMs process text tokens 10× faster than image tokens. Less data in, faster answer out.
cheaper ~500 text tokens vs ~3,000 image tokens per call. That adds up fast at scale.

Let's be real for a second.

Glance doesn't replace screenshots entirely. About 45% of the time — when your user is in a browser, text editor, terminal, or any standard productivity app — structured text is all your AI needs. It responds faster, costs less, and gets exact element positions instead of guessing from pixels.

The other 45%? Canvas apps, games, custom-rendered UIs — the accessibility tree is sparse there. Fall back to screenshots for those. Glance tells you when to switch. Use the right tool for the moment.

Or just tell your agent to set it up
import { screen } from 'glance-sdk'

// one line — returns structured text, not pixels
const ctx = await screen()

// feed it to any LLM as plain text
const res = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  messages: [{ role: 'user', content: ctx }]
})

The screenshot tax

Every Clicky, Snippy, and screen companion being built right now pays it on every single interaction. Capture, encode, upload, wait for the vision model to squint at pixels. Your users feel that lag.

Screenshot pipeline
1 Capture 2MB screenshot
2 Base64 encode, upload to API
3 Vision model processes ~3,000 tokens
4 AI guesses element coordinates
2–5 sec ~$0.03 / call ~50px off
With Glance
1 Read OS accessibility tree — 30ms
2 Send ~500 tokens of structured text
3 Exact coordinates included free
~0.8 sec ~$0.001 / call 0px off
Why is it so much faster? LLMs process text and image tokens on different pipelines. Image tokens take 1–3 extra seconds to encode and interpret — that's time your user spends waiting. With text-only input, the model starts generating immediately. The response feels instant because, to the model, it basically is.

What your LLM actually receives

Instead of a 2MB image for the AI to squint at, it gets structured data from the macOS accessibility tree:

[App: Ghostty | Window: "~/projects — zsh"]

## Tabs
- [Tab] "~/projects" at (120,12) [SELECTED]
- [Tab] "npm run dev" at (240,12)
- [Button] "+" at (360,12)

## Toolbar
- [Button] "Back" at (40,12)
- [Button] "Forward" at (70,12)
- [PopUpButton] "Profiles" at (420,12)

## Content
- [StaticText] "~/projects git:(main)" at (20,80)
- [StaticText] "$ glance screen" at (20,100)

These are native macOS apps, not browser DOM. Glance reads the OS-level accessibility tree — the same data VoiceOver uses. Coordinates are exact, not estimated from pixels.

Where Glance shines

Glance reads the native macOS accessibility tree — the same structured data VoiceOver uses. How much it can see depends on how well the app exposes its UI.

Best for ~45% of use cases

Full element tree — every button, field, link, label, and its exact position. Text-only mode works perfectly here. This is where Glance saves you the most.

  • Code editors — VS Code, Cursor, Zed, Xcode. Sees file trees, tabs, editor content, terminal output.
  • Terminals — Ghostty, iTerm2, Warp, Terminal.app. Tabs, buttons, text content all exposed.
  • Chat & productivity — Slack, Discord, Notion, Notes, Mail. Full message and UI structure.
  • System apps — Finder, System Settings, Activity Monitor. Everything exposed.
  • Electron apps — Any app built on Electron inherits Chromium's accessibility tree.
Works, with gaps ~30% of use cases

Menus, toolbars, and panels are readable — but the main content area may be partially exposed. Combine with a screenshot for full context.

  • Browsers — Chrome, Safari, Arc. Tabs, address bar, toolbar yes. Web content varies by site — well-structured ARIA sites work better.
  • DaVinci Resolve — color panels, timeline controls, menus yes. Video viewer no.
  • Adobe apps — toolbars, layers panel, menus yes. Artboard/canvas no.
  • Figma desktop — app chrome yes. Design canvas no.
  • Blender — UI panels yes. 3D viewport no.
Use screenshot ~25% of use cases

These render everything custom or have deep nested content — the accessibility tree is sparse or unreliable. Glance detects this automatically so you can switch to screenshots.

  • Games and canvas apps
  • WebGL / custom-rendered UIs
  • Deep web page content
  • Remote desktop streams

Smart fallback, built in

Glance tells you when structured text isn't enough. One check, and you switch strategies automatically.

import { capture } from 'glance-sdk'

const state = await capture()

if (state.elementCount > 5) {
  // rich structure — use text (10× faster, 30× cheaper)
  sendToLLM({ role: 'user', content: state.prompt })
} else {
  // canvas app — fall back to screenshot
  const img = await captureScreenshot()
  sendToLLM({ role: 'user', content: [{ type: 'image', data: img }] })
}
Glance (text) Screenshot (image)
Capture ~30ms ~50ms
Network nothing to upload 200–500ms (2MB image)
LLM thinking ~0.3s (500 text tokens) ~1.8s (3,000 image tokens)
Total latency ~0.8s ~3.5s
Element positions exact (from OS) estimated (from pixels)
Cost ~$0.001 ~$0.03

Get started

The pre-built macOS binary is bundled inside the package. Nothing else to install.

npm
npm install glance-sdk
pip
pip install glance-sdk
swift
.package(url: "github.com/rishabhsai/glance", from: "0.1.0")

Four functions. That's the API.

screen()

Returns an LLM-ready string. Every UI element with role, label, value, and exact coordinates. Drop it straight into your prompt as text.

→ string
capture()

Full structured data — app name, window title, element array, the prompt string, and timing metrics. Use when you need programmatic access or want to check elementCount for fallback logic.

→ object
find(name)

Look up a UI element by label. Returns its exact pixel position — built for Clicky-style cursor pointing. No more coordinate guessing.

→ element | null
checkAccess()

Check if macOS Accessibility permission is granted. Same permission Clicky and similar tools already require for push-to-talk.

→ boolean

Drop it into your stack

Glance works as a drop-in enhancement for any screen-aware AI tool.

Clicky / Snippy

Replace the screenshot capture in CompanionManager.swift with Glance.screen(). Two lines changed. Responses feel instant, pointing becomes pixel-perfect.

Electron apps

npm install glance-sdk — the native binary ships inside node_modules. Call screen() from your main process. No native compilation needed.

Python agents

pip install glance-sdk — binary bundled in the wheel. Works with LangChain, Claude SDK, OpenAI SDK, CrewAI, or any framework.

Any language

Use the CLI: ./glance screen --json outputs structured JSON to stdout. Parse it from Go, Rust, Ruby — anything that can exec a process.

This is just the start

Windows & Linux

macOS uses AXUIElement. Windows has UI Automation, Linux has AT-SPI. Same idea, different OS APIs. We're working on it — or grab the issue and ship it first.

Smarter formatting

The prompt formatter is good but not perfect. Better grouping, context-aware truncation, app-specific templates — lots of room to make the LLM output even tighter.

Diff mode

Instead of sending the full screen every time, only send what changed since the last capture. Fewer tokens, faster responses, lower cost.

OmniParser fallback

For canvas apps, games, and custom-rendered UIs where accessibility trees are sparse — automatically fall back to vision-based parsing. Coming soon.

Make your AI companion
actually fast.

Open source · MIT license · One function call