open-source sdk for macos
Glance reads the macOS accessibility tree — the same API VoiceOver uses — and gives your AI structured data about native app UI: buttons, text fields, sliders, their labels, and exact pixel coordinates. No screenshots. No vision models. One function call.
Glance doesn't replace screenshots entirely. About 45% of the time — when your user is in a browser, text editor, terminal, or any standard productivity app — structured text is all your AI needs. It responds faster, costs less, and gets exact element positions instead of guessing from pixels.
The other 45%? Canvas apps, games, custom-rendered UIs — the accessibility tree is sparse there. Fall back to screenshots for those. Glance tells you when to switch. Use the right tool for the moment.
import { screen } from 'glance-sdk'
// one line — returns structured text, not pixels
const ctx = await screen()
// feed it to any LLM as plain text
const res = await anthropic.messages.create({
model: 'claude-sonnet-4-6',
messages: [{ role: 'user', content: ctx }]
})
Every Clicky, Snippy, and screen companion being built right now pays it on every single interaction. Capture, encode, upload, wait for the vision model to squint at pixels. Your users feel that lag.
Instead of a 2MB image for the AI to squint at, it gets structured data from the macOS accessibility tree:
[App: Ghostty | Window: "~/projects — zsh"]
## Tabs
- [Tab] "~/projects" at (120,12) [SELECTED]
- [Tab] "npm run dev" at (240,12)
- [Button] "+" at (360,12)
## Toolbar
- [Button] "Back" at (40,12)
- [Button] "Forward" at (70,12)
- [PopUpButton] "Profiles" at (420,12)
## Content
- [StaticText] "~/projects git:(main)" at (20,80)
- [StaticText] "$ glance screen" at (20,100)
These are native macOS apps, not browser DOM. Glance reads the OS-level accessibility tree — the same data VoiceOver uses. Coordinates are exact, not estimated from pixels.
Glance reads the native macOS accessibility tree — the same structured data VoiceOver uses. How much it can see depends on how well the app exposes its UI.
Full element tree — every button, field, link, label, and its exact position. Text-only mode works perfectly here. This is where Glance saves you the most.
Menus, toolbars, and panels are readable — but the main content area may be partially exposed. Combine with a screenshot for full context.
These render everything custom or have deep nested content — the accessibility tree is sparse or unreliable. Glance detects this automatically so you can switch to screenshots.
Glance tells you when structured text isn't enough. One check, and you switch strategies automatically.
import { capture } from 'glance-sdk'
const state = await capture()
if (state.elementCount > 5) {
// rich structure — use text (10× faster, 30× cheaper)
sendToLLM({ role: 'user', content: state.prompt })
} else {
// canvas app — fall back to screenshot
const img = await captureScreenshot()
sendToLLM({ role: 'user', content: [{ type: 'image', data: img }] })
}
The pre-built macOS binary is bundled inside the package. Nothing else to install.
npm install glance-sdk
pip install glance-sdk
.package(url: "github.com/rishabhsai/glance", from: "0.1.0")
screen()
Returns an LLM-ready string. Every UI element with role, label, value, and exact coordinates. Drop it straight into your prompt as text.
→ stringcapture()
Full structured data — app name, window title, element array, the prompt string, and timing metrics. Use when you need programmatic access or want to check elementCount for fallback logic.
→ objectfind(name)
Look up a UI element by label. Returns its exact pixel position — built for Clicky-style cursor pointing. No more coordinate guessing.
→ element | nullcheckAccess()
Check if macOS Accessibility permission is granted. Same permission Clicky and similar tools already require for push-to-talk.
→ booleanGlance works as a drop-in enhancement for any screen-aware AI tool.
Replace the screenshot capture in CompanionManager.swift with Glance.screen(). Two lines changed. Responses feel instant, pointing becomes pixel-perfect.
npm install glance-sdk — the native binary ships inside node_modules. Call screen() from your main process. No native compilation needed.
pip install glance-sdk — binary bundled in the wheel. Works with LangChain, Claude SDK, OpenAI SDK, CrewAI, or any framework.
Use the CLI: ./glance screen --json outputs structured JSON to stdout. Parse it from Go, Rust, Ruby — anything that can exec a process.
macOS uses AXUIElement. Windows has UI Automation, Linux has AT-SPI. Same idea, different OS APIs. We're working on it — or grab the issue and ship it first.
The prompt formatter is good but not perfect. Better grouping, context-aware truncation, app-specific templates — lots of room to make the LLM output even tighter.
Instead of sending the full screen every time, only send what changed since the last capture. Fewer tokens, faster responses, lower cost.
For canvas apps, games, and custom-rendered UIs where accessibility trees are sparse — automatically fall back to vision-based parsing. Coming soon.
Open source · MIT license · One function call