Glance

Let's be real for a second.

Glance doesn't replace screenshots entirely. About 45% of the time — when your user is in a browser, text editor, terminal, or any standard productivity app — structured text is all your AI needs. It responds faster, costs less, and gets exact element positions instead of guessing from pixels.

The other 45%? Canvas apps, games, custom-rendered UIs — the accessibility tree is sparse there. Fall back to screenshots for those. Glance tells you when to switch. Use the right tool for the moment.

⌘ Or just tell your agent to set it up

import { screen } from 'glance-sdk'

// one line — returns structured text, not pixels
const ctx = await screen()

// feed it to any LLM as plain text
const res = await anthropic.messages.create({
  model: 'claude-sonnet-4-6',
  messages: [{ role: 'user', content: ctx }]
})

from glance_sdk import screen

# one line — returns structured text, not pixels
ctx = screen()

# feed it to any LLM as plain text
res = client.messages.create(
    model="claude-sonnet-4-6",
    messages=[{"role": "user", "content": ctx}]
)

import Glance

// one line — returns structured text, not pixels
let ctx = try Glance.screen()

// or get structured data with exact positions
let state = try Glance.capture()
print(state.elements.first?.center)
// → CGPoint(x: 520, y: 340)

The screenshot tax

Every Clicky, Snippy, and screen companion being built right now pays it on every single interaction. Capture, encode, upload, wait for the vision model to squint at pixels. Your users feel that lag.

Screenshot pipeline

1 Capture 2MB screenshot

2 Base64 encode, upload to API

3 Vision model processes ~3,000 tokens

4 AI guesses element coordinates

2–5 sec ~$0.03 / call ~50px off

With Glance

1 Read OS accessibility tree — 30ms

2 Send ~500 tokens of structured text

3 Exact coordinates included free

~0.8 sec ~$0.001 / call 0px off

Why is it so much faster? LLMs process text and image tokens on different pipelines. Image tokens take 1–3 extra seconds to encode and interpret — that's time your user spends waiting. With text-only input, the model starts generating immediately. The response feels instant because, to the model, it basically is.

What your LLM actually receives

Instead of a 2MB image for the AI to squint at, it gets structured data from the macOS accessibility tree:

[App: Ghostty | Window: "~/projects — zsh"]

## Tabs
- [Tab] "~/projects" at (120,12) [SELECTED]
- [Tab] "npm run dev" at (240,12)
- [Button] "+" at (360,12)

## Toolbar
- [Button] "Back" at (40,12)
- [Button] "Forward" at (70,12)
- [PopUpButton] "Profiles" at (420,12)

## Content
- [StaticText] "~/projects git:(main)" at (20,80)
- [StaticText] "$ glance screen" at (20,100)

[App: Code | Window: "index.ts — my-project"]

## Sidebar
- [OutlineRow] "src" at (40,120) [EXPANDED]
- [OutlineRow] "index.ts" at (60,145) [SELECTED]
- [OutlineRow] "utils.ts" at (60,170)

## Editor Tabs
- [Tab] "index.ts" at (280,40) [SELECTED]
- [Tab] "package.json" at (380,40)

## Editor
- [TextArea] line 42, col 15 at (500,300) [FOCUSED]

## Terminal Panel
- [StaticText] "npm run dev" at (280,650)
- [StaticText] "Server running on :3000" at (280,670)

[App: Slack | Window: "Acme Inc"]

## Channels
- [StaticText] "# general" at (40,120)
- [StaticText] "# engineering" at (40,150) [SELECTED]
- [StaticText] "# random" at (40,180)

## Messages
- [Group] "Alice: shipped the fix" at (400,200)
- [Group] "Bob: LGTM 🚀" at (400,280)
- [Link] "View PR #142" at (420,310)

## Compose
- [TextArea] "Message #engineering" at (400,650) [FOCUSED]
- [Button] "Send" at (850,650)

[App: DaVinci Resolve | Window: "Project 1 - Edit"]

## Focused
- [Slider] "Midtones" value=0.32 at (510,390) [FOCUSED]

## Controls
- [Button] "Cut" at (120,42)
- [Button] "Color" at (680,42)
- [PopUpButton] "Node" value="Corrector 1" at (820,42)

## Color Wheels
- [Slider] "Lift" value=0.15 at (400,380)
- [Slider] "Gamma" value=-0.08 at (560,380)
- [Slider] "Gain" value=0.22 at (720,380)

These are native macOS apps, not browser DOM. Glance reads the OS-level accessibility tree — the same data VoiceOver uses. Coordinates are exact, not estimated from pixels.

Where Glance shines

Glance reads the native macOS accessibility tree — the same structured data VoiceOver uses. How much it can see depends on how well the app exposes its UI.

Best for ~45% of use cases

Full element tree — every button, field, link, label, and its exact position. Text-only mode works perfectly here. This is where Glance saves you the most.

Code editors — VS Code, Cursor, Zed, Xcode. Sees file trees, tabs, editor content, terminal output.
Terminals — Ghostty, iTerm2, Warp, Terminal.app. Tabs, buttons, text content all exposed.
Chat & productivity — Slack, Discord, Notion, Notes, Mail. Full message and UI structure.
System apps — Finder, System Settings, Activity Monitor. Everything exposed.
Electron apps — Any app built on Electron inherits Chromium's accessibility tree.

Works, with gaps ~30% of use cases

Menus, toolbars, and panels are readable — but the main content area may be partially exposed. Combine with a screenshot for full context.

Browsers — Chrome, Safari, Arc. Tabs, address bar, toolbar yes. Web content varies by site — well-structured ARIA sites work better.
DaVinci Resolve — color panels, timeline controls, menus yes. Video viewer no.
Adobe apps — toolbars, layers panel, menus yes. Artboard/canvas no.
Figma desktop — app chrome yes. Design canvas no.
Blender — UI panels yes. 3D viewport no.

Use screenshot ~25% of use cases

These render everything custom or have deep nested content — the accessibility tree is sparse or unreliable. Glance detects this automatically so you can switch to screenshots.

Games and canvas apps
WebGL / custom-rendered UIs
Deep web page content
Remote desktop streams

Smart fallback, built in

Glance tells you when structured text isn't enough. One check, and you switch strategies automatically.

import { capture } from 'glance-sdk'

const state = await capture()

if (state.elementCount > 5) {
  // rich structure — use text (10× faster, 30× cheaper)
  sendToLLM({ role: 'user', content: state.prompt })
} else {
  // canvas app — fall back to screenshot
  const img = await captureScreenshot()
  sendToLLM({ role: 'user', content: [{ type: 'image', data: img }] })
}

from glance_sdk import capture

state = capture()

if state["elementCount"] > 5:
    # rich structure — use text (10× faster, 30× cheaper)
    send_to_llm(role="user", content=state["prompt"])
else:
    # canvas app — fall back to screenshot
    img = capture_screenshot()
    send_to_llm(role="user", content=[{"type": "image", "data": img}])

let state = try Glance.capture()

if state.elementCount > 5 {
    // rich structure — use text (10× faster, 30× cheaper)
    sendToLLM(role: "user", content: state.prompt)
} else {
    // canvas app — fall back to screenshot
    let img = try await captureScreenshot()
    sendToLLM(role: "user", content: [.image(img)])
}

Glance (text) Screenshot (image)

Capture ~30ms ~50ms

Network nothing to upload 200–500ms (2MB image)

LLM thinking ~0.3s (500 text tokens) ~1.8s (3,000 image tokens)

Total latency ~0.8s ~3.5s

Element positions exact (from OS) estimated (from pixels)

Cost ~$0.001 ~$0.03

Get started

The pre-built macOS binary is bundled inside the package. Nothing else to install.

npm

npm install glance-sdk

pip

pip install glance-sdk

swift

.package(url: "github.com/rishabhsai/glance", from: "0.1.0")

Four functions. That's the API.

screen()

Returns an LLM-ready string. Every UI element with role, label, value, and exact coordinates. Drop it straight into your prompt as text.

→ string

capture()

Full structured data — app name, window title, element array, the prompt string, and timing metrics. Use when you need programmatic access or want to check elementCount for fallback logic.

→ object

find(name)

Look up a UI element by label. Returns its exact pixel position — built for Clicky-style cursor pointing. No more coordinate guessing.

→ element | null

checkAccess()

Check if macOS Accessibility permission is granted. Same permission Clicky and similar tools already require for push-to-talk.

→ boolean

Drop it into your stack

Glance works as a drop-in enhancement for any screen-aware AI tool.

Clicky / Snippy

Replace the screenshot capture in CompanionManager.swift with Glance.screen(). Two lines changed. Responses feel instant, pointing becomes pixel-perfect.

Electron apps

npm install glance-sdk — the native binary ships inside node_modules. Call screen() from your main process. No native compilation needed.

Python agents

pip install glance-sdk — binary bundled in the wheel. Works with LangChain, Claude SDK, OpenAI SDK, CrewAI, or any framework.

Any language

Use the CLI: ./glance screen --json outputs structured JSON to stdout. Parse it from Go, Rust, Ruby — anything that can exec a process.

This is just the start

Windows & Linux

macOS uses AXUIElement. Windows has UI Automation, Linux has AT-SPI. Same idea, different OS APIs. We're working on it — or grab the issue and ship it first.

Smarter formatting

The prompt formatter is good but not perfect. Better grouping, context-aware truncation, app-specific templates — lots of room to make the LLM output even tighter.

Diff mode

Instead of sending the full screen every time, only send what changed since the last capture. Fewer tokens, faster responses, lower cost.

OmniParser fallback

For canvas apps, games, and custom-rendered UIs where accessibility trees are sparse — automatically fall back to vision-based parsing. Coming soon.

Your AI doesn't need
to see the screen.

It just needs to know
what's on it.

Let's be real for a second.

The screenshot tax

What your LLM actually receives

Where Glance shines

Smart fallback, built in

Get started

Four functions. That's the API.

Drop it into your stack

Clicky / Snippy

Electron apps

Python agents

Any language

This is just the start

Windows & Linux

Smarter formatting

Diff mode

OmniParser fallback

Make your AI companion
actually fast.

Your AI doesn't needto see the screen.

It just needs to knowwhat's on it.

Let's be real for a second.

The screenshot tax

What your LLM actually receives

Where Glance shines

Smart fallback, built in

Get started

Four functions. That's the API.

Drop it into your stack

Clicky / Snippy

Electron apps

Python agents

Any language

This is just the start

Windows & Linux

Smarter formatting

Diff mode

OmniParser fallback

Make your AI companionactually fast.

Your AI doesn't need
to see the screen.

It just needs to know
what's on it.

Make your AI companion
actually fast.