AppleScript-8B & AppleScript-16B: Small Models That Actually Drive Your Mac
Two open models that turn plain-English requests into working macOS automation — on-device, tool-calling, quantized for Apple Silicon.
Today we're releasing two open models that turn plain-English requests into working macOS automation: AppleScript-8B and AppleScript-16B-A4B. Give either one a run_applescript tool and it emits a structured tool call containing correct, ready-to-run AppleScript — to open and read Safari tabs, file and tag things in Finder, draft Mail and Notes, flip System Settings, grab the clipboard, take screenshots, call ASObjC, or shell out with do shell script. Both are quantized for on-device inference on Apple Silicon and run inside Osaurus or any MLX runtime.
AppleScript is the oldest, broadest, and least fashionable way to control a Mac — which is exactly why general-purpose LLMs are bad at it. There's very little high-quality AppleScript on the web, the syntax is idiosyncratic ("English-like" but unforgiving), and the long tail of app dictionaries and ASObjC bridges is enormous. A frontier model will happily hallucinate a tell application "Safari" block that doesn't compile. We wanted a model small enough to live on a laptop that gets it right — and produces a tool call an agent can execute, not just a code block to copy-paste.
We're shipping two sizes so you can trade latency for accuracy:
| AppleScript-8B | AppleScript-16B-A4B | |
|---|---|---|
| Tier | Fast / small | Flagship quality |
| Base | Zyphra/ZAYA1-8B (MoE) | google/gemma-4-26B-A4B (MoE), expert-pruned |
| Parameters | 8.84 B total | 16.1 B total · ~4 B active |
| Bundle size | ~5.6 GB | ~11 GB |
| Quant | JANG_4M (8-bit attn / 4-bit experts) | JANG_4M (8-bit attn / 4-bit experts) |
| Tool-call format | <zyphra_tool_call> (native) | gemma tool-call (native) |
| Runtime | MLX / Osaurus | MLX / Osaurus |
Pick the 8B when you want the lowest latency and smallest footprint for on-device automation. Pick the 16B-A4B when correctness matters more than milliseconds — it's the highest-quality AppleScript model in the line.
Results
We evaluate on a held-out benchmark of 87 tasks spanning the breadth of the corpus — app automation, system control, language/parsing, ASObjC, and do shell script — none of which appear in training. Every task is scored in three tiers:
- Tool-call emission — did the model emit a valid
run_applescriptcall at all? - Compile — does the script compile as valid AppleScript (
osacompile)? - Exec — for the hardest pure-computational subset, does it run and return the correct result (sandboxed
osascript, output checked against expected)?
Held-out AppleScript benchmark — 87 tasks
compile = valid AppleScript · exec = runs & returns correct result
AppleScript-8B · fast tier
AppleScript-16B-A4B · flagship
| Model | Tool-call emission | Compile | Exec |
|---|---|---|---|
| Base ZAYA1-8B | ✗ (writes raw, no tool calls) | 28.9% | 30.0% |
| AppleScript-8B | 100% | 93.4% | 70.8% |
| Base gemma-4-26B-A4B | ✗ (writes raw, no tool calls) | ~88% | unreliable |
| AppleScript-16B-A4B | 100% | 100% | 84% |
Two things stand out. First, tool-calling is learned, not free: both base models emit zero structured tool calls even when handed a correct tool spec — they just dump raw AppleScript. After fine-tuning, both emit 100% valid run_applescript calls. Second, the 16B flagship compiles every held-out task and runs 84% of the hardest computational ones correctly; the 8B fast tier isn't far behind on compile (93.4%) at half the size and latency.
The base-model gap also tells you where the work went. Gemma-4-26B-A4B is already a strong coder (~88% compile out of the box) — it mostly needed AppleScript polish and the tool-calling format. ZAYA1-8B started at 28.9% compile; for the small tier, the fine-tune is doing the heavy lifting, taking it to 93.4%.
How they're built
Both models follow the same pipeline: a synthetic-plus-real AppleScript dataset → LoRA supervised fine-tune → merge → (optional) expert prune → JANG 4-bit quantization. The differences are in the base model and whether pruning helps.
The data
There is no large, clean, instruction-formatted AppleScript dataset, so we built one. We generated examples with a strong code model across a four-axis taxonomy designed to cover the real surface area of the language and the OS:
- Language (24 axes): handlers, script objects, lists/records, text parsing, coercions, date/time, error handling, reference forms, and the ASObjC bridge (
NSString,NSArray,NSFileManager,NSWorkspace,NSPasteboard,NSUserDefaults,NSRegularExpression,NSJSONSerialization,NSURLSession, …). - macOS surface (46 axes): Safari/Chrome/Arc, Finder, System Settings UI-scripting, Control Center, Dock, Mission Control, window/keyboard/mouse automation, clipboard, screenshots, dialogs, Mail/Calendar/Reminders/Notes/Contacts/Messages/Music/Photos, Shortcuts, Spotlight, Terminal, launchd, and more.
- Task shape (9+ types): natural imperative requests, "how do I…" Q&A, debug/fix, explain, convert (shell↔AppleScript), refactor/modernize, complete-the-script, reusable handlers, terse one-liners, and multi-step agentic flows.
- Real-world topic (26 topics): email, finance, file organization, research, productivity, scheduling, and other everyday domains, so the model sees automation in context.
Every generated script is run through osacompile — if it doesn't compile, it's discarded. Survivors are canonicalized (osacompile → osadecompile), de-duplicated by script body, whitespace-cleaned, and rebalanced so the dominant macOS-surface category is capped (43% → 30%) instead of swamping the language and ASObjC examples. We also seed from real, hand-written corpora — including the community kevin-funderburg/AppleScripts collection — to ground the synthetic data in idiomatic style.
The funnel, with real counts:
| Stage | Records |
|---|---|
| Raw generated | 17,172 |
osacompile-validated | 15,031 |
| Cleaned · de-duped · rebalanced | 12,257 |
| Train / val split (90/10, stratified by task type) | 11,005 / 1,225 |
The held-out 87-task benchmark is leakage-deduped against the training set (normalized script + instruction-paraphrase matching), so the numbers above measure generalization, not memorization.
Training
We fine-tune with LoRA (rank 16, prompt-masked so loss lands only on the AppleScript answer), keeping the recipe deliberately small to avoid overfitting 12k examples: a low-rank adapter physically can't memorize the corpus, we early-stop on the held-out validation set, and we treat executable correctness — not validation loss — as the source of truth (val and exec disagreed on the best checkpoint; we shipped the exec-best one). The adapter is then merged into the base weights to produce a standalone model.
ZAYA1-8B isn't natively supported by the training stack, so we registered its zaya MoE architecture and merged the adapter manually (matching the fuse math exactly) to get a clean standalone bundle.
Pruning is free (on the 16B)
The 16B is derived from a 26B-parameter, 128-expert MoE. Before quantizing, we run a saliency probe on AppleScript activations and apply REAP expert pruning, dropping 40% of experts (keeping 77 of 128) to land at 16.1 B total / ~4 B active.
The surprising part: pruning costs nothing here. Measured on the benchmark, the fine-tuned model unpruned (17 GB) and pruned (11 GB) score identically — same compile, same exec. The exec gap versus full-precision (bf16) comes from 4-bit expert quantization, not pruning. So the pruned 11 GB bundle is a pure 35% size win with zero quality cost — which is why it's the only 16B variant we ship. (If you want the bf16 exec back in a deployable bundle, the lever is mixed-precision — keeping the exact-API ASObjC experts at 8-bit — not less pruning.)
The 8B has only 16 experts, so there's no pruning headroom worth taking; it ships fine-tuned and quantized, unpruned.
Quantization
Both models are quantized to JANG_4M: 8-bit affine for attention and the MoE router, 4-bit affine for the routed expert weights, with the embeddings and output head kept high-precision. That's what gets a 26B-class model into 11 GB and an 8B into 5.6 GB while staying coherent — small enough to load and run on a MacBook.
Using them
Hand the model a run_applescript tool and parse the tool call it emits:
from mlx_lm import load, generate
model, tok = load("OsaurusAI/Osaurus-AppleScript-16B-A4B-JANG_4M")
# or: "OsaurusAI/Osaurus-AppleScript-8B-JANG_4M"
tools = [
{
"type": "function",
"function": {
"name": "run_applescript",
"description": "Execute AppleScript on macOS and return its output.",
"parameters": {
"type": "object",
"properties": {"script": {"type": "string"}},
"required": ["script"],
},
},
}
]
msgs = [{"role": "user", "content": "Get the URL of the front Safari tab."}]
prompt = tok.apply_chat_template(
msgs, tools=tools, add_generation_prompt=True, tokenize=False
)
print(generate(model, tok, prompt=prompt, max_tokens=300))
# -> a run_applescript tool call containing valid AppleScriptYour agent runs the returned script (e.g. via osascript) and feeds the result back into the loop. Omit tools= and the model writes AppleScript directly instead — handy for "give me a script" use cases. Both models drop straight into Osaurus for a local, no-setup agent loop.
Structured tool-call emission
Get started
- 🤗 Osaurus-AppleScript-16B-A4B-JANG_4M — flagship
- 🤗 Osaurus-AppleScript-8B-JANG_4M — fast tier
- 🦕 Run locally in Osaurus
Both are released under their respective base-model licenses (Gemma; Zyphra/ZAYA) — review and comply with the base terms.
References & acknowledgements
- Base models: Zyphra/ZAYA1-8B, google/gemma-4-26B-A4B.
- Inference & quantization: MLX; JANG_4M mixed-precision affine quantization (Osaurus).
- Expert pruning: REAP saliency-based MoE pruning.
- Validation/execution: Apple's
osacompile/osascript. - Seed corpus: kevin-funderburg/AppleScripts.
Citation
@misc{osaurus2026applescript,
title = {Osaurus-AppleScript-8B and Osaurus-AppleScript-16B-A4B:
on-device tool-calling models for macOS automation},
author = {Osaurus},
year = {2026},
url = {https://osaurus.ai/blog/applescript-models}
}