Texas School Board Video Archive Discovery
A study on making tools that work with agents to produce unlocks. The domain: locating board meeting video archives across Texas's ~1,060 independent school districts — a fragmented, manual problem with no central index.
There's no central database, no API, no unified platform. A human would have to visit 1,000+ district websites by hand. Pure automation fails — each district uses different platforms, different structures, different edge cases. Scripts can handle the predictable standard platforms; the rest requires judgment. The combination — automation for the automatable, agent instructions for the rest, validation to catch mistakes — is one of the few approaches that scales.
The hypothesis
Most tools assume human operators — checkpointing, validation, and instructions are afterthoughts. But if you design for agents from the start — validation gates, resumable batching, explicit state, clear failure modes, platform-specific rules — you get a different kind of unlock. Can an agent reliably process hundreds of districts with minimal human intervention?
What I built
Agent instructions. Search protocol (vendor-first, then YouTube, then manual), verification gates (video player loads, archive depth, free access), platform-specific rules (BoardBook = agendas only, CitizenPortal = paywall), output schema, checkpointing every 10–15 districts. Common pitfalls documented so the agent doesn't repeat them.
Validation layer. District ID validation before processing — prevents duplicates, ensures TEA canonical IDs. Validation pipeline catches bad URLs. Designed so an agent can't silently corrupt the dataset.
Automation + agent handoff. Scripts handle Swagit, Granicus, YouTube bulk finder to save usage. The agent handles the rest: custom website archives, BoardBook verification, YouTube iframe extraction, ambiguous cases. Clear handoff points.
Open data. Results published as CSV. The methodology is reproducible; other states or local bodies can adapt it.
Why it matters
Public meetings are public record, but they're only useful if you can actually find them. Right now, if you want to watch a school board meeting in Texas, you have to know which district, find their website, figure out which video platform they use, and hope the archive is actually public. Multiply that by a thousand districts and it's effectively inaccessible. A single index changes that — for journalists covering local governance, researchers studying policy trends, or parents who just want to see what their board discussed last Tuesday.
What I learned
Agents need constraints. Validation gates and explicit rules aren't bureaucracy — they're what let you trust agent output at scale. "Verify before marking HIGH" is the difference between a usable dataset and a mess.
Resumability is a design constraint. STATE.md, checkpointing, "next batch" prompts — the work is structured so an agent (or human) can pick up exactly where it left off. That unlocks long-horizon tasks that would otherwise be too tedious.
The unlock is the workflow, not just the data. The verified districts matter. But the bigger unlock is proving that agentic discovery works for this class of problem.
Built with