Coactive's agentic search uses an LLM to interpret what a user is actually looking for and dynamically orchestrates multiple search systems in parallel. A single natural-language query can simultaneously search across visual scenes, spoken dialogue (both semantic transcript matching and exact phrase search), environmental audio and sounds, on-screen text, recognized faces, and structured metadata like timestamps, locations, and events. The system automatically routes each query to the right combination of search modalities, so users never have to specify how to search. They just describe what they want to find.
Inside Mimir, that looks like typing:
"Find me where the spokesperson says 'we're responding to the situation' and include reporter questions."
"Find the story about 1,000+ people dressed as Marilyn Monroe in Palm Springs. I need the best wide shots and crowd reactions."
"Moments where the crowd is cheering but no one is speaking."
“Self-driving Tesla car on the road, autopilot footage”

Each of these queries triggers a different combination of search systems behind the scenes. The crowd-cheering example uses Coactive's audio sound search, powered by a 527-class audio recognition model, to find specific sounds within video independently of what is visible on screen. The spokesperson query combines transcript search with visual scene retrieval. Celebrity queries can optionally leverage Coactive's face recognition to surface enrolled individuals across the archive.
Results surface as specific moments within assets, with timestamps, so editors can review and select the right section of footage rather than scrubbing through full files. Users can also refine results with negative search (for example, "cars but not white cars"), and the system preserves those constraints across follow-up queries and drill-downs.
From discovery to assembly