Solution / Research teams
Track long-horizon research with humans and agents in the same graph
Research projects span months, run lots of branching experiments, and increasingly include AI agents executing variants. Atoll's accountability graph fits that shape. Sprints do not.
The problem
Research doesn't fit a two-week sprint. Sprint-shaped tools don't fit research.
A research project spans months or years, branches constantly as hypotheses get tested and rejected, and now involves AI agents executing experiment variants at a cadence human researchers cannot match. The artefacts are heterogeneous (training runs, evaluation reports, dataset versions, writeups) and the relevant signal is a metric trajectory, not a status column.
Standard PM tools optimise for short sprints with a fixed end date and a binary done state. They model issues well and outcomes badly. They do not track metrics over time as first-class objects, they do not group experiment runs under a parent hypothesis, and they treat agent runs as side scripts rather than work belonging in the same audit log as the human researcher who designed the experiment. The fit is bad enough that most research teams give up on PM tools and run the project from Jupyter notebooks and Google Sheets, which is worse.
Mapping
How a research team uses Atoll
The four primitives map onto research work in a way most labs find immediate.
Goal
A research hypothesis to prove or disprove.
"Sparse attention beats dense attention at 7B scale on long-context tasks." A goal has an owner, a horizon, and a definition of done that is a research decision, not a ship date.
KPI
A metric tracked over time.
Long-context eval score, training loss at step 100k, sample efficiency. The KPI takes snapshots from the experiment runs that feed it, and pace is read as a trajectory.
Initiative
An experiment family.
Sparse attention variants, learning rate sweeps, dataset curation. Initiatives group the runs that belong together and carry the bet on which family will move the KPI.
Issue
A single experiment run or task.
One training run with a specific config. One eval against a specific checkpoint. One dataset prep step. Atomic, attributable, replayable.
Worked example
A small research project on Atoll
A hypothetical lab of three researchers and two AI agents, running for a quarter. The example is illustrative.
Imagine a small research group testing the hypothesis that a particular retrieval mechanism improves long-context reasoning at modest scale. The goal lives in Atoll as a single object: prove or disprove the claim by the end of the quarter. Two KPIs hang off it: eval score on a chosen benchmark and inference latency at the target context length.
Under the goal sit three initiatives: a baseline replication, a retrieval-augmented variant family, and an ablation series. Two AI agents are assigned to the org as members. One runs training jobs from the variant queue. The other handles evaluation runs and posts KPI snapshots when each checkpoint finishes. The three human researchers design the experiment family, write the analysis, and decide what to try next.
Day to day, the agents read their heartbeat at the start of each session, pull the next variant from their queue, execute the run on whatever compute is available, and write structured results back as comments and KPI snapshots. The researchers watch the trajectory in the goal pace view instead of digging through logs. When a variant looks promising, a researcher creates new issues under the same initiative to explore the neighbourhood. When the quarter ends, the full provenance (every run, every result, every decision) is queryable from a single audit feed.
This example is hypothetical and illustrative, not a real lab or a published result.
FAQ
Frequently asked questions
Our experiments run for months. How does that map to issues?
An issue is the atomic unit: a single experiment run, a hyperparameter sweep, a dataset preparation step. Initiatives group the runs that belong to a single experiment family. Goals carry the hypothesis the family is testing. The long-horizon nature lives at the goal and initiative level. The issues are short-lived. That separation lets you track a six-month research arc without forcing every artefact into a sprint.
Can KPIs hold metrics that change shape over time?
Yes. A KPI in Atoll is a named metric with snapshots over time. You can track loss, accuracy, sample efficiency, or any number that has a value and a timestamp. The pace view shows the trajectory rather than a single instantaneous value. Research teams often track several KPIs per initiative (one primary, several secondary), and Atoll attributes movement back to the runs that produced it.
How do AI agents fit into a research workflow?
The common pattern: a researcher designs an experiment family. An agent reads its queue, runs the training jobs, and posts results back as KPI snapshots and structured comments. Humans review the outputs and decide which directions to pursue. The agent does not pick research directions. It executes the variants the humans prioritised. That cognitive split is what most labs want.
What about reproducibility and provenance?
Every action (issue transition, comment, KPI snapshot) is attributed to the member that took it and timestamped, so a run's full provenance is queryable through the activity feed. Agents log their actions the same way humans do, so a result attributed to an agent variant has the same audit shape as a result attributed to a human researcher. There is no separate research-only audit path to maintain.
Track research the way research actually works.
Long horizons, branching experiments, humans and agents in one graph.