Product Spec — Expedite Experiment Detail Page

Section 1

Problem & Opportunity

Problem Statement

Experiment owners can't tell at a glance whether to ship a feature. Today, ship/iterate/stop decisions are made by reading scorecards, which are notoriously difficult to interpret correctly: dozens of metrics, mixed stat-sig signals, no clear hierarchy between guardrails and North Stars, and no shared rationale captured alongside the decision. The result is a workflow that is slow, inconsistent across owners, and effectively unaudited — owners regularly default to gut feel or escalate to leadership for a tie-break rather than reading the data themselves, and six months later no one can reconstruct why a given experiment shipped.

Opportunity

An Experiment Detail page turns the same scorecard data into a single decision-grade view: a Copilot-recommended call (advisory), the human's actual decision, and the guardrail evidence behind it — on one URL. By compressing minutes of scorecard interpretation into seconds of glanceable judgment and by capturing decision rationale in-product (including a hard exception trail when guardrails fail), we can make every ship decision across the M365 Copilot portfolio faster, more consistent, and fully auditable. The page becomes the single source of truth for "is this ready to ship, and why?" — replacing ad-hoc scorecard reading and email threads as the deciding artifact.

Evidence

Quantitative evidence pending. Recommended additions before review: (1) # experiments shipped per quarter and median time from "scorecard ready" to "decision locked"; (2) % of ship decisions today that escalate to a DS partner or skip-level for interpretation help; (3) # of post-hoc audit requests where the original rationale could not be reconstructed from existing artifacts.

Anecdotal — Experiment owners: consistent feedback that scorecards are hard to read and that there is no shared format for "the call I made and why."
Anecdotal — Reviewers / leadership: ship reviews routinely re-derive the same context from scratch because no decision artifact exists.
Adjacent signal: growth in 1:1 requests to DS partners for "is this ready?" interpretation across recent quarters.

Section 2

Goals & Success Metrics

Goals

Make ship/iterate/stop decisions faster and more consistent by giving experiment owners a glanceable, decision-grade view of scorecard data — replacing scorecard reading as the primary decision artifact.
Create an auditable record of every ship decision — what was decided, by whom, and why — including a structured exception trail when guardrails fail.
Establish the Detail page as the single source of truth for experiment ship readiness across the M365 Copilot portfolio.

Success Metrics

Primary Metric

Detail Page WoW Visit Growth

Target: 25% week-over-week growth in unique experiment owners visiting the page · Baseline: TBD (current page traffic)

Secondary Metric

Guardrail Exception Completion Rate

Target: 60% of Ship decisions made over a failed guardrail include a completed Exception within 90 days of launch · Baseline: 0% (Exception flow is new)

Guardrail Metric

Time-to-Decision

Must not regress relative to current scorecard-app workflow — measured as median time from "scorecard ready" to "decision locked"

Explicit Non-Goals

Replacing the scorecard app or its drill-downs — Detail page summarizes; the scorecard remains the source of truth for raw metric exploration

Becoming a generic dashboard — the page is purpose-built for ship/iterate/stop decisions, not exploratory analytics

Auto-shipping or auto-stopping experiments — Copilot recommendations are advisory only; humans own the decision

Section 3

User Scenarios

Scenario 1 — Experiment Owner reviewing for ship readiness (Primary)

Priya is the Experiment Owner for a new Copilot summarization feature. Her A/B has been running for two weeks and she needs to decide whether to ship. Today she opens the scorecard app, scrolls through 40+ metrics, tries to remember which ones are guardrails versus North Stars, and either escalates to her DS partner or makes a gut call. With the Detail page, Priya opens a single URL, sees a Copilot recommendation of "Iterate" labeled advisory with three cited guardrail concerns, reviews the structured Guardrail Metrics table (Consumer | Commercial × Retention, Upsell, Engagement), agrees with the rationale, and locks in "Iterate" as her decision in under two minutes. The decision and her free-text rationale are captured against the experiment for audit.

Scenario 2 — Reviewer / skip-level triaging across many experiments

Marcus is a Director reviewing six experiments before a Friday ship review. He has 30 minutes. He opens each Detail page in a tab, scans the header (experiment name + Copilot recommendation chip), looks at the prominent Decision selector to see what the owner is proposing, and reads the Decision Rationale. For two experiments where Copilot says "Stop" but the owner proposed "Ship," he comments and asks the owner to reconcile before the meeting. He never opens the scorecard app — the Detail page gives him everything he needs to triage at the level of the review.

Scenario 3 — Auditor / leadership reading rationale months later

Six months after a controversial ship, a leadership review asks: "Why did we ship this if engagement was flat?" An auditor opens the archived Detail page and reads the locked-in Decision ("Ship"), the Decision Rationale, the completed Guardrail Exception (filled out because Retention was Fail), and the cited guardrail values at the time of decision. The full reasoning is reconstructable in five minutes from a single URL — no email archeology, no scorecard re-derivation, no asking around for context.

Section 4

Requirements

Priority	ID	Requirement	Acceptance Criteria	Notes
P0	R-01	Header displays the experiment name (large, primary) and a Copilot Recommendation chip (smaller, labeled "Copilot recommendation (advisory)") with one of: Ship / Iterate / Stop / Hold	Header renders on every Detail page; chip label always reads "advisory"; recommendation values constrained to the four enum options	Identity is the experiment, not the recommendation
P0	R-02	Copilot Recommendation rationale is structured as three labeled blocks: Pros, Cons, Risks	All three sections always render; empty section shows "None identified"; freeform prose is not allowed outside the three labels	Forces consistent reasoning shape
P0	R-03	Decision UI (Ship / Iterate / Stop / Hold) is the most visually prominent control on the page — larger, higher-contrast, and higher in the layout than the Copilot Recommendation chip	In design review, Decision selector measurably dominates the visual hierarchy; usability test confirms users identify Decision before Recommendation	Human-owned decision must lead
P0	R-04	Decision Rationale is a free-text field that is required only when the saved Decision differs from the active Copilot Recommendation (R-01). When the Decision matches the Recommendation, the field remains available but is optional. When the Copilot Recommendation is unavailable (Section 5 edge case), rationale is required for every Decision	Save action is enabled with an empty rationale when Decision == Recommendation; Save is disabled with an empty rationale when Decision != Recommendation and inline messaging explains "Rationale is required because your decision differs from Copilot's recommendation"; rationale is also required when the Recommendation is in the unavailable state; switching the Decision selector updates the required/optional state in real time; minimum length enforced when required (TBD)	Reduces friction in the agreement path while keeping the audit anchor for divergent decisions. Pairs with R-20: Ship-with-failing-guardrail is always a divergent case (Copilot will not recommend Ship over a failing guardrail), so override events stay fully gated on rationale
P0	R-05	Guardrail Metrics table renders three categories — Retention, Upsell, Engagement, in that order — × two columns (Consumer \| Commercial). Empty cells are not allowed; missing metrics show "Not applicable"	All three category rows always render; both columns always render; cells with no metric show explicit "Not applicable" text rather than blank space	Source §5
P0	R-06	Guardrail Override — when any guardrail = Fail, the Decision UI surfaces a persistent red "Ship is not recommended" warning and reveals a 3-field Override form (Why acceptable / Mitigation / Followups). Ship remains selectable (legal / brand / business reasons may force a Ship), but the Override is captured with the decision for the audit trail	Red warning callout renders whenever Ship is selected with a failing guardrail; Override form (3 required fields) is captured alongside the decision record; Save is gated on the Decision Rationale, not on the Override fields; Override capture is mandatory for downstream audit reports	Soft override (was hard-gate in early drafts) — captures audit data without blocking ship
P0	R-07	Page reads only from gold/silver layer Expedite data — no external retrieval, no client-side joins to ad-hoc sources	Code review confirms data scope; no network calls outside Expedite gold/silver endpoints; security review sign-off	Source §10 (Copilot CLI guardrails)
P1	R-08	Default rendering logic per guardrail category: stat-sig Fails first, then stat-sig Improvements sorted by absolute impact descending; non-stat-sig metrics hidden behind a "Show all (N more)" disclosure	Sort order verified per category; non-stat-sig metrics not rendered by default but count is visible; expansion is one click	Source §5
P1	R-09	Details & Links section surfaces the underlying scorecard, design doc, and telemetry queries — one-click to open the source-of-truth scorecard	All three link types render when available; missing links show "Not linked"; scorecard link opens correct scorecard in new tab	Source §6
P1	R-10	When a metric exists for only one of (Consumer \| Commercial), the other column shows "Not applicable" rather than collapsing or leaving the cell empty	Renders "Not applicable" placeholder; visually distinct from a missing-data state	Source §5 — single-side metric handling
P2	R-11	Audit trail / decision history view — show the chronological sequence of decisions and rationales for an experiment	Toggleable history panel; entries show timestamp, actor, decision, rationale	Post-v1 candidate
P2	R-12	Export and share decision rationale (PDF or shareable link) for offline review meetings	Export renders header + decision + rationale + guardrail summary; shareable link respects permissions	Post-v1 candidate
P2	R-13	Per-guardrail commenting — Reviewers can leave a question or note tied to a specific row in the Guardrail Metrics table	Comments render inline; threaded; notification to experiment owner	Post-v1 candidate
P0	R-14	Page layout is a single-column vertical stack, top-to-bottom: Header → Generative Insights → Decision → Guardrail Metrics (Consumer \| Commercial side-by-side) → Experiment Details → Decision History. There is no standalone "Decision Summary" tile	No side rail of secondary tiles; Experiment Details (owner, DS partner, dates, metadata chips) renders below Guardrail Metrics, not above the fold; Decision Summary tile does not exist	From prototype iteration — keeps the decision-critical content above the fold
P0	R-15	Page chrome matches the Expedite alpha shell: top blue app bar (IDEAS Experimentation brand + utility actions), left icon nav rail, light gray page background, white content cards with thin gray borders and 8px corner radius	Visual diff against alpha screenshot shows aligned bar height, nav width, background tone, card stroke; theme toggle and avatar render in the top bar	Carries the existing app shell so the Detail page feels native to Expedite
P0	R-16	Copilot Recommendation rationale (Pros / Cons / Risks per R-02) is presented inside a dedicated "Generative Insights" card with a sparkle icon, the AI-content disclaimer, and a Copilot recommendation pill — collapsible and placed between the Header and the Decision card	Card has visible AI disclaimer copy; sparkle/AI iconography is consistent with platform conventions; collapse/expand state persists per user session; rec pill matches R-01 enum	Makes "this is AI output, treat as advisory" unmissable
P0	R-17	Flat / non-stat-sig metric rows display "0%" in the delta column rather than rendering the underlying near-zero value	Every row marked flat / non-stat-sig in the data set renders exactly "0%"; tooltip can still surface the raw observed value for analysts who need it	Reduces visual noise; aligns the rendered table with stat-sig semantics (anything not stat-sig is treated as zero impact)
P1	R-18	Decision-card callout visibility rule: when Ship is selected and a guardrail is failing (R-06 state), the green "current decision matches recommendation" callout is suppressed so the red Override warning is the only callout in view. On any other selection the green callout is restored	Inspecting the DOM with Ship selected + failing guardrail shows the green callout hidden; switching to Iterate / Stop / Hold restores it; both callouts never render simultaneously	Prevents two competing callouts from fighting for attention
P0	R-19	Consumer and Commercial Guardrail cards render an additional MDE (Minimum Detectable Effect) column to the right of the metric value (delta) for every metric row, in both the default and "Show all" expanded states. MDE is rendered in black (`#000`) regardless of the metric's stat-sig state, so the MDE column never inherits the red / green / gray treatment used for the delta	Every metric row in both Consumer and Commercial cards renders three cells in this order: metric name, delta (impact %), MDE (%); the MDE cell uses color `#000`; the column appears for stat-sig Fails, stat-sig Improvements, and non-stat-sig rows alike; when MDE is unavailable for a row the cell shows "—" (em dash) rather than collapsing; "Not applicable" cells (R-10) remain unchanged because they have no metric rows	Lets reviewers judge whether a flat / non-stat-sig result is "truly flat" or "underpowered" at a glance — the delta alone hides the power story
P0	R-20	Every Guardrail Override of the Ship decision (i.e., Ship saved while one or more guardrails = Fail per R-06) writes an immutable audit-log entry that captures, at minimum: actor identity (UPN/email + display name), UTC timestamp (ISO 8601 with `Z` suffix), original verdict being overridden (the Copilot recommendation at the moment of override — Stop / Iterate / Hold), and the three justifications from R-06 (Why acceptable / Mitigation / Followups). Entries are persisted server-side and surfaced in the Decision History card	Every Save of Ship-with-failing-guardrail produces exactly one log entry that conforms to the documented schema (`actor`, `timestampUtc`, `originalVerdict`, `justifications`); actor equals the authenticated user; timestamp is UTC serialized with the `Z` suffix; original verdict is frozen at write time (not re-fetched on read); all three justification fields are persisted verbatim and never truncated; the Decision History card renders the entry as a distinct "Override" row exposing all four fields; entries are immutable (edits create a new entry referencing the original) and are queryable by downstream audit reports within 5 minutes of save	Tightens R-11 specifically for override events — even though the broader history view is P2, override logging itself ships in v1 because the audit trail is the whole point of allowing soft overrides (R-06)
P0	R-21	Statistical significance for guardrail metrics is determined by the Benjamini-Hochberg (BH) procedure controlling the False Discovery Rate at q ≤ 0.05, applied across the experiment's full guardrail family (Consumer + Commercial × Retention / Upsell / Engagement, excluding "Not applicable" cells per R-10). The BH-adjusted q-value — not the raw per-metric p-value at α = 0.05 — is the sole determinant of whether a metric is rendered as a stat-sig Fail, stat-sig Improvement, or non-stat-sig Flat. Reference: Benjamini-Hochberg procedure	The data layer (R-07) computes raw p-value, BH-adjusted q-value, family size m, sorted rank, and a `stat_sig` boolean (q ≤ 0.05) for every guardrail metric in the family; the family scope is recorded with each computation for audit reproducibility; every page-rendering decision that previously consulted "stat-sig" — R-06 Fail-detection / Override gate, R-08 sort + disclose, R-17 zero-out, R-20 override audit log capture — consumes the BH `stat_sig` flag and may not second-guess it; per-row inspection of raw p, BH q, family size m, and rank is exposed through the R-22 detail panel; a "Significance: Benjamini-Hochberg FDR ≤ 0.05" indicator is rendered on the Guardrail Metrics surface so reviewers know which rule produced the colors; the FDR threshold (default 0.05) is configurable per experiment via the same configuration surface that owns R-08 sort order	Resolves Open Question Q1. Multi-hypothesis correction is required because each experiment evaluates many guardrail metrics simultaneously; raw α = 0.05 inflates false positives roughly linearly with family size. BH controls the expected proportion of false positives among rejected hypotheses, which is the right correction for the guardrail use case. Trade-off: BH is more conservative than raw p < 0.05 when the family is large — see Risk 3 update
P0	R-22	Every guardrail metric row exposes a "?" affordance that toggles a slim per-row decision detail panel. The panel is a compact key/value grid showing, in this order: Observed Δ, p-value, BH-adjusted p-value (q), Relative CI (95%), Threshold Δ, Direction (higher / lower is better), MDE, and the exact decision rule that produced the row's classification (e.g., "Powered + non-stat-sig + CI passes → Safe", "Powered + stat-sig + CI excludes 0 in harmful direction → Fail", "Underpowered + non-stat-sig → Inconclusive (awaiting power)")	"?" button renders on every metric row in both Consumer and Commercial cards (default-visible and "Show all" expanded states alike); clicking the button expands the slim panel inline directly below the row; clicking again (or pressing Esc) collapses it; expanded state is per-row (multiple rows may be expanded simultaneously); the panel's eight fields render in the documented order; numeric fields use tabular-nums and inherit no color from the delta; the decision-rule string is the verbatim, fully resolved rule used by the data layer (no abbreviations, no localization lookups); panel is keyboard-traversable and the button exposes `aria-expanded`; "Not applicable" cells (R-10) do not render a "?" button because they have no metric row	Closes the loop on R-19 (MDE column) and R-21 (BH significance) — reviewers can now see why a row was classified the way it was and re-derive the call themselves without leaving the page. Captures the decision rule in the UI so future audits don't depend on tribal knowledge of the data-layer logic
P1	R-23	Experiment Details card surfaces Stage and Iteration alongside the other deployment-context fields (Ring, Audience, Treatment, etc.). Stage is a short string identifying the deployment tier (e.g., `WW`, `MSIT`, `SDF`, `Frontier`); Iteration is a small positive integer starting at 1 indicating which attempt this is within the same hypothesis (typically under a dozen). Both values MUST be sourced from the same experiment record whose scorecard, guardrails, and Copilot recommendation are rendered on this Detail page — they must not be hardcoded, defaulted, inherited from a sibling experiment, or otherwise decoupled from the experiment presented	Both fields render as labeled key/value chips inside the Experiment Details metadata strip, immediately adjacent to Ring; Stage accepts any short alphanumeric string; Iteration is a positive integer; the rendered Stage and Iteration values are read from the same experiment/progression identifier that drives the rest of the page (Header name, Guardrail Metrics, Decision History) — i.e., a single round-trip to the Expedite gold/silver layer (R-07) returns Stage and Iteration in the same payload as the metrics, and there is no separate fetch or fallback path; navigating to a different experiment updates Stage and Iteration in lockstep with every other field on the page; both fields render on every Detail page, with "—" shown only when the source record is genuinely missing the value (never as a default)	Lets owners distinguish ring (deployment population) from stage (release tier) and see at a glance how many times this experiment has iterated before today's decision — and guarantees those signals describe this experiment, not a stale or unrelated one
P1	R-24	Header renders a single-row Filter bar immediately under the experiment title block — five pill-style filter controls in this fixed left-to-right order: Ring, Audience, Iteration, App, Variant. Each pill renders as "Label: SelectedValue" with a chevron affordance and opens a single-select dropdown of allowed values. Allowed values: Ring = SDF, WW, MSIT, DOD · Audience = Any, Commercial, Consumer · Iteration = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 · App = All, Word, Excel, PowerPoint, MCA, Teams, Outlook Web, Outlook Win 32, Copilot App · Variant = one entry per treatment-vs-control pairing for this experiment, formatted "`<treatment_id> - <control_id>`" (e.g., `Copilot_recap_v2_A - Copilot_recap_v2_off`); experiments with multiple treatments expose one option per pairing (e.g., `Copilot_recap_v2_B - Copilot_recap_v2_off`, `Copilot_recap_v2_C - Copilot_recap_v2_off`). Filter selections scope the Guardrail Metrics view to the chosen slice; the App dimension is intentionally excluded from the Decision and Generative Insights scope per R-25	All five pills render in the documented order on every Detail page; each pill is a single-select dropdown with the exact value list above (no free entry); selecting an option updates the pill's value label, persists per user/experiment, and triggers a re-fetch of all page surfaces against the new slice; only one dropdown is open at a time; opening one pill closes the others; clicking outside any pill or pressing Esc closes the open dropdown; pills are keyboard-traversable, expose `aria-haspopup="listbox"` + `aria-expanded`, and the option list exposes `role="listbox"` with `role="option"`/`aria-selected` on each row; default selections on first load are Ring=WW, Audience=Any, Iteration=10 (latest), App=All, Variant=first treatment-vs-control pairing for the experiment; when a slice returns zero rows the affected card renders an explicit "No data for this slice" empty state rather than collapsing	Replaces the earlier static "Variants: ON vs OFF" chip. The Variant pill enumerates the treatment-vs-control pairings defined by the experiment; single-treatment experiments expose exactly one option. The Iteration filter here scopes the data view and is independent of the Iteration metadata field surfaced in the Experiment Details card (R-23), which describes the experiment record itself
P1	R-25	The Decision card and the Generative Insights recommendation (Copilot rec, recap callout, Pros / Cons / Risks, Ship-override gate, decision rationale requirement) are scoped to the App = All aggregated rollup for the otherwise-current filter slice. Changing the App filter MUST NOT change the recommendation, narrative, decision pills, gate callout, override form, or rationale-required state — it only re-slices the Guardrail Metrics cards. Both the Decision card and the Generative Insights card surface a short subtext reading "Based on aggregated product experience" — directly under the "Decision" heading on the Decision card, and directly under the "Generative Insights" title on the GI card — so reviewers understand the scope	Selecting a non-All App value (e.g., Word, Excel) updates only the Guardrail Metrics rows (per-row deltas, severity dots, BH-significance markers); the Decision pill set, the active recommendation, the recap callout copy, the Pros/Cons/Risks lists, the gate-callout visibility, and the owner's currently-selected decision all remain unchanged. Returning App to All likewise leaves the decision-card and Generative Insights content unchanged. The "Based on aggregated product experience" subtext renders on both cards on every Detail page in a muted, italicized 12px style consistent with the existing secondary-text treatment; it is static copy (not data-driven). Ring, Audience, Iteration, and Variant filter changes continue to drive both the decision/GI surfaces and the metric surfaces as documented in R-24	Ship / Iterate / Stop / Hold are program-level calls and must be derived from the aggregated rollup — making them swing with each App selection would invite owners to pick the App that best supports their preferred decision. Keeping the decision App-agnostic while letting reviewers freely inspect per-App metrics preserves both auditability and exploration. The subtext makes the scoping rule visible at the point of decision instead of relying on tribal knowledge

Section 5

Design Direction

UX Flow & Reference

Existing Detail page (reference for design system, fonts, and colors — to be retained):
expedite-web-alpha2 · experiment summary

Page layout (single column, top to bottom):
App shell (top blue bar + left icon nav rail · R-15) → Header (Experiment name + Copilot recommendation chip · advisory) → Filter bar (Ring · Audience · Iteration · App · Variant · single-select pill dropdowns · R-24) → Generative Insights (Pros / Cons / Risks · collapsible AI card · R-16) → Decision (Ship / Iterate / Stop / Hold pills, left-aligned with OWNER DECISION label — largest, most prominent) → Decision Rationale (free-text · required only when Decision diverges from the Copilot recommendation · R-04) → Guardrail Override form (conditional · R-06) → Guardrail Metrics (Consumer | Commercial × Retention, Upsell, Engagement) → Experiment Details (owner, DS partner, Stage, Iteration, metadata chips · placed below Guardrails · R-14, R-23) → Decision History → Footer

Figma: link pending

Key Interaction Notes

Filter bar: Five pill-style filters render in one row directly under the experiment title — Ring, Audience, Iteration, App, Variant (R-24). Each pill is a single-select dropdown with a short, fixed value list (no free entry). Ring, Audience, Iteration, and Variant apply to every data surface on the page; the App dimension only re-slices the Guardrail Metrics cards — the Decision and Generative Insights surfaces remain locked to the App=All aggregated rollup (R-25), and the Decision card carries the subtext "Based on aggregated product experience" to make this scope visible. Only one dropdown is open at a time; Esc or click-outside dismisses.
Visual hierarchy: Decision is a primary, full-width selector with high contrast. The Copilot Recommendation is a smaller chip with the explicit "advisory" label. The Decision must measurably dominate — it is the human-owned outcome, and the Recommendation is supporting context.
Guardrail Override pattern: When any guardrail = Fail, the Decision card renders a persistent red callout — "⛔ Ship is not recommended. … document the override below for the audit trail." The Ship pill is still selectable (legal, brand, or business obligations may force a Ship). Selecting Ship opens an inline Guardrail Override form with three required fields — Why acceptable, Mitigation, Followups — captured alongside the decision for audit. Save is gated on Decision Rationale, not on the Override fields. The green "current decision matches recommendation" callout is suppressed in this state (R-18) so only the red warning is visible. Every saved override also writes an immutable audit-log entry — actor, UTC timestamp, original Copilot verdict, and the three justifications — surfaced in the Decision History card per R-20.
Guardrail Metrics table: Retention, Upsell, Engagement render in that order, top to bottom. Each category is a row with two columns (Consumer | Commercial). Within a column, default rendering shows stat-sig Fails first, then stat-sig Improvements sorted by absolute impact descending. Non-stat-sig metrics are collapsed behind "Show all (N more)" with the hidden count always visible. Every metric row carries an MDE (Minimum Detectable Effect) column to the right of the delta — rendered in black so it never competes visually with the red / green / gray delta — so reviewers can tell at a glance whether a flat result is truly flat or simply underpowered (R-19).
Style retention: Visual style, fonts, and colors of the existing Expedite alpha page (linked above) are retained. Only the page structure and information hierarchy are new.

Edge Cases & Empty States

No guardrails defined: Show "No guardrails configured for this experiment" in each category row; do not collapse the table.
All guardrails non-stat-sig: Collapse rows to a "Show details" state but always render the three category headers so the structure stays predictable.
Copilot recommendation unavailable (computation failure or out of scope): Chip shows "Recommendation unavailable" rather than a stale value; Decision flow continues to function normally.
Scorecard data stale (>72 hours since last refresh): Stale-data banner above the page with the timestamp; Decision can still be saved, with a warning recorded in the rationale audit.
Missing single-side metric: The other column shows "Not applicable" — explicit placeholder, not an empty cell (R-10).

Accessibility

The Decision selector is keyboard-navigable and announces label and current selection — including the override state, e.g., "Ship — selected, guardrail failing, override required."
All guardrail rows are keyboard-traversable; stat-sig markers are conveyed both visually (color) and textually (e.g., "stat-sig fail" announced).
Copilot recommendation rationale (Pros / Cons / Risks) is announced as three labeled regions. The "advisory" label is read aloud with the recommendation value.

Section 6

Open Questions & Risks

Open Questions

Q1 (RESOLVED — see R-21): How is "stat-sig" threshold defined per metric — single fixed alpha, or per-metric configuration?

Default rendering logic (R-08) and the Guardrail Override warning (R-06) both depend on a clear stat-sig boundary. Resolution: R-21 defines stat-sig as the Benjamini-Hochberg FDR procedure at q ≤ 0.05 applied across the full guardrail family per experiment, replacing the implicit raw p < 0.05 alpha. Threshold is configurable per experiment but defaults to 0.05. Owner: Data Science (closed).

Q2: What is the exact Guardrail Override template — fields, validation rules, and approval workflow?

R-06 specifies Why acceptable / Mitigation / Followups as required, but length minimums, structured tags, linkage to followup work items, and any approval/sign-off step are TBD. Owner: PM (David Lydston).

Q3: Should Ship-overrides require tiered approval, or is owner-only sign-off sufficient?

v1 makes the Override a soft gate — anyone can ship by completing the 3-field Override form. Open question: should overrides above a severity threshold (e.g., stat-sig retention regression > 1%) require Director sign-off or a co-signer recorded in the audit trail? Owner: Engineering + Leadership.

Q4: Does the "advisory" framing of the Copilot recommendation need legal/compliance review before launch?

Recommendation is shown alongside a binding human decision; we should confirm whether the advisory disclaimer is sufficient or whether additional language/positioning is required. Owner: PM + Legal.

Risks

Risk 1: Over-reliance on the Copilot recommendation despite "advisory" framing

Likelihood: High · Impact: High. Owners may rubber-stamp the recommendation without reviewing guardrails — undoing the speed gain by replacing one shortcut (gut feel) with another (Copilot-as-oracle). Mitigation: visual hierarchy puts Decision >> Recommendation (R-03); Decision Rationale is required whenever the Decision diverges from the Recommendation (R-04) so every disagreement is audited; for the agreement path — where R-04 no longer forces rationale — quarterly sampling of agreement decisions to spot rubber-stamping (e.g., random audit + DS partner spot-checks); periodic review of rationale-vs-recommendation agreement rates by experiment area to surface owners who never disagree.

Risk 2: Guardrail Override becomes a rubber-stamp

Likelihood: Medium · Impact: High. With Ship now selectable as a soft override (R-06), owners may write "minor regression, ship anyway" with no real mitigation plan, and the Override form becomes ceremony rather than protection. Mitigation: three structured fields (Why acceptable / Mitigation / Followups) all required; followups link to tracked work items; quarterly audit of accepted overrides; flag overrides whose followups are never closed; consider tiered approval for high-severity overrides (Q3).

Risk 3: Default rendering hides important non-stat-sig regressions

Likelihood: Medium · Impact: Medium. A real issue may live just below the stat-sig threshold and never get surfaced because R-08 collapses non-stat-sig metrics by default. R-21's BH FDR control at q ≤ 0.05 is intentionally more conservative than raw p < 0.05, which slightly amplifies this risk for small-but-real effects when the guardrail family is large. Mitigation: always show "N hidden" count in each category; one-click expand; the MDE column (R-19) makes underpowered metrics obvious next to the delta; per-row tooltip exposes raw p, BH q, family size, and rank (R-21) so analysts can re-derive borderline calls; quarterly retro on near-miss regressions to validate both the BH threshold and the family scope.

Risk 4: Page performance degrades with large guardrail tables

Likelihood: Low · Impact: Medium. Slow load makes triage workflow worse, not better — directly undermining the Detail Page WoW Visit Growth target. Mitigation: server-side default rendering; lazy-load expanded rows; perf SLA included in launch criteria.

Risk 5: Decision Rationale is too basic and reduces to "stat-sig says yes"

Likelihood: High · Impact: High. If owners write rationales that merely repeat stat-sig results — without capturing what meaningful user/business change the experiment represents — the auditability goal collapses; six months later the rationale tells a future reader nothing new beyond what the scorecard already showed. Mitigation: free-text required in v1, but explore structured prompts in v1.1 (e.g., "What user behavior change does this represent?", "What would have changed your decision?"); quarterly review of rationale quality with sampling and feedback to owners.

Expedite — Experiment Detail Page