Problem:
I have a a Playwright test for an onboarding flow currently assesses LLM-generated content using an LLM-as-a-judge method with a binary pass/fail outcome (score > 60). While functional, this approach lacks the ability to track score trends over time, which would provide more valuable insights.
Suggested solution:
Implement a custom metrics system that:
Captures the actual LLM evaluation scores from tests
Exposes these scores to Checkly's dashboard
Enables visualization of individual scores per run and aggregated metrics (7-day and 14-day averages)
Creates an extensible framework for any future custom metrics beyond just LLM evaluation scores
Please authenticate to join the conversation.
In Review
π‘ Feature Request
10 months ago

Berk Durmus
Get notified by email when there are changes.
In Review
π‘ Feature Request
10 months ago

Berk Durmus
Get notified by email when there are changes.