Problem:
I have a a Playwright test for an onboarding flow currently assesses LLM-generated content using an LLM-as-a-judge method with a binary pass/fail outcome (score > 60). While functional, this approach lacks the ability to track score trends over time, which would provide more valuable insights.
Suggested solution:
Implement a custom metrics system that:
Captures the actual LLM evaluation scores from tests
Exposes these scores to Checkly's dashboard
Enables visualization of individual scores per run and aggregated metrics (7-day and 14-day averages)
Creates an extensible framework for any future custom metrics beyond just LLM evaluation scores
Please authenticate to join the conversation.
In Review
π‘ Feature Request
9 months ago

Berk Durmus
Get notified by email when there are changes.
In Review
π‘ Feature Request
9 months ago

Berk Durmus
Get notified by email when there are changes.