QA Frameworks: Scoring, Calibration, and DSAT Scrubbing

Why most QA programmes fail

Most customer service operations have a QA programme. Fewer have one that works. The gap between a QA programme that exists and one that drives genuine quality improvement is not a measurement gap — it is a design gap. Scorecards that measure the wrong things, calibration sessions that produce alignment on paper but not in practice, DSAT reviews that generate a number without generating insight. The programme runs, scores are produced, reports are circulated, and quality does not improve.

A QA programme fails when it is designed around the organisation's convenience rather than the customer's experience. When it measures what is easy to observe — was the greeting used, was the ticket tagged correctly — rather than what actually determines whether the customer had a good experience. When calibration is treated as a box-ticking exercise rather than a genuine alignment practice. When DSAT is reported as a percentage rather than investigated as a dataset.

Building a QA programme that works requires getting three things right simultaneously: a scoring framework that measures the dimensions of quality that actually matter, a calibration process that makes scores trustworthy and defensible, and a DSAT scrubbing practice that turns negative ratings into actionable intelligence. This article covers all three.

What a QA framework is designed to do

A QA framework is not a surveillance mechanism. It is a learning and improvement system — the structured practice through which an organisation understands whether its agents are delivering the experience it has designed, identifies where they are not, and creates the data that coaching and process improvement need to be effective.

A QA framework that is designed primarily to catch agents doing things wrong will produce a team that is defensive, risk-averse, and focused on scoring well rather than on serving customers well. A QA framework that is designed primarily to understand quality patterns and improve them will produce a team that views quality assessment as a development tool rather than a performance threat.

The design intent determines the culture the programme creates. Before designing the scorecard, the framework, or the calibration process, the intent should be explicit: this programme exists to help us understand how well we are delivering on our experience vision, identify where we are falling short, and improve systematically. Everything else follows from that intent.

Designing the scorecard

The scorecard is the core instrument of the QA framework — the structured set of criteria against which each assessed interaction is scored. Its design determines what the programme measures, what behaviour it incentivises, and what the coaching conversations it generates focus on.

Compliance-based versus quality-based criteria

A common and consequential mistake in scorecard design is conflating compliance-based criteria — did the agent follow the defined process — with quality-based criteria — did the agent deliver a genuinely good experience. Both matter. They are not the same thing and should not be mixed without clarity about what each is measuring.

Compliance-based criteria measure adherence to defined procedures and standards: was the correct greeting used, was the ticket documented according to the template, was the escalation path followed correctly, was the customer's data handled in accordance with policy. Compliance criteria are binary or near-binary — the agent either followed the procedure or did not. They are important because process adherence underpins consistency and regulatory compliance, but they do not, on their own, tell you whether the customer had a good experience.

Quality-based criteria measure the dimensions of the interaction that determine customer experience: was the agent's response accurate, was the communication clear and appropriately toned, did the agent demonstrate genuine understanding of the customer's situation, was the resolution complete rather than superficial. Quality criteria require judgment in their assessment — there is no binary right or wrong for "was the communication clear" — and that judgment is what calibration is designed to align.

A well-designed scorecard includes both categories, weights them appropriately, and makes clear which criteria are compliance-based and which are quality-based. The weighting should reflect what actually drives customer outcomes — typically quality-based criteria should carry more weight than compliance-based ones, because a technically compliant interaction that fails on accuracy and tone will produce DSAT regardless of how well the procedures were followed.

The critical error category

Every QA scorecard should include a category of critical errors — specific failures that are so consequential that they override the overall score regardless of performance on other criteria. Critical errors represent failures that directly harm the customer, create compliance or legal risk, or fundamentally undermine the purpose of the interaction.

Examples of critical errors in a CS context:

Providing incorrect regulatory or compliance information that could cause the customer financial or legal harm. Sharing one customer's data with another customer. Failing to escalate a contact that meets defined escalation criteria, resulting in an SLA breach. Making a commitment to a customer that the organisation cannot or will not honour. Handling a data subject request incorrectly in a way that creates GDPR exposure.

A critical error should result in an automatic fail for that interaction regardless of the score on other criteria. The threshold for what constitutes a critical error should be defined explicitly in the scorecard documentation — not left to reviewer judgment — and should be reviewed regularly as the operation's risk profile evolves.

The critical error category creates a separate accountability track that is distinct from developmental coaching. An agent whose overall QA scores are improving but who commits periodic critical errors is not progressing — the critical errors are a separate conversation from the development trend.

Scorecard structure

A practical scorecard for a CS operation covers four to six categories, each with two to four specific criteria, scored on a consistent scale. Fewer than four categories produces a scorecard that is too blunt to generate specific coaching insight. More than six creates scoring burden that reduces the quality of assessment and the consistency of calibration.

A sample scorecard structure:

Resolution quality — 35% weight Was the customer's issue fully resolved? Was the information provided accurate? Was the resolution complete rather than partial?

Communication quality — 30% weight Was the communication clear and appropriately structured? Was the tone appropriate to the situation? Was empathy demonstrated where the situation warranted it?

Process adherence — 20% weight Were the correct procedures followed? Was the ticket documented correctly? Were escalation criteria applied appropriately?

Efficiency — 15% weight Was handle time appropriate for the complexity of the contact? Was after-contact work completed correctly and promptly?

The weighting reflects the priority hierarchy: getting the answer right and communicating it well are more important than process adherence and efficiency. A scorecard that weights process adherence above resolution quality tells agents implicitly that following the procedure matters more than solving the problem — which is rarely the intended message.

Scoring scales

Scoring scales should be consistent across all criteria and simple enough that they can be applied reliably without extensive training. Two common approaches:

A binary scale — pass/fail for each criterion — is simple, fast to apply, and produces clear pass rates for each criterion. Its limitation is that it loses nuance — an agent who barely passes a criterion and one who performs it excellently receive the same score. Binary scales are appropriate for compliance-based criteria where the criterion is genuinely binary.

A three or four-point scale — does not meet standard, meets standard, exceeds standard, or a numerical equivalent — captures nuance without creating the false precision of a ten-point scale. Four-point scales work well for quality-based criteria where the distinction between adequate and excellent performance is meaningful and observable.

Mixing binary scales for compliance criteria and multi-point scales for quality criteria within the same scorecard is a valid and often practical approach — it applies the appropriate precision to each category.

Calibration: making scores trustworthy

A scorecard produces scores. Calibration makes those scores trustworthy — ensuring that the same interaction assessed by different reviewers produces consistent scores, and that the standards applied today are the same as the standards applied last month.

Without calibration, QA scores are opinions dressed up as measurements. An agent who disputes a low score — "I think I handled that well" — has no reference point beyond the reviewer's judgment. With calibration, the score is grounded in a shared, documented standard that is the product of multiple reviewers' alignment rather than one person's assessment.

The calibration session

A calibration session brings together multiple reviewers — typically QA analysts, team leads, and the QA programme owner — to independently score the same interaction and then compare and discuss their scores.

The sequence matters. Independent scoring before the group discussion is what makes calibration valuable. If reviewers see each other's scores before forming their own, anchoring bias — the tendency to adjust toward the first number seen — eliminates the independence that calibration is designed to produce. The discipline of independent scoring first is non-negotiable.

After independent scoring, the group compares scores criterion by criterion. Where scores diverge significantly — more than one point on a four-point scale, or a pass-fail disagreement — the discussion focuses on what drove the divergence. Was it a different interpretation of the criterion? A different weighting of the evidence in the interaction? A difference in what the standard requires?

The outcome of the discussion is a documented consensus score and — more importantly — a documented rationale that clarifies the standard for that criterion in that type of situation. The calibration session is not primarily about reaching agreement on one score. It is about building a shared understanding of what the standard means in practice, across the range of situations the team encounters.

Calibration frequency and sample selection

Calibration sessions should run at minimum monthly. For operations implementing a new QA framework or introducing new reviewers, weekly calibration in the first two months accelerates alignment significantly.

Sample selection for calibration should be deliberate rather than random. A calibration sample that includes only straightforward contacts will not surface the interpretive disagreements that calibration is designed to resolve. The most useful calibration samples include a mix of contact types, at least one interaction that reviewers found genuinely difficult to score, and at least one interaction where a recent quality issue or edge case was observed in live operations.

Measuring calibration quality

The gap between reviewers' independent scores before discussion and the consensus score after is a measure of calibration quality — specifically of how much variance exists in how reviewers interpret the standards. Tracking this gap over time — the average score divergence across calibration sessions — shows whether calibration is producing genuine alignment or whether reviewers are agreeing in sessions while continuing to apply divergent standards independently.

An inter-rater reliability score — the statistical measure of agreement between reviewers scoring the same interactions independently — is the formal measurement of calibration quality. Most QA tools calculate this automatically. A target inter-rater reliability above 0.80 on a standard kappa scale is considered good alignment. Below 0.70 indicates significant divergence that the calibration process is not resolving.

DSAT scrubbing: turning negative ratings into insight

DSAT — Dissatisfied customer ratings — is the most actionable dataset in a QA programme. Every DSAT rating represents a customer who experienced something sufficiently poor to register explicit dissatisfaction. Understanding what went wrong in those interactions, at a systematic rather than anecdotal level, is where the most valuable quality improvement intelligence comes from.

DSAT scrubbing is the structured practice of reviewing every DSAT-rated interaction, classifying the root cause of the dissatisfaction, and aggregating those classifications into a dataset that reveals the patterns driving negative customer experience.

The DSAT scrub process

A DSAT scrub is not a complaint review. It is a systematic analytical practice with a defined process, a consistent classification taxonomy, and a regular output that feeds into quality improvement priorities.

The process for each DSAT interaction:

Review the full interaction — not just the rated contact but the thread history if the customer has contacted previously. DSAT ratings are often the result of an accumulated experience rather than a single interaction failure.

Apply the QA scorecard to the interaction. What criteria did the interaction fail on? Was it a critical error, a quality failure, or a compliance miss? The QA score for DSAT interactions provides the quality dimension of the analysis.

Classify the root cause using the taxonomy defined below.

Note any contextual factors — was this a high-complexity contact type, was the agent new, was there a system issue that affected the interaction? Contextual factors do not excuse quality failures but they inform whether the failure is systemic or situational.

The DSAT classification taxonomy

A consistent classification taxonomy is what makes DSAT scrubbing analytically useful rather than a collection of individual anecdotes. Every DSAT interaction should be classified into one primary root cause category:

Resolution failure — the customer's problem was not solved, or was solved incorrectly. The most serious category because it represents the most fundamental failure of the CS function's purpose. Sub-categories: problem not understood, incorrect information provided, partial resolution without follow-through, problem recurred after claimed resolution.

Process failure — the problem was resolved but the process was poor. Sub-categories: excessive contacts required, unnecessary wait time, repeated context-setting required, poor handoff between agents or tiers, SLA breach.

Communication failure — the resolution was correct but the communication quality was poor. Sub-categories: unclear explanation, inappropriate tone, lack of empathy in a situation that warranted it, conflicting information from different agents.

Expectation mismatch — the dissatisfaction was driven by a gap between what the customer expected and what was possible — a resolution timeline that did not match expectation, a product limitation that was not communicated at sale, a policy that the customer did not know applied to their situation.

External factor — the dissatisfaction was driven by something outside the CS team's control — a product failure, a billing error, a sales promise that was not delivered. Still worth recording because it informs cross-functional action even though it does not indicate a CS quality failure.

Aggregating DSAT findings

The value of the classification taxonomy is in aggregation. A single classified DSAT interaction is an anecdote. A hundred classified DSAT interactions over a month is a dataset that reveals where quality is systematically failing.

The monthly DSAT analysis report should show: the distribution of root cause categories as a percentage of total DSAT volume, the trend in each category over the previous three months, the contact types and agent cohorts with the highest DSAT concentration, and the specific criteria that the QA assessments of DSAT interactions most frequently failed on.

This report is the primary input to coaching priorities, process improvement initiatives, and cross-functional escalations. A DSAT profile dominated by resolution failure on a specific contact type points to a knowledge gap or a process design problem. One dominated by communication failure points to a coaching priority. One dominated by external factors points to a cross-functional conversation with product or sales.

Connecting DSAT to QA scores

DSAT interactions should be systematically included in the QA review sample — not just reviewed as part of the DSAT scrub but formally scored against the QA scorecard. Connecting DSAT ratings to QA scores reveals the relationship between specific quality criteria and customer dissatisfaction — which quality failures are most predictive of DSAT, and which quality failures occur without generating DSAT.

This connection is what makes the QA framework genuinely linked to customer outcomes rather than existing as an internal quality measurement that may or may not reflect what customers actually experience. A QA criterion that consistently fails in DSAT interactions is a criterion that matters to customers. A QA criterion that fails frequently in assessed interactions but shows no correlation with DSAT may be measuring something that matters to the operation but not to the customer — and is worth reviewing for relevance.

Closing the loop: from QA findings to improvement

A QA framework that produces scores, calibration sessions that produce alignment, and DSAT scrubbing that produces classification data are all inputs to improvement. The improvement only happens if the findings are acted on — if coaching is delivered, processes are redesigned, and the QA framework itself is updated when it reveals its own limitations.

The closing of the loop has three mechanisms that are covered in the next two articles: QA-driven coaching that delivers targeted feedback on specific findings in near-real-time rather than waiting for scheduled 1:1s, RCA that traces quality patterns to their systemic causes and addresses them at the process or knowledge level, and the governance cadence that reviews QA programme performance regularly and updates the framework when it is not measuring the right things.

QA Frameworks: Scoring, Calibration, and DSAT Scrubbing

Why most QA programmes fail

What a QA framework is designed to do

Designing the scorecard

Calibration: making scores trustworthy

DSAT scrubbing: turning negative ratings into insight

Closing the loop: from QA findings to improvement

Rate this article

Related articles

Suggest a change

In this topic