Insights & Use Cases
April 8, 2026

How to build an AI-Powered interview scoring system with speech-to-text

This tutorial shows you how to build an AI-powered interview scoring system that records interviews, converts speech-to-text, and systematically evaluates candidates using structured criteria.

Kelsey Foster
Growth
Reviewed by
No items found.
Table of contents

This tutorial shows you how to build an AI-powered interview scoring system that records interviews, converts speech-to-text, and systematically evaluates candidates using structured criteria. Instead of frantically taking notes while trying to listen and assess simultaneously, you'll separate interviewing from scoring—conduct natural conversations first, then analyze complete transcripts afterward with objective evidence for every evaluation.

You'll use Python with AssemblyAI's speech-to-text API to transcribe interviews with speaker separation, then implement automated scoring algorithms that search transcripts for competency evidence. The system generates detailed scorecards with supporting quotes, timestamps, and numerical ratings for each skill you're evaluating. By the end, you'll have a complete workflow that transforms subjective interview assessments into data-driven hiring decisions backed by verifiable evidence from actual candidate responses.

What is an AI-powered interview scoring system?

An AI-powered interview scoring system is a structured tool that records your interviews, turns the audio into text, and then searches for evidence of specific skills. This means you can focus entirely on the conversation during interviews, then analyze what was said afterward using the complete transcript.

Here's how it works: You record the interview, upload the audio file to a speech-to-text API, get back a transcript with speaker labels, and then search for quotes that prove or disprove each skill you're evaluating. Instead of frantically taking notes while trying to listen, you have every word preserved for careful analysis.

Traditional interview scoring happens while you're talking—you're trying to listen, take notes, and evaluate simultaneously. AI-powered scoring separates these tasks completely. You interview first, then score later using the full transcript.

Core components: Criteria, scale, evidence, and transcripts

Every interview scoring system needs four parts working together. First, you define what skills matter for the job—usually 4 to 6 specific competencies like problem-solving or technical knowledge.

Second, you create a 1-5 rating scale where each number means something specific. Third, you extract evidence from the transcript—actual quotes where candidates demonstrate each skill. Fourth, you use complete transcripts as your single source of truth instead of handwritten notes.

  • Scoring criteria: Job-specific skills you can observe in answers
  • Rating scale: Clear definitions for scores 1 through 5
  • Evidence extraction: Direct quotes that prove competency levels
  • Complete transcripts: Every word said, with speaker labels and timestamps

The manual scoring challenge: Bias, recall, and inconsistency

Manual scoring during interviews creates three problems that hurt your hiring decisions. You experience cognitive overload when trying to listen, evaluate, and take notes at the same time—your brain can't do all three tasks well simultaneously.

Memory-based scoring means you forget details from earlier in the interview. You might miss a brilliant answer from minute 10 if the candidate stumbles at minute 45. Different interviewers also interpret the same response differently—one person's "strong communication" is another's "adequate explanation."

Benefits of transcript-based scoring

Transcript-based scoring gives you four key advantages over manual evaluation:

  • Complete evidence: Every claim in your evaluation links to an actual quote
  • Reduced bias: All evaluators review identical information instead of different notes
  • Legal protection: Timestamped evidence supports your hiring decisions if challenged
  • Time savings: Batch-process multiple interviews and share transcripts with your team

You can search transcripts for specific keywords, share them with other interviewers for second opinions, and create detailed scorecards with supporting evidence for every rating.

How to build an AI-powered interview scoring system

Building your scoring system requires five sequential steps. Each step builds on the previous one, from defining what you're measuring to generating scores with supporting evidence.

Step 1: Define role-specific scoring criteria

Start by identifying 4 to 6 skills that directly predict success in the specific role you're hiring for. Don't use generic traits like "good communication"—these mean different things for different jobs.

For a software engineering role, focus on observable behaviors:

  • How candidates break down complex technical problems
  • Their approach to system design and architecture decisions
  • Knowledge of relevant programming languages and frameworks
  • Ability to explain technical concepts clearly

For a customer success role, look for different competencies:

  • Specific techniques for handling difficult customer situations
  • Methods for building trust and rapport with clients
  • How quickly they learn new software systems
  • Examples of influencing customer decisions

Write each competency as something you can identify in interview responses. Instead of "leadership potential," use "describes specific examples of mentoring team members or influencing technical decisions."

Step 2: Choose your rating scale

Use a 5-point scale with clear definitions for each score. This gives you enough detail without overwhelming evaluators with too many options.

Score 1 - Far Below Requirements: No evidence of competence, avoided the question, or gave completely irrelevant answers

Score 2 - Below Requirements: Minimal understanding shown, vague responses without concrete examples

Score 3 - Meets Requirements: Adequate demonstration with at least one relevant, complete example

Score 4 - Exceeds Requirements: Strong evidence with multiple detailed examples or sophisticated understanding

Score 5 - Far Exceeds Requirements: Exceptional mastery with innovative approaches or profound insights

Don't use scales with more than 5 points. Research shows humans struggle to consistently differentiate between more than 5 levels, so a 10-point scale creates noise rather than precision.

Step 3: Set up interview recording and transcription

Quality transcription starts with quality audio recording. You need clear, separate audio for each speaker to get accurate transcripts with speaker labels.

Recording setup for video platforms:

  • Zoom: Enable cloud recording with separate audio files for each participant
  • Microsoft Teams: Use meeting recording, then download the MP4 file
  • Google Meet: Install a recording extension or use OBS Studio for local capture

Audio quality requirements:

  • Use external microphones instead of built-in laptop mics
  • Test audio levels before each interview—aim for clear, consistent volume
  • Record in quiet rooms with minimal background noise
  • Save recordings in common formats like MP3, MP4, or WAV

Legal requirements:

  • Inform candidates during scheduling that interviews will be recorded
  • Get verbal consent at the start: "This interview will be recorded for evaluation purposes. Do you consent?"
  • Follow your local laws—some states require consent from all parties
  • Store recordings securely and delete them after making hiring decisions

Step 4: Configure AssemblyAI transcription (Python)

Now you'll implement the transcription system using Python and AssemblyAI's speech-to-text API. Start by installing the required packages:

pip install assemblyai python-dotenv

Create a .env file to store your API key securely:

ASSEMBLYAI_API_KEY=your_api_key_here

Here's a complete script to transcribe interviews with speaker labels:

# interview_transcriber.py
import assemblyai as aai
import json
import os
from datetime import datetime
from dotenv import load_dotenv

# Load environment variables
load_dotenv()

# Configure AssemblyAI
aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')

def transcribe_interview(audio_file_path, candidate_name,
position):
    """
    Transcribe an interview recording with speaker separation
    
    Args:
        audio_file_path: Path to your audio file
        candidate_name: Name of the candidate 
        position: Job title being interviewed for
    
    Returns:
        Dictionary with transcript and metadata
    """
    
    print(f"Starting transcription for {candidate_name}...")
    
    # Configure transcription settings
    config = aai.TranscriptionConfig(
        speech_model=aai.SpeechModel.best,  # Use highest 
accuracy model
        speaker_labels=True,        # Separate speakers
        speakers_expected=2,        # Interview has 2 people
        punctuate=True,            # Add punctuation
        format_text=True          # Proper capitalization
    )
    
    # Create transcriber and process file
    transcriber = aai.Transcriber(config=config)
    transcript = transcriber.transcribe(audio_file_path)
    
    # Check if transcription succeeded
    if transcript.status == aai.TranscriptStatus.error:
        print(f"Transcription failed: {transcript.error}")
        return None
    
    print("Transcription completed successfully!")
    
    # Extract speaker-labeled segments
    utterances = []
    for utterance in transcript.utterances:
        utterances.append({
            'speaker': utterance.speaker,
            'text': utterance.text,
            'start_time': utterance.start / 1000,  # Convert to
seconds
            'end_time': utterance.end / 1000,
            'confidence': utterance.confidence
        })
    
    # Prepare results
    result = {
        'candidate_name': candidate_name,
        'position': position,
        'interview_date': datetime.now().strftime('%Y-%m-%d
%H:%M:%S'),
        'duration_minutes': round(transcript.audio_duration / 60,
2),
        'utterances': utterances,
        'full_text': transcript.text
    }
    
    # Save transcript to JSON file
    output_filename = f"{candidate_name.replace(' ',
'_')}_{position.replace(' ', '_')}.json"
    with open(output_filename, 'w') as f:
        json.dump(result, f, indent=2, ensure_ascii=False)
    
    print(f"Transcript saved as {output_filename}")
    return result

# Example usage
if __name__ == "__main__":
    result = transcribe_interview(
        audio_file_path="interview_recording.mp3",
        candidate_name="Sarah Johnson", 
        position="Senior Developer"
    )
    
    if result:
        print(f"Interview duration: {result['duration_minutes']} minutes")
        print(f"Total utterances: {len(result['utterances'])}")

Key configuration options explained:

  • speech_model=aai.SpeechModel.best: Uses AssemblyAI's highest accuracy model for optimal transcription quality
  • speaker_labels=True: AssemblyAI automatically identifies different speakers and labels them consistently throughout the transcript
  • speakers_expected=2: Optimizes the algorithm for two-person conversations (interviewer and candidate)
  • punctuate=True: Adds periods, commas, and question marks to make the text readable
  • format_text=True: Capitalizes proper nouns and sentence beginnings

AssemblyAI handles multiple audio formats automatically—you don't need to convert files before uploading. The API works with MP3, MP4, M4A, WAV, and other common formats.

Get accurate, speaker-labeled interview transcripts

Sign up to get your AssemblyAI API key and run the Python script above. Diarization and formatting make quotes and timestamps effortless.

Get API key

Step 5: Extract evidence and calculate Sscores

With your transcript ready, you can now extract evidence for each competency and calculate scores. Here's a complete scoring system:

# interview_scorer.py
import json
import re
from typing import Dict, List, Tuple

class InterviewScorer:
    def __init__(self, transcript_file: str):
        """Load transcript and prepare for scoring"""
        with open(transcript_file, 'r') as f:
            self.transcript_data = json.load(f)
        
        # Extract candidate responses (usually Speaker B)
        self.candidate_responses = 
self._get_candidate_responses()
    
    def _get_candidate_responses(self) -> List[str]:
        """Extract only the candidate's speech from transcript"""
        responses = []
        
        # Count how many times each speaker talks
        speaker_counts = {}
        for utterance in self.transcript_data['utterances']:
            speaker = utterance['speaker']
            speaker_counts[speaker] = speaker_counts.get(speaker,
0) + 1
        
        # Candidate usually speaks less than interviewer
        candidate_speaker = min(speaker_counts.keys(), key=lambda
x: speaker_counts[x])
        
        # Get all candidate utterances
        for utterance in self.transcript_data['utterances']:
            if utterance['speaker'] == candidate_speaker:
                responses.append(utterance['text'])
        
        return responses
    
    def find_evidence_for_competency(self, competency_keywords:
List[str]) -> List[str]:
        """
        Find candidate responses that contain evidence of a competency
        
        Args:
            competency_keywords: List of words/phrases related to
the skill
        
        Returns:
            List of candidate quotes showing this competency
        """
        evidence = []
        
        for response in self.candidate_responses:
            response_lower = response.lower()
            
            # Check if response contains any competency keywords
            keyword_matches = sum(1 for keyword 
in competency_keywords 
                                if keyword.lower() in response_lower)
            
            # If response has keywords and is substantial, 
include as evidence
            if keyword_matches > 0 and len(response.split()) >
10:
                evidence.append(response)
        
        return evidence[:3]  # Return top 3 pieces of evidence
    
    def score_competency(self, evidence: List[str]) -> int:
        """
        Calculate 1-5 score based on evidence quality and
quantity
        
        Args:
            evidence: List of quotes showing competency
        
        Returns:
            Score from 1 (poor) to 5 (excellent)
        """
        if not evidence:
            return 1  # No evidence found
        
        evidence_count = len(evidence)
        avg_length = sum(len(e.split()) for e in evidence) /
len(evidence)
        
        # Score based on quantity and depth of evidence
        if evidence_count == 1 and avg_length < 20:
            return 2  # Minimal evidence
        elif evidence_count <= 2 and avg_length < 30:
            return 3  # Adequate evidence  
        elif evidence_count >= 2 and avg_length >= 30:
            return 4  # Strong evidence
        elif evidence_count >= 3 and avg_length >= 40:
            return 5  # Exceptional evidence
        else:
            return 3  # Default to adequate
    
    def generate_scorecard(self, competencies: Dict[str,
List[str]]) -> Dict:
        """
        Generate complete scorecard with scores and supporting evidence
        
        Args:
            competencies: Dict mapping competency names to
keyword lists
        
        Returns:
            Complete scorecard with scores and evidence
        """
        scorecard = {
            'candidate': self.transcript_data['candidate_name'],
            'position': self.transcript_data['position'],
            'interview_date':
self.transcript_data['interview_date'],
            'competency_scores': {},
            'supporting_evidence': {},
            'overall_score': 0
        }
        
        total_score = 0
        
        for competency_name, keywords in competencies.items():
            # Find evidence and calculate score
            evidence = 
self.find_evidence_for_competency(keywords)
            score = self.score_competency(evidence)
            
            # Store results
            scorecard['competency_scores'][competency_name] =
score
            scorecard['supporting_evidence'][competency_name] =
evidence
            total_score += score
        
        # Calculate overall average
        scorecard['overall_score'] = round(total_score /
len(competencies), 1)
        
        return scorecard
    
    def save_scorecard(self, competencies: Dict[str, List[str]], 
output_file: str):
        """Generate and save scorecard to JSON file"""
        scorecard = self.generate_scorecard(competencies)
        
        with open(output_file, 'w') as f:
            json.dump(scorecard, f, indent=2, ensure_ascii=False)
        
        # Print summary
        print(f"\n=== Scorecard for {scorecard['candidate']}
===")
        for competency, score in 
scorecard['competency_scores'].items():
            print(f"{competency}: {score}/5")
        print(f"Overall Score: {scorecard['overall_score']}/5")
        print(f"Detailed scorecard saved to {output_file}")

# Example usage
if __name__ == "__main__":
    # Define competencies with relevant keywords
    engineering_competencies = {
        'Problem Solving': [
            'analyze', 'debug', 'troubleshoot', 'solution',
'approach',
            'investigate', 'root cause', 'systematic', 'break
down'
        ],
        'Technical Skills': [
            'python', 'javascript', 'react', 'database', 'api',
            'algorithm', 'architecture', 'testing', 'performance'
        ],
        'Communication': [
            'explain', 'clarify', 'example', 'understand', 
'question',
            'discuss', 'present', 'document', 'feedback'
        ],
        'Experience': [
            'project', 'team', 'lead', 'built', 'developed',
            'implemented', 'managed', 'delivered', 'worked on'
        ]
    }
    
    # Score the interview
    scorer = 
InterviewScorer('Sarah_Johnson_Senior_Developer.json')
    scorer.save_scorecard(engineering_competencies, 
'sarah_johnson_scorecard.json')

This system searches candidate responses for keywords related to each competency, then scores based on the quantity and depth of evidence found. You can customize the keywords for different roles and adjust the scoring logic based on your needs.

For more sophisticated analysis, you can integrate with language models through AssemblyAI's LLM Gateway:

# Advanced scoring with AssemblyAI LLM Gateway
import requests
import os

def score_with_llm(self, competency: str, candidate_responses: str) -> dict:
    """Use AssemblyAI LLM Gateway to analyze responses for 
specific competency"""
    
    prompt = f"""Analyze these interview responses for evidence
of {competency}.

Candidate responses:
{candidate_responses}

Provide:
1. Score from 1-5 (1=no evidence, 5=exceptional evidence)
2. Three best quotes demonstrating this skill
3. Brief explanation of your score

Respond in JSON format."""
    
    headers = {
        'authorization': os.getenv('ASSEMBLYAI_API_KEY'),
        'Content-Type': 'application/json'
    }
    
    response = requests.post(
        'https://llm-gateway.assemblyai.com/v1/chat/completions',
        headers=headers,
        json={
            'model': 'claude-sonnet-4-20250514',
            'messages': [
                {
                    'role': 'user',
                    'content': prompt
                }
            ],
            'max_tokens': 1000,
            'temperature': 0.1
        }
    )
    
    return response.json()['choices'][0]['message']['content']

Common implementation mistakes to avoid

Three critical mistakes can undermine your scoring system and lead to poor hiring decisions.

Using one-size-fits-all criteria

Don't use the same competencies for every role—"communication skills" means different things for a backend engineer versus a sales representative. A data scientist needs to explain statistical concepts to non-technical stakeholders, while a customer support agent needs empathy and de-escalation techniques.

Instead, analyze your top performers in each role and identify what makes them successful. If your best engineers excel at code reviews, include "provides constructive technical feedback" rather than generic "teamwork." If your top salespeople ask great discovery questions, look for "uses strategic questioning to uncover customer needs."

Skipping interviewer calibration sessions

Even with transcripts and defined scales, different evaluators interpret evidence differently without alignment. One person's "strong problem-solving" is another's "basic analysis" because everyone brings different expectations.

Schedule monthly calibration sessions where all interviewers score the same sample transcript independently, then discuss differences. When scores differ by more than one point, dig into why. These conversations surface hidden assumptions and align your team on what each score level actually means.

Neglecting audio quality requirements

Poor audio quality ruins everything downstream—inaccurate transcripts lead to missed evidence and unreliable scores. A transcript with even 20% errors can completely change the meaning of technical explanations or miss key competency indicators.

Set strict quality standards and test them:

  • Test recording setup before every interview
  • Require external microphones for all participants
  • Use wired internet connections to prevent audio dropouts
  • Record practice sessions to verify audio clarity
  • Reject and reschedule interviews with poor audio quality

One garbled explanation of a technical concept can change a candidate's score from "exceeds expectations" to "below requirements" if key terms are mistranscribed.

How to measure interview scoring system effectiveness

Track four metrics to validate that your system improves hiring decisions and reduces inconsistency between interviewers.

Time-to-Hire: Measure days from job posting to offer acceptance. Structured scoring should speed up decision-making because you have clear evidence for each decision.

Quality of Hire: Track new hire performance ratings after 6 months on the job. Higher-scored candidates should perform better if your system accurately predicts success.

Inter-Rater Reliability: Calculate agreement between different evaluators scoring the same interview. Use Cohen's Kappa coefficient—scores above 0.7 indicate good agreement.

Score Distribution: Check if scores spread appropriately across candidates. If everyone clusters around the same score, your criteria might be too generic or your scale needs adjustment.

Here's Python code to calculate inter-rater reliability:

from sklearn.metrics import cohen_kappa_score

def calculate_agreement(evaluator1_scores, evaluator2_scores):
    """Calculate Cohen's Kappa for inter-rater agreement"""
    
    kappa = cohen_kappa_score(evaluator1_scores,
evaluator2_scores)
    
    if kappa < 0.4:
        agreement_level = "Poor agreement - needs calibration"
    elif kappa < 0.6:
        agreement_level = "Fair agreement - some calibration needed"  
    elif kappa < 0.8:
        agreement_level = "Good agreement - system working well"
    else:
        agreement_level = "Excellent agreement - very consistent"
    
    return kappa, agreement_level

# Example
eval1_scores = [3, 4, 5, 2, 4, 3, 5, 4]
eval2_scores = [3, 5, 4, 2, 3, 3, 4, 4]

kappa, interpretation = calculate_agreement(eval1_scores, 
eval2_scores)
print(f"Agreement score: {kappa:.2f} - {interpretation}")

Track these metrics before and after implementing your system to measure improvement. Most teams see better consistency and faster decision-making within a few months of adoption.

Validate transcription quality on your audio

Upload an interview clip to test accuracy, speaker labels, and timestamps in your browser—no code required. Perfect before wiring up the API.

Try playground

Final words

This transcript-based approach transforms interview scoring from a subjective, memory-dependent process into an objective, evidence-based system. You record interviews to capture every detail, transcribe them to create searchable text, then systematically extract evidence for each competency you're evaluating. This separation of interviewing and scoring lets you focus entirely on the conversation while ensuring thorough, consistent evaluation afterward.

AssemblyAI's speech-to-text models excel at handling the diverse accents, technical terminology, and conversational speech patterns common in interviews. Features like speaker diarization automatically separate interviewer and candidate speech, while high accuracy rates ensure your evidence extraction reflects what was actually said rather than transcription errors. When hiring decisions depend on capturing nuanced technical discussions and behavioral examples, reliable speech recognition becomes essential infrastructure for building fair, defensible interview processes.

Start building your scoring system today

Create an account to access the Speech-to-Text API with speaker diarization, then plug it into your scoring workflow.

Sign up free

Freqeuently asked questions

How accurate does speech-to-text need to be for reliable interview scoring?

You need high accuracy for effective scoring because transcription errors can change technical terms or miss key competency indicators. AssemblyAI's Universal models achieve a mean Word Error Rate (WER) of 5.6% on English datasets, which translates to approximately 94.4% accuracy, ensuring your scores reflect actual candidate responses.

Can I use this system with video interview platforms like Zoom and Teams?

Yes, most video platforms let you download recordings as MP4 files, which AssemblyAI accepts directly. You can also extract just the audio track if you prefer working with audio files. The transcription process works the same regardless of whether you upload video or audio files.

How does automated transcript scoring compare to traditional manual evaluation during interviews?

Transcript-based scoring achieves much higher consistency between different evaluators because everyone reviews identical, complete information rather than relying on incomplete notes and fading memories. You also capture evidence that would be impossible to write down during fast-paced conversations.

What should I do if candidates refuse to consent to interview recording?

Offer traditional live scoring as an alternative while explaining that recording ensures fairer evaluation by allowing thorough review of their responses. Most candidates consent once they understand recordings are used solely for accurate assessment and aren't shared outside the hiring team.

How many competencies should I evaluate per interview to avoid overwhelming the scoring process?

Stick to 4-6 competencies maximum per interview. More than that becomes difficult to track effectively, and you risk diluting focus on the most important skills. Choose competencies that directly predict success in the specific role rather than using generic criteria.

Title goes here

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Button Text
HR