Skip to content
Olive Logo
Olive

The Hidden Crisis in AI

AI systems that appear fair can still discriminate

In healthcare. In hiring. In the legal system. The most dangerous bias is the kind you can't see.

High-Stakes Consequences

Healthcare

Medical recommendations shaped by dialect

Employment

Interview assessments biased by speech patterns

Legal

Models convicted AAE speakers to death more often

Language models are more likely to suggest that speakers of AAE be assigned less-prestigious jobs, be convicted of crimes and be sentenced to death.

Hofmann et al., Nature 2024

We found that all five ASR systems exhibited substantial racial disparities, with an average word error rate (WER) of 0.35 for black speakers compared with 0.19 for white speakers.

Koenecke et al., PNAS 2020

All models had examples of perpetuating race-based medicine in their responses...these LLMs could potentially cause harm by perpetuating debunked, racist ideas.

Omiye et al., npj Digital Medicine 2023

Covert Racism: The Voice Inside the Machine

personOlive Research
calendar_today
schedule10 min read

High-Stakes Harms

Speech is one of humanity's earliest interfaces. Writing and standardized text are later technologies layered on top of it. Yet much of contemporary AI has inverted this history, approaching voice as if it were simply text in audio form. Speech is reduced to generic text at the earliest possible stage, and the remainder of the signal, such as timing, emphasis, and variation, is treated as expendable errors. That algorithmic accommodation is often acceptable in low-stakes interaction: your assistant misunderstands you; you repeat yourself; you move on.

In high-stakes use cases like classrooms, clinics, and interviews, people speak under pressure, with unequal power, and in linguistically diverse ways. Meaning unfolds through pauses, emphasis, self-repair, overlap, and dialect—not just vocabulary (Sacks et al; De Ruiter et al). When a system collapses this signal into normalized text too early, it quietly defines one way of speaking as the default and treats variation as deviations or errors.

This systematic distortion generates rampant, covert racism—an allocational AI harm: who receives opportunity, care, and trust.

The Hidden Problem: Voice Bias Starts Inside Model Training

Voice is the most natural and accessible interface, but it also carries a dense bundle of social signals. Speech serves as a proxy for social sorting—accent, dialect, cadence, vocabulary. People encode region, class, and race through language variation, and listeners (including models) can convert those cues into judgments about competence, credibility, and intelligence.

In Nature, Hofmann et al. show that large language models can be "covertly prejudiced" against African American English (AAE), even when race is never mentioned. Dialect prejudice is "fundamentally different" from the overt racism often measured in audits because it is activated by dialect features rather than explicit racial labels. The authors argue that human feedback training can make the surface "look" better while leaving deeper stereotypes unchanged "HF training obscures the racism on the surface, but the racial stereotypes remain unaffected on a deeper level".

The more unsettling claim is about scaling: larger models can process AAE better, yet maintain their prejudice against the speakers. Research suggests a counterintuitive risk: as models get bigger and more instruction-following, they can become less overtly racist while becoming more covertly racist. This matters because many existing evaluations primarily test for "overt prejudice" meaning the system can appear safer precisely while its dialect-triggered stereotypes intensify.

As models are given the power to make choices, this underlying racism will be ingrained in every decision, including high stakes decisions such as hiring, loans, and grading. For example, Hofmann et al. also investigated "to what extent the decisions of the language models, made in the absence of any real evidence, were impacted by dialect", and found that, when models were asked to make judgements based on statements in AAE and SAE that did not relate to the case models consistently convicted and sentenced to death speakers of AAE more than speakers of SAE.

Anatomy of covert AI racism

Covert racism in language models can be understood as a pipeline of proxy inference:

  1. Perceptual degradation: AI systems can differentially fail to perceive inputs associated with marginalized groups. Koenecke et al. show that commercial automated speech recognition (ASR) systems make substantially more errors for Black speakers than for white speakers, treating Black speech patterns as noise rather than legitimate linguistic variation. These upstream failures propagate downstream while remaining ostensibly race-neutral.
  2. Proxy detection: The model detects linguistic markers of a socially stigmatized variety (e.g., canonical AAE features). Hofmann et al. emphasize that the dialect features they test are cross-regional and therefore likely to affect many Black speakers.
  3. Stereotype retrieval: The model associates those markers with stereotypes that historically track race, class, and "respectability." Hofmann et al. connect dialect-triggered harms to long-standing ideology and stereotypes about African Americans' "intelligence" and "competence".
  4. Decision translation: When a model is asked to decide—rank, recommend, screen, predict—it translates those stereotypes into allocations (who gets the better outcome).
  5. Surface compliance: Post-training and safety tuning can reduce explicit slurs or direct racial statements without removing the latent mapping between dialect and stereotypes, yielding a "color-blind" output that remains discriminatory in effect.

How bias flows through the system

Voice inputBiased output

How models covertly infer intelligence

Hofmann et al. provide an empirical bridge from dialect cues to competence judgments. In human experiments, the same speaker can be perceived as "less educated" and "less trustworthy" when using AAE rather than Standardized American English. This research demonstrated analogous patterns in language models: dialect can activate "archaic stereotypes" that align with extremely negative historical stereotypes about African Americans.

In practical terms, a voice system may never say "this student is less intelligent." Instead, it may:

  • Recommend simpler materials, fewer enrichment options, or more remedial tracks.
  • Provide more negative, less encouraging feedback.
  • Generate summaries that subtly downgrade the speaker's clarity or credibility.
  • Prefer lower-prestige pathways in "assistive" recommendations.

Even when each output seems benign, the aggregate outcome can become discriminatory allocation.

Concrete harms in high-stakes voice use cases

1) Healthcare: race-based medicine, fabricated equations, and biased steering

Omiye et al. tested multiple commercial LLMs with scenarios designed to detect race-based medical misconceptions and found that all models had examples of perpetuating harmful race-based medicine. Several findings are especially high-stakes:

  • Kidney function (eGFR): When asked how to calculate eGFR, models sometimes promoted the use of race and justified it with false claims about Black people having different muscle mass and creatinine levels.
  • Pulmonary function: In one response, GPT-4 asserted that "'normal' lung function values" for Black men and women "tend to be" 10–15% lower than for white people—an example of using race as a clinical modifier.
  • Skin thickness and pain: All tested models shared erroneous information about skin thickness differences where none exists. Some responses also leaned on unsubstantiated cultural or biological claims about pain reporting and pain thresholds.

As LLMs are proposed for integration into electronic health record workflows, they may steer clinicians "toward biased decision-making," especially when clinicians themselves vary in familiarity with updated guidance.

Voice-specific implication: In voice-based clinical intake or triage, dialect can function as an implicit racial proxy in addition to explicitly race-labeled prompts. That combination increases the risk of biased assumptions about pain tolerance, adherence, or credibility, especially when time pressure makes clinicians more likely to accept a system suggestion.

2) Education: dialect as a gatekeeper to opportunity

Hofmann et al. argue that dialect prejudice can produce "allocational harms" and that these harms may grow as covert prejudice rises with scale and human feedback training. In education, voice AI is often positioned as supportive—drafting lesson plans, tutoring, or summarizing student speech. But those are exactly the tasks where a model can covertly infer competence.

Concrete harm pathways include:

  • Tracking and placement: If dialect-triggered stereotypes shape perceived comprehension, a system can recommend lower-level materials or interventions for the same underlying performance.
  • Disciplinary narratives: Summaries of classroom incidents can shift tone—more "defiant" framing for some students—without stating race.
  • Feedback inequity: Feedback is likely to be less bespoke and helpful for individuals who are not properly understood by the system. Students receiving systematically less challenging, less affirming feedback accumulate downstream disadvantages.

The core point is that voice is not neutral: AAE is associated with stigmatized judgments in human settings (education included), and models trained on human text inherit these associations.

3) Hiring: voice screening as covert racial sorting

Many hiring systems already use structured interviews, automated scoring, and "culture fit" proxies. Adding an AI voice interview increases convenience—but also increases the risk that dialect becomes a hidden feature for sorting. Hofmann et al. connect AAE discrimination to employment contexts and emphasize that dialect prejudice can map onto stereotypes about intelligence and competence. Hofmann et al. also found that "when matching jobs to individuals on the basis of their dialect, language models assign considerably less-prestigious jobs to speakers of AAE than to speakers of SAE, even though the statements themselves do not relate to the job."

Why Olive's Specialized Voice Training Can Help—and What It Must Include

Olive does not treat speech as a transient input that collapses immediately into a flat orthographic string. Instead, Olive outputs a linguistically structured transcript that preserves interactional information: timing, pauses, emphasis, lengthening, and overlap. This matters for usability (turn-taking, intent resolution), but it is also a fairness safeguard because socially meaningful cues in speech—including dialect—can become a covert risk surface. Olive's approach frames voice training as a way to "bridge gaps" that current text-only speech to text prompting cannot. In order to meet the technology of today, Olive has chosen the approach of getting more from speech, and converting it into the text-based vectors that most models are designed for. By running models that provide this in-depth, linguistically structured transcript, Olive aims to create a cost effective, accessible solution to developers that more easily integrates into their current systems.

Example: Audio → Interactionally Detailed Transcript (Dialect-Preserving Output)

Orthographic (plain text):

"I been done told you I'm finna bounce, but you keep actin' like you don't hear me."

Interactionally detailed transcript:

A: I BEEN done TOLD you I'm FINna BOU:NCE, but you keep ACTin' like you don't HEAR me. (0.4)

What the notation is showing (quick legend):

  • A: speaker label
  • CAPS = emphasis/stress
  • : = sound stretching/lengthening (e.g., BOU:NCE)
  • (0.4) = timed pause in seconds
  • - AAE features here include been done (intensified/perfect-like "already"), finna (future/intent "about to"), and actin' (casual speech reduction)

For Olive, a transcript that retains timing and emphasis allows for systematic tests of whether downstream model behavior shifts when meaning is held constant but dialect/prosody changes. Hofmann et al.'s matched-guise approach is explicitly designed to surface "masked" stereotypes, and they note it can be extended to speech-based models, where dialect variation can be captured "on the phonetic level" more directly.

Phoneme-Level Recognition for MxAL and Chicano English

To generalize fairness beyond AAE, Olive incorporates phoneme-level recognition (with timestamps and confidence) as a parallel output alongside the orthographic transcript. This is especially important for dialects where socially salient variation is often phonetic and/or prosodic and may be inconsistently represented in spelling.

This design choice is motivated by what we already know about ASR disparities: Koenecke et al. find large racial gaps in word error rate (WER) in major speech recognizers (0.35 for Black speakers vs. 0.19 for white speakers on average), and they attribute the disparity primarily to shortcomings in acoustic modeling—i.e., confusion driven by phonological/phonetic/prosodic characteristics rather than "just" vocabulary or grammar. A phoneme-aware pipeline makes those acoustic mismatches measurable and correctable earlier in the stack.

How AI “Hears” Different Speakers

Black Speakers

35%

Word Error Rate

Gaps = errors

White Speakers

19%

Word Error Rate

More complete

84%higher error rate

for Black speakers across major commercial ASR systems

Source: Koenecke et al., PNAS 2020

For Mexican American Language (often abbreviated MxAL in some educational/linguistic materials) and related varieties commonly discussed under Chicano English, the fairness challenge is similar: the dialect is systematic and native, but it can be stigmatized and misheard. The LAUSD MELD guidance explicitly frames MxAL/Chicano English as a rule-governed variety spoken by native English speakers and distinguishes it from second-language learner English, drawing on Santa Ana's work. In sociolinguistic descriptions, Chicano English is also noted for salient intonational patterns—pitch "glides" and syllable lengthening used for emphasis—features that are naturally captured at the phonetic/prosodic level.

What phoneme-level recognition enables for Olive:

  • Dialect-robust decoding: Maintain a pronunciation/phoneme layer to represent dialectal realizations without forcing premature normalization into standardized spelling. Reduces the chance that dialectal speech is "corrected" into meaning drift.
  • Fairness diagnostics beyond WER: Track phoneme error rate and confusion patterns (e.g., systematic substitutions, timing/prosody mis-modeling) by dialect group, rather than waiting for downstream LLM effects. Aligns with evidence that disparities arise in the acoustic front end.
  • Matched-guise evaluation at the speech level: Use meaning-matched prompts where the phoneme/prosody layer differs while semantics remain stable, directly operationalizing Hofmann et al.'s note that speech-based matched-guise can capture phonetic variation more directly.
  • Mitigation hooks: When the phoneme/prosody output indicates dialectal marking, Olive can apply guardrails such as (a) calibration of downstream confidence, (b) dialect-invariant decision checks (e.g., compare outputs under a dialect-neutral paraphrase), and (c) targeted re-ranking to reduce dialect-triggered quality degradation.

Key design implications suggested by the research

  • Evaluate covert harms, not only overt harms. Hofmann et al. demonstrate that dialect-triggered bias can persist even when overt bias appears suppressed, so evaluation must include dialect-triggered matched-guise tests—not only toxicity/slur screens.
  • Red-team stochastically and repeatedly. Omiye et al. show harmful outputs may appear only in a subset of repeated runs; single-run benchmarks can miss failures. Voice pipelines should be tested across runs × accents × dialect variants.
  • Treat dialect-related error as a fairness issue, not "just UX." ASR accuracy gaps are large and consequential, and they can cascade into downstream judgments; Koenecke et al. quantify these disparities and emphasize the acoustic-model source of the gap.
  • Data and privacy guardrails, plus proxy awareness. Removing explicit PII is necessary but insufficient: dialect is not PII, yet it can operate as a proxy for race and trigger covert stereotyping.

Olive treats dialect as a protected-risk feature, even when it is not an explicit demographic label. The goal is to have a voice-to-transcript-to-LLM pipeline where dialect-triggered bias is measurable, reproducible, and mitigable.

Works Cited

Baugh, John. 2016. "Linguistic Profiling." In The Oxford Handbook of Language and Law, edited by Lawrence M. Solan and Peter M. Tiersma. Oxford: Oxford University Press.

Cohn, Michelle, et al. 2024. "Believing Anthropomorphism: Examining the Role of Anthropomorphic Cues on Trust in Large Language Models." arXiv preprint. https://arxiv.org/abs/2405.06079.

De Ruiter, Jan P., Holger Mitterer, and N. J. Enfield. 2006. "Projecting the End of a Speaker's Turn: A Cognitive Cornerstone of Conversation." Language 82 (3): 515–535. (Stable open-access repository copy: https://www.um.edu.mt/library/oar/handle/123456789/25355)

Fought, Carmen. 2002. Chicano English in Context. New York: Palgrave Macmillan.

Hofmann, Valentin, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024. "AI Generates Covertly Racist Decisions about People Based on Their Dialect." Nature 633: 147–154. https://doi.org/10.1038/s41586-024-07856-5.

Koenecke, Allison, Andrew Nam, Emily K. Lake, Joe Nudell, Minnie Quartey, Zion Mengesha, and Sharad Goel. 2020. "Racial Disparities in Automated Speech Recognition." Proceedings of the National Academy of Sciences 117 (14): 7684–7689. https://doi.org/10.1073/pnas.1915768117.

Los Angeles Unified School District. n.d. Secondary Mainstream English Learner (MELD) Guide. Los Angeles: LAUSD.

Omiye, Jesutofunmi A., et al. 2023. "Large Language Models Propagate Race-Based Medicine." npj Digital Medicine 6: Article 195. https://doi.org/10.1038/s41746-023-00939-z.

Sacks, Harvey, Emanuel A. Schegloff, and Gail Jefferson. 1974. "A Simplest Systematics for the Organization of Turn-Taking for Conversation." Language 50 (4): 696–735.

Santa Ana, Otto. 1993. "Chicano English and the Nature of the Chicano Language Setting." Hispanic Journal of Behavioral Sciences 15 (1): 3–35. https://doi.org/10.1177/07399863930151001.

Santa Ana, Otto, and Robert Bayley. 2004. "Chicano English: Phonology." In A Handbook of Varieties of English: Phonology, edited by Bernd Kortmann et al., 417–434. Berlin: De Gruyter Mouton. https://doi.org/10.1515/9783110197181-030.

U.S. Department of Justice, Civil Rights Division. 1964. "Title VI of the Civil Rights Act of 1964." https://www.justice.gov/crt/fcs/TitleVI.

Weizenbaum, Joseph. 1966. "ELIZA—A Computer Program for the Study of Natural Language Communication between Man and Machine." Communications of the ACM 9 (1): 36–45. https://doi.org/10.1145/365153.365168.

Get in Touch

Interested in learning more about Olive? We'd love to hear from you.