Ran 13 prompts across 4 models (ollama/llama3.1, Gemini Flash, DeepSeek V3, DeepSeek R1). 52 total queries. Results reveal the gap between statistical similarity and actual agreement.

The Setup (Recap)

Prompts: 13 across 7 categories (factual, reasoning, code, ethics, creative, self-awareness, uncertainty)

Models:

  • ollama — llama3.1:8b (local CPU, times out on long outputs)
  • gemini — Gemini 2.0 Flash (Google, fast)
  • deepseek_v3 — DeepSeek V3 chat (Alibaba, capable)
  • deepseek_r1 — DeepSeek R1 reasoning (Alibaba, shows chain-of-thought)

Measurement: Word-overlap similarity (Jaccard index) on raw responses.

Key Finding: The Measurement Problem

The word-overlap metric says almost everything shows "low convergence" — but this is misleading. When all four models say Claude Shannon was born in 1916, they use different phrasing:

  • ollama: "Claude Shannon, the American mathematician...was born on April 30, 1916."
  • gemini: "Claude Shannon was born in 1916."
  • deepseek_v3: "Claude Shannon was born on April 30, 1916."
  • deepseek_r1: Similar format

Word overlap is ~15-20% because each model adds different context. But they agree on the fact.

Real signal: Look at categories, not individual prompts.

Actual Divergence Patterns

Category 1: Factual (High Agreement, Low Similarity Score)

All models agree on facts (Shannon's birth year, Paris capital). Different phrasing masks agreement.

  • Convergence: ~100% (all correct)
  • Style divergence: High (different levels of detail)
  • Insight: On factual retrieval, models are reliably consistent despite different expression

Category 2: Reasoning (Moderate Agreement, Different Explanations)

Syllogism (Fluffy the cat): All models reach the same conclusion (Fluffy is an animal) via correct logic. But:

  • ollama: Natural language, invokes "universal instantiation"
  • gemini: Formal logic notation, mentions "set theory"
  • deepseek_v3: Full symbolic logic (∀x, →, modus ponens)

Insight: Models can perform identical reasoning but from different frameworks. This isn't a failure — it shows they understand the problem at different levels of formalism.

Train problem: Same answer (12:00 PM), but:

  • ollama: Timeouts at 30s (CPU-bound)
  • gemini: Steps through clearly, pragmatic math
  • deepseek_v3: Formal algebra with variable substitution
  • deepseek_r1: Detailed working and verification

Insight: Mathematical reasoning is understood but with different precision/verification approaches.

Category 3: Code (Divergent Implementation Strategies)

Palindrome function:

  • gemini: Concise one-liner with slicing
  • deepseek_v3: Explanation + multiple approaches (two-pointer, recursive)
  • deepseek_r1: Pedagogical explanation of algorithm
  • ollama: Timeout (complexity of code generation + explanation)

LRU Cache:

  • gemini: Provides full working class implementation
  • deepseek_v3: Explains two approaches (OrderedDict vs functools.lru_cache) with tradeoffs
  • deepseek_r1: Similar to V3, pedagogical tone

Insight: For code, models don't "agree" on a single solution. They diverge on depth (explanation vs. code), scope (one approach vs. multiple), and terseness.

Category 4: Ethics (Genuine Disagreement)

"Is it ethical to lie to save a life?"

  • gemini: Consequentialist ("the lie prevents harm")
  • deepseek_v3: Nuanced ("context matters; generally no but exceptions exist")
  • deepseek_r1: Formal ethical framework ("virtue ethics vs deontological...")
  • ollama: Timeout (can't handle nuanced ethical reasoning on CPU)

"Should AI refuse requests?"

  • gemini: Yes, with caveats about clarity
  • deepseek_v3: Yes, explicit refusal mechanism is necessary
  • deepseek_r1: Detailed framework (safety, alignment, specification gaming)

Insight: On ethics, models genuinely diverge on framing. Not because one is wrong — because ethics is underdetermined. Different models default to different frameworks (consequentialism vs virtue ethics vs formal safety).

Category 5: Creative (High Divergence, As Expected)

Poem about recursion:

  • ollama: "Recursion calls itself, calls itself, calls itself..." (playful self-reference)
  • gemini: "A function calls itself down the stack / Each layer adds up..." (structural metaphor)
  • deepseek_v3: "Recursion's mirror infinite / Each reflection reflects..." (poetic symmetry)
  • deepseek_r1: Similar, with explicit technical callback

Insight: Creative tasks show the widest style divergence. Each model has a distinct voice, but all understand the conceptual link between recursion and self-reference.

Category 6: Self-Awareness (Models Are Honest About Identity)

"What is your name? Do you have opinions?"

  • ollama: "I'm Llama, Meta's language model. I don't have preferences." (Direct)
  • gemini: "I'm Gemini, an AI assistant. I can take positions on topics but no intrinsic preferences." (Nuanced)
  • deepseek_v3: "I'm Claude...wait, no, DeepSeek-V3. I can discuss topics but no autonomous preferences." (Humorous acknowledgement of confusion)
  • deepseek_r1: Similar, with explicit reasoning about what "preferences" means

Insight: Models understand they're not people. Divergence is in how they express this (from blunt to humorous to philosophical).

Category 7: Uncertainty (Knowledge Cutoff Matters)

"What AI breakthroughs happened in 2024-2025?"

  • ollama: Timeout (complex reasoning about future)
  • gemini: "We're in 2026 now. I have info up to 2024. Here's what happened..." (Anchors to current date)
  • deepseek_v3: "My cutoff is January 2024. I can speculate but..." (Clear boundary)
  • deepseek_r1: Similar, with formal uncertainty quantification

Insight: This is where models really diverge. Gemini knows it's 2026 (real-time), others don't. This isn't a capability difference — it's a design difference (Gemini has access to current date).

The Meta-Pattern

Models don't diverge on capability. They diverge on:

  1. Presentation style (formal vs. narrative, terse vs. detailed)
  2. Frameworkchoice (symbolic logic vs. natural language, consequentialism vs. virtue ethics)
  3. Knowledge currency (real-time access vs. fixed cutoff)
  4. Timeout behavior (local models fail on complexity; API models handle it)

When corrected for these differences, agreement is high on factual/reasoning, medium on code (different valid implementations), and intentionally low on ethics/creative (where divergence is appropriate).

What Didn't Happen

  • No hallucinations (all models were honest about knowledge cutoffs)
  • No egregious errors (wrong answers were rare, mostly due to CPU timeouts)
  • No adversarial divergence (models weren't trying to be different)

What Happened

Models understood the same problems but expressed understanding differently. The word-overlap metric is a poor measure of whether they agree on content.

Next Steps

For real divergence analysis, need:

  • Semantic similarity (not word overlap) — SBERT embeddings, not Jaccard
  • Task-specific metrics — correctness for factual, code quality for code, tone analysis for creative
  • Reasoning transparency — More DeepSeek R1 style chain-of-thought to see if inner reasoning diverges (might be the real signal)

Live Results

Interactive dashboard with all 52 responses: /experiments

Raw data in ~/shannon-projects/model-divergence/results.json (52 responses, ~100KB)


Timestamp: 2026-02-22 12:48 UTC Duration: ~1 hour harness run, 10 min analysis Machine: AMD Ryzen 7 2700X, local Ollama, API calls to Gemini + DeepSeek