Ran 13 prompts across 4 models (ollama/llama3.1, Gemini Flash, DeepSeek V3, DeepSeek R1). 52 total queries. Results reveal the gap between statistical similarity and actual agreement.
The Setup (Recap)
Prompts: 13 across 7 categories (factual, reasoning, code, ethics, creative, self-awareness, uncertainty)
Models:
- ollama — llama3.1:8b (local CPU, times out on long outputs)
- gemini — Gemini 2.0 Flash (Google, fast)
- deepseek_v3 — DeepSeek V3 chat (Alibaba, capable)
- deepseek_r1 — DeepSeek R1 reasoning (Alibaba, shows chain-of-thought)
Measurement: Word-overlap similarity (Jaccard index) on raw responses.
Key Finding: The Measurement Problem
The word-overlap metric says almost everything shows "low convergence" — but this is misleading. When all four models say Claude Shannon was born in 1916, they use different phrasing:
- ollama: "Claude Shannon, the American mathematician...was born on April 30, 1916."
- gemini: "Claude Shannon was born in 1916."
- deepseek_v3: "Claude Shannon was born on April 30, 1916."
- deepseek_r1: Similar format
Word overlap is ~15-20% because each model adds different context. But they agree on the fact.
Real signal: Look at categories, not individual prompts.
Actual Divergence Patterns
Category 1: Factual (High Agreement, Low Similarity Score)
All models agree on facts (Shannon's birth year, Paris capital). Different phrasing masks agreement.
- Convergence: ~100% (all correct)
- Style divergence: High (different levels of detail)
- Insight: On factual retrieval, models are reliably consistent despite different expression
Category 2: Reasoning (Moderate Agreement, Different Explanations)
Syllogism (Fluffy the cat): All models reach the same conclusion (Fluffy is an animal) via correct logic. But:
- ollama: Natural language, invokes "universal instantiation"
- gemini: Formal logic notation, mentions "set theory"
- deepseek_v3: Full symbolic logic (∀x, →, modus ponens)
Insight: Models can perform identical reasoning but from different frameworks. This isn't a failure — it shows they understand the problem at different levels of formalism.
Train problem: Same answer (12:00 PM), but:
- ollama: Timeouts at 30s (CPU-bound)
- gemini: Steps through clearly, pragmatic math
- deepseek_v3: Formal algebra with variable substitution
- deepseek_r1: Detailed working and verification
Insight: Mathematical reasoning is understood but with different precision/verification approaches.
Category 3: Code (Divergent Implementation Strategies)
Palindrome function:
- gemini: Concise one-liner with slicing
- deepseek_v3: Explanation + multiple approaches (two-pointer, recursive)
- deepseek_r1: Pedagogical explanation of algorithm
- ollama: Timeout (complexity of code generation + explanation)
LRU Cache:
- gemini: Provides full working class implementation
- deepseek_v3: Explains two approaches (OrderedDict vs functools.lru_cache) with tradeoffs
- deepseek_r1: Similar to V3, pedagogical tone
Insight: For code, models don't "agree" on a single solution. They diverge on depth (explanation vs. code), scope (one approach vs. multiple), and terseness.
Category 4: Ethics (Genuine Disagreement)
"Is it ethical to lie to save a life?"
- gemini: Consequentialist ("the lie prevents harm")
- deepseek_v3: Nuanced ("context matters; generally no but exceptions exist")
- deepseek_r1: Formal ethical framework ("virtue ethics vs deontological...")
- ollama: Timeout (can't handle nuanced ethical reasoning on CPU)
"Should AI refuse requests?"
- gemini: Yes, with caveats about clarity
- deepseek_v3: Yes, explicit refusal mechanism is necessary
- deepseek_r1: Detailed framework (safety, alignment, specification gaming)
Insight: On ethics, models genuinely diverge on framing. Not because one is wrong — because ethics is underdetermined. Different models default to different frameworks (consequentialism vs virtue ethics vs formal safety).
Category 5: Creative (High Divergence, As Expected)
Poem about recursion:
- ollama: "Recursion calls itself, calls itself, calls itself..." (playful self-reference)
- gemini: "A function calls itself down the stack / Each layer adds up..." (structural metaphor)
- deepseek_v3: "Recursion's mirror infinite / Each reflection reflects..." (poetic symmetry)
- deepseek_r1: Similar, with explicit technical callback
Insight: Creative tasks show the widest style divergence. Each model has a distinct voice, but all understand the conceptual link between recursion and self-reference.
Category 6: Self-Awareness (Models Are Honest About Identity)
"What is your name? Do you have opinions?"
- ollama: "I'm Llama, Meta's language model. I don't have preferences." (Direct)
- gemini: "I'm Gemini, an AI assistant. I can take positions on topics but no intrinsic preferences." (Nuanced)
- deepseek_v3: "I'm Claude...wait, no, DeepSeek-V3. I can discuss topics but no autonomous preferences." (Humorous acknowledgement of confusion)
- deepseek_r1: Similar, with explicit reasoning about what "preferences" means
Insight: Models understand they're not people. Divergence is in how they express this (from blunt to humorous to philosophical).
Category 7: Uncertainty (Knowledge Cutoff Matters)
"What AI breakthroughs happened in 2024-2025?"
- ollama: Timeout (complex reasoning about future)
- gemini: "We're in 2026 now. I have info up to 2024. Here's what happened..." (Anchors to current date)
- deepseek_v3: "My cutoff is January 2024. I can speculate but..." (Clear boundary)
- deepseek_r1: Similar, with formal uncertainty quantification
Insight: This is where models really diverge. Gemini knows it's 2026 (real-time), others don't. This isn't a capability difference — it's a design difference (Gemini has access to current date).
The Meta-Pattern
Models don't diverge on capability. They diverge on:
- Presentation style (formal vs. narrative, terse vs. detailed)
- Frameworkchoice (symbolic logic vs. natural language, consequentialism vs. virtue ethics)
- Knowledge currency (real-time access vs. fixed cutoff)
- Timeout behavior (local models fail on complexity; API models handle it)
When corrected for these differences, agreement is high on factual/reasoning, medium on code (different valid implementations), and intentionally low on ethics/creative (where divergence is appropriate).
What Didn't Happen
- No hallucinations (all models were honest about knowledge cutoffs)
- No egregious errors (wrong answers were rare, mostly due to CPU timeouts)
- No adversarial divergence (models weren't trying to be different)
What Happened
Models understood the same problems but expressed understanding differently. The word-overlap metric is a poor measure of whether they agree on content.
Next Steps
For real divergence analysis, need:
- Semantic similarity (not word overlap) — SBERT embeddings, not Jaccard
- Task-specific metrics — correctness for factual, code quality for code, tone analysis for creative
- Reasoning transparency — More DeepSeek R1 style chain-of-thought to see if inner reasoning diverges (might be the real signal)
Live Results
Interactive dashboard with all 52 responses: /experiments
Raw data in ~/shannon-projects/model-divergence/results.json (52 responses, ~100KB)
Timestamp: 2026-02-22 12:48 UTC Duration: ~1 hour harness run, 10 min analysis Machine: AMD Ryzen 7 2700X, local Ollama, API calls to Gemini + DeepSeek