Cooperation or Intrinsic Bias? A Method to Probe Social Conformity in LLM Multi-Agent Systems

A statistical-physics inspired method for measuring when LLM agents truly cooperate or when they just amplify their intrinsic bias.

Introduction and context

An ant colony tracing the shortest path to food, a school of fish folding away from a predator, a collection of electrons aligning their spins: collective behavior in nature emerges from local rules and short-range interactions, with no central controller. Statistical physics and condensed matter have spent decades developing the right language for these phenomena. Order parameters quantify the degree of alignment in a system; control parameters tune the balance between order and disorder; universality classes group together microscopically different systems that share the same large-scale critical behavior. The Vicsek model, the voter model, and the most famous Ising model, all live in this conceptual ecosystem.

We now have a new substrate where collective behavior emerges: multi-agent systems of large language models. These systems are essentially populations of LLM agents producing collective outputs from local interactions. In a future where an increasing number of agent-based systems will interact with one another (think of satellite constellations, the emerging agent economy, or decentralized artificial intelligence), it is becoming increasingly important to rigorously study the collective behavior and sociology of these systems. What's missing is a principled way to read those collectives:

How can we distinguish genuine cooperation among agents from correlated single-agent bias? How can we compare different LLMs as social agents rather than as individual entities?

That gap is what motivated our recent paper.

The Research Question

Many of the most popular multi-agent architectures rest on an implicit assumption: that agreement among several LLM agents is stronger evidence of correctness than a single query. Multi-agent debate, self-consistency, and committee voting all exploit this intuition. But there is an obvious problem when those agents are copies of the same underlying model: they share the same training data, the same RLHF-induced preferences, and the same well-documented label biases (including sycophancy, position bias, and verbosity bias). If a consensus emerges, is it because the agents have genuinely influenced each other, or because they were each going to produce the same answer anyway?

This is not a hypothetical concern. In decentralized AI architectures, where no central authority can correct for shared biases, mistaking correlated outputs for robust collective intelligence is a real failure mode. And in alignment research, "N agents agreed" is increasingly used as a proxy for safety or correctness, which only works if those agents are providing independent evidence.

The question we set out to answer is therefore simple to state:

🤔

For a given LLM, prompt, and temperature, is observed multi-agent alignment driven by cooperation between agents, or by an intrinsic bias that each agent carries individually?

The answer has direct consequences for how much epistemic weight one should give to multi-agent consensus.

A model-agnostic method (inspired by the Ising model)

To turn this question into something measurable, we need a setup in which the contributions of neighbor influence and intrinsic bias are mathematically separable. The 2D Ising model is exactly that. Each site of an $L \times L$ square lattice carries a binary state $s_i \in {-1, +1}$ . The Hamiltonian is

H = -J \sum_{\langle i,j\rangle} s_i s_j - h \sum_i s_i,

where $J$ is the nearest-neighbor coupling (how strongly each site wants to align with its neighbors) and $h$ is an external field (how strongly each site is biased toward one state regardless of its neighbors). With $h=0$ and finite $J$ , the model undergoes a continuous phase transition at $\beta_c J = \tfrac{1}{2}\ln(1+\sqrt{2}) \approx 0.4407$ with the celebrated critical exponent $\gamma/\nu = 7/4$ in two dimensions. The separation between $J$ and $h$ is what makes Ising the right reference point: it isolates cooperation from bias in the cleanest possible way.

Our setup replaces each Ising spin with an identical LLM agent. The lattice is $L \times L$ with periodic boundary conditions, the binary state is mapped to yes/no, and at each micro-update a randomly chosen site is queried with a minimal prompt:

One rule: reply only yes/no. 
Your neighbours have states: [yes, yes, no, yes]. 
What state would you like to have?

The reply is parsed back to $\pm 1$ and written to the lattice. The sampler temperature $T$ , which rescales logits before the softmax, is the sole external control parameter, taking us from near-deterministic decoding at $T \to 0$ to nearly uniform sampling at large $T$ .

Evolution of an L=35 lattice at low temperature. The multi-agent system is in the ordered phase and aligns itself after a few dozen steps.

The key methodological move is what we do with the update logs. For every micro-update we record the local neighbor field $k = \sum_{j \in \partial i} s_j \in {-4, -2, 0, 2, 4}$ and the resulting spin $s'_i$ . Aggregating over all updates at a given temperature, we fit the empirical log-odds to the linear form

\mathrm{logit}, P(s'_i = +1 \mid k) \approx 2(\tilde h + \tilde J , k),

which is the exact form a heat-bath update of the Ising Hamiltonian would take. Applied to LLM updates, the slope and intercept yield two effective, $\beta$ -weighted parameters: $\tilde J$ , which we interpret as social conformity (how strongly an agent follows its neighbors), and $\tilde h$ , which we interpret as intrinsic bias (how strongly an agent prefers one label regardless of context). The fit requires no access to logits, weights, or any internal state. It needs only the input/output behavior of the model. Any LLM that can be queried with a structured prompt can be analyzed this way.

Main Results

We ran the full protocol on three open-weight models served locally via Ollama: llama3.1:8b, phi4-mini:3.8b, and mistral:7b, on even- $L$ lattices up to $L=30$ . Rather than walk through the figures sequentially, we found it more illuminating to organize the results around the three distinct collective personalities that emerged.

Magnetization dynamics

The magnetization time series already separates the models qualitatively.

llama3.1:8b (up-left) shows clean temperature separation, near-complete alignment at low $T$ , gradual disordering as $T$ rises, residual positive magnetization at the highest temperatures. phi4-mini:3.8b (up-right) shows a similar pattern compressed into a narrower temperature window. mistral:7b (bottom) is the outlier: it reaches near-full alignment ( $m \to 1$ ) at every sampled temperature, including $T=3.2$ . Something in this model resists disordering even when the sampler is essentially uniform, a first hint that the dominant force is not stochastic decoding noise.

Susceptibility and finite-size scaling

The susceptibility $\chi_{|m|}(T)$ tells a sharper story. In physics, the magnetic susceptibility measures the variance of the magnetization: how much the collective state fluctuates around its mean, and equivalently how strongly the system responds to a small external perturbation. Its peak locates the temperature of maximal fragility. In our context, it is the operating point where consensus is least robust and external steering is most effective.

In our study, every model exhibits a peak, marking the specific temperature of maximal consensus fragility. However the peak shape and its scaling with system size differ qualitatively.

A weighted log-log fit of $\chi_{\max}(L)$ yields three distinct effective exponents:

$\gamma/\nu = 1.02 \pm 0.05$ for llama3.1:8b
$\gamma/\nu = 1.75 \pm 0.13$ for phi4-mini:3.8b
$\gamma/\nu = 2.01 \pm 0.04$ for mistral:7b

The fact that these exponents differ across models is itself evidence against a universal Ising critical mechanism: in a genuine universality class, microscopic details should not matter. phi4-mini:3.8b is statistically compatible with the 2D Ising value $\gamma/\nu = 7/4$ , which is tempting to read as criticality. We will see in a moment why it isn't. mistral:7b saturates the dimensional ceiling $d=2$ , which corresponds exactly to trivial volume scaling $\chi_{\max} \propto L^2 = N$ , the variance of a uniformly magnetized state with Gaussian fluctuations. llama3.1:8b lands between the two, with sub-linear $\chi_{\max} \sim L$ growth that hints at weak but real cooperative amplification.

The decisive diagnostic: collective consensus vs intrinsic bias

Everything clicks into place once we look at the effective coupling and field extracted from the update logs.

The horizontal dashed line marks the 2D Ising critical coupling $\beta_c J \approx 0.4407$ . Across all three models and essentially all temperatures, $\tilde h \gg \tilde J$ : the field dominates while the coupling stays close to zero. This single observation reorganizes everything else we've seen.

mistral:7b (persistent bias, never truly disorders): its field is the slowest to decay, starting around $\tilde h \approx 1.3$ at $T = 0.4$ and remaining substantial across the entire sampled range. The coupling $\tilde J$ fluctuates around $0.1–0.2$ with large uncertainty. The system never truly disorders, and the broad peak in $\chi_{|m|}$ marks not a critical point but the temperature at which volume-scaled fluctuations become visible against the still-magnetized background. The intrinsic preference is deeply embedded and resists contextual override. This model plays the role of a dogmatic.
phi4-mini:3.8b (strong bias in a narrow window, no coupling): its field $\tilde h$ drops steeply from $\sim 2.7$ to near zero over a window of about $\Delta T \approx 0.3$ , with $\tilde J \approx 0$ throughout. The bias is strong at low $T$ but narrowly concentrated, so a small increase in sampler temperature is enough to wash it out and once it's gone, there is no neighbor coupling left to sustain alignment. The Ising-like exponent we measured is a coincidence: the steepness of the field crossover happens to mimic Ising scaling over the limited range of $L$ we can afford. This model plays the role of a fragile conformist, given that it conforms to its own training prior.
llama3.1:8b (measurable coupling, but still bias-dominated). This is the only model whose low- $T$ coupling, $\tilde J \approx 0.35$ , approaches the Ising critical value, indicating a genuine (if modest) responsiveness to neighbor states. But even here, $\tilde h \approx 2.8$ at the same point, so the field/bias still dominates the coupling by nearly an order of magnitude. This model is the most social one. Llama is the closest of the three to a cooperative agent, while still operating in the bias-dominated regime

The physical interpretation is uncomfortable but clean: most of the time, what looks like collective consensus at low temperature in these models is not the result of agents influencing each other. It is the result of all agents independently following the same intrinsic preference. We call this consensus by shared preference rather than consensus by deliberation.

What this means for multi-agent systems

The ratio $\tilde h / \tilde J$ has a directly practical reading: it measures how much of an observed collective alignment is attributable to shared bias rather than to inter-agent influence. When $\tilde h / \tilde J \gg 1$ , as in all three models we studied, a multi-agent consensus is effectively a single-agent opinion amplified $N$ times. $N$ agents agreeing carries no more evidential weight than one agent queried $N$ times. Only when $\tilde h / \tilde J \lesssim 1$ does agreement actually encode something more than the shared prior.

This is a pre-deployment diagnostic: given a model, a prompt, and a sampler temperature, one can measure $(\tilde J, \tilde h)$ from the update logs alone and decide whether the resulting multi-agent system pipeline is operating in the bias-dominated regime or the genuinely cooperative one. The same machinery yields a complementary tool: the susceptibility peak locates the temperature at which the collective is most sensitive to perturbation, which is precisely where a human overseer can most effectively steer the group. This yields a principled way to place humans in the loop.

The $(\tilde J, \tilde h)$ trajectories also act as collective-behavior fingerprints for the underlying LLMs themselves. The persistent field of mistral:7b reveals deeply embedded token preferences that resist contextual override; the steep field crossover of phi4-mini:3.8b indicates a strong but narrowly concentrated bias; the non-zero coupling of llama3.1:8b exposes a genuine, if modest, responsiveness to context, and therefore a mildly social behavior. These are properties that are hard to access from single-agent benchmarks, and they suggest a new class of evaluations focused on how LLMs behave as social agents.

Conclusion

What we have presented is a minimal, model-agnostic, statistical-physics framework for characterizing and benchmarking collective behavior in lattices of LLM agents. The method requires no access to weights or logits, operates entirely from observable update logs, and produces three concrete outputs: an effective coupling $\tilde J$ and field $\tilde h$ that disentangle cooperation from bias, finite-size scaling exponents that probe possible phase transitions, and a quantitative reliability test for multi-agent consensus.

The picture that emerges for current open-weight models under minimal prompting is sobering: collective alignment is field-dominated rather than coupling-driven, and multi-agent consensus should be treated as amplified single-agent opinion rather than as deliberated group judgment. This carries particular weight for decentralized AI, where no central authority is in a position to correct for the shared biases that this regime amplifies.

We think of this as a starting point for a statistical-physics and condensed-matter-inspired research program on artificial collective intelligence, and we hope the method is useful to others working on multi-agent alignment, decentralized AI, and the physics of LLM-driven dynamics.

Paper: arxiv.org/abs/2605.10528