Will Scaling Improve Social Simulation with LLMs?
English summary
Researchers used scaling laws on 85 Qwen3-based transformer LLMs pretrained on DCLM web text with fixed compute budgets from 1e18 to 1e20 FLOPs, and evaluated 35 larger open-weight models up to 70B parameters, to study how compute scale impacts social simulation fidelity. They found strong compute scaling for opinion modeling and behavioral simulation tasks, especially for populations well-represented in English web corpora. Longitudinal forecasting and underrepresented opinions scale more slowly and correlate less with general benchmarks like MMLU. Scaling fails to improve model calibration with human cognitive biases such as risk aversion, and even fine-tuned models from 0.5B to 8B parameters show no performance gain on these tasks. The results conclude that scaling will benefit most social simulation settings but will be unreliable for low-resource domains and certain cognitive heuristics.
Chinese summary
研究人员使用85个基于Qwen3架构的Transformer语言模型(在DCLM网络文本语料上以固定计算预算10^18至10^20 FLOPs预训练)和35个最大的70B参数开源模型,通过缩放定律研究计算规模对社会模拟保真度的影响。在意见建模和行为模拟任务中观察到强大的计算缩放效应,尤以英文语料中常见人群为甚。纵向预测和代表性不足的意见扩展较慢,且与MMLU等通用基准相关性较低。缩放无法改善模型对人类认知偏差(如风险规避)的校准,微调模型从0.5B到8B也未带来提升。结论指出规模扩展能提升多数场景的社会模拟,但在低资源领域和特定认知启发式上并不可靠。
Key points
Strong compute scaling was observed for opinion and behavioral simulation tasks, particularly for populations prevalent in English web text.
在意见和行为模拟任务中观察到强大的计算缩放效应,尤其是在英文网络文本中常见的人群上。
Longitudinal forecasting and underrepresented opinions scale more slowly and are less correlated with general knowledge benchmarks like MMLU.
纵向预测和代表性不足的意见扩展较慢,且与MMLU等通用知识基准的相关性较低。
Scaling fails to improve LLM calibration with human cognitive biases such as risk aversion, even with fine-tuning from 0.5B to 8B parameters.
缩放无法改善LLM对风险规避等人类认知偏差的校准,即使将模型从0.5B微调到8B也无济于事。
Fine-tuned models on these bias tasks show no scaling trend, indicating limitations beyond general capability.
在这些偏差任务上微调后的模型未表现出扩展趋势,表明存在超越通用能力的限制。
Scale will improve social simulations in most settings but will be unreliable in low-resource domains and for certain cognitive heuristics.
规模扩展能改善多数场景下的社会模拟,但在低资源领域和特定认知启发式上仍不可靠。