Member Only Content
To access all features, please consider upgrading to full Membership.
AI Ecosystem Intelligence Explorer
21 of 163 articles
Reasoning models don’t always say what they think
Research from Anthropic on the faithfulness of AI models’ Chain-of-Thought
On the Biology of a Large Language Model
We investigate the internal mechanisms used by Claude 3.5 Haiku — Anthropic’s lightweight production model — in a variety of contexts, using our circuit tracing methodology.
The 2025 AI Index Report | Stanford HAI
Your browser does not support the video tag.
Proof or Bluff? Evaluating LLMs on 2025 USA Math Olympiad
Recent math benchmarks for large language models (LLMs) such as MathArena indicate that state-of-the-art reasoning models achieve impressive performance on mathematical competitions like AIME, with the leading model, o3-mini, achieving scores comparable to top human competitors. However, these benchmarks evaluate models solely based on final numerical answers, neglecting rigorous reasoning and proof generation which are essential for real-world mathematical tasks. To address this, we introduce the first comprehensive evaluation of full-solution reasoning for challenging mathematical problems. Using expert human annotators, we evaluated several state-of-the-art reasoning models on the six problems from the 2025 USAMO within hours of their release. Our results reveal that all tested models struggled significantly, achieving less than 5% on average. Through detailed analysis of reasoning traces, we identify the most common failure modes and find several unwanted artifacts arising from the optimization strategies employed during model training. Overall, our results suggest that current LLMs are inadequate for rigorous mathematical reasoning tasks, highlighting the need for substantial improvements in reasoning and proof generation capabilities.
Large Language Models Pass the Turing Test
We evaluated 4 systems (ELIZA, GPT-4o, LLaMa-3.1-405B, and GPT-4.5) in two randomised, controlled, and pre-registered Turing tests on independent populations. Participants had 5 minute conversations simultaneously with another human participant and one of these systems before judging which conversational partner they thought was human. When prompted to adopt a humanlike persona, GPT-4.5 was judged to be the human 73% of the time: significantly more often than interrogators selected the real human participant. LLaMa-3.1, with the same prompt, was judged to be the human 56% of the time -- not significantly more or less often than the humans they were being compared to -- while baseline models (ELIZA and GPT-4o) achieved win rates significantly below chance (23% and 21% respectively). The results constitute the first empirical evidence that any artificial system passes a standard three-party Turing test. The results have implications for debates about what kind of intelligence is exhibited by Large Language Models (LLMs), and the social and economic impacts these systems are likely to have.
GitHub - VAST-AI-Research/TripoSG: TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models
TripoSG: High-Fidelity 3D Shape Synthesis using Large-Scale Rectified Flow Models - VAST-AI-Research/TripoSG
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
DeepMesh: Auto-Regressive Artist-mesh Creation with Reinforcement Learning
tencent/Hunyuan3D-2mv · Hugging Face
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
Transformers from scratch
Let’s build a Transformer Neural Network from Scratch together !