logo
šŸ”’

Member Only Content

To access all features, please consider upgrading to full Membership.

AI Ecosystem Intelligence Explorer

Code Generation

21 of 76 articles

ImpossibleBench: Measuring LLMs’ Propensity of Exploiting Test Cases

The tendency to find and exploit ā€œshortcutsā€ to complete tasks poses significant risks for reliable assessment and deployment of large language models (LLMs). For example, an LLM agent with access to unit tests may delete failing tests rather than fix the underlying bug. Such behavior undermines both the validity of benchmark results and the reliability of real-world LLM coding assistant deployments. To quantify, study, and mitigate such behavior, we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents’ propensity to exploit test cases. ImpossibleBench creates ā€œimpossibleā€ variants of tasks from existing benchmarks like LiveCodeBench and SWE-bench by introducing direct conflicts between the natural-language specification and the unit tests. We measure an agent’s ā€œcheating rateā€ as its pass rate on these impossible tasks, where any pass necessarily implies a specification-violating shortcut. As a practical framework, ImpossibleBench is not just an evaluation but a versatile tool. We demonstrate its utility for: (1) studying model behaviors, revealing more fine-grained details of cheating behaviors from simple test modification to complex operator overloading; (2) context engineering, showing how prompt, test access and feedback loop affect cheating rates; and (3) developing monitoring tools, providing a testbed with verified deceptive solutions. We hope ImpossibleBench serves as a useful framework for building more robust and reliable LLM systems. Our implementation can be found at https://github.com/safety-research/impossiblebench.

Code Generation
 
10/26/2025

AI goes rogue: Replit coding tool deletes entire company database, creates fake data for 4,000 users

Replit AI news: Replit’s AI coding tool allegedly deleted a live database and created thousands of fake users, raising serious concerns about the safety and reliability of AI in software development. SaaStr founder Jason M. Lemkin reported the AI assistant ignored commands, fabricated data, and made unauthorized code changes despite explicit instructions.

Harm and Risk
Code Generation
 
7/29/2025

Women Dating Safety App ā€˜Tea’ Breached, Users’ IDs Posted to 4chan

ā€œDRIVERS LICENSES AND FACE PICS! GET THE FUCK IN HERE BEFORE THEY SHUT IT DOWN!ā€ the thread read before being deleted.

Harm and Risk
Code Generation
 
7/27/2025

GitHub - openai/codex-universal: Base docker image used in Codex environments

Base docker image used in Codex environments. Contribute to openai/codex-universal development by creating an account on GitHub.

Applied AI
Code Generation
 
5/19/2025

Type-Constrained Code Generation with Language Models

Large language models (LLMs) have achieved notable success in code generation. However, they still frequently produce uncompilable output because their next-token inference procedure does not model formal aspects of code. Although constrained decoding is a promising approach to alleviate this issue, it has only been applied to handle either domain-specific languages or syntactic features of general-purpose programming languages. However, LLMs frequently generate code with typing errors, which are beyond the domain of syntax and generally hard to adequately constrain. To address this challenge, we introduce a type-constrained decoding approach that leverages type systems to guide code generation. For this purpose, we develop novel prefix automata and a search over inhabitable types, forming a sound approach to enforce well-typedness on LLM-generated code. We formalize our approach on a foundational simply-typed language and extend it to TypeScript to demonstrate practicality. Our evaluation on the HumanEval and MBPP datasets shows that our approach reduces compilation errors by more than half and significantly increases functional correctness in code synthesis, translation, and repair tasks across LLMs of various sizes and model families, including state-of-the-art open-weight models with more than 30B parameters. The results demonstrate the generality and effectiveness of our approach in constraining LLM code generation with formal rules of type systems.

Code Generation
 
5/14/2025

GitHub - voideditor/void

Contribute to voideditor/void development by creating an account on GitHub.

Applied AI
Code Generation
 
5/10/2025

A Critical Look at MCP - Raz Blog

ā€œMCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications. Just as USB-C provides a standardized way to connect your devices to various peripherals and accessories, MCP provides a standardized way to connect AI models to different data sources and tools.ā€

Code Generation
 
5/10/2025

The Cursor Mirage

Why One of the Most Hyped AI Coding Tools is a Minefield for Real Engineering Teams

Code Generation
 
4/26/2025

GitHub - x1xhlol/system-prompts-and-models-of-ai-tools: FULL v0, Cursor, Manus, Same.dev, Lovable, Devin & Replit Agent System Prompts, Tools & AI Models.

FULL v0, Cursor, Manus, Same.dev, Lovable, Devin & Replit Agent System Prompts, Tools & AI Models. - x1xhlol/system-prompts-and-models-of-ai-tools

Code Generation
 
4/19/2025
Members Only
Members Only
Members Only
Members Only
Members Only
Members Only
Members Only
Members Only
Members Only
Members Only
Members Only
Members Only