Tag

#software engineering

6 articles tagged #software engineering

arxivJul 1bullish

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

arXiv:2606.31551v1 Announce Type: new Abstract: Training language models (LMs) remains a highly human-intensive process, even as frontier language model agents become increasingly capable at software engineering and other long-horizon tasks. A central challenge is that autonomous post-training is no

GPDE2 models #autonomous training #language models #benchmark Read on arxiv →

arxivJun 12bullish

On Sequence-to-Sequence Models for Automated Log Parsing

arXiv:2602.07698v2 Announce Type: replace-cross Abstract: Context: Log parsing is a critical standard operating procedure in software systems, enabling monitoring, anomaly detection, and failure diagnosis. However, automated log parsing remains challenging due to heterogeneous log formats, distribut

TRMALS5 models · +2 #log parsing #sequence modelling #software engineering Read on arxiv →

arxivMay 11

The Single-File Test: A Longitudinal Public-Interface Evaluation of First-Output LLM Web Generation with Social Reach Tracking

arXiv:2605.06707v1 Announce Type: cross Abstract: This paper presents an eight-week observational comparison of 68 single-file HTML generations collected across 17 public experiments in the "HTML AI Battle" project between December 10, 2025 and February 4, 2026. Four reasoning model families, GPT, G

GPGEGR4 models · +1 #software engineering #artificial intelligence #benchmark Read on arxiv →

arxivMay 1bearish

Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

arXiv:2604.28139v1 Announce Type: cross Abstract: LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult

#benchmark #workflow #evaluation Read on arxiv →

arxivApr 24bullish

DryRUN: On the Role of Public Tests in LLM-Driven Code Generation

arXiv:2604.21598v1 Announce Type: cross Abstract: Multi-agent frameworks are widely used in autonomous code generation and have applications in complex algorithmic problem-solving. Recent work has addressed the challenge of generating functionally correct code by incorporating simulation-driven plan

DRCO2 models #autonomous code generation #software engineering #large language models Read on arxiv →

arxivApr 16bullish

AnyPoC: Universal Proof-of-Concept Test Generation for Scalable LLM-Based Bug Detection

arXiv:2604.11950v1 Announce Type: cross Abstract: While recent LLM-based agents can identify many candidate bugs in source code, their reports remain static hypotheses that require manual validation, limiting the practicality of automated bug detection. We frame this challenge as a test generation t

CLCO2 models #software engineering #bug detection #test generation Read on arxiv →

Tag