arxiv6d ago

Skillware: A Software Ontology and Engineering Lifecycle for Persistent Behavioral Artifacts

arXiv:2607.18970v1 Announce Type: cross Abstract: Agent Skills have become persistent behavioral artifacts across independent AI agent systems. They combine natural-language task specifications with metadata and optional references, scripts, assets, hooks, package manifests, tests, and companion int

#software-engineering #artificial-intelligence #agent-systems Read on arxiv →

arxivJul 10bullish

Simulator Ensembles for Trustworthy Autonomous Driving Systems Testing

arXiv:2503.08936v3 Announce Type: replace-cross Abstract: Scenario-based testing with driving simulators is extensively used to identify failing conditions of automated driving assistance systems (ADAS). However, existing studies have shown that repeated test execution in the same as well as in dist

#testing #simulators #adas Read on arxiv →

arxivJun 19

A Model-Driven Approach for Developing Families of Reinforcement Learning Environments

arXiv:2606.20324v1 Announce Type: cross Abstract: Virtual training environments are software-intensive systems in which reinforcement learning (RL) agents learn, adapt, and demonstrate meaningful behavior. Virtual training environments offer a safe and cost-efficient alternative to training agents i

#reinforcement-learning #software-engineering #model-driven-development Read on arxiv →

arxivJun 18

SWE-Future: Forecast-Conditioned Data Synthesis for Future-Oriented Software Engineering Agents

arXiv:2606.18733v1 Announce Type: cross Abstract: Realistic coding-agent benchmarks often replay public GitHub issues and pull requests, making them vulnerable to overlap with model pretraining, fine-tuning, synthetic-data generation, or benchmark-driven model selection. Fully synthetic tasks avoid

#software-engineering #artificial-intelligence #benchmark Read on arxiv →

arxivJun 17bullish

LLM-as-Judge in Education: A Curriculum-Grounded Marking Pipeline

arXiv:2606.17507v1 Announce Type: new Abstract: Generative AI and large language models (LLMs) are increasingly applied to question generation and automated assessment. However, deploying LLMs in preparation for high-stakes exams requires more than prompt engineering; it demands software pipelines t

LL1 model #education #assessment #language-models Read on arxiv →

arxivJun 3bullish

VulnAgent-R2: Evidence-Calibrated Multi-Agent Auditing for Repository-Level Vulnerability Detection

arXiv:2603.13384v2 Announce Type: replace-cross Abstract: Software vulnerabilities often depend on cross-file data flow, build options, framework conventions, and runtime guards, so isolated function classifiers produce fragile and poorly calibrated warnings. Repository-level LLM agents can gather r

VUVU2 models #vulnerability-detection #software-engineering #auditability Read on arxiv →

arxivMay 14bullish

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

arXiv:2601.20255v2 Announce Type: replace-cross Abstract: SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supe

#benchmark #software-engineering #large-language-models Read on arxiv →

arxivMay 8

BUILD-AND-FIND: An Effort-Aware Protocol for Evaluating Agent-Managed Codebases

arXiv:2605.06136v1 Announce Type: cross Abstract: Most coding-agent benchmarks ask whether generated code behaves correctly. That remains essential, but repository-level engineering is increasingly agent-managed: one agent writes a repository, and later agents inspect, audit, or extend it as working

#benchmark #software-engineering #artificial-intelligence Read on arxiv →

arxivMay 7bullish

AI Advocate: Educational Path to Transform Squads to the Future

arXiv:2605.03800v1 Announce Type: cross Abstract: This paper analyzes the strategic education process aimed at transitioning traditional software development squads into hybrid structures centered on collaborative work between humans and Artificial Intelligence (AI). In a context where human-AI coll

#collaboration #education #software-engineering Read on arxiv →

arxivMay 6

Using LLMs in Software Design: An Empirical Study of GitHub and A Practitioner Survey

arXiv:2605.01392v1 Announce Type: cross Abstract: Recent advancements in Large Language Models (LLMs) have demonstrated significant potential across a wide range of software engineering tasks, including software design, an area traditionally regarded as highly dependent on human expertise and judgme

CH1 model #software-engineering #large-language-models #design Read on arxiv →

arxivApr 23bullish

Coding with Eyes: Visual Feedback Unlocks Reliable GUI Code Generating and Debugging

arXiv:2604.19750v1 Announce Type: cross Abstract: Recent advances in Large Language Model (LLM)-based agents have shown remarkable progress in code generation. However, current agent methods mainly rely on text-output-based feedback (e.g. command-line outputs) for multi-round debugging and struggle

GE1 model #gui #debugging #benchmark Read on arxiv →

arxivApr 17bullish

Contract-Coding: Towards Repo-Level Generation via Structured Symbolic Paradigm

arXiv:2604.13100v1 Announce Type: cross Abstract: The shift toward intent-driven software engineering (often termed "Vibe Coding") exposes a critical Context-Fidelity Trade-off: vague user intents overwhelm linear reasoning chains, leading to architectural collapse in complex repo-level generation.

#software-engineering #autonomous-engineering #intent-driven Read on arxiv →

arxivApr 10bullish

Triage: Routing Software Engineering Tasks to Cost-Effective LLM Tiers via Code Quality Signals

arXiv:2604.07494v1 Announce Type: cross Abstract: Context: AI coding agents route every task to a single frontier large language model (LLM), paying premium inference cost even when many tasks are routine. Objectives: We propose Triage, a framework that uses code health metrics -- indicators of soft

#software-engineering #model-selection #cost-optimization Read on arxiv →