arXiv:2604.04771v2 Announce Type: replace-cross Abstract: Current document parsing methods advance primarily through model architecture innovation, while systematic engineering of training data remains underexplored. Yet state-of-the-art models spanning diverse architectures and parameter scales exh