Model Detail
grok-2
—grok-2 is an AI model released by xAI. It has accumulated 22K downloads on Hugging Face since publication.
grok-2 is priced at $2/M input tokens and $10/M output tokens. Operationally the model offers a 131K-token context window, which matters when sizing it for prompt-heavy or latency-sensitive workloads. Pricing in this range is the working middle of the API market — neither the cheapest nor the most expensive option per token, so cost-fit is usually a function of how much output you generate.
grok-2 is published on Hugging Face but our pipeline has not yet captured architecture, license, or parameter-count metadata for this entry. The data is refreshed daily, so these fields typically populate within 24–48 hours of release.
grok-2 is best fit for general-purpose AI workloads. Treat this as a starting matrix rather than a benchmark verdict — the right deployment usually depends on the specific evaluation suite that mirrors your workload.
Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction
arXiv:2606.05863v1 Announce Type: new Abstract: Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learne
Low-Rank Decay for Grokking in Scale-Invariant Transformers: A Spectral-Geometric View
arXiv:2606.04405v1 Announce Type: cross Abstract: Modern Transformer architectures frequently employ normalization mechanisms such as RMSNorm and Query-Key Normalization, making parts of the model approximately scale-invariant with respect to weight magnitudes. In this regime, standard Frobenius-nor
Grokers: Bottom-Up Inductive Comprehension and Write-Time Intelligence over Typed Knowledge Graphs
arXiv:2606.00050v1 Announce Type: new Abstract: We present Grokers, an architecture for building persistent, structured comprehension of typed knowledge graphs through bottom-up inductive traversal of dependency subgraphs. Unlike retrieval-augmented generation (RAG), which pays full comprehension co
The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold
arXiv:2511.01938v3 Announce Type: replace-cross Abstract: Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to represe
A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization
arXiv:2606.00230v1 Announce Type: new Abstract: Grokking, the phenomenon in which neural networks generalize long after fitting their training data, has been studied in supervised settings on many epochs. LLM pre-training instead involves next-token prediction over an unlabeled corpus, with limited
To Grok Grokking: Provable Grokking in Ridge Regression
arXiv:2601.19791v3 Announce Type: replace Abstract: We study grokking, the onset of generalization long after overfitting, in a classical ridge regression setting. We prove end-to-end grokking results for learning over-parameterized linear regression models using gradient descent with weight decay.