AI Spotlight

Grok 3 First Impressions: Smartest AI Yet?

Hands-on with Grok 3: First impressions show it wows with advanced reasoning, Deep Search, and possibly the smartest AI yet.

Stef Buzas

Feb 20, 2025 • 3 min read

Image composed by Hiraku for illustrative purposes.

What is Grok 3?

Grok 3 is xAI’s latest AI model, released on February 17, 2025. Marketed as the "smartest AI on Earth" by Elon Musk, it competes with OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 2 Flash Thinking Experimental. It excels in math, science, and coding, showing strong performance in reasoning-heavy tasks.

Key Features & Advancements

Grok 3 brings significant upgrades over its predecessors:

Deep Search: Allows real-time web browsing for up-to-date knowledge, market insights, and fact-checking.
Advanced Reasoning: Performs step-by-step logical problem-solving, proven in AIME’24 math benchmarks.
Multimodal Capabilities: Expected to support both text and image generation based on hints from xAI engineers.
Massive Compute Power: Trained on 100,000 Nvidia H100 GPUs, accumulating 200 million GPU-hours—10× more than Grok 2.
Enhanced Training Data: Uses a mix of public datasets, X user data (opt-out available), and reinforcement learning-generated synthetic data.

Performance Benchmarks

Grok 3 dominates across key AI benchmarks, outperforming major competitors. The benchmark data comes from LMSys Chatbot Arena, internal xAI tests, and independent AI research groups:

Task	Grok-3 Reasoning Beta	Grok-3 mini	GPT-4o (o3mini)	Claude 3.5 Sonnet	Gemini-2 Flash Exp	DeepSeek-R1	OpenAI o1
Math (AIME’24)	96	93	87	84	80	73	78
Science (GPQA)	85	84	80	81	74	71	79
Coding (LCB Oct-Feb)	80	79	74	76	73	65	77

Additionally, Grok 3’s “Chocolate” prototype topped the Chatbot Arena leaderboard with 1402 points, surpassing Gemini-2 Flash Thinking Experimental (1385).

Hands-On Impressions

While our comprehensive evaluation of Grok 3 is ongoing, we are eager to share our initial findings based on early testing. These preliminary insights highlight both the strengths and areas for improvement observed thus far.

Reasoning & Problem-Solving

In our testing, Grok 3 performed well in math, structured logic, and coding challenges, showing strong step-by-step problem-solving. Given more time to process, it improved its accuracy significantly, especially in complex reasoning tasks.

However, it felt slower and less intuitive than OpenAI's o1 and 4o, as well as Claude 3.5 Sonnet. It required longer, more explicit prompts for similar results, and in some cases, missed non-standard logical connections that competing models handled smoothly.

Real-Time Web Browsing (Deep Search)

Grok 3’s real-time web retrieval and summarization remained one of its stronger areas. It effectively pulled accurate, up-to-date information on complex and nuanced topics, which was then structured and synthesized based on the specific task at hand.

However, in practice, its performance did not consistently surpass Grok 2, which was already proficient in this area, making the improvement feel incremental rather than groundbreaking.

Coding

During our early small and no context tests, Grok 3 demonstrated strong performance in structured programming tasks, particularly when provided with clear and detailed instructions. It excels in algorithmic problem-solving, often generating accurate and well-documented code.

Compared to models like OpenAI's o1 and Claude 3.5 Sonnet, Grok 3 may require more specific prompts to achieve optimal results and might be less adept at refining and iterating on code dynamically

We are currently conducting a deeper analysis and comparative studies of Grok 3 against leading AI models such as Gemini 2.0 Flash Thinking, Claude 3.5 Sonnet (along with Cursor Agent), OpenAI's o1 and o3 models, and more. Our forthcoming reports will provide a comprehensive evaluation of its performance across various tasks and applications.

Pricing and Access

X Premium+ now costs $40 per month, a recent increase from $22, giving access to Grok 3 for X users.
There’s also a SuperGrok subscription at $30 per month or $300 annually, which includes DeepSearch, extended reasoning, and higher image generation limits.
An Enterprise API is expected soon, allowing businesses to integrate Grok 3 for advanced use cases beyond consumer access.

Future Developments

A voice mode is rolling out within a week, enhancing conversational capabilities similar to ChatGPT’s voice mode, making interactions more dynamic.
Elon Musk hinted that Grok 2 may become open-source once Grok 3 stabilizes, aligning with xAI’s approach to share previous models.
Future updates may include expanded multimodal abilities, potentially improving image and video understanding, though details are still emerging.

Conclusion

While our early hands-on testing reveals that there’s still room for refinement, Grok 3 is undoubtedly pushing the boundaries and challenging the notion of what the smartest AI can be.

Grok 3 is a powerhouse in reasoning and real-time search. Its massive training scale, deep search capabilities, and benchmark dominance shows that it's a formidable contender!