We trained a 32B model to beat Opus 4 at credit card optimization

Name: We trained a 32B model to beat Opus 4 at credit card optimization
Availability: InStock
Author: dimberman

by dimberman·Apr 20, 2026·6 points·0 comments

Visit Project View on HN

AI Analysis

●●SolidBig BrainDark Horse

GRPO-trained Qwen 32B beats Opus 4 on credit card tasks — specific domain win.

Strengths

•Domain-specific RL environment with verifiable benchmarks against commercial APIs
•Open source Apache 2.0 environment enables reproducibility and extension
•Blog documents reward design failures and fixes — transparent training process

Weaknesses

•Credit card optimization is narrow niche — limited generalization beyond this domain
•RL fine-tuning for specific tasks is becoming common pattern, not novel architecture

Post Description

We built an RL environment for credit card reward optimization and trained Qwen 32B with GRPO against it. The trained model scores ~0.51 on held-out tasks vs. Opus 4 at ~0.41 and GPT-4o at 0.36. Environment is open source (Apache 2.0). Blog post explains the reward design, what broke during training, how we fixed it, and what we'd do differently.