The Brief
Marktguru processes 10 million product lines a month (1M+ receipts, ~10 lines each). Three systematic data quality issues corrupt pricing data, product matching, and cross-retailer comparisons. Here's what we found and how to fix it.
Challenge 1: Pack Size Blindness
packaging_quantity column is empty in every row.1a: The Raw Data
Full Stiegl Hell dataset — 501 receipt lines sorted by price. Click any row to inspect it.
| Product Line | Price (EUR) | Qty | Pack Qty | Retailer | Discount |
|---|
1b: The Problem
packaging_quantity column is empty in every row.
311 rows are single bottles (EUR 1-1.50). 150+ rows are hidden multi-packs (EUR 10-30) recorded as quantity=1.
1c: The Hypothesis
product_line column.
| Product Line | Price | Hidden Pack Size | True Unit Price |
|---|---|---|---|
SC STIEGL HELL FL. 20ER | EUR 27.60 | 20-pack ("20ER") | EUR 1.38 |
Stiegl Hell 6er | EUR 7.20 | 6-pack ("6er") | EUR 1.20 |
Stiegl Hell FL. | EUR 28.56 | ? (no marker, but 28.56/24 = EUR 1.19) | EUR ~1.19 |
1d: Pack Size Detective
Click a preset below, pick a row from the table, or type your own. The 5-step detection pipeline processes it live.
1e: The Result
Detection Method Legend
Each row is classified by one of 5 detection methods. Rules are evaluated top-to-bottom (waterfall). First match wins.
Broader Application: 1,000 Stiegl ProductLines
The same detection runs on all Stiegl variants (Goldbräu, Bock, Bio). It catches the same patterns.
| Product Line | Recorded Price | Pack Size | Unit Price | Method | Cleaned Name (generated) |
|---|
Challenge 2: Cross-Retailer Price Chaos
2a: Raw Data Explorer
12,254 products across 17 German retailers. Click any product to see all retailer prices.
| Item | EAN | Category | Retailers | Min (EUR) | Max (EUR) | Spread % |
|---|
2b: The Problem
2c: The Hypothesis — Same Root Cause as Challenge 1
| Product | Retailer A | Retailer B | Spread | Explanation |
|---|---|---|---|---|
Coca-Cola 1.251x6 | Aldi: EUR 0.75 | REWE: EUR 9.99 | 3,896% | REWE sells 8-pack, Aldi sells single |
SPEZI | Penny: EUR 0.85 | Edeka: EUR 11.99 | 1,932% | Edeka sells 14-pack, Penny sells single |
Radeberger Pils 20x0.5L | Netto: EUR 0.53 | Edeka: EUR 9.99 | 1,785% | Edeka sells 10-pack crate |
2d: Cross-Retailer Pack Detective
Click a preset below or search by EAN/name. The detection shows original retailer prices, detected pack sizes, and corrected unit prices.
2e: The Result — Corrected Prices
Before vs After: Price Spread Distribution
All Products — Original Spreadsheet Data → Corrected Prices
All 2,742 products from the AWS spreadsheet. Each row shows every retailer price column from the original data. Click any row to expand and see the full old→new price comparison per retailer, with detection steps explained.
| Product (from spreadsheet) | EAN | Retailer Prices (Original Spreadsheet Data) | Status |
|---|
Challenge 3: AI Hallucinations
The briefing explicitly states: “Snack Fun is an own brand of Hofer, definitely not Lidl. Also Lidl is nowhere to see on the images.”
3a: The Raw Data
SnackFun dataset — 96 receipt extraction records. All “Snack Fun” brand — Hofer's private label. The AI hallucinated “Lidl” as brand_parent for all 96 rows. Search and sort to explore.
| Line ID | EAN | Product Name | Brand | Volume | Unit | Confidence % |
|---|
3b: The Problem
EAN Conflicts
Same EAN mapped to different product names. Each conflict is a potential hallucination.
Confidence Distribution
3c: Our Solution
brand_parent_source field: “packaging”, “knowledge”, or “none”.3d: Prompt Comparison
We have Marktguru's ACTUAL production prompt (from their briefing PDF). It is detailed and professional. The hallucination source is one line: “use your training information to identify the brand parent.”
Loading...
Loading...
Key Differences
3e: How It Works
The complete transparent pipeline — every step is visible to the user.
Assembled Prompt Preview
Change the market context to see how the prompt changes. This is exactly what the AI receives.
Loading...
Show full assembled prompt (base gray + context highlighted)
Loading...
3f: Live AI Comparison
3g: Test Results & Model Recommendations
Data source: app/vision_results/run_2026-03-06_009.json — 48 tests (6 models × 4 product images × 2 prompts). Each AI model was asked to extract product data from the same images using both Marktguru's production prompt and our improved prompt. Results compared against verified ground truth.
Passed: Tests where brand_parent AND brand_name are both correct (e.g. "2/4" = 2 out of 4 images correct).
Pass Rate: Passed ÷ Total tests, as percentage.
Failed (Hallucinations): Tests where the AI assigned the product to the WRONG retailer (e.g. Lidl instead of Hofer).
Quality Score (0-100): Field-by-field accuracy vs ground truth. Weighted: brand_parent 40pts, brand_name 30pts, product_name 20pts, volume+flavor+packaging 10pts.
Baseline: Marktguru's current production prompt (GPT-4o, temp 0, "use your training information to identify the brand parent").
Our prompt: Context-aware prompt with market info, anti-hallucination framing, packaging type list.
Head-to-Head Comparison
Model Ranking
All 6 tested models, sorted by quality improvement. Baseline = Marktguru's current approach.
| Model | Cost | Baseline Pass Rate | Baseline Quality | Our Pass Rate | Our Quality | Improvement | Recommendation |
|---|
What Made the Difference
Solution Pipeline & Next Steps
What We Proved
| Challenge | Status | Key Result | Production Readiness |
|---|---|---|---|
| Ch.1: Pack Size Detection | Solved | 98.9% std dev reduction (EUR 11.82 → EUR 0.13), regex + ratio inference | Ready — deterministic pipeline, no AI cost |
| Ch.2: Cross-Retailer Price Chaos | Solved | 177 products corrected, 113 removed from >100% spread | Ready — same pipeline as Ch.1, scales to full dataset |
| Ch.3: AI Hallucinations | Solved | Best model + our prompt: 50% → 100% pass rate (Gemini 3 Flash). Avg across all 6 models: 25% → 62% | Ready — prompt + market context + model choice, no fine-tuning needed |
Concrete Next Steps
Scale the Vision Test
Run our prompt vs. their prompt on 100-500 product images across diverse categories (drinks, snacks, dairy, frozen, household). Include multiple retailers, countries, and private-label brands. Estimated cost: ~$5-25 via OpenRouter. This is the critical validation step.
Expand Ground Truth
We need verified ground truth for more products — not just Snack Fun. Marktguru provides the product database; we define the test set with known-tricky cases: private-label brands, multi-pack products, regional variants. This is the foundation for measuring improvement.
Run Pack Size Pipeline on Full Dataset
The regex + ratio inference pipeline is deterministic and free — no AI cost. Apply to Marktguru's full product dataset (not just the 12K AWS sample). Measure how many price anomalies it catches across all categories and retailers.
Deploy & Monitor
Only after steps 1-3 validate the approach: switch the vision prompt (cost: $0), add post-processing layer, integrate pack size pipeline. Recommended model: Gemini 3 Flash or Gemini 2.5 Flash (~$0.30-0.50/1M tokens, 4/4 pass rate with our prompt).
What's NOT Needed
No new models. Gemini Flash at $0.30/1M tokens outperforms GPT-5 Mini at $0.25 and Claude Haiku at $1.00. Cost is not the bottleneck — context is.
No new infrastructure. All solutions work with Marktguru's existing OpenRouter integration. Prompt swap + post-processing layer + pack size pipeline.