Leaderboard

OCRGenBench: A Comprehensive Benchmark for Evaluating OCR Generative Capabilities

19
Models evaluated
33
OCR Generative tasks
5
Text categories
1,060
Human-annotated samples
2
Languages (ZH & EN)
Source:
Type:
# Model Date Type Source Params OCRGenScore T2I (VIEScore) ↑ T2I (AR) ↑ Edit (1-LPIPS) ↑ Edit (AR) ↑ Dewarping (DD) ↑ Deshadow (MSSSIM) ↑ Deblur (MSSSIM) ↑
Closed-Source Models
1 Nano Banana Pro 2025.11 Unified U&G Closed 77.19 92.24 76.96 85.22 71.46 42.52 85.93 61.37
3 Seedream 4.5 2025.12 Specialized Closed 63.35 90.45 74.09 65.75 45.19 23.85 57.99 55.09
9 GPT Image 1.5 2025.03 Specialized Closed 54.00 93.41 68.72 57.36 42.73 29.55 36.10 32.09
Open-Source · Unified Understanding & Generation
5 BAGEL 2025.05 Unified U&G Open 14B 59.11 57.08 14.95 87.03 15.07 20.80 76.18 76.27
8 OmniGen2 2025.06 Unified U&G Open 7B 54.24 59.96 26.37 65.83 13.82 21.21 64.26 65.17
11 InternVL-U 2026.03 Unified U&G Open 4B 43.64 66.25 44.17 53.87 17.21 28.69 49.45 30.80
13 Janus-4o 2025.06 Unified U&G Open 7B 29.58 53.59 13.63 45.53 5.80 24.88 34.28 39.33
15 ILLUME+ 2025.04 Unified U&G Open 7B 28.39 43.15 5.46 42.71 2.68 27.21 42.78 38.28
Open-Source · Specialized Generation
2 FLUX.2-dev 2025.11 Specialized Open 32B 70.19 88.88 66.37 83.92 41.56 41.41 67.97 71.87
4 LongCat-Image 2025.12 Specialized Open 6B 66.39 84.02 67.51 85.56 51.53 28.04 72.12 56.62
6 FLUX.2-Klein-9B 2026.01 Specialized Open 9B 59.28 82.84 39.63 81.18 31.10 24.44 57.98 56.74
7 Qwen-Image 2025.12 Specialized Open 20B 56.29 84.41 65.21 65.65 41.20 25.14 50.05 38.81
10 GLM-Image 2026.01 Specialized Open 9B 50.12 83.53 69.36 70.97 21.54 24.99 43.15 41.13
12 FLUX.1-Kontext-dev 2025.06 Specialized Open 12B 36.51 39.58 21.69 53.76 15.13 24.30 30.36 30.80
14 SD-3.5-Large 2024.10 Specialized Open 8B 29.53 50.94 27.51 47.43 5.77 30.99 29.07 32.64
# Model OCRGenScore Text Removal ↑ Style Transfer: Artistic Text ↑ Style Transfer: Hist. Doc. (VIEScore) ↑ Hist. Doc. Rest. (1-LPIPS) ↑ Scene Text SR (MSSSIM) ↑ Layout-Aware Text Gen. ↑
Handwriting (MSSSIM) Scene Text (MSSSIM) VIEScore AR VIEScore AR
Closed-Source Models
1 Nano Banana Pro 77.19 84.16 91.66 78.27 89.00 77.66 74.15 64.67 88.87 100.00
9 GPT Image 1.5 54.00 47.59 51.38 87.16 94.22 80.57 46.18 23.23 85.69 97.77
3 Seedream 4.5 64.69 69.97 83.93 81.52 99.34 78.07 59.88 42.02 54.50 95.00
Open-Source · Unified Understanding & Generation
5 BAGEL 59.11 80.23 79.78 33.19 12.80 44.49 73.71 87.67 75.31 32.99
8 OmniGen2 54.24 47.28 82.11 35.17 47.59 64.66 50.64 83.75 72.58 43.33
11 InternVL-U 43.64 57.19 61.13 53.41 59.99 20.75 37.45 17.74 53.94 85.83
15 ILLUME+ 28.39 48.93 68.42 0.50 2.26 1.00 25.84 21.88 7.67 3.29
13 Janus-4o 29.58 43.69 35.77 25.08 8.17 16.53 28.60 19.05 41.55 17.32
Open-Source · Specialized Generation
7 Qwen-Image 56.29 49.22 44.56 70.63 95.47 72.53 59.48 27.65 84.44 98.93
4 LongCat-Image 66.39 78.11 89.56 69.22 94.27 69.74 66.66 24.37 83.57 96.14
14 SD-3.5-Large 29.53 44.64 26.11 56.64 44.01 0.00 42.79 17.37 18.52 4.84
12 FLUX.1-Kontext-dev 36.51 58.34 31.87 58.66 52.59 52.80 43.24 22.49 49.00 28.51
2 FLUX.2-dev 70.19 78.75 93.96 79.86 91.32 68.48 74.62 52.78 76.53 44.61
6 FLUX.2-Klein-9B 59.28 65.71 80.59 78.48 93.39 67.48 63.81 46.46 74.05 41.61
10 GLM-Image 50.12 64.61 16.21 59.77 51.90 77.81 67.98 11.80 80.81 66.67
1st place 2nd place 3rd place Bold = best per column  |  Underline = 2nd best  |  T2I = Text-to-Image generation  |  DD = Document-Distortion metric
Source:
# Model Document ↑ Handwriting ↑ Scene Text ↑ Artistic Text ↑ Layout-Rich Text ↑
Modern Historical Slide Poster Layout-Aware
Closed-Source Models
1 Nano Banana Pro 70.87 71.37 78.28 84.67 89.11 92.86 80.85 94.43
3 Seedream 4.5 48.74 61.82 60.30 70.34 89.07 77.20 74.23 74.75
9 GPT Image 1.5 36.31 56.47 57.16 60.37 86.26 80.67 63.21 91.73
Open-Source · Unified Understanding & Generation
5 BAGEL 56.33 47.63 55.47 63.83 40.25 47.79 44.91 54.15
8 OmniGen2 45.97 46.84 40.66 63.81 47.45 42.84 42.59 57.96
11 InternVL-U 36.83 29.96 47.58 49.94 57.94 52.56 41.60 69.88
13 Janus-4o 31.75 25.88 33.28 33.51 29.09 45.85 26.94 29.43
15 ILLUME+ 33.87 19.48 32.55 36.05 13.40 34.90 21.57 5.48
Open-Source · Specialized Generation
2 FLUX.2-dev 60.17 64.36 69.75 75.50 87.04 80.17 70.68 80.52
4 LongCat-Image 54.79 63.47 70.92 72.21 85.95 79.04 82.48 89.86
6 FLUX.2-Klein-9B 47.85 56.55 60.73 66.89 72.56 68.01 57.04 57.83
7 Qwen-Image 45.22 56.37 52.12 58.59 87.26 75.63 70.40 91.68
10 GLM-Image 41.42 62.62 57.83 49.74 68.66 66.84 64.12 73.74
12 FLUX.1-Kontext-dev 31.10 35.72 33.39 32.38 52.95 38.09 27.40 38.76
14 SD-3.5-Large 30.11 22.86 33.23 30.56 47.90 43.93 36.96 11.68
T2I Generation Subtasks: Text-to-Image generation tasks across multiple text categories. OCRGenScoreT2I = weighted average of VIEScore and Accuracy Rate (AR) across all T2I subtasks. Models marked with * support only T2I generation (no image editing capabilities).
# Model OCRGenScoreT2I Hist. Doc. ↑ Handwriting ↑ Scene Text ↑ Artistic Text ↑ Slide ↑ Poster ↑
VIEScore AR VIEScore AR VIEScore AR VIEScore AR VIEScore VIEScore AR
Closed-Source Models
1 Nano Banana Pro 85.53 92.0261.76 91.3674.35 94.0690.89 90.9191.32 94.82 91.8758.22
2 Seedream 4.5 83.83 89.6053.38 89.4853.92 90.8184.58 90.9895.42 90.72 92.5083.87
3 GPT Image 1.5 82.63 92.3549.63 94.1067.74 93.2083.19 92.3777.29 94.25 93.6259.43
Open-Source · Unified Understanding & Generation
11InternVL-U 58.44 58.2610.27 56.6640.59 69.7880.09 83.8946.83 76.11
12OmniGen2 45.58 73.2518.98 59.3424.48 63.1232.71 58.3724.57 48.42 72.7712.08
14Janus-4o 38.79 72.7810.01 49.2214.59 53.6615.70 42.4819.20 62.31 62.300.07
15BAGEL 37.34 54.348.59 54.1116.05 65.469.53 55.0825.76 43.69 70.401.40
17ILLUME+ 27.26 59.428.37 43.877.81 51.072.72 24.624.12 36.13 52.720.14
18Show-o2* 19.44 28.715.01 25.8311.26 24.795.47 35.1423.67 20.07 33.220.07
Open-Source · Specialized Generation
4FLUX.2-dev 79.28 85.5931.41 81.8929.94 79.4799.99 85.7092.47 81.14 89.7567.62
5LongCat-Image 78.92 82.9142.03 78.3051.74 87.4795.99 88.5588.05 83.82 88.4175.96
6GLM-Image 77.95 79.8560.99 77.9564.16 88.5182.17 88.6475.77 78.70 90.4969.48
7Qwen-Image 76.03 87.4026.36 60.2888.00 77.7699.53 86.4995.47 85.91 84.3258.30
8Z-Image* 74.46 66.2257.31 68.5272.78 75.8793.08 77.2989.80 66.44 79.4877.30
9Ovis-Image* 69.02 52.4552.75 55.3054.00 76.6785.82 77.3892.70 57.14 75.9790.87
10FLUX.2-Klein-9B 59.28 82.8439.63 88.9131.10 80.4784.11 83.8645.04 51.34 51.3737.23
13SD-3.5-Large 43.36 52.397.40 41.2019.86 47.4842.44 59.5343.76 58.06 64.3025.70
16FLUX.1-Kontext-dev 31.78 37.219.55 36.2021.25 40.1529.01 39.8033.69 39.25 51.983.96
Text Editing Tasks: Modify specific textual content within images while preserving other elements unchanged. Evaluated by 1 − LPIPS (perceptual similarity, ↑ better) and Accuracy Rate (AR, ↑ better) per text category. Note: Models marked with * currently support only image editing tasks.
# Model OCRGenScoreEdit Modern Doc. ↑ Historical Doc. ↑ Handwriting ↑ Scene Text ↑ Artistic Text ↑ Slide ↑ Poster ↑
1-LPIPSAR 1-LPIPSAR 1-LPIPSAR 1-LPIPSAR 1-LPIPSAR 1-LPIPSAR 1-LPIPSAR
Closed-Source Models
2Nano Banana Pro 80.30 88.1580.28 74.6638.87 77.4258.21 92.0373.36 92.1594.00 85.9296.90 93.5079.80
6Seedream 4.5 56.74 65.1441.20 60.9513.77 53.7724.48 68.6745.31 67.15100.00 64.0664.31 59.8469.71
7GPT Image 1.5 52.32 55.6628.30 48.738.80 54.0831.89 59.9245.78 67.4565.31 60.2867.98 55.3644.42
Open-Source · Unified Understanding & Generation
8BAGEL 50.57 87.589.00 79.452.50 86.2715.95 90.5520.66 87.6317.63 83.6920.09 92.5715.28
11OmniGen2 39.86 61.769.32 68.483.66 62.1111.90 73.5510.21 74.2424.66 49.8824.11 51.506.13
13InternVL-U 35.03 51.205.59 43.9810.75 58.5415.32 58.5021.29 58.3945.15 44.2913.72 49.5914.15
14Janus-4o 28.09 43.822.42 32.061.92 44.593.89 47.296.24 52.6127.68 56.122.68 43.760.82
16ILLUME+ 22.68 39.251.25 41.831.19 41.463.89 61.533.46 48.290.62 63.533.78 42.740.69
Open-Source · Specialized Generation
1FireRed-Image-Edit-v1.1* 81.27 89.7560.08 80.4644.70 90.6876.18 91.9766.97 21.1893.26 88.0298.66 96.1979.47
3LongCat-Image 70.97 84.3529.83 76.7532.06 86.5552.73 87.5654.12 83.6291.95 83.8164.72 92.9872.66
4FLUX.2-dev 64.14 83.5334.86 74.6113.08 80.9331.32 87.5748.38 87.5789.77 84.5156.67 91.2833.86
5Qwen-Image 57.68 67.6333.60 63.3721.50 51.4025.06 72.0858.58 85.9572.33 59.4827.65 84.4498.93
9FLUX.2-Klein-9B 48.32 76.1516.35 67.515.09 59.9215.77 82.8424.61 67.9847.48 75.9833.96 79.1916.77
10GLM-Image 46.72 41.4214.42 54.2014.42 65.1544.39 59.4077.81 67.9847.48 78.3184.08 58.3174.23
12FLUX.1-Kontext-dev 35.48 51.496.82 45.3410.75 54.6615.52 62.0411.27 62.8120.97 57.2916.53 52.865.39
15SD-3.5-Large 26.74 35.324.45 37.150.38 43.195.87 42.865.56 64.8718.46 59.600.00 51.506.13
Chinese (ZH) Performance: Model performance evaluated exclusively on Chinese text samples. All metrics (OCRGenScore, T2I, Editing, I2I Translation) are computed on Chinese-language test data only.
Source:
Type:
# Model OCRGenScore T2I (VIEScore) ↑ T2I (AR) ↑ Edit (1-LPIPS) ↑ Edit (AR) ↑ Dewarping (DD) ↑ Deshadow (MSSSIM) ↑ Deblur (MSSSIM) ↑
Closed-Source Models
1 Nano Banana Pro 76.02 91.61 82.27 83.43 67.96 43.99 93.82 60.26
4 Seedream 4.5 63.55 91.09 66.89 61.87 40.88 21.45 83.87 49.51
8 GPT Image 1.5 53.04 93.50 69.24 55.35 34.08 31.69 34.34 33.60
Open-Source · Unified Understanding & Generation
7 BAGEL 55.87 53.09 2.32 86.09 8.97 19.25 74.19 59.08
9 OmniGen2 47.82 51.27 0.87 63.12 4.99 19.56 68.38 46.71
11 InternVL-U 43.88 70.18 44.30 50.93 9.34 23.52 64.68 33.54
12 ILLUME+ 29.17 53.17 0.56 42.34 1.69 22.80 44.82 42.50
15 Janus-4o 25.50 44.85 0.19 44.02 1.17 25.94 31.45 39.39
Open-Source · Specialized Generation
2 FLUX.2-dev 69.30 89.60 58.42 81.81 28.48 37.52 81.94 61.19
3 LongCat-Image 65.97 87.39 68.92 85.83 52.70 28.00 79.45 59.62
5 FLUX.2-Klein-9B 56.53 80.47 11.39 78.95 12.37 23.16 83.86 51.34
6 Qwen-Image 56.44 88.11 68.43 60.69 33.06 19.27 60.62 42.47
10 GLM-Image 46.72 87.98 63.47 70.68 13.82 21.44 25.73 44.02
13 FLUX.1-Kontext-dev 28.72 3.85 0.22 53.46 2.90 19.78 26.00 34.36
14 SD-3.5-Large 26.74 22.55 0.60 45.29 1.01 31.62 22.71 37.63
# Model OCRGenScore Text Removal ↑ Style Transfer: Artistic Text ↑ Hist. Doc. (VIEScore) ↑ Hist. Doc. Rest. (1-LPIPS) ↑ Scene Text SR (MSSSIM) ↑ Layout-Aware Text Gen. ↑
Handwriting (MSSSIM) Scene Text (MSSSIM) VIEScore AR VIEScore AR
Closed-Source Models
1 Nano Banana Pro 76.02 84.16 93.53 64.21 79.23 77.66 74.15 52.98 88.02 100.00
8 GPT Image 1.5 53.04 47.59 50.75 81.63 90.11 80.57 46.18 19.68 86.05 95.30
4 Seedream 4.5 63.55 69.97 88.99 76.60 99.23 78.07 59.88 29.30 55.26 89.47
Open-Source · Unified Understanding & Generation
7 BAGEL 55.87 80.23 87.20 12.74 1.30 44.49 73.71 88.05 82.51 1.74
9 OmniGen2 47.82 47.28 83.19 5.66 6.52 64.66 50.64 82.67 67.37 0.00
11 InternVL-U 43.88 57.19 53.47 44.41 57.21 20.75 37.45 19.48 62.28 83.30
12 ILLUME+ 29.17 48.93 71.65 0.00 0.00 1.00 25.84 16.19 3.08 0.00
15 Janus-4o 28.50 43.69 37.33 0.00 0.00 16.53 28.60 16.21 32.96 0.00
Open-Source · Specialized Generation
6 Qwen-Image 56.44 49.22 38.52 68.27 96.01 72.53 59.48 27.24 82.44 99.00
3 LongCat-Image 65.97 78.11 90.32 58.02 91.30 69.74 66.66 18.05 82.47 92.90
14 SD-3.5-Large 26.74 44.64 25.62 48.05 3.90 0.00 42.79 18.16 17.93 0.00
13 FLUX.1-Kontext-dev 28.52 58.34 27.83 37.85 3.96 52.80 43.24 21.34 22.61 0.00
2 FLUX.2-dev 69.30 78.75 93.96 73.96 82.64 68.48 74.62 33.10 75.85 36.57
5 FLUX.2-Klein-9B 56.53 65.71 92.63 67.54 86.79 67.48 63.81 44.49 75.16 10.47
10 GLM-Image 46.72 64.61 18.00 54.20 44.39 77.81 67.98 11.07 79.31 58.31
1st place 2nd place 3rd place Bold = best per column  |  Underline = 2nd best  |  Ranked by OCRGenScore on Chinese (ZH) text samples
English (EN) Performance: Model performance evaluated exclusively on English text samples. All metrics (OCRGenScore, T2I, Editing, I2I Translation) are computed on English-language test data only.
Source:
Type:
# Model OCRGenScore T2I (VIEScore) ↑ T2I (AR) ↑ Edit (1-LPIPS) ↑ Edit (AR) ↑ Dewarping (DD) ↑ Deshadow (MSSSIM) ↑ Deblur (MSSSIM) ↑
Closed-Source Models
1 Nano Banana Pro 78.75 92.87 71.71 87.00 74.96 41.53 81.99 62.92
4 Seedream 4.5 62.73 89.82 75.22 61.62 49.51 25.45 45.05 62.90
9 GPT Image 1.5 53.95 93.31 68.21 59.36 51.37 28.13 36.99 29.98
Open-Source · Unified Understanding & Generation
5 OmniGen2 62.26 68.55 51.47 68.54 22.66 22.11 62.20 39.44
7 BAGEL 60.96 61.02 27.37 87.96 21.16 21.84 77.17 61.57
11 InternVL-U 45.45 62.36 44.04 56.82 25.08 32.13 41.83 26.97
14 Janus-4o 33.54 63.19 26.86 47.04 10.43 24.18 35.70 33.95
15 ILLUME+ 31.02 33.17 10.34 43.09 3.67 30.16 41.75 38.54
Open-Source · Specialized Generation
2 FLUX.2-dev 70.74 88.17 74.19 86.04 54.64 43.81 60.98 59.35
3 LongCat-Image 65.76 80.70 66.12 85.29 56.37 28.07 68.45 60.66
6 FLUX.2-Klein-9B 61.95 85.18 67.43 83.40 49.82 25.28 45.04 51.37
8 Qwen-Image 55.14 80.76 62.04 70.61 49.34 29.06 44.77 33.67
10 GLM-Image 47.17 81.11 75.16 71.25 29.26 27.36 51.96 37.08
12 FLUX.1-Kontext-dev 43.59 72.99 42.82 54.67 27.37 37.31 33.54 25.53
13 SD-3.5-Large 35.32 75.96 54.00 49.65 10.30 30.37 32.24 25.66
# Model OCRGenScore Text Removal ↑ Style Transfer: Artistic Text ↑ Style Transfer: Hist. Doc. (VIEScore) ↑ Hist. Doc. Rest. (1-LPIPS) ↑ Scene Text SR (MSSSIM) ↑ Layout-Aware Text Gen. (VIEScore) ↑
Handwriting (MSSSIM) Scene Text (MSSSIM) VIEScore AR VIEScore AR
Closed-Source Models
1 Nano Banana Pro 78.75 - 89.79 92.33 98.77 - - 76.36 89.72 100.00
9 GPT Image 1.5 53.95 - 52.01 92.69 98.33 - - 26.78 85.33 100.00
4 Seedream 4.5 62.73 - 78.87 86.44 99.45 - - 54.74 53.74 100.00
Open-Source · Unified Understanding & Generation
5 BAGEL 60.96 - 72.36 53.64 24.30 - - 87.29 68.11 64.24
7 OmniGen2 62.26 - 81.03 64.68 88.66 - - 84.83 77.79 86.66
11 InternVL-U 45.45 - 68.79 62.41 62.77 - - 16.00 45.60 88.36
15 ILLUME+ 31.02 - 65.19 1.00 4.52 - - 27.57 12.26 6.58
14 Janus-4o 33.54 - 34.21 50.16 16.34 - - 21.89 50.14 34.64
Open-Source · Specialized Generation
8 Qwen-Image 55.14 - 50.60 72.99 94.93 - - 28.06 86.44 98.86
3 LongCat-Image 65.76 - 88.80 80.42 97.24 - - 30.69 84.67 99.38
13 SD-3.5-Large 35.32 - 26.60 65.23 84.12 - - 16.58 19.11 9.68
12 FLUX.1-Kontext-dev 43.59 - 35.91 79.47 100.00 - - 23.64 75.39 57.02
2 FLUX.2-dev 70.74 - 99.71 85.76 100.00 - - 72.46 77.21 52.65
6 FLUX.2-Klein-9B 61.95 - 68.55 89.42 100.00 - - 48.43 72.94 72.75
10 GLM-Image 47.17 - 14.42 65.34 59.41 - - 12.53 82.31 75.03
1st place 2nd place 3rd place Bold = best per column  |  Underline = 2nd best  |  Ranked by OCRGenScore on English (EN) text samples

Benchmark Overview — Task & Data Examples

OCRGenBench examples

OCRGenBench is the most comprehensive benchmark to date for evaluating the OCR generative capabilities of generative models. It pioneers in the unification of T2I generation, text editing, and OCR-related image-to-image translation to universally reflect a model's visual text synthesis abilities, i.e., OCR generative capabilities. The benchmark covers 5 common text categories and 33 OCR generative tasks, including 1,060 challenging, human-annotated samples with dense text, varied layouts, multiple aspect ratios, and bilingual content. We also design a unified metric OCRGenScore, assessing text accuracy, instruction following, visual quality, and structural consistency in visual text synthesis.

Submit Your Model Results

Evaluated your model on OCRGenBench? Upload your results Excel to request a leaderboard entry.

📊
Drag & drop your Excel / CSV here
📄

Preview — first 5 rows

⬇ Template

Download the template, fill in your model's scores, then upload to preview the format. The submission portal will open soon — follow the GitHub repo for updates.

BibTeX
@article{zhang2025ocrgenbench,
  title     = {{OCRGenBench: A Comprehensive Benchmark for Evaluating OCR Generative Capabilities}},
  author    = {Zhang, Peirong and Xu, Haowei and Zhang, Jiaxin and Zheng, Xuhan and Xu, Guitao and Zhang, Yuyi and Liu, Junle and Yang, Zhenhua and Zhou, Wei and Jin, Lianwen},
  journal   = {arXiv preprint arXiv:2507.15085},
  year      = {2025},
}