BioVLM 8B is a cost-efficient scientific domain vision-language model that surpasses GPT-5.2 on biological research tasks. Developed in collaboration with Harvard Medical School and Edinburgh's Roslin Institute, it uses automated rich-text data synthesis from raw PDF papers to train a domain-specialized VLM — with the entire pipeline costing less than $200.

Key Features

Anti-Hallucination Document Pipeline — Five-stage processing: layout detection, structured Markdown, cross-reference repair, specialized chart OCR, and hallucination cleaning with terminology correction.

Agentic Data Synthesis — Dual-source question generation from papers and benchmark pattern extraction, with retrieval-augmented answer generation and weakness-driven augmentation.

Two-Stage Training — "Know the Facts" (SFT for domain knowledge injection) followed by "Know the Reasons" (GRPO for reasoning), with no additional human annotation required.

Closed-Loop Evaluation — Automatically detects weak areas, searches PubMed for relevant papers, and iterates training until convergence.

Results

  • LAB-Bench weighted accuracy: 48.0% (GPT-5.2: 44.2%, +3.8% improvement)
  • Synthetic data outperforms human-annotated data by +14.7 to +17.1% at the same token budget
  • Full pipeline cost: < $200 (cloud GPU)
  • Trained on ~20,000 PubMed Central open-access papers

Collaborators

Harvard Medical School, Edinburgh Roslin Institute