olmocr is a toolkit developed by AllenAI designed to linearize PDF documents. It is specifically intended for the preparation of LLM datasets and model training.

Read original