Hey! I am the founder of Tensorlake. We benchmarked the models that our customers consider using in enterprises or regulated industries where there is a big need for processing documents for various automation. Benchmarking takes a lot of time so we focussed on the ones that we get asked about.
On Gemini and other VLMs - we excluded these models because they don't do visual grounding - aka they don't provide page layouts, bounding boxes of elements on the pages. This is a table stakes feature for use-cases customers are building with Tensorlake. It wouldn't be possible to build citations without bounding boxes.
On pricing - we are probably the only company offer a pure on-demand pricing without any tiers. With Tensorlake, you can get back markdown from every page, summaries of figures, tables and charts, structured data, page classification, etc - in ONE api call. This means we are running a bunch of different models under the hood. If you add up the token count, and complexity of infrastructure to build a complex pipeline around Gemini, and other OCR/Layout detection model I bet the price you would end up with won't be any cheaper than what we provide :) Plus doing this at scale is very very complex - it requires building a lot of sophisticated infrastructure - another source of cost behind modern Document Ingestion services.
Google's Vertex API for document processing absolutely does bounding boxes. In fact, some of the document processors are just a wrap around Google's product.
It does, we have users in Europe and Asia using it with non English languages. Can you please send me a message at diptanu at tensorlake dot ai, would love to see why it didn’t work.
On Gemini and other VLMs - we excluded these models because they don't do visual grounding - aka they don't provide page layouts, bounding boxes of elements on the pages. This is a table stakes feature for use-cases customers are building with Tensorlake. It wouldn't be possible to build citations without bounding boxes.
On pricing - we are probably the only company offer a pure on-demand pricing without any tiers. With Tensorlake, you can get back markdown from every page, summaries of figures, tables and charts, structured data, page classification, etc - in ONE api call. This means we are running a bunch of different models under the hood. If you add up the token count, and complexity of infrastructure to build a complex pipeline around Gemini, and other OCR/Layout detection model I bet the price you would end up with won't be any cheaper than what we provide :) Plus doing this at scale is very very complex - it requires building a lot of sophisticated infrastructure - another source of cost behind modern Document Ingestion services.