Unstructured API
Overview
The Unstructured API consists of two parts:
- The Unstructured Workflow Endpoint enables a full range of partitioning, chunking, embedding, and enrichment options for your files and data. It is designed to batch-process files and data in remote locations; send processed results to various storage, databases, and vector stores; and use the latest and highest-performing models on the market today. It has built-in logic to deliver the highest quality results at the lowest cost. Learn more.
- The Unstructured Partition Endpoint is intended for rapid prototyping of Unstructured’s various partitioning strategies, with limited support for chunking. It is designed to work only with processing of local files, one file at a time. Use the Unstructured Workflow Endpoint for production-level scenarios, file processing in batches, files and data in remote locations, generating embeddings, applying post-transform enrichments, using the latest and highest-performing models, and for the highest quality results at the lowest cost. Learn more.
Benefits over open source
The Unstructured API provides the following benefits beyond the Unstructured open source library offering:
- Designed for production scenarios.
- Significantly increased performance on document and table extraction.
- Access to newer and more sophisticated vision transformer models.
- Access to Unstructured’s fine-tuned OCR models.
- Access to Unstructured’s by-page and by-similarity chunking strategies.
- Adherence to security and SOC2 Type 1, SOC2 Type 2, and HIPAA compliance standards.
- Authentication and identity management.
- Incremental data loading.
- Image extraction from documents.
- More sophisticated document hierarchy detection.
- Unstructured manages code dependencies, for instance for libraries such as Tesseract.
- Unstructured manages its own infrastructure, including parallelization and other performance optimizations.
Pricing
To call the Unstructured API, you must have an Unstructured account.
Unstructured offers three account pricing plans:
- SaaS Cloud-hosted - Processing happens on Unstructured’s software-as-a-service (SaaS) cloud infrastructure in a multi-tenant environment.
- Hybrid SaaS - Processing also happens on Unstructured’s SaaS cloud infrastructure, but your data stays protected in a dedicated cloud environment, maintaining strict data privacy.
- VPC - Sometimes referred to as self-hosted, an instance of the Unstructured SaaS is deployed into your own virtual private cloud (VPC), providing complete data ownership and infrastructure control, full customization, and dedicated technical support.
For more details, see the Unstructured Pricing page.
Some of these plans are billed on a per-page basis.
Unstructured calculates a page as follows:
- For these file types, a page is a page, slide, or image:
.pdf
,.pptx
, and.tiff
. - For
.docx
files that have page metadata, Unstructured calculates the number of pages based on that metadata. - For all other file types, Unstructured calculates the number of pages as the file’s size divided by 100 KB.
- For non-file data, Unstructured calculates a page as 100 KB of incoming data to be processed.
Get support
Should you require any assistance or have any questions regarding the Unstructured API, please contact us directly.