AI-Powered Business Intelligence: Embeddings, Semantic Search & Document Processing

The real cost of AI at scale

Most AI proofs-of-concept look cheap. Production systems don't. When you're processing thousands of documents, running semantic search over millions of records, and generating summaries on demand, the per-token costs compound fast.

We've built AI pipelines processing tens of thousands of company accounts — extracting financial data points, carbon emissions disclosures, and ESG metrics. Here's what actually works in production, and how we kept costs manageable.

Semantic search with word embeddings

Keyword search breaks on synonyms, paraphrasing, and domain jargon. Semantic search — using vector embeddings to represent meaning rather than tokens — solves this, and the tooling is now genuinely production-ready.

Our implementation uses OpenAI's text-embedding-3-small model to generate embeddings at indexing time. These are stored in OpenSearch using the k-NN plugin with HNSW indexing. At query time, the user's input is embedded with the same model and a nearest-neighbour search returns semantically similar results.

from openai import AsyncOpenAI

client = AsyncOpenAI()

async def embed(text: str) -> list[float]:
    response = await client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

The key decisions: text-embedding-3-small over ada-002 for better performance at lower cost; HNSW over exact k-NN for sub-millisecond query latency on large indices; hybrid search combining BM25 and vector scores for the best of both approaches.

Cutting OpenAI costs in half with Batch API

The OpenAI Batch API processes requests asynchronously with a 24-hour turnaround at 50% of the standard per-token rate. For offline workloads — nightly document processing, bulk classification, scheduled summarisation — this is a straightforward cost reduction with no quality trade-off.

The pattern:

Collect tasks into a JSONL file with a custom custom_id per request
Submit the batch via the Batch API and store the batch ID
Poll for completion or use an SQS-triggered Lambda to check status
Retrieve results, match by custom_id, and write to the database

We run this for nightly ESG data enrichment on thousands of company profiles. The cost reduction was immediate; the 24-hour latency is invisible to users who see data refreshed each morning.

Document processing with JSON schema prompts

Extracting specific data points from unstructured company accounts is a hard problem. We use a JSON schema-based prompting approach: define the exact output shape you want, pass it as the response_format with json_schema mode, and the model returns structured data that maps directly to your ORM models.

response = await client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "Extract the financial metrics below."},
        {"role": "user", "content": document_text[:8000]}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "financial_extract",
            "schema": {
                "type": "object",
                "properties": {
                    "revenue": {"type": "number"},
                    "ebitda": {"type": "number"},
                    "carbon_scope_1": {"type": "number"},
                    "carbon_scope_2": {"type": "number"}
                }
            }
        }
    }
)

Schema mode eliminates the need for output parsing and virtually eliminates malformed responses. Combined with gpt-4o-mini rather than gpt-4o for extraction tasks, costs drop dramatically with minimal accuracy loss on well-structured documents.

Gemini for PDF OCR

Google's Gemini models handle PDF inputs natively, making them ideal for OCR tasks on scanned annual reports where text extraction tools produce garbage. We use Gemini via the GCP Vertex AI API in batch prediction mode — upload PDFs to Cloud Storage, submit a batch prediction job, and retrieve structured JSON results at a fraction of the real-time API cost.

For high-quality scanned documents, Gemini's extraction quality matches or exceeds traditional OCR tooling like Tesseract or AWS Textract, and it understands the document's semantic structure — tables, footnotes, section headers — without custom parsing logic.

Clustering for pattern discovery

Once you have embeddings, clustering becomes straightforward. We run K-means and DBSCAN clustering on company profile embeddings to discover natural groupings in large datasets — useful for segmentation, anomaly detection, and surfacing similar companies without manual categorisation.

The workflow: embed → cluster → label cluster centroids with a cheap LLM summarisation call → store cluster assignments alongside records. Users get a "similar companies" feature; the data team gets automatic industry segmentation they didn't have to label by hand.

Practical takeaways

Use Batch API for any workload that tolerates 24-hour latency — the 50% cost saving is free money
JSON schema mode is non-negotiable for extraction tasks — it eliminates an entire class of production bugs
Hybrid search (BM25 + vector) consistently outperforms either alone — don't skip BM25 just because you have embeddings
Gemini is genuinely competitive on document tasks; the GCP batch prediction API makes it cost-effective at scale
Profile your token usage per endpoint before assuming the expensive model is necessary