Skip to main content
ai voiceMarch 11, 202618 min read

Pinecone Vector Search for Partner Matching: A Real-World Example

How we used Pinecone vector embeddings to build an AI-powered partner matching engine for 17K+ professionals. Architecture, implementation, and lessons learned.

Loic Bachellerie

Senior Product Engineer

Introduction

The first version of our matching algorithm was a disaster. We were using keyword overlap and filter chains to connect professionals on onSpark, our AI partnership platform. The result: users with nearly identical goals were getting zero matches because one said "growth hacking" and the other said "user acquisition." Same intent, zero overlap.

We rebuilt the engine using Pinecone vector search and OpenAI embeddings. Within two weeks of shipping, match acceptance rates climbed from 23% to 61%. Today, onSpark runs 17,000+ professional profiles through that same Pinecone index every day, returning ranked match candidates in under 120ms.

This post is the technical deep-dive I wish existed when I started. You will get the full architecture, every TypeScript implementation detail, the embedding strategy that actually works at scale, and the production optimizations that keep costs sane past the 10K-user mark.

By the end, you will understand:

  • Why vector search beats filter-based matching for nuanced professional profiles
  • How to structure embeddings for multi-dimensional similarity
  • The Pinecone index setup and upsert pipeline for 17K+ records
  • How to query, score, and rank candidates in real time
  • Production optimizations including namespace sharding and batched re-indexing

The Problem: Why Keyword Matching Fails for People

Before getting into Pinecone, it is worth understanding why the naive approach breaks down. This is not an academic point. It cost us three months of user churn before we fixed it.

A professional on onSpark fills out a voice onboarding (handled by Vapi.ai) that asks four questions:

  1. What is your professional background?
  2. What are you working on right now?
  3. What kind of partnerships are you looking for?
  4. What can you offer a partner?

Those answers get transcribed into a structured profile. The challenge: two people can describe the exact same professional reality in completely different language.

User A: "I'm a go-to-market lead with a background in PLG. Looking for a technical co-founder who can ship fast."

User B: "Sales and growth is my thing. Built two SaaS products to ramp. Want an engineering partner who moves quickly."

A keyword system gives these users near-zero similarity. A vector similarity model gives them a cosine distance of roughly 0.07, extremely close. That difference is the entire product.

The second problem was intent asymmetry. Filtering lets you match "looking for X" against "offers X" in a rigid table-join way. But real partnership intent lives on a spectrum. Someone offering "strategic introductions to investors" is also a match for someone looking for "fundraising support." Semantic similarity captures that; keyword filtering does not.

Vector search converts text into a high-dimensional numeric representation where semantic distance maps to geometric distance. Texts with similar meaning end up near each other in that space regardless of the specific words used.

The process is:

  1. Pass profile text through an embedding model (we use text-embedding-3-large)
  2. The model returns a 3072-dimensional float array
  3. Store that vector in Pinecone alongside a profile ID and metadata
  4. At query time, embed the requesting user's profile the same way
  5. Pinecone returns the nearest neighbors by cosine similarity in milliseconds

For partner matching, this means we are finding profiles that are semantically compatible - not just syntactically similar. That is the core unlock.

Vector Matching Pipeline

From raw profile text to ranked match candidates

Profile Text
OpenAI Embed
Pinecone Index
Ranked Matches

Input

Voice transcript + structured fields

Model

3072-dim float array

Storage

17K+ vectors, cosine metric

Output

Top-K with scores in <120ms

Match acceptance rate: 23% → 61% after migration

Pinecone Setup

Creating the Index

Start with the Pinecone dashboard or their SDK. For onSpark I use the SDK so the index configuration is version-controlled and reproducible.

// lib/pinecone/client.ts
import { Pinecone } from "@pinecone-database/pinecone";
 
const PINECONE_INDEX_NAME = "onspark-profiles";
const EMBEDDING_DIMENSION = 3072; // text-embedding-3-large
 
let _pineconeClient: Pinecone | null = null;
 
export function getPineconeClient(): Pinecone {
  if (!_pineconeClient) {
    _pineconeClient = new Pinecone({
      apiKey: process.env.PINECONE_API_KEY!,
    });
  }
  return _pineconeClient;
}
 
export async function ensureIndex(): Promise<void> {
  const client = getPineconeClient();
  const existing = await client.listIndexes();
  const exists = existing.indexes?.some(
    (idx) => idx.name === PINECONE_INDEX_NAME
  );
 
  if (!exists) {
    await client.createIndex({
      name: PINECONE_INDEX_NAME,
      dimension: EMBEDDING_DIMENSION,
      metric: "cosine",
      spec: {
        serverless: {
          cloud: "aws",
          region: "us-east-1",
        },
      },
    });
 
    // Wait for index to be ready
    let ready = false;
    while (!ready) {
      await new Promise((resolve) => setTimeout(resolve, 2000));
      const description = await client.describeIndex(PINECONE_INDEX_NAME);
      ready = description.status?.ready ?? false;
    }
  }
}
 
export function getIndex() {
  return getPineconeClient().index(PINECONE_INDEX_NAME);
}

A few decisions worth explaining here:

Cosine metric over Euclidean: For text embeddings, cosine similarity is the right choice. It measures the angle between vectors rather than their absolute distance, which means two profiles with similar content but different verbosity end up correctly close to each other. Euclidean distance penalizes longer text.

Serverless over pod-based: At 17K records we are well within serverless pricing efficiency. Pod-based indexes make sense above roughly 1M vectors or when you need guaranteed query latency SLAs. For our workload, serverless gives p99 under 150ms and costs a fraction of provisioned pods.

Dimension 3072: This is the native output dimension of text-embedding-3-large. You can request a smaller dimension via the API (useful for reducing storage costs), but we found the full dimension meaningfully improved match quality - worth the extra ~$0.40/day in storage at our scale.

Embedding Strategy

This is where most vector search implementations get it wrong. If you just concatenate all profile fields and embed the resulting string, you lose structural signal. The embedding model compresses everything into a single point, and the context windows of different fields compete with each other.

For onSpark, we use a composite embedding strategy: we build a single rich string, but we structure it so the most semantically important content appears first and with natural framing. Transformers pay more attention to earlier tokens.

Building the Profile Document

// lib/pinecone/profile-document.ts
 
export interface ProfileFields {
  background: string;
  currentProject: string;
  partnershipGoal: string;
  offering: string;
  industry: string;
  stage: string; // "idea" | "pre-seed" | "seed" | "series-a" | "growth"
  location: string;
}
 
export function buildProfileDocument(fields: ProfileFields): string {
  // Order matters: most semantically important first
  const sections = [
    `Partnership goal: ${fields.partnershipGoal}`,
    `Offering: ${fields.offering}`,
    `Current project: ${fields.currentProject}`,
    `Professional background: ${fields.background}`,
    `Industry: ${fields.industry}`,
    `Stage: ${fields.stage}`,
    `Location: ${fields.location}`,
  ];
 
  return sections.join(". ");
}
 
// Example output:
// "Partnership goal: Looking for a technical co-founder to build my SaaS MVP.
//  Offering: Go-to-market strategy, investor intros, B2B sales expertise.
//  Current project: Building an AI scheduling tool for healthcare.
//  Professional background: 8 years in enterprise SaaS sales at Oracle and Salesforce.
//  Industry: healthcare technology. Stage: pre-seed. Location: New York."

Generating Embeddings

// lib/pinecone/embeddings.ts
import OpenAI from "openai";
 
const EMBEDDING_MODEL = "text-embedding-3-large";
const BATCH_SIZE = 100; // OpenAI limit per request
 
let _openai: OpenAI | null = null;
 
function getOpenAI(): OpenAI {
  if (!_openai) {
    _openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY! });
  }
  return _openai;
}
 
export async function embedText(text: string): Promise<number[]> {
  const openai = getOpenAI();
  const response = await openai.embeddings.create({
    model: EMBEDDING_MODEL,
    input: text,
  });
  return response.data[0].embedding;
}
 
export async function embedBatch(texts: string[]): Promise<number[][]> {
  const openai = getOpenAI();
  const batches: string[][] = [];
 
  for (let i = 0; i < texts.length; i += BATCH_SIZE) {
    batches.push(texts.slice(i, i + BATCH_SIZE));
  }
 
  const results: number[][] = [];
 
  for (const batch of batches) {
    const response = await openai.embeddings.create({
      model: EMBEDDING_MODEL,
      input: batch,
    });
 
    const sorted = response.data.sort((a, b) => a.index - b.index);
    results.push(...sorted.map((item) => item.embedding));
  }
 
  return results;
}

One important detail: the OpenAI embeddings API returns results in an arbitrary order when you submit a batch. The sort by index ensures the output array aligns with the input array. Skipping this causes a subtle bug where profiles get wrong embeddings - it took us two days to trace during our initial bulk indexing.

Indexing Profiles

Upsert Pipeline

// lib/pinecone/indexer.ts
import { getIndex } from "./client";
import { buildProfileDocument, type ProfileFields } from "./profile-document";
import { embedText, embedBatch } from "./embeddings";
 
export interface ProfileRecord {
  id: string;
  fields: ProfileFields;
  metadata: {
    userId: string;
    displayName: string;
    industry: string;
    stage: string;
    location: string;
    createdAt: string;
    updatedAt: string;
    isActive: boolean;
  };
}
 
export async function upsertProfile(record: ProfileRecord): Promise<void> {
  const document = buildProfileDocument(record.fields);
  const embedding = await embedText(document);
  const index = getIndex();
 
  await index.upsert([
    {
      id: record.id,
      values: embedding,
      metadata: record.metadata,
    },
  ]);
}
 
export async function upsertProfiles(
  records: ProfileRecord[]
): Promise<{ indexed: number; failed: number }> {
  const UPSERT_BATCH_SIZE = 100; // Pinecone limit per upsert call
  let indexed = 0;
  let failed = 0;
 
  // Build all documents
  const documents = records.map((r) => buildProfileDocument(r.fields));
 
  // Embed in batches
  const embeddings = await embedBatch(documents);
 
  // Prepare vectors
  const vectors = records.map((record, i) => ({
    id: record.id,
    values: embeddings[i],
    metadata: record.metadata,
  }));
 
  // Upsert to Pinecone in batches
  const index = getIndex();
  for (let i = 0; i < vectors.length; i += UPSERT_BATCH_SIZE) {
    const batch = vectors.slice(i, i + UPSERT_BATCH_SIZE);
 
    try {
      await index.upsert(batch);
      indexed += batch.length;
    } catch (error) {
      console.error(
        `Upsert failed for batch starting at index ${i}:`,
        error
      );
      failed += batch.length;
    }
  }
 
  return { indexed, failed };
}
 
export async function deleteProfile(profileId: string): Promise<void> {
  const index = getIndex();
  await index.deleteOne(profileId);
}

Triggering Indexing on Profile Updates

In production, profile indexing happens in three places:

  1. Vapi webhook - after voice onboarding completes, the transcript is processed and the new profile is indexed immediately
  2. Profile edit API - whenever a user updates fields, we re-embed and upsert
  3. Nightly batch job - catches any drift, re-indexes all profiles modified in the last 24 hours
// app/api/webhooks/vapi/route.ts (simplified)
import { NextRequest, NextResponse } from "next/server";
import { processOnboardingTranscript } from "@/lib/onboarding/processor";
import { upsertProfile } from "@/lib/pinecone/indexer";
 
export async function POST(request: NextRequest): Promise<NextResponse> {
  const body = await request.json();
 
  if (body.message?.type !== "end-of-call-report") {
    return NextResponse.json({ status: "ignored" });
  }
 
  const { transcript, call } = body.message;
  const userId = call.metadata?.userId;
 
  if (!userId) {
    return NextResponse.json({ error: "Missing userId in metadata" }, { status: 400 });
  }
 
  // Extract structured fields from voice transcript
  const fields = await processOnboardingTranscript(transcript);
 
  // Save to database
  const profile = await saveProfileToDatabase(userId, fields);
 
  // Index in Pinecone asynchronously (do not block the webhook response)
  upsertProfile({
    id: profile.id,
    fields,
    metadata: {
      userId,
      displayName: profile.displayName,
      industry: fields.industry,
      stage: fields.stage,
      location: fields.location,
      createdAt: profile.createdAt,
      updatedAt: new Date().toISOString(),
      isActive: true,
    },
  }).catch((err) => {
    console.error(`Pinecone upsert failed for user ${userId}:`, err);
  });
 
  return NextResponse.json({ status: "ok" });
}

Indexing Triggers

Three paths that keep the Pinecone index current

Voice Onboarding
1Vapi call ends
2Transcript processed
3Upsert fired async
Real-time
Profile Edit API
1User saves changes
2DB write completes
3Re-embed and upsert
On change
Nightly Batch Job
1Fetch 24h modified
2Bulk embed + upsert
3Log drift reports
Nightly 2AM UTC

Querying for Matches

The Core Match Query

// lib/pinecone/matcher.ts
import { getIndex } from "./client";
import { buildProfileDocument, type ProfileFields } from "./profile-document";
import { embedText } from "./embeddings";
 
export interface MatchCandidate {
  profileId: string;
  score: number;
  metadata: {
    userId: string;
    displayName: string;
    industry: string;
    stage: string;
    location: string;
    isActive: boolean;
  };
}
 
export interface MatchQueryOptions {
  topK?: number;
  minScore?: number;
  filterIndustry?: string;
  filterStage?: string[];
  excludeProfileIds?: string[];
}
 
export async function findMatches(
  queryFields: ProfileFields,
  queryProfileId: string,
  options: MatchQueryOptions = {}
): Promise<MatchCandidate[]> {
  const {
    topK = 20,
    minScore = 0.72,
    filterIndustry,
    filterStage,
    excludeProfileIds = [],
  } = options;
 
  const document = buildProfileDocument(queryFields);
  const queryEmbedding = await embedText(document);
 
  // Build metadata filter
  const filter: Record<string, unknown> = {
    isActive: { $eq: true },
  };
 
  if (filterIndustry) {
    filter.industry = { $eq: filterIndustry };
  }
 
  if (filterStage && filterStage.length > 0) {
    filter.stage = { $in: filterStage };
  }
 
  const index = getIndex();
 
  // Request more than topK to account for excluded IDs
  const fetchCount = topK + excludeProfileIds.length + 10;
 
  const response = await index.query({
    vector: queryEmbedding,
    topK: fetchCount,
    includeMetadata: true,
    filter,
  });
 
  const excluded = new Set([queryProfileId, ...excludeProfileIds]);
 
  const candidates = (response.matches ?? [])
    .filter((match) => !excluded.has(match.id))
    .filter((match) => (match.score ?? 0) >= minScore)
    .slice(0, topK)
    .map((match) => ({
      profileId: match.id,
      score: match.score ?? 0,
      metadata: match.metadata as MatchCandidate["metadata"],
    }));
 
  return candidates;
}

Why We Set minScore at 0.72

This was empirically derived over several weeks of A/B testing. Here is what the score distribution looked like across 5,000 sampled queries:

Score rangeInterpretationUser acceptance rate
0.90+Near-duplicate profiles82% (often too similar, low novelty)
0.80–0.90Highly compatible74%
0.72–0.80Strong compatibility61% (sweet spot)
0.62–0.72Moderate match31%
Below 0.62Weak or noise9%

The acceptance rate at 0.72–0.80 is actually higher than 0.90+ because novelty matters. Two founders with very similar backgrounds do not add as much value to each other as two complementary professionals in adjacent spaces. The sweet spot captures "highly compatible but distinct enough to be useful."

Scoring and Ranking

Raw Pinecone scores are a good start, but they are not the final ranking signal. We apply a post-processing scoring layer that incorporates three additional factors.

The Composite Score

// lib/matching/scorer.ts
import type { MatchCandidate } from "../pinecone/matcher";
 
interface ScoringContext {
  queryUserId: string;
  queryStage: string;
  queryLocation: string;
  previouslyShownProfileIds: Set<string>;
  previouslyDeclinedProfileIds: Set<string>;
}
 
interface RankedMatch {
  profileId: string;
  userId: string;
  displayName: string;
  compositeScore: number;
  vectorScore: number;
  freshnessBoost: number;
  locationBoost: number;
  noveltyPenalty: number;
}
 
const WEIGHT_VECTOR = 0.7;
const WEIGHT_FRESHNESS = 0.15;
const WEIGHT_LOCATION = 0.15;
 
export function rankMatches(
  candidates: MatchCandidate[],
  context: ScoringContext
): RankedMatch[] {
  const now = Date.now();
 
  const ranked = candidates.map((candidate) => {
    const vectorScore = candidate.score;
 
    // Freshness boost: profiles updated in last 30 days get a lift
    const updatedAt = new Date(
      (candidate.metadata as Record<string, string>).updatedAt
    ).getTime();
    const daysSinceUpdate = (now - updatedAt) / (1000 * 60 * 60 * 24);
    const freshnessBoost = Math.max(0, 1 - daysSinceUpdate / 30);
 
    // Location boost: same city gets a small signal
    const sameLocation =
      candidate.metadata.location === context.queryLocation;
    const locationBoost = sameLocation ? 1 : 0;
 
    // Novelty penalty: profiles shown before but not acted on get downranked
    const wasShown = context.previouslyShownProfileIds.has(candidate.profileId);
    const wasDeclined = context.previouslyDeclinedProfileIds.has(
      candidate.profileId
    );
    const noveltyPenalty = wasDeclined ? 0.5 : wasShown ? 0.85 : 1.0;
 
    const compositeScore =
      (WEIGHT_VECTOR * vectorScore +
        WEIGHT_FRESHNESS * freshnessBoost +
        WEIGHT_LOCATION * locationBoost) *
      noveltyPenalty;
 
    return {
      profileId: candidate.profileId,
      userId: candidate.metadata.userId,
      displayName: candidate.metadata.displayName,
      compositeScore,
      vectorScore,
      freshnessBoost,
      locationBoost,
      noveltyPenalty,
    };
  });
 
  return ranked.sort((a, b) => b.compositeScore - a.compositeScore);
}

The three additional signals:

Freshness (15%): A profile updated two days ago is more likely to reflect current intent than one untouched for eight months. This is especially true in early-stage startup contexts where people pivot frequently.

Location (15%): Same-city matches convert to actual meetings at 2.3x the rate of remote-only matches, based on our data. We give a modest boost but do not filter out remote candidates.

Novelty penalty: This is the most important signal after the vector score. If a user has been shown a profile five times and never connected, showing it again is noise. Declined matches get a 50% penalty; previously shown but not acted on get a 15% penalty.

Composite Scoring Breakdown

How raw vector scores become final rankings

Vector Similarity (70%)Pinecone cosine score
Profile Freshness (15%)Days since last update
Location Signal (15%)Same city = +1.0

Then multiplied by novelty factor:

Never shown: 1.0xShown, no action: 0.85xDeclined: 0.5x
Final ranking drives the match feed

Production Optimizations

At 1,000 users, none of this matters much. At 17,000 users with hundreds of queries per minute during peak hours, small inefficiencies compound fast. These are the four optimizations that kept us below budget and below 120ms p95 latency.

1. Namespace Sharding by Stage

Pinecone namespaces let you partition an index. We shard by company stage, so an "idea-stage" founder only queries against the namespaced subset of profiles that declared themselves open to early-stage partnerships.

// lib/pinecone/namespace-strategy.ts
 
const STAGE_NAMESPACE_MAP: Record<string, string> = {
  idea: "stage-early",
  "pre-seed": "stage-early",
  seed: "stage-growth",
  "series-a": "stage-growth",
  growth: "stage-scale",
  enterprise: "stage-scale",
};
 
export function getNamespaceForStage(stage: string): string {
  return STAGE_NAMESPACE_MAP[stage] ?? "stage-growth";
}
 
export function getQueryNamespaces(stage: string): string[] {
  // Return own namespace plus adjacent ones for cross-stage matching
  const own = getNamespaceForStage(stage);
 
  const adjacent: Record<string, string[]> = {
    "stage-early": ["stage-early", "stage-growth"],
    "stage-growth": ["stage-early", "stage-growth", "stage-scale"],
    "stage-scale": ["stage-growth", "stage-scale"],
  };
 
  return adjacent[own] ?? [own];
}
 
// Usage in upsert
export async function upsertProfileWithNamespace(
  record: ProfileRecord
): Promise<void> {
  const namespace = getNamespaceForStage(record.fields.stage);
  const index = getIndex().namespace(namespace);
  // ... rest of upsert
}

This alone cut average query time by 38% because each query scans a smaller vector space. The "adjacent namespaces" logic ensures that a seed-stage founder can still match with a pre-seed founder - you query multiple namespaces and merge the results.

2. Embedding Cache

Generating embeddings costs money and adds latency. Profile embeddings rarely change, so we cache them in Redis with a TTL keyed to the profile's updatedAt timestamp.

// lib/pinecone/embedding-cache.ts
import { Redis } from "@upstash/redis";
 
const redis = Redis.fromEnv();
const CACHE_TTL_SECONDS = 60 * 60 * 24 * 7; // 7 days
 
function embedCacheKey(profileId: string, updatedAt: string): string {
  return `embed:${profileId}:${updatedAt}`;
}
 
export async function getOrCreateEmbedding(
  profileId: string,
  updatedAt: string,
  document: string,
  generateFn: (text: string) => Promise<number[]>
): Promise<number[]> {
  const cacheKey = embedCacheKey(profileId, updatedAt);
 
  const cached = await redis.get<number[]>(cacheKey);
  if (cached) {
    return cached;
  }
 
  const embedding = await generateFn(document);
 
  await redis.setex(cacheKey, CACHE_TTL_SECONDS, embedding);
 
  return embedding;
}

Cache hit rate in production: 94%. Most queries come from users checking their match feed, which retrieves existing profiles rather than re-embedding them. This brings our OpenAI embedding costs down by roughly 15x compared to computing fresh on every query.

3. Stale-While-Revalidate Match Cache

The match feed itself is cached per user with a short TTL. Pinecone queries are fast, but when 500 users check their feed at 9AM, the concurrent load adds up.

// lib/matching/match-cache.ts
import { Redis } from "@upstash/redis";
 
const redis = Redis.fromEnv();
const MATCH_CACHE_TTL = 60 * 30; // 30 minutes
const STALE_THRESHOLD = 60 * 20; // Revalidate if older than 20 minutes
 
interface CachedMatches {
  matches: RankedMatch[];
  cachedAt: number;
}
 
export async function getMatchesWithCache(
  userId: string,
  computeFn: () => Promise<RankedMatch[]>
): Promise<RankedMatch[]> {
  const cacheKey = `matches:${userId}`;
  const cached = await redis.get<CachedMatches>(cacheKey);
 
  if (cached) {
    const age = (Date.now() - cached.cachedAt) / 1000;
 
    if (age < STALE_THRESHOLD) {
      // Fresh: return immediately
      return cached.matches;
    }
 
    if (age < MATCH_CACHE_TTL) {
      // Stale: return stale data, revalidate in background
      computeFn().then(async (fresh) => {
        await redis.setex(
          cacheKey,
          MATCH_CACHE_TTL,
          { matches: fresh, cachedAt: Date.now() }
        );
      }).catch(console.error);
 
      return cached.matches;
    }
  }
 
  // Cache miss or expired: compute synchronously
  const matches = await computeFn();
  await redis.setex(
    cacheKey,
    MATCH_CACHE_TTL,
    { matches, cachedAt: Date.now() }
  );
 
  return matches;
}

4. Batched Re-indexing on Schema Changes

When we change the buildProfileDocument function - adding a new field or reordering sections - all 17K embeddings become stale because they were computed from an older document structure. We re-index the full corpus in background batches to avoid blocking the main API.

// scripts/reindex-all-profiles.ts
import { getAllActiveProfiles } from "../lib/db/profiles";
import { upsertProfiles } from "../lib/pinecone/indexer";
 
const BATCH_SIZE = 500;
const DELAY_BETWEEN_BATCHES_MS = 2000; // Rate limit headroom
 
async function reindexAllProfiles(): Promise<void> {
  const profiles = await getAllActiveProfiles();
  console.log(`Starting reindex for ${profiles.length} profiles`);
 
  let processed = 0;
  let failed = 0;
 
  for (let i = 0; i < profiles.length; i += BATCH_SIZE) {
    const batch = profiles.slice(i, i + BATCH_SIZE);
 
    const result = await upsertProfiles(batch);
    processed += result.indexed;
    failed += result.failed;
 
    console.log(
      `Progress: ${processed}/${profiles.length} indexed, ${failed} failed`
    );
 
    if (i + BATCH_SIZE < profiles.length) {
      await new Promise((resolve) =>
        setTimeout(resolve, DELAY_BETWEEN_BATCHES_MS)
      );
    }
  }
 
  console.log(`Reindex complete. Indexed: ${processed}, Failed: ${failed}`);
}
 
reindexAllProfiles().catch(console.error);

A full reindex of 17K profiles takes about 12 minutes and costs roughly $4.80 in OpenAI embedding calls at current pricing. We run it during off-peak hours when we deploy schema changes, and the delay between batches prevents us from hitting OpenAI's rate limits on the embeddings endpoint.

Results and Metrics

After running this system in production for eight months, here are the honest numbers.

Scale:

  • 17,400 active profiles indexed in Pinecone
  • ~2,800 match queries per day at peak
  • Full index size: ~420MB in Pinecone serverless

Match Quality (compared to previous keyword system):

  • Match acceptance rate: 23% → 61% (+165%)
  • "No matches found" rate: 34% → 4% (the old system returned empty results for niche profiles)
  • User-reported match quality (5-star NPS): 2.9 → 4.1

Performance:

  • p50 query latency: 68ms (Pinecone only)
  • p95 query latency: 114ms (Pinecone only)
  • Full match feed with scoring: p95 at 210ms (includes Redis cache check, scoring layer)
  • Cache hit rate (match feed): 71%

Costs (monthly):

  • Pinecone serverless: $38/month
  • OpenAI embeddings (with cache): $22/month
  • Redis (Upstash): $9/month
  • Total infrastructure for matching: ~$69/month at 17K users

That is $0.004 per active user per month for the full matching infrastructure. We had originally budgeted $300/month based on naive per-query cost estimates. The embedding cache was the single biggest lever.

What surprised us:

The novelty penalty had a larger impact than expected. Before adding it, users were silently churning because they kept seeing the same 10 profiles. After the novelty decay, session-to-session retention on the match feed improved by 28%. Showing fewer but more varied matches turned out to matter more than showing the highest-scoring matches repeatedly.

The other surprise was how much profile completeness affected match quality. A fully filled-out profile gets 3.5x better match scores on average than a sparse one. We now show an in-app completeness score and saw a 40% reduction in low-quality queries after users understood the connection.

FAQ

Q: Why not use Pinecone's hybrid search with sparse vectors? A: We experimented with it. For our use case - long-form professional descriptions - dense embeddings alone outperformed hybrid search. Hybrid is more valuable when you have short queries against long documents, like e-commerce product search. Two long profile texts matching against each other is the sweet spot for pure dense retrieval.

Q: How do you handle profiles in languages other than English? A: text-embedding-3-large is multilingual. In practice, most onSpark profiles are in English or French. We embed them as-is and the similarity still works across languages with a small quality degradation (roughly 0.05 lower scores for cross-language matches in our tests). We plan to add language metadata filtering for a future version.

Q: What happens when someone's profile changes dramatically? A: The re-embed on save triggers immediately. The new embedding overwrites the old one in Pinecone via upsert. The match cache for that user is invalidated on their next request. The only window where stale matches are shown is within the 30-minute match cache TTL, which we accept as a reasonable tradeoff.

Q: Could you do this without Pinecone - just pgvector in Postgres? A: Yes, and for under 50K profiles I would seriously consider it. pgvector with an IVFFlat or HNSW index handles this scale fine. We chose Pinecone for the managed infrastructure, metadata filtering, and namespace support. If you are on Supabase, pgvector with vector_cosine_ops would work well and reduce operational surface area.

Q: What is the minimum viable version of this? A: Three functions: embedText, upsertProfile, and findMatches. You can ship a working prototype in a day. Start with text-embedding-3-small (lower cost, 1536 dimensions) and upgrade if match quality is not good enough. Most of the complexity in this post is optimization work that comes later.

Conclusion

Vector search transformed onSpark's matching from a frustrating, keyword-constrained experience into one that understands professional intent at a semantic level. The 17K-profile scale we operate at today was not the hard part - the hard part was figuring out that keyword overlap is the wrong abstraction for matching people, and that cosine similarity of good embeddings gets you much closer to what users actually mean.

The implementation is straightforward. Pinecone's SDK is well-documented, OpenAI's embeddings API is reliable, and the composite scoring layer on top adds the human signals (freshness, location, novelty) that pure vector similarity misses. The total infrastructure cost at our current scale is under $70/month - a rounding error compared to the product value.

If you are building any kind of matching, recommendation, or similarity search product, vector search is the right foundation. The ecosystem has matured to the point where the operational burden is minimal.


Building a matching or recommendation product and want to talk through the architecture? Get in touch and I can help you figure out whether vector search is the right fit for your specific use case.

Related Posts:

  • [Building Production AI Voice Agents with Vapi.ai (2026 Complete Guide)]
  • [Firebase vs Supabase: The Definitive Comparison for Startups (2026)]
  • [MVP Architecture Patterns That Scale]
Share:

Get practical engineering insights

AI voice agents, automation workflows, and shipping fast. No spam, unsubscribe anytime.