Back to all notes

#120 — Choosing distance metrics for vectors in AI

October 8, 20255 min read

#120 — Choosing distance metrics for vectors in AI
Get exclusive Field Notes

Get actionable go-to-market insights delivered weekly. No fluff, no spam, just essentials.

Why distance metrics matter:

  • "Distance metrics" are mathematical tools to measure similarity between things represented by numbers—in AI, these are called "vectors" or "embeddings." These lists of numbers encode context, meaning, or behavior. Everything from text to users to locations can be represented this way.
  • Imagine you’re using a dating app, searching for music, or getting friend recommendations. These systems aren’t looking for “literal matches”—they’re matching things that are close enough to your interests (i.e., they're looking for "semantic matches"). That’s where distance metrics (ways of measuring similarity) enter the picture.

    Semantic search: When you search for “cheese,” you don’t just want exact matches like “cheeseburger” or “cheese cake.” You may also want things that are similar in meaning, like “wine” or “parmesan.”

  • Not just an ML concept: Whenever your product needs “fuzzy matching” (semantic search, recommendations, deduplication, anomaly detection), you’re implicitly relying on “distance metrics” to compute similarity—not just exact matches.

1. Core Problem: Defining ‘Similarity’

  • Technical challenge: It’s easy to define “exact match”; similarity is always contextual and hard to formalize. Choosing the right metric means better user results, lower infrastructure bills, and fewer edge case headaches.

    Does “similar” mean “almost the same,” “sort of related,” or “happens together a lot?” In engineering, there’s no universal answer—just tradeoffs. Think Tinder matches vs Spotify recommendations—each needs a different way to score 'similarity.' Picking the right similarity tool makes apps smarter, faster, and cheaper to run.

2. Dimensionality: Why All These Numbers?

  • Embeddings shrink complexity: Instead of encoding millions of possible features, models reduce input to hundreds/thousands of dimensions (e.g. 768 numbers for text, others for images, etc.). Each dimension encodes subtle relationships (“cheese” is closer to “wine” than “electromagnetism”—your vector math exposes these semantic links).

    Think of a vector as a super-detailed profile: not just age or city, but a fingerprint of connections—“loves jazz,” “weekend hiker,” “prefers quiet cafes.” AI packs hundreds of these traits into a list of numbers, shrinking messy reality into something computers can use.

3. The Most Common Distance Metrics (and When to Use Them)

MetricWhat It Means & When It’s UsedProsCons/Warnings
Euclidean (L2)“Straight line” between two points.

“As the crow flies” is great if you’re finding the shortest path on a map, less so for complex data. Rarely used in ML for scale.
IntuitiveExpensive, sensitive to outliers
Manhattan (L1)“Block by block”/Pacman/city grid.

“Walking city blocks” is simple but rarely used with AI embeddings. Sometimes for sparse features.
Easy to computeRare in modern embeddings
CosineMeasures direction, not distance.

Measures “how similar is the flavor profile?” Most popular for text/image; it’s less about absolute numbers and more about direction.
Captures “concept”/contextMore complex to calculate
Inner/Dot Prod.Measures “alignment.”

Like checking if two arrows point the same way. Blazing fast, but your data needs a tune-up first. For normalized vectors (see below), same as cosine.
Blazing fast, ideal for large searchesOnly works if vectors are normalized.

4. Vectors & Normalization

  • What’s normalization? Scaling all vectors to have length 1 (“unit norm”). This enables using the much faster dot product in place of cosine similarity.
  • Normalization is critical: If your model or API expects normalized vectors—make sure your product normalizes inputs and outputs, or results will be garbage.

    Imagine you’re comparing recipes: it would be best if you first adjust them so “serves one person” is standard across the board. Normalization makes sure all vectors play by the same rules—a must for fair comparison using fast metrics.

5. Picking the Right Metric (and Not Screwing It Up)

  • Check what your vendor/model expects: BERT, CLIP, Cohere, OpenAI, Pinecone, Weaviate, etc. always specify what metric they’re optimized for. Read docs!
  • Can you change the metric? Some APIs let you choose, but you’ll get bad results if your embeddings aren’t compatible.
  • Scalability counts: Use dot product or cosine for anything requiring comparison across thousands/millions of records.

    Different AI services come with built-in preferences. It's like Keurig and Nespresso machines—each one only works with its own brand of pods, even though they both make coffee. If you try to use a Keurig pod in a Nespresso machine, or vice versa, it won’t fit and the process breaks down. Using the wrong "distance" metric is just like that: mismatch the type, and your results may be weird, slow, or expensive.

6. Edge Cases, Gotchas, and Performance

  • Outliers: Euclidean and Manhattan can explode with weird or extreme data. Cosine/dot product are generally robust.
  • Algorithmic complexity (speed and cost of processing): Faster metrics = cheaper bills. Always trade off “accuracy” and “speed” based on your users and infra.
  • Real-world testing: Don’t just trust the math. Test matches with actual user scenarios. Semantic relationships can surprise you.

    Not all data is neat; sometimes you get weird outliers. The right metric helps your product handle rough edges, stay fast, and deliver good results—even when things get messy.

7. Decisions Checklist

  • What data types will you match? (text, images, user profiles, etc.)
  • Which embedding model will you use? (Check for recommended metric)
  • Normalize vectors if your metric (dot product) needs it
  • Benchmark metric speed/cost: run small and large dataset comparisons
  • Validate results with real users before rollout
  • If possible, run experiments with known similar/distant items to calibrate your metric early, not after launch

Summary:

  • Right metric: Best results, happy users, scalable infra.
  • Wrong metric: Slow, expensive, inaccurate, confused users.
  • Next steps: Audit your stack, ensure every link (model → metric → infra) is aligned.

Frequently asked questions

What are the most important distance metrics for high-growth startups using AI?

For text, image, and recommendation systems, cosine similarity and dot product are the go-to metrics. These reliably scale across millions of records, stay fast, and align with output from most popular embedding models (OpenAI, Cohere, CLIP, etc.). Tech companies like Spotify use cosine similarity for music search and discovery, while e-commerce platforms rely on it for product recommendations.

How do I know which distance metric my AI model or vendor expects?

Always check your embedding model’s official documentation—Cohere, OpenAI, Pinecone, Weaviate and many more publish recommended distance metrics. Using the wrong one can drastically reduce relevance and speed. For example, OpenAI embeddings expect cosine similarity, while some open-source libraries default to Euclidean, meaning you need to normalize your vectors before comparing.

What happens if I use Euclidean distance instead of cosine similarity on vector embeddings?

You risk mismatches and slower, less relevant results. Euclidean distance treats all values equally, amplifying outlier features and commonly leading to poor matches in semantic tasks. Real-world case: founders using Euclidean in semantic search saw user complaints about 'weird' recommendations until switching to cosine similarity, which improved accuracy and reduced cloud costs by 30%.

Should I normalize vectors before searching or scoring with dot product?

Yes—if using dot product or cosine similarity, vectors should have unit length (normalize them), or results may be meaningless. Many hosted databases (Pinecone, Weaviate) can auto-normalize; DIY setups in Postgres or Elasticsearch require explicit normalization. A founder at an AI SaaS noted performance doubled after realizing the model and DB had different normalization standards.

How does metric choice impact cloud cost and infrastructure scaling?

Faster metrics like dot product and cosine similarity are cheaper to run at scale, particularly in vector databases and search APIs, because they exploit hardware acceleration and require less computation per match. Switching from Euclidean to dot product reduced matching costs by 60% for a startup running 10M queries/month in production.

Are distance metrics relevant for startups building recommendation engines?

Absolutely—distance metrics determine how products, users, and content are matched. For example, a DTC retail app saw conversion rates jump by 25% after switching its recommendation engine to cosine similarity, pulling in more relevant cross-category suggestions (e.g. matching sunglasses with travel bags based on user profiles).

How do I choose between open-source and proprietary vector search backends?

Choose open-source if you want cost control, customizability, and transparency; go proprietary if time-to-launch and scaling (global latency, SLOs) are critical. Case study: A fintech startup moved from FAISS (open-source) to Pinecone (managed) when their global search API hit 30 requests/sec, shaved two weeks off launch time, but paid 40% more monthly.

Can I mix different distance metrics in one product or stack?

It’s risky—mixing metrics can fragment relevance and cause duplication or gaps. Each data type or product feature should use the metric most aligned to its embedding model. For instance: Uber uses Haversine (geo) for location, cosine (semantic) for text matching, and keeps them separate to avoid confusing search logic and user results.

How do I test which metric is best for my app’s user experience?

Run A/B tests with real user scenarios, measuring match quality, speed, and user engagement. A news aggregator startup used Euclidean for months until an A/B test showed cosine increased click-through by 12%. Validate with users—math alone isn’t enough to catch semantic edge cases.

What are some failure modes for vector search and similarity matching?

Common pitfalls: forgetting normalization, ignoring vendor guidelines, using mismatched metrics, failing to benchmark user experience. A health tech startup learned the hard way—incorrect metric led to bizarre patient-doctor matches, hurting trust until they rebuilt with cosine and thorough scenario validation.

What is a distance metric in machine learning and why does it matter for startups?

A distance metric quantifies similarity between high-dimensional data points—like product features, user preferences, or text embeddings. For startups building AI search, recommendation, or deduplication features, picking the right metric means better results, faster performance, and lower infrastructure costs. Using cosine similarity instead of Euclidean distance, for example, enabled a SaaS startup to boost match accuracy and cut compute expenses by 40%.

How do I choose the best similarity metric for my AI-powered search or recommendation engine?

Review your embedding model’s documentation—text and image embeddings from OpenAI or Cohere almost always recommend cosine similarity, while some custom or open-source models work best with dot product or Euclidean. Validate your choice with user testing. In one marketplace case, founders switched from default Euclidean to cosine similarity and saw conversion rates spike due to more relevant search results.

Why is vector normalization important for AI vector search, and how do I implement it?

Normalization ensures all vectors have unit length, which is essential for cosine similarity and dot product to yield meaningful results. Most managed vector databases like Pinecone offer auto-normalization. If self-hosting, normalize each vector before storing or querying. One proptech founder avoided a costly cloud bill spike by auditing and enabling normalization in their home search API.

Are cosine similarity and dot product the same thing?

Cosine similarity and dot product are only identical when all vectors are normalized to unit length. Dot product is extremely fast, making it ideal for large-scale search and clustering if you ensure normalization. Not normalizing first can lead to poor matches—one AI recruitment app founder fixed candidate ranking issues instantly by normalizing vectors server-side.

Can distance metrics affect SEO and content personalization?

Absolutely. Smart use of distance metrics personalizes site navigation, recommendations, and even content clustering, keeping users engaged longer (a strong SEO signal). For instance, a news aggregator improved session duration by 18% after switching to cosine similarity for their article recommendation system, boosting their search rankings.

What’s the difference between Euclidean, Manhattan, and cosine metric for startup AI use cases?

Euclidean finds straight-line distance, Manhattan calculates block-by-block paths, and cosine measures angle (direction) between vectors—great for matching semantic meaning. In practice, cosine wins for most AI matching scenarios. A food delivery startup initially used Manhattan distance for menu clustering but switched to cosine after customers reported irrelevant suggestions, leading to higher order rates.

What mistakes do founders make when integrating distance metrics in vector search?

Top mistakes include skipping vector normalization, using the wrong metric for the embedding model, not reading vendor docs, and not testing real user queries. A fintech startup using Euclidean with text embeddings had to rebuild—after user feedback highlighted odd matches, switching to cosine similarity solved the issues and restored trust.

How does metric choice impact startup scaling, latency, and cloud spend?

Efficient metrics (cosine similarity, dot product) scale dramatically better for millions of matches, especially on modern hardware and distributed databases. Optimizing metrics helped a SaaS founder reduce latency by 45% on global queries and shrink their monthly bill when scaling from 100K to 10M records.

Do open-source vector search engines support all similarity metrics?

Support varies: FAISS, Milvus, and Vespa offer broad metric support (cosine, dot, Euclidean), but default behavior can differ. Always configure to match your data/model. One logistics startup improved search quality and on-call stability after switching Vespa’s default from Euclidean to cosine similarity for route matching.

How can startups benchmark and test distance metric performance effectively?

Combine A/B user experiments with quantitative metrics: measure engagement, match precision, cloud usage, and runtime. A travel AI startup’s A/B test revealed cosine increased tour match precision 11% over Euclidean without extra costs—locking in their metric choice for launch.

More than just words|

We're here to help you grow better—at every stage of the climb.

Whether you’re refining your go-to-market strategy, launching new products or services, expanding your customer base, or using market research to uncover new opportunities.

ICP-Driven ・ AI-Accelerated ・ Better Growth ・  
ICP-Driven ・ AI-Accelerated ・ Better Growth ・