Back to Blog
    Machine Learning

    Machine Learning in Small Use Cases: Embedding SQL Tables with Sentence Transformers

    Learn how to leverage sentence transformers for creating semantic embeddings of SQL database content, enabling powerful search and recommendation features without massive infrastructure.

    Abo Nazari
    Abo Nazari
    September 28, 2026·12 min read
    Machine Learning in Small Use Cases: Embedding SQL Tables with Sentence Transformers

    Machine Learning in Small Use Cases: Embedding SQL Tables with Sentence Transformers

    Not every machine learning problem requires a massive GPU cluster or months of model training. Sometimes, the most elegant solutions come from applying well-established techniques to smaller, focused use cases. Today, we'll explore how sentence transformers can revolutionize the way you work with SQL databases.

    The Problem: Traditional SQL Search Limitations

    Imagine you have a product database with thousands of items. A user searches for "comfortable shoes for running." Traditional SQL queries using LIKE or full-text search might miss relevant results because:

    • Exact keyword matches are required
    • Synonyms aren't recognized ("comfortable" vs "cozy")
    • Semantic relationships are ignored ("running shoes" vs "athletic footwear")

    Enter Sentence Transformers

    Sentence transformers convert text into dense vector representations (embeddings) that capture semantic meaning. Similar concepts cluster together in vector space, enabling semantic search rather than just keyword matching.

    Why Sentence Transformers?

    • Pre-trained models: No need to train from scratch
    • Lightweight: Run on CPU for small to medium datasets
    • Fast: Millisecond inference times
    • Multilingual: Support for 50+ languages
    • Open source: Active community and regular updates

    Implementation Guide

    Step 1: Setup and Installation

    # Install required packages
    pip install sentence-transformers psycopg2-binary numpy
    
    # Import dependencies
    from sentence_transformers import SentenceTransformer
    import psycopg2
    import numpy as np
    

    Step 2: Initialize the Model

    # Load a pre-trained model
    # 'all-MiniLM-L6-v2' is fast and efficient for most use cases
    model = SentenceTransformer('all-MiniLM-L6-v2')
    
    # For multilingual support, use:
    # model = SentenceTransformer('paraphrase-multilingual-MiniLM-L12-v2')
    

    Step 3: Connect to Your Database

    # Database connection
    conn = psycopg2.connect(
        host="localhost",
        database="your_database",
        user="your_user",
        password="your_password"
    )
    cursor = conn.cursor()
    

    Step 4: Generate Embeddings

    # Fetch records from your table
    cursor.execute("""
        SELECT id, title, description
        FROM products
    """)
    products = cursor.fetchall()
    
    # Combine relevant text fields
    texts = [
        f"{title} {description}"
        for id, title, description in products
    ]
    
    # Generate embeddings
    embeddings = model.encode(texts, show_progress_bar=True)
    
    # Add embedding column to your table (one-time setup)
    cursor.execute("""
        ALTER TABLE products
        ADD COLUMN IF NOT EXISTS embedding vector(384)
    """)
    
    # Store embeddings
    for (id, _, _), embedding in zip(products, embeddings):
        cursor.execute("""
            UPDATE products
            SET embedding = %s
            WHERE id = %s
        """, (embedding.tolist(), id))
    
    conn.commit()
    

    Step 5: Implement Semantic Search

    def semantic_search(query, top_k=5):
        # Encode the query
        query_embedding = model.encode(query)
    
        # Find similar products using cosine similarity
        cursor.execute("""
            SELECT id, title, description,
                   1 - (embedding <=> %s::vector) as similarity
            FROM products
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_embedding.tolist(), query_embedding.tolist(), top_k))
    
        results = cursor.fetchall()
        return results
    
    # Example search
    results = semantic_search("comfortable running shoes")
    for id, title, description, similarity in results:
        print(f"[{similarity:.3f}] {title}")
    

    Real-world Use Cases

    1. E-commerce Search Enhancement

    Transform keyword searches into semantic searches that understand user intent.

    2. Document Management

    Find similar documents even when they use different terminology.

    3. Customer Support

    Match support tickets to relevant knowledge base articles automatically.

    4. Content Recommendation

    Suggest related articles, products, or content based on semantic similarity.

    5. Duplicate Detection

    Identify potential duplicate records that traditional methods might miss.

    Performance Considerations

    Embedding Generation

    • Batch processing: Generate embeddings in batches of 32-128 for optimal speed
    • Caching: Store embeddings rather than regenerating them
    • Incremental updates: Only generate embeddings for new/modified records

    Search Performance

    For large datasets (>100k records):

    • Use pgvector extension for PostgreSQL
    • Implement HNSW indexes for fast approximate nearest neighbor search
    • Consider quantization to reduce storage requirements

    Cost Analysis

    • Inference time: ~10ms per item on CPU
    • Storage: 384 floats × 4 bytes = 1.5KB per embedding
    • No API costs: Run entirely on your infrastructure

    Common Pitfalls and Solutions

    Pitfall 1: Wrong Model Choice

    Problem: Using large models unnecessarily Solution: Start with MiniLM models, upgrade only if needed

    Pitfall 2: Stale Embeddings

    Problem: Database updates don't refresh embeddings Solution: Implement triggers or scheduled jobs for regeneration

    Pitfall 3: Ignoring Context

    Problem: Embedding single fields in isolation Solution: Combine relevant fields (title + description + category)

    Pitfall 4: No Hybrid Approach

    Problem: Purely semantic search misses exact matches Solution: Combine with traditional filters and keyword search

    Production Deployment

    Architecture

    User Query → API Layer → Embedding Generation → Vector Search → Results Ranking → Response
    

    Monitoring

    • Track search quality metrics
    • Monitor embedding generation latency
    • Alert on degraded search performance

    Scaling

    • Use read replicas for search workload
    • Implement caching layer (Redis)
    • Consider vector-specialized databases (Pinecone, Weaviate)

    Code Repository

    I've created a complete example implementation on GitHub:

    • Setup scripts
    • Sample data
    • API endpoints
    • Performance benchmarks
    • Docker compose for easy deployment

    Conclusion

    Sentence transformers provide a practical, cost-effective way to add semantic understanding to your SQL databases. You don't need a massive ML infrastructure to deliver intelligent search and recommendation features.

    The key is starting small, measuring results, and iterating based on real user feedback. Often, the 80/20 rule applies: you can achieve 80% of the benefits with 20% of the complexity of full-scale ML systems.