RAG Job Search - AI-Powered Semantic Job Matching

Project Overview

Built a production-ready AI job matching system that uses RAG (Retrieval-Augmented Generation), semantic search, and LLM-powered reasoning to intelligently match job seekers with AI/ML roles. The system goes beyond keyword matching by understanding context, culture fit, and market compensation.

Key Achievement: Optimized from $50-100/month to $3-5/month through smart caching, model selection (GPT-3.5-Turbo vs GPT-4), and selective LLM analysis - achieving 90% cost reduction while maintaining high-quality matches.

The Challenge

Traditional job search relies on keyword matching, which misses nuanced requirements like culture fit, work arrangement preferences, and market-competitive compensation. Needed an intelligent system that:

Understands semantic meaning beyond keywords
Provides explainable AI reasoning for each match
Enriches jobs with culture and salary data
Operates cost-effectively at scale
Automates job discovery from multiple sources

Architecture

RAG Pipeline Flow

1. Ingestion

Job postings → Text chunks → Vector embeddings

→

2. Retrieval

User profile → Similarity search → Top matches

→

3. Enrichment

Culture matching + Salary analysis

→

4. Reasoning

GPT-3.5 explains why jobs match

Key Components

Vector Database

ChromaDB for persistent vector storage with OpenAI embeddings (text-embedding-3-small)

LLM Orchestration

LangChain for RAG pipelines, embeddings, and LLM chains with GPT-3.5-Turbo

API Layer

FastAPI with async endpoints, lifespan management, and middleware for logging/monitoring

Automation

APScheduler for background job scraping (twice daily at 6am/6pm UTC)

Web Scraping

Playwright + BeautifulSoup for company career pages and job boards

Caching

Diskcache for 6-hour match result caching (90% cost reduction on repeat searches)

Core Features

🎯 Semantic Matching

Vector similarity search finds jobs based on meaning, not just keywords. Understands "ML Engineer" relates to "Machine Learning Scientist".

🧠 LLM Reasoning

GPT-3.5-Turbo analyzes top 5 matches and explains why each job fits the candidate's profile with specific highlights.

🎨 Culture Matching

Scores companies (0-1) based on alignment with user's culture preferences (Research-driven, Fast-paced, etc.).

💰 Salary Analysis

Compares listed compensation against market data from levels.fyi with variance analysis (above/below/competitive).

🔄 Automated Scraping

Background scheduler scrapes 13 sources (10 company sites + 3 job boards) twice daily, with auto-ingestion.

⚡ Smart Caching

MD5-based cache with 6-hour TTL. Identical searches hit cache instantly, reducing API costs by 90%.

🛡️ Rate Limiting

Protects against abuse with 20 searches per 15-min window. Tracks usage and enforces monthly cost limits.

📊 Production Logging

Structured JSON logs with request IDs, performance tracking, and daily rotation for debugging and monitoring.

Technical Deep Dive

Service-Based Architecture

Refactored from monolithic to service-oriented architecture with dependency injection for maintainability and testability:

# Service layer with protocol-based interfaces
class ISearchService(Protocol):
    def search(self, query: str, k: int) -> List[Document]: ...
    def ingest(self, jobs: List[Job]) -> IngestionStats: ...

class SearchService(BaseService):
    """Vector search using ChromaDB + OpenAI embeddings"""
    def __init__(self):
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.vectorstore = Chroma(persist_directory="./vectorstore")

    def search(self, query: str, k: int = 20) -> List[Document]:
        # Semantic similarity search
        return self.vectorstore.similarity_search(query, k=k)

class JobMatcherService:
    """Main orchestration service"""
    def __init__(self, search: ISearchService, enrichment: IEnrichmentService):
        self.search = search
        self.enrichment = enrichment
        self.cache = CacheService(ttl=21600)  # 6 hours

Cost Optimization Strategy

Achieved 90% cost reduction through multiple strategies:

# 1. Model Selection: GPT-3.5-Turbo (90% cheaper than GPT-4)
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.3)

# 2. Selective Analysis: Only top 5 jobs get LLM reasoning
top_jobs = ranked_jobs[:5]
for job in top_jobs:
    reasoning = llm.invoke(analysis_prompt)

# 3. Smart Caching: 6-hour TTL with MD5 hash keys
cache_key = hashlib.md5(json.dumps(profile).encode()).hexdigest()
if cached := cache.get(cache_key):
    return cached  # 90% of searches hit cache

# 4. Incremental Embeddings: Only embed new jobs
existing_ids = set(vectorstore.get()['ids'])
new_jobs = [j for j in jobs if j.id not in existing_ids]

Enrichment Pipeline

Culture and salary data enrichment with confidence scoring:

@lru_cache(maxsize=1)  # Singleton pattern for data loading
def load_culture_data() -> Dict:
    with open("data/company_culture.json") as f:
        return json.load(f)

def match_culture(company: str, preferences: List[str]) -> CultureMatch:
    culture_db = load_culture_data()
    company_data = culture_db.get(company, {})

    # Calculate trait alignment score
    matching_traits = set(preferences) & set(company_data.get("traits", []))
    score = len(matching_traits) / len(preferences) if preferences else 0

    return CultureMatch(
        score=score,
        matching_traits=list(matching_traits),
        confidence="High" if company in culture_db else "Low"
    )

Background Automation

Automated job discovery with scheduler integration:

scheduler = AsyncIOScheduler()

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Startup: Initialize services and scheduler
    matcher = JobMatcherService()

    # Schedule scraping twice daily
    scheduler.add_job(scrape_and_ingest, CronTrigger(hour=6))
    scheduler.add_job(scrape_and_ingest, CronTrigger(hour=18))
    scheduler.start()

    yield

    # Shutdown: Cleanup
    scheduler.shutdown()
    matcher.cleanup()

Production Readiness

Error Handling & Monitoring

Custom Exceptions: 10+ exception types (RateLimitError, EmbeddingError, VectorStoreError, ScrapingError) with rich context
Structured Logging: JSON logs with request IDs, performance metrics, and cost tracking
Request Tracking: Unique request IDs for distributed tracing
Performance Monitoring: Alerts for slow requests (>5s) and high API usage

Retry Logic & Resilience

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(min=1, max=10),
    retry=retry_if_exception_type((RateLimitError, TimeoutError))
)
def fetch_with_retry(url: str) -> str:
    # Exponential backoff for API failures
    response = requests.get(url, timeout=10)
    response.raise_for_status()
    return response.text

Configuration Management

Environment Validation: Checks API keys, creates directories on startup
Centralized Config: All magic values in constants.py (pricing, rate limits, model settings)
YAML Config: External configuration for scraping, matching, and cost optimization

Skills Demonstrated

AI/ML Architecture

RAG pipeline design, vector databases, embedding strategies, LLM orchestration

Backend Development

FastAPI, async/await patterns, dependency injection, service-oriented architecture

Data Engineering

ETL pipelines, web scraping, data enrichment, incremental updates

Cost Optimization

Smart caching, model selection, selective processing, rate limiting

Production Engineering

Structured logging, error handling, monitoring, performance optimization

System Design

Scalable architecture, background jobs, lifespan management, clean code patterns