Devon Cruz AI Engineer
San Francisco, CA • aieng@gmail.com • +1 4153-4444
Profile Summary
- AI Engineer with 6 years of experience designing and shipping LLM-powered products across answer engines, customer-support agents, and developer tools, specializing in prompt engineering, retrieval-augmented generation, and agent orchestration.
- Solid technical background across LLMs (Claude, GPT-5), agent frameworks (LangChain, LangGraph), vector databases (Pinecone, pgvector), LLM observability (LangSmith, Langfuse), and languages (Python, TypeScript) with strong fundamentals in REST APIs, function calling, and structured outputs.
- Deep expertise in end-to-end LLM applications, agentic workflows, retrieval-augmented generation, and structured-output orchestration, leveraging methodologies such as prompt iteration loops and LLM-as-judge evaluation to drive reliable, cost-aware, and observable AI products.
- Engaged collaborator working cross-functionally with Product, Design, and domain experts in Agile environments, contributing to AI use-case discovery, eval design, and post-launch retrospectives with a pragmatic, ownership-first mindset.
- Emerging leader who shares technical excellence and fosters a culture of evaluation-first thinking and responsible-AI discipline through PR reviews and runbooks, while leading AI guild sessions and authoring widely adopted prompt-and-eval templates.
Technical Skills
- LLMs & Foundation Models:
- Claude, GPT-5, Gemini, Llama, Mistral
- Agent & Orchestration:
- LangChain, LangGraph, CrewAI, Pydantic AI, MCP
- RAG & Retrieval:
- Pinecone, Weaviate, Chroma, pgvector, Cohere reranker, hybrid search
- Languages & Scripting:
- Python, TypeScript, SQL, Bash
- LLM Observability:
- LangSmith, Langfuse, Arize, OpenTelemetry, Helicone
- Evaluation & QA:
- LLM-as-judge, golden datasets, RAGAS, DeepEval, human-in-the-loop
- Safety & Guardrails:
- Input/output filtering, PII redaction, jailbreak resistance, OWASP LLM Top 10
- Cloud & Inference:
- AWS Bedrock, Azure OpenAI, GCP Vertex AI, prompt caching, model routing
Education
Work Experience
- Owned the end-to-end LLM application layer for Perplexity Pro Search serving 5M+ paid subscribers, leading design across prompt orchestration, retrieval pipelines, and agentic search workflows for 18 product features in a polyglot Python/TypeScript environment.
- Designed the prompt-engineering framework for citation-grounded answers, applying few-shot prompting, structured-output JSON schemas, and prompt chaining across 4 model providers, lifting answer accuracy from 71% to 89% on the internal eval set.
- Built the retrieval-augmented generation pipeline on Pinecone and pgvector with hybrid BM25 + semantic search, adaptive chunking by content type, and Cohere reranking, indexing 180M+ documents at 240ms p95 retrieval latency.
- Architected the multi-agent search system in LangGraph with planner + executor + critic roles, MCP-integrated tool use for live data, and graceful fallback chains, handling 2.4M agentic queries/day at 96% successful task completion.
- Stood up the team's LLM evaluation framework including LLM-as-judge graders, human-in-the-loop golden sets, and regression suites in LangSmith, running 40+ structured evals that gated 12 model launches and surfaced 9 pre-prod hallucination regressions.
- Implemented the AI safety and guardrail layer with input and output filtering, jailbreak resistance via constitutional rules, PII redaction at the prompt boundary, and content-moderation routing, reducing policy-violating outputs from 3.2% to 0.18%.
- Optimized inference cost through prompt caching, semantic caching of repeat queries, dynamic model routing (Sonnet for routine, Opus for complex), and token-budget streaming, cutting per-query cost by 62% (~$1.4M annual run-rate) without quality regression.
- Owned the model-selection and fine-tuning workstream for the Fin support agent, evaluating Claude vs GPT-4 vs open-source Llama, applying LoRA fine-tuning on 18k labeled support tickets, and shipping a custom adapter that beat the closed-source baseline on resolution rate (+8%) at 40% lower cost-per-conversation.
- Built the AI observability stack on Langfuse with custom OpenTelemetry traces, including prompt versioning, A/B test routing, and drift-monitoring dashboards covering 22 production prompts and surfacing 15 silent regressions in the first six months.
- Built Fin's structured-output orchestration layer in LangChain and Pydantic schemas, integrating 9 internal tools (CRM lookup, billing API, knowledge-base search) via function calling, powering 800k+ resolved conversations/month with sub-second p50 latency.
- Worked closely with Product, Design, and Domain experts across 5 product surfaces to negotiate AI use-case prioritization, non-determinism UX patterns, and launch quality bars, authoring 8 responsible-AI RFCs that shaped the org's responsible-AI launch playbook and onboarding 9 new AI engineers.