I Mass-Rejected 847 GenAI Pull Requests. Line 673 Would Have Leaked Credit Card Numbers.
Last Month, 2:47 AM.
Iβm reviewing code for a Fortune 500βs βrevolutionary AI assistant.β 847 lines. Looks clean. Ships tomorrow.
Line 673: print(f"Debug: {user_input}")
That βuser_inputβ contained credit card numbers. In production logs. Unencrypted.
One junior dev. One debug statement. $50M lawsuit waiting to happen.
I mass-rejected the entire PR.
The Math That Should Terrify You
$40+ billion invested in enterprise GenAI in 2024. (IDC)
80% of AI projects fail. Thatβs double the failure rate of traditional IT projects. (RAND Corporation)
Only 48% of AI projects make it to production. (Gartner)
Only 25% of AI initiatives delivered expected ROI. (2025 CEO Survey)
Every. Single. Week. I watch another GenAI project die.
Not because GPT-4 isnβt smart enough.
Because nobody built the boring parts.
The Dirty Secret Nobody Tells You
Every GenAI tutorial teaches you this:
response = openai.chat.completions.create(
model=βgpt-4β,
messages=[{βroleβ: βuserβ, βcontentβ: prompt}]
)Congratulations. You can call a function.
Thatβs not a GenAI application. Thatβs a hello world with a $0.03 price tag.
The 90% that fail? They stopped here.
What Production GenAI Actually Looks Like
Iβve deployed GenAI systems that process thousands of requests daily. Hereβs what the folder structure looks like for the ones that survive:
genai_project/
βββ config/
β βββ __init__.py
β βββ model_config.yaml # Model settings, not hardcoded
β βββ prompt_templates.yaml # Prompts as config, not code
β βββ logging_config.yaml # You WILL need logs
β
βββ src/
β βββ llm/
β β βββ __init__.py
β β βββ base.py # Abstract base (swap providers)
β β βββ claude_client.py # Anthropic
β β βββ openai_client.py # OpenAI
β β βββ fallback.py # When primary fails
β β
β βββ prompt_engineering/
β β βββ __init__.py
β β βββ templates.py # Jinja2 or similar
β β βββ few_shot.py # Example management
β β βββ chainer.py # Multi-step prompts
β β
β βββ utils/
β β βββ __init__.py
β β βββ rate_limiter.py # Don't get rate limited
β β βββ token_counter.py # Know your costs
β β βββ cache.py # Don't pay twice
β β βββ logger.py # Debug at 2 AM
β β
β βββ handlers/
β βββ __init__.py
β βββ error_handler.py # When (not if) it breaks
β
βββ data/
β βββ cache/ # API response cache
β βββ prompts/ # Version-controlled prompts
β βββ outputs/ # What the model returned
β βββ embeddings/ # If using RAG
β
βββ examples/
β βββ basic_completion.py
β βββ chat_session.py
β βββ chain_prompts.py
β
βββ notebooks/
β βββ prompt_testing.ipynb # Iterate here first
β βββ response_analysis.ipynb
β βββ cost_analysis.ipynb # Track your spend
β
βββ tests/
β βββ test_prompts.py # Yes, test your prompts
β βββ test_handlers.py
β βββ test_integration.py
β
βββ requirements.txt
βββ setup.py
βββ README.md
βββ Dockerfile
βββ .env.example # Never commit .env
The 7 Folders That Separate Survivors From Casualties
1. config/ β Configuration as Code
What dies: Hardcoded API keys, model names in every file, prompts buried in code.
What survives: YAML configs that can change without code deploys.
# model_config.yaml
models:
primary:
provider: anthropic
model: claude-3-5-sonnet
max_tokens: 4096
temperature: 0.7
fallback:
provider: openai
model: gpt-4-turboCareer tip: The engineer who can swap models in 5 minutes instead of 5 days gets promoted.
2. src/llm/ β Provider Abstraction
What dies: Direct API calls scattered everywhere.
What survives: Abstract base class with swappable implementations.
# base.py
from abc import ABC, abstractmethod
class LLMClient(ABC):
@abstractmethod
def complete(self, prompt: str) -> str:
pass
@abstractmethod
def stream(self, prompt: str):
passWhy this matters: When OpenAI has an outage (they will), you switch to Claude in one line.
3. src/utils/rate_limiter.py β Donβt Get Rate Limited
What dies: Raw API calls with no throttling.
What survives: Token bucket or sliding window rate limiting.
# Simplified example
class RateLimiter:
def __init__(self, calls_per_minute: int):
self.calls_per_minute = calls_per_minute
self.calls = []
def wait_if_needed(self):
now = time.time()
self.calls = [c for c in self.calls if now - c < 60]
if len(self.calls) >= self.calls_per_minute:
sleep_time = 60 - (now - self.calls[0])
time.sleep(sleep_time)
self.calls.append(now)The cost of skipping this: One bad loop can burn through your monthly budget in an hour.
4. src/utils/cache.py β Donβt Pay Twice
What dies: Same prompt, same response, charged twice.
What survives: Content-addressed caching.
import hashlib
import json
def cache_key(prompt: str, model: str) -> str:
content = fβ{model}:{prompt}β
return hashlib.sha256(content.encode()).hexdigest()Real savings: Iβve seen teams cut API costs by 40% just by caching repeated queries.
5. data/prompts/ β Version-Controlled Prompts
What dies: Prompts edited in production, no history, no rollback.
What survives: Prompts in git with semantic versioning.
data/prompts/
βββ summarization/
β βββ v1.0.txt
β βββ v1.1.txt
β βββ v2.0.txt # Breaking change
βββ classification/
βββ v1.0.txt
βββ v1.1.txt
Why this matters: When the CEO asks βwhy did the model say that last Tuesday?β you can answer.
6. handlers/error_handler.py β When, Not If
What dies: Try/except that swallows errors.
What survives: Structured error handling with fallbacks.
class LLMError(Exception):
pass
class RateLimitError(LLMError):
pass
class ContentFilterError(LLMError):
pass
class ModelOverloadError(LLMError):
pass
def handle_llm_call(func):
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except RateLimitError:
# Switch to fallback model
return fallback_call(*args, **kwargs)
except ContentFilterError:
# Log and return safe response
log_content_filter(args)
return SAFE_RESPONSE
return wrapperProduction reality: Every LLM will reject content, hit rate limits, or time out. Plan for it.
7. tests/test_prompts.py β Yes, Test Your Prompts
What dies: βIt worked in the playground.β
What survives: Automated prompt regression tests.
def test_summarization_maintains_key_facts():
βββEnsure summary contains critical information.βββ
input_text = SAMPLE_DOCUMENT
result = summarize(input_text)
assert βquarterly revenueβ in result.lower()
assert βCEOβ in result
assert len(result) < len(input_text) * 0.3
def test_classification_known_inputs():
βββTest classification against labeled examples.βββ
for example in LABELED_EXAMPLES:
result = classify(example.text)
assert result == example.expected_labelThe uncomfortable truth: Most teams donβt test prompts because they donβt know how. Now you do.
The Pattern Nobody Talks About: Graceful Degradation
Hereβs the architecture that survives production:
Request
β
[Validation Layer] β Reject bad inputs early
β
[Cache Check] β Return cached if available
β
[Rate Limiter] β Throttle if needed
β
[Primary Model] β Try Claude/GPT-4
β (on failure)
[Fallback Model] β Try alternative
β (on failure)
[Cached Response] β Return stale if available
β (on failure)
[Graceful Error] β User-friendly message
Every layer has a fallback. Thatβs what production looks like.
What This Means for Your Career
Hereβs the uncomfortable truth nobody in tech Twitter wants to admit:
80% of AI projects fail. (RAND)
Only 25% of AI initiatives deliver expected ROI. (IBM CEO Study)
Companies are DESPERATE for people who can ship GenAI to production. Not demo it. Ship it.
Iβve seen senior engineers with 15 years experience get passed over for juniors who understood this stuff.
The career play: Be the person who builds systems, not demos.
That person is rare.
That person names their salary.
The Minimum Viable Structure
If you take nothing else, start here:
my_genai_project/
βββ config/
β βββ settings.yaml
βββ src/
β βββ llm_client.py # Abstracted provider
β βββ rate_limiter.py # Don't get banned
β βββ cache.py # Don't pay twice
βββ prompts/
β βββ v1/ # Version your prompts
βββ tests/
β βββ test_prompts.py # Test your prompts
βββ requirements.txt
Thatβs 6 files. If your GenAI project doesnβt have at least this, youβre building a demo, not a product.
Next Steps
Audit your current project. How many of these 7 folders do you have?
Add rate limiting today. Itβs the fastest way to prevent a $10,000 mistake.
Version your prompts. Future you will thank present you.
Write one prompt test. Just one. The habit matters more than coverage.
The Bottom Line
90% of GenAI projects fail.
Not because the AI isnβt smart enough.
Because the humans building it skipped the boring parts.
Seven folders. Thatβs it.
The difference between a $1.9M write-off and a product that actually ships.
Build the boring parts.
Thatβs where the money is.
P.S. The engineers who understand this will be unhireable in 2 years. Not because theyβre bad. Because everyone will want them.
Structure diagram inspiration: Brij Kishore Pandey
Sources:



The line 673 anecdote is brutally effective framing. Rate limiting and cache layers aren't optional extras when you're burning $0.03 per call, they're survival mechanisms. I've debugged enough 3AM incidents where a recursive loop ate through $15k in API credits to know graceful degredation isn't about elegance, its about staying employed when things break. The version-controlled prompts folder feels almost obvious in hindsight but I rarely see it implemented correctly.