Enterprise NLQ→SQL Hardening: ThoughtSpot Patterns
How Big Players Secure LLM-Generated SQL
ThoughtSpot and similar enterprise BI platforms use a multi-layered security approach that goes far beyond prompt engineering. Here’s the architecture pattern and how DataTruth implements it.
The Four Pillars
A) Semantic Model as the Gate (Not the LLM)
ThoughtSpot Approach:
- Heavy reliance on governed semantic layer
- Model metadata defines what’s queryable
- Joins, relationships, and constraints are pre-configured
- LLM can only generate queries using authorized model elements
DataTruth Implementation:
✅ Semantic Layer (src/semantic/)
# loader.py - Semantic model defines all queryable elements
semantic_context = {
"metrics": [...], # Pre-approved calculations
"dimensions": [...], # Approved grouping fields
"relationships": [...], # Valid JOIN paths
"synonyms": [...] # NL term mappings
}
✅ Plan-Level Validation (orchestrator_v2.py)
def _validate_plan(query_plan, semantic_context):
"""Validate BEFORE SQL generation"""
# Check metrics exist
available_metrics = {m["name"] for m in semantic_context.get("metrics", [])}
for metric in query_plan.metrics:
if metric.name not in available_metrics:
errors.append(f"Unknown metric: {metric.name}")
# Check dimensions exist
# Check filters reference valid fields
# Only approved elements pass through
Gap Analysis:
- ⚠️ Need explicit JOIN path validation (can only JOIN via pre-defined relationships)
- ⚠️ Need calculation validation (derived metrics must use approved formulas)
- ⚠️ Need data lineage tracking (audit what data feeds into results)
B) Generated SQL Not Blindly Executed
ThoughtSpot Approach:
- SQL generated by LLM is NOT executed directly
- Translated to internal query language (enforces controls)
- If SQL tries unauthorized access → no results returned
- Query rewriting layer enforces all policies
DataTruth Implementation:
✅ SQL Validation Layer (validator_v2.py)
class ProductionSQLValidator:
def validate(self, sql: str) -> SQLValidationResult:
# 1. Parse to AST
# 2. Security checks (injection, dangerous ops)
# 3. Structure validation (depth, complexity)
# 4. Schema validation (table/column authorization)
# 5. Performance validation
✅ Query Plan Abstraction (planner/)
# SQL is not generated directly from NL question
# Flow: NL → Intent → QueryPlan → SQL
# Each stage has validation gates
Current Flow:
Natural Language Question
↓
Intent Extraction (LLM)
↓ [GATE 1: Plan Validation]
Query Plan
↓ [GATE 2: SQL Generation Rules]
Generated SQL
↓ [GATE 3: SQL Validation (AST + Security)]
Validated SQL
↓ [GATE 4: Execution]
Results
Gap Analysis:
- ⚠️ Need query rewriting layer (normalize SQL before execution)
- ⚠️ Need query plan caching with authorization metadata
- ⚠️ Need audit log of what SQL was generated vs what was executed
- ⚠️ Need SQL sanitization (even after validation)
C) Constrained LLM Prompt Context
ThoughtSpot Approach:
- Explicitly select which columns/values/metadata go into prompt
- Use structured instructions for SQL generation
- Limit context to user’s authorized scope
- Don’t send full schema → send filtered view
DataTruth Implementation:
✅ Filtered Semantic Context (intent_extractor.py)
# Only send relevant metrics/dimensions to LLM
filtered_context = {
"metrics": [m for m in metrics if user_can_access(m, user)],
"dimensions": [d for d in dimensions if user_can_access(d, user)]
}
✅ Structured Prompts (prompts/)
System: You are a SQL generator. Use ONLY these metrics: [...]
User: {question}
Context: Available metrics, dimensions, sample values
Instructions: Generate SELECT only, use provided metric formulas
Gap Analysis:
- ⚠️ Need user-scoped semantic filtering (RLS at model level)
- ⚠️ Need column-level permissions (hide sensitive columns)
- ⚠️ Need value masking in prompts (PII/sensitive data)
- ⚠️ Need prompt size optimization (token limits with large schemas)
D) Platform Security Cannot Be Bypassed
ThoughtSpot Approach:
- Row-Level Security (RLS) enforced at table/model level
- Permissions apply to all derived content
- Cannot bypass via SQL generation tricks
- Security rules applied BEFORE query execution
DataTruth Implementation Status:
❌ Row-Level Security - NOT IMPLEMENTED
# NEEDED: User context in every query
user_context = {
"user_id": "...",
"roles": ["analyst", "sales"],
"permissions": {
"tables": ["orders", "customers"],
"columns": {"orders": ["id", "amount"]}, # Not "customer_ssn"
"row_filters": {
"orders": "region = 'US' AND created_date >= '2024-01-01'"
}
}
}
❌ Query Injection of RLS Filters - NOT IMPLEMENTED
# NEEDED: Automatic filter injection
# User query: "Show all orders"
# Generated SQL: "SELECT * FROM orders WHERE region = 'US' LIMIT 100"
# ↑ Auto-injected based on user permissions
❌ Authorization Cache - NOT IMPLEMENTED
# NEEDED: Fast permission checks
authorization_cache = {
"user_123": {
"can_query_table": {"orders": True, "payroll": False},
"can_access_metric": {"revenue": True, "cost": False}
}
}
Gap Analysis:
- ❌ No user authentication/authorization framework
- ❌ No RLS policy engine
- ❌ No automatic filter injection
- ❌ No column-level masking
- ❌ No audit logging of data access
Implementation Roadmap
Phase 1: Core Security (Blocking Issues)
1. User Context Throughout Pipeline
class QueryRequest(BaseModel):
question: str
user_id: str # NEW
user_context: UserContext # NEW - roles, permissions
pagination: Optional[PaginationParams]
2. RLS Policy Engine
class RLSPolicy(BaseModel):
table: str
filter_condition: str # SQL WHERE clause
applies_to_roles: List[str]
class RLSEngine:
def apply_rls(self, sql: str, user_context: UserContext) -> str:
"""Inject RLS filters into SQL"""
3. Authorization Validator
class AuthorizationValidator:
def can_access_table(self, table: str, user: UserContext) -> bool
def can_access_column(self, table: str, column: str, user: UserContext) -> bool
def can_access_metric(self, metric: str, user: UserContext) -> bool
Phase 2: Query Rewriting
4. Query Normalizer
class QueryRewriter:
def normalize(self, sql: str) -> str:
"""Normalize SQL to canonical form"""
def inject_rls(self, sql: str, user: UserContext) -> str:
"""Inject row-level security filters"""
def mask_columns(self, results: List[Dict], user: UserContext) -> List[Dict]:
"""Mask sensitive columns in results"""
5. Safe Query Executor
class SecureQueryExecutor:
def execute(self, sql: str, user_context: UserContext) -> Results:
# 1. Validate user can execute query
# 2. Rewrite SQL with RLS filters
# 3. Execute in sandboxed connection
# 4. Mask sensitive columns
# 5. Audit log access
Phase 3: Semantic Filtering
6. User-Scoped Semantic Layer
class UserScopedSemanticLoader:
def get_context(self, user: UserContext) -> Dict:
"""Return only metrics/dimensions user can access"""
all_metrics = self.loader.get_all_metrics()
return {
"metrics": [m for m in all_metrics if self.auth.can_access_metric(m, user)],
"dimensions": [d for d in all_dims if self.auth.can_access_dimension(d, user)]
}
7. Prompt Context Filter
class SecurePromptBuilder:
def build_prompt(self, question: str, user: UserContext) -> str:
# Only include authorized elements
# Mask sensitive sample values
# Add user-specific constraints
Phase 4: Audit & Compliance
8. Query Audit Logger
class QueryAuditLogger:
def log_query(self,
user: UserContext,
nl_question: str,
generated_sql: str,
executed_sql: str, # After RLS injection
results_returned: int,
sensitive_data_accessed: List[str]):
"""Log all query activity for compliance"""
9. Data Access Monitor
class DataAccessMonitor:
def track_access(self, user: str, table: str, columns: List[str])
def detect_anomalies(self) -> List[AnomalyAlert]
def generate_compliance_report(self, start_date: str, end_date: str)
Architecture Comparison
ThoughtSpot Architecture
User Question
↓
[Semantic Model Gate] ← User Permissions
↓
LLM (Constrained Prompt)
↓
Generated SQL
↓
[Query Rewriter] ← RLS Policies
↓
Internal Query Language
↓
[Security Layer] ← Cannot Bypass
↓
Database Execution
↓
[Result Masking] ← Column Permissions
↓
Results
DataTruth Current Architecture
User Question
↓
[Semantic Layer] ✅
↓
LLM (Structured Prompt) ✅
↓
Query Plan Validation ✅
↓
SQL Generation ✅
↓
SQL Validation (AST) ✅
↓
Database Execution ⚠️ (No RLS)
↓
Results ⚠️ (No masking)
DataTruth Target Architecture
User Question + User Context
↓
[User-Scoped Semantic Layer] ← Permissions Cache
↓
LLM (Filtered Prompt)
↓
Query Plan Validation ← Authorization Check
↓
SQL Generation
↓
SQL Validation (AST + Security)
↓
[Query Rewriter] ← RLS Injection
↓
[Authorization Check] ← Table/Column Permissions
↓
Database Execution (Sandboxed)
↓
[Result Masking] ← Column Permissions
↓
[Audit Logger]
↓
Results
Security Properties
What We Have (✅)
- Semantic Model Gate
- ✅ Semantic layer defines queryable elements
- ✅ Plan-level validation before SQL generation
- ✅ Schema validation (table/column existence)
- SQL Not Blindly Executed
- ✅ Multi-stage validation pipeline
- ✅ AST-based SQL parsing
- ✅ Security checks (injection, dangerous ops)
- ✅ Query plan abstraction
- Constrained LLM Context
- ✅ Structured prompts with instructions
- ✅ Filtered semantic context
- ✅ Sample value control
What We Need (❌)
- Row-Level Security
- ❌ User context in queries
- ❌ RLS policy engine
- ❌ Automatic filter injection
- ❌ Permission-based result filtering
- Query Rewriting
- ❌ SQL normalization
- ❌ RLS filter injection
- ❌ Column masking
- ❌ Query sanitization
- Authorization Framework
- ❌ User/role management
- ❌ Permission checks (table/column/metric)
- ❌ Authorization cache
- ❌ Permission inheritance
- Audit & Compliance
- ❌ Query audit logging
- ❌ Data access tracking
- ❌ Anomaly detection
- ❌ Compliance reporting
The “Secret Sauce” Summary
It’s NOT:
- ❌ Magical prompting
- ❌ Better LLM model
- ❌ More training data
- ❌ Clever SQL generation tricks
It IS:
- ✅ Semantic model that defines authorized scope
- ✅ Constrained generation with filtered context
- ✅ Query rewriting to enforce policies
- ✅ Security layer that cannot be bypassed
- ✅ Multi-gate validation at every stage
- ✅ Audit everything for compliance
Key Insight:
The LLM is a code generator, not a security boundary. Security must be enforced OUTSIDE the LLM, in the platform layer.
Next Steps
Immediate (Week 1)
- Design user context data model
- Implement authorization validator (table/column checks)
- Add user context to QueryRequest
- Update semantic loader to filter by permissions
Short-term (Month 1)
- Implement RLS policy engine
- Build query rewriter with filter injection
- Add column masking to results
- Create audit logging system
Medium-term (Quarter 1)
- User/role management UI
- RLS policy configuration UI
- Audit dashboard and compliance reports
- Performance optimization for permission checks
Long-term (Quarter 2+)
- Advanced RLS (dynamic filters, context-based)
- Data lineage tracking
- Anomaly detection in data access patterns
- Integration with enterprise identity providers (SSO, SAML)
References
- ThoughtSpot Sage Architecture
- Looker LookML Security Model
- Tableau Row-Level Security
- Power BI RLS Implementation
- Snowflake Row Access Policies
- Database RLS (PostgreSQL, Oracle VPD)