Enterprise NLQ→SQL Hardening: ThoughtSpot Patterns

How Big Players Secure LLM-Generated SQL

ThoughtSpot and similar enterprise BI platforms use a multi-layered security approach that goes far beyond prompt engineering. Here’s the architecture pattern and how DataTruth implements it.

The Four Pillars

A) Semantic Model as the Gate (Not the LLM)

ThoughtSpot Approach:

Heavy reliance on governed semantic layer
Model metadata defines what’s queryable
Joins, relationships, and constraints are pre-configured
LLM can only generate queries using authorized model elements

DataTruth Implementation:

✅ Semantic Layer (src/semantic/)

# loader.py - Semantic model defines all queryable elements
semantic_context = {
    "metrics": [...],      # Pre-approved calculations
    "dimensions": [...],   # Approved grouping fields
    "relationships": [...], # Valid JOIN paths
    "synonyms": [...]      # NL term mappings
}

✅ Plan-Level Validation (orchestrator_v2.py)

def _validate_plan(query_plan, semantic_context):
    """Validate BEFORE SQL generation"""
    # Check metrics exist
    available_metrics = {m["name"] for m in semantic_context.get("metrics", [])}
    for metric in query_plan.metrics:
        if metric.name not in available_metrics:
            errors.append(f"Unknown metric: {metric.name}")
    
    # Check dimensions exist
    # Check filters reference valid fields
    # Only approved elements pass through

Gap Analysis:

⚠️ Need explicit JOIN path validation (can only JOIN via pre-defined relationships)
⚠️ Need calculation validation (derived metrics must use approved formulas)
⚠️ Need data lineage tracking (audit what data feeds into results)

B) Generated SQL Not Blindly Executed

ThoughtSpot Approach:

SQL generated by LLM is NOT executed directly
Translated to internal query language (enforces controls)
If SQL tries unauthorized access → no results returned
Query rewriting layer enforces all policies

DataTruth Implementation:

✅ SQL Validation Layer (validator_v2.py)

class ProductionSQLValidator:
    def validate(self, sql: str) -> SQLValidationResult:
        # 1. Parse to AST
        # 2. Security checks (injection, dangerous ops)
        # 3. Structure validation (depth, complexity)
        # 4. Schema validation (table/column authorization)
        # 5. Performance validation

✅ Query Plan Abstraction (planner/)

# SQL is not generated directly from NL question
# Flow: NL → Intent → QueryPlan → SQL
# Each stage has validation gates

Current Flow:

Natural Language Question
    ↓
Intent Extraction (LLM)
    ↓ [GATE 1: Plan Validation]
Query Plan
    ↓ [GATE 2: SQL Generation Rules]
Generated SQL
    ↓ [GATE 3: SQL Validation (AST + Security)]
Validated SQL
    ↓ [GATE 4: Execution]
Results

Gap Analysis:

⚠️ Need query rewriting layer (normalize SQL before execution)
⚠️ Need query plan caching with authorization metadata
⚠️ Need audit log of what SQL was generated vs what was executed
⚠️ Need SQL sanitization (even after validation)

C) Constrained LLM Prompt Context

ThoughtSpot Approach:

Explicitly select which columns/values/metadata go into prompt
Use structured instructions for SQL generation
Limit context to user’s authorized scope
Don’t send full schema → send filtered view

DataTruth Implementation:

✅ Filtered Semantic Context (intent_extractor.py)

# Only send relevant metrics/dimensions to LLM
filtered_context = {
    "metrics": [m for m in metrics if user_can_access(m, user)],
    "dimensions": [d for d in dimensions if user_can_access(d, user)]
}

✅ Structured Prompts (prompts/)

System: You are a SQL generator. Use ONLY these metrics: [...]
User: {question}
Context: Available metrics, dimensions, sample values
Instructions: Generate SELECT only, use provided metric formulas

Gap Analysis:

⚠️ Need user-scoped semantic filtering (RLS at model level)
⚠️ Need column-level permissions (hide sensitive columns)
⚠️ Need value masking in prompts (PII/sensitive data)
⚠️ Need prompt size optimization (token limits with large schemas)

D) Platform Security Cannot Be Bypassed

ThoughtSpot Approach:

Row-Level Security (RLS) enforced at table/model level
Permissions apply to all derived content
Cannot bypass via SQL generation tricks
Security rules applied BEFORE query execution

DataTruth Implementation Status:

❌ Row-Level Security - NOT IMPLEMENTED

# NEEDED: User context in every query
user_context = {
    "user_id": "...",
    "roles": ["analyst", "sales"],
    "permissions": {
        "tables": ["orders", "customers"],
        "columns": {"orders": ["id", "amount"]},  # Not "customer_ssn"
        "row_filters": {
            "orders": "region = 'US' AND created_date >= '2024-01-01'"
        }
    }
}

❌ Query Injection of RLS Filters - NOT IMPLEMENTED

# NEEDED: Automatic filter injection
# User query: "Show all orders"
# Generated SQL: "SELECT * FROM orders WHERE region = 'US' LIMIT 100"
#                                           ↑ Auto-injected based on user permissions

❌ Authorization Cache - NOT IMPLEMENTED

# NEEDED: Fast permission checks
authorization_cache = {
    "user_123": {
        "can_query_table": {"orders": True, "payroll": False},
        "can_access_metric": {"revenue": True, "cost": False}
    }
}

Gap Analysis:

❌ No user authentication/authorization framework
❌ No RLS policy engine
❌ No automatic filter injection
❌ No column-level masking
❌ No audit logging of data access

Implementation Roadmap

Phase 1: Core Security (Blocking Issues)

1. User Context Throughout Pipeline

class QueryRequest(BaseModel):
    question: str
    user_id: str  # NEW
    user_context: UserContext  # NEW - roles, permissions
    pagination: Optional[PaginationParams]

2. RLS Policy Engine

class RLSPolicy(BaseModel):
    table: str
    filter_condition: str  # SQL WHERE clause
    applies_to_roles: List[str]

class RLSEngine:
    def apply_rls(self, sql: str, user_context: UserContext) -> str:
        """Inject RLS filters into SQL"""

3. Authorization Validator

class AuthorizationValidator:
    def can_access_table(self, table: str, user: UserContext) -> bool
    def can_access_column(self, table: str, column: str, user: UserContext) -> bool
    def can_access_metric(self, metric: str, user: UserContext) -> bool

Phase 2: Query Rewriting

4. Query Normalizer

class QueryRewriter:
    def normalize(self, sql: str) -> str:
        """Normalize SQL to canonical form"""
    
    def inject_rls(self, sql: str, user: UserContext) -> str:
        """Inject row-level security filters"""
    
    def mask_columns(self, results: List[Dict], user: UserContext) -> List[Dict]:
        """Mask sensitive columns in results"""

5. Safe Query Executor

class SecureQueryExecutor:
    def execute(self, sql: str, user_context: UserContext) -> Results:
        # 1. Validate user can execute query
        # 2. Rewrite SQL with RLS filters
        # 3. Execute in sandboxed connection
        # 4. Mask sensitive columns
        # 5. Audit log access

Phase 3: Semantic Filtering

6. User-Scoped Semantic Layer

class UserScopedSemanticLoader:
    def get_context(self, user: UserContext) -> Dict:
        """Return only metrics/dimensions user can access"""
        all_metrics = self.loader.get_all_metrics()
        return {
            "metrics": [m for m in all_metrics if self.auth.can_access_metric(m, user)],
            "dimensions": [d for d in all_dims if self.auth.can_access_dimension(d, user)]
        }

7. Prompt Context Filter

class SecurePromptBuilder:
    def build_prompt(self, question: str, user: UserContext) -> str:
        # Only include authorized elements
        # Mask sensitive sample values
        # Add user-specific constraints

Phase 4: Audit & Compliance

8. Query Audit Logger

class QueryAuditLogger:
    def log_query(self, 
                  user: UserContext, 
                  nl_question: str,
                  generated_sql: str,
                  executed_sql: str,  # After RLS injection
                  results_returned: int,
                  sensitive_data_accessed: List[str]):
        """Log all query activity for compliance"""

9. Data Access Monitor

class DataAccessMonitor:
    def track_access(self, user: str, table: str, columns: List[str])
    def detect_anomalies(self) -> List[AnomalyAlert]
    def generate_compliance_report(self, start_date: str, end_date: str)

Architecture Comparison

ThoughtSpot Architecture

User Question
    ↓
[Semantic Model Gate] ← User Permissions
    ↓
LLM (Constrained Prompt)
    ↓
Generated SQL
    ↓
[Query Rewriter] ← RLS Policies
    ↓
Internal Query Language
    ↓
[Security Layer] ← Cannot Bypass
    ↓
Database Execution
    ↓
[Result Masking] ← Column Permissions
    ↓
Results

DataTruth Current Architecture

User Question
    ↓
[Semantic Layer] ✅
    ↓
LLM (Structured Prompt) ✅
    ↓
Query Plan Validation ✅
    ↓
SQL Generation ✅
    ↓
SQL Validation (AST) ✅
    ↓
Database Execution ⚠️ (No RLS)
    ↓
Results ⚠️ (No masking)

DataTruth Target Architecture

User Question + User Context
    ↓
[User-Scoped Semantic Layer] ← Permissions Cache
    ↓
LLM (Filtered Prompt)
    ↓
Query Plan Validation ← Authorization Check
    ↓
SQL Generation
    ↓
SQL Validation (AST + Security)
    ↓
[Query Rewriter] ← RLS Injection
    ↓
[Authorization Check] ← Table/Column Permissions
    ↓
Database Execution (Sandboxed)
    ↓
[Result Masking] ← Column Permissions
    ↓
[Audit Logger]
    ↓
Results

Security Properties

What We Have (✅)

Semantic Model Gate
- ✅ Semantic layer defines queryable elements
- ✅ Plan-level validation before SQL generation
- ✅ Schema validation (table/column existence)
SQL Not Blindly Executed
- ✅ Multi-stage validation pipeline
- ✅ AST-based SQL parsing
- ✅ Security checks (injection, dangerous ops)
- ✅ Query plan abstraction
Constrained LLM Context
- ✅ Structured prompts with instructions
- ✅ Filtered semantic context
- ✅ Sample value control

What We Need (❌)

Row-Level Security
- ❌ User context in queries
- ❌ RLS policy engine
- ❌ Automatic filter injection
- ❌ Permission-based result filtering
Query Rewriting
- ❌ SQL normalization
- ❌ RLS filter injection
- ❌ Column masking
- ❌ Query sanitization
Authorization Framework
- ❌ User/role management
- ❌ Permission checks (table/column/metric)
- ❌ Authorization cache
- ❌ Permission inheritance
Audit & Compliance
- ❌ Query audit logging
- ❌ Data access tracking
- ❌ Anomaly detection
- ❌ Compliance reporting

The “Secret Sauce” Summary

It’s NOT:

❌ Magical prompting
❌ Better LLM model
❌ More training data
❌ Clever SQL generation tricks

It IS:

✅ Semantic model that defines authorized scope
✅ Constrained generation with filtered context
✅ Query rewriting to enforce policies
✅ Security layer that cannot be bypassed
✅ Multi-gate validation at every stage
✅ Audit everything for compliance

Key Insight:

The LLM is a code generator, not a security boundary. Security must be enforced OUTSIDE the LLM, in the platform layer.

Next Steps

Immediate (Week 1)

Design user context data model
Implement authorization validator (table/column checks)
Add user context to QueryRequest
Update semantic loader to filter by permissions

Short-term (Month 1)

Implement RLS policy engine
Build query rewriter with filter injection
Add column masking to results
Create audit logging system

Medium-term (Quarter 1)

User/role management UI
RLS policy configuration UI
Audit dashboard and compliance reports
Performance optimization for permission checks

Long-term (Quarter 2+)

Advanced RLS (dynamic filters, context-based)
Data lineage tracking
Anomaly detection in data access patterns
Integration with enterprise identity providers (SSO, SAML)

References

ThoughtSpot Sage Architecture
Looker LookML Security Model
Tableau Row-Level Security
Power BI RLS Implementation
Snowflake Row Access Policies
Database RLS (PostgreSQL, Oracle VPD)