Schema Discovery

The pg_ai_query extension includes an intelligent schema discovery system that automatically analyzes your PostgreSQL database structure to provide context for AI-powered query generation. This page explains how schema discovery works and how you can optimize it.

Overview

Schema discovery is the process by which the extension examines your database to understand:

Tables and Views: What data structures exist
Columns and Data Types: What fields are available and their types
Relationships: How tables connect through foreign keys
Constraints: Primary keys, unique constraints, and check constraints
Indexes: Available indexes for query optimization
Statistics: Row counts and table sizes for context

This information is then used to generate accurate, well-structured SQL queries from your natural language requests.

How Schema Discovery Works

1. Automatic Triggering

Schema discovery happens automatically when:

The extension is first used in a PostgreSQL session
You explicitly call schema discovery functions
The AI needs context to generate a query

-- These actions trigger schema discovery:
SELECT generate_query('show users');  -- Automatic discovery
SELECT get_database_tables();         -- Explicit discovery
SELECT get_table_details('users');    -- Detailed table analysis

2. Discovery Process

The extension follows this process:

Step 1: Table Enumeration

-- Extension queries information_schema to find user tables
-- Equivalent to:
SELECT table_name, table_schema, table_type
FROM information_schema.tables
WHERE table_schema NOT IN ('information_schema', 'pg_catalog', 'pg_toast');

Step 2: Column Analysis

-- For each table, analyze columns
-- Equivalent to:
SELECT column_name, data_type, is_nullable, column_default
FROM information_schema.columns
WHERE table_name = 'your_table';

Step 3: Constraint Discovery

-- Find primary keys, foreign keys, and unique constraints
-- Equivalent to:
SELECT constraint_name, constraint_type, column_name
FROM information_schema.table_constraints tc
JOIN information_schema.constraint_column_usage ccu
  ON tc.constraint_name = ccu.constraint_name;

Step 4: Relationship Mapping

-- Discover foreign key relationships
-- Equivalent to:
SELECT
  tc.table_name,
  kcu.column_name,
  ccu.table_name AS foreign_table_name,
  ccu.column_name AS foreign_column_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
  ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu
  ON ccu.constraint_name = tc.constraint_name
WHERE constraint_type = 'FOREIGN KEY';

Step 5: Statistics Collection

-- Gather table statistics for optimization
-- Equivalent to:
SELECT
  schemaname,
  tablename,
  n_tup_ins,
  n_tup_upd,
  n_tup_del,
  n_live_tup
FROM pg_stat_user_tables;

3. Information Processing

The collected information is then:

Structured into JSON format for AI consumption
Cached in the PostgreSQL session for performance
Filtered to exclude system tables and sensitive information
Contextualized with business-friendly descriptions

Schema Discovery Functions

get_database_tables()

Returns high-level information about all tables in your database.

Usage:

SELECT get_database_tables();

Example Output:

[
  {
    "table_name": "users",
    "schema_name": "public",
    "table_type": "BASE TABLE",
    "estimated_rows": 1500,
    "table_size": "128 kB"
  },
  {
    "table_name": "orders",
    "schema_name": "public",
    "table_type": "BASE TABLE",
    "estimated_rows": 5000,
    "table_size": "512 kB"
  }
]

Information Provided:

Table names and schemas
Table type (BASE TABLE, VIEW, etc.)
Estimated row counts
Physical table sizes

get_table_details()

Returns comprehensive information about a specific table.

Usage:

SELECT get_table_details('table_name', 'schema_name');

Example Output:

{
  "table_name": "users",
  "schema_name": "public",
  "columns": [
    {
      "column_name": "id",
      "data_type": "integer",
      "is_nullable": false,
      "column_default": "nextval('users_id_seq'::regclass)",
      "is_primary_key": true,
      "is_unique": true
    },
    {
      "column_name": "email",
      "data_type": "character varying",
      "character_maximum_length": 150,
      "is_nullable": false,
      "is_unique": true
    },
    {
      "column_name": "created_at",
      "data_type": "timestamp without time zone",
      "is_nullable": true,
      "column_default": "CURRENT_TIMESTAMP"
    }
  ],
  "constraints": [
    {
      "constraint_name": "users_pkey",
      "constraint_type": "PRIMARY KEY",
      "column_names": ["id"]
    }
  ],
  "foreign_keys": [],
  "indexes": [
    {
      "index_name": "users_email_idx",
      "columns": ["email"],
      "is_unique": true
    }
  ]
}

Optimizing Schema Discovery

1. Table and Column Naming

Use clear, descriptive names that help the AI understand your data:

Good Examples:

-- Clear table names
CREATE TABLE customer_orders (...);
CREATE TABLE product_categories (...);
CREATE TABLE user_preferences (...);

-- Descriptive column names
CREATE TABLE users (
    user_id SERIAL PRIMARY KEY,
    full_name VARCHAR(100),
    email_address VARCHAR(150),
    registration_date TIMESTAMP,
    account_status VARCHAR(20)
);

Less Optimal:

-- Unclear abbreviations
CREATE TABLE usr (...);
CREATE TABLE prod_cat (...);

-- Ambiguous column names
CREATE TABLE users (
    id INT,
    nm VARCHAR(100),
    em VARCHAR(150),
    dt TIMESTAMP,
    st VARCHAR(20)
);

2. Adding Table Comments

Enhance schema discovery with descriptive comments:

-- Table-level comments
COMMENT ON TABLE users IS 'Customer information and account details';
COMMENT ON TABLE orders IS 'Customer purchase orders and transaction history';
COMMENT ON TABLE products IS 'Product catalog with pricing and inventory';

-- Column-level comments
COMMENT ON COLUMN users.account_status IS 'User account status: active, inactive, suspended';
COMMENT ON COLUMN orders.fulfillment_status IS 'Order processing status: pending, shipped, delivered, cancelled';

3. Proper Constraint Definition

Well-defined constraints help the AI understand data relationships:

-- Primary keys
ALTER TABLE users ADD CONSTRAINT users_pkey PRIMARY KEY (user_id);

-- Foreign keys with descriptive names
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
    FOREIGN KEY (customer_id) REFERENCES users(user_id);

-- Meaningful unique constraints
ALTER TABLE users ADD CONSTRAINT unique_user_email
    UNIQUE (email_address);

-- Check constraints with business logic
ALTER TABLE orders ADD CONSTRAINT check_order_amount_positive
    CHECK (total_amount > 0);

4. Strategic Indexing

Indexes help the AI understand query optimization opportunities:

-- Indexes on frequently queried columns
CREATE INDEX idx_users_registration_date ON users(registration_date);
CREATE INDEX idx_orders_order_date ON orders(order_date);

-- Composite indexes for common query patterns
CREATE INDEX idx_orders_customer_date ON orders(customer_id, order_date);

Schema Discovery Best Practices

1. Database Design

Normalized Structure: Use proper normalization to make relationships clear
Consistent Naming: Follow consistent naming conventions across tables
Foreign Key Relationships: Always define foreign key constraints
Meaningful Defaults: Use sensible default values where appropriate

2. Documentation

-- Document complex business rules
COMMENT ON TABLE order_items IS 'Individual line items within customer orders, linked to product catalog';
COMMENT ON COLUMN products.discontinued_date IS 'Date when product was discontinued; NULL means still active';

3. Data Types

Choose appropriate data types that convey meaning:

-- Good: Specific types
CREATE TABLE events (
    event_date DATE,              -- Clearly a date
    event_time TIME,              -- Clearly a time
    duration_minutes INTEGER,     -- Clearly numeric
    event_type VARCHAR(50),       -- Clearly textual with reasonable limit
    is_public BOOLEAN            -- Clearly boolean
);

-- Less clear: Generic types
CREATE TABLE events (
    event_date TEXT,              -- Could be anything
    event_time TEXT,              -- Ambiguous format
    duration_minutes TEXT,        -- Should be numeric
    event_type TEXT,              -- No length constraint
    is_public TEXT               -- Should be boolean
);

4. Schema Organization

Organize related tables logically:

-- Group related tables with consistent prefixes
CREATE TABLE user_accounts (...);
CREATE TABLE user_preferences (...);
CREATE TABLE user_sessions (...);

CREATE TABLE product_catalog (...);
CREATE TABLE product_categories (...);
CREATE TABLE product_reviews (...);

CREATE TABLE order_headers (...);
CREATE TABLE order_items (...);
CREATE TABLE order_shipments (...);

Performance Considerations

Schema Discovery Caching

Session-level Caching: Schema information is cached per PostgreSQL session
Cache Invalidation: Schema cache is cleared when the session ends
Refresh Triggers: Schema is re-analyzed if tables are modified

Optimization Tips

Minimize Table Count: Large numbers of tables slow down discovery
Use Views: Create views to present simplified interfaces to complex schemas
Partition Strategy: Consider table partitioning for very large tables
Statistics Updates: Keep table statistics current with ANALYZE

-- Update statistics for better schema discovery
ANALYZE users;
ANALYZE orders;

-- Or update all tables
ANALYZE;

Security Considerations

What Schema Discovery Accesses

✅ Accessed Information:

User table structures
Column names and types
Constraint definitions
Index information
Table statistics

❌ Not Accessed:

Actual data content
System catalog details
User passwords or sensitive data
Database configuration

Privacy Protection

The schema discovery system:

Only reads metadata, never actual data
Respects PostgreSQL’s permission system
Excludes system tables and schemas
Never transmits sensitive information to external APIs

Troubleshooting Schema Discovery

Common Issues

Problem: “No tables found”

-- Check if tables exist
SELECT table_name FROM information_schema.tables
WHERE table_schema = 'public' AND table_type = 'BASE TABLE';

-- Verify permissions
SELECT has_table_privilege('users', 'SELECT');

Problem: Missing relationships

-- Check foreign key constraints
SELECT
    tc.table_name,
    kcu.column_name,
    ccu.table_name AS foreign_table_name
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu
  ON tc.constraint_name = kcu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';

Problem: Outdated statistics

-- Update table statistics
ANALYZE your_table_name;

-- Check last analyze time
SELECT schemaname, tablename, last_analyze
FROM pg_stat_user_tables;

Schema discovery is a powerful feature that makes your database “AI-ready” by providing the context needed for intelligent query generation. Following these best practices will help you get the most accurate and useful results from the pg_ai_query extension.

Keyboard shortcuts

PostgreSQL AI Query Extension