Building an AI-Ready Data Foundation

Why Your Modernisation Strategy Must Account for AI Before You Need It

The architectural decisions you make today will determine whether you can adopt AI in 12 months — or spend 18 months rebuilding to get there.

Introduction

“In the last 18 months, I have watched three enterprise customers complete successful cloud modernisations — and immediately begin re-architecting for AI. Not because the modernisation failed. Because it succeeded in solving the problem they had in 2024, without anticipating the problem they would have in 2026. The CRM we containerised on ECS Fargate runs beautifully. It just cannot feed a Bedrock knowledge base without a 6-month data engineering project that nobody budgeted for. That gap is preventable. This post is about how to prevent it.”

1. The Modernisation Debt You Don’t See Coming

Every enterprise I work with in presales is having two conversations simultaneously: “How do we modernize our legacy platforms?” and “How do we adopt AI?” The problem is, these conversations happen in different rooms — with different stakeholders, different timelines, and different budgets.

The result is predictable: organisations invest 12–18 months modernising their infrastructure — containerizing applications, automating deployments, implementing observability — and then discover that their freshly modernised platform is fundamentally unprepared for AI workloads. The data is in the wrong format, the wrong place, or the wrong structure.

The lesson is clear: modernisation without AI-readiness is incomplete modernisation. Not because every enterprise needs AI today — but because the cost of retrofitting AI-ready data patterns later is 3–5x higher than embedding them from the start.

2. Why This Matters Now: The Data Foundation Is the AI Bottleneck

The market has shifted. AI is no longer an R&D experiment — it’s a board-level strategic priority. But the gap between AI ambition and AI execution is almost always a data problem, not a model problem.

According to Gartner’s February 2025 research, the data foundation gap is so severe that organisations will abandon 60% of AI projects unsupported by AI-ready data through 2026 (Gartner, “Lack of AI-Ready Data Puts AI Projects at Risk,” February 2025). The differentiator is not which model you choose. It’s whether your data is structured, accessible, and governed in a way that AI systems can consume.

Three market forces are making AI-ready data foundations urgent:

Agentic AI demands structured, accessible enterprise data — Gartner predicts that 40% of enterprise applications will feature task-specific AI agents by the end of 2026, up from less than 5% in 2025 (Gartner, August 2025). These agents need semantically indexed, permission-aware, real-time data access. If your data isn’t ready, your agents are useless.
The cost of retrofitting is exponential — Adding vector search, embedding pipelines, and governance to an existing platform retrofits every layer. Designing for it during modernisation adds ~15% to initial project cost. Retrofitting later adds 3–5x that amount.
Competitive differentiation is shifting to data quality — Every enterprise has access to the same foundation models. The organisations that win are those whose proprietary data is clean, structured, and accessible to AI systems. Your data is your moat — but only if it’s AI-ready.

3. What “AI-Ready” Actually Means: A Practical Definition

“AI-ready data” is not a marketing term — it’s a set of specific architectural properties that determine whether your data can be consumed by AI systems without significant re-engineering.

Property	What It Means	Why AI Needs It
Structured & catalogued	Data has metadata, schemas, and lineage tracking	AI systems need to discover and understand data without human interpretation
Semantically searchable	Data can be queried by meaning, not just keywords	RAG, agents, and copilots search by intent, not exact match
Embeddable	Data can be converted to vector representations	Foundation models consume embeddings, not raw database rows
Governed & permissioned	Access controls, retention policies, PII classification	AI systems must respect the same data boundaries as humans
Fresh & synchronised	Data reflects current state, not stale snapshots	AI answers are only as good as the data they’re grounded in
Multi-modal accessible	Text, documents, images, structured data all queryable	Modern AI is multi-modal — your data layer should be too

4. The Architecture: AI-Ready Data Foundation on AWS

The reference architecture below connects the four layers that together form an AI-ready data foundation. The layers are sequential — data flows from enterprise sources through ingestion and storage into AI consumption — with governance enforced at every stage.

[Figure 1: AI-ready data foundation on AWS — four-layer architecture]

Layer 1: Data Ingestion & Integration

The problem this solves: Enterprise data lives in dozens of sources — CRM databases, ERP systems, document repositories, SaaS applications, operational logs. AI systems need unified access without building point-to-point integrations for each source.

AWS Glue — Serverless ETL for batch data integration, schema discovery, and data cataloguing
Amazon Kinesis Data Streams — Real-time data ingestion for operational events
AWS Glue Data Catalog — Centralised metadata repository that makes data discoverable to AI systems
Zero-ETL integrations — Direct data flow between operational databases and analytics environments without pipeline management

Strategic decision: Why Glue + Zero-ETL over custom pipelines?

Custom ETL pipelines give you maximum control — but they also give you maximum maintenance burden. AWS Glue handles the undifferentiated work — schema discovery, job scheduling, error handling, auto-scaling — while Zero-ETL integrations eliminate pipelines entirely for supported source-destination pairs.

The real value for AI readiness: Glue Data Catalog creates a metadata layer that AI systems can query to understand what data exists, where it lives, and what it means — without human intervention. This is the foundation for autonomous AI agents that can discover and access enterprise data on their own.

PYTHON

# AWS Glue job — Transform CRM data and generate embeddings for AI consumption
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from pyspark.context import SparkContext
import boto3, json
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
# Read from CRM database via Glue Data Catalog
crm_data = glueContext.create_dynamic_frame.from_catalog(
    database="crm_database",
    table_name="customer_interactions"
)
# Transform: flatten, standardise timestamps, concatenate text fields
transformed = crm_data.apply_mapping([
    ("customer_id",       "string",    "customer_id",       "string"),
    ("interaction_date",  "timestamp", "interaction_date",  "timestamp"),
    ("interaction_type",  "string",    "interaction_type",  "string"),
    ("subject",           "string",    "subject",           "string"),
    ("body",              "string",    "body",              "string"),
    ("resolution_status", "string",    "resolution_status", "string")
])
# Write to S3 in Parquet — AI-ready for analytics AND embedding pipelines
glueContext.write_dynamic_frame.from_options(
    frame=transformed,
    connection_type="s3",
    connection_options={
        "path": "s3://crm-data-lake/ai-ready/customer-interactions/",
        "partitionKeys": ["interaction_type"]
    },
    format="parquet"
)

Key design decision: Writing to Parquet format partitioned by interaction type serves dual purposes — analytics tools (Athena, Redshift Spectrum) can query it efficiently, AND embedding generation pipelines can process it in parallel by partition. One write, two AI consumption patterns.

Layer 2: Vector Storage & Semantic Search

Traditional databases answer “find records where customer_id = 12345.” AI systems need to answer “find interactions similar to this customer complaint about delivery delays.” That requires vector representations of your data — embeddings that capture semantic meaning.

Option	Best For	Trade-off
S3 Vectors	Massive scale (billions of vectors), cost-sensitive workloads, AI agent memory	Highest scale, lowest cost per vector — GA since Dec 2025
Aurora pgvector	Applications already using PostgreSQL, need transactional + vector in one DB	Familiar tooling, but vector performance limited at very large scale
OpenSearch	Hybrid search (keyword + semantic), log analytics + AI	Excellent search, but higher operational cost
Bedrock Knowledge Bases	Fastest time-to-value, fully managed RAG, no infrastructure management	Least control, but zero operational burden

My recommendation: Start with Bedrock Knowledge Bases for your first AI use case — it gets you from zero to working RAG in days, not months. Then evaluate S3 Vectors or Aurora pgvector for production workloads where you need more control. The mistake I see most often: teams spending 3 months evaluating vector databases before validating that their AI use case delivers business value.

BASH

# Create an S3 Vectors vector index for CRM customer interaction embeddings
aws s3vectors create-vector-bucket 
  --vector-bucket-name crm-ai-embeddings
aws s3vectors create-vector-index 
  --vector-bucket-name crm-ai-embeddings 
  --vector-index-name customer-interactions 
  --dimension 1024 
  --distance-metric cosine 
  --metadata-configuration '{
    "fields": {
      "customer_id":        {"dataType": "str"},
      "interaction_type":   {"dataType": "str"},
      "interaction_date":   {"dataType": "str"},
      "resolution_status":  {"dataType": "str"}
    }
  }'

The metadata configuration is the critical AI-readiness pattern. By attaching structured metadata to each vector, you enable filtered vector search — “find similar complaints, but only for enterprise customers in the last 90 days.” Without metadata, vector search returns semantically similar results with no business context filtering.

Layer 3: AI Consumption — Knowledge Bases & Agents

Raw data and vector embeddings are not useful to end users. The AI consumption layer connects your data foundation to the applications and agents that deliver business value.

Amazon Bedrock Knowledge Bases — Managed RAG that connects foundation models to your enterprise data
Amazon Bedrock AgentCore — Platform for building, deploying, and managing AI agents at scale (launched 2025). Provides memory, tool execution, and multi-agent orchestration.
Amazon Q Business — Enterprise AI assistant that connects to corporate data sources with zero custom development. Unlike Bedrock Knowledge Bases (which requires developer effort to build an application layer), Q Business provides a ready-made conversational interface for non-technical employees — think of it as “enterprise ChatGPT over your internal data” that IT can deploy in days without writing application code.
Amazon Bedrock Guardrails — Content filtering, topic blocking, and PII redaction for AI outputs
AWS Lambda — Serverless compute for AI orchestration and custom logic

JSON

{
  "name": "crm-customer-knowledge-base",
  "knowledgeBaseConfiguration": {
    "type": "VECTOR",
    "vectorKnowledgeBaseConfiguration": {
      "embeddingModelArn": "arn:aws:bedrock:ap-south-1::foundation-model/amazon.titan-embed-text-v2:0"
    }
  },
  "storageConfiguration": {
    "type": "OPENSEARCH_SERVERLESS",
    "opensearchServerlessConfiguration": {
      "collectionArn": "arn:aws:aoss:ap-south-1:<account-id>:collection/<collection-id>",
      "vectorIndexName": "crm-interactions-index",
      "fieldMapping": {
        "vectorField": "embedding",
        "textField": "content",
        "metadataField": "metadata"
      }
    }
  }
}

Bedrock AgentCore — Orchestrating AI Agents Over Your Data Foundation:

AgentCore provides the runtime for AI agents that can autonomously discover, reason over, and act on enterprise data. Here’s a simplified agent configuration that connects to the CRM knowledge base:

JSON

{
  "agentName": "crm-customer-insights-agent",
  "foundationModel": "anthropic.claude-sonnet-4-20250514",
  "instruction": "You are a customer insights agent. Use the CRM knowledge base to answer questions about customer interaction history, resolution patterns, and service trends. Always cite specific interaction records when providing answers.",
  "idleSessionTTLInSeconds": 1800,
  "knowledgeBases": [
    {
      "knowledgeBaseId": "crm-customer-knowledge-base",
      "description": "CRM customer interaction history including support tickets, complaints, and resolutions"
    }
  ],
  "actionGroups": [
    {
      "actionGroupName": "CRMActions",
      "description": "Actions for retrieving and summarising CRM data",
      "actionGroupExecutor": {
        "lambda": "arn:aws:lambda:ap-south-1:<account-id>:function:crm-agent-actions"
      }
    }
  ],
  "memoryConfiguration": {
    "enabledMemoryTypes": ["SESSION_SUMMARY"],
    "storageDays": 30
  }
}

This agent configuration demonstrates the key AgentCore capabilities: it connects to the knowledge base for RAG-grounded answers, has action groups for executing business logic (e.g., creating follow-up tickets), and uses session memory so conversations maintain context across interactions.

A customer support agent can now ask “What similar issues have we resolved for this customer segment?” and get answers grounded in actual CRM interaction history — not hallucinated responses from a general-purpose model. Meanwhile, Amazon Q Business gives non-technical employees the same AI-powered access through a conversational interface — no custom application development required.

Layer 4: Governance & Security

AI systems that access enterprise data must respect the same access controls, data classification, and audit requirements as human users. Without governance, AI becomes a data exfiltration risk.

The most common AI governance failure: An AI system is deployed with access to a broad data lake, and six months later someone realises it can surface PII from HR records in customer-facing responses. Retrofitting governance after deployment means re-engineering data access patterns, re-testing AI outputs, and potentially recalling responses that already reached end users.

Designing governance from the start means:

Data classification — happens during ingestion (Layer 1), not after AI deployment
Access controls — inherited by AI services through IAM roles — the same permission model your human users follow
PII identification — tagged by Amazon Macie before it enters the vector store
Bedrock Guardrails — filter AI outputs at the application layer as a defence-in-depth measure

JSON

{
  "name": "crm-ai-guardrail",
  "sensitiveInformationPolicyConfig": {
    "piiEntitiesConfig": [
      {"type": "EMAIL",   "action": "ANONYMIZE"},
      {"type": "PHONE",   "action": "ANONYMIZE"},
      {"type": "NAME",    "action": "ANONYMIZE"},
      {"type": "ADDRESS", "action": "BLOCK"}
    ]
  },
  "topicPolicyConfig": {
    "topicsConfig": [{
      "name": "internal-financials",
      "definition": "Questions about company revenue, margins, or financial performance",
      "type": "DENY"
    }]
  }
}

This guardrail configuration demonstrates defence-in-depth: even if the underlying data contains PII, the AI output layer anonymises or blocks sensitive information before it reaches the end user.

5. The CRM Modernisation Revisited: What We’d Add for AI-Readiness

In my previous posts, I documented the modernisation of an enterprise CRM platform — containerisation, CI/CD automation, observability, security hardening. That architecture is solid for operational excellence. But if we were designing it today with AI-readiness as a requirement, here’s what would change:

Layer	Current State	AI-Ready Addition	Business Value
Data storage	RDS MySQL (transactional)	+ S3 data lake with Parquet exports	Analytics + AI consumption without impacting production DB
Search	SQL queries only	+ OpenSearch with vector search	Natural language search across CRM records
Knowledge access	Manual reports, dashboards	+ Bedrock Knowledge Base over CRM data	AI-powered customer insights on demand
Employee AI access	None	+ Amazon Q Business connected to CRM data	Non-technical staff get AI answers without custom apps
Data governance	IAM + encryption	+ Lake Formation + Macie + Guardrails	AI-safe data access with PII protection
Integration	Direct database queries	+ Glue ETL + Data Catalog	Discoverable, catalogued data for AI agents

The cost of adding this later vs. now

If we add these capabilities as a retrofit in 12 months, the estimated effort is 8–12 weeks of engineering work — because we’d need to build data export pipelines from production RDS, design a new data lake schema, implement embedding generation, and add governance controls that don’t exist today.

If we’d included them in the original design, the incremental effort would have been 2–3 weeks — because the data flows, access patterns, and governance model would have been designed holistically from the start. That 4–5x cost multiplier is the AI-readiness tax that enterprises pay when they treat modernisation and AI as separate initiatives.

6. A Presales Perspective: How to Position AI-Ready Data in Customer Conversations

In presales engagements, the AI-readiness conversation has become the most effective way to expand modernisation scope — not by overselling AI, but by helping customers avoid a predictable and expensive mistake.

The conversation I have most often goes like this:

Customer: “We want to modernise our CRM/ERP/portal. We’re not thinking about AI yet — that’s a 2027 initiative.”

My response: “That’s fine — you don’t need to build AI capabilities today. But let me show you what happens if we design the data layer without considering AI, and then you want to add it in 18 months.”

Then I walk through the retrofit cost comparison: 2–3 weeks of incremental design now vs. 8–12 weeks of re-engineering later. The math is simple, and it resonates with CFOs who hate paying twice for the same outcome.

Three questions that surface AI-readiness gaps in discovery:

“If your CEO asked tomorrow for AI-powered insights from this platform’s data, how long would it take to deliver?” — Most customers answer “months” or “I don’t know.” That gap is the opportunity.
“Where does your unstructured data live, and can any system search it by meaning rather than keywords?” — This surfaces the vector search gap that blocks RAG and agent use cases.
“Who controls what data your AI systems can access, and how is that audited?” — This surfaces the governance gap that blocks enterprise AI deployment in regulated industries.

7. The Business Case: Why CFOs Should Fund AI-Ready Data Now

For technology leaders building the business case, here are the numbers that resonate in CFO conversations:

Cost avoidance

Retrofit cost for AI-readiness after modernisation: 8–12 weeks engineering effort (~$150K–$300K for mid-market enterprises)
Incremental cost to include AI-readiness during modernisation: 2–3 weeks (~$40K–$75K)
Net savings per platform: $110K–$225K — and most enterprises have 3–5 platforms that will need AI capabilities

Time-to-value acceleration

Time to first AI use case without AI-ready data: 6–9 months (data re-engineering + AI development)
Time to first AI use case with AI-ready data: 4–8 weeks (AI development only — data is already prepared)
Competitive advantage: 4–7 months faster to market with AI-powered features

Risk reduction

Organisations that deploy AI without governance face an average of 2–3 data incidents in the first year
Each incident costs $50K–$500K in remediation depending on regulatory exposure
Governance-first design reduces incident probability by ~80%

8. Cross-Layer Architecture Summary

Layer	AWS Services	Purpose	AI Enablement
Ingestion	Glue, Kinesis, Zero-ETL	Unified data integration	Makes data discoverable and processable
Storage	S3 Data Lake, RDS, S3 Vectors	Structured + vector storage	Supports both operational and AI query patterns
Semantic Search	OpenSearch, Aurora pgvector, Bedrock KB	Meaning-based retrieval	Enables RAG, agents, and natural language access
AI Consumption	Bedrock, AgentCore, Q Business, Lambda	Application layer	Delivers AI value to end users
Governance	Lake Formation, Macie, Guardrails, IAM	Access control + compliance	Ensures AI respects data boundaries

These layers are not independent — they form a pipeline. Data flows from ingestion through storage into semantic search, consumed by AI applications, governed at every stage. Skip any layer and the pipeline breaks.

9. Lessons for Technology Leaders

Modernisation without AI-readiness is incomplete modernisation — Design for the requirements your board will mandate in 12–18 months, not just today’s requirements.
The data foundation is the AI bottleneck, not the model — Every enterprise has access to the same foundation models. Your competitive advantage is the quality, structure, and accessibility of your proprietary data.
Start with cataloguing, prove with managed services, optimise with purpose-built infrastructure — Glue Data Catalog first, Bedrock Knowledge Bases to prove value, then S3 Vectors or pgvector for production scale.
Governance enables AI adoption — it doesn’t slow it down — Enterprises that skip governance don’t deploy faster. They deploy and retract. Build governance in from day one.
The 15% investment today saves the 3–5x retrofit tomorrow — Frame it as insurance to your CFO. It gets funded.
AI-ready data is a presales differentiator — When you help customers avoid the retrofit tax, you establish trust and expand engagement scope.

Conclusion

“The best time to build an AI-ready data foundation is during your cloud modernisation. The second-best time is now. But there is a third option that most enterprises choose: waiting until the AI use case is approved, the budget is allocated, and the business is asking why it’s taking 18 months to get a Bedrock pilot into production. That third option is the most expensive — and the most common.”

About the Author

Rajat Jindal is VP – Presales at AeonX Digital Technology Limited, where he architects winning cloud strategies for enterprise customers and translates modernization into measurable business value. He is a strong advocate of AWS, committed to sharing thought leadership that helps technology leaders make faster, better-informed decisions.

Building an AI-Ready Data Foundation

Why Your Modernisation Strategy Must Account for AI Before You Need It

Introduction

1. The Modernisation Debt You Don’t See Coming

2. Why This Matters Now: The Data Foundation Is the AI Bottleneck

3. What “AI-Ready” Actually Means: A Practical Definition

4. The Architecture: AI-Ready Data Foundation on AWS

Layer 1: Data Ingestion & Integration

Layer 2: Vector Storage & Semantic Search

Layer 3: AI Consumption — Knowledge Bases & Agents

Layer 4: Governance & Security

5. The CRM Modernisation Revisited: What We’d Add for AI-Readiness

The cost of adding this later vs. now

6. A Presales Perspective: How to Position AI-Ready Data in Customer Conversations

Three questions that surface AI-readiness gaps in discovery:

7. The Business Case: Why CFOs Should Fund AI-Ready Data Now

Cost avoidance

Time-to-value acceleration

Risk reduction

8. Cross-Layer Architecture Summary

9. Lessons for Technology Leaders

Conclusion

Recent Posts

Recent Comments

Quick Links

Newsletter

Success!

Social Media