“Arm Waves” are messy brain dumps, long form articles written to kickstart research, spark thinking, surface ideas and identify gaps, before they are eventually refined into a focused, curated set of “Finger Point” content, in a published “an Agile Data Guide”.
I often find writing helps me coalesce and refine my thoughts when new patterns start to emerge, but aren’t very clear yet.
This is the Arm Waves, aka brain dump / train of thought to help refine what I think the data stack looks like in the new “AI” domain.
I am iterating this article as I research and think about the patterns needed in this space, so the post will be updated overtime as I learn more.
Defining the “AI” domain
Lets anchor the context of what I mean by the “AI” domain.
I am thinking about GenAI, Large Language Models (LLM’s), Agents, Agentic and MCP’s.
Lets pop over to my ChatGPT co-friend for definitions:
GenAI: Software that generates new content (text, images, code, etc.) based on patterns it has learned from large datasets, typically using AI models.
Large Language Models (LLMs): Advanced AI models trained on vast amounts of text data to understand and generate human-like language.
Agents: Software systems that use models like LLMs to autonomously perform tasks, often combining reasoning, decision-making, and interaction with tools or data.
Agentic: Describes behaviour or systems that act with initiative, making decisions and taking action toward goals.
MCP (Model Context Protocol): Provides a universal adapter for AI agents / LLMs to integrate and share data with external tools, systems, and data sources without custom coding. MCP is often referred to as “the USB-C of AI apps” .
Don’t love the definitions it came up with, but don’t hate them enough to spend time rewriting them (yet).
I’m not thinking about Data Mining, Statistics, Data Science, Machine Learning and all those patterns that I believe have been in the “Analytics” domain for many decades.
Defining the current data stack
Context is key so lets understand how I see the current data stack pattern.
I tend to think of patterns in layers.
So for me there is a very high level view where I distill the patterns down into as few large boxes as possible:
And then I will break that down into a lot of much smaller boxes when I need to define more detailed patterns:
I have used both of these diagrams for a while now and they both need to be iterated for the “AI Data Stack” pattern.
I did an iteration a year or two ago where the core data “AI” pattern that was emerging was the Text to SQL pattern and so I added a box in the Consume areas to cover that.
But things have accelerated a lot in this space and I need to do another major iteration to my thinking and pattern library.
For this article I am going to call out a few core patterns that are helping my current train of thought on the patterns that need to be iterated the most.
Again when I am thinking about new patterns I find thinking in big boxes, not detailed ones helps, it stops me getting stuck in the weeds too early or stops me “boiling the ocean” as Juan Sequeda likes to say.
Centralised Data
The current data stack pattern is focussed on collecting data from source systems, storing that data in one place, changing the data to make it fit for purpose and then providing access to that data to any person or system who needs it.
Data platform(s) harvest the data from the places that create it. Last Mile tools consume the data from the data platform.
Catalogs harvest the Data / Metadata
The current data stacks are focussed on Catalog capabilities that extract the Metadata (and often data) from the various places that data is stored.
Catalog(s) harvest the metadata from the places that create it, after it has been created.
Defining the “AI Data Stack”
This diagram / map is a brain dump, my focus is the number of boxes and what they do / don’t do. Not the flow, or layout etc (yet).
Key patterns I am thinking about (in no particular order)
Source Systems are a source of Data and Context that the Agents need to understand and access;
Data Platforms are a source of Data and Context that the Agents need to understand and access, but not the only one anymore;
An Agent may invoke a Last Mile Tool, an action in an Application, or another Agent.
Source Systems and Applications are one in the same, but keeping them as separate in this diagram helps with the thinking process right now.
All the places that hold data, provide Context for that data to whatever needs that Context.
All the “SYSTEMS” / places that hold Data or have a User Interface, will provide a MCP service that allows an agent to access it directly, removing the need to always access the centralised data platform (this feels very much like the data virtualisation patterns of old).
If the source system cannot serve the data needed, for example historical data, then the Agent will need to use the data from the Data Platform.
Source Systems will be forced to start storing Historical data to serve Agents directly.
The Context Plane will provide a single pain of glass for all Data Context held by every system, so the Agents can talk to one place.
The Context Plane will receive Context from the Systems, not harvest it. The Systems will Push the Context to the Context Plane. This will remove the need for the Context Plane to have to create an Pull pattern / adapter for every System in the world.
Or the Context Plane will operate under a Federated / Virtualised pattern where it doesn’t hold any Context but can point the Agents to the Systems Context so they can access it directly.
There needs to be a single language for Context, aka the equivalent of SQL, this is unlikely to happen as the data domain can never agree, and vendors like to make their own standards to create lock-in and a “moat”.
The Context Plane will not provide Orchestration or Execution capabilities, Agents will talk directly with the MCP services for the relevant Systems to execute its task.
Agents will either be Orchestrated following a “Direct Graph” pattern or a “Pub/Sub” / “Fire and Forget” / “Mesh” pattern.
AI Agent interaction patterns
Next I start to think about some of the more detailed pattern diagrams to refine my thinking. I find thinking using maps helps me identify the underlying patterns.
Centralised Context Plane
Context is Pushed from each System to the Centralised Context Plane where it is stored.
Federated / Virtualised Context Plane
Context is Pulled from each System by the Federated Context Plane as and when it is needed.
Historical data at Source
Systems stores Historical data, either as part of their primary data store, or as a companion data store. This is managed by the the System Team/Owner/Vendor not by a seperate Data Team / Platform.
Agent Execution via Systems MCP Service
The Agent communicates with the Context Plane to find out where the data it needs lives, and to understand the Context of that data. It then executes directly via the Systems MCP Services.
I can see a raft of alternative patterns for this one, centralised MCP Services for execution for example.
Agent Execution via Centralised MCP Service
The Agent communicates with the Context Plane to find out where the data it needs lives, and to understand the Context of that data. It then executes via a centralised MCP Service.
Wood from the Trees
Still a way to go before I have a coherent set of Patterns that I can Coach / Mentor / Teach somebody else for the “AI Data Stack”, or present as a robust Architecture map.
But as I have already said, writing my half formed ideas helps me think.
This is really good stuff. Thank you for mapping out these patterns and providing the context needed to understand existing and emerging AI capabilities.