AI LLMs baseline for IT/OT infrastructure documentation and topology
Most infrastructure data that ends up in AI pipelines was never designed to be there.
Raw vendor exports, ad-hoc JSON blobs, and log dumps carry enough noise that training on them teaches a model mostly to reproduce the vendor’s dialect, not to reason about infrastructure, including a potential risk of carrying sensitive data like secrets. OSIRIS JSON was designed for a different purpose to be:
- portable
- vendor-neutral
- normalize output data
- schema-validated and standardized topology snapshots
but that design turns out to be exactly what makes it useful as an AI LLM training baseline. The two stages of the LLM pipeline where OSIRIS JSON fits most naturally are fine-tuning and instruction.
What fine-tuning needs from data
Fine-tuning adapts a pretrained model to a specific domain. The signal comes from the training corpus: every document teaches the model what vocabulary is normal, what structures appear together, what relationships are valid, and what patterns repeat.
For that signal to be useful, the data has to be consistent. A training corpus made of raw Cisco NX-OS exports, Azure ARM templates, and VMware vCenter dumps does not teach infrastructure reasoning. It teaches three separate vendor dialects. A model trained on that mix will reflect those inconsistencies back when asked to reason across boundaries.
OSIRIS JSON try to remove that problem at the schema level.
Every OSIRIS JSON document, regardless of whether it was produced from an Hyperscaler like AWS account, an on-premise Arista Fabric or series of Cisco Nexus switches, or a Nokia fabric, shares the same outer structure: version, metadata, and topology. Resources always have id, type, and provider. Connection types follow the same dot-notation conventions. Group semantics are consistent.
A fine-tuning corpus built from OSIRIS documents could better trains an LLM model on infrastructure concepts, not the same with each vendor formats that could eventually change over time as the vendor update features. The model learns that a network.firewall sits between network.switch resources and has connectivity connections to segments it protects. It learns that a three-tier application topology consistently involves a load balancer, compute resources and an application-layer database. It learns what a valid containment relationship looks like versus a dependency.
That is the right kind of domain knowledge to specialize a model on.
The graph structure is a natural training signal
OSIRIS JSON models infrastructure as an explicit graph: resources are nodes, connections are edges, and groups cluster both into logical or physical boundaries.
Graph structure is unusually valuable in training data because it encodes relationships explicitly. In unstructured text, the model has to infer that a firewall is upstream of the servers it protects. In an OSIRIS JSON document, that relationship is a typed, directional connection with a source, target, and type. The model does not have to guess, it reads a structure that was deliberately constructed to express exactly that topology.
The connection type taxonomy carries additional signal. The difference between a dependency connection and a containment connection is semantic, not syntactic. Training on documents that use those types correctly teaches a model to distinguish functional relationships from structural ones. That distinction matters when a model is asked to answer questions like:
- what breaks if this resource goes down?
- what is logically inside this amazon aws vpc?
- can you show the traffic flow of all resources withing this microsoft azure subscription?
The resource type taxonomy adds another layer. The dot-notation hierarchy (compute.vm, compute.vm.template, network.switch, storage.volume) provides a natural classification that a fine-tuned model can internalize, generalize from, and apply to resources it has never seen before.
Why normalization matters for training quality
Producer normalization is one of the least discussed parts of OSIRIS JSON, but it has a direct impact on training data quality.
Before an OSIRIS JSON document reaches the fine-tuning corpus, a producer has already:
- mapped vendor-specific resource types to standard OSIRIS JSON taxonomy entries
- removed credentials, tokens, and secrets through redaction
- resolved duplicate or conflicting identifiers into deterministic stable IDs
- serialized the topology in a consistent, sorted output
The result is that every document in the training corpus is structurally clean with guardrails in place. There are no embedded passwords or secrets that teach the model to associate credentials with topology fields. There is no identifier instability that introduces noise across runs of the same environment. There is no vendor-specific property that conflicts with the same property named differently by another vendor.
A fine-tuning corpus built from producer-validated OSIRIS JSON documents has an unusually low noise floor for a domain as heterogeneous as infrastructure.
What instruction alignment needs from data
Instruction fine-tuning is about something different. The goal is not to teach domain knowledge, it is to teach the model how to use that knowledge in response to human requests. This is the stage that dramatically improves usability by aligning the model’s behaviour with human expectations, making it more helpful, reliable, and controllable.
OSIRIS JSON documents are grounded artifacts. They describe a specific infrastructure at a specific point in time, validated against a known schema. That groundedness is what makes them useful for constructing instruction pairs.
A well-formed instruction example pairs a natural language question, an OSIRIS JSON document as context, and a correct answer derived from the topology. For example:
- Does this topology have a single point of failure? The model reads the connections, identifies resources with no redundant path, and answers with specific resource IDs and reasoning
- What would be affected if the firewall lost its uplink? The model traverses the connection graph outward from the relevant resource and lists dependent resources
- Summarize this infrastructure for an architecture review The model reads resource types, counts, and group structure to produce a human-readable summary
- Generate an updated OSIRIS document that adds a standby replica for the primary database The model extends the topology in a schema-valid way
- Generate an accurate topology of my Microsoft Azure infrastructure and document it in markdown format The model reading an OSIRIS JSON document is able to create a topology in draw.io format or mermaid (for high-level visibility) with a markdown document that resume the infrastructure configuration at a point in time.
The key property is verifiability. Because the source document is schema-validated and the topology is explicit, a human reviewer can check that the model’s answer is correct. That feedback loop where human evaluators can actually assess model responses against a ground truth is what makes instruction data useful. It is much harder to construct quality instruction pairs for infrastructure when the source data is ambiguous or inconsistently formatted.
Provider attribution preserves traceability without locking in vendor bias
One design decision in OSIRIS JSON that matters for AI use cases is provider attribution.
Resources carry their originating provider information AWS, Azure, Arista, Cisco, Nokia, HPE but the core structure is always vendor-neutral. A model trained on OSIRIS JSON documents learns to associate provider context with resource behavior without learning that the core topology language is provider-specific.
This matters for instruction alignment in particular. If a user asks what kind of infrastructure is this? the model should answer based on the topology, not based on recognizing a specific vendor’s export format. Provider attribution in OSIRIS JSON makes that distinction explicit in the training data rather than leaving the model to infer it from lexical patterns.
A structured baseline for infrastructure-aware AI LLMs models
The direction this points toward is infrastructure-aware AI LLMs that can reason about topology, surface risks, answer architecture questions, assist with documentation, and propose changes grounded in your real inventory up-to-date data, not in vague general or vendor-specific knowledge and documentation about what infrastructure “usually” looks like.
OSIRIS JSON is not the only ingredient for that. But as a normalized, schema-validated, deterministic interchange format that producers across different vendors and platforms can target, it provides something the AI pipeline would otherwise have to construct from scratch: a consistent structural language for infrastructure.
This is the OSIRIS JSON baseline.
Related reading
- Read the OSIRIS JSON v1.0 specification
- Read the core concepts
- Read the resource type taxonomy
- Read the producer guidelines
- Read about validation levels
Banner icon author Esri