Content-Addressable Transformers (CATs) is a unified Data Service Collaboration framework for organizations implemented as an edge-computing service that establish a Data Mesh as a scalable self-serviced Data Platform of Data Products with Data Provenance. CATs connect collaborators between organizations on a Data Mesh via the Content-Addressed Storage(CAS) of interoperable and scalable data processing to enable Data Provenance. CAT data processing workloads (CATs) are deployable as parallelized and distributed processes at horizontal & vertical scale to support scalable (big) data processing microservices with Scientific Computing capabilities. CATs are also integration points which enable scaled data processing portability between client-server cloud platforms and mesh (p2p) networks with minimal rework or modification.
CATs are submitted as content-addressed Orders of data processes (transformers) which are Invoiced for verification and logged as Bills-Of-Materials that serve as Data Provenance records. These records are content-addressed as unique identifiers of CAT workloads and their content. CATs content-addresses are also used as URIs that provide a means of data transportation. Therefore, the implementation of CATs' as content-addressed data processes establishes and self-services a scalable Data Platform as a Data Mesh network of interoperable distributed computing workloads deployable on Kubernetes as CATs execution paradigm.
CATs enables the continuous reification of Data Initiatives by cataloging discoverable, accessible, and re-executable workloads as Data Service Collaboration composable records between organizations. These records provide a reliable and efficient way to manage, share, and reference data processes via Content-Addressing Data Provenance records.
Content-Addressing is a method of uniquely identifying and retrieving data based on its content rather than its
location or address. CATs provides verifiable data processing and transport on a Mesh network of CATs interconnected by
Content-Addressing Data Provenance records with IPFS
CIDs (Content-Identifiers) as content addresses issued by IPFS
client to identify and retrieve inputs,
transformations, outputs, and infrastructure (as code [IaC]) for verifying transformation accuracy given CIDs.

CATs' utilizes Ray for interoperable & parallelized distributed computing frameworks deployable on Kubernetes for Big Data processing with Scientific Computing. Ray is a unified compute framework that enables the development of parallel and distributed applications for scalable data transformation, Machine Learning, and AI. Ray provides CATs with interoperable computing frameworks with its ecosystem integrations such as Apache Spark, and PyTorch.
Ray is deployed as an execution middleware on Kubernetes. IPFS serves as CATs' Data Mesh's network layer to provide
parallelized data ingress and egress for IPFS data. This network portability closes the gap between data analysis and
business operations by connecting the network planes of the cloud service model (SaaS, PaaS, IaaS) with IPFS. CATs
connect these network planes by enabling the instantiation of FaaS with cloud services in AWS, GCP, Azure, etc. on a
Data Mesh network of CATs. IPFS enables this connection as p2p distributed-computing job submission in addition to
the client-server job submission provided by Ray.

- Install Dependencies
- Install CATs:
git clone git@github.com:DynamicalSystemsGroup/cats.git cd cats python -m venv venv # Create Virtual Environment source venv/bin/activate # Activate Virtual Environment python -m pip install --upgrade pip pip install dist/*.whl
- Demo: Establish a CAT Mesh
- Test: CAT Mesh Verification
Data Initiatives will be naturally reified as a result of Data Service Collaboration on CATs. CATs will be
compiled and executed as interconnecting services on a Data Mesh that grows naturally when organizations communicate
CATs provenance records within feedback loops of Data Initiatives.

Organizations and collaborators participating will employ CATs for rapid ratification of service agreements within collaborative feedback loops of Data Initiatives. CATs' apply an Architectural Quantum Domain-Driven Design principle described in Data Mesh of Data Products to reify Data Initiatives. (* Design Description)
The Action Plane is the Analytical Data Processing interface. The Action Plane orchestrates and supervises
how virtual resources owned by the Data Product should be managed, routed, and processed and is stored “offmesh”
(“offline”). It supervises the exchange of data between sub-Process components within the Data sub-Plane (Process) in
adherence to Data Contracting Standards of organizations participating in a Data Mesh.

Quantum Architecture Description as a Minimal Federated Operating Model
- Function is a FaaS for scalable Data Processing and analytics executed as CAT Processes. Functions (FaaS) are deployed
on Structure (PaaS) to execute Processes orchestrated by InfraFunctions (FaaS)
- Processes are Functional Data Processors executable by InfraFunctions (FaaS) deployed on Structure (PaaS), and
contextualized with pre and post processed data by InfraFunctions (FaaS). Processes (FaaS) are executed with and made
orchestratable by InfraFunctions (FaaS) to support the following use-cases
- The CAT Order is updated with the inclusion of resulting mutated Functions (FaaS) for execution processed by CATs Factory Client.
- InfraFunction (FaaS) is a Data Processing orchestrator that employs a CAR for the configurable execution of scalable
Processing operated by the Plant (SaaS)
- The CAT Order is updated in alignment CATs Architectural Quantum’s Functionality. This Order will include the resulting updated of Structure (PaaS) with respect to the updated Plant (SaaS) and an updated Function (FaaS) with updated Ingress and Egress subProcesses (FaaS)
- Processes are Functional Data Processors executable by InfraFunctions (FaaS) deployed on Structure (PaaS), and
contextualized with pre and post processed data by InfraFunctions (FaaS). Processes (FaaS) are executed with and made
orchestratable by InfraFunctions (FaaS) to support the following use-cases
- Structure (PaaS as IaC) provisions and maintains the Plant (SaaS) as Function’s (FaaS) scalable execution environment.
- The Plant (SaaS) is InfraStructure’s (IaaS) dynamically scaled execution environment of Function (FaaS)
as an IaC plugin(s)
- The web application codebase is Content Addressed within CAT Orders as Data Contract metadata for Order registration.
- InfraStructure (IaaS) supports the provisioning of dynamically scaled infrastructure for maintaining a Plant (SaaS).
- The CAT Order is updated in alignment with event-driven functionality and operations with the resulting mutation of Structure (PaaS).
- The Plant (SaaS) is InfraStructure’s (IaaS) dynamically scaled execution environment of Function (FaaS)
as an IaC plugin(s)
BOM (Bill of Materials) are CATs' Content-Addressed Data Provenance record for verifiable data processing and transport on a Mesh network of CATs. BOMs are used as CAT’ input & output that contain CATs’ means of data processing.
- BOMs employ CIDs for location-agnostic retrieval based on its content as well as processes and
Data Verification. BOM CIDs can be used to verify the means of processing
data (input, transformation / process, output, infrastructure-as-code (IaC)) they can also make CATs resilient by
enabling re-execution via retrieval. CATs certifies the accuracy of data processing on data products and pipelines by
enabling maintenance and reporting of
data and process lineage & provenance as chains of
evidence using CIDs.

- CAT Mesh is composed by CATs executing BOMs.

CAT Mesh is a self-serviced Data Mesh platform with Data Provenance. CAT Nodes are CAT Mesh peers that enable workloads to be portable between client-server cloud platforms and p2p mesh network with minimal rework or modification.
Multi-disciplinary and cross-functional teams can use CAT Nodes to verify and scale distributed computing workloads. Workloads (CATs) executed by CAT Nodes interface cloud service model (SaaS, PaaS, IaaS) offered by providers such as AWS, GCP, Azure, etc. on a Mesh Network interconnected by IPFS.
CAT Nodes are Data Products - peer-nodes on a mesh network that encapsulate components (*) to function as a service providing access to a domain's analytical data as a product; * code, data & metadata, and infrastructure.
In the following image:
- Large ovals in the image above represent Data Products servicing each other with Data
- "O" ovals are Operational Data web service endpoints
- "D" ovals are Analytical Data web service endpoints
- Source: Data Mesh Principles and Logical Architecture - Zhamak
Dehghani, et al.

- Data Verification - a process for which data is checked for accuracy and inconsistencies before processed
- Data Provenance - a means of proving data lineage using historical records that provide the means of pipeline re-execution and data validation
- Data Lineage - reporting of data lifecyle from source to destination
- Distributed Computing - typically the concurrent and/or parallel execution of job tasks distributed to networked computers processing data
- Bill of Materials (BOM) - an extensive list of raw materials, components, and instructions required to construct, manufacture, or repair a product or service
CATs was developed by the Dynamical Systems Group (DSG) team.
Key contributions:
- Network Architecture & Verified Information Exchange:
- Lead Solutions Architect & Developer / Distributed Systems Engineer
- Testing: Danilo
