Week 5: Data Oriented Architectures

[pdf slides]

Christian Cabrera

Abstract:

In this lecture we explore Data Oriented Architectures concept and how designing data-first systems addresses data science pipeline requirements.

notutils

[edit]

This small package is a helper package for various notebook utilities used below.

The software can be installed using

%pip install notutils

from the command prompt where you can access your python installation.

The code is also available on GitHub: https://github.com/lawrennd/notutils

Once notutils is installed, it can be imported in the usual manner.

import notutils

Data-Oriented Architectures

[edit]

Modern AI systems are increasingly data-driven, which creates new challenges and requirements that traditional service-oriented architectures (SOAs) may not adequately address. This has led to the emergence of data-oriented architectures (DOAs) as an alternative architectural pattern that prioritizes data availability, ownership, traceability and monitoring.

A fundamental challenge is what has been called the “Data Dichotomy” - while service-oriented architectures are designed to hide data, data-driven systems need to expose and share data effectively. This creates a need to rethink how we architect AI systems.

Core DOA Principles

[edit]

Data-oriented architectures are built on three core principles that address the key requirements of modern AI systems. Each principle targets specific challenges in modern data-driven applications.

Data-First Principle

The data-first principle puts data at the center of system design, ensuring that data flows are prioritized over service interfaces.

Decentralization helps address latency and scalability challenges by distributing both data and processing across the system.

The openness principle ensures that data can be effectively shared and reused while maintaining appropriate governance.

While few systems fully implement all DOA principles today, various technologies enable partial implementation - databases and message queues for data-first approaches, distributed storage and computing for decentralization, and asynchronous communication for openness.

Case Study: Water Level Monitoring

[edit]

A practical example of DOA principles being applied can be seen in the water level monitoring project at Dedan Kimathi University of Technology (DeKUT) in Kenya, monitoring the Ewaso Nyiro River. This real-world implementation demonstrates both the benefits and challenges of applying DOA principles.

Figure: Architecture of the water level monitoring system showing key components and data flows

The system demonstrates the three core DOA principles, though with some practical limitations. The data-first approach is evident in the centralized data collection, while openness is implemented through MQTT-based communication. However, the heavy reliance on cloud resources presents opportunities for more decentralization.

Empirical Evaluation of DOA

[edit]

Research comparing DOA to traditional SOA approaches has revealed several advantages, particularly in machine learning deployment contexts. The evaluation focused on key metrics that impact system maintainability and evolution.

Studies have shown that DOA implementations typically require fewer component changes during system evolution, exhibit lower cognitive complexity, and maintain better maintainability scores compared to equivalent SOA implementations.

One particularly important advantage of DOA is its support for better causality analysis through explicit data flow representation.

The ADS Library

[edit]

The ADS (Assess, Access, Address) library provides a practical implementation of DOA principles for data science pipelines. It structures data science workflows in a way that naturally aligns with DOA principles while maintaining flexibility for different use cases.

This structured approach enables better traceability of data throughout the pipeline, improves reproducibility of analyses, and makes systems easier to maintain over time.

Summary

[edit]

Data-Oriented Architectures provide a valuable paradigm for designing modern AI and data science systems. While each system’s specific implementation details will vary based on context, the core principles of DOA provide a robust framework for addressing common challenges in data-driven systems.

The success of DOA implementations in real-world cases like the water monitoring system, combined with empirical evaluations showing benefits in maintainability and analysis, suggests that this architectural approach will become increasingly important as systems become more data-driven.

Thanks!

For more information on these subjects and more you might want to check the following resources.

References