Engineering in the Wild

Clean Architecture Concepts: Where Do Frameworks Like ML Trainers Truly Belong?

TLDR

When implementing ML systems with clean architecture, you face a key design decision: should your ML framework abstractions live in the core domain or application layer? This article explores both patterns, providing code examples and architecture diagrams to help you make the right choice for your system.


Building robust, maintainable backend systems often leads us to layered architectures – think Domain-Driven Design, Hexagonal Architecture, or Clean Architecture. The central tenet is usually to protect the core business logic, keeping it pure and independent from the noisy, ever-changing details of infrastructure like databases, external APIs, web frameworks, and specialized libraries... like ML training frameworks.

To be honest, there's massive overlap between these architectures. Not because they're poorly designed, but simply because different people thought about the same problem and arrived at similar conclusions with some nuances to it. So instead of focusing on the theory, lets focus on a practical scenario.

What happens when a fundamental business capability, something your system must do to deliver value, is inherently tied to one of these external libraries? Consider the task of "training a model" in an ML-focused application. Training is core to the purpose, yet the actual code relies on Scikit-learn, LightGBM, TensorFlow, or PyTorch. Placing direct calls to these libraries inside your pristine domain layer is a definite no-no – it couples your core to infrastructure.

How do we resolve this tension? The answer lies in abstracting the capability and carefully placing that abstraction in the right layer.

The Ports and Adaptors Pattern: Your Framework Firewall

Hexagonal Architecture offers a powerful pattern for this: Ports and Adaptors.

flowchart TD
    subgraph "Core Domain"
        Port["Port\n(Interface)"]
    end
    
    subgraph "Infrastructure"
        Adapter1["Adapter A\n(Implementation)"]
        Adapter2["Adapter B\n(Implementation)"]
    end
    
    Adapter1 -->|implements| Port
    Adapter2 -->|implements| Port

You can also think of this as Abstract and Concrete classes or interfaces and implementations. I like to imagine this as a universal power adapter when travelingβ€”your core domain is your device that just wants consistent electricity (capability), while the adapters handle all the messy details of different plug shapes and voltages (frameworks). You never want your laptop to know it's in a European hotel vs. an American office; it should just get power without worrying about the details! Similarly, your domain logic shouldn't care if it's talking to TensorFlow or PyTorch under the hood.

Anyway, the golden rule: Dependencies point inwards. The infrastructure layer depends on the abstract ports/interfaces defined in the layers it serves, but never the other way around.

Applying this to our ML trainer problem, we need a "Model Trainer" abstraction. The question becomes: In which inner layer should the interface for this abstraction live?

Approach 1: The Capability is Core (Port in Core)

This approach is suitable when the capability (like training) is so fundamental that the core domain model itself, or core business rules, conceptually depend on the existence of this capability. The core needs a way to abstractly interact with this function.

flowchart TD
    subgraph "Core Domain"
        ModelTrainerPort["ModelTrainerPort\n(Interface)"]
        Experiment["Experiment\n(Aggregate)"]
        ValueObjects["ModelConfigVO\nTrainingDataVO\nTrainedModelArtifactVO"]
        
        Experiment -->|uses| ModelTrainerPort
        Experiment -->|uses| ValueObjects
    end
    
    subgraph "Infrastructure"
        SKLearnAdapter["SKLearnTrainerAdapter"]
        TensorFlowAdapter["TensorFlowTrainerAdapter"]
        
        SKLearnAdapter -->|implements| ModelTrainerPort
        TensorFlowAdapter -->|implements| ModelTrainerPort
    end
    
    subgraph "Application"
        TrainService["TrainModelService"]
        TrainService -->|uses| Experiment
    end

Rationale: The core domain defines what the business is and how it fundamentally behaves. If "training" is a central verb or concept within the core's responsibilities (e.g., an Experiment aggregate must be able to initiate training runs as part of its lifecycle), the core needs a port for it.

Structure & Code Stubs:

Core Layer: Defines Value Objects representing training inputs/outputs and the Port interface.

# backend/core/valueobjects.py
from typing import Dict, Any
from dataclasses import dataclass

@dataclass(frozen=True)
class ModelConfigVO: 
    # Configuration for the model
    params: Dict[str, Any]
    name: str

@dataclass(frozen=True)
class TrainingDataVO:  
    # Data for training
    features: np.ndarray
    labels: np.ndarray
    
@dataclass(frozen=True)
class TrainedModelArtifactVO: 
    # Result of training
    model_bytes: bytes
    metadata: Dict[str, Any]
# backend/core/ports/model_trainer_port.py
from typing import Protocol
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO

class ModelTrainerPort(Protocol):
    """Core Port: Interface for the model training capability needed by the core."""
    def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO: ...

Infrastructure Layer: Implements the Port interface using a specific framework.

# backend/infra/training/sklearn_trainer_adapter.py
from sklearn.linear_model import LogisticRegression # Framework dependency
import pickle
from backend.core.ports.model_trainer_port import ModelTrainerPort # Depends on core port
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO

class SKLearnTrainerAdapter(ModelTrainerPort):
    """Infra Adaptor: Implements the core port using SKLearn."""
    def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
        # --- SKLearn specific logic here ---
        model = LogisticRegression(**model_config.params)
        model.fit(data.features, data.labels)
        model_bytes = pickle.dumps(model)
        # ------------------------------------
        return TrainedModelArtifactVO(model_bytes=model_bytes, metadata={'framework': 'sklearn'})

Application Layer: Uses the Port via Dependency Injection.

# backend/application/services/train_model_service.py
from backend.core.ports.model_trainer_port import ModelTrainerPort # Depends on core port
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO

class TrainModelApplicationService:
    def __init__(self, model_trainer: ModelTrainerPort): # DI the core port
        self.model_trainer = model_trainer

    def execute(self, config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
        # Use the port to trigger training
        return self.model_trainer.train(config, data)

In my experience, placing ports in the core domain makes the most sense when the capability is truly fundamental to how your domain entities operate. This approach signals that the capability isn't just a utility but an essential part of the domain's definitionβ€”the core is explicitly declaring its need for external "muscle," which the infrastructure layer provides.

Approach 2: The Capability is for Orchestration (Interface in Application)

This approach positions the interface in the application layer. This is suitable when the capability is needed by the application services to orchestrate use cases, but the core domain model itself doesn't have a direct conceptual dependency on the capability's interface. The core focuses purely on business rules and state transitions, providing data that the application layer then acts upon using external capabilities.

flowchart TD
    subgraph "Core Domain"
        ValueObjects["ModelConfigVO\nTrainingDataVO\nTrainedModelArtifactVO"]
        Experiment["Experiment\n(Aggregate)"]
        
        Experiment -->|uses| ValueObjects
    end
    
    subgraph "Application"
        ModelTrainer["ModelTrainer\n(Interface)"]
        TrainService["TrainModelService"]
        
        TrainService -->|uses| Experiment
        TrainService -->|uses| ModelTrainer
        TrainService -->|uses| ValueObjects
    end
    
    subgraph "Infrastructure"
        SKLearnAdapter["SKLearnTrainerAdapter"]
        TensorFlowAdapter["TensorFlowTrainerAdapter"]
        
        SKLearnAdapter -->|implements| ModelTrainer
        TensorFlowAdapter -->|implements| ModelTrainer
    end

Rationale: The application layer defines and executes use cases. These use cases often require interacting with infrastructure after engaging the core domain, or preparing data to send to the core domain or infrastructure. If a capability is needed to fulfill a specific use case workflow, defining the interface in the application layer makes sense.

Structure & Code Stubs:

Core Layer: Defines Value Objects and core domain logic, but not the Trainer interface.

# backend/core/valueobjects.py
# --- Same VOs as Approach 1 ---
@dataclass(frozen=True)
class ModelConfigVO: ...
@dataclass(frozen=True)
class TrainingDataVO: ...
@dataclass(frozen=True)
class TrainedModelArtifactVO: ...

Application Layer: Defines the interface for the capability and the services that use it.

# backend/application/interfaces/model_trainer.py
from typing import Protocol
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO # Depends on core VOs

class ModelTrainer(Protocol): # Note: Interface lives in application layer
    """Application Interface: Defines training needed by application services."""
    def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO: ...

Infrastructure Layer: Implements the Application layer interface.

# backend/infra/training/sklearn_trainer_adapter.py
from sklearn.linear_model import LogisticRegression # Framework dependency
import pickle
from backend.application.interfaces.model_trainer import ModelTrainer # Depends on application interface
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO # Depends on core VOs

class SKLearnTrainerAdapter(ModelTrainer): # Implements the application interface
    """Infra Adaptor: Implements the application interface using SKLearn."""
    def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
        # --- SKLearn specific logic here (same as Approach 1) ---
        # ...
        # ---------------------------------------------------------
        return TrainedModelArtifactVO(...)

Application Layer: Uses its own defined interface via Dependency Injection.

# backend/application/services/train_model_service.py
from backend/application/interfaces/model_trainer import ModelTrainer # Depends on application interface
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO # Depends on core VOs

class TrainModelApplicationService:
    def __init__(self, model_trainer: ModelTrainer): # DI the application interface
        self.model_trainer = model_trainer

    def execute(self, config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
        # Use the interface to trigger training
        return self.model_trainer.train(config, data)

Comparing and Building Intuition

Here's a table summarizing the key distinctions between approaches:

Feature Approach 1: Port in Core (core/ports/) Approach 2: Interface in App (application/interfaces/)
Interface Location backend/core/ports/ backend/application/interfaces/
Defining Layer Core Domain Layer Application Layer
Dependent Layer Core Domain depends on the Port Application Layer depends on the Interface
Primary Driver Core Domain Model itself needs the capability for its rules/integrity Application Use Case needs the capability to orchestrate a process/interact externally
Core's Role Core defines a needed capability of the outside world Core provides data/rules; application layer uses external capabilities with it
Best Fit For... Capabilities fundamental to core domain rules, validation, consistency Capabilities for orchestration, external interaction based on core outcomes
flowchart LR
    subgraph "Decision Process"
        Q1["Is capability integral to\ndomain entity behavior?"]
        A1["Use Core Domain Port"]
        A2["Use Application Interface"]
        
        Q1 -->|Yes| A1
        Q1 -->|No| A2
    end

The core question is: Does the abstract need for this capability live with the fundamental business concepts (Core), or with the logic that sequences steps to fulfill a request (Application)?

Let's use adjacent examples:

When I've built ML-focused systems, I've typically organized them with services like api_services.py, event_services.py, or dedicated modules like api_services/training.py. This structure naturally aligns with an application layer focused on use case orchestration. These services correspond directly to API endpoints or event handlers because they implement the use cases those endpoints/handlers trigger. This is exactly what the application layer should doβ€”translate external triggers into orchestrated actions involving the core domain and infrastructure.

I've found that placing interfaces like ModelTrainer in an application/interfaces/ folder works well when the application services are the primary consumers of this capability. In several ML projects I've worked on, this pattern has proven effectiveβ€”it means the application layer depends on an abstraction for training, while the infrastructure layer provides the concrete implementation. This maintains the dependency arrow pointing inward, protecting application logic from direct framework coupling, just as a port in the core would protect the domain.

In my experience implementing both patterns, I've found that the choice between placing an interface in the core or application layer depends on where the abstraction's primary consumer lives and how it fits into the system's overall flow. Both approaches successfully decouple your system from infrastructure. The key is being intentional about which layer truly defines the need for the capabilityβ€”is it fundamental to your domain concepts or primarily an orchestration concern?

By understanding this distinction and consistently applying the Ports and Adaptors pattern, you can effectively integrate necessary frameworks like ML trainers without compromising the clarity and maintainability of your core domain and application logic.

Deep Dives

Directory Structures for ML Systems

When implementing these patterns in real-world ML systems, I've found it helpful to establish consistent directory structures that reflect the architectural boundaries. Let's map out folder structures within the common layers (backend/, core/, infra/, application/, endpoint/).

From here, I've seen two effective approaches to organizing subdirectories:

Type-Based Organization

backend/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ aggregates/          # Domain model behavior
β”‚   β”œβ”€β”€ entities/            # Core business entities
β”‚   β”œβ”€β”€ valueobjects/        # Immutable value objects
β”‚   └── ports/               # Interfaces for infrastructure needs
β”œβ”€β”€ infra/
β”‚   β”œβ”€β”€ ml/
β”‚   β”‚   β”œβ”€β”€ sklearn_trainer.py   # Adapts scikit-learn to port/interface
β”‚   β”‚   └── tensorflow_trainer.py # Adapts TensorFlow to port/interface
β”‚   └── persistence/         # Database adapters
β”œβ”€β”€ application/
β”‚   β”œβ”€β”€ services/            # Use case handlers
β”‚   β”œβ”€β”€ interfaces/          # Application-layer abstractions
β”‚   └── commands/            # Input data structures
└── endpoint/                # External interfaces (API, CLI)

Domain-Based Organization

backend/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ model/               # Model domain (valueobjects.py, entities.py, ports.py)
β”‚   └── experiment/          # Experiment domain (valueobjects.py, entities.py, ports.py)
β”œβ”€β”€ infra/
β”‚   β”œβ”€β”€ ml_frameworks/       # ML framework adapters for different domains
β”‚   └── persistence/         # Database adapters by domain
β”œβ”€β”€ application/
β”‚   β”œβ”€β”€ training/            # Training-related services and interfaces
β”‚   └── inference/           # Inference-related services and interfaces
└── endpoint/                # API endpoints by domain

Both structures have merits. The type-based approach is often intuitive initially but can scatter related code. The domain-based structure tends to scale better with complexity and team size by co-locating concepts. Adaptor implementations consistently belong in infra when they fulfill ports from core, and in endpoint (or application) when they translate external input into domain concepts. The key is consistency within your chosen structure.

Interface Design Patterns

In the backend/core/aggregates/ or backend/core/interfaces/ folder, this is where the interface or the main public methods of your aggregate roots are defined. Whether this is a formal abc.ABC or typing.Protocol defining the aggregate's contract, or the concrete aggregate class itself serving as its own interface, this is the heart of your domain behavior that lives independently of infrastructure.

In backend/core/ports/, you define those abstract interfaces (using abc.ABC or typing.Protocol) for the infrastructure services that the core needs to interact with (like ML frameworks, persistence, or external APIs). These port definitions declare what the core needs, without caring how it's implemented.

The flow works like this:

For ML frameworks specifically, I recommend designing interfaces that:

  1. Use domain terminology rather than framework-specific concepts
  2. Accept and return domain objects (value objects or entities)
  3. Hide framework-specific error handling behind domain-appropriate exceptions
  4. Include versioning information in training metadata for reproducibility

Here's where the beauty of abstraction really shinesβ€”during code reviews and stakeholder discussions. When collaborating with business teams or product managers, we can focus solely on the interfaces, which speak the language of the business problem. For more technical audiences, we might explore the entity and value object design, which define the precise "knobs and levers" available to the system. But notice what never enters the conversation? Whether we're using PostgreSQL or MySQL, TensorFlow or PyTorch. Those implementation details are safely contained behind our interfaces.

ML-Specific Considerations

When working with ML frameworks, there are some unique considerations that affect how we design our ports and adapters:

Versioning and Reproducibility

For ML training interfaces, consider explicitly tracking framework versioning:

class SKLearnTrainerAdapter(ModelTrainerPort):
    """Implements training using scikit-learn."""
    
    def __init__(self):
        self._framework_info = {
            "name": "scikit-learn",
            "version": sklearn.__version__,
            "capabilities": ["linear_models", "tree_models"]
        }
    
    def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
        # ... training implementation ...
        
        # Include framework information in model metadata
        return TrainedModelArtifactVO(
            model_bytes=model_bytes,
            metadata={
                "framework": self._framework_info,
                "training_date": datetime.now().isoformat(),
            }
        )

Performance Optimization Hints

For ML systems with strict performance requirements, you might extend interfaces with optional performance hints:

class ModelTrainer(Protocol):
    def train(
        self, 
        model_config: ModelConfigVO, 
        data: TrainingDataVO,
        performance_hints: Optional[Dict[str, Any]] = None
    ) -> TrainedModelArtifactVO:
        """
        Train a model with optional performance hints like:
        - use_gpu: bool
        - batch_size: int
        - precision: str ('float32', 'float16', etc.)
        """
        ...

This allows the application layer to provide optimization guidance without coupling to specific framework implementations.

Decoupling core logic from infrastructure via interfaces (ports) is foundational for building testable and maintainable ML systems. Defining ports in the core ensures your domain doesn't know about concrete ML framework details. Dependency Injection then wires the system together at the composition root, typically outside the core layers.