Clean Architecture Concepts: Where Do Frameworks Like ML Trainers Truly Belong?
TLDR
When implementing ML systems with clean architecture, you face a key design decision: should your ML framework abstractions live in the core domain or application layer? This article explores both patterns, providing code examples and architecture diagrams to help you make the right choice for your system.
Building robust, maintainable backend systems often leads us to layered architectures β think Domain-Driven Design, Hexagonal Architecture, or Clean Architecture. The central tenet is usually to protect the core business logic, keeping it pure and independent from the noisy, ever-changing details of infrastructure like databases, external APIs, web frameworks, and specialized libraries... like ML training frameworks.
To be honest, there's massive overlap between these architectures. Not because they're poorly designed, but simply because different people thought about the same problem and arrived at similar conclusions with some nuances to it. So instead of focusing on the theory, lets focus on a practical scenario.
What happens when a fundamental business capability, something your system must do to deliver value, is inherently tied to one of these external libraries? Consider the task of "training a model" in an ML-focused application. Training is core to the purpose, yet the actual code relies on Scikit-learn, LightGBM, TensorFlow, or PyTorch. Placing direct calls to these libraries inside your pristine domain layer is a definite no-no β it couples your core to infrastructure.
How do we resolve this tension? The answer lies in abstracting the capability and carefully placing that abstraction in the right layer.
The Ports and Adaptors Pattern: Your Framework Firewall
Hexagonal Architecture offers a powerful pattern for this: Ports and Adaptors.
flowchart TD
subgraph "Core Domain"
Port["Port\n(Interface)"]
end
subgraph "Infrastructure"
Adapter1["Adapter A\n(Implementation)"]
Adapter2["Adapter B\n(Implementation)"]
end
Adapter1 -->|implements| Port
Adapter2 -->|implements| Port
Ports: These are interfaces (like Python
abc.ABC
ortyping.Protocol
) defined inside your core layers. They represent a contract for a capability that an inner layer needs from an outer layer. Think of them as sockets or plugs on the side of your hexagon.Adaptors: These live in the outermost layer (infrastructure). They are concrete implementations of the ports, talking to the specific external technology (a database driver, an HTTP client, an ML library). They "adapt" the technology's API to fit the shape of the port defined by the inner layer.
You can also think of this as Abstract and Concrete classes or interfaces and implementations. I like to imagine this as a universal power adapter when travelingβyour core domain is your device that just wants consistent electricity (capability), while the adapters handle all the messy details of different plug shapes and voltages (frameworks). You never want your laptop to know it's in a European hotel vs. an American office; it should just get power without worrying about the details! Similarly, your domain logic shouldn't care if it's talking to TensorFlow or PyTorch under the hood.
Anyway, the golden rule: Dependencies point inwards. The infrastructure layer depends on the abstract ports/interfaces defined in the layers it serves, but never the other way around.
Applying this to our ML trainer problem, we need a "Model Trainer" abstraction. The question becomes: In which inner layer should the interface for this abstraction live?
Approach 1: The Capability is Core (Port in Core)
This approach is suitable when the capability (like training) is so fundamental that the core domain model itself, or core business rules, conceptually depend on the existence of this capability. The core needs a way to abstractly interact with this function.
flowchart TD
subgraph "Core Domain"
ModelTrainerPort["ModelTrainerPort\n(Interface)"]
Experiment["Experiment\n(Aggregate)"]
ValueObjects["ModelConfigVO\nTrainingDataVO\nTrainedModelArtifactVO"]
Experiment -->|uses| ModelTrainerPort
Experiment -->|uses| ValueObjects
end
subgraph "Infrastructure"
SKLearnAdapter["SKLearnTrainerAdapter"]
TensorFlowAdapter["TensorFlowTrainerAdapter"]
SKLearnAdapter -->|implements| ModelTrainerPort
TensorFlowAdapter -->|implements| ModelTrainerPort
end
subgraph "Application"
TrainService["TrainModelService"]
TrainService -->|uses| Experiment
end
Rationale: The core domain defines what the business is and how it fundamentally behaves. If "training" is a central verb or concept within the core's responsibilities (e.g., an Experiment
aggregate must be able to initiate training runs as part of its lifecycle), the core needs a port for it.
Structure & Code Stubs:
Core Layer: Defines Value Objects representing training inputs/outputs and the Port interface.
# backend/core/valueobjects.py
from typing import Dict, Any
from dataclasses import dataclass
@dataclass(frozen=True)
class ModelConfigVO:
# Configuration for the model
params: Dict[str, Any]
name: str
@dataclass(frozen=True)
class TrainingDataVO:
# Data for training
features: np.ndarray
labels: np.ndarray
@dataclass(frozen=True)
class TrainedModelArtifactVO:
# Result of training
model_bytes: bytes
metadata: Dict[str, Any]
# backend/core/ports/model_trainer_port.py
from typing import Protocol
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO
class ModelTrainerPort(Protocol):
"""Core Port: Interface for the model training capability needed by the core."""
def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO: ...
Infrastructure Layer: Implements the Port interface using a specific framework.
# backend/infra/training/sklearn_trainer_adapter.py
from sklearn.linear_model import LogisticRegression # Framework dependency
import pickle
from backend.core.ports.model_trainer_port import ModelTrainerPort # Depends on core port
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO
class SKLearnTrainerAdapter(ModelTrainerPort):
"""Infra Adaptor: Implements the core port using SKLearn."""
def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
# --- SKLearn specific logic here ---
model = LogisticRegression(**model_config.params)
model.fit(data.features, data.labels)
model_bytes = pickle.dumps(model)
# ------------------------------------
return TrainedModelArtifactVO(model_bytes=model_bytes, metadata={'framework': 'sklearn'})
Application Layer: Uses the Port via Dependency Injection.
# backend/application/services/train_model_service.py
from backend.core.ports.model_trainer_port import ModelTrainerPort # Depends on core port
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO
class TrainModelApplicationService:
def __init__(self, model_trainer: ModelTrainerPort): # DI the core port
self.model_trainer = model_trainer
def execute(self, config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
# Use the port to trigger training
return self.model_trainer.train(config, data)
In my experience, placing ports in the core domain makes the most sense when the capability is truly fundamental to how your domain entities operate. This approach signals that the capability isn't just a utility but an essential part of the domain's definitionβthe core is explicitly declaring its need for external "muscle," which the infrastructure layer provides.
Approach 2: The Capability is for Orchestration (Interface in Application)
This approach positions the interface in the application layer. This is suitable when the capability is needed by the application services to orchestrate use cases, but the core domain model itself doesn't have a direct conceptual dependency on the capability's interface. The core focuses purely on business rules and state transitions, providing data that the application layer then acts upon using external capabilities.
flowchart TD
subgraph "Core Domain"
ValueObjects["ModelConfigVO\nTrainingDataVO\nTrainedModelArtifactVO"]
Experiment["Experiment\n(Aggregate)"]
Experiment -->|uses| ValueObjects
end
subgraph "Application"
ModelTrainer["ModelTrainer\n(Interface)"]
TrainService["TrainModelService"]
TrainService -->|uses| Experiment
TrainService -->|uses| ModelTrainer
TrainService -->|uses| ValueObjects
end
subgraph "Infrastructure"
SKLearnAdapter["SKLearnTrainerAdapter"]
TensorFlowAdapter["TensorFlowTrainerAdapter"]
SKLearnAdapter -->|implements| ModelTrainer
TensorFlowAdapter -->|implements| ModelTrainer
end
Rationale: The application layer defines and executes use cases. These use cases often require interacting with infrastructure after engaging the core domain, or preparing data to send to the core domain or infrastructure. If a capability is needed to fulfill a specific use case workflow, defining the interface in the application layer makes sense.
Structure & Code Stubs:
Core Layer: Defines Value Objects and core domain logic, but not the Trainer interface.
# backend/core/valueobjects.py
# --- Same VOs as Approach 1 ---
@dataclass(frozen=True)
class ModelConfigVO: ...
@dataclass(frozen=True)
class TrainingDataVO: ...
@dataclass(frozen=True)
class TrainedModelArtifactVO: ...
Application Layer: Defines the interface for the capability and the services that use it.
# backend/application/interfaces/model_trainer.py
from typing import Protocol
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO # Depends on core VOs
class ModelTrainer(Protocol): # Note: Interface lives in application layer
"""Application Interface: Defines training needed by application services."""
def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO: ...
Infrastructure Layer: Implements the Application layer interface.
# backend/infra/training/sklearn_trainer_adapter.py
from sklearn.linear_model import LogisticRegression # Framework dependency
import pickle
from backend.application.interfaces.model_trainer import ModelTrainer # Depends on application interface
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO # Depends on core VOs
class SKLearnTrainerAdapter(ModelTrainer): # Implements the application interface
"""Infra Adaptor: Implements the application interface using SKLearn."""
def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
# --- SKLearn specific logic here (same as Approach 1) ---
# ...
# ---------------------------------------------------------
return TrainedModelArtifactVO(...)
Application Layer: Uses its own defined interface via Dependency Injection.
# backend/application/services/train_model_service.py
from backend/application/interfaces/model_trainer import ModelTrainer # Depends on application interface
from backend.core.valueobjects import ModelConfigVO, TrainingDataVO, TrainedModelArtifactVO # Depends on core VOs
class TrainModelApplicationService:
def __init__(self, model_trainer: ModelTrainer): # DI the application interface
self.model_trainer = model_trainer
def execute(self, config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
# Use the interface to trigger training
return self.model_trainer.train(config, data)
Comparing and Building Intuition
Here's a table summarizing the key distinctions between approaches:
Feature | Approach 1: Port in Core (core/ports/) | Approach 2: Interface in App (application/interfaces/) |
---|---|---|
Interface Location | backend/core/ports/ |
backend/application/interfaces/ |
Defining Layer | Core Domain Layer | Application Layer |
Dependent Layer | Core Domain depends on the Port | Application Layer depends on the Interface |
Primary Driver | Core Domain Model itself needs the capability for its rules/integrity | Application Use Case needs the capability to orchestrate a process/interact externally |
Core's Role | Core defines a needed capability of the outside world | Core provides data/rules; application layer uses external capabilities with it |
Best Fit For... | Capabilities fundamental to core domain rules, validation, consistency | Capabilities for orchestration, external interaction based on core outcomes |
flowchart LR
subgraph "Decision Process"
Q1["Is capability integral to\ndomain entity behavior?"]
A1["Use Core Domain Port"]
A2["Use Application Interface"]
Q1 -->|Yes| A1
Q1 -->|No| A2
end
The core question is: Does the abstract need for this capability live with the fundamental business concepts (Core), or with the logic that sequences steps to fulfill a request (Application)?
Let's use adjacent examples:
Sanctions Check (Intuition for Port in Core): Imagine an
Order
aggregate with a strict rule: "Cannot approve if customer is sanctioned." TheOrder.approve()
method must enforce this. It needs anis_sanctioned(customerId) -> bool
check. The concept of this check is intertwined with the coreOrder
's business rules. The core depends on the abstract ability to do this check. βΉ DefineSanctionsCheckerPort
incore/ports/
.Email Confirmation (Intuition for Interface in Application): After an order is placed (core domain event), the application needs to send an email. The core
Order
becoming 'Placed' is a domain concern, but sending a notification email isn't. It's an application workflow step triggered by the core event. The application service handling the event needs anEmailSender
capability. βΉ DefineEmailSender
interface inapplication/interfaces/
.
When I've built ML-focused systems, I've typically organized them with services like api_services.py
, event_services.py
, or dedicated modules like api_services/training.py
. This structure naturally aligns with an application layer focused on use case orchestration. These services correspond directly to API endpoints or event handlers because they implement the use cases those endpoints/handlers trigger. This is exactly what the application layer should doβtranslate external triggers into orchestrated actions involving the core domain and infrastructure.
I've found that placing interfaces like ModelTrainer
in an application/interfaces/
folder works well when the application services are the primary consumers of this capability. In several ML projects I've worked on, this pattern has proven effectiveβit means the application layer depends on an abstraction for training, while the infrastructure layer provides the concrete implementation. This maintains the dependency arrow pointing inward, protecting application logic from direct framework coupling, just as a port in the core would protect the domain.
In my experience implementing both patterns, I've found that the choice between placing an interface in the core or application layer depends on where the abstraction's primary consumer lives and how it fits into the system's overall flow. Both approaches successfully decouple your system from infrastructure. The key is being intentional about which layer truly defines the need for the capabilityβis it fundamental to your domain concepts or primarily an orchestration concern?
By understanding this distinction and consistently applying the Ports and Adaptors pattern, you can effectively integrate necessary frameworks like ML trainers without compromising the clarity and maintainability of your core domain and application logic.
Deep Dives
Directory Structures for ML Systems
When implementing these patterns in real-world ML systems, I've found it helpful to establish consistent directory structures that reflect the architectural boundaries. Let's map out folder structures within the common layers (backend/
, core/
, infra/
, application/
, endpoint/
).
- L1:
backend/
- L2:
core/
- Domain models, business logic interfaces (ports) - L2:
infra/
- Infrastructure implementations (adaptors) - L2:
application/
- Application services (use cases) - L2:
endpoint/
- External interfaces (APIs, CLI, etc.)
- L2:
From here, I've seen two effective approaches to organizing subdirectories:
Type-Based Organization
backend/
βββ core/
β βββ aggregates/ # Domain model behavior
β βββ entities/ # Core business entities
β βββ valueobjects/ # Immutable value objects
β βββ ports/ # Interfaces for infrastructure needs
βββ infra/
β βββ ml/
β β βββ sklearn_trainer.py # Adapts scikit-learn to port/interface
β β βββ tensorflow_trainer.py # Adapts TensorFlow to port/interface
β βββ persistence/ # Database adapters
βββ application/
β βββ services/ # Use case handlers
β βββ interfaces/ # Application-layer abstractions
β βββ commands/ # Input data structures
βββ endpoint/ # External interfaces (API, CLI)
Domain-Based Organization
backend/
βββ core/
β βββ model/ # Model domain (valueobjects.py, entities.py, ports.py)
β βββ experiment/ # Experiment domain (valueobjects.py, entities.py, ports.py)
βββ infra/
β βββ ml_frameworks/ # ML framework adapters for different domains
β βββ persistence/ # Database adapters by domain
βββ application/
β βββ training/ # Training-related services and interfaces
β βββ inference/ # Inference-related services and interfaces
βββ endpoint/ # API endpoints by domain
Both structures have merits. The type-based approach is often intuitive initially but can scatter related code. The domain-based structure tends to scale better with complexity and team size by co-locating concepts. Adaptor implementations consistently belong in
infra
when they fulfill ports fromcore
, and inendpoint
(orapplication
) when they translate external input into domain concepts. The key is consistency within your chosen structure.
Interface Design Patterns
In the backend/core/aggregates/
or backend/core/interfaces/
folder, this is where the interface or the main public methods of your aggregate roots are defined. Whether this is a formal abc.ABC
or typing.Protocol
defining the aggregate's contract, or the concrete aggregate class itself serving as its own interface, this is the heart of your domain behavior that lives independently of infrastructure.
In backend/core/ports/
, you define those abstract interfaces (using abc.ABC
or typing.Protocol
) for the infrastructure services that the core needs to interact with (like ML frameworks, persistence, or external APIs). These port definitions declare what the core needs, without caring how it's implemented.
The flow works like this:
core/aggregates
contain your core domain logic/behavior definitions- This logic needs infrastructure capabilities, so it depends on the interfaces defined in
core/ports
- The implementations of those
core/ports
interfaces live ininfra/
- These implementations get Dependency Injected into the classes defined in
core/aggregates
or into your application services
For ML frameworks specifically, I recommend designing interfaces that:
- Use domain terminology rather than framework-specific concepts
- Accept and return domain objects (value objects or entities)
- Hide framework-specific error handling behind domain-appropriate exceptions
- Include versioning information in training metadata for reproducibility
Here's where the beauty of abstraction really shinesβduring code reviews and stakeholder discussions. When collaborating with business teams or product managers, we can focus solely on the interfaces, which speak the language of the business problem. For more technical audiences, we might explore the entity and value object design, which define the precise "knobs and levers" available to the system. But notice what never enters the conversation? Whether we're using PostgreSQL or MySQL, TensorFlow or PyTorch. Those implementation details are safely contained behind our interfaces.
ML-Specific Considerations
When working with ML frameworks, there are some unique considerations that affect how we design our ports and adapters:
Versioning and Reproducibility
For ML training interfaces, consider explicitly tracking framework versioning:
class SKLearnTrainerAdapter(ModelTrainerPort):
"""Implements training using scikit-learn."""
def __init__(self):
self._framework_info = {
"name": "scikit-learn",
"version": sklearn.__version__,
"capabilities": ["linear_models", "tree_models"]
}
def train(self, model_config: ModelConfigVO, data: TrainingDataVO) -> TrainedModelArtifactVO:
# ... training implementation ...
# Include framework information in model metadata
return TrainedModelArtifactVO(
model_bytes=model_bytes,
metadata={
"framework": self._framework_info,
"training_date": datetime.now().isoformat(),
}
)
Performance Optimization Hints
For ML systems with strict performance requirements, you might extend interfaces with optional performance hints:
class ModelTrainer(Protocol):
def train(
self,
model_config: ModelConfigVO,
data: TrainingDataVO,
performance_hints: Optional[Dict[str, Any]] = None
) -> TrainedModelArtifactVO:
"""
Train a model with optional performance hints like:
- use_gpu: bool
- batch_size: int
- precision: str ('float32', 'float16', etc.)
"""
...
This allows the application layer to provide optimization guidance without coupling to specific framework implementations.
Decoupling core logic from infrastructure via interfaces (ports) is foundational for building testable and maintainable ML systems. Defining ports in the core ensures your domain doesn't know about concrete ML framework details. Dependency Injection then wires the system together at the composition root, typically outside the core layers.