On AzureML Datastores and Storage Indirection
Migrating our AI/ML workloads over to AzureML for the past few days has forced me to think carefully about storage abstractions and their practical implications. The datastore concept, while simple in principle, reveals interesting design choices when you dig into the mechanics.
The Abstraction Layer
A datastore in AzureML is fundamentally a named pointer to existing storageâtypically a blob container in Azure Storage or an ADLS file system. It doesn't provision new resources; it registers existing ones. This design decision makes sense from an operational perspective. You maintain control over your storage while AzureML provides a consistent interface for data access.
The relationship maps cleanly: one datastore points to one container. No automatic partitioning, no magic folder creation. You create the container, AzureML connects to it. The simplicity is deliberate and, in practice, reliable.
Storage Patterns in Practice
I've been experimenting with different organizational patterns. The most straightforward approach uses a single storage account with multiple containersâone per logical data boundary. For client work, this translates to one container per client, each registered as a separate datastore.
storage_account/
âââ client_alpha/ â datastore_alpha
âââ client_beta/ â datastore_beta
âââ shared_resources/ â datastore_shared
The isolation is clean at the storage level. Access control happens through Azure's IAM system, not through AzureML's abstractions. This keeps the security model straightforward and auditable.
Path Resolution and Data Layout
Within a datastore's container, you control the hierarchy entirely. AzureML doesn't impose structure or constraints on your blob organization. When referencing data, you specify paths explicitly:
azureml://datastores/client_alpha/paths/market_data/2024/Q1/trades.parquet
The path resolution is deterministic. No magic, no auto-discovery. You specify exactly what you want, when you want it.
Operational Considerations
The datastore abstraction removes some friction from data access patterns, but it doesn't eliminate the underlying complexity of distributed storage. A few observations:
Connection strings and credentials are stored in the datastore registration. This creates a dependency between your ML workflows and your storage access patterns. If you rotate keys or change access methods, you need to update datastore configurations.
Cross-workspace data sharing works by registering the same storage location multiple times. Each workspace maintains its own logical reference, but they point to the same underlying blobs. This is useful for shared datasets but requires coordination on access patterns.
The failure modes are predictable: if the underlying container is deleted or becomes inaccessible, jobs fail cleanly. The datastore doesn't provide any buffering or caching layerâit's a pure indirection.
MLTable Integration
The MLTable concept sits on top of datastores, providing schema and transformation metadata for structured data access. It's essentially a YAML-based descriptor that tells AzureML how to interpret your blobs as tabular data.
The separation is clean: datastores handle storage connectivity, MLTables handle data interpretation. You can have multiple MLTables pointing to the same datastore but interpreting the data differently.
Design Reflections
The datastore abstraction reflects a particular philosophy about data management in ML workflows. It assumes you want explicit control over storage, clear boundaries between logical data sets, and predictable access patterns.
This approach trades some convenience for transparency. You can't accidentally create storage resources, and you can't accidentally share data across boundaries you didn't intend. The system forces you to be explicit.
To me, this explicitness is valuable. A simple repo over it solves any "extras" I need. The abstraction layer is thin enough that you can reason about the underlying storage behavior without surprises.
Implementation Notes
In practice, I've found that designing your container structure upfront pays dividends.
Frequently Asked Questions
Can a datastore point to local files or external sources? Not directly. Datastores are designed to reference cloud-based storageâtypically Azure Blob Storage, ADLS Gen2, Azure Files, or OneLake. You can't register a local directory as a datastore, but you can upload local files into a datastore using the SDK:
datastore.upload(src_dir="./data", target_path="raw/2024/Q2")
For non-Azure sources (like public URLs), you'll need to handle the download manually or register a Data
asset with a remote path, depending on your use case.
What exactly is an MLTable?
An MLTable
is a YAML-based data asset that describes how to interpret blobs as structured dataâdefining file formats, column types, transformations, and more. It lets you decouple storage from schema logic and provides a consistent interface for training and evaluation pipelines.
paths:
- azureml://datastores/datastore_alpha/paths/market_data/2024/Q1/
transformations:
- read_delimited:
delimiter: ","
encoding: "utf-8"
Think of it as metadata-plus-query-engine for your tabular data.
Can I use the same datastore in multiple AzureML workspaces? Yes. You can register the same blob container as a datastore in any number of workspaces. Each workspace maintains its own datastore registry, so youâll need to re-register manually or via script.
Datastore.register_azure_blob_container(
workspace=ws,
datastore_name="shared_data",
container_name="shared_resources",
account_name="mystorage",
auth_type="managed_identity"
)
This is useful for shared datasets that span environments or teams.
How is access to a datastore controlled? Access is governed at the storage layerâvia Azure IAM, access keys, SAS tokens, or identity-based auth (like managed identity or service principal). AzureML doesnât add its own ACL layer on top. The datastore acts as a named credential binding:
datastore = Datastore.register_azure_blob_container(
workspace=ws,
datastore_name="client_alpha",
container_name="client_alpha",
account_name="storage_xyz",
credential=SasTokenConfiguration(sas_token="...")
)
Auditability and security posture are inherited from how you manage access on the storage account itself.
What happens if the container backing a datastore is deleted or changed? Jobs will fail cleanly. The datastore is a logical pointerâit doesn't protect against accidental deletion or mutation of the underlying blobs. There's no caching or shadow copy mechanism in place.
Expect clear and deterministic errors if the path or blob doesnât exist:
Error: FileNotFoundError â 'market_data/2024/Q1/trades.parquet' not found in datastore 'client_alpha'