Stream data from an Azure datastore into datasets object with `from_generator`

  Kiến thức lập trình

I have thousands of images files in an Azure datastore. I’d like to stream them into a datasets object in an AzureML Jupyter notebook instance.

I’m using the code below, but it keeps running after 45 minutes with no end in sight:

from azureml.fsspec import AzureMachineLearningFileSystem
from datasets import Dataset

fs = AzureMachineLearningFileSystem('azureml://subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<ds_name')

file_lst = fs.ls('datastore_subdir/')[1:]

def process_image():
    for file in file_lst:
        img = Image.open(fs.open(file))
        yield {'image': img}

image_ds = Dataset.from_generator(process_image)

I already have this working using the answer from this question. However, would like to understand whether there’s a way to get it to work using the generator approach.

For comparison, the following completes in 1.5 minutes:

import io

img_lst = []

for file in fs.ls('datastore_subdir/')[1:]:
    with fs.open(file) as f:
        print(counter)
        img = Image.open(io.BytesIO(f.read()))
        img_lst.append(img)

LEAVE A COMMENT