I have thousands of images files in an Azure datastore. I’d like to stream them into a datasets
object in an AzureML Jupyter notebook instance.
I’m using the code below, but it keeps running after 45 minutes with no end in sight:
from azureml.fsspec import AzureMachineLearningFileSystem
from datasets import Dataset
fs = AzureMachineLearningFileSystem('azureml://subscriptions/<sub_id>/resourcegroups/<rg_name>/workspaces/<ws_name>/datastores/<ds_name')
file_lst = fs.ls('datastore_subdir/')[1:]
def process_image():
for file in file_lst:
img = Image.open(fs.open(file))
yield {'image': img}
image_ds = Dataset.from_generator(process_image)
I already have this working using the answer from this question. However, would like to understand whether there’s a way to get it to work using the generator approach.
For comparison, the following completes in 1.5 minutes:
import io
img_lst = []
for file in fs.ls('datastore_subdir/')[1:]:
with fs.open(file) as f:
print(counter)
img = Image.open(io.BytesIO(f.read()))
img_lst.append(img)