How to split a Hugging Face dataset in streaming mode without loading it into memory?
I’m working with Hugging Face datasets and I need to split a dataset into training and validation sets. My main requirement is that the dataset should be processed in streaming mode, as I don’t want to load the entire dataset into memory.
How to apply .map() function and keep it as an iterator for a Hugging Face Dataset, in Streaming Mode without loading it to memory?
I’m currently working with the Hugging Face datasets library and need to apply transformations to multiple datasets (such as ds_khan and ds_mathematica) using the .map() function, but in a way that mimics streaming (i.e., without loading the entire dataset into memory). I am particularly interested in interleaving these transformed datasets while keeping the data processing as lazy as possible, similar to streaming=True.