Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors
I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)]
to an RDD using pyspark’s HashingTF and IDF implementations. I tried to save the RDD with tf-idf values, but when I saved the output to a file and then loaded it from the file. The loaded file outputs an RDD that is the original saved RDD but with the order of the SparseVectors now seemingly with a random one as the first in the RDD and then assigned proper order after that.
Saving and Loading RDD (pyspark) to pickle file is randomizing SparseVectors
I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)]
to an RDD using pyspark’s HashingTF and IDF implementations. I tried to save the RDD with tf-idf values, but when I saved the output to a file and then loaded it from the file. The loaded file outputs an RDD that is the original saved RDD but with the order of the SparseVectors randomized