Relative Content

Tag Archive for pythonpython-3.xpysparkrddtf-idf

Saving and Loading RDD (pyspark) to pickle file is changing order of SparseVectors

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark’s HashingTF and IDF implementations. I tried to save the RDD with tf-idf values, but when I saved the output to a file and then loaded it from the file. The loaded file outputs an RDD that is the original saved RDD but with the order of the SparseVectors now seemingly with a random one as the first in the RDD and then assigned proper order after that.

Saving and Loading RDD (pyspark) to pickle file is randomizing SparseVectors

I trained tf-idf on a pre-tokenized (unigram tokenizer) dataset that I converted from list[list(token1, token2, token3, ...)] to an RDD using pyspark’s HashingTF and IDF implementations. I tried to save the RDD with tf-idf values, but when I saved the output to a file and then loaded it from the file. The loaded file outputs an RDD that is the original saved RDD but with the order of the SparseVectors randomized