Do Spark / Pyspark ML tree-based algorithms require one-hot encoding?
Tree-based algorithms are able to handle nominal data without one-hot encoding, but whether this works is implementation-specific. I found old answers on StackOverflow that say the old MLLib tree algorithms were able to use metadata from a StringIndexer to properly handle categorical data. Is that still the case in the modern pyspark.ml? And is the metadata preserved by VectorAssembler?