Weird first word read Python
I’m reading the first Harry Potter book as a UTF-8 file in Python (I’ve tried packages io
and codecs
), and the first word that is read, which is “harry” (lowercase because I first word tokenize the entire corpus with nltk
), is read as 'ufeffharry'
. I’m guessing this has to do with the encoding and perhaps because it’s the first word of the sentence.