Relative Content

Tag Archive for pythonutf-8io

Weird first word read Python

I’m reading the first Harry Potter book as a UTF-8 file in Python (I’ve tried packages io and codecs), and the first word that is read, which is “harry” (lowercase because I first word tokenize the entire corpus with nltk), is read as 'ufeffharry'. I’m guessing this has to do with the encoding and perhaps because it’s the first word of the sentence.