Question

I’ve got a CSV file with mixed up encoded strings:

1956;MathÃ© AltÃ©ry;Le Temps Perdu
1963;Alain BarriÃ¨re;Elle Ã©tait si jolie
2024;Gérard truc;Mon éden
2024;GÃ¥te;Ulveham
2023;GergÅ‘ RÃ¡cz;ZÃ© SzabÃ³
2210;Anna MjÃ¶ll;SjÃºbÃdÃº
2200;Ovidijus VyÅ¡niauskas;LopÅ¡inÄ— mylimai

Some accents on french names are either UTF8 encoded (Ã© ; Ã¨ ) or are encoded in plain text (‘é’ ASCII ?). Also some characters (‘ő’) don’t exist with UTF8 but with Windows-1252.

What I’ve done is converting everyting from UTF8 to Windows-1252:

mb_convert_encoding($text, 'Windows-1252','UTF-8');

which works like a charm, .. except for french accents that are all detected as UTF8.

Here is sample code:

$csv = array("1956;MathÃ© AltÃ©ry;Le Temps Perdu",
"1963;Alain BarriÃ¨re;Elle Ã©tait si jolie",
"2024;Gérard truc;Mon éden",
"2024;GÃ¥te;Ulveham",
"2023;GergÅ‘ RÃ¡cz;ZÃ© SzabÃ³",
"2210;Anna MjÃ¶ll;SjÃºbÃdÃº",
"2200;Ovidijus VyÅ¡niauskas;LopÅ¡inÄ— mylimai");

foreach ($csv as $row) {
        echo "row=$rown";
        $row_values = explode(';', $row);

        foreach ($row_values as $value) {
                echo sprintf("value=%-30s encoding=%-10s W1252=%-20sn",
                        $value,
                        mb_detect_encoding($value),
                        mb_convert_encoding($value, 'Windows-1252' ,'UTF-8'));
        }

        echo "n";
}

# php ./encodings.php
row=1956;MathÃ© AltÃ©ry;Le Temps Perdu
value=1956                           encoding=ASCII      W1252=1956
value=MathÃ© AltÃ©ry             encoding=UTF-8      W1252=Mathé Altéry
value=Le Temps Perdu                 encoding=ASCII      W1252=Le Temps Perdu

row=1963;Alain BarriÃ¨re;Elle Ã©tait si jolie
value=1963                           encoding=ASCII      W1252=1963
value=Alain BarriÃ¨re              encoding=UTF-8      W1252=Alain Barrière
value=Elle Ã©tait si jolie         encoding=UTF-8      W1252=Elle était si jolie

row=2024;Gérard truc;Mon éden
value=2024                           encoding=ASCII      W1252=2024
value=Gérard truc                   encoding=UTF-8      W1252=G▒rard truc
value=Mon éden                      encoding=UTF-8      W1252=Mon ▒den

row=2024;GÃ¥te;Ulveham
value=2024                           encoding=ASCII      W1252=2024
value=GÃ¥te                        encoding=UTF-8      W1252=Gåte
value=Ulveham                        encoding=ASCII      W1252=Ulveham

row=2023;GergÅ‘ RÃ¡cz;ZÃ© SzabÃ³
value=2023                           encoding=ASCII      W1252=2023
value=GergÅ‘ RÃ¡cz              encoding=UTF-8      W1252=Gergő Rácz
value=ZÃ© SzabÃ³                 encoding=UTF-8      W1252=Zé Szabó

row=2210;Anna MjÃ¶ll;SjÃºbÃdÃº
value=2210                           encoding=ASCII      W1252=2210
value=Anna MjÃ¶ll                  encoding=UTF-8      W1252=Anna Mjöll
value=SjÃºbÃdÃº               encoding=UTF-8      W1252=Sjúbídú

row=2200;Ovidijus VyÅ¡niauskas;LopÅ¡inÄ— mylimai
value=2200                           encoding=ASCII      W1252=2200
value=Ovidijus VyÅ¡niauskas        encoding=UTF-8      W1252=Ovidijus Vyšniauskas
value=LopÅ¡inÄ— mylimai         encoding=UTF-8      W1252=Lopšinė mylimai

You can see that values having french accents are detected as UTF-8 and converting them to W1252 adds weird characters meaning a wrong conversion.

Now, I can’t figure out how to detect which strings need to be encoded, which don’t need to. mb_detect_encoding doesn’t seem to return reliable results.

Any idea ?

Many thanks all !!

Detect mixed-up encoding in PHP and make everyting Windows 1252

LEAVE A COMMENT Hủy