Comment remplacer les guillemets encodés par Microsoft dans PHP

Question

Je dois remplacer la version Microsoft Word des guillemets simples et doubles (“ ” ‘ ’) avec des guillemets réguliers ('et ") en raison d'un problème d'encodage dans mon application. Je n'ai pas besoin qu'elles soient des entités HTML et je ne peux pas changer mon schéma de base de données.

J'ai deux options: utiliser une expression régulière ou un tableau associé.

Y a-t-il une meilleure manière de faire cela?

J'ai deux options: utiliser une expression régulière ou un tableau associé.

Y a-t-il une meilleure manière de faire cela?

Pascal MARTIN · Accepted Answer

Étant donné que vous ne souhaitez remplacer que quelques caractères spécifiques et bien identifiés, je choisirais str_replace avec un tableau: vous n'avez évidemment pas besoin du regex d'artillerie lourde qui vous apportera ;-)

Et si vous rencontrez d'autres caractères spéciaux (putain de copier-coller de Microsoft Word ...), vous pouvez simplement les ajouter à ce tableau chaque fois que cela est nécessaire/chaque fois qu'ils sont identifiés.

La meilleure réponse que je puisse donner à votre commentaire est probablement ce lien: Convertir les devis intelligents avec PHP

Et le code associé (en citant cette page) :

function convert_smart_quotes($string) { $search = array(chr(145), chr(146), chr(147), chr(148), chr(151)); $replace = array("'", "'", '"', '"', '-'); return str_replace($search, $replace, $string); }

(Je n'ai pas Microsoft Word sur cet ordinateur, donc je ne peux pas tester par moi-même)

Je ne me souviens pas exactement de ce que nous avons utilisé au travail (ce n'était pas moi qui devais gérer ce type d'entrée) , mais c'était la même chose genre de trucs ...

Justin Dominic · Answer

J'ai trouvé une réponse à cette question. Vous avez besoin d'une seule ligne de code en utilisant la fonction iconv() en php:

// replace Microsoft Word version of single and double quotations marks (“ ” ‘ ’) with regular quotes (' and ") $output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);

Gumbo · Answer

Vos citations encodées par Microsoft sont probablement les guillemets typographiques . Vous pouvez simplement les remplacer par str_replace si vous connaissez l'encodage de la chaîne dans laquelle vous souhaitez les remplacer.

Voici un exemple pour UTF-8 mais en utilisant un seul tableau de mappage avec strtr :

$quotes = array( "\xC2\xAB" => '"', // « (U+00AB) in UTF-8 "\xC2\xBB" => '"', // » (U+00BB) in UTF-8 "\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8 "\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8 "\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8 "\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8 "\xE2\x80\x9C" => '"', // “ (U+201C) in UTF-8 "\xE2\x80\x9D" => '"', // ” (U+201D) in UTF-8 "\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8 "\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8 "\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8 "\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8 ); $str = strtr($str, $quotes);

Si vous avez besoin d'un autre encodage, vous pouvez utiliser mb_convert_encoding pour convertir les clés.

thelastshadow · Answer

Si comme moi vous arrivez ici avec une énorme gamme de caractères ASCII/Microsoft Word cassés qui font des choses étranges à votre CMS ou RTE et iconv ne fonctionne pas, alors cette fonction folle pourrait bien être pour toi.

Assurez-vous que votre encodage est UTF-8 lorsque vous enregistrez cette fonction dans un fichier.

<?php /** * fixMSWord * * Replace ASCII chars with UTF-8. Note there are ASCII characters that don't * correctly map and will be replaced by spaces. * * @author Robin Cafolla * @date 2013-03-22 */ function fixMSWord($string) { $map = Array( '33' => '!', '34' => '"', '35' => '#', '36' => '$', '37' => '%', '38' => '&', '39' => "'", '40' => '(', '41' => ')', '42' => '*', '43' => '+', '44' => ',', '45' => '-', '46' => '.', '47' => '/', '48' => '0', '49' => '1', '50' => '2', '51' => '3', '52' => '4', '53' => '5', '54' => '6', '55' => '7', '56' => '8', '57' => '9', '58' => ':', '59' => ';', '60' => '<', '61' => '=', '62' => '>', '63' => '?', '64' => '@', '65' => 'A', '66' => 'B', '67' => 'C', '68' => 'D', '69' => 'E', '70' => 'F', '71' => 'G', '72' => 'H', '73' => 'I', '74' => 'J', '75' => 'K', '76' => 'L', '77' => 'M', '78' => 'N', '79' => 'O', '80' => 'P', '81' => 'Q', '82' => 'R', '83' => 'S', '84' => 'T', '85' => 'U', '86' => 'V', '87' => 'W', '88' => 'X', '89' => 'Y', '90' => 'Z', '91' => '[', '92' => '\', '93' => ']', '94' => '^', '95' => '_', '96' => '`', '97' => 'a', '98' => 'b', '99' => 'c', '100'=> 'd', '101'=> 'e', '102'=> 'f', '103'=> 'g', '104'=> 'h', '105'=> 'i', '106'=> 'j', '107'=> 'k', '108'=> 'l', '109'=> 'm', '110'=> 'n', '111'=> 'o', '112'=> 'p', '113'=> 'q', '114'=> 'r', '115'=> 's', '116'=> 't', '117'=> 'u', '118'=> 'v', '119'=> 'w', '120'=> 'x', '121'=> 'y', '122'=> 'z', '123'=> '{', '124'=> '|', '125'=> '}', '126'=> '~', '127'=> ' ', '128'=> '&#8364;', '129'=> ' ', '130'=> ',', '131'=> ' ', '132'=> '"', '133'=> '.', '134'=> ' ', '135'=> ' ', '136'=> '^', '137'=> ' ', '138'=> ' ', '139'=> '<', '140'=> ' ', '141'=> ' ', '142'=> ' ', '143'=> ' ', '144'=> ' ', '145'=> "'", '146'=> "'", '147'=> '"', '148'=> '"', '149'=> '.', '150'=> '-', '151'=> '-', '152'=> '~', '153'=> ' ', '154'=> ' ', '155'=> '>', '156'=> ' ', '157'=> ' ', '158'=> ' ', '159'=> ' ', '160'=> ' ', '161'=> '¡', '162'=> '¢', '163'=> '£', '164'=> '¤', '165'=> '¥', '166'=> '¦', '167'=> '§', '168'=> '¨', '169'=> '©', '170'=> 'ª', '171'=> '«', '172'=> '¬', '173'=> '', '174'=> '®', '175'=> '¯', '176'=> '°', '177'=> '±', '178'=> '²', '179'=> '³', '180'=> '´', '181'=> 'µ', '182'=> '¶', '183'=> '·', '184'=> '¸', '185'=> '¹', '186'=> 'º', '187'=> '»', '188'=> '¼', '189'=> '½', '190'=> '¾', '191'=> '¿', '192'=> 'À', '193'=> 'Á', '194'=> 'Â', '195'=> 'Ã', '196'=> 'Ä', '197'=> 'Å', '198'=> 'Æ', '199'=> 'Ç', '200'=> 'È', '201'=> 'É', '202'=> 'Ê', '203'=> 'Ë', '204'=> 'Ì', '205'=> 'Í', '206'=> 'Î', '207'=> 'Ï', '208'=> 'Ð', '209'=> 'Ñ', '210'=> 'Ò', '211'=> 'Ó', '212'=> 'Ô', '213'=> 'Õ', '214'=> 'Ö', '215'=> '×', '216'=> 'Ø', '217'=> 'Ù', '218'=> 'Ú', '219'=> 'Û', '220'=> 'Ü', '221'=> 'Ý', '222'=> 'Þ', '223'=> 'ß', '224'=> 'à', '225'=> 'á', '226'=> 'â', '227'=> 'ã', '228'=> 'ä', '229'=> 'å', '230'=> 'æ', '231'=> 'ç', '232'=> 'è', '233'=> 'é', '234'=> 'ê', '235'=> 'ë', '236'=> 'ì', '237'=> 'í', '238'=> 'î', '239'=> 'ï', '240'=> 'ð', '241'=> 'ñ', '242'=> 'ò', '243'=> 'ó', '244'=> 'ô', '245'=> 'õ', '246'=> 'ö', '247'=> '÷', '248'=> 'ø', '249'=> 'ù', '250'=> 'ú', '251'=> 'û', '252'=> 'ü', '253'=> 'ý', '254'=> 'þ', '255'=> 'ÿ' ); $search = Array(); $replace = Array(); foreach ($map as $s => $r) { $search[] = chr((int)$s); $replace[] = $r; } return str_replace($search, $replace, $string); }

ceejayoz · Answer

Nous avons utilisé ce qui suit. Il traite de quelques caractères spéciaux supplémentaires.

$text = str_replace(chr(130), ',', $text); // Baseline single quote $text = str_replace(chr(132), '"', $text); // Baseline double quote $text = str_replace(chr(133), '...', $text); // Ellipsis $text = str_replace(chr(145), "'", $text); // Left single quote $text = str_replace(chr(146), "'", $text); // Right single quote $text = str_replace(chr(147), '"', $text); // Left double quote $text = str_replace(chr(148), '"', $text); // Right double quote $text = mb_convert_encoding($text, 'HTML-ENTITIES', 'UTF-8');

NobleUplift · Answer

Chacune des réponses précédentes, à l'exception de Gumbo's , modifie les chaînes Unicode:

echo convert_smart_quotes("This is Yi: ꑑ. Point ⒒ this breaks Yi. Yi broke–why? I need a longer––point. This makes Han 嗗 mad.");

Résulte en:

This is Yi: ?''. Point ?'' this breaks Yi. Yi broke?"why? I need a longer?"?"point. This makes Han ?-- mad.

L'iconv:

$output = iconv('UTF-8', 'ASCII//TRANSLIT', $input);

Résulte en:

Avis PHP: iconv (): Détection d'un caractère illégal dans la chaîne d'entrée dans le code shell php sur la ligne 1

Vous pouvez le changer en //IGNORE, qui supprimera les caractères, mais ne les traduira pas.

C'est le meilleur moyen de remplacer les guillemets Microsoft encodés en CP1252. S'ils sont en Unicode et que vous devez les remplacer, utilisez la réponse de Gumbo:

function convert_cp1252_to_ascii($input, $default = '') { if ($input === null || $input == '') { return $default; } // https://en.wikipedia.org/wiki/UTF-8 // https://en.wikipedia.org/wiki/ISO/IEC_8859-1 // https://en.wikipedia.org/wiki/Windows-1252 // http://www.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT $encoding = mb_detect_encoding($input, array('Windows-1252', 'ISO-8859-1'), true); if ($encoding == 'ISO-8859-1' || $encoding == 'Windows-1252') { /* * Use the search/replace arrays if a character needs to be replaced with * something other than its Unicode equivalent. */ $replace = array( 128 => "E", // http://www.fileformat.info/info/unicode/char/20AC/index.htm EURO SIGN 129 => "", // UNDEFINED 130 => ",", // http://www.fileformat.info/info/unicode/char/201A/index.htm SINGLE LOW-9 QUOTATION MARK 131 => "f", // http://www.fileformat.info/info/unicode/char/0192/index.htm LATIN SMALL LETTER F WITH HOOK 132 => ",,", // http://www.fileformat.info/info/unicode/char/201e/index.htm DOUBLE LOW-9 QUOTATION MARK 133 => "...", // http://www.fileformat.info/info/unicode/char/2026/index.htm HORIZONTAL Ellipsis 134 => "t", // http://www.fileformat.info/info/unicode/char/2020/index.htm DAGGER 135 => "T", // http://www.fileformat.info/info/unicode/char/2021/index.htm DOUBLE DAGGER 136 => "^", // http://www.fileformat.info/info/unicode/char/02c6/index.htm MODIFIER LETTER CIRCUMFLEX ACCENT 137 => "%", // http://www.fileformat.info/info/unicode/char/2030/index.htm PER MILLE SIGN 138 => "S", // http://www.fileformat.info/info/unicode/char/0160/index.htm LATIN CAPITAL LETTER S WITH CARON 139 => "<", // http://www.fileformat.info/info/unicode/char/2039/index.htm SINGLE LEFT-POINTING ANGLE QUOTATION MARK 140 => "OE", // http://www.fileformat.info/info/unicode/char/0152/index.htm LATIN CAPITAL LIGATURE OE 141 => "", // UNDEFINED 142 => "Z", // http://www.fileformat.info/info/unicode/char/017d/index.htm LATIN CAPITAL LETTER Z WITH CARON 143 => "", // UNDEFINED 144 => "", // UNDEFINED 145 => "'", // http://www.fileformat.info/info/unicode/char/2018/index.htm LEFT SINGLE QUOTATION MARK 146 => "'", // http://www.fileformat.info/info/unicode/char/2019/index.htm RIGHT SINGLE QUOTATION MARK 147 => "\"", // http://www.fileformat.info/info/unicode/char/201c/index.htm LEFT DOUBLE QUOTATION MARK 148 => "\"", // http://www.fileformat.info/info/unicode/char/201d/index.htm RIGHT DOUBLE QUOTATION MARK 149 => "*", // http://www.fileformat.info/info/unicode/char/2022/index.htm BULLET 150 => "-", // http://www.fileformat.info/info/unicode/char/2013/index.htm EN DASH 151 => "--", // http://www.fileformat.info/info/unicode/char/2014/index.htm EM DASH 152 => "~", // http://www.fileformat.info/info/unicode/char/02DC/index.htm SMALL TILDE 153 => "TM", // http://www.fileformat.info/info/unicode/char/2122/index.htm TRADE MARK SIGN 154 => "s", // http://www.fileformat.info/info/unicode/char/0161/index.htm LATIN SMALL LETTER S WITH CARON 155 => ">", // http://www.fileformat.info/info/unicode/char/203A/index.htm SINGLE RIGHT-POINTING ANGLE QUOTATION MARK 156 => "oe", // http://www.fileformat.info/info/unicode/char/0153/index.htm LATIN SMALL LIGATURE OE 157 => "", // UNDEFINED 158 => "z", // http://www.fileformat.info/info/unicode/char/017E/index.htm LATIN SMALL LETTER Z WITH CARON 159 => "Y", // http://www.fileformat.info/info/unicode/char/0178/index.htm LATIN CAPITAL LETTER Y WITH DIAERESIS ); $find = array(); foreach (array_keys($replace) as $key) { $find[] = chr($key); } $input = str_replace($find, array_values($replace), $input); /* * Because ISO-8859-1 and CP1252 are identical except for 0x80 through 0x9F * and control characters, always convert from Windows-1252 to UTF-8. */ $input = iconv('Windows-1252', 'UTF-8//IGNORE', $input); } return $input; }

Tiré de cette réponse , avec quelques modifications. Si vous souhaitez contrôler ce que vous trouvez/remplacez, utilisez cette fonction.