Algorithme Slugify URL en C #?

Question

J'ai donc recherché et parcouru la balise slug sur SO et je n'ai trouvé que deux solutions convaincantes:

Ce ne sont que des solutions partielles au problème. Je pourrais le coder manuellement moi-même, mais je suis surpris qu'il n'y ait pas encore de solution.

Alors, existe-t-il une implémentation d'alrogithme slugify en C # et/ou .NET qui traite correctement les caractères latins, l'unicode et divers autres problèmes de langue?

Marcel · Accepted Answer

http://predicatet.blogspot.com/2009/04/improved-c-slug-generator-or-how-to.html

public static string GenerateSlug(this string phrase) { string str = phrase.RemoveAccent().ToLower(); // invalid chars str = Regex.Replace(str, @"[^a-z0-9\s-]", ""); // convert multiple spaces into one space str = Regex.Replace(str, @"\s+", " ").Trim(); // cut and trim str = str.Substring(0, str.Length <= 45 ? str.Length : 45).Trim(); str = Regex.Replace(str, @"\s", "-"); // hyphens return str; } public static string RemoveAccent(this string txt) { byte[] bytes = System.Text.Encoding.GetEncoding("Cyrillic").GetBytes(txt); return System.Text.Encoding.ASCII.GetString(bytes); }

Joan Caron · Answer

Vous trouverez ici un moyen de générer un slug url en c #. Cette fonction supprime tous les accents (réponse de Marcel), remplace les espaces, supprime les caractères invalides, coupe les tirets de la fin et remplace les doubles occurrences de "-" ou "_"

Code:

public static string ToUrlSlug(string value){ //First to lower case value = value.ToLowerInvariant(); //Remove all accents var bytes = Encoding.GetEncoding("Cyrillic").GetBytes(value); value = Encoding.ASCII.GetString(bytes); //Replace spaces value = Regex.Replace(value, @"\s", "-", RegexOptions.Compiled); //Remove invalid chars value = Regex.Replace(value, @"[^a-z0-9\s-_]", "",RegexOptions.Compiled); //Trim dashes from end value = value.Trim('-', '_'); //Replace double occurences of - or _ value = Regex.Replace(value, @"([-_]){2,}", "$1", RegexOptions.Compiled); return value ; }

dana · Answer

Voici mon interprétation, basée sur les réponses de Joan et Marcel. Les modifications que j'ai apportées sont les suivantes:

Voici le code:

public class UrlSlugger { // white space, em-dash, en-dash, underscore static readonly Regex WordDelimiters = new Regex(@"[\s—–_]", RegexOptions.Compiled); // characters that are not valid static readonly Regex InvalidChars = new Regex(@"[^a-z0-9\-]", RegexOptions.Compiled); // multiple hyphens static readonly Regex MultipleHyphens = new Regex(@"-{2,}", RegexOptions.Compiled); public static string ToUrlSlug(string value) { // convert to lower case value = value.ToLowerInvariant(); // remove diacritics (accents) value = RemoveDiacritics(value); // ensure all Word delimiters are hyphens value = WordDelimiters.Replace(value, "-"); // strip out invalid characters value = InvalidChars.Replace(value, ""); // replace multiple hyphens (-) with a single hyphen value = MultipleHyphens.Replace(value, "-"); // trim hyphens (-) from ends return value.Trim('-'); } /// See: http://www.siao2.com/2007/05/14/2629747.aspx private static string RemoveDiacritics(string stIn) { string stFormD = stIn.Normalize(NormalizationForm.FormD); StringBuilder sb = new StringBuilder(); for (int ich = 0; ich < stFormD.Length; ich++) { UnicodeCategory uc = CharUnicodeInfo.GetUnicodeCategory(stFormD[ich]); if (uc != UnicodeCategory.NonSpacingMark) { sb.Append(stFormD[ich]); } } return (sb.ToString().Normalize(NormalizationForm.FormC)); } }

Cela ne résout toujours pas le problème des caractères non latins. Une solution complètement alternative serait d'utiliser ri.EscapeDataString pour convertir la chaîne sa représentation hexadécimale:

string original = "测试公司"; // %E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8 string converted = Uri.EscapeDataString(original);

Utilisez ensuite les données pour générer un lien hypertexte:

<a href="http://www.example.com/100/%E6%B5%8B%E8%AF%95%E5%85%AC%E5%8F%B8"> 测试公司 </a>

De nombreux navigateurs affichent des caractères chinois dans la barre d'adresse (voir ci-dessous), mais sur la base de mes tests limités, il n'est pas complètement pris en charge.

address bar with Chinese characters

REMARQUE: pour que ri.EscapeDataString fonctionne de cette façon, iriParsing doit être activé.

[~ # ~] modifier [~ # ~]

Pour ceux qui cherchent à générer des URL Slugs en C #, je recommande de vérifier cette question connexe:

Comment Stack Overflow génère-t-il ses URL optimisées pour le référencement?

C'est ce que j'ai fini par utiliser pour mon projet.

Matthew Groves · Answer

Un problème que j'ai eu avec la slugification (nouveau Word!) Est les collisions. Si j'ai un article de blog, par exemple, appelé "Stack-Overflow" et un autre appelé "Stack Overflow", les slugs de ces deux titres sont les mêmes. Par conséquent, mon générateur de slugs doit généralement impliquer la base de données d'une manière ou d'une autre. C'est peut-être pourquoi vous ne voyez pas de solutions plus génériques.

Simon Mourier · Answer

Voici ma photo. Elle supporte:

suppression des signes diacritiques (donc nous ne supprimons pas seulement les caractères "invalides")
longueur maximale pour le résultat (ou avant l'élimination des signes diacritiques - "début tronqué")
séparateur personnalisé entre les morceaux normalisés
le résultat peut être forcé en majuscule ou en minuscule
liste configurable des catégories Unicode prises en charge
liste configurable de plages de caractères autorisés
prend en charge le framework 2.0

Code:

/// <summary> /// Defines a set of utilities for creating slug urls. /// </summary> public static class Slug { /// <summary> /// Creates a slug from the specified text. /// </summary> /// <param name="text">The text. If null if specified, null will be returned.</param> /// <returns> /// A slugged text. /// </returns> public static string Create(string text) { return Create(text, (SlugOptions)null); } /// <summary> /// Creates a slug from the specified text. /// </summary> /// <param name="text">The text. If null if specified, null will be returned.</param> /// <param name="options">The options. May be null.</param> /// <returns>A slugged text.</returns> public static string Create(string text, SlugOptions options) { if (text == null) return null; if (options == null) { options = new SlugOptions(); } string normalised; if (options.EarlyTruncate && options.MaximumLength > 0 && text.Length > options.MaximumLength) { normalised = text.Substring(0, options.MaximumLength).Normalize(NormalizationForm.FormD); } else { normalised = text.Normalize(NormalizationForm.FormD); } int max = options.MaximumLength > 0 ? Math.Min(normalised.Length, options.MaximumLength) : normalised.Length; StringBuilder sb = new StringBuilder(max); for (int i = 0; i < normalised.Length; i++) { char c = normalised[i]; UnicodeCategory uc = char.GetUnicodeCategory(c); if (options.AllowedUnicodeCategories.Contains(uc) && options.IsAllowed(c)) { switch (uc) { case UnicodeCategory.UppercaseLetter: if (options.ToLower) { c = options.Culture != null ? char.ToLower(c, options.Culture) : char.ToLowerInvariant(c); } sb.Append(options.Replace(c)); break; case UnicodeCategory.LowercaseLetter: if (options.ToUpper) { c = options.Culture != null ? char.ToUpper(c, options.Culture) : char.ToUpperInvariant(c); } sb.Append(options.Replace(c)); break; default: sb.Append(options.Replace(c)); break; } } else if (uc == UnicodeCategory.NonSpacingMark) { // don't add a separator } else { if (options.Separator != null && !EndsWith(sb, options.Separator)) { sb.Append(options.Separator); } } if (options.MaximumLength > 0 && sb.Length >= options.MaximumLength) break; } string result = sb.ToString(); if (options.MaximumLength > 0 && result.Length > options.MaximumLength) { result = result.Substring(0, options.MaximumLength); } if (!options.CanEndWithSeparator && options.Separator != null && result.EndsWith(options.Separator)) { result = result.Substring(0, result.Length - options.Separator.Length); } return result.Normalize(NormalizationForm.FormC); } private static bool EndsWith(StringBuilder sb, string text) { if (sb.Length < text.Length) return false; for (int i = 0; i < text.Length; i++) { if (sb[sb.Length - 1 - i] != text[text.Length - 1 - i]) return false; } return true; } } /// <summary> /// Defines options for the Slug utility class. /// </summary> public class SlugOptions { /// <summary> /// Defines the default maximum length. Currently equal to 80. /// </summary> public const int DefaultMaximumLength = 80; /// <summary> /// Defines the default separator. Currently equal to "-". /// </summary> public const string DefaultSeparator = "-"; private bool _toLower; private bool _toUpper; /// <summary> /// Initializes a new instance of the <see cref="SlugOptions"/> class. /// </summary> public SlugOptions() { MaximumLength = DefaultMaximumLength; Separator = DefaultSeparator; AllowedUnicodeCategories = new List<UnicodeCategory>(); AllowedUnicodeCategories.Add(UnicodeCategory.UppercaseLetter); AllowedUnicodeCategories.Add(UnicodeCategory.LowercaseLetter); AllowedUnicodeCategories.Add(UnicodeCategory.DecimalDigitNumber); AllowedRanges = new List<KeyValuePair<short, short>>(); AllowedRanges.Add(new KeyValuePair<short, short>((short)'a', (short)'z')); AllowedRanges.Add(new KeyValuePair<short, short>((short)'A', (short)'Z')); AllowedRanges.Add(new KeyValuePair<short, short>((short)'0', (short)'9')); } /// <summary> /// Gets the allowed unicode categories list. /// </summary> /// <value> /// The allowed unicode categories list. /// </value> public virtual IList<UnicodeCategory> AllowedUnicodeCategories { get; private set; } /// <summary> /// Gets the allowed ranges list. /// </summary> /// <value> /// The allowed ranges list. /// </value> public virtual IList<KeyValuePair<short, short>> AllowedRanges { get; private set; } /// <summary> /// Gets or sets the maximum length. /// </summary> /// <value> /// The maximum length. /// </value> public virtual int MaximumLength { get; set; } /// <summary> /// Gets or sets the separator. /// </summary> /// <value> /// The separator. /// </value> public virtual string Separator { get; set; } /// <summary> /// Gets or sets the culture for case conversion. /// </summary> /// <value> /// The culture. /// </value> public virtual CultureInfo Culture { get; set; } /// <summary> /// Gets or sets a value indicating whether the string can end with a separator string. /// </summary> /// <value> /// <c>true</c> if the string can end with a separator string; otherwise, <c>false</c>. /// </value> public virtual bool CanEndWithSeparator { get; set; } /// <summary> /// Gets or sets a value indicating whether the string is truncated before normalization. /// </summary> /// <value> /// <c>true</c> if the string is truncated before normalization; otherwise, <c>false</c>. /// </value> public virtual bool EarlyTruncate { get; set; } /// <summary> /// Gets or sets a value indicating whether to lowercase the resulting string. /// </summary> /// <value> /// <c>true</c> if the resulting string must be lowercased; otherwise, <c>false</c>. /// </value> public virtual bool ToLower { get { return _toLower; } set { _toLower = value; if (_toLower) { _toUpper = false; } } } /// <summary> /// Gets or sets a value indicating whether to uppercase the resulting string. /// </summary> /// <value> /// <c>true</c> if the resulting string must be uppercased; otherwise, <c>false</c>. /// </value> public virtual bool ToUpper { get { return _toUpper; } set { _toUpper = value; if (_toUpper) { _toLower = false; } } } /// <summary> /// Determines whether the specified character is allowed. /// </summary> /// <param name="character">The character.</param> /// <returns>true if the character is allowed; false otherwise.</returns> public virtual bool IsAllowed(char character) { foreach (var p in AllowedRanges) { if (character >= p.Key && character <= p.Value) return true; } return false; } /// <summary> /// Replaces the specified character by a given string. /// </summary> /// <param name="character">The character to replace.</param> /// <returns>a string.</returns> public virtual string Replace(char character) { return character.ToString(); } }