Understanding utf8mb4_unicode_ci vs. utf8mb4_unicode_520_ci Collations in MariaDB/MySQL

2024-07-27

Character Set: Defines the range of characters a database can store (e.g., basic Latin letters, Asian characters, Cyrillic alphabets). Common character sets in MySQL/MariaDB include utf8 and utf8mb4.
Collation: Determines how characters are ordered and compared. It defines rules for sorting, searching, and comparisons within a character set.

utf8mb4 vs. utf8

Both utf8 and utf8mb4 are character sets that can represent a vast range of characters from various languages.
The key difference lies in how they store characters:
- utf8 uses 1 to 4 bytes per character, but it's limited in representing certain characters beyond basic Latin.
- utf8mb4 consistently uses 4 bytes per character, ensuring wider compatibility with various languages and complex characters.

utf8mb4_unicode_ci vs. utf8mb4_unicode_520_ci Collations

These collations are both used with the utf8mb4 character set for proper handling of multilingual data. The main difference lies in the Unicode version they adhere to for character comparisons:

utf8mb4_unicode_ci:
- Follows the Unicode Collation Algorithm (UCA) version 4.0.0.
- This version might not handle certain character comparisons as accurately as newer versions.
utf8mb4_unicode_520_ci:
- Follows the UCA version 5.2.0.
- Provides more up-to-date and potentially more accurate character comparisons, especially for languages with complex sorting rules.

Choosing the Right Collation

utf8mb4_unicode_ci: A good choice for older databases or if compatibility with MySQL versions before 8.0 is crucial. It's also sufficient for basic multilingual needs where complex sorting isn't a major concern.
utf8mb4_unicode_520_ci: Recommended for most modern use cases, especially if you need accurate sorting and comparisons for a wider range of languages, including those with intricate sorting rules.

Additional Considerations

MySQL 8.0 and Newer: The default collation for utf8mb4 in MySQL 8.0 and later is utf8mb4_0900_ai_ci, which is based on UCA version 9.0.0 and offers even more advanced sorting capabilities.
MariaDB 10.11 and Newer: MariaDB 10.11 introduced uca1400_ai_ci as the default collation, based on UCA version 14.0.0.
Compatibility: If you need to ensure compatibility between MariaDB and MySQL, utf8mb4_unicode_520_ci is a safe choice as it's widely supported in both systems.

CREATE TABLE my_table (
  id INT PRIMARY KEY AUTO_INCREMENT,
  name VARCHAR(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci  -- Use utf8mb4_unicode_ci
  -- OR
  description TEXT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci  -- Use utf8mb4_unicode_520_ci
);

In this example, you can choose either utf8mb4_unicode_ci or utf8mb4_unicode_520_ci depending on your specific needs.

Converting an Existing Table's Collation:

ALTER TABLE my_table
  CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;  -- Use utf8mb4_unicode_ci
  -- OR
  CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_520_ci;  -- Use utf8mb4_unicode_520_ci

This code modifies an existing table named my_table to use the specified collation. Be cautious when converting existing tables, as it might require downtime and data integrity checks.

Specifying Collation in a Query:

SELECT * FROM my_table
WHERE name COLLATE utf8mb4_unicode_ci LIKE '%é%';  -- Use utf8mb4_unicode_ci
  -- OR
  WHERE name COLLATE utf8mb4_unicode_520_ci LIKE '%é%';  -- Use utf8mb4_unicode_520_ci

This query searches the name column in the my_table, but it explicitly specifies the collation to ensure proper character handling, especially for characters with accents like the "é" in this example.

MariaDB and MySQL allow creating custom collations, but this is an advanced technique requiring in-depth knowledge of the Unicode Collation Algorithm (UCA) and character handling. It's generally not recommended unless you have very specific sorting or comparison needs that existing collations don't fulfill.

Normalization (For Specific Cases):

In some cases, you might be able to achieve similar results through data normalization techniques. For instance, if your concern is case-insensitive sorting, you could store all data in lowercase or uppercase consistently, negating the need for a case-insensitive collation. However, this approach can have limitations depending on your specific use case and might not be suitable for all scenarios.

Alternative Character Sets (Limited Use):

If your data primarily deals with a specific language or character set with a simpler sorting scheme, you could explore alternative character sets like latin1_swedish_ci (for Swedish characters) or cp1251_ci (for Cyrillic characters). However, this approach limits compatibility with other languages and reduces the overall flexibility of your database.

Application-Level Handling:

For specific sorting or comparison needs, consider handling them at the application level. This involves writing code in your programming language to process and sort the data according to your desired rules. While this approach offers more control, it adds complexity to your application logic.

Choosing the Right Approach:

The best approach depends on your specific requirements. Here's a general guide:

For most modern multilingual databases: utf8mb4_unicode_520_ci is a good choice due to its wide compatibility and accurate character handling.
For advanced sorting needs or niche use cases: Consider custom collations (with caution) or application-level handling.
For simpler, single-language scenarios: Explore alternative character sets if they meet your specific needs.

mysql unicode mariadb