MySQL支持的字符集简介

摘要:MySQL支持的字符集简介

从根源上讲,计算机只能识别二进制代码,我们利用计算机处理的所有事情,最终都转换成了二进制码。为了使计算机能够处理文字,人们想出了给每个文字符号编码以便于计算机识别处理的办法,这就是字符集的由来。

简单的说,字符集就是一套文字符号及其编码、比较规则的集合。其从20世纪60年代发展到现在,已经非常成熟和稳定,下面是一些常见的字符集。

字符集 是否定长 编码方式 说明
ACSII 单字节7位编码 最早的奠基性字符集
ISO-8859-1/latin1 单字节8位编码 西欧字符集,latin1是ISO-8859-1的别名,有些环境下写作latin-1
GB 2312 双字节编码 全名是GB 2312-1980,早期国标,不再推荐使用
GBK 双字节编码 GBK是GB2312的扩展,非国标,但使用广泛
GB 18030 2字节或4字节编码 国标,是GBK的超集,并与其兼容
UTF-32 4字节编码 也叫UCS-4,使用较少
UCS-2 2字节编码
UTF-16 2字节或4字节编码 不兼容ASCII
UTF-8 1~4字节编码 被广泛使用的Unicode字符集

字符集对数据库来说非常重要,因为数据库存储的大部分都是各种文字,字符集对数据的存储、处理性能、甚至迁移升级等都有影响。

MySQL 5.7目前支持几十种字符集。

root@database-one 13:40:  [(none)]> show character set;
+----------+---------------------------------+---------------------+--------+
| Charset  | Description                     | Default collation   | Maxlen |
+----------+---------------------------------+---------------------+--------+
| big5     | Big5 Traditional Chinese        | big5_chinese_ci     |      2 |
| dec8     | DEC West European               | dec8_swedish_ci     |      1 |
| cp850    | DOS West European               | cp850_general_ci    |      1 |
| hp8      | HP West European                | hp8_english_ci      |      1 |
| koi8r    | KOI8-R Relcom Russian           | koi8r_general_ci    |      1 |
| latin1   | cp1252 West European            | latin1_swedish_ci   |      1 |
| latin2   | ISO 8859-2 Central European     | latin2_general_ci   |      1 |
| swe7     | 7bit Swedish                    | swe7_swedish_ci     |      1 |
| ascii    | US ASCII                        | ascii_general_ci    |      1 |
| ujis     | EUC-JP Japanese                 | ujis_japanese_ci    |      3 |
| sjis     | Shift-JIS Japanese              | sjis_japanese_ci    |      2 |
| hebrew   | ISO 8859-8 Hebrew               | hebrew_general_ci   |      1 |
| tis620   | TIS620 Thai                     | tis620_thai_ci      |      1 |
| euckr    | EUC-KR Korean                   | euckr_korean_ci     |      2 |
| koi8u    | KOI8-U Ukrainian                | koi8u_general_ci    |      1 |
| gb2312   | GB2312 Simplified Chinese       | gb2312_chinese_ci   |      2 |
| greek    | ISO 8859-7 Greek                | greek_general_ci    |      1 |
| cp1250   | Windows Central European        | cp1250_general_ci   |      1 |
| gbk      | GBK Simplified Chinese          | gbk_chinese_ci      |      2 |
| latin5   | ISO 8859-9 Turkish              | latin5_turkish_ci   |      1 |
| armscii8 | ARMSCII-8 Armenian              | armscii8_general_ci |      1 |
| utf8     | UTF-8 Unicode                   | utf8_general_ci     |      3 |
| ucs2     | UCS-2 Unicode                   | ucs2_general_ci     |      2 |
| cp866    | DOS Russian                     | cp866_general_ci    |      1 |
| keybcs2  | DOS Kamenicky Czech-Slovak      | keybcs2_general_ci  |      1 |
| macce    | Mac Central European            | macce_general_ci    |      1 |
| macroman | Mac West European               | macroman_general_ci |      1 |
| cp852    | DOS Central European            | cp852_general_ci    |      1 |
| latin7   | ISO 8859-13 Baltic              | latin7_general_ci   |      1 |
| utf8mb4  | UTF-8 Unicode                   | utf8mb4_general_ci  |      4 |
| cp1251   | Windows Cyrillic                | cp1251_general_ci   |      1 |
| utf16    | UTF-16 Unicode                  | utf16_general_ci    |      4 |
| utf16le  | UTF-16LE Unicode                | utf16le_general_ci  |      4 |
| cp1256   | Windows Arabic                  | cp1256_general_ci   |      1 |
| cp1257   | Windows Baltic                  | cp1257_general_ci   |      1 |
| utf32    | UTF-32 Unicode                  | utf32_general_ci    |      4 |
| binary   | Binary pseudo charset           | binary              |      1 |
| geostd8  | GEOSTD8 Georgian                | geostd8_general_ci  |      1 |
| cp932    | SJIS for Windows Japanese       | cp932_japanese_ci   |      2 |
| eucjpms  | UJIS for Windows Japanese       | eucjpms_japanese_ci |      3 |
| gb18030  | China National Standard GB18030 | gb18030_chinese_ci  |      4 |
+----------+---------------------------------+---------------------+--------+
41 rows in set (0.00 sec)

 

可以看到,每种字符集包括Charset(字符集)和collation(校对规则)两个概念。字符集决定MySQL存储字符串的方式,校对规则决定比较字符串的方式。字符集和校对规则是一对多的关系,每个字符集至少对应一个校对规则。

root@database-one 13:42:  [(none)]> show collation like 'gbk%';
+----------------+---------+----+---------+----------+---------+
| Collation      | Charset | Id | Default | Compiled | Sortlen |
+----------------+---------+----+---------+----------+---------+
| gbk_chinese_ci | gbk     | 28 | Yes     | Yes      |       1 |
| gbk_bin        | gbk     | 87 |         | Yes      |       1 |
+----------------+---------+----+---------+----------+---------+
2 rows in set (0.01 sec)
root@database-one 13:48:  [(none)]> SHOW COLLATION WHERE Charset = 'latin1';
+-------------------+---------+----+---------+----------+---------+
| Collation         | Charset | Id | Default | Compiled | Sortlen |
+-------------------+---------+----+---------+----------+---------+
| latin1_german1_ci | latin1  |  5 |         | Yes      |       1 |
| latin1_swedish_ci | latin1  |  8 | Yes     | Yes      |       1 |
| latin1_danish_ci  | latin1  | 15 |         | Yes      |       1 |
| latin1_german2_ci | latin1  | 31 |         | Yes      |       2 |
| latin1_bin        | latin1  | 47 |         | Yes      |       1 |
| latin1_general_ci | latin1  | 48 |         | Yes      |       1 |
| latin1_general_cs | latin1  | 49 |         | Yes      |       1 |
| latin1_spanish_ci | latin1  | 94 |         | Yes      |       1 |
+-------------------+---------+----+---------+----------+---------+
8 rows in set (0.00 sec)

 

可以看到gbk有两种校对规则,默认使用gbk_chinese_ci。latin1有八种校对规则,默认使用latin1_swedish_ci。

校对规则命名约定:以相关的字符集名开始,中间一般是语言名,以_ci结尾代表大小写不敏感,以_cs结尾代表大小写敏感,以_bin结尾代表基于字符编码值比较。

面对众多的字符集,我们该如何选择呢?一般是结合业务的特征来考虑,比如:业务要给不同语言的国家或地区使用,就应该选择Unicode字符集。业务只在某个区域用,就选相应的国家字符集,会更高效些。

© 2020, morinson. 版权所有. 欢迎转载,但请保留作者及出处。