about utf-8 encoding

About UTF-8 Encoding

UTF-8 is a binary text encoding, used for Unicode characters. This makes it possible to use UTF-8 for (almost) any language on Earth, which is one of the reasons for its popularity.

The UTF-8 text encoder tool stores each Unicode character into a variable number of bytes. Since JavaScript (and many other languages) internally use only two bytes for each character (UTF-16), this conversion utility will only handle code points up to U+FFFF:

Byte 1 Byte 2 Byte 3 Comment
0xxxxxxx - - ASCII (7-bit) characters are unmodified
110xxxxx 10xxxxxx - Code points up to U+07FF
1110xxxx 10xxxxxx 10xxxxxx Code points up to U+FFFF

By setting the input encoding to UTF-8, the same UTF-8 tool can also decode a raw byte stream in UTF-8 format or correct a garbled text.

See the Wikipedia article on UTF-8 for more info.