Search results
Results From The WOW.Com Content Network
The dex format defined by Dalvik also uses the same modified UTF-8 to represent string values. [67] Tcl also uses the same modified UTF-8 [68] as Java for internal representation of Unicode data, but uses strict CESU-8 for external data. All known Modified UTF-8 implementations also treat the surrogate pairs as in CESU-8.
Legacy programs can generally handle UTF-8 encoded files, even if they contain non-ASCII characters. For instance, the C printf function can print a UTF-8 string because it only looks for the ASCII '%' character to define a formatting string. All other bytes are printed unchanged.
Start of String SOS U+0099 153 0302 0231: Single Graphic Character Introducer SGCI U+009A 154 0302 0232: Single Character Intro Introducer SCI U+009B 155 0302 0233: Control Sequence Introducer: CSI U+009C 156 0302 0234: String Terminator ST U+009D 157 0302 0235: Operating System Command OSC U+009E 158 0302 0236: Private Message PM U+009F 159 ...
The same character converted to UTF-8 becomes the byte sequence EF BB BF. The Unicode Standard allows the BOM "can serve as a signature for UTF-8 encoded text where the character set is unmarked". [75] Some software developers have adopted it for other encodings, including UTF-8, in an attempt to distinguish UTF-8 from local 8-bit code pages.
For UTF-8, the BOM is optional, while it is a must for the UTF-16 and the UTF-32 encodings. (Note: UTF-16 and UTF-32 without the BOM are formally known under different names, they are different encodings, and thus needs some form of encoding declaration – see UTF-16BE, UTF-16LE, UTF-32LE and UTF-32BE.) The use of the BOM character (U+FEFF ...
[6] [7] [8] The Encoding Standard further stipulates that new formats, new protocols (even when existing formats are used) and authors of new documents are required to use UTF-8 exclusively. [9] Besides UTF-8, the following encodings are explicitly listed in the HTML standard itself, with reference to the Encoding Standard: [8]
Unicode equivalence is the specification by the Unicode character encoding standard that some sequences of code points represent essentially the same character. This feature was introduced in the standard to allow compatibility with pre-existing standard character sets, which often included similar or identical characters.
Although not strictly required, UTF-8 is usually also transfer encoded to avoid problems across seven-bit mail servers. MIME transfer encoding of UTF-8 makes it either unreadable as a plain text (in the case of base64) or, for some languages and types of text, heavily size inefficient (in the case of quoted-printable).