Text Encoding Explained
Discover how computers represent text as numbers. From ASCII to UTF-8, learn why encoding matters for international text.
Table of Contents
Why Encoding Matters
Computers don't understand letters or symbols—they only work with numbers (binary: 0s and 1s). Text encoding is the system that maps characters to numbers so computers can store and display text.
When you see the letter "A" on screen, the computer actually stores the number 65. When displaying text, it looks up 65 in an encoding table and shows "A". Different encoding systems use different number-to-character mappings, which is why encoding problems cause gibberish text like "é" instead of "é".
Encoding mistakes cause website errors, corrupted emails, and broken database records. Understanding encoding prevents data loss and ensures text displays correctly worldwide.
ASCII: The Beginning
ASCII (American Standard Code for Information Interchange) was created in 1963 and uses 7 bits to represent 128 characters:
- 0-31: Control characters (newline, tab, etc.)
- 32-126: Printable characters (letters, numbers, punctuation)
- 127: Delete character
Character → ASCII Number (Decimal)
A → 65
B → 66
a → 97
0 → 48
Space → 32
! → 33
ASCII's Limitation
ASCII only covers English characters. It can't represent:
- Accented letters (é, ñ, ü)
- Non-Latin scripts (中文, العربية, Русский)
- Emoji (😀, 🚀)
- Special symbols (€, ©, ™)
Different countries created extended ASCII variants (like Windows-1252, ISO-8859-1) that used the 8th bit for 128 additional characters, but these were incompatible with each other. A file encoded in one extended ASCII couldn't be read correctly in another.
Unicode: Universal Characters
Unicode solved the incompatibility problem by creating a single character set for all languages. Unicode assigns a unique number (called a code point) to every character from every writing system.
Unicode Code Points
Code points are written as U+ followed by hexadecimal digits:
Character → Unicode Code Point
A → U+0041
€ → U+20AC
中 → U+4E2D
😀 → U+1F600
Unicode contains over 140,000 characters covering 150+ scripts. It's constantly expanding to include historical scripts, emoji, and symbols.
Unicode defines which characters exist and their code points. Encodings like UTF-8 define how to store these code points as bytes in files and memory.
UTF-8: The Modern Standard
UTF-8 (8-bit Unicode Transformation Format) is the most popular Unicode encoding. It's used by 98% of websites and is the default for most programming languages.
Why UTF-8 Won
- Backward compatible with ASCII: ASCII characters use the same bytes in UTF-8
- Variable length: Common characters (English) use 1 byte, others use 2-4 bytes
- Self-synchronizing: Can detect character boundaries even if you jump into the middle of a file
- No byte order issues: Unlike UTF-16, no BOM (Byte Order Mark) needed
How UTF-8 Encodes Characters
1 byte (U+0000 to U+007F): ASCII characters
Example: "A" → 01000001 (1 byte)
2 bytes (U+0080 to U+07FF): Latin extended, Greek, Cyrillic
Example: "é" → 11000011 10101001 (2 bytes)
3 bytes (U+0800 to U+FFFF): Most common characters
Example: "中" → 11100100 10111000 10101101 (3 bytes)
4 bytes (U+10000 to U+10FFFF): Emoji, rare characters
Example: "😀" → 11110000 10011111 10011000 10000000 (4 bytes)
UTF-8 vs Other Encodings
| Encoding | Bytes per Character | Pros | Cons |
|---|---|---|---|
| ASCII | 1 | Simple, compact for English | Only 128 characters |
| UTF-8 | 1-4 (variable) | Universal, ASCII compatible | Variable length can be slower |
| UTF-16 | 2-4 (variable) | Common in Windows/Java | Wastes space for ASCII text |
| UTF-32 | 4 (fixed) | Fixed width, fast indexing | 4× larger than UTF-8 for English |
Common Encoding Problems
1. Mojibake (文字化け) - Garbled Text
Problem: "café" displays as "café"
Cause: UTF-8 text interpreted as Windows-1252 or ISO-8859-1.
Solution: Set correct encoding when opening files. Most text editors have "Reopen with Encoding" options.
2. BOM (Byte Order Mark) Issues
Problem: Files start with weird characters like ""
Cause: UTF-8 BOM (EF BB BF) visible in editors that don't expect it.
Solution: Save as "UTF-8 without BOM" for web files. Only use BOM for Windows text files.
3. Database Encoding Mismatches
Problem: Text looks fine in app but garbled in database.
Cause: Connection charset doesn't match database charset.
Solution:
-- MySQL example
SET NAMES utf8mb4;
CREATE DATABASE mydb CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
4. Email Encoding
Problem: Subject lines with non-ASCII characters show as "=?UTF-8?Q?..."
Cause: Email headers must be MIME-encoded for international characters.
Solution: Use email libraries that handle encoding automatically (e.g., PHPMailer, Python's email module).
The #1 rule: Be consistent. If you read a file as UTF-8, write it as UTF-8. Mixing encodings in a single file guarantees corruption.
Best Practices
- Always use UTF-8 unless you have a specific reason not to
- Declare encoding in HTML:
<meta charset="UTF-8"> - Set database charset to UTF-8: Use
utf8mb4in MySQL for full Unicode support - Specify encoding when opening files: Python:
open('file.txt', encoding='utf-8') - Test with international characters: Use é, 中文, 😀 in your tests
- Save source code as UTF-8: Configure your IDE to default to UTF-8
For Web: UTF-8, always
For JSON: UTF-8 (required by spec)
For CSV: UTF-8 with BOM if opening in Excel
For databases: utf8mb4 (MySQL) or UTF-8 (PostgreSQL)
For code files: UTF-8 without BOM
Detecting Encoding
If you receive a file with unknown encoding, use detection tools:
- Command line:
file -i filename.txt(Linux/Mac) - Python:
chardetlibrary - Online tools: Search "encoding detector"
- Text editors: Notepad++ shows encoding in status bar
Remember: UTF-8 is not just a technical choice—it's about inclusivity. Using UTF-8 ensures your software works for users worldwide, regardless of their language.