Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

WinNT missing out on UTF-8 and instead going with UCS-2 for their UNICODE text encoding might have been "the other" billion dollar mistake in the history of computing ;)

There was a 9 month time window between the invention of UTF-8 and the first release of WinNT (Sep 1992 to Jul 1993).

But ok fine, UTF-8 didn't really become popular until the web became popular.

But then missing the other opportunity to make the transition with the release of the first consumer version of WinNT (WinXP) nearly a decade later is inexcusable.



"UTF-8 was first officially presented at the USENIX conference in San Diego, from January 25 to 29, 1993" (https://en.wikipedia.org/wiki/UTF-8)

Hey team, we're working to release an ambitious new operating system in about 6 months, but I've decided we should burn the midnight oil to rip out and redo all of the text handling we worked on to replace it with something that was just introduced at a conference..

Oh and all the folks building their software against the beta for the last few months, well they knew what they were getting themselves into, after all it is a beta (https://books.google.com/books?id=elEEAAAAMBAJ&pg=PA1#v=onep...)

As for Windows XP, so now we're going to add a third version of the A/W APIs?

More background: https://devblogs.microsoft.com/oldnewthing/20190830-00/?p=10...


Interestingly, there is another story on the HN front page about Steve Wozniak doing exactly that for the Apple I:

https://news.ycombinator.com/item?id=45265240


The 6502 and the 6800 are pretty similar. The 6501 was pin compatible with the 6800, but not software compatible; the 6501 was dropped as part of a settlement with Motorola.

Changing an in-progress system design to a similar chip that was much less expensive ($25 at the convention vs $175 for a 6800, dropped to $69 the month after the convention) is a leap of faith, but the difference in cost is obvious justification, and the Apple I had no legacy to work with.

It would have been great if Windows NT could have picked up utf-8, but it's a bigger leap and the benefit wasn't as clear; variable width code points are painful in a lot of ways, and 16-bits for a code point seemed like it would be enough for anybody.


My takeaway from this story has always been that both MS and Plan 9 simply passively implemented Unicode as received. It was only IBM that had the vision to see that the encoding was wrong and they should make a new one.


But doesn't OS/2 still use UCS-2 internally? And only years later (1995+)?

Potential source: https://ia802804.us.archive.org/13/items/os2developmentrelat...


The history is more complicated than that. Originally, ISO/IEC 10646 and Unicode were two separate efforts, with only ISO having 31-ish-bit ideas, and Unicode being strictly 16 bits [0][1]. UTF-8 as in TFA was clearly developed to cover the 31-bit ISO character collection, the encoding going up to 6 bytes per character, although a couple of years later this was restricted to the 4 bytes sufficient to cover the 20 bits of Unicode 2.0 (1996). The initial UTF-8 development is therefore somewhat beyond the scope of what Unicode 1.x was about at the time.

Furthermore, the development of Windows NT already began in 1989 (then planned as OS/2 3.0) and proceeded in parallel to the finalization of Unicode 1.0, and to its eventual adoption by ISO that lead to Unicode 1.1 and ISO/IEC 10646-1:1993. It was natural to adopt that standardization effort.

Once established, the 16-bit encoding used by Windows NT was engrained in kernel and userspace APIs, notably the BSTR string type used by Visual Basic and COM, and importantly in NTFS. Adopting UTF-8 for Windows XP would have provided little benefit at that point, while causing a lot of complications. For backwards compatibility, something like WTF-8 would effectively have been required, and there would have been an additional performance penalty for converting back and forth from the existing WCHAR/BSTR APIs and serializations. It wasn't remotely a viable opportunity for such a far-reaching change.

Lastly, my recollection is that UTF-8 only became really widespread on the web some time after the release of Windows XP (2001), maybe roughly around Vista.

[0] https://en.wikipedia.org/wiki/Universal_Coded_Character_Set#...

[1] "Internationalization and character set standards", September 1993, https://dl.acm.org/doi/pdf/10.1145/174683.174687


Windows using CP-125X encodings by default in many countries instead of a UTF-8 did a lot of damage, at least in my experience.


For many European languages like French or German the switch from local CP-encodings meant that only some characters like å, ñ, ç, etc. would require extra bytes. And thus the switch to UTF-8 was a no-brainer.

On the other hand, Cyrillic and Greek are two examples of short alphabets that allowed combining them with ASCII into a single-byte encoding for countries like Greece, Bulgaria, Russia, etc. For those locations switching to UTF-8 meant that you need extra bytes for all characters in a local language, and thus higher storage, memory, and bandwidth requirements for all computing. So, non-Unicode encodings stuck there for a lot longer.


And back then Unicode was just 16 bits so UTF-8 wasn't such an obvious advantage in flexibility.


Can't imagine they would've wanted to change encoding between Win3.1 and NT3.1.


But they did?


UCS-2 support was first released with some add on for Win3.1(which also still did most stuff with multiple character sets).


IIRC Win32s (the subset of win32 released for windows 3.1) only added UCS-2 string processing, none of the system wide character APIs.

I think all the actual OS was still codepage (with the "multibyte" versions for things like Eastern languages being pretty much forks), and windows95 wasn't really much different.


win32s brings codepage to/from widechar APIs and codepage table files (P_*.NLS), and a setting "AnsiCP=" in [NLS] section in win32s.ini.

16bit IE brings its own MSNLS.DLL for handling different codepages to ACP(Active Codepage) in Win3.1x.

and win9x also works mainly in ANSI codepage with some kernel side unicode support.


And nowadays developers have to deal with the "A/W" suffix bullshit.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: