Unicode character “ꙮ” (U+A66E) is being updated

jerf · on Sept 19, 2022

When my kids were young, I accidentally flubbed the pronunciation of "Santa Claus" once and said something that sounded a lot like "Centiclops", which I decided to roll with. Centiclops is a lot like a cyclops with one eye, except the as a reading of the roots clearly indicates, this is a creature with 100 eyes.

Today I learn that Centiclops effectively has a Unicode character. As Centiclops' representative in the world of the non-imaginary, we accept that a Unicode character with a hundred eyes is not practical and we accept the representation with just a few eyes, but generally agree that upgrading to 7 to 10 is a nice improvement, as 7 does not evenly divide into 100 but 10 does. This is important, because... reasons.

sshine · on Sept 19, 2022

From "The House of Asterion" by Jorge Luis Borges:

"It is true that I never leave my house, but it is also true that its doors (whose numbers are infinite) (footnote: The original says fourteen, but there is ample reason to infer that, as used by Asterion, this numeral stands for infinite.) are open day and night to men and to animals as well."

https://klasrum.weebly.com/uploads/9/0/9/1/9091667/the_house...

Eddonarth · on Sept 21, 2022

That reminds me of the Nahuatl word centzon, which is used to mean either 400, or an innumerable/infinite number. The Aztecs used a base-20 number system, so 400 = 20*20.

https://www.mythicalcreaturescatalogue.com/post/2016/06/10/t...

bitwize · on Sept 20, 2022

Greek mythology actually did have a "centiclops" -- Argus Panoptes ("all eyes"), who had a hundred eyes all over his body. Hera assigned him to watch over Io, a nymph who had been turned into a cow, so that Zeus wouldn't come and shag her in secret. Argus was slain by Hermes (a Zeus loyalist); to mourn and honor him, Hera had his eyes transferred to the peacock's tail.

The real Greek for a hundred-eyed being would be something like "hekatonoptes", but Argus wasn't called that as far as I know.

layer8 · on Sept 19, 2022

There should be a combining “eye” character so that you can have as many or few eyes as you like.

Though to be honest, that Unicode character looks more like a bunch of cells forming a tissue to me than eyes.

inimino · on Sept 20, 2022

But then what are eyes but a bunch of cells?

tetris11 · on Sept 21, 2022

B ¬∈ A

mbrubeck · on Sept 20, 2022

The nice thing about ꙮ having ten eyes is that you can now combine ten of them with U+200D ZERO WIDTH JOINER [1] to make a centiclops grapheme, as long as your font has a glyph for that particular ligature. (Readers without centiclops-compatible fonts will simply see ten separate ꙮ glyphs, an acceptable fallback for legacy systems.)

[1]: https://en.wikipedia.org/wiki/Zero-width_joiner

doodpants · on Sept 19, 2022

Or perhaps this character is an accurate representation of a Dekaclops.

tzot · on Sept 19, 2022

It'd be dekaops, because the -cl- is part of “cycle”+“ops” (one round eye, with the one dropped because it's inferred). So “cycle” out, “deka” in.

MonkeyClub · on Sept 20, 2022

I thought ops was from opsis and referred to a “circle-like face”?

Like oinops, referring to the “wine-like face” (ie dark read) of a sea passage in the evening.

May be wrong, though, I don’t have my tools handy to check now.

jerf · on Sept 19, 2022

My client finds your proposal offensive and an appropriation of his culture, and also that Dekaclops guy is mean and smells bad and hasn't returned the lawnmower my client lent him even though my client has clearly referred to the need to mow his lawn several times now so he totally doesn't deserve a Unicode character.

tempodox · on Sept 19, 2022

“Santa Clause” would translate to “holy clause”. There might be such a thing but I think you meant Santa Claus :)

jerf · on Sept 19, 2022

My fingers love adding the e's on the end of any worde that can conceivably take them. Also have that problem with any word that can take an "ly" even if I don't meanly it.

Fixed, thanks.

partomniscient · on Sept 20, 2022

That's supercalifragilisticexpialidociously strangee.

dtparr · on Sept 19, 2022

Maybe just a big fan of the Tim Allen movie?

justbaker · on Sept 20, 2022

[grunts]

JohnFen · on Sept 19, 2022

I thought "santa" meant "saint"?

thaumasiotes · on Sept 19, 2022

> I thought "santa" meant "saint"?

Well, santa is a Spanish word meaning "holy" and saint is a cognate French word meaning the same thing. They descend from Latin sanctus; compare sanctify.

When the prayer goes "holy Mary, mother of god", "holy Mary" is an exact equivalent of "santa María".

robocat · on Sept 19, 2022

Might as well mention “Sancta Marīa” in Latin, for example from the Christian Hail Mary[1], a recorded Latin version[2], written Latin next to English and Spanish[3] and of course translated into thousands of languages[4] although unfortunately mostly written using /A-Z/i; I am an atheist interested in languages.

[1] https://en.m.wikipedia.org/wiki/Hail_Mary

[2] https://glaemscrafu.jrrvf.com/english/avemaria.html

[3] https://hymnary.org/text/hail_mary_full_of_grace_the_lord_is...

[4] http://www.marysrosaries.com/Rosary_prayers_in_different_lan...

thaumasiotes · on Sept 19, 2022

In my mind, the Latin form of Mary is Mariam, because that's what my Latin teacher taught me. (He also commented that, unlike Greek names, Hebrew names never inflected in Latin, so that it would be "Mariam" regardless of what case the name should appear in.)

But it makes sense that Church Latin would be different.

felix318 · on Sept 19, 2022

“Santa” means “female saint” in Italian and Spanish. Perhaps the English “santa” came from another language but I always found the name “Santa Claus” just horrible.

nohuck13 · on Sept 19, 2022

The name Santa Claus evolved from Nick's Dutch nickname, Sinter Klaas, a shortened form of Sint Nikolaas (Dutch for Saint Nicholas)

https://www.history.com/.amp/topics/christmas/santa-claus#si...

Tijdreiziger · on Sept 19, 2022

It’s actually Sinterklaas (without a space) and we still call him that :) We also ended up re-importing the American Santa Claus, so these days we have two festive holidays in December.

Archelaos · on Sept 19, 2022

The first mention of this version of Saint Nicholas's name has the form "St. A Claus" and appeared in the New-York Gazette of 20 Dec 1773.[1] The same issue also first reported some incident regarding tea in Boston harbour. Nice coincidence.

[1] Source: https://boston1775.blogspot.com/2016/12/st-claus-was-celebra...

plumeria · on Sept 19, 2022

The Tim Allen movie series in Spanish is titled "Santa Cláusula".

wongarsu · on Sept 19, 2022

Saint is more or less the same as holy, just used as a title. It comes from Old French saint, seinte "holy, pious, devout," from Latin sanctus "holy, consecrated"

0xbadcafebee · on Sept 19, 2022

It does; the character originates from Saint Nicholas (or Odin, depending who you ask)

fortran77 · on Sept 19, 2022

I thought it was a misspelling of Satan, but maybe that's because I'm Jewish.

thaumasiotes · on Sept 19, 2022

> Centiclops is a lot like a cyclops with one eye, except th[at] as a reading of the roots clearly indicates, this is a creature with 100 eyes.

Not in any normal sense of "roots". Cent is a Latin root meaning 100. ops is a Greek form meaning eye. The -i- indicates that the word is being formed in Latin, and the -cl- is entirely spurious. The original Greek word divides as cycl-ops, not cy-clops.

martyvis · on Sept 19, 2022

A bit like the heli-copter | helico-pter thing.

kevinpet · on Sept 20, 2022

Wow. I never noticed that before. Spiral wings.

FabHK · on Sept 19, 2022

Impressively polylingual, even multiglot.

inopinatus · on Sept 19, 2022

In any case, there is already an ancient, general, and perfectly serviceable epithet Panoptes.

anthk · on Sept 19, 2022

Cent is easy to grasp if you speak a Romance.

thaumasiotes · on Sept 19, 2022

But it doesn't combine with ops. You'd need to talk about a hecatops or a hecatontops. And even more than it can't combine with ops, it can't combine with clops because there is no such root.

Someone · on Sept 20, 2022

It does. Combining Latin and Greek roots is done fairly frequently.

https://en.wikipedia.org/wiki/Hybrid_word mentions automobile, chloroform, hexadecimal, micro-instruction, petroleum, television and a few others.

dragonwriter · on Sept 21, 2022

Citing wikipedia (“wiki” is a Hawaiian-derived root, and “-pedia” a Greek-derived suffix) is particularly appropriate here.

jerf · on Sept 20, 2022

The thing that will annoy you most of all is that I was fully aware of all of this at the time... and I did it anyways.

Truly the purity of English will never recover from the trauma that I have inflicted on it.

dragonwriter · on Sept 21, 2022

> But it doesn't combine with ops.

Sure, it does, in English, which stole prefixes, suffixes, and roots from Latin, Greek, and many other languages, and has no problem using them together, without special concern about where it got them from.

etamponi · on Sept 19, 2022

By the same reasoning, the 7-eyed O has now been used more than once, so it deserves a glyph! So the right way to do this is to introduce a new character for the correct glyph, and also leave the current one (perhaps changing the title). Otherwise these tweets won't make when read by someone that updated to Unicode 15.0

koboll · on Sept 19, 2022

Honestly it probably deserves the Pluto treatment: decertification as a character. One historical use in the 1400s doesn't merit a character and never did.

colejohnson66 · on Sept 19, 2022

Unicode's mission is to make every document "roundtrip-able". Even if a character is only used once, it should be possible to save a plaintext version of the containing document without losing any information. Roughly, I should be able to put a transcription of that one translation from the 1400s on Wikisource without using images.

You may disagree with me, and that's fine, but it doesn't change Unicode's mission. Besides, there's room for 1,112,064 codepoints[a], and only 149,146 are in use. It's predicted we'll never use it up, so what harm is there in one codepoint no one will ever need?

[a]: U+10'FFFF max; it used to be U+FFFF'FFFF, but UTF-16 and surrogates ruined that

tzs · on Sept 19, 2022

If that was once its mission, it was clearly abandoned long ago. They rejected Klingon characters on the grounds that it has low usage for communication, and that many of the people who do communicate in Klingon use a latinized form.

ꙮ seems to just be a fancy way of writing О. I haven't seen anything that says it has a different meaning. The arguments for excluding Klingon seem to apply even more so to ꙮ.

mananaysiempre · on Sept 19, 2022

If you look through the old mailing list postings, the oft-left-implicit problem with Klingon (as well as Tengwar, Everson’s [EDIT: misspelling] pet project) is that it may get people into legal trouble (even though in a reasonable world it shouldn’t be able to). So in the unofficial CSUR / UCSUR they remain.

A weird solitary character from the 1400s isn’t subject to that, and even if it’s a mistake it’s probably not worth breaking compatibility at this point (I think the last such break with code points genuinely changing meanings was to repair a mistaken CJK unification some time in the 00s, and the Consortium may even have tied its own hands in that regard with the ever-more-strict stability policies).

Similarly, for example, old ISO keyboard symbols (the ⌫ for erase backwards, but also a ton of virtually unused ones) were thrown in indiscriminately at the beginning of the project when attempting to cover every existing encoding, but when the ISO decided to extend the repertoire they were told to kindly provide examples of running-text (not iconic) usage in a non-member-body-controlled publication. (Crickets. The ISO keyboard input model itself only vaguely corresponds to how input methods for QWERTY-adjacent keyboards work in existing systems—as an attempt at rationalization, it seems to mostly be a failed one.)

lifthrasiir · on Sept 19, 2022

[EDIT: Removed a section about the now-fixed typo]

> I think the last such break with code points genuinely changing meanings was to repair a mistaken CJK unification some time in the 00s, and the Consortium may even have tied its own hands in that regard with the ever-more-strict stability policies[.]

Not exactly, the last break happened between Unicode 1.1 and 2.0 and the new CJK Unified Ideographs Extension A block still contains unified characters. The main reason for break was that both Hangul and CJK(V) ideographs required tons of additional code points and it became clear that 16-bit code space is dangerously insufficient; by 1.1 there was only a single big block of unassigned code points from U+A000 to U+E7FF (18,432 total), and there were 4,516 and 6,582 new Hangul and CJK(V) ideographs in 2.0 (11,098 total).

bobsmooth · on Sept 19, 2022

Unless it's legitimately someone's native tongue, conlangs shouldn't be in unicode. If there are kids out there that are native Klingon speakers, then you can make the argument it should be included.

thfuran · on Sept 19, 2022

I think it makes way more sense to put a conlang in Unicode than it does a peculiar stylistic flourish only ever applied once to a single letter in a single document. If that belongs in Unicode, why not every bit of marginalia ever doodled and every uniquely adorned drop cap / initial letter?

hoseja · on Sept 20, 2022

There is a smattering of missionary-made alphabets that have way less usage than some conlangs. Why are they legitimate but conlangs aren't?

brokensegue · on Sept 20, 2022

so all you need is one crazy parent? shouldn't be too hard to find

Someone · on Sept 20, 2022

https://www.dailymail.co.uk/news/article-1229808/Linguist-re... (2009):

“A linguist has revealed he talked only in Klingon to his son for the first three years of his life to find out if he could learn to speak the 'language'.

[…]

Now 13, Speers' son does not speak Klingon at all.”

koboll · on Sept 19, 2022

Okay, let's take a look at the context where the multiocular o was used: https://en.wikipedia.org/wiki/Multiocular_O

I see that near it, there is an ef (Ф) with a very tall stem.

Why should that not be included as a standard unicode character? Surely it is used more often than the multiocular o.

You may say "it's a decorative flourish", which is of course true, but so is the multiocular o. Should we allow every conceivable decorative flourish into unicode? What is the standard for where flourishes become distinct characters?

bityard · on Sept 19, 2022

Today, I wrote a document by hand containing a new symbol that only looks like genitalia if you squint really hard. Where do I apply to have it included in unicode so that it can be digitized properly?

lucumo · on Sept 19, 2022

Rule-lawyering wise-asses try to mess with many policies. It's rarely a sensible indictment of a policy, nor is it very effective. Anyone dealing with such people just ignores them.

fluoridation · on Sept 19, 2022

What's the criterion that includes the document in the tweet, but excludes the document referenced by the GP?

bzxcvbn · on Sept 19, 2022

https://www.unicode.org/pending/proposals.html

https://www.unicode.org/emoji/proposals.html#selection_facto...

fluoridation · on Sept 19, 2022

I don't see any anything on the inclusion of symbols that are not icons, such as U+A66E, or the symbol proposed by bityard.

koala_man · on Sept 19, 2022

Can you reuse 𓂸 or 𓂺?

0xbadcafebee · on Sept 19, 2022

And for years we've just been using eggplants!

_abox · on Sept 20, 2022

Lol I never knew those existed. Apparently they're egyption hieroglyphs but it really makes me wonder the meanings of them now :)

I also wonder how these didn't become insanely popular overnight, like the famous eggplant.

411111111111111 · on Sept 19, 2022

( ﾉ ≧ ∇ ≦ ) ﾉﾐ ┻ ━ ┻

〜 ( ꒪ ꒳ ꒪ ) 〜

( ༎ຶ ෴ ༎ຶ )

Akronymus · on Sept 21, 2022

Reuse rectangle or rectangle? (Seems to not render for me, win 10, chrome)

bawolff · on Sept 19, 2022

Given that “𓂸” (U+130B8) is already in unicode (and related 𓂹,𓂺) pretty sure the only problem is you made it up, not that it looks like genitilia

kadoban · on Sept 19, 2022

For as inclusive as that mission is, it seems weird to me how limited in certain areas unicode is. For instance, people use peach emoji since there isn't one for butt, eggplant since there's no penis, etc.

This doesn't contradict the stated goal exactly, but it seems against the spirit of it at least.

yakireev · on Sept 19, 2022

One could argue that emoji should have never been added to Unicode in the first place. Peaches and butts are images, pictures, illustrations, whatever - but they are not characters. There's no writing system which has a colored drawing of a peach as a character.

PeterisP · on Sept 19, 2022

Yes there is - a widely used character set (when Unicode talks about "writing systems" it explicitly includes all the computer character sets used in practice pre-unicode) used by japanese 'featurephones' had emoji characters, so in order to be able to include that character set in unicode, unicode had to add emoji.

squeaky-clean · on Sept 19, 2022

They're sort of neither. The peach emoji will render differently on iOS, Android, Windows. And I'm sure emoji-replacement packs are possible on Windows and Android (even though it's also guaranteed to be a virus).

So a peach emoji is not the same thing as the iOS peach-emoji-image. Similar to how changing my font doesn't change the actual characters.

I don't think including emojis was a great idea, but now that it's happened and people everywhere use them, emoji have become characters. I agree with your point, but it's already happened and so now there's not really any going back.

bzxcvbn · on Sept 19, 2022

Yes there is. We're using it right now. Even linguists are studying the use of emoji today.

eternityforest · on Sept 19, 2022

But that doesn't change the fact that most people use them snd like them, and there is not much technical disruption. They just chose practicality over purity.

bawolff · on Sept 19, 2022

Not only that - people use them in textual communication the way letters traditionally are used. There is probably a lot better argument for emoiji than a lot of other things in unicode (but it is a slippery slope)

yakireev · on Sept 19, 2022

Most people (me included) like funny cat videos and send funny cat videos. Shall we include some to Unicode?

I mean, this ship has long sailed, but that was a mistake nevertheless. Not everything has to be a unicode character.

kadoban · on Sept 20, 2022

If there were specific funny cat videos that were culturally relevent, maybe.

For example I could see there being an emoji for keyboard cat.

eternityforest · on Sept 20, 2022

That wouldn't be practical. It would make fonts too big, and videos aren't a thing that goes inline in text.

However, I could totally see some kind of open source GIF library of a few hundred meme videos and pictures, to standardize the "Reply with a GIF" thing in some P2P chat ecosystem, and maybe it could have a new URL scheme for referring to OpenMemes images.

hoseja · on Sept 20, 2022

You can have entire sentences constructed out of emoji.

vcxy · on Sept 19, 2022

I tried to reply with just a unicode penis but that got flagged immediately, so I'll be more substantial and leave out the actual penis. It appears in Egyptian hieroglyphs, so actually there is a penis included in unicode.

kadoban · on Sept 20, 2022

That's true, good call. I feel like there should be one without the context of Egyptian hieroglphys though, though I'm not exactly sure how that kind of thing works in unicode.

pbhjpbhj · on Sept 20, 2022

I thought peach was a vulva. What's the emoji for a vulva then?

Don't tell me Presidents of the United States' song "Peaches" was about butts!?

https://en.wikipedia.org/wiki/Peaches_(The_Presidents_of_the...

kadoban · on Sept 20, 2022

I'm not sure if there is one. Or maybe it's used for both, depending on context/community?

The song was definitely a vaginal reference as far as I ever knew.

_abox · on Sept 20, 2022

> For instance, people use peach emoji since there isn't one for butt, eggplant since there's no penis, etc.

Personally I think there should be, actually. There's all these other body parts but these are left out. Emoji is almost becoming a language and the good thing is that everyone can understand them, regardless of language. For example I could imagine these could be very useful in an international medical setting. Or for sexting, obviously, we can pretend that's not a thing but that's a bit too Victorian for me.

Of course they're not appropriate in some settings but so are many words.

Akronymus · on Sept 21, 2022

I REALLY don't like that emojis are beholden to companies. For example, when the emoji for a gun was changed from a pistol to a squirtgun on many platforms, it changed the meaning of its use by a lot. You could argue that it is a good thing, but I see it as a pretty bad direction to go into.

djur · on Sept 19, 2022

Unicode doesn't have a character for every illuminated initial, nor should it. I'm not clear on why this character should be considered any differently.

skyyler · on Sept 19, 2022

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3194.pdf

It was introduced with other "ocular O"s which are seemingly more commonly used than this one.

It's not quite an illuminated initial.

akavel · on Sept 19, 2022

Wow, this is probably the most actually useful and interesting comment in this whole discussion, thanks! For anyone interested, the most relevant quotes from the document are in particular:

"This document requests the addition of a number of Cyrillic characters to be added to the UCS. It also requests clarification in the Unicode Standard of four existing characters. This is a large proposal. While all of the characters are either Cyrillic characters (plus a couple which are used with the Cyrillic script), they are used by different communities. Some are used for non-Slavic minority languages and others are used for early Slavic philology and linguistics, while others are used in more recent ecclesiastical contexts. We considered the possibility of dividing the proposal into several proposals, but since this proposal involves changes to glyphs in the main Cyrillic block, adds a character to the main Cyrillic block, adds 16 characters to the Cyrillic Supplement block, adds 10 characters to the new Cyrillic Extended-A block currently under ballot, creates two entirely new Cyrillic blocks with 55 and 26 characters respectively, as well as adding two characters to the Supplementary Punctuation block, it seemed best for reviewers to keep everything together in one document.

(...)

MONOCULAR O Ꙩꙩ, BINOCULAR O Ꙫꙫ, DOUBLE MONOCULAR O Ꙭꙭ, and MULTIOCULAR O ꙮ are used in words which are based on the root for ‘eye’. The first is used when the wordform is singular, as ꙩкꙩ; the second and third are used in the root for ‘eye’ when the wordform is dual, as ꙫчи, ꙭчи; and the last in the epithet ‘many-eyed’ as in серафими многоꙮчитїй ‘many-eyed seraphim’. It has no upper-case form. See Figures 34, 41, 42, 55."

prmoustache · on Sept 20, 2022

Literally everyone in choir: we have boooobs!

j-bos · on Sept 19, 2022

Because it's already been added to unicode. Now it's not a question of whether or not to add, rather to remove, and unicode almost by definition does not remove.

thayne · on Sept 19, 2022

Unicode does have deprecated code points though. Not that I necessarily think making this character deprecated makes sense.

lmm · on Sept 19, 2022

Meanwhile one still can't roundtrip regular Japanese without some kind of funky out-of-band signalling. By itself this kind of thing is harmless, but it speaks to poor prioritization from Unicode.

lifthrasiir · on Sept 20, 2022

This is incorrect. I think you defined round-trip as something else, but some character set A providing a round-trip compatibility with other set B means that B can be converted to A and back to B without a loss. And it is one of Unicode's explicit goals to provide a round-trip compatibility with major encodings including Japanese ones.

Han unification only means that when you convert Japanese encodings (B) to Unicode (A), it is not distinguishable from non-Japanese encodings converted to Unicode. This means that the Unicode text doesn't always follow domestic conventions without out-of-band signaling or IVD or so. But if you know that the text was converted from a particular encoding, you can perfectly recover the original text encoded in that encoding.

lmm · on Sept 20, 2022

By that logic any 8-byte encoding is round-trip compatible with all encodings, since however bad the mojibake is, if you know what the original encoding was then you can always just convert back to that.

lifthrasiir · on Sept 20, 2022

Yes, but only under a very wrecked hypothetical definition of "conversion".

josefx · on Sept 20, 2022

To be fair they wanted to keep everything representable with 16 bits and that wasn't going to happen without the Han-unification. The mess when everything still had to move to a 32 bit representation has been far reaching, many programming languages went from exposing code points atomically as "char" to some half encoded nonsense value that just happens to also be a valid standalone value in UTF-16 most of the time and a source of bugs when you least expect it.

umanwizard · on Sept 19, 2022

Why can’t it round-trip Japanese?

lmm · on Sept 20, 2022

"Han Unification" - in Unicode many Japanese characters are represented as Chinese characters that look different (and subjectively ugly). The Unicode consortium's answer is that you're supposed to use a different font or something when displaying Japanese, which is pretty unsatisfying (e.g. if you want to have a block of text that contains both Japanese and Chinese, you can't represent that as just a Unicode string, it has to be some kind of rope of segments with their own fonts, at which point frankly you might as well just go back to bytes-with-encoding which at least breaks very clearly and visibly if you get it wrong).

kevin_thibedeau · on Sept 20, 2022

You can use the deprecated language tag control codes to distinguish unified code points. It is unlikely to be well supported but it is there.

shp0ngle · on Sept 19, 2022

The thing is, this is just a decorative way to write “o”. It’s not a specific letter by any definition.

I can’t speak of other letters that were added in the same batch in 2007. Some of them seam meaningful, I donno, I don’t speak old church slavonic (although I am told it sounds like Croatian, which I understand a little)

http://std.dkuug.dk/jtc1/sc2/wg2/docs/n3194.pdf

sloonz · on Sept 20, 2022

> so what harm is there in one codepoint no one will ever need?

Fonts bloat (do you want a font with 1 million characters in it ? I don’t. Do you want to have to install 1000 fonts having 1000 characters each to be sure to cover all the Unicode table ? I don't).

Lots of issues for everyday programmers (how do you handle weird unicode characters in your validation code ?) potentially leading to security issues (bypassing validation rules by close-but-different characters, phishing…)

bhk · on Sept 19, 2022

Cataloging every doodle ever drawn inline with text by anyone at any time in history would exhaust any finite set of code points.

layer8 · on Sept 19, 2022

> Unicode's mission is to make every document "roundtrip-able".

Only for characters from existing coded character sets.

IshKebab · on Sept 20, 2022

That isn't Unicode's mission though. To get new characters added you have to show that people use it or would use it if it were available.

umanwizard · on Sept 19, 2022

The artist Prince changed his stage name to an unpronounceable symbol for a few years. It appears in more than one document. Should it be added to Unicode?

caf · on Sept 20, 2022

Probably. Maybe you can propose it.

rhplus · on Sept 20, 2022

By the original reasoning, shouldn’t every fancy illuminated character from medieval manuscripts get its own codepoint?

modzu · on Sept 19, 2022

why isnt the artist formerly known as prince in unicode?

Pinus · on Sept 19, 2022

Isn’t there an entire Unicode block for the symbols on the Phaistos disc? Yes: https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block) . I suppose those occur in quite a few documents about the disc, even though the disc itself is the only known document written in those symbols.

PeterisP · on Sept 19, 2022

At the moment this character is used in many documents and databases - including comments in this thread, the article mentioned there, etc.

There could have been a good case not to include it back in 2007, but once it has been included, excluding it would break stuff.

BlueTemplar · on Sept 19, 2022

And updating it rather than adding a new, correct one, might make the current uses confusing ?

Speaking of which, do we have any similar hexagonal symbol ?

reaperducer · on Sept 19, 2022

One historical use in the 1400s doesn't merit a character and never did

One known and surviving use. It is possible that it exists in other places, since the vast majority of the planet's written work has not been digitized. It may also have been used other places that have not survived.

Just because it's not important to you does not mean it is not important.

The fact that is survived for 600 years makes it interesting and worth saving. It is infinitely unlikely that anything you do, write, or say will last that long.

koboll · on Sept 19, 2022

Sure it's possible, but there should be a higher bar than "it's possible it's used more than once" for meriting inclusion in the standard keyboard of billions of devices worldwide.

tsimionescu · on Sept 19, 2022

The thing is, looking at the page, there are many other characters that were not added - the large red С-looking characters, for example. But for some "bizarre" reason, those were not included in Unicode...

Of course, the simple answer is that Unicode actually includes any character that someone cares enough to ask to be added, with rare exceptions.

bhaney · on Sept 19, 2022

> It is infinitely unlikely that anything you do, write, or say will last that long

Ouch

bawolff · on Sept 19, 2022

There are characters in unicode with 0 usages that we dont even know where they came from. E.g. 彁

lifthrasiir · on Sept 19, 2022

While the origin of 彁 will never be certain, there is a good chance that it came from a misinterpretation of 彊 [1]. Why is this not an accepted theory though? Because it is still possible that 彁 did appear in some reference source from the standardization, and neither that source or a source where 彊 does look like 彁 was found.

[1] http://www.asahi.com/special/kotoba/archive2015/moji/2011082...

runarberg · on Sept 19, 2022

idk. when the word Planet was redefined such that Pluto was no longer a planet, it kind of ruined the word Planet. It suddenly wasn’t nearly as useful as a word as it used to (even though now it has a precise meaning). For most people that use the word, it won’t matter (and is actually rather exciting) that they keep discovering new planets in our solar system.

If they’d treat the word characters the same way, it would only serve to confuse and do no favors to the remaining glyphs.

BlueTemplar · on Sept 19, 2022

This is temporary though, soon people will look at you funny if you say that Pluto is a planet - and/or they might not even have heardof it (though of course that is still worth learning about in an History of Science context).

We do NOT keep discovering new planets, rather minor planets (I agree that the term is confusing), more than a million of them discovered in the Solar System now, like the 9007 James Bond.

runarberg · on Sept 19, 2022

It could go either way, it is not always that the scientific meaning wins out, especially not when even scientists don’t find the new definition useful.

When I think of a planet, I think of a world that has active geology that isn’t a moon (I know excluding moons is arbitrary, and perhaps I shouldn’t do that; but hey, that’s language for you). I honestly don’t care about the orbit, and I bet that when most people think about planets they aren’t thinking about the orbit either, let alone whether the planet has cleared the orbit or not. I doubt that will change.

gerikson · on Sept 19, 2022

> When I think of a planet, I think of a world that has active geology

Wouldn't that definition rule out gas giants?

jameshart · on Sept 19, 2022

No just that, but whether or not Mars is still geologically active is still an open question. If you admit planets on the basis that they have a history of geological activity, then Ceres is a planet too.

I don’t think anybody considers geological activity as particularly useful for classifying things as ‘planet’ or ‘not planet’.

runarberg · on Sept 19, 2022

Why shouldn’t Ceres be a planet? If Pluto gets to be a planet then Ceres is definitely a planet.

But there is still active geology on Mars. There is still moisture, winds and ice-caps that are shaping the environment. I consider that to be geologically active.

EDIT: And there are actual experts which consider active geology (or something similar) to be a planet, including Anton Petrov (https://www.youtube.com/watch?v=8-2HxrgqUnM)

jameshart · on Sept 20, 2022

Okay, but then you have to go and figure out which other asteroid and kuiper belt objects are planets.

The 'dwarf planet' distinction helps solve this! There are planets - distinctive in that they have clear orbits - and there are dwarf planets, which can be part of belt systems. This is a useful distinction.

runarberg · on Sept 20, 2022

Sure it is, but the distinction between terrestrial planets and gas giants are also useful, that doesn’t mean the latter aren’t planets.

I think it is fine that there are more objects planets then we can meaningfully count. Loads of things in our language act like that. E.g. a bug can be any number of things, and you know what a bug is by just talking about it. If some insect society then comes up with a meaningful definition of bugs which excludes spiders, that definition isn’t really doing the average user of that word any favor.

runarberg · on Sept 19, 2022

Yeah, probably strictly... But I’m not a planetary scientist. I’m merely a user of language, and I don’t need to be rigorous in my definitions. And to me the weather patterns on Jupiter is an interesting feature enough to count as geology (even though it is probably not strictly a geology).

echelon · on Sept 19, 2022

This thread on HN won't make sense in the future if the Unicode body replaces ꙮ

Make a new character!

nerfhammer · on Sept 19, 2022

why not make an additional eye a diacritic mark so you can just add an arbitrary number of eyes

martin_a · on Sept 19, 2022

Uff.

I'm not sure we have space for another glyph in Unicode. Looks pretty packed in here...

BlueTemplar · on Sept 19, 2022

UTF-8 is still more than 80% empty, and can be potentially extended...

colejohnson66 · on Sept 19, 2022

Theoretically, UTF-8 can encode up to 31 bits (U+7FFF'FFFF)[0], but for compatibility with UTF-16's surrogates, it's officially capped to 21 bits with the max being U+10'FFFF[1]. That decision was made November 2003, so there's two decades of software written with hard caps of U+10'FFFF.

[0]: https://www.rfc-editor.org/rfc/rfc2279

[1]: https://www.rfc-editor.org/rfc/rfc3629#section-3

jotato · on Sept 19, 2022

My thought as well

baybal2 · on Sept 19, 2022

Unicode basic rule is that character definitions never ever change, even when enumerated erroneously.

Arnt · on Sept 19, 2022

Yes, but this is a change either way, because that codepoint's definition referred to that character. Either the reference or the description of the appearance has to change.

echelon · on Sept 19, 2022

   ꙮ ꙮ 
  ꙮ ꙮ ꙮ 
   ꙮ ꙮ

ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ

Make a new character. Updating the existing character ruins the meaning of all previous usages.

It's like trying to change an API. Don't disrespect your existing users. Make a new version.

(ꙮ ͜ʖꙮ)

Think of all the ASCII art this botches. That has to have some historical importance to the Unicode standards body.

(⌐ꙮ_ꙮ)

For scholarly digital (unprinted) documents where the correct character rendering matters, erroneous past usages can be trivially found with grep, a date search, and easily corrected. The domain experts will familiarize themselves with this issue and fix the problem. Don't take a shotgun to it!

This message wꙮn't have the ꙮriginally intended meaning if the characters are updated from underneath.

ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ ꙮ

diimdeep · on Sept 19, 2022

Here[1][2] is the scan of manuscript from 1429, image #251

[1] https://lib-fond.ru/lib-rgb/304-i/f-304i-308/#image-251 [2] https://web.archive.org/web/20110927102700/https://www.stsl....

aasasd · on Sept 19, 2022

So the text at that point literally talks about ‘many-eyes seraphims’. The eyes symbol is a pure gag—seems to be spliced in place of the letter ‘о’ in the word ‘eye’ just a little down the line. (However, Old Slavonic is a tough read due to no spaces, so I'm not sure about that word. But at least it's not the Glagolitic script, which was just ridiculous and actually had multi-circle letters.)

chickenimprint · on Sept 20, 2022

As far as I can tell it swallows up the preceding го.

It seems to be preceded by other jocular glyphs among scribes. A regular o received a central dot when writing the word "eye": ꙩ

Words containing "two" or "both" had an o replaced by two conjoined os: ꚙ

It is only natural to carry on the in-joke for the dual and plural "eyes" like this: ꙭ

Our scribe simply got a little excited, following the pattern to it's logical end point in the term "many-eyed"

klyrs · on Sept 19, 2022

It's curious that the red ink blobs behind the "eyes" aren't included in the unicode glyph either...

nashashmi · on Sept 19, 2022

Looks more like a diagram in the middle of text. It's very unique. It should not be a character

squeaky-clean · on Sept 19, 2022

It's used in place of the letter "o", so not purely a diagram but it feels like the role of a font to me, not a dedicated character.

Stamp01 · on Sept 19, 2022

I don't understand why this character needs to exist given that, at least according to the author, it has only been seen once in the wild, and it's semantically identical to another more widely used character.

I'm glad I'm not responsible for unicode. Clearly I have the wrong mindset for it.

PeterisP · on Sept 19, 2022

Perhaps it's relevant to look at how it was introduced - as a "package deal" with many, many characters from medieval cyrillic literature, as described in this proposal https://www.unicode.org/L2/L2007/07003r-n3194r-cyrillic.pdf

It certainly made sense to include this package in Unicode, and the vast majority of those characters certainly should be in this proposal. You do have to draw the line somewhere, and obviously those close to the line will be debatable, no matter where you chose to draw it, like this particular symbol - but once you've decided that you will include the one-eyed O (small and capital) and the two-eyed O (small and capital), then putting in the many-eyed O as well to complete the set doesn't seem so far-fetched.

lifthrasiir · on Sept 19, 2022

Surprisingly many characters in Unicode are only recorded a few times if not once before the assignment. Chinese characters for example have a lot of them, because it was relatively frequent to make a new character for newborns before the modernity and some of them have survived through literatures but otherwise seen no uses (e.g. 𡸫 U+21E2B only appears once in the Records of the Three Kingdoms 三國志). But they have still received code points because they are considered essential for digitaization of historical works, and multiocular O is no different.

Stamp01 · on Sept 20, 2022

I didn't realize that digitization of all historical works was the goal of unicode. There's plenty of space for everything. And only a few fonts out there aim for complete coverage, like noto.

I just don't have the personal fortitude to attempt something so grandiose. Seems like a fool's errand.

Also, keep in mind there's not just one multiocular O. There's a bunch with varying numbers of eyes.

db48x · on Sept 21, 2022

Not every goal needs to be something you can accomplish in a day or a year or even one lifetime.

There are not quite 8000 spoken languages on Earth at the moment, and a lot of them are from cultures that never invented writing. SIL has sent a missionary to most of them to learn the language, invent a writing system for it, teach it to them, and translate the New Testament into it. Most of those are fairly standard alphabets using characters from the Latin scripts, plus perhaps a few new characters or new combinations of character and diacritical. The task is large, but finite.

bogwog · on Sept 19, 2022

Imagine you’re a historian from the future studying some old document, and you spot a weird character that you’ve never seen before. Wouldn’t it be useful to be able to search for that character to see if it shows up in any other document? A simple OCR scan will bring up all the information you could ever need for that one weird symbol.

shadowgovt · on Sept 19, 2022

It's been seen once in the in-print wild.

There's no way to know how many since-written documents will break if a whole codepoint is dropped.

1-6 · on Sept 19, 2022

I agree with your mindset. It’s time for a unicode replacement.

Waterluvian · on Sept 19, 2022

I’m not sure how I feel about this. I’m not an expert by any means.

But something just doesn’t feel right when you’ve got unicode with a character with one known use from forever ago.

Doesn’t this open up the flood gates to just a ridiculous amount of work or else biased gatekeeping?

How much work would it be to implement your own font of the entire unicode set? Or is that not actually a thing and fonts implement as-desired subsets?

gumby · on Sept 19, 2022

There are quite a few such characters in Unicode because academic articles about things like cuneiform need to be digitized too. And because the historical record is so sparse, we often have vanishingly few, or only one example of a character, and perhaps no way to know if it was a misprint or a real character.

Actually this character seems like a scribe's joke, no different from the illustrated characters at the beginning of medieval paragraphs (all of which are represented in Unicode as A, B or whatever). But the point still holds.

It even holds for modern languages -- consider the ghost characters needed for round trip compatibility: https://weekly-geekly.imtqy.com/articles/418717/index.html

(actually cuneiform is a poor example; perhaps Linear A would have been a better example)

ArchD · on Sept 20, 2022

Why don't those digitized articles just use images? They can have any variant of any glyph they want to document.

gumby · on Sept 20, 2022

It's not just the articles, it's digitization of the texts themselves and email conversations. Using characters offers the opportunity to do computational textual analysis (this allows you to do substitutions first, by replacing this character with 'o' -- much harder on a bunch of tiny images).

Plus there's no shortage of space in the Unicode address space.

lifthrasiir · on Sept 19, 2022

> How much work would it be to implement your own font of the entire unicode set? Or is that not actually a thing and fonts implement as-desired subsets?

You can't, and you are not expected to do so. You are limited by OpenType limit (65,535 glyphs), various shaping rules that possibly increase the number of required glyphs, and lack of local or historical typographic convention. Your best bet is either to recruit a large number of experts (e.g. Google Noto fonts) or to significantly sacrifice quality (e.g. GNU Unifont).

poizan42 · on Sept 19, 2022

A single OpenType font file is limited to 65,535 glyphs. Nothing stops your font from being implemented as a series of .otf files (besides what people think of as a "font" when it comes to usage on computers).

But yes, time constraints are the limiting factor. I don't think anyone is going to dedicate their entire life to making a single font.

lifthrasiir · on Sept 19, 2022

While you are right that one logical font can consist of multiple font files (or possibly a OpenType collection), this constraint does affect most typical fonts, and in particular wide-coverage CJK fonts already hit this limit. Fonts supporting only one of Chinese, Japanese and Korean don't need that many glyphs, and probably even two of them will be okay, but fonts with all three sets of glyphs won't. It is therefore common to provide three versions of fonts, all differently named.

userbinator · on Sept 19, 2022

You could also go the shady route and just make a font out of all the "reference character sheets" that the Unicode site has. Probably not legal and the result would not be pleasant to read, but that's one way to create a font containing all of Unicode.

Waterluvian · on Sept 19, 2022

I wasn’t aware of the 2^16 limitation. Thank you for the notes!

aasasd · on Sept 19, 2022

I'll tell you more: there are Unicode glyphs without known usage.

db48x · on Sept 21, 2022

And then there are the ghost characters, which are known never to have been used.

kratom_sandwich · on Sept 19, 2022

I love this character and I love the fact that is being updated. Just to get this right: at some point some person chose to doodle the letter instead of writing it the correct way and now we have a corresponding Unicode character? Sort of amazing and it also makes you think ...

lmkg · on Sept 19, 2022

There was a... "tradition" is a strong word, perhaps "trend" is better. Authors making copies of the Bible or related works in Cyrillic, that the letter O (equivalent to Roman O) at the beginning of the word for "eye" would be stylized to look like an eye. There are a variety of glyphs along these lines: Ꙩ, Ꙫ, Ꙭ. All of them, including ꙮ, were added to Unicode as a single group.

The glyph "ꙮ" was used to refer to an Angel with a whole buncha eyeballs, as one does. In terms of texts that survive today, this specific glyph has exactly one use in a single manuscript from the 1400's. It might have been used more, in texts which don't survive. But it is part of a larger trend, and I bet that its inclusion in Unicode depends strongly on that.

But yeah, in itself the ꙮ character exists solely so that modern computers are capable of a more-faithful rendition of the transcription of a single handwritten copy of the Book of Psalms.

happytoexplain · on Sept 19, 2022

Thank you for describing the missing context. I couldn't understand why this stylized letter deserved a code point more than the uncountable others. I don't necessarily agree still, but the fact that this character was only unique within a larger trend makes it much more reasonable.

vintermann · on Sept 19, 2022

Hah, and here I thought I was making a joke when I called it a biblically accurate O!

cyral · on Sept 19, 2022

> modern computers are capable of a more-faithful rendition of the transcription of a single handwritten copy of the Book of Psalms.

I wonder if there is even a copy of the book transcribed to actual characters or if it only exists as scanned PDF copies? If anyone did transcribe it, would they have any knowledge that the ꙮ character even exists on computers?

henriquecm8 · on Sept 19, 2022

So you are saying that the glyph is now more biblically accurate?

int_19h · on Sept 19, 2022

The Bible doesn't specify how many eyes seraphim have.

"In the center, around the throne, were four living creatures, and they were covered with eyes, in front and in back. ... Each of the four living creatures had six wings and was covered with eyes all around, even under its wings."

caf · on Sept 20, 2022

"meme"

Arnt · on Sept 19, 2022

I attended a Unicode meeting (or maybe two? not sure?) and came away with the impression that Unicode is like those open source projects that are used by half of the world and maintained by a handful of skilled and benevolent people.

In Unicode's case I think most of them are paid, at least.

shp0ngle · on Sept 19, 2022

That is what I understood too. It doesn’t seem particularly hard to add new letters to Unicode too if you try a bit.

However that is a bit harder with emojis, that have their own subcommittee, which seem to be more bureaucratic and also more popular than the rest of Unicode. Everyone wants to make a new emoji.

cillian64 · on Sept 19, 2022

It does raise interesting questions about what counts as decoration/formatting and what counts as part of the actual text. You could view these ocular O characters as purely decorative (like the fancy first character in a paragraph) but they could also be seem as a quirk of spelling which should be represented in unicode.

But the multiocular O really does seem like one monk got bored one time and did some doodling.

dhosek · on Sept 19, 2022

This is not exactly a correct description. Unicode does not specify the appearance of characters, only their meaning. It seems what’s changed is the reference presentation of the character in the Unicode tables, not the character itself. Unicode goes to great lengths to preserve backwards compatibility so changing the meaning of a code point would violate that principle. Your OS or application providing Unicode 15.0.0 support will not change the appearance of U+A66E. The appearance is dependent on the font.

lordnacho · on Sept 19, 2022

Wait a minute, how will we refer to the old glyph in the future? Once this is updated the articles such as this one will have the new shape.

lifthrasiir · on Sept 19, 2022

There was a joke that U+A66E should retain seven eyes and further eyes should be added with a ZWJ sequence [1]. If that character somehow got very popular in modern texts, updating its glyph may result in an interoperability problem so such solution would have been needed. But that didn't happen so the glyph itself has been updated instead.

[1] https://twitter.com/BabelStone/status/1323440365429542919

martin_a · on Sept 19, 2022

"The character formely known as U+A66E"

personjerry · on Sept 20, 2022

If you open the proposal [0] it kinda just looks like someone doodled some flowers on the text rather than actually used a particular letter. And given it's the ONLY existing record of this letter, it's very suspect isn't it?

[0]: https://www.unicode.org/wg2/docs/n5170-multiocular-o.pdf

Gare · on Sept 20, 2022

It's in place of "goo" in "mnogoočimi" (many-eyed) in the phrase "many-eyed seraphims", so it at least makes sense.

sterlind · on Sept 20, 2022

my Old Church Slavonic is pretty rusty (well, nonexistent), but "mnogo" looks like modern Russian много (many), and the -imi I guess would be instrumental plural like -ими? but Russian for "eye" is глаз or око. I'm guessing oč -> око, and it's a compound word? or is the č an infix, something like "ogo" is eye, and mnogoočimi is such because the two -og-s (one from mnog and the other from go) fuse because "mnogoögočimi" would be awkward to pronounce?

generationP · on Sept 20, 2022

Pretty sure it is mnogo-oči-tii. The word "oči" still means "eyes" (although mostly in poetry).

politelemon · on Sept 19, 2022

Here's the original tweet where the discrepancy was noticed in 2020, and a photograph of a page inside the book where it's used:

https://twitter.com/etiennefd/status/1322673792452354048

perihelions · on Sept 19, 2022

Related thread, about non-existent CJK characters ending up in Unicode through transcription mistakes ("ghost characters"):

https://news.ycombinator.com/item?id=32095502 ("A Spectre Is Haunting Unicode", 180 comments)

edit to add: The top thread in the 2020 repost was about ꙮ,

https://news.ycombinator.com/item?id=24955536

xanathar · on Sept 19, 2022

So it's a Unicode character that represents a... blob with 10 eyes?

Hordes of Wizards of the Coast lawyers getting ready for the big fight

supernewton · on Sept 19, 2022

Nah, Beholders have 11 eyes, so we're good here.

tsimionescu · on Sept 19, 2022

I feel like the spelling should be updated to Behꙮlders, or better yet, BehꙨꙮlders, to reflect that (of course, this would only make sense once the glyph update actually hits).

gedy · on Sept 19, 2022

Name checks out: https://forgottenrealms.fandom.com/wiki/Xanathar_(original)

sshine · on Sept 19, 2022

(┛ꙮДꙮ)┛彡┻━┻

loudmax · on Sept 19, 2022

  ꙮ>
 ===

The James Webb Space Telescope.

thechao · on Sept 19, 2022

┬─┬ノ( ꙮ _ ꙮノ)

SnooSux · on Sept 19, 2022

Be not afraid

drewzero1 · on Sept 19, 2022

Bee not afraid?

msla · on Sept 19, 2022

Bee Nut Afraid.

(When an apiarist is terrified.)

throwaway98797 · on Sept 19, 2022

cant unsee

Izkata · on Sept 19, 2022

Don't worry, you'll forget about this one when it gets six more eyes.

hulitu · on Sept 19, 2022

> Unicode character “ꙮ” (U+A66E) is being updated

I fear this will lead to a lot of "bug fixes and performance improvements" in Android. /s

just-ok · on Sept 20, 2022

From the tweet’s image:

> written in an extinct language, Old Church Slavonic

It’s absolutely not extinct and is used by the Eastern Orthodox Church in their religious texts almost exclusively. It’s taught to children alongside their Sunday school curriculum and, of course, in seminaries.

Tagbert · on Sept 20, 2022

Generally languages with only liturgical usage are not considered “living” languages, just as the Latin if the Catholic Church is still considered a “dead” language.

xashor · on Sept 19, 2022

Too bad I have to adjust my business cards for ꙮ.world

xenonite · on Sept 19, 2022

Sadly, ꙮ is not eligible for engraving by Apple on AirPods.

colejohnson66 · on Sept 19, 2022

As of right now, it's available for "adoption": https://www.unicode.org/consortium/adopt-a-character.html

pippy · on Sept 19, 2022

The Unicode can be ridiculous at times. It contains a character used once in a single manuscript in a extinct language, but not a standardized glyph for an external URL link.

wheybags · on Sept 19, 2022

This kind of stupid thing is my problem with Unicode. We have all this baggage for stuff that nobody uses, and we need to deal with it forever. The worst for me is the way there is no possible way to encode a grapheme cluster as a constant size, so using Unicode make it impossible to have simple character access like an old style c string, no matter how big you make your char, even though it's totally possible with damn near every language that people actually use.

So then we all end up paying this massive complexity tax everywhere to pay for support for some Mongolian script that died out 200 years ago (or multi codepoint encodings of simple things like é - just why, it was so avoidable).

svat · on Sept 19, 2022

> encode a grapheme cluster as a constant size […] totally possible with damn near every language that people actually use

This is not true. For a concrete example: the languages Hindi and Marathi, with ~500 million speakers, use the Devanagari script (also used by Nepali and Sanskrit), in which a grapheme cluster is (usually) a sequence of consonants followed by a vowel. For instance, something like "bhuktvā" (भुक्त्वा) would be two grapheme clusters, one (भु) for "bhu" and one (क्त्वा) for "ktvā". In Unicode each vowel and consonant (here, bh, u, k, t, v, ā) is separately encoded, which is the only reasonable thing to do, and inevitably means that grapheme clusters can have different lengths (number of code points). The alternative would have been to encode every possible (sequence of consonants + vowel) as a single codepoint, which gets ridiculous quickly: these sequences can be up to 5 consonants long, so you'd end up having to encode (33^5 * 13 ≈ 500M) codepoints for Devanagari alone (or completely prevent certain sequences of consonants from being expressed, which makes no sense either), not to mention that most of the scripts of the Indian subcontinent and south-east Asia follow the same principle and have similar issues (e.g. Bengali with 250M speakers, Telugu, Javanese, Punjabi, Kannada, Gujarati, Thai with over 50M speakers each, etc).

(See chapters 12–17 of the Unicode standard, currently version 15: https://www.unicode.org/versions/Unicode15.0.0/ch12.pdf)

gnulinux · on Sept 19, 2022

Have you ever written software before Unicode? We had N different encodings for each language, each culture, each country. There were all kinds of bugs creeping up, and software that works perfectly well could be buggy for one random language. Unicode abstracted all of this away from the programmer in a pretty simple fashion. I simply do not see how we're paying the "complexity tax" by using Unicode, unless you're writing a library that handles Unicode (which you shouldn't do, you should use existing libraries) you don't need to know anything about Unicode.

mkipper · on Sept 19, 2022

Before Unicode, everyone who came up with a character encoding scheme probably thought their system was good enough for any reasonable use-case. But they all had limitations that made them inadequate for things less obscure than representing some dead Mongolian language.

It would be nice if we could come up with some magical system that optimally encodes all the text that "matters" and ignores everything else, but history has shown that to be very hard. So we're left with Unicode, which takes the approach of giving us (effectively) infinite code points to represent characters, with (effectively) infinite ways to visually represent them. That does lead to a bunch of "unnecessary" baggage and headaches, but it also solves a bunch of real problems that you probably don't know exist.

Unicode is a pain in the ass, but it's a solution to a very hard problem. You can feel free to design your own solution, but you'll probably run head-first into all the problems Unicode was trying to solve from 40 years ago.

JohnFen · on Sept 19, 2022

I hear you. I loathe working with Unicode for this exact reason. It's a bit of a nightmare due to its complexity.

That said, what it's trying to do is enormously complex.