When my kids were young, I accidentally flubbed the pronunciation of "Santa Claus" once and said something that sounded a lot like "Centiclops", which I decided to roll with. Centiclops is a lot like a cyclops with one eye, except the as a reading of the roots clearly indicates, this is a creature with 100 eyes.
Today I learn that Centiclops effectively has a Unicode character. As Centiclops' representative in the world of the non-imaginary, we accept that a Unicode character with a hundred eyes is not practical and we accept the representation with just a few eyes, but generally agree that upgrading to 7 to 10 is a nice improvement, as 7 does not evenly divide into 100 but 10 does. This is important, because... reasons.
From "The House of Asterion" by Jorge Luis Borges:
"It is true that I never leave my house, but it is also true that its doors (whose numbers are infinite) (footnote: The original says fourteen, but there is ample reason to infer that, as used by Asterion, this numeral stands for infinite.) are open day and night to men and to animals as well."
That reminds me of the Nahuatl word centzon, which is used to mean either 400, or an innumerable/infinite number. The Aztecs used a base-20 number system, so 400 = 20*20.
Greek mythology actually did have a "centiclops" -- Argus Panoptes ("all eyes"), who had a hundred eyes all over his body. Hera assigned him to watch over Io, a nymph who had been turned into a cow, so that Zeus wouldn't come and shag her in secret. Argus was slain by Hermes (a Zeus loyalist); to mourn and honor him, Hera had his eyes transferred to the peacock's tail.
The real Greek for a hundred-eyed being would be something like "hekatonoptes", but Argus wasn't called that as far as I know.
The nice thing about ꙮ having ten eyes is that you can now combine ten of them with U+200D ZERO WIDTH JOINER [1] to make a centiclops grapheme, as long as your font has a glyph for that particular ligature. (Readers without centiclops-compatible fonts will simply see ten separate ꙮ glyphs, an acceptable fallback for legacy systems.)
My client finds your proposal offensive and an appropriation of his culture, and also that Dekaclops guy is mean and smells bad and hasn't returned the lawnmower my client lent him even though my client has clearly referred to the need to mow his lawn several times now so he totally doesn't deserve a Unicode character.
My fingers love adding the e's on the end of any worde that can conceivably take them. Also have that problem with any word that can take an "ly" even if I don't meanly it.
Well, santa is a Spanish word meaning "holy" and saint is a cognate French word meaning the same thing. They descend from Latin sanctus; compare sanctify.
When the prayer goes "holy Mary, mother of god", "holy Mary" is an exact equivalent of "santa María".
Might as well mention “Sancta Marīa” in Latin, for example from the Christian Hail Mary[1], a recorded Latin version[2], written Latin next to English and Spanish[3] and of course translated into thousands of languages[4] although unfortunately mostly written using /A-Z/i; I am an atheist interested in languages.
In my mind, the Latin form of Mary is Mariam, because that's what my Latin teacher taught me. (He also commented that, unlike Greek names, Hebrew names never inflected in Latin, so that it would be "Mariam" regardless of what case the name should appear in.)
But it makes sense that Church Latin would be different.
“Santa” means “female saint” in Italian and Spanish. Perhaps the English “santa” came from another language but I always found the name “Santa Claus” just horrible.
It’s actually Sinterklaas (without a space) and we still call him that :) We also ended up re-importing the American Santa Claus, so these days we have two festive holidays in December.
The first mention of this version of Saint Nicholas's name has the form "St. A Claus" and appeared in the New-York Gazette of 20 Dec 1773.[1] The same issue also first reported some incident regarding tea in Boston harbour. Nice coincidence.
Saint is more or less the same as holy, just used as a title. It comes from Old French saint, seinte "holy, pious, devout," from Latin sanctus "holy, consecrated"
> Centiclops is a lot like a cyclops with one eye, except th[at] as a reading of the roots clearly indicates, this is a creature with 100 eyes.
Not in any normal sense of "roots". Cent is a Latin root meaning 100. ops is a Greek form meaning eye. The -i- indicates that the word is being formed in Latin, and the -cl- is entirely spurious. The original Greek word divides as cycl-ops, not cy-clops.
But it doesn't combine with ops. You'd need to talk about a hecatops or a hecatontops. And even more than it can't combine with ops, it can't combine with clops because there is no such root.
Sure, it does, in English, which stole prefixes, suffixes, and roots from Latin, Greek, and many other languages, and has no problem using them together, without special concern about where it got them from.
By the same reasoning, the 7-eyed O has now been used more than once, so it deserves a glyph! So the right way to do this is to introduce a new character for the correct glyph, and also leave the current one (perhaps changing the title). Otherwise these tweets won't make when read by someone that updated to Unicode 15.0
Honestly it probably deserves the Pluto treatment: decertification as a character. One historical use in the 1400s doesn't merit a character and never did.
Unicode's mission is to make every document "roundtrip-able". Even if a character is only used once, it should be possible to save a plaintext version of the containing document without losing any information. Roughly, I should be able to put a transcription of that one translation from the 1400s on Wikisource without using images.
You may disagree with me, and that's fine, but it doesn't change Unicode's mission. Besides, there's room for 1,112,064 codepoints[a], and only 149,146 are in use. It's predicted we'll never use it up, so what harm is there in one codepoint no one will ever need?
[a]: U+10'FFFF max; it used to be U+FFFF'FFFF, but UTF-16 and surrogates ruined that
If that was once its mission, it was clearly abandoned long ago. They rejected Klingon characters on the grounds that it has low usage for communication, and that many of the people who do communicate in Klingon use a latinized form.
ꙮ seems to just be a fancy way of writing О. I haven't seen anything that says it has a different meaning. The arguments for excluding Klingon seem to apply even more so to ꙮ.
If you look through the old mailing list postings, the oft-left-implicit problem with Klingon (as well as Tengwar, Everson’s [EDIT: misspelling] pet project) is that it may get people into legal trouble (even though in a reasonable world it shouldn’t be able to). So in the unofficial CSUR / UCSUR they remain.
A weird solitary character from the 1400s isn’t subject to that, and even if it’s a mistake it’s probably not worth breaking compatibility at this point (I think the last such break with code points genuinely changing meanings was to repair a mistaken CJK unification some time in the 00s, and the Consortium may even have tied its own hands in that regard with the ever-more-strict stability policies).
Similarly, for example, old ISO keyboard symbols (the ⌫ for erase backwards, but also a ton of virtually unused ones) were thrown in indiscriminately at the beginning of the project when attempting to cover every existing encoding, but when the ISO decided to extend the repertoire they were told to kindly provide examples of running-text (not iconic) usage in a non-member-body-controlled publication. (Crickets. The ISO keyboard input model itself only vaguely corresponds to how input methods for QWERTY-adjacent keyboards work in existing systems—as an attempt at rationalization, it seems to mostly be a failed one.)
[EDIT: Removed a section about the now-fixed typo]
> I think the last such break with code points genuinely changing meanings was to repair a mistaken CJK unification some time in the 00s, and the Consortium may even have tied its own hands in that regard with the ever-more-strict stability policies[.]
Not exactly, the last break happened between Unicode 1.1 and 2.0 and the new CJK Unified Ideographs Extension A block still contains unified characters. The main reason for break was that both Hangul and CJK(V) ideographs required tons of additional code points and it became clear that 16-bit code space is dangerously insufficient; by 1.1 there was only a single big block of unassigned code points from U+A000 to U+E7FF (18,432 total), and there were 4,516 and 6,582 new Hangul and CJK(V) ideographs in 2.0 (11,098 total).
Unless it's legitimately someone's native tongue, conlangs shouldn't be in unicode. If there are kids out there that are native Klingon speakers, then you can make the argument it should be included.
I think it makes way more sense to put a conlang in Unicode than it does a peculiar stylistic flourish only ever applied once to a single letter in a single document. If that belongs in Unicode, why not every bit of marginalia ever doodled and every uniquely adorned drop cap / initial letter?
“A linguist has revealed he talked only in Klingon to his son for the first three years of his life to find out if he could learn to speak the 'language'.
[…]
Now 13, Speers' son does not speak Klingon at all.”
I see that near it, there is an ef (Ф) with a very tall stem.
Why should that not be included as a standard unicode character? Surely it is used more often than the multiocular o.
You may say "it's a decorative flourish", which is of course true, but so is the multiocular o. Should we allow every conceivable decorative flourish into unicode? What is the standard for where flourishes become distinct characters?
Today, I wrote a document by hand containing a new symbol that only looks like genitalia if you squint really hard. Where do I apply to have it included in unicode so that it can be digitized properly?
Rule-lawyering wise-asses try to mess with many policies. It's rarely a sensible indictment of a policy, nor is it very effective. Anyone dealing with such people just ignores them.
For as inclusive as that mission is, it seems weird to me how limited in certain areas unicode is. For instance, people use peach emoji since there isn't one for butt, eggplant since there's no penis, etc.
This doesn't contradict the stated goal exactly, but it seems against the spirit of it at least.
One could argue that emoji should have never been added to Unicode in the first place. Peaches and butts are images, pictures, illustrations, whatever - but they are not characters. There's no writing system which has a colored drawing of a peach as a character.
Yes there is - a widely used character set (when Unicode talks about "writing systems" it explicitly includes all the computer character sets used in practice pre-unicode) used by japanese 'featurephones' had emoji characters, so in order to be able to include that character set in unicode, unicode had to add emoji.
They're sort of neither. The peach emoji will render differently on iOS, Android, Windows. And I'm sure emoji-replacement packs are possible on Windows and Android (even though it's also guaranteed to be a virus).
So a peach emoji is not the same thing as the iOS peach-emoji-image. Similar to how changing my font doesn't change the actual characters.
I don't think including emojis was a great idea, but now that it's happened and people everywhere use them, emoji have become characters. I agree with your point, but it's already happened and so now there's not really any going back.
But that doesn't change the fact that most people use them snd like them, and there is not much technical disruption. They just chose practicality over purity.
Not only that - people use them in textual communication the way letters traditionally are used. There is probably a lot better argument for emoiji than a lot of other things in unicode (but it is a slippery slope)
That wouldn't be practical. It would make fonts too big, and videos aren't a thing that goes inline in text.
However, I could totally see some kind of open source GIF library of a few hundred meme videos and pictures, to standardize the "Reply with a GIF" thing in some P2P chat ecosystem, and maybe it could have a new URL scheme for referring to OpenMemes images.
I tried to reply with just a unicode penis but that got flagged immediately, so I'll be more substantial and leave out the actual penis. It appears in Egyptian hieroglyphs, so actually there is a penis included in unicode.
That's true, good call. I feel like there should be one without the context of Egyptian hieroglphys though, though I'm not exactly sure how that kind of thing works in unicode.
> For instance, people use peach emoji since there isn't one for butt, eggplant since there's no penis, etc.
Personally I think there should be, actually. There's all these other body parts but these are left out. Emoji is almost becoming a language and the good thing is that everyone can understand them, regardless of language. For example I could imagine these could be very useful in an international medical setting. Or for sexting, obviously, we can pretend that's not a thing but that's a bit too Victorian for me.
Of course they're not appropriate in some settings but so are many words.
I REALLY don't like that emojis are beholden to companies. For example, when the emoji for a gun was changed from a pistol to a squirtgun on many platforms, it changed the meaning of its use by a lot. You could argue that it is a good thing, but I see it as a pretty bad direction to go into.
Unicode doesn't have a character for every illuminated initial, nor should it. I'm not clear on why this character should be considered any differently.
Wow, this is probably the most actually useful and interesting comment in this whole discussion, thanks! For anyone interested, the most relevant quotes from the document are in particular:
"This document requests the addition of a number of Cyrillic characters to be added to the UCS. It also requests clarification in the Unicode Standard of four existing characters. This is a large proposal. While all of the characters are either Cyrillic characters (plus a couple which are used with the Cyrillic script), they are used by different communities. Some are used for non-Slavic minority languages and others are used for early Slavic philology and linguistics, while others are used in more recent ecclesiastical contexts. We considered the possibility of dividing the proposal into several proposals, but since this proposal involves changes to glyphs in the main Cyrillic block, adds a character to the main Cyrillic block, adds 16 characters to the Cyrillic Supplement block, adds 10 characters to the new Cyrillic Extended-A block currently under ballot, creates two entirely new Cyrillic blocks with 55 and 26 characters respectively, as well as adding two characters to the Supplementary Punctuation block, it seemed best for reviewers to keep everything together in one document.
(...)
MONOCULAR O Ꙩꙩ, BINOCULAR O Ꙫꙫ, DOUBLE MONOCULAR O Ꙭꙭ, and MULTIOCULAR O ꙮ are used in words which are based on the root for ‘eye’. The first is used when the wordform is singular, as ꙩкꙩ; the second and third are used in the root for ‘eye’ when the wordform is dual, as ꙫчи, ꙭчи; and the last in the epithet ‘many-eyed’ as in серафими многоꙮчитїй ‘many-eyed seraphim’. It has no upper-case form. See Figures 34, 41, 42, 55."
Because it's already been added to unicode. Now it's not a question of whether or not to add, rather to remove, and unicode almost by definition does not remove.
Meanwhile one still can't roundtrip regular Japanese without some kind of funky out-of-band signalling. By itself this kind of thing is harmless, but it speaks to poor prioritization from Unicode.
This is incorrect. I think you defined round-trip as something else, but some character set A providing a round-trip compatibility with other set B means that B can be converted to A and back to B without a loss. And it is one of Unicode's explicit goals to provide a round-trip compatibility with major encodings including Japanese ones.
Han unification only means that when you convert Japanese encodings (B) to Unicode (A), it is not distinguishable from non-Japanese encodings converted to Unicode. This means that the Unicode text doesn't always follow domestic conventions without out-of-band signaling or IVD or so. But if you know that the text was converted from a particular encoding, you can perfectly recover the original text encoded in that encoding.
By that logic any 8-byte encoding is round-trip compatible with all encodings, since however bad the mojibake is, if you know what the original encoding was then you can always just convert back to that.
To be fair they wanted to keep everything representable with 16 bits and that wasn't going to happen without the Han-unification. The mess when everything still had to move to a 32 bit representation has been far reaching, many programming languages went from exposing code points atomically as "char" to some half encoded nonsense value that just happens to also be a valid standalone value in UTF-16 most of the time and a source of bugs when you least expect it.
"Han Unification" - in Unicode many Japanese characters are represented as Chinese characters that look different (and subjectively ugly). The Unicode consortium's answer is that you're supposed to use a different font or something when displaying Japanese, which is pretty unsatisfying (e.g. if you want to have a block of text that contains both Japanese and Chinese, you can't represent that as just a Unicode string, it has to be some kind of rope of segments with their own fonts, at which point frankly you might as well just go back to bytes-with-encoding which at least breaks very clearly and visibly if you get it wrong).
The thing is, this is just a decorative way to write “o”. It’s not a specific letter by any definition.
I can’t speak of other letters that were added in the same batch in 2007. Some of them seam meaningful, I donno, I don’t speak old church slavonic (although I am told it sounds like Croatian, which I understand a little)
> so what harm is there in one codepoint no one will ever need?
Fonts bloat (do you want a font with 1 million characters in it ? I don’t. Do you want to have to install 1000 fonts having 1000 characters each to be sure to cover all the Unicode table ? I don't).
Lots of issues for everyday programmers (how do you handle weird unicode characters in your validation code ?) potentially leading to security issues (bypassing validation rules by close-but-different characters, phishing…)
The artist Prince changed his stage name to an unpronounceable symbol for a few years. It appears in more than one document. Should it be added to Unicode?
Isn’t there an entire Unicode block for the symbols on the Phaistos disc? Yes: https://en.wikipedia.org/wiki/Phaistos_Disc_(Unicode_block) . I suppose those occur in quite a few documents about the disc, even though the disc itself is the only known document written in those symbols.
One historical use in the 1400s doesn't merit a character and never did
One known and surviving use. It is possible that it exists in other places, since the vast majority of the planet's written work has not been digitized. It may also have been used other places that have not survived.
Just because it's not important to you does not mean it is not important.
The fact that is survived for 600 years makes it interesting and worth saving. It is infinitely unlikely that anything you do, write, or say will last that long.
Sure it's possible, but there should be a higher bar than "it's possible it's used more than once" for meriting inclusion in the standard keyboard of billions of devices worldwide.
The thing is, looking at the page, there are many other characters that were not added - the large red С-looking characters, for example. But for some "bizarre" reason, those were not included in Unicode...
Of course, the simple answer is that Unicode actually includes any character that someone cares enough to ask to be added, with rare exceptions.
While the origin of 彁 will never be certain, there is a good chance that it came from a misinterpretation of 彊 [1]. Why is this not an accepted theory though? Because it is still possible that 彁 did appear in some reference source from the standardization, and neither that source or a source where 彊 does look like 彁 was found.
idk. when the word Planet was redefined such that Pluto was no longer a planet, it kind of ruined the word Planet. It suddenly wasn’t nearly as useful as a word as it used to (even though now it has a precise meaning). For most people that use the word, it won’t matter (and is actually rather exciting) that they keep discovering new planets in our solar system.
If they’d treat the word characters the same way, it would only serve to confuse and do no favors to the remaining glyphs.
This is temporary though, soon people will look at you funny if you say that Pluto is a planet - and/or they might not even have heardof it (though of course that is still worth learning about in an History of Science context).
We do NOT keep discovering new planets, rather minor planets (I agree that the term is confusing), more than a million of them discovered in the Solar System now, like the 9007 James Bond.
It could go either way, it is not always that the scientific meaning wins out, especially not when even scientists don’t find the new definition useful.
When I think of a planet, I think of a world that has active geology that isn’t a moon (I know excluding moons is arbitrary, and perhaps I shouldn’t do that; but hey, that’s language for you). I honestly don’t care about the orbit, and I bet that when most people think about planets they aren’t thinking about the orbit either, let alone whether the planet has cleared the orbit or not. I doubt that will change.
No just that, but whether or not Mars is still geologically active is still an open question. If you admit planets on the basis that they have a history of geological activity, then Ceres is a planet too.
I don’t think anybody considers geological activity as particularly useful for classifying things as ‘planet’ or ‘not planet’.
Why shouldn’t Ceres be a planet? If Pluto gets to be a planet then Ceres is definitely a planet.
But there is still active geology on Mars. There is still moisture, winds and ice-caps that are shaping the environment. I consider that to be geologically active.
EDIT: And there are actual experts which consider active geology (or something similar) to be a planet, including Anton Petrov (https://www.youtube.com/watch?v=8-2HxrgqUnM)
Okay, but then you have to go and figure out which other asteroid and kuiper belt objects are planets.
The 'dwarf planet' distinction helps solve this! There are planets - distinctive in that they have clear orbits - and there are dwarf planets, which can be part of belt systems. This is a useful distinction.
Sure it is, but the distinction between terrestrial planets and gas giants are also useful, that doesn’t mean the latter aren’t planets.
I think it is fine that there are more objects planets then we can meaningfully count. Loads of things in our language act like that. E.g. a bug can be any number of things, and you know what a bug is by just talking about it. If some insect society then comes up with a meaningful definition of bugs which excludes spiders, that definition isn’t really doing the average user of that word any favor.
Yeah, probably strictly... But I’m not a planetary scientist. I’m merely a user of language, and I don’t need to be rigorous in my definitions. And to me the weather patterns on Jupiter is an interesting feature enough to count as geology (even though it is probably not strictly a geology).
Theoretically, UTF-8 can encode up to 31 bits (U+7FFF'FFFF)[0], but for compatibility with UTF-16's surrogates, it's officially capped to 21 bits with the max being U+10'FFFF[1]. That decision was made November 2003, so there's two decades of software written with hard caps of U+10'FFFF.
Yes, but this is a change either way, because that codepoint's definition referred to that character. Either the reference or the description of the appearance has to change.
Make a new character. Updating the existing character ruins the meaning of all previous usages.
It's like trying to change an API. Don't disrespect your existing users. Make a new version.
(ꙮ ͜ʖꙮ)
Think of all the ASCII art this botches. That has to have some historical importance to the Unicode standards body.
(⌐ꙮ_ꙮ)
For scholarly digital (unprinted) documents where the correct character rendering matters, erroneous past usages can be trivially found with grep, a date search, and easily corrected. The domain experts will familiarize themselves with this issue and fix the problem. Don't take a shotgun to it!
This message wꙮn't have the ꙮriginally intended meaning if the characters are updated from underneath.
So the text at that point literally talks about ‘many-eyes seraphims’. The eyes symbol is a pure gag—seems to be spliced in place of the letter ‘о’ in the word ‘eye’ just a little down the line. (However, Old Slavonic is a tough read due to no spaces, so I'm not sure about that word. But at least it's not the Glagolitic script, which was just ridiculous and actually had multi-circle letters.)
I don't understand why this character needs to exist given that, at least according to the author, it has only been seen once in the wild, and it's semantically identical to another more widely used character.
I'm glad I'm not responsible for unicode. Clearly I have the wrong mindset for it.
It certainly made sense to include this package in Unicode, and the vast majority of those characters certainly should be in this proposal. You do have to draw the line somewhere, and obviously those close to the line will be debatable, no matter where you chose to draw it, like this particular symbol - but once you've decided that you will include the one-eyed O (small and capital) and the two-eyed O (small and capital), then putting in the many-eyed O as well to complete the set doesn't seem so far-fetched.
Surprisingly many characters in Unicode are only recorded a few times if not once before the assignment. Chinese characters for example have a lot of them, because it was relatively frequent to make a new character for newborns before the modernity and some of them have survived through literatures but otherwise seen no uses (e.g. 𡸫 U+21E2B only appears once in the Records of the Three Kingdoms 三國志). But they have still received code points because they are considered essential for digitaization of historical works, and multiocular O is no different.
I didn't realize that digitization of all historical works was the goal of unicode. There's plenty of space for everything. And only a few fonts out there aim for complete coverage, like noto.
I just don't have the personal fortitude to attempt something so grandiose. Seems like a fool's errand.
Also, keep in mind there's not just one multiocular O. There's a bunch with varying numbers of eyes.
Not every goal needs to be something you can accomplish in a day or a year or even one lifetime.
There are not quite 8000 spoken languages on Earth at the moment, and a lot of them are from cultures that never invented writing. SIL has sent a missionary to most of them to learn the language, invent a writing system for it, teach it to them, and translate the New Testament into it. Most of those are fairly standard alphabets using characters from the Latin scripts, plus perhaps a few new characters or new combinations of character and diacritical. The task is large, but finite.
Imagine you’re a historian from the future studying some old document, and you spot a weird character that you’ve never seen before. Wouldn’t it be useful to be able to search for that character to see if it shows up in any other document? A simple OCR scan will bring up all the information you could ever need for that one weird symbol.
I’m not sure how I feel about this. I’m not an expert by any means.
But something just doesn’t feel right when you’ve got unicode with a character with one known use from forever ago.
Doesn’t this open up the flood gates to just a ridiculous amount of work or else biased gatekeeping?
How much work would it be to implement your own font of the entire unicode set? Or is that not actually a thing and fonts implement as-desired subsets?
There are quite a few such characters in Unicode because academic articles about things like cuneiform need to be digitized too. And because the historical record is so sparse, we often have vanishingly few, or only one example of a character, and perhaps no way to know if it was a misprint or a real character.
Actually this character seems like a scribe's joke, no different from the illustrated characters at the beginning of medieval paragraphs (all of which are represented in Unicode as A, B or whatever). But the point still holds.
It's not just the articles, it's digitization of the texts themselves and email conversations. Using characters offers the opportunity to do computational textual analysis (this allows you to do substitutions first, by replacing this character with 'o' -- much harder on a bunch of tiny images).
Plus there's no shortage of space in the Unicode address space.
> How much work would it be to implement your own font of the entire unicode set? Or is that not actually a thing and fonts implement as-desired subsets?
You can't, and you are not expected to do so. You are limited by OpenType limit (65,535 glyphs), various shaping rules that possibly increase the number of required glyphs, and lack of local or historical typographic convention. Your best bet is either to recruit a large number of experts (e.g. Google Noto fonts) or to significantly sacrifice quality (e.g. GNU Unifont).
A single OpenType font file is limited to 65,535 glyphs. Nothing stops your font from being implemented as a series of .otf files (besides what people think of as a "font" when it comes to usage on computers).
But yes, time constraints are the limiting factor. I don't think anyone is going to dedicate their entire life to making a single font.
While you are right that one logical font can consist of multiple font files (or possibly a OpenType collection), this constraint does affect most typical fonts, and in particular wide-coverage CJK fonts already hit this limit. Fonts supporting only one of Chinese, Japanese and Korean don't need that many glyphs, and probably even two of them will be okay, but fonts with all three sets of glyphs won't. It is therefore common to provide three versions of fonts, all differently named.
You could also go the shady route and just make a font out of all the "reference character sheets" that the Unicode site has. Probably not legal and the result would not be pleasant to read, but that's one way to create a font containing all of Unicode.
I love this character and I love the fact that is being updated. Just to get this right: at some point some person chose to doodle the letter instead of writing it the correct way and now we have a corresponding Unicode character? Sort of amazing and it also makes you think ...
There was a... "tradition" is a strong word, perhaps "trend" is better. Authors making copies of the Bible or related works in Cyrillic, that the letter O (equivalent to Roman O) at the beginning of the word for "eye" would be stylized to look like an eye. There are a variety of glyphs along these lines: Ꙩ, Ꙫ, Ꙭ. All of them, including ꙮ, were added to Unicode as a single group.
The glyph "ꙮ" was used to refer to an Angel with a whole buncha eyeballs, as one does. In terms of texts that survive today, this specific glyph has exactly one use in a single manuscript from the 1400's. It might have been used more, in texts which don't survive. But it is part of a larger trend, and I bet that its inclusion in Unicode depends strongly on that.
But yeah, in itself the ꙮ character exists solely so that modern computers are capable of a more-faithful rendition of the transcription of a single handwritten copy of the Book of Psalms.
Thank you for describing the missing context. I couldn't understand why this stylized letter deserved a code point more than the uncountable others. I don't necessarily agree still, but the fact that this character was only unique within a larger trend makes it much more reasonable.
> modern computers are capable of a more-faithful rendition of the transcription of a single handwritten copy of the Book of Psalms.
I wonder if there is even a copy of the book transcribed to actual characters or if it only exists as scanned PDF copies? If anyone did transcribe it, would they have any knowledge that the ꙮ character even exists on computers?
The Bible doesn't specify how many eyes seraphim have.
"In the center, around the throne, were four living creatures, and they were covered with eyes, in front and in back. ... Each of the four living creatures had six wings and was covered with eyes all around, even under its wings."
I attended a Unicode meeting (or maybe two? not sure?) and came away with the impression that Unicode is like those open source projects that are used by half of the world and maintained by a handful of skilled and benevolent people.
In Unicode's case I think most of them are paid, at least.
That is what I understood too. It doesn’t seem particularly hard to add new letters to Unicode too if you try a bit.
However that is a bit harder with emojis, that have their own subcommittee, which seem to be more bureaucratic and also more popular than the rest of Unicode. Everyone wants to make a new emoji.
It does raise interesting questions about what counts as decoration/formatting and what counts as part of the actual text. You could view these ocular O characters as purely decorative (like the fancy first character in a paragraph) but they could also be seem as a quirk of spelling which should be represented in unicode.
But the multiocular O really does seem like one monk got bored one time and did some doodling.
This is not exactly a correct description. Unicode does not specify the appearance of characters, only their meaning. It seems what’s changed is the reference presentation of the character in the Unicode tables, not the character itself. Unicode goes to great lengths to preserve backwards compatibility so changing the meaning of a code point would violate that principle. Your OS or application providing Unicode 15.0.0 support will not change the appearance of U+A66E. The appearance is dependent on the font.
There was a joke that U+A66E should retain seven eyes and further eyes should be added with a ZWJ sequence [1]. If that character somehow got very popular in modern texts, updating its glyph may result in an interoperability problem so such solution would have been needed. But that didn't happen so the glyph itself has been updated instead.
If you open the proposal [0] it kinda just looks like someone doodled some flowers on the text rather than actually used a particular letter. And given it's the ONLY existing record of this letter, it's very suspect isn't it?
my Old Church Slavonic is pretty rusty (well, nonexistent), but "mnogo" looks like modern Russian много (many), and the -imi I guess would be instrumental plural like -ими? but Russian for "eye" is глаз or око. I'm guessing oč -> око, and it's a compound word? or is the č an infix, something like "ogo" is eye, and mnogoočimi is such because the two -og-s (one from mnog and the other from go) fuse because "mnogoögočimi" would be awkward to pronounce?
I feel like the spelling should be updated to Behꙮlders, or better yet, BehꙨꙮlders, to reflect that (of course, this would only make sense once the glyph update actually hits).
> written in an extinct language, Old Church Slavonic
It’s absolutely not extinct and is used by the Eastern Orthodox Church in their religious texts almost exclusively. It’s taught to children alongside their Sunday school curriculum and, of course, in seminaries.
Generally languages with only liturgical usage are not considered “living” languages, just as the Latin if the Catholic Church is still considered a “dead” language.
The Unicode can be ridiculous at times. It contains a character used once in a single manuscript in a extinct language, but not a standardized glyph for an external URL link.
This kind of stupid thing is my problem with Unicode. We have all this baggage for stuff that nobody uses, and we need to deal with it forever. The worst for me is the way there is no possible way to encode a grapheme cluster as a constant size, so using Unicode make it impossible to have simple character access like an old style c string, no matter how big you make your char, even though it's totally possible with damn near every language that people actually use.
So then we all end up paying this massive complexity tax everywhere to pay for support for some Mongolian script that died out 200 years ago (or multi codepoint encodings of simple things like é - just why, it was so avoidable).
> encode a grapheme cluster as a constant size […] totally possible with damn near every language that people actually use
This is not true. For a concrete example: the languages Hindi and Marathi, with ~500 million speakers, use the Devanagari script (also used by Nepali and Sanskrit), in which a grapheme cluster is (usually) a sequence of consonants followed by a vowel. For instance, something like "bhuktvā" (भुक्त्वा) would be two grapheme clusters, one (भु) for "bhu" and one (क्त्वा) for "ktvā". In Unicode each vowel and consonant (here, bh, u, k, t, v, ā) is separately encoded, which is the only reasonable thing to do, and inevitably means that grapheme clusters can have different lengths (number of code points). The alternative would have been to encode every possible (sequence of consonants + vowel) as a single codepoint, which gets ridiculous quickly: these sequences can be up to 5 consonants long, so you'd end up having to encode (33^5 * 13 ≈ 500M) codepoints for Devanagari alone (or completely prevent certain sequences of consonants from being expressed, which makes no sense either), not to mention that most of the scripts of the Indian subcontinent and south-east Asia follow the same principle and have similar issues (e.g. Bengali with 250M speakers, Telugu, Javanese, Punjabi, Kannada, Gujarati, Thai with over 50M speakers each, etc).
Have you ever written software before Unicode? We had N different encodings for each language, each culture, each country. There were all kinds of bugs creeping up, and software that works perfectly well could be buggy for one random language. Unicode abstracted all of this away from the programmer in a pretty simple fashion. I simply do not see how we're paying the "complexity tax" by using Unicode, unless you're writing a library that handles Unicode (which you shouldn't do, you should use existing libraries) you don't need to know anything about Unicode.
Before Unicode, everyone who came up with a character encoding scheme probably thought their system was good enough for any reasonable use-case. But they all had limitations that made them inadequate for things less obscure than representing some dead Mongolian language.
It would be nice if we could come up with some magical system that optimally encodes all the text that "matters" and ignores everything else, but history has shown that to be very hard. So we're left with Unicode, which takes the approach of giving us (effectively) infinite code points to represent characters, with (effectively) infinite ways to visually represent them. That does lead to a bunch of "unnecessary" baggage and headaches, but it also solves a bunch of real problems that you probably don't know exist.
Unicode is a pain in the ass, but it's a solution to a very hard problem. You can feel free to design your own solution, but you'll probably run head-first into all the problems Unicode was trying to solve from 40 years ago.
I'm getting the impression that this is only "obvious" from a latin-cyrillic-greek alphabet point of view ?
P.S.: Also, even for those, it would seem that one of the big reasons for things like combining characters was added to Unicode in order to be backwards compatible even with mutually incompatible encodings ?
Your notion of character doesn't necessarily match others, and there are many cases where the number of possible "characters" in some notion is unbounded. Unicode provides a very well-defined superset of those notions for you. Collecting characters is only a minor portion of their jobs.
Am I alone in thinking that this is not so much a separate character, as a doodle a bored monk made to relieve a tiny bit of the tedium of copying manuscripts?
I was astonished not to see this mentioned at all when I saw the post earlier! Almost commented about it myself but I wanted to think about something else.
Use a font that contains the previous glyph. This is just an update to the reference glyph, and there is nothing prevents you from using a font that has an upside-down A in the place of U+0041.
There's an emoji for handgun, but Apple and other big tech decided it needed to be a water gun. There is also a rifle character intended to represent the sport of shooting in a pentathlon, but again Apple threw its weight around and, while the character became codified in Unicode, it never became an emoji and no font from big tech supports it.
I guess because the goal of Unicode is to be able to represent every character that's appeared in language. This one is in a published book, while guns and a sexual intercourse symbol aren't.
Emoji was a weird value add that Japanese mobile providers added to their phones before Unicode. To get them to move to Unicode, they had to keep them. That's why there's a Tokyo Tower emoji, but not an Eiffel Tower. That's why the post office has a 〒 on it. That people get any use out of emoji outside of Japan is really pure luck.
I've even heard emoji referred to as "the carrot that keeps the implementations current." Every time a new version of Unicode is published, a few more emoji are tacked on. It acts as incentive for all the cellphone carriers and such to put the money into updating their implementations, because nobody wants to be the one on the block with the one phone that can't render "Mirror Ball" .
Incidentally, Windows doesn't have the mirror ball. I guess it is a carrot to get me to upgrade to Windows 11, which I am skipping. (The key with Windows is to only use the good versions; XP, 7, 10, ???. Hoping ??? arrives soon ;)
That seems actually logical when you consider that kanji presumably began as simple depictions of objects that could be drawn quickly. Perhaps the only difference between emoji and kanji is time.
There's a career path to get there. It involves becoming someone who cares deeply about the ways and means of digitizing data stored in analog media. Drill down deep enough, and you'll find yourself in a fascinating world of coding an error.
There are things like the "ghost characters," which are codepoints in Japanese that map to characters that were basically transcription errors when the team was putting together a full set of Kanji. Some characters with an extra horizontal line snuck into the set; they were likely caused by a transcription error because the character got split onto two pieces of paper by lines of text being copy-pasted into a records book, and the shadow cast by the thin extra layer of paper was misinterpreted as another stroke.
And then people wander why software developers don't care to support Unicode properly.
First 60,000+ characters made sense, than few more were needed and Unicode suddenly got to play with a 1,000,000+ and just went off the rails.
Today I learn that Centiclops effectively has a Unicode character. As Centiclops' representative in the world of the non-imaginary, we accept that a Unicode character with a hundred eyes is not practical and we accept the representation with just a few eyes, but generally agree that upgrading to 7 to 10 is a nice improvement, as 7 does not evenly divide into 100 but 10 does. This is important, because... reasons.