Skip to main content

Delving into the 7th level of Unicode Hell (Ruby program)

When the three brothers defeated the Titans and had the whole cosmos to divide among themselves, it was the oldest ᾍδης who got the Underworld. A kingdom I seem to be quickly descending on my innocent trip to down Ruby programming. I set out to do a little program that would reverse text like ᾍδης particularly useful for Hebrew and Arabic. But then I found myself entering...

A Glance into the Unicode Underworld
Kairon ferried me across while I was sleeping, because suddenly my little program is facing a huge issue. Ruby doesn't fully support Unicode.  As I searched deeper into this Unicode world I found I was in bigger trouble than I originally thought. ANSI, the common encoding of Ruby, is 8-bit encoding. So I thought, well Unicode must be 16-bit encoding, more bits more letters -- problem solved! Just find a way to split off two 8-bit character chunks at a time and stack them on a new line and voilà! We have the text reversed. I even found a way of doing that. If you have a string you can call the character in it in order by feeding it a position number and how many characters you wanted. "string"[position_number, number_of_characters]  But alas...

Deeper into the Unicode Mists.
If you look at your browser you'll invariably see under the 'View' menu an encoding option for UTF-8. This is the most common encoding method for Unicode and has become the virtual default on the web. And this is the encoding I am using. And here is the rub. It's multi-character encoding. It can be one to three characters long, as my little test program shows. So I can't just grab 2-character chunks because that will invariably result in gibberish. The program needs to read each character and see if it is a one character encoding like Á or α (alpha) or a three character encoding like β (capital beta) or a two character encoding like γ (gamma). Because this affects the positions of subsequent characters, it needs to read the character figure out if it is part of the previous character and split it off that way, otherwise it can start splitting midway through a character. And I thought this was going to be easy-peasie-Japanesie.

Cheating Death?
Now my confusion stems in reality before UTF8 there was UCS2 which was exclusively a 2-character encoding. Unfortunately for me UCS2 was superceeded by UTF16 another variable character encoding. However, the initial UCS2 set is part of UTF16 so as long as my file is UTF16 encoded and contains no characters outside the UCS2 subset, I should be able to cheat by telling it to use two-byte character chunks to reverse. This is a little tricky because the program I was using to create the text samples, Notepad++ doesn't encode in UTF16.

Send a Hero
I went to ask for advice on this to LA's Ruby Group and I got a work around that solved this issue. Using scan, which turns a string into an array at predetermined "chop" points, I can use Unicode codes as the chop points. That way I go from a string to an array or list made up of elements, each of which is a single Unicode character. This worked only with UTF8 (not with UTF16) but it worked flawlessly allowing me to simply reverse the order of the array. At at least for now I can be free of the Chthonic world of Unicode.

Comments

Popular posts from this blog

Building my own home.

I've decided. I want to build my own home. There is something special about building your own things. I built a desk for my tiny room when I first moved to L.A. My room was so small that I had to sit on the bed to use the computer so I build a high desk so I could sit on the bed and work on the computer. My roommate Trentity helped me cut the ply-wood to the right side. I still have that desk. It now sits on the living room covered by a cloth hiding the surplus of costume parts my current roommate Sean uses in his creations. Learning to build and fix things continue. And the feeling of satisfaction from fixing even small things is great. So a few years ago I heard on the NPR program the Story about a couple of educators that moved to a tent in their back-yard so they could rent their house and afford to send their kids to college. They had a special type of tent called a yurt and cooked and showered in an RV they had parked next to it. I thought I could do that. Housing in Lo

Contrasting Styles of Writing: English vs. Spanish

There is interestingly enough a big difference between what's considered good writing in Spanish and English . V.S. Naipul winner of the 2001 Nobel prize for literature publish an article on writing . In it he emphasizes the use of short clear sentences and encourages the lack of adjectives and adverbs. Essentially he pushes the writer to abandon florid language and master spartan communication . This is a desired feature of English prose , where short clipped sentences are the norm and seamlessly flow into a paragraph. In English prose the paragraph is the unit the writer cares about the most. This is not the case in Spanish where whole short stories (I'm thinking this was Gabriel Garcia Marquez but maybe it was Cortázar) are written in one sentence. Something so difficult to do in English that the expert translator could best manage to encapsulate the tale in two sentences. The florid language is what is considered good writing in Spanish but unfortunately this has lead t

My Fake Resume

Inspired by the over aggrandized bio of Joseph Rakofsky I want to write my own. If you don't know who he is; Joseph Rakofsky is a lawyer who earned a mistrial for a criminal client due to his (alleged) incompetence as reported on the Washington Post . There has been quite a few commentaries on his "Streisand-house" approach of suing all the bloggers and even the Washington Post and American Bar Association for reporting his (alleged) ineptitude. ("Streisand-house" is what happened to Barbara Streisand who wanted to have a picture of her mansion removed from the internet and she sued to have it removed. Unfortunately suing requires the filing of public documents with a picture of her house. The lawsuit had the direct opposite effect it intended. Everybody now could see legally, since it was a public document, a picture of her house.) But all that internet gossip aside I'm most impressed by his resume. Here is a quote from the website: Prior to stud