UTF-8 in ICU
Trying to understand the UTF8 page in ICU. It would seem to be their attempt to remove the overheads of UTF-16 conversions when the base character set is already UTF-8? The U_CHARSET_IS_UTF8
flag is a little difficult to track in the ICU code as while it is recomended for some platforms in the Recomended Build document, the links are going to the wrong files. It is defined in platform.h not in utype.h ...
The bit I'm having trouble with is it's link to UCONFIG_NO_CONVERSION which would seem to disable any conversion filter, but we still want to convert into and export from UTF-8 in the outside world, so I don't see why that is appropriate? However it is the input and output processes that should handle that, and it is only needed when a non-UTF-8 connection is required. The previlance of UTF-8 these days would suggest that conversion of legacy material to UTF-8 may be a sensible process?
It would seem that codepoint based activity is best caaried up after enabling the U_WCHAR_IS_UTF32 by using UTF-32 'strings' when looking at any character based processing. This is how I've been viewing handling 'character' based string handling anyway. Rather than introducing the problems UTF-16 seems to create here, but I'm not sure what happens on windows based platforms here. It seems UTF-16 is the default for windows API in ICU. Since the codepoints of a UTF-32 view only require 21bits, the other bits could be used for additional flags, and I have been thinking that such things as islower, isupper, isaccent, iscontrol and the like could be flagged in the fourth byte for quicker access to tests like 'allupper' or 'all text' and this may be modified for different collations.
My simplistic view of things seems to think there are basically three string lengths ...
1/ Number of bytes for buffer
2/ Number of code points ( characters + control and embellishment ? )
3/ Number of glyphs ( option to display or hide control codes as in ASCII )
Interestingly ICU is still limited to 32bit integers or rather 2Gbyte string lengths!
But this has now been confused by the introduction of NFD/NFC/NFKC/NFKD? Which will vary all of the above in some cases? Being somewhat linguistically challenged, while I understand the concepts such as accents, Would standardising on say NFD help with actions like lower/upper conversion, or does accenting a character sometimes change it's alphabetical order so collations need the 'NFC' form to sort by?
I think that what is clear is that while there may be a single 'UTF-8' writing standard, sorting collations are even more diverse than the previous codesets? Firebird has always managed COLLATION as a separate filter to CHARACTER SET, and allows individual fields to have their own COLLATION so we can index on different languages within the one table. I'm thinking that this may be required when adding sorting in a UTF-8 based setup? Rather than specifying 'encoding' one simply specifies 'collation' where it varies from the basic rules?