I have been working on jieba-rs these few weeks and traced the code from the original python implementation. I found a wrong assumption in the code that is by assuming the unicode scalar only ranges from U+4E00 to U+9FD5
The same kind of mistakes have been made on Chinese websites, where developers don’t even bother to put minimum effort to understand what the unicode standard is, and therefore only assumes that the unicode range only from the BMP. They don’t know as of Unicode 12.0, it has defined a total of 87887 CJK Unifided Ideograph. And there are charaters defined in Extension A to Extension F, and the upcoming planned Extension G. What falls in the BMP was mainly the result of Han Unification
To correctly defined the range of the unicode for CJK as the time of Unicode 12, you have to define at least the following ranges.
- U+3400…U+4DBF (Extesnion A)
- U+4E00…U+9FFF (BMP)
- U+F900…U+FAFF (Compatibilty Ideograph)
- U+20000…U+2A6DF (Extension B)
- U+2A700…U+2B73F (Extension C)
- U+2B740…U+2B81F (Extension D)
- U+2B820…U+2CEAF (Extension E)
- U+2CEB0…U+2EBEF (Extension F)
- U+2F800…U+2FA1F (Compatibility Supplement)
It would cover the test cases like so that you would have your logic built on the correct foundation.
Not only I am leaving Medium, I am also switching my default browser from Chrome to Firefox, due to the concerns that Google is creeping in and intrude the my privacy and trust after reading the news 1 and 2.
It’s not the first time I tried to switch my main browser to Firefox, the last time I tried was when the first version of Quantum’s release. I heard it was shipped with its CSS engine from servo, I couldn’t help to download it and try it out. However, it wasn’t a successful attempt. There was bugs and issues I run into, apart from that, all of the extensions need to be migrated since the APIs and architectures are totally different. At the time when it was released, there wasn’t many good extensions migrated yet. But now most of my commonly used extensions are supported.
Here are the list of the extensions I’ve installed so far.
- Facebook Containers
- Multi-Account Containers
- Firefox Lockwise
- IG Helper
- Neat URL
I found that Facebook Containers is extremely useful, it works pretty much like sandbox account in Android. Neat URL is also convenient in that I don’t have to manually remove those tracking parameters when I copy & paste the url to others.
It’s not working perfectly without any issue though. I’ve run into the following situations.
- FB Messenger not able to load, not sure it is because I set the content blocking rules too strict.
- Google Drive often hits to high cpu usage when the folders have many items.
- Random high cpu spike when open on certain web pages.
It is not without issue but it is good enough that I am happy to set it as my main browser, and get rid of the sneaky Google.