codePointAt
utility. I see it differs slightly from the polyfill outlined on the MDN docs (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt). Any reason you can think of as to why they wouldn’t also consider whether the indexed character is a low surrogate? My sense is that your implementation is correct, but I want to sanity check.
codepointat
be only forward looking is for during iteration. For example, if I attempt to iterate over a string while looking forward and backward, I would have to account for repeated surrogate pairs (i.e., every surrogate pair would be returned twice). If the algorithm instead is only forward looking, then I can check for a low surrogate at each iteration, potentially skipping along to the next code point.
Was discussed back in 2012 to simply return the low surrogate. Not clear why: https://esdiscuss.org/topic/march-24-meeting-notes#content-24
This is probably because one can always get the high surrogate using charCodeAt
@stdlib/string/code-point-unit-count
which takes a code point and returns the number of code units. This way, when we want to iterate over a string to return code points, we can first invoke codePointAt
to get a code point and then codePointUnitCount
to advance the index.
backward
to be false
.
next-extended-grapheme-cluster-break
? Not clear to me, based on this thread (https://bugs.python.org/issue30717), whether we’d also want to have a utility for non-extended breaks.
It’s possible that we may not want to care about “legacy” clusters. If we do, we could always do
next-legacy-grapheme-cluster-break
.
Yeah, the Unicode report explicitly states "the legacy grapheme cluster boundaries are maintained primarily for backwards compatibility with earlier versions of this specification". https://unicode.org/reports/tr29/
In [8]: numGraphemeClusters( 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞' )
Out[8]: 6
Another alternative @congzhangzh is using something like this utility: https://github.com/getify/moduloze
We’ve tried as best we can to make the project as easy to convert to ESM as possible, so a tool like moduloze
should work quite easily on stdlib
.
@congzhangzh Quick update. The work being done by @rreusser is almost ready for use. We are just working on a few final edge cases.
Tks for you both works, it's great to hear you are near to finish, I will wait for your guys finish es module support, and focus on other side of my project first.