codePointAtinto its own module, do you want me to take that up?
codepointatbe only forward looking is for during iteration. For example, if I attempt to iterate over a string while looking forward and backward, I would have to account for repeated surrogate pairs (i.e., every surrogate pair would be returned twice). If the algorithm instead is only forward looking, then I can check for a low surrogate at each iteration, potentially skipping along to the next code point.
Was discussed back in 2012 to simply return the low surrogate. Not clear why: https://esdiscuss.org/topic/march-24-meeting-notes#content-24
This is probably because one can always get the high surrogate using
@stdlib/string/code-point-unit-countwhich takes a code point and returns the number of code units. This way, when we want to iterate over a string to return code points, we can first invoke
codePointAtto get a code point and then
codePointUnitCountto advance the index.
next-extended-grapheme-cluster-break? Not clear to me, based on this thread (https://bugs.python.org/issue30717), whether we’d also want to have a utility for non-extended breaks.
It’s possible that we may not want to care about “legacy” clusters. If we do, we could always do
Yeah, the Unicode report explicitly states "the legacy grapheme cluster boundaries are maintained primarily for backwards compatibility with earlier versions of this specification". https://unicode.org/reports/tr29/
In : numGraphemeClusters( 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞' ) Out: 6