Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Athan
    @kgryte
    @ShraddheyaS Looking at the codePointAt utility. I see it differs slightly from the polyfill outlined on the MDN docs (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt). Any reason you can think of as to why they wouldn’t also consider whether the indexed character is a low surrogate? My sense is that your implementation is correct, but I want to sanity check.
    Athan
    @kgryte
    Interestingly, the ECMAScript spec simply returns the low surrogate if the provided index does not mark the start of a surrogate pair: https://www.ecma-international.org/ecma-262/#sec-string.prototype.codepointat
    Was discussed back in 2012 to simply return the low surrogate. Not clear why: https://esdiscuss.org/topic/march-24-meeting-notes#content-24
    I am fine with deviating from the ECMAScript spec provided we can justify our reasoning. @ShraddheyaS Perhaps you have some thoughts here?
    Athan
    @kgryte
    In thinking about it, I suppose one reason to have codepointat be only forward looking is for during iteration. For example, if I attempt to iterate over a string while looking forward and backward, I would have to account for repeated surrogate pairs (i.e., every surrogate pair would be returned twice). If the algorithm instead is only forward looking, then I can check for a low surrogate at each iteration, potentially skipping along to the next code point.
    I think we have a few of options here. 1) we can keep as is, with the need to justify our algorithm. 2) we refactor to only be forward looking (similar to the polyfill). 3) we can maintain the ability to look backward, but only when an added (third) parameter indicates that looking backward is desired.
    Naively, I suppose (3) could be appealing in the scenario where I randomly want code points from a string. However, that is also problematic because then surrogate pairs would be overrepresented among other characters.
    So in short, I am leaning toward (2).
    Athan
    @kgryte
    More on the semantics of codepointat: https://www.ecma-international.org/ecma-262/#sec-codepointat
    Shraddheya Shendre
    @ShraddheyaS

    Was discussed back in 2012 to simply return the low surrogate. Not clear why: https://esdiscuss.org/topic/march-24-meeting-notes#content-24

    This is probably because one can always get the high surrogate using charCodeAt

    Athan
    @kgryte
    @ShraddheyaS May also be useful to create a string utility @stdlib/string/code-point-unit-count which takes a code point and returns the number of code units. This way, when we want to iterate over a string to return code points, we can first invoke codePointAt to get a code point and then codePointUnitCount to advance the index.
    Shraddheya Shendre
    @ShraddheyaS
    I admit that the current implementation is over cautious. But if we are to traverse a string backwards, then by not considering the case when we land at a low surrogate, aren't we forcing any inefficient behaviour? The low code unit will be accessed twice?
    Once while deciphering that we have landed at a low surrogate, hence moving backwards in the iteration and then after iterator is at correct start of surrogate pair, another access for returning the appropriate code point after assimilation
    Athan
    @kgryte
    That is a good point. That would seem to make (3) a better option (i.e., giving users the control to look backward). Maybe something to the effect of codePointAt( string, position, backward ) where backward is a boolean.
    The default is for backward to be false.
    Athan
    @kgryte
    I think that should be fine. We’ll just need to document the differences from the built-in codePointAt in the README, as done, e.g., with from-code-point.
    Feel free to create a PR for that utility. Been on our internal TODO list for a while!
    Shraddheya Shendre
    @ShraddheyaS
    Should I submit an issue first? Or directly the PR?
    Athan
    @kgryte
    Create an RFC issue, but no need to wait for us to approve. You can submit a PR immediately after without waiting around.
    Athan
    @kgryte
    As far as next_break. I think its reasonable to move this to a separate package. Maybe @stdlib/string/next-grapheme-cluster-break (alias: nextGraphemeClusterBreak), with signature nextGraphemeClusterBreak( str[, fromIndex] ), similar to indexOf.
    @ShraddheyaS Should it be next-extended-grapheme-cluster-break? Not clear to me, based on this thread (https://bugs.python.org/issue30717), whether we’d also want to have a utility for non-extended breaks.
    Athan
    @kgryte
    It’s possible that we may not want to care about “legacy” clusters. If we do, we could always do next-legacy-grapheme-cluster-break.
    @ShraddheyaS You have any thoughts on whether we should consider updating our implementation (in the future) to use a trie? https://github.com/foliojs/unicode-trie
    Shraddheya Shendre
    @ShraddheyaS
    Whoa, will need to look into this, especially matching algorithm. Won't matching prefixes be a bit hard in a Trie given that we can have multiple ways of encoding same grapheme? Or does this data structure take care of that?

    It’s possible that we may not want to care about “legacy” clusters. If we do, we could always do next-legacy-grapheme-cluster-break.

    Yeah, the Unicode report explicitly states "the legacy grapheme cluster boundaries are maintained primarily for backwards compatibility with earlier versions of this specification". https://unicode.org/reports/tr29/

    Athan
    @kgryte
    This I don’t know. I don’t think this is an urgent question to resolve, but we can keep it in mind for future work, if need be.
    Okay. So, let’s not worry about “legacy” boundaries for now. We can just state in the relevant package README that the package applies to extended grapheme clusters.
    Re: trie. I think the trie is supposed to actually address the fact that there are multiple ways to encode the same grapheme. But would need to confirm.
    Regardless, I think, for now, we can simply go with your implementation and create an issue after the fact that suggests considering using a trie for “fast” lookup.
    Athan
    @kgryte
    @ShraddheyaS FYI: I think I’ll rename the @stdlib/string/len package. Now that I’m thinking about it, we should probably be more explicit; e.g., @stdlib/string/num-grapheme-clusters. Sound okay with you? I’ll do this on my end.
    Athan
    @kgryte
    @ShraddheyaS Wow! Just tried out various “hard” examples. Your contribution for determining the number of grapheme clusters works great!
    For example,
    In [8]: numGraphemeClusters( 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞' )
    Out[8]: 6
    Athan
    @kgryte
    Another useful utility package which should be straightforward to implement once next-grapheme-cluster-break is implemented will be splitGraphemeClusters, which would split a string into an array of grapheme clusters.
    Athan
    @kgryte
    If anyone would like to chime in on strided array function interfaces (similar in concept to NumPy’s ufuncs), here is a tracking issue. :) stdlib-js/stdlib#364
    Note that this proposal is a bit more limited and lower level, but would provide an important stepping stone for eventual functionality with NumPy feature parity.
    congzhangzh
    @congzhangzh
    @kgryte Hi, does stdlib support javascript es module import mode?
    Athan
    @kgryte
    @congzhangzh Not built-in support, yet. @rreusser has a WIP branch he is working on, but that is not quite ready for release. In the mean time, some on-line JS CDNs support wrapping CommonJS as an ES module, which may be useful if you’re looking to import all of stdlib. If you’re looking to import just a few or just one stdlib package, then you either need to use CJS or you can create a custom bundle and wrap as an ES module. More work, obviously, but that is the current workaround. What is your use case?
    congzhangzh
    @congzhangzh
    @kgryte I just need a few packages, as my project is at very early stage first, it will grow in the future
    @kgryte for the wrapping technology, can you give me some workable links? tks
    Athan
    @kgryte
    @congzhangzh Quick update. The work being done by @rreusser is almost ready for use. We are just working on a few final edge cases.
    For a CDN supporting wrapping a CommonJS module as an ES Module: https://jspm.org/
    Athan
    @kgryte
    If you are open to using an existing bundle, you can find some here: https://www.npmjs.com/org/stdlib
    Athan
    @kgryte

    Another alternative @congzhangzh is using something like this utility: https://github.com/getify/moduloze

    We’ve tried as best we can to make the project as easy to convert to ESM as possible, so a tool like moduloze should work quite easily on stdlib.

    congzhangzh
    @congzhangzh

    @congzhangzh Quick update. The work being done by @rreusser is almost ready for use. We are just working on a few final edge cases.

    Tks for you both works, it's great to hear you are near to finish, I will wait for your guys finish es module support, and focus on other side of my project first.

    Athan
    @kgryte
    Cool. Will circle back once we finish. 🤞
    congzhangzh
    @congzhangzh
    @kgryte FYI. I use some hours to test jspm.org with stdlib, it works very well.
    Athan
    @kgryte
    Awesome!