Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
    Athan
    @kgryte
    @ShraddheyaS Reviewed! Thanks for working on this!
    Shraddheya Shendre
    @ShraddheyaS
    Thanks for the extensive review and the suggestions. I was able to incorporate most of them.
    Athan
    @kgryte
    Great!!! Thanks!
    Will take a look again later today.
    Athan
    @kgryte
    @ShraddheyaS Reviewed your updates! Looking good!
    Shraddheya Shendre
    @ShraddheyaS
    @kgryte - Thanks again! I have resolved all comments except the benchmark.js one and one where you suggested using 2 indices and trimming the length. Still not very clear about those.
    Also, regarding refactoring codePointAt into its own module, do you want me to take that up?
    Shraddheya Shendre
    @ShraddheyaS
    After taking a look at #296 and #297, it seems we should also move nextBreak into a separate module. Those 2 issues would be very simple to resolve after that.
    Athan
    @kgryte
    @ShraddheyaS Looking at the codePointAt utility. I see it differs slightly from the polyfill outlined on the MDN docs (https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/codePointAt). Any reason you can think of as to why they wouldn’t also consider whether the indexed character is a low surrogate? My sense is that your implementation is correct, but I want to sanity check.
    Athan
    @kgryte
    Interestingly, the ECMAScript spec simply returns the low surrogate if the provided index does not mark the start of a surrogate pair: https://www.ecma-international.org/ecma-262/#sec-string.prototype.codepointat
    Was discussed back in 2012 to simply return the low surrogate. Not clear why: https://esdiscuss.org/topic/march-24-meeting-notes#content-24
    I am fine with deviating from the ECMAScript spec provided we can justify our reasoning. @ShraddheyaS Perhaps you have some thoughts here?
    Athan
    @kgryte
    In thinking about it, I suppose one reason to have codepointat be only forward looking is for during iteration. For example, if I attempt to iterate over a string while looking forward and backward, I would have to account for repeated surrogate pairs (i.e., every surrogate pair would be returned twice). If the algorithm instead is only forward looking, then I can check for a low surrogate at each iteration, potentially skipping along to the next code point.
    I think we have a few of options here. 1) we can keep as is, with the need to justify our algorithm. 2) we refactor to only be forward looking (similar to the polyfill). 3) we can maintain the ability to look backward, but only when an added (third) parameter indicates that looking backward is desired.
    Naively, I suppose (3) could be appealing in the scenario where I randomly want code points from a string. However, that is also problematic because then surrogate pairs would be overrepresented among other characters.
    So in short, I am leaning toward (2).
    Athan
    @kgryte
    More on the semantics of codepointat: https://www.ecma-international.org/ecma-262/#sec-codepointat
    Shraddheya Shendre
    @ShraddheyaS

    Was discussed back in 2012 to simply return the low surrogate. Not clear why: https://esdiscuss.org/topic/march-24-meeting-notes#content-24

    This is probably because one can always get the high surrogate using charCodeAt

    Athan
    @kgryte
    @ShraddheyaS May also be useful to create a string utility @stdlib/string/code-point-unit-count which takes a code point and returns the number of code units. This way, when we want to iterate over a string to return code points, we can first invoke codePointAt to get a code point and then codePointUnitCount to advance the index.
    Shraddheya Shendre
    @ShraddheyaS
    I admit that the current implementation is over cautious. But if we are to traverse a string backwards, then by not considering the case when we land at a low surrogate, aren't we forcing any inefficient behaviour? The low code unit will be accessed twice?
    Once while deciphering that we have landed at a low surrogate, hence moving backwards in the iteration and then after iterator is at correct start of surrogate pair, another access for returning the appropriate code point after assimilation
    Athan
    @kgryte
    That is a good point. That would seem to make (3) a better option (i.e., giving users the control to look backward). Maybe something to the effect of codePointAt( string, position, backward ) where backward is a boolean.
    The default is for backward to be false.
    Athan
    @kgryte
    I think that should be fine. We’ll just need to document the differences from the built-in codePointAt in the README, as done, e.g., with from-code-point.
    Feel free to create a PR for that utility. Been on our internal TODO list for a while!
    Shraddheya Shendre
    @ShraddheyaS
    Should I submit an issue first? Or directly the PR?
    Athan
    @kgryte
    Create an RFC issue, but no need to wait for us to approve. You can submit a PR immediately after without waiting around.
    Athan
    @kgryte
    As far as next_break. I think its reasonable to move this to a separate package. Maybe @stdlib/string/next-grapheme-cluster-break (alias: nextGraphemeClusterBreak), with signature nextGraphemeClusterBreak( str[, fromIndex] ), similar to indexOf.
    @ShraddheyaS Should it be next-extended-grapheme-cluster-break? Not clear to me, based on this thread (https://bugs.python.org/issue30717), whether we’d also want to have a utility for non-extended breaks.
    Athan
    @kgryte
    It’s possible that we may not want to care about “legacy” clusters. If we do, we could always do next-legacy-grapheme-cluster-break.
    @ShraddheyaS You have any thoughts on whether we should consider updating our implementation (in the future) to use a trie? https://github.com/foliojs/unicode-trie
    Shraddheya Shendre
    @ShraddheyaS
    Whoa, will need to look into this, especially matching algorithm. Won't matching prefixes be a bit hard in a Trie given that we can have multiple ways of encoding same grapheme? Or does this data structure take care of that?

    It’s possible that we may not want to care about “legacy” clusters. If we do, we could always do next-legacy-grapheme-cluster-break.

    Yeah, the Unicode report explicitly states "the legacy grapheme cluster boundaries are maintained primarily for backwards compatibility with earlier versions of this specification". https://unicode.org/reports/tr29/

    Athan
    @kgryte
    This I don’t know. I don’t think this is an urgent question to resolve, but we can keep it in mind for future work, if need be.
    Okay. So, let’s not worry about “legacy” boundaries for now. We can just state in the relevant package README that the package applies to extended grapheme clusters.
    Re: trie. I think the trie is supposed to actually address the fact that there are multiple ways to encode the same grapheme. But would need to confirm.
    Regardless, I think, for now, we can simply go with your implementation and create an issue after the fact that suggests considering using a trie for “fast” lookup.
    Athan
    @kgryte
    @ShraddheyaS FYI: I think I’ll rename the @stdlib/string/len package. Now that I’m thinking about it, we should probably be more explicit; e.g., @stdlib/string/num-grapheme-clusters. Sound okay with you? I’ll do this on my end.
    Athan
    @kgryte
    @ShraddheyaS Wow! Just tried out various “hard” examples. Your contribution for determining the number of grapheme clusters works great!
    For example,
    In [8]: numGraphemeClusters( 'Z͑ͫ̓ͪ̂ͫ̽͏̴̙̤̞͉͚̯̞̠͍A̴̵̜̰͔ͫ͗͢L̠ͨͧͩ͘G̴̻͈͍͔̹̑͗̎̅͛́Ǫ̵̹̻̝̳͂̌̌͘!͖̬̰̙̗̿̋ͥͥ̂ͣ̐́́͜͞' )
    Out[8]: 6
    Athan
    @kgryte
    Another useful utility package which should be straightforward to implement once next-grapheme-cluster-break is implemented will be splitGraphemeClusters, which would split a string into an array of grapheme clusters.
    Athan
    @kgryte
    If anyone would like to chime in on strided array function interfaces (similar in concept to NumPy’s ufuncs), here is a tracking issue. :) stdlib-js/stdlib#364
    Note that this proposal is a bit more limited and lower level, but would provide an important stepping stone for eventual functionality with NumPy feature parity.
    congzhangzh
    @congzhangzh
    @kgryte Hi, does stdlib support javascript es module import mode?
    Athan
    @kgryte
    @congzhangzh Not built-in support, yet. @rreusser has a WIP branch he is working on, but that is not quite ready for release. In the mean time, some on-line JS CDNs support wrapping CommonJS as an ES module, which may be useful if you’re looking to import all of stdlib. If you’re looking to import just a few or just one stdlib package, then you either need to use CJS or you can create a custom bundle and wrap as an ES module. More work, obviously, but that is the current workaround. What is your use case?
    congzhangzh
    @congzhangzh
    @kgryte I just need a few packages, as my project is at very early stage first, it will grow in the future
    @kgryte for the wrapping technology, can you give me some workable links? tks