These are chat archives for rust-lang/rust

19th
Jan 2017
Robyn Speer
@rspeer
Jan 19 2017 02:10
How do I use the result of a group_by from itertools? Everything I do with it tells me that "the type of this value must be known in this context", but the type it appears to be is insane and not valid Rust syntax.
Here's my code:
rustfn postprocess_words(words: UWordBounds) -> Vec<String> { let iter1 = words.scan((0, false), |state, word| { let (idx, joining) = *state; let joinable = script_is_spaceless(word); if joining && joinable { *state = (idx, true); Some((idx, word.to_string())) } else { *state = (idx + 1, joinable); Some((idx, word.to_string())) } }); let iter2 = iter1.group_by(|&elt| elt.0); let iter3 = &iter2.into_iter().map(|item| { let group = item.1; group.collect().concat() }); iter3.collect() }
sorry
trying again
fn postprocess_words(words: UWordBounds) -> Vec<String> {
    let iter1 = words.scan((0, false), |state, word| {
        let (idx, joining) = *state;
        let joinable = script_is_spaceless(word);
        if joining && joinable {
            *state = (idx, true);
            Some((idx, word.to_string()))
        } else {
            *state = (idx + 1, joinable);
            Some((idx, word.to_string()))
        }
    });
    let iter2 = iter1.group_by(|&elt| elt.0);
    let iter3 = &iter2.into_iter().map(|item| {
        let group = item.1;
        group.collect().concat()
    });
    iter3.collect()
}
If I try to just call group.concat(), it says:
no method named `concat` found for type `itertools::Group<'_, {integer}, std::iter::Scan<unicode_segmentation::UWordBounds<'_>, ({integer}, bool), [closure@src/lib.rs:53:40: 63:6]>, [closure@src/lib.rs:64:32: 64:44]>` in the current scope
but of course I can't use that mess as the type.
I'm doing these operations with iterators because I'd eventually like them to work on an entire file, without having to make a Vec of all the input or all the results.
The purpose of iter2 and iter3 is I'd like to group these tuples by idx, and concatenate all the string values with the same idx together.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:23
I suspect your problem is that the Rust compiler doesn’t want to nail down the type of the integer literals for some reason. Try making the initial 0 in the tuple passed to scan into 0usize or 0u64 (depending on intent).
Robyn Speer
@rspeer
Jan 19 2017 02:27
Changing it to 0u64 didn't make a difference.
Neither did changing the iter2 line to let iter2 = iter1.group_by(|&elt| elt.0 as u64);
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:30
Huh. Trying it in play.integer32.com right now, then.
Robyn Speer
@rspeer
Jan 19 2017 02:31
I could paste the rest of the code if it helps, but it has dependencies and stuff
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:33
Nah, I managed to replicate the error just by importing crates appropriately and supplying dummy implementations of script_is_spaceless and main.
Robyn Speer
@rspeer
Jan 19 2017 02:33
Cool.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:34
Ah, I’m dumb!
You need to annotate the use of collect.
It’s return-type polymorphic, so just using concat on the result doesn’t give rustc enough information to pin down its type.
What type were you hoping to call concat on?
Robyn Speer
@rspeer
Jan 19 2017 02:37
String or maybe &str, but now I'm noticing that all the group keys are in there again
it actually seems to be an iterator of (u64, String)
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:37
yeah.
:)
So the correct line is then group.map(|pair| pair.1).collect::<String>().
And then you’ll get some errors from the borrow checker, but your initial problem is solved. :D
Although actually, it looks like the annotation is unnecessary after all and you can just use group.map(|pair| pair.1).collect().
Robyn Speer
@rspeer
Jan 19 2017 02:41
That appears to be true.
So, yeah, now I am down to one error with the borrow checker:
fn postprocess_words(words: UWordBounds) -> Vec<String> {
    let iter1 = words.scan((0u64, false), |state, word| {
        let (idx, joining) = *state;
        let joinable = script_is_spaceless(word);
        if joining && joinable {
            *state = (idx, true);
            Some((idx, word.to_string()))
        } else {
            *state = (idx + 1u64, joinable);
            Some((idx, word.to_string()))
        }
    });
    let iter2 = iter1.group_by(|elt| elt.0 as u64);
    let iter3 = &iter2.into_iter().map(|elt| {
        let group = elt.1;
        let strings: Vec<String> = group.map(|pair| pair.1).collect();
        strings.concat()
    });
    iter3.collect()
}
error[E0507]: cannot move out of borrowed content
  --> src/lib.rs:70:5
   |
70 |     iter3.collect()
   |     ^^^^^ cannot move out of borrowed content
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:45
Replicated.
You should note that the extra concat is unnecess
*unnecessary.
The collect call will automatically join all input strings into one if you tell it to.
Robyn Speer
@rspeer
Jan 19 2017 02:46
Huh. I think I don't know what collect actually does, I just assumed it was for turning an iterator into a Vec.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:47
It is for that, but more generally it’s for turning an iterator into some kind of persistent collection.
Strings count as a collection of chars, but also as a collection of strs or of other Strings, at least for the purposes of collect. It’s pretty cool.
Anything you can use with collect will show up as an impl of std::iter::FromIterator, which is the trait that handles this magic behind the scenes.
Robyn Speer
@rspeer
Jan 19 2017 02:48
Indeed, changing the type I'm putting it into to String seems to be working. Or at least not raising any errors on top of the borrow checker complaint.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:48
Robyn Speer
@rspeer
Jan 19 2017 02:49
turbofish :D
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:50
:)
One of my most proud contributions to the Rust project so far has been to add those three little FromIterator impls for Cow<str>.
Anyway, trying to solve your lifetime issue.
Robyn Speer
@rspeer
Jan 19 2017 02:51
Collecting Result<T, E> into Result<Collection<T>, E> sounds pretty useful. I assume it returns the first error if there is one?
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:51
Yes, it does.
Similar behavior with the impl for Option: None on the first failure.
In both cases, the underlying iterator doesn’t get evaluated any more than necessary to see the first error.
Found the problem borrowck had with your code:
Remove the & in front of iter2 in the assignment to iter3. Taking a reference there is strictly unnecessary, and also was giving the scope sets algorithms fits. :)
Robyn Speer
@rspeer
Jan 19 2017 02:56
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:57
In fact, you can just write the entire function as one (large) expression like so:
fn postprocess_words(words: UWordBounds) -> Vec<String> {
    words.scan((0u64, false), |state, word| {
        let (idx, joining) = *state;
        let joinable = script_is_spaceless(word);
        if joining && joinable {
            *state = (idx, true);
            Some((idx, word.to_string()))
        } else {
            *state = (idx + 1u64, joinable);
            Some((idx, word.to_string()))
        }
    }).group_by(|elt| elt.0 as u64).into_iter().map(|elt| {
        let group = elt.1;
        group.map(|pair| pair.1).collect()
    }).collect()
}
And I thinks it’s actually marginally more readable that way too (but YMMV).
Robyn Speer
@rspeer
Jan 19 2017 02:58
Probably. Mostly I was giving error messages more distinct lines to point to.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:58
Ah. :)
Let me see if I can “simplify” your code further; I think there are some things that are not really necessary in there.
Robyn Speer
@rspeer
Jan 19 2017 02:59
Alright, great! It compiles! And it gives the wrong answer compared to the simple version I had that used Vec for everything. Now I've just got to think through the logic of my scan.
I notice that Some((idx, word.to_string())) can be pulled out of the if statement
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 02:59
Yes, it can.
And the use of .to_string() there is also unnecessary (since you’re just joining them into an owned string at the end anyway, you might as well keep them as slices for as long as you can).
So this is the (still wrong, but simpler) function now:
fn postprocess_words(words: UWordBounds) -> Vec<String> {
    words.scan((0usize, false), |state, word| {
        let (idx, joining) = *state;
        let joinable = script_is_spaceless(word);
        if joining && joinable {
            *state = (idx, true)
        } else {
            *state = (idx + 1, joinable)
        }
        Some((idx, word))
    }).group_by(|elt| elt.0).into_iter().map(|elt| {
        let group = elt.1;
        group.map(|pair| pair.1).collect()
    }).collect()
}
What answer are you getting and what answer did you want to get?
Robyn Speer
@rspeer
Jan 19 2017 03:03
If you'd like some context: what I'm doing here is postprocessing the output of unicode_segmentation, because when your text includes words in a script Unicode doesn't know how to segment, Unicode throws a fit and blows it apart into individual characters. (I've previously done this same thing in Python.)
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:03
Oh, ok.
Robyn Speer
@rspeer
Jan 19 2017 03:04
#[cfg(test)]
mod tests {
    use super::segment_words;
    #[test]
    fn it_works() {
        let s = "こんにちは world, this is a test case.";
        let words: &[_] = &["こんにちは", " ", "world", ", ", "this", " ", "is", " ", "a", " ", "test", " ", "case", "."];
        assert_eq!(segment_words(s, "en"), words);
    }
}
what I got was almost that, except the first "こ" was in its own segment.
(Left to its own devices, unicode_segmentation would have put all of the Japanese kana in their own segments.)
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:05
Oh, I see. My immediate response is to think that you have the iterator equivalent of an “off-by-one” error.
Robyn Speer
@rspeer
Jan 19 2017 03:05
Yep.
It's also occurred to me that maybe I'd get a better, more efficient result if I just forked unicode_segmentation.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:06
Heh.
Try inserting .inspect(|x| println!("{}", x) at various places in the iterator chain and see what the output is.
Sorry, forgot the ending paren: .inspect(|x| println!("{}", x))
Robyn Speer
@rspeer
Jan 19 2017 03:10
well, after iter1, I get "the trait std::fmt::Display is not implemented for (u64, std::string::String)".
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:10
Lolwut.
Try replacing {} with {:#?}, then (this is the sequence for pretty-printed debug output).
Robyn Speer
@rspeer
Jan 19 2017 03:12
Aha. And now I see the off-by-one error.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:12
Yay!
Iterator::inspect is really nice for this kind of thing.
Robyn Speer
@rspeer
Jan 19 2017 03:16
Alright, I've just got some logic to think about. Thanks for all your help!
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:17
Also, now that I understand better what this is for, you should probably replace all of your u64 annotations with usize ones, since we’re conceptually dealing with “indices” into an in-memory set of strings, and it would be rather strange if the size of all addressable memory was smaller than the space required to store those strings (since they’re in-memory to start with).
Robyn Speer
@rspeer
Jan 19 2017 03:22
Nice, it simplified into this:
fn postprocess_words(words: UWordBounds) -> Vec<String> {
    words.enumerate().map(|(idx, word)| {
        if script_is_spaceless(word) {
            (0usize, word)
        } else {
            (idx + 1, word)
        }
    })
    .group_by(|elt| elt.0)
    .into_iter().map(|elt| {
        let group = elt.1;
        group.map(|pair| pair.1).collect::<String>()
    }).collect()
}
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:35
You can also get rid of the let group line by inlining it as elt.1.map(…)
Robyn Speer
@rspeer
Jan 19 2017 03:43
Is it too much to hope for to be able to return the iterator without collecting it?
This of course doesn't work, though I wish it did:
fn postprocess_words(words: UWordBounds) -> Map<_, String> {
    words.enumerate().map(|(idx, word)| {
        if script_is_spaceless(word) {
            (0usize, word)
        } else {
            (idx + 1, word)
        }
    })
    .group_by(|elt| elt.0)
    .into_iter().map(|(_, group)| {
        group.map(|pair| pair.1).collect::<String>()
    })
}
but I have no idea what would fill the blank in the return type
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:49
Unfortunately, the answer right now is no, you can’t.
Robyn Speer
@rspeer
Jan 19 2017 03:49
Hmm.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:49
There is an experimental feature (called impl Trait) that would let you do this, but it’s only accessible on nightly and it’s behind a feature flag.
If you want to play with it though, the correct incantation to a nightly rustc is #![feature(conservative_impl_trait)].
Robyn Speer
@rspeer
Jan 19 2017 03:50
Yeah, I heard about impl Trait last night and all the things it will solve about return types
That sounds like it's actually arriving pretty soon, at least.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:51
It should arrive not too un-soon.
Robyn Speer
@rspeer
Jan 19 2017 03:51
Ah.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:51
Simply put, it needs a lot more work to be generally usable.
But there’s a lot of incentive to make it available on stable in at least some form.
So who knows.
Robyn Speer
@rspeer
Jan 19 2017 03:51
It's sounding like I was better off in the version of this code that used no iterators at all.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:52
Nah, your current version is basically ideal as of today’s Rust.
Without the iterators at all, you’d be heap-allocating a bunch of Vec’s you don’t need. :)
If you really wanted something that eliminated that last collect, and were willing to go slightly insane to get it, you could use continuation-passing style, taking advantange of the existential-to-universal translation that that provides.
Robyn Speer
@rspeer
Jan 19 2017 03:55
Here's the version without iterators:
fn postprocess_words(words: UWordBounds) -> Vec<String> {
    let mut joining = false;
    let mut initial = true;
    let mut current_word = "".to_string();
    let mut result = Vec::new();
    for word in words {
        if !has_alphanumeric(word) {
            joining = false;
            continue;
        }
        let join_script = script_is_spaceless(word);
        if joining && join_script {
            current_word.push_str(word);
        } else {
            joining = join_script;
            if !initial {
                result.push(current_word);
            }
            current_word = word.to_string();
        }
        initial = false;
    };
    if !initial {
        result.push(current_word);
    }
    result
}
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:56
Yeah, that’s massively inefficient.
Robyn Speer
@rspeer
Jan 19 2017 03:56
(oh, and I hadn't decided to deal with has_alphanumeric separately yet)
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:56
Taking advantage of iterators is the right thing to do in Rust.
The whole reason impl Trait exists is so patterns like that are even more widely applicable.
Robyn Speer
@rspeer
Jan 19 2017 03:56
Okay. I'm kind of surprised; I figured that all the closures would have overhead.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:57
Closures in Rust have zero overhead.
Robyn Speer
@rspeer
Jan 19 2017 03:57
and all the constructing tuples and stuff.
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:57
They’re compiled into synthesized structs that implement the Fn traits behind the scenes.
You can look at the generated assembly if you want to be sure. :)
Robyn Speer
@rspeer
Jan 19 2017 03:58
Got it.
What's more inefficient about the iterator-less version, though?
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 03:58
Generally, shuffling tuples around has little-to-no cost either (but, again, check the assembly to be sure).
The iterator-less version allocates at least log(n) times for the Vec’s growth pattern.
As well at O(n *log(k)) times for the intermediate strings.
Robyn Speer
@rspeer
Jan 19 2017 04:00
And collecting into a String is smarter about how it allocates?
Alexander Ronald Altman
@pthariensflame
Jan 19 2017 04:01
It’s possible that rustc would optimize some of that away, but iterator’s exist so you don’t have to rely on such shaky things.
collect will immediately allocate exactly enough space as it needs up front, if it possibly can.
In addition, the iterator-based version might omit some bounds checks that the vec-and-string version has, but I’m less sure about that.
In the end, benchmarks are the only way to say for sure. :D
Jonas Matser
@mtsr
Jan 19 2017 13:32
Anyone using VSCode? I'm finding that debug (F5) doesn't bother building the project first and was wondering if there's a quick fix for that?
I know there's prelaunchtask, but RustyCode's Cargo commands don't seem to be available, so I would have to redefine those in tasks.json, which seems odd.