Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Activity
    peytontarry
    @peytontarry
    Hi @LucyMcGowan , I'm a little confused about what "tree depth" means. When we set tree_depth = 10, I think that we are allowing T0 to have a maximum depth of 10, but I'm not quite sure what that really means. Is it that T0 can have a maximum of ten splits/internal nodes?
    1 reply
    Lucy D'Agostino McGowan
    @LucyMcGowan
    I'm updating the exam details (I've deleted my comment above to avoid confusion)
    • The first part of the midterm will be on canvas just like midterm 02, you have until Tuesday, April 28 to complete it and you have 2 hours once you begin
    • The second part will be due Tuesday, April 28, the second part is an analysis so there is not an imposed time limit, but it is due on Tuesday
    TaylorFowlks
    @TaylorFowlks
    For exercise 3, I am getting this error message "No tuning parameters have been detected, performance will be evaluated using the resamples with no tuning. Did you want [fit_resamples()]?" How do I fix this?
    jdtrat
    @jdtrat
    What's your code @TaylorFowlks ?
    TaylorFowlks
    @TaylorFowlks
    I think I figured it out, but now I am getting an error message for exercise 5
    jdtrat
    @jdtrat
    Oh good! What error are you getting?
    TaylorFowlks
    @TaylorFowlks
    image.png
    jdtrat
    @jdtrat
    I believe it should be boost_spec <- boost_tree(parameters)
    TaylorFowlks
    @TaylorFowlks
    I tried that and it is still producing the same error message
    jdtrat
    @jdtrat
    Try running install.packages("xgboost") in your console and then library(xgboost)before rerunning your code
    TaylorFowlks
    @TaylorFowlks
    that still didn't work
    TaylorFowlks
    @TaylorFowlks
    I think I figured it out. I put "trees" instead of "tree"
    jdtrat
    @jdtrat
    Glad it worked!
    TaylorFowlks
    @TaylorFowlks
    Screen Shot 2020-04-22 at 2.19.06 PM.png
    Now I am getting this error message for exercise 6 @jdtrat
    Lucy D'Agostino McGowan
    @LucyMcGowan
    Thank you @jdtrat! @TaylorFowlks it looks like you need to save your result output, see this sta-363-s20/community#79
    TaylorFowlks
    @TaylorFowlks
    For some odd reason, that didn't work
    @LucyMcGowan
    Lucy D'Agostino McGowan
    @LucyMcGowan
    @TaylorFowlks - just added an issue to your repo with a suggestion: sta-363-s20/lab-06-ensemble-TaylorFowlks#1
    harrisonpopp
    @harrisonpopp
    @LucyMcGowan I'm confused on how the algorithm for boosted trees builds on itself, does each B build on itself until the rmse is the lowest? And is it every tree that does this?
    traytib
    @traytib
    Anyone else's rstudio cloud getting stuck while knitting?
    Brian White
    @BrianNathanWhite
    I have a question concerning the proportion one should use when splitting their data into training and testing sets. How does one determine a 'good' proportion? To elaborate, I recall there are issues with the simple validation approach (i.e. the variability of the test error). In the class examples I've observed that we split the data into training and testing sets, perform K-fold CV on the training data to tune a parameter, and then see how the optimized model performs on our held out test data. Is this summary correct? In particular, how does the proportion affect the outcome of this process.
    Feifan-990807
    @Feifan-990807
    @LucyMcGowan I have a question regarding the tree depth function in R. Does the tree depth represent the maximum number of internal nodes (number of predictors) in the larger tree or the maximum number of nodes in total (internal nodes and terminal nodes)? Thank you!
    jdtrat
    @jdtrat

    @Feifan-990807 I think this should answer your question

    Here's some documentation for the code we're writing, but it says that tree_depth is the maximum depth of the tree (i.e. number of splits).

    1 reply
    Andy-Jiang-483
    @Andy-Jiang-483
    @jdtrat Hey Johnathan, do we need to split the dataframe fro 1-4?
    I'm getting conflicted answers
    Lucy D'Agostino McGowan
    @LucyMcGowan
    @Andy-Jiang-483 you do not need to split it, you can just use cross validation
    @BrianNathanWhite, when you have sufficient data, the initial splitting basically provides an additional check on your estimates via cross validation. In practice I very rarely both split and use cross validation, since cross validation is estimating the test error as well. I think as a general rule, you can feel pretty confident in your results acquired by cross validation without needing the extra split into training and testing
    @traytib is your file still getting stuck?
    Brian White
    @BrianNathanWhite
    @LucyMcGowan That makes sense, thank you. One follow up question, what constitutes 'sufficient' data?
    Lucy D'Agostino McGowan
    @LucyMcGowan
    It depends on lots of things (like how easy the prediction is, how heterogeneous the observations are, etc) but a rule of thumb I’ve heard is 20,000
    @BrianNathanWhite that rule of thumb is mentioned here: https://www.fharrell.com/post/split-val/, Frank’s posts are a great resource for building really good prediction models for future reference
    Brian White
    @BrianNathanWhite
    @LucyMcGowan Cool! It's bookmarked.
    harrisonpopp
    @harrisonpopp
    @LucyMcGowan I think this question got lost in the comments but I'm confused on how the algorithm for boosted trees builds on itself, does each B build on itself until the rmse is the lowest? And is it every tree that does this?
    Lucy D'Agostino McGowan
    @LucyMcGowan
    Each B builds on the previous, essentially fitting a tree to the residuals (the parts the previous trees “missed”)
    SangniHuang
    @SangniHuang
    @LucyMcGowan Hi Prof. McGowan, I wonder if a variable importance check is important for improving the accuracy of the predicted model (i.e. decreasing the test error) or it is just useful in improving interpretability? Is it necessary to perform variable importance in our exam (in this case we have 30 variables)?
    2 replies
    gregorydvor
    @gregorydvor
    @LucyMcGowan If I choose to select a recipe, I would have to use fit_resamples(), right?
    7 replies
    SangniHuang
    @SangniHuang
    @LucyMcGowan Hi Prof. McGowan, when I tried to use the recipe I've specified to fit the model, it gives me an "Error: formula should be a formula object". How should I format the formula (using a recipe) in a different way?
    Lucy D'Agostino McGowan
    @LucyMcGowan
    @SangniHuang, see my thread above from @gregorydvor’s question - it’s fine to enter as is on Canvas but if you want to try the code yourself first, you can create a new data frame from the recipe, specify the formula in the fit, and the data as the new data frame like this:
    new_dat <- your_recipe %>%
      prep() %>%
      juice()
    
    fit(final_spec,
         y ~ , #fill in your formula here
         data = new_dat)
    Noah Handwerk
    @noah-handwerk
    @LucyMcGowan I was trying to look at a lasso model and was wondering it there was anyway to see the coefficient values?
    Lucy D'Agostino McGowan
    @LucyMcGowan
    @noah-handwerk, This is probably easiest to see with variable importance. There are several ways to do this, one is via the vip package. Here is some code:
    # install.packages("vip") #uncomment and run this ONCE in the console
    library(vip)
    
    mtcars_cv <- vfold_cv(mtcars)
    linear_reg(mixture = 1, penalty = 3) %>%
      set_engine("glmnet") %>%
      fit(mpg ~ . , data = mtcars) -> f
    
    f %>%
    vi(lambda = lowest_rmse$penalty) %>%
      mutate(
        Importance = abs(Importance),
        Variable = factor(Variable, levels = Variable[order(Importance)])
      ) %>%
      ggplot(aes(x = Importance, y = Variable, fill = Sign)) +
      geom_col() +
      scale_x_continuous(expand = c(0, 0)) +
      labs(y = NULL)
    erincperry
    @erincperry
    @LucyMcGowan I have been trying to knit my work to an html for the past hour but it keeps stalling and will not fully knit. My code is updated on github and I submitted the quiz on canvas already. Should I keep trying to knit the document or should I leave it for now?
    Error in system(paste(which, shQuote(names[i])), intern = TRUE, ignore.stderr = TRUE) :
    cannot popen '/usr/bin/which 'pdflatex' 2>/dev/null', probable reason 'Cannot allocate memory'
    I keep getting this error
    jdtrat
    @jdtrat
    Are you on RStudio Cloud? @erincperry
    erincperry
    @erincperry
    yes @jdtrat
    jdtrat
    @jdtrat
    I vaguely remember from the beginning of the semester that there was an issue if a lot of people were trying to knit at once hence the inability to allocate memory. I don't think we just need to knit anything, but I'm not positive. I was just going to submit my final model. @erincperry
    erincperry
    @erincperry
    okay thank you! @jdtrat
    Lucy D'Agostino McGowan
    @LucyMcGowan
    @erincperry, I’m sorry you were having trouble with cloud 😕, just submitting the code via canvas is enough, @jdtrat is correct. 👌