Where communities thrive


  • Join over 1.5M+ people
  • Join over 100K+ communities
  • Free without limits
  • Create your own community
People
Repo info
Activity
Cristian Măgherușan-Stanciu
@cristim
cool, thanks!
mello7tre
@mello7tre
i have changed the code to directly wait for spot instance ready and use it to replace ondemand,
run time is 40 to 50 secs
tomorrow i will try to launch multiple instances for the same asg
currently i have put all code inside handleNewOnDemandInstanceLaunch but should be better to create a method ad hoc:
                log.Printf("%s instance %s belongs to an enabled ASG and should be "+
                        "replaced with spot, attempting to launch spot replacement",
                        i.region.name, *i.InstanceId)
                if spotInstanceID, err := i.launchSpotReplacement(); err != nil {
                        log.Printf("%s Couldn't launch spot replacement for %s",
                                i.region.name, *i.InstanceId)
                        return err
                } else {
                        log.Printf("Waiting for spot instance %s to be in status running", *spotInstanceID)
                        err := r.services.ec2.WaitUntilInstanceRunning(
                                &ec2.DescribeInstancesInput{
                                        InstanceIds: []*string{spotInstanceID},
                                })
                        if err != nil {
                                log.Printf("Issue while waiting for spot instance %v to start: %v",
                                        spotInstanceID, err.Error())
                                return err
                        }
                        if err := r.scanInstance(spotInstanceID); err != nil {
                                log.Printf("%s Couldn't scan instance %s: %s", i.region.name,
                                        *spotInstanceID, err.Error())
                                return err
                        }
                        spotInstance := r.instances.get(*spotInstanceID)
                        if _, err := spotInstance.swapWithGroupMember(i.asg); err != nil {
                                log.Printf("%s, couldn't perform spot replacement of %s ",
                                        i.region.name, *i.InstanceId)
                                return err
                        }
this way spot "life" outside ASG should be nearly zero, but as downside we will take more time to replace multiple instance belonging to the same asg
Cristian Măgherușan-Stanciu
@cristim
that's fine, thanks!
mello7tre
@mello7tre
hi @cristim i noticed a strange thing:
at 08:22 i had an Instance Rebalance Recomendation, so AS removed it from the ASG, but at 09:35 instance is still running. I think that AWS IRR are not 100% accurate, instance may be terminated or not....
the problem is that we remove tag launched-for-asg as for spot, but this way if they are not terminated we end with unused spot instances that will never be terminated by AS (they are like "ghost" instances)
probably best approach should be, do not remove tag launched-for-asg if event is IRR, what do you think ?
mello7tre
@mello7tre
(this happen if ASG do not have TerminationLifecycleHook so we detach spot instance)
Cristian Măgherușan-Stanciu
@cristim

@/all Tomorrow I'll resume the monthly AutoSpotting community calls.

Feel free to join our next session tomorrow at 13:00 UTC if you'd like to get in touch and discuss about AutoSpotting.

Agenda:

  • current state of the project and ongoing work
  • near-term plans
  • free discussion about the project

Looking forward to seeing you.
Best regards,
Cristian

Zoom Meeting URL: https://zoom.us/j/6950084333?pwd=RFplNUhydSsrTlFKWFJaMkczdkU4QT09
Meeting ID: 695 008 4333
Passcode: 9gu6f

mello7tre
@mello7tre
@cristim, i made some real test with the previous posted changes to the code (and some other too), seems to work fine, replaced 10 instances, belonging to the same ASG, without any problem.
It took nearly 8 min to replace them all (launch spot + replace).
Only AS run that can work in parallel for the same ASG are the one that send message to the SQS Queue or that terminate instances.
All others are triggered by SQS and work in a sequential way (for the same ASG and Region).
i can create a PR so you can have a look at the code, let me know..
(i have changed the logic in waitForInstanceStatus and gained 10sec for run)
Cristian Măgherușan-Stanciu
@cristim
Sure, thanks!
Have you rebase it to my other draft PR?
mello7tre
@mello7tre
i have started from sqs-workflow-improvements
Cristian Măgherușan-Stanciu
@cristim
great!
mello7tre
@mello7tre
just discovered that i need to make a change to cron spot launch (do not do it directly but send message to queue), so think will create the PR tomorrow morning
Cristian Măgherușan-Stanciu
@cristim
Why is it needed? I thought we can keep it without the queue for the Docker use case
mello7tre
@mello7tre
it will send only if we do not use the queue
i need to do this in order to avoid processing spot launch event that do not came from sqs
when in ondemand event we launch the spot, then wait for it and do the replacement, but spot launch trigger another AS run
Cristian Măgherușan-Stanciu
@cristim
Not sure I can follow 😊
But I'll try it out tomorrow
mello7tre
@mello7tre
i will try to explain it better in the commits description
we should write some workflows for every trigger events to better understood it
tomorrow i am on holiday, i will try to create them
mello7tre
@mello7tre
autospotting.jpg
new workflows
(i have omitted lifecycle and termination events for simplicity)
mello7tre
@mello7tre
@cristim i have create the PR:
AutoSpotting/AutoSpotting#462
it's based on your sqs-workflow-improvements branch, this way is easier to find the changes
Cristian Măgherușan-Stanciu
@cristim
Thanks, I'll have a look shortly
Cristian Măgherușan-Stanciu
@cristim
Can you join the community call in about two hours?
mello7tre
@mello7tre
yes, i will join it
Cristian Măgherușan-Stanciu
@cristim
Thanks, looking forward to seeing you
Cristian Măgherușan-Stanciu
@cristim
@mello7tre I just fixed the remaining issues from that PR and ran a test against an ASG with 20 instances, it worked like a charm and replaced them all with Spot in about 20min
it's a bit slower than I hoped/expected but it's a much smoother ride than what we had before and the Spot instances were all attached immediately to the ASG
mello7tre
@mello7tre
glad to hear this
Cristian Măgherușan-Stanciu
@cristim
thanks for your help on this!
mello7tre
@mello7tre
:-)
mello7tre
@mello7tre
regarding moto test, i have asked in moto gitter channel if the prefer a single PR or multiple ones (with my changes) but no one answered, so currently i am in standby .. i will update on progress, if any..
Cristian Măgherușan-Stanciu
@cristim
thanks!
I'd probably split it by component to make it easier for them to review
mello7tre
@mello7tre
yes, only problem is that changes involves same files
example:
the change for adding ASG LifeCycleHook (create, describe, delete) involve moto/autoscaling/models.py , and the same file is involved for the fix to attach_instances
and looking at moto current open PR (91) i fear that PR merging will be slow....
Cristian Măgherușan-Stanciu
@cristim
yeah, well in that case make it about the autoscaling service, which should be fine to then add multiple actions to it
Cristian Măgherușan-Stanciu
@cristim
I've merged the SQS change and then I rebased the EBS volume upgrade branch to the current master, the code still works as in my initial tests and for the rest of the week I'll be working on making it more configurable
image.png
mello7tre
@mello7tre
ok, as soon as i have some time i will update the autospotting running in my dev account to the latest master (currently it's running with lastest sqs-workflow-improvements)
1 reply
Cristian Măgherușan-Stanciu
@cristim
the EBS code I mentioned isn't yet in master, you can check it out from the feat/replace_io1_with_new_io2_EBS_volumes branch
mello7tre
@mello7tre
ok, thanks
Cristian Măgherușan-Stanciu
@cristim
I decided to merge it as it is to have more people test it