These are chat archives for ManageIQ/manageiq/performance

3rd
Jan 2018
Nick LaMuro
@NickLaMuro
Jan 03 2018 00:02
I am
Keenan Brock
@kbrock
Jan 03 2018 04:27
@Fryguy Did I read this correctly? "this table will be removed?" that is a big win.
Joe Rafaniello
@jrafanie
Jan 03 2018 15:00
@kbrock Yup :clap: :bow: :scissors: :bomb:
Keenan Brock
@kbrock
Jan 03 2018 15:01
lets backport all the way back to D ;)
Keenan Brock
@kbrock
Jan 03 2018 15:44
@NickLaMuro I need to view data by process id. so I'm writing a script to split up logs (evm.log in particular) by pid
Also grouping pids together by worker type
do you already have something like this?
the script I run numbers by before grouped the logs by worker type - but having too many processes in the same data set really botched things up
Nick LaMuro
@NickLaMuro
Jan 03 2018 15:47
@kbrock this does that I believe

@kbrock though, I think what I did here works probably more to what you would prefer:

https://github.com/NickLaMuro/miq_tools/blob/master/miq_top_output_processor/top_processor.rb

The pid_buffer is effectively what I am talking about in that file
basically, I keep track of a file name based on the pid in the hash, and when the current pid in question changes, I close the file (but leave the file object around), and open the appropriate file Object (create if needed), and write to that
Jason Frey
@Fryguy
Jan 03 2018 15:53
@kbrock Yes, @gtanzillo found a way to delete the vim_performance_tag_values table :heart_eyes:
Dennis Metzger
@dmetzger57
Jan 03 2018 16:30
@kbrock I tried forcing the VC Simulator to not associate all the VMs to the/a global datastore (actual specified 0 global datastores), result was the simulator had a global datastore and all the VMs were associated with it :worried:
Keenan Brock
@kbrock
Jan 03 2018 16:36
wonder if we want to change the code. will only fix dev. but may be good anyway
Nick LaMuro
@NickLaMuro
Jan 03 2018 16:39
I already started considering that, and even made a PR to start down the road to refactoring that, but determined that it really wouldn't have a noticeable impact to customers, and would just introduce risk to the stability to do it

but determined that it really wouldn't have a noticeable impact to customers

For those trying to follow along at home, the Storage.perf_capture method will load all of the Vm records and associated files in it's call, and will bloat memory in simulated environments:

https://github.com/ManageIQ/manageiq/blob/694f5f39faf973254e88437ef428a712e2854ec2/app/models/storage.rb#L714-L716

This is only a problem in the simulator because it only creates one storage for all of the VMs in a single provider. No sane real world system actually has this problem (from what I have gathered), but it causes inaccurate spikes when running the metrics collector when the provider is a VMWare simulator.

Keenan Brock
@kbrock
Jan 03 2018 17:13
@NickLaMuro but since we use simulators to improve our performance, I wonder if fixing that will allow us to better test / develop - so helps customers by making it easier for us to work on our code
Keenan Brock
@kbrock
Jan 03 2018 17:21
@NickLaMuro I want different processing for evm_mem_log_processor.rb - so I was thinking just creating a generic splitter may make sense
Nick LaMuro
@NickLaMuro
Jan 03 2018 17:28
@kbrock I am not opposed to that idea
Keenan Brock
@kbrock
Jan 03 2018 17:28
it is taking too long for me to process so many files
took >30 minutes
I had 10 days, and 4 different workers
I now need to split up by pid
Nick LaMuro
@NickLaMuro
Jan 03 2018 17:29
yeah, I could see that...
I forget that I am just doing top_output processing, and that goes much quicker (smaller files)
Keenan Brock
@kbrock
Jan 03 2018 17:29
think my graphs are confusing because all pids are in a single file - especially for the collector
it was all good enough until I tried understanding the collector
guess we can discuss more at 2:30
Nick LaMuro
@NickLaMuro
Jan 03 2018 17:31
oh yeah, one thing I did do was reduce the number of collector/processor workers to 1 when trying to debug those workers in the environment
Keenan Brock
@kbrock
Jan 03 2018 17:31
thanks for links to your scripts
ooh
you do this with the leaking appliances?
Nick LaMuro
@NickLaMuro
Jan 03 2018 17:32
found that running multiple at once was introducing too many variables
@kbrock yeah
Keenan Brock
@kbrock
Jan 03 2018 17:32
yes
agreed
hmm. I'll double check my appliance (101)
Nick LaMuro
@NickLaMuro
Jan 03 2018 17:33
for example, with that spike from the Storage.perf_capture, it only gets run once an hour

so if you have it running on multiple workers, the "spike" looks more sporadic or random, because it might happen on different workers, or never on one of them

(if the leak was based on the fact that we have multiple of the same type of worker, I would then think we should work around this, but I don't think that is the case)

Keenan Brock
@kbrock
Jan 03 2018 17:34
yea - that last graph is based upon your appliances
not sure why it looks so sparadic - was thinking separating by pid would make better
hmm
I'll look further into this
Nick LaMuro
@NickLaMuro
Jan 03 2018 17:35
@kbrock more that it can create the illusion of a non-deterministic leak
Keenan Brock
@kbrock
Jan 03 2018 17:35
I wanted multiple because it is hard to keep up with the load
100%
ok. I think I want to simplify your script to not produce data but just produce logs by pid
then we can do what ever processing we need from there
Nick LaMuro
@NickLaMuro
Jan 03 2018 17:36

I wanted multiple because it is hard to keep up with the load

Yeah, I have talked with @dmetzger57 about this, and we really need to find a sweet spot where one worker can handled X number of VMs with out being overloaded, but only so much we can do

ok. I think I want to simplify your script to not produce data but just produce logs by pid

Okay, though, then that means having to process over them twice

could you possibly change it into more generic where you pass in a RegExp, instead of using a constant?
and allow a block for processing once the line is matched?

then your output is just:

$ ls some_log_dir
some_evm.log
some_evm_pid_1.data
some_evm_pid_2.data
some_evm_pid_3.data

Instead of:

$ ls some_log_dir
some_evm.log
some_evm_pid_1.log
some_evm_pid_2.log
some_evm_pid_3.log
some_evm_pid_1.data
some_evm_pid_2.data
some_evm_pid_3.data
@kbrock ^
Keenan Brock
@kbrock
Jan 03 2018 19:45
ooh - I was thinking just *.log
Keenan Brock
@kbrock
Jan 03 2018 20:01
@NickLaMuro I was thining all you do is cut up the log files by pid. maybe do a regex so you only copy across certain lines to the logs
Nick LaMuro
@NickLaMuro
Jan 03 2018 20:05
@kbrock right, I get that, but then your overall algorithm is iterating over the data twice:
  • Once to split the log into logs by pid
  • A second time to format that data into something gnuplot can slurp up
that doesn't mean your algorithm will be 2x slower necessarily, but worse case, if you are trying to digest all of the data from every worker, it will be slower than the previous method
Keenan Brock
@kbrock
Jan 03 2018 21:41
@NickLaMuro do we need the date logic in evm_mem_log_processor.rb ?
@NickLaMuro why the eval in info = eval Regexp.last_match[8]
Nick LaMuro
@NickLaMuro
Jan 03 2018 21:49

@NickLaMuro why the eval in info = eval Regexp.last_match[8]

@kbrock with that strikethrough, did that just become a second ping right after the one above? :trollface:

but to your original question, let me look

@NickLaMuro do we need the date logic in evm_mem_log_processor.rb ?

Can I have you clarify what you mean? Because there are technically two "date logic" parts in that code base.

Nick LaMuro
@NickLaMuro
Jan 03 2018 21:55

The one in that file, which is just parsing the date using the regexp (one that I borrowed from you by the way), and is used at the bottom of the file:

https://github.com/NickLaMuro/miq_tools/blob/ca8732a/memory_leak_worker_logging/evm_mem_log_processor.rb#L108

The other is actually in the top_output_processor.rb, which is kinda specific to the way that file is built:

https://github.com/NickLaMuro/miq_tools/blob/ca8732a/miq_top_output_processor/top_processor.rb#L84

That one is basically doing DateTime arithmetic (poorly) to determine the current date from the timestamp that is posted hourly in that file, and using it with the time output from the top output.

There "might" be some use for the second used elsewhere, but not sure what that might be off the top of my head.