These are chat archives for uwhpsc-2016/uwhpsc-2016

16th
May 2016
Chris Kang
@shinwookang
May 16 2016 03:04
I have a general question about computational time, but I'm not sure I would be able to make it to Monday's office hours, so I'll leave it here for whoever can answer it. Why is Simpson's algorithm computationally faster than Trapezodial? Simpson's feels like it should be slower than Trapezodial, because it's accessing x[i], x[i+1], x[i+2], and doing more computations. But is it simply because Simpson's for loop is incrementing faster than Trapezodial's? Thank you
Matt
@mostberg1
May 16 2016 19:34
Is homework 2 graded? When should we expect it to be?
Chris Swierczewski
@cswiercz
May 16 2016 22:55
@mostberg1 Graded by tomorrow. It's a lot of homework.
@shinwookang Three doubles definitely fit on one cache line, so memory access time is not as much of an issue as one might think. It might also be that the thread overhead on trapz is outweighing the computational cost. That is, threads don't spend enough time computing before movign on to the next bit of work.

Office Hours - Start

I will do my best to answer everyone's questions. If I'm not responding then it's because I'm in a private chat with someone.
Chris Swierczewski
@cswiercz
May 16 2016 23:10
(Slowly getting to Issues questions...)
jasheetz
@jasheetz
May 16 2016 23:13
Hi Chris! I had a question about the number of processes in HW3 and the report section. For the report I tried 16 and 30 also , besides the 8. In my analysis, 16 and 30 do better than 8, but worse than the number of processors we actually have access to. Is this expected?
Chris Swierczewski
@cswiercz
May 16 2016 23:15
Strange. I think there are on the order of ~300 cores on the SMC server where our accounts are allocated. Each project should only have access to four cores, though. I would send an email to SMC about this. (Follow the instructions under "Settings".) Part of the homework is to observe the effects of too many threads and not enough cores.
Being able to access all ~300 is quite the security flaw!
(BTW: the contact instructions are under "Settings" --> "Project Control". If you have code demonstrating this effect that is easy for them to compile and run you should include instructions in your email.)
jasheetz
@jasheetz
May 16 2016 23:17
I don't think I'm accessing that number. The performance of 16 and 30 is worse than 4. I have a graph in my report section that shows the point.
Chris Swierczewski
@cswiercz
May 16 2016 23:18
I'll take a look at your graph.
jasheetz
@jasheetz
May 16 2016 23:19
Almost as if virtual 8 gets bogged down, but virtual 16 and 30 are all set to go when the "real" 4 that I have access to are finished?
Chris Swierczewski
@cswiercz
May 16 2016 23:20
That really is strange.
jasheetz
@jasheetz
May 16 2016 23:20
Is the graph clear? I separated the two groups 1,2,4,8 and then 16 and 30.
Chris Swierczewski
@cswiercz
May 16 2016 23:20
Your graphs look more or less like the expected behavior. Though, in some of my own experiments the 8-thread version ended up "catching up" to the 2- and 4- thread versions.
jasheetz
@jasheetz
May 16 2016 23:21
I ran this a few times and you can see the 1000 repeat.
Chris Swierczewski
@cswiercz
May 16 2016 23:22
Wow. With repeat=1000.
Hmm...
jasheetz
@jasheetz
May 16 2016 23:22
It took a little bit, but not too long.
Chris Swierczewski
@cswiercz
May 16 2016 23:23
I realized after posting the homework that it would be more meaningful to take the min of the repeated runs rather than the average.
This is exactly because of the spurious events.
In fact, when we judge the performance of your code we will run repeat=5 ish and take the min of those runs.
jasheetz
@jasheetz
May 16 2016 23:24
hmm. for chunked and parallel I did the standard things that you mentioned for parallel, no heroic coding.
Chris Swierczewski
@cswiercz
May 16 2016 23:24
Ha ha. Very good.
Chris Kang
@shinwookang
May 16 2016 23:26
Hey @cswiercz , thank you for answering the question! So continuing with the initial question on computational time of trapezodial vs simpson's, when do you go "over" one cache line? What would be an example?
Chris Swierczewski
@cswiercz
May 16 2016 23:26
@jasheetz See what happens after SMC gets back to you. I really didn't expect this kind of behavior with many more threads. Actually, I'm going to try running locally on my own machine really quickly.
@sin
jasheetz
@jasheetz
May 16 2016 23:26
ok.
Chris Swierczewski
@cswiercz
May 16 2016 23:27
@shinwookang I've been meaning to go over this in class for a while but there is a lot of MPI stuff that we need to cover. Basically, the data stored in RAM is separated into a bunch of "cache lines". Imagine RAM is laid out in a bunch of rows, each row a cache line. When you request a piece of data from RAM the entire line is sent to the cache.
jasheetz
@jasheetz
May 16 2016 23:27
thanks! This homework was significantly more enjoyable than hw2. that's all from me.
Chris Swierczewski
@cswiercz
May 16 2016 23:27
There are many interesting ramifications of this layout. I'll try to squeeze in a discussion tomorrow.
Ha ha. That's great!
@jasheetz The students have made it through the fire. I'm trying my best to make the MPI homework "approachable". It's difficult because MPI is a very challenging subject so writing an interesting, yet not impossible, assignment is...interesting.
Shared memory parallelism is easy.
err..."easy"
Chris Kang
@shinwookang
May 16 2016 23:31
@cswiercz I see! Thank you for the information. Sounds like secondary reference material.
Chris Swierczewski
@cswiercz
May 16 2016 23:36
Good idea. I'll find some articles.
bar.png
nicksmithc102
@nicksmithc102
May 16 2016 23:38
Hey Professor; I notice that, in the integrate.c files, originally, every function ends with return 0.0. Should we change that to return the results of the function?
Chris Swierczewski
@cswiercz
May 16 2016 23:38
@jasheetz Actually, ignore that picture. I forgot to actually plot the 32 thread timings.
@nicksmithc102 Definitely. I wrote return 0.0; to remind you that the functions are supposed to return something. (As opposed to "return by reference")
nicksmithc102
@nicksmithc102
May 16 2016 23:39
Gotcha. Thanks.
Chris Swierczewski
@cswiercz
May 16 2016 23:39
bar.png
@jasheetz Incoming timings for 32 threads on my 4-core machine. Basically, it's an exaggerated version of the 8-thread situation.
Meaning: more overhead with small chunksize and slightly less overhead with larger chunksize (but still more than the 8-thread case.)
So it must be an SMC thing.
jasheetz
@jasheetz
May 16 2016 23:41
Looking at your graph.. I was trying to see if I coded 8 wrong.
Chris Swierczewski
@cswiercz
May 16 2016 23:45
It's possible. I'll take a closer look at your time generation code.
jasheetz
@jasheetz
May 16 2016 23:46
Yours start with .009, mine start with .04 at the top of the graph... But you mentioned this was on your machine.
Chris Swierczewski
@cswiercz
May 16 2016 23:47
BTW: fvals is not a keyword argument. That is, pass in your y data like so:
time_simps_parallel_chunked(y, x, num_threads=p, ...)
I can't recall if you can actually pass keyword arguments as positional arguments in Python and have it do what you expect.
Yes, my machine so the timings will be different.
In fact, the timings will be slightly different every time I run it because I have stuff going on in the background.
jasheetz
@jasheetz
May 16 2016 23:48
hmm. I'll look into that.
Chris Swierczewski
@cswiercz
May 16 2016 23:48
Again, I suggest informing SMC once you're convinced that there aren't any bugs in your code.
@jasheetz Aside from the keyword argument issue your plotting code looks okay, I believe. Though, I just noticed that you're using a set to store the process numbers. Change that to a list, instead. I can't recall exactly right now but I don't think sets are positional, either.
Maybe...
That is, set
procs = [1, 2, 4, 8, 16, 32]
jasheetz
@jasheetz
May 16 2016 23:54
ok. I'll look when I get home from work... Thanks for looking! And good to see the bump back to the timing of the 1 processor for the multi-threads when the chunk size and N approach.
Chris Swierczewski
@cswiercz
May 16 2016 23:58
Yes! That's the key observation.
Every thread doing all the work = no parallelism advantage.

Office Hours - End

Good luck, everyone!
I'm pleased with people's progress so far.
jasheetz
@jasheetz
May 16 2016 23:59
That's what I thought thanks.
Chris Swierczewski
@cswiercz
May 16 2016 23:59
Of course!