Lecture 5 - (finish) OpenMP
Let's look at how to create a histogram using OpenMP
![[3.1OpenMP-PartII.pdf#page=14]]
The problem with the above code is that all chunks are writing to the same histogram, which is in the same piece of shared memory (global memory). So when the chunks access it and go to write, there's a race condition. Thus, adding #pragma atomic could help here for the ++ operation.
Note that if we change the problem to try to count the number of different words, this approach becomes much harder! How many words can we base our chunks off of? We don't really know.
![[3.1OpenMP-PartII.pdf#page=15]]
Hence, our proposed change:
![[3.1OpenMP-PartII.pdf#page=16]]
Notice the threaded version is super slow! Adding the atomic inside the for loop because of the thread locking/unlocking that is a problem.
Instead, we should have local_histograms and then add them up:
![[3.1OpenMP-PartII.pdf#page=17]]
We:
- First chunk our work
- Then run the chunked work in parallel.
Notice:
- The last
forloop is in each thread that runs, to combine their results - Notice we can do the
parallelcall on the very outside. It's good to do this, and is common, for code that is "thread" related. It's nice this way since you break up your compiler calls into relevant chunks. - The
nowaitsays to not wait for all the other threads to finish, you can just keep going. - Notice that the highest level
{}in this case "creates" the thread, and at the very end will join the thread.
Notice that if you do a reduced would be better here. Having threads combining in pairs, then those pairs combining will have a much better speedup.
Also notice the local_histogram[111][num_buckets], and it makes sure that no threads share the same cache line. This padding helps to have the access to local_histogram be within the same cache tag. You can try doing this by using a weird integer number of arrays or pieces of data. For example, if
A common length is 128, so a common one to try is 128 + 1 = 129. If the 111 was changed to 129 that would likely also work here.
Doing it Again
Consider:
![[3.1OpenMP-PartII.pdf#page=18]]
Notice the only difference is that __declspec (align(64)) int local_histogram[...]. The align(64) is a C standard that aligns the chunks to the nearest 64-th byte. As such, then instead of doing [111] we do [num_threads+1][num_buckets]. This forces there to be no false sharing, without having to make a bunch of empty space.
Also the new pragma will have all threads synched on the last for, then only one thread does the last for. The next one fixes some of these synch issues but:
![[3.1OpenMP-PartII.pdf#page=19]]
There's a race condition on the bottom for loop, namely on writing to the global histogram.
In Summary
Atomic operations are expensive, but they are fundamental building blocks. Also, we have to only have synchronization when we need it, as we should try to be correct before we try to make it faster. We also have to consider the hardware primitives such as the cache.
![[3.1OpenMP-PartII.pdf#page=21]]
A Final Look at a Program
Look at:
![[3.1OpenMP-PartII.pdf#page=24]]
Notice that:
- The
double's may experience false sharing, so we could align them to the nearest 64 or 128. - The code is actually correct. Each threads acts indpendently on
i, so thread 0 hasi = 0, 8, ...and so on. - If you add
#pragma omp foron the innermostfor, doesn't give you a speedup. You can chunk per thread, but then that means that the extra threads (the "threads within the threads") are going to have to synchronize their chunks, which would be a likely speed-down.
You can actually do an alternative approach where each thread does a "tile" and then synch's their output:
Another thing is that if you go columns, then rows, you get more spatial locality, so then you're getting more cache hits than misses.