The usual advice these days is to remove all SW prefetch from old code , and only consider putting it back in if profiling shows cache misses and you're not saturating memory bandwidth. Prefetching both sides of the next step of a binary search can still help.
The suggestion to use a separate prefetch thread 6. But modern CPUs Sandybridge-family and Ryzen are much beefier and should either run a real thread or not use hyperthreading leave the other logical core idle so the solo thread has the full resources instead of partitioning the ROB.
Software prefetch has always been "brittle" : the right magic tuning numbers to get a speedup depend on the details of the hardware, and maybe system load.
Too early and it's evicted before the demand load. Too late and it doesn't help. See also How to properly use prefetch instructions?
If you need every last drop of performance, and you can tune for a specific machine, SW prefetch is worth looking at for sequential access, but it may still be a slowdown if you have enough ALU work to do while coming close to bottlenecking on memory. Cache line size is still 64 bytes. See How can cache be that fast? Unaligned load instructions have zero penalty if the address is aligned at runtime, but compilers especially gcc make better code when autovectorizing if they know about any alignment guarantees.
As Ulrich predicted, every multi-socket system is NUMA these days: integrated memory controllers are standard, i. Skylake-X AVX no longer has an inclusive L3, but I think there's still a tag directory that lets it check what's cached anywhere on chip and if so where without actually broadcasting snoops to all the cores. SKX uses a mesh rather than a ring bus , with generally even worse latency than previous many-core Xeons, unfortunately. Basically all of the advice about optimizing memory placement still applies, just the details of exactly what happens when you can't avoid cache misses or contention vary.
But in real multi-threaded programs, synchronization is kept to a minimum because it's expensive , so contention is low and a CAS-retry loop usually succeeds without having to retry. The design is still what David Kanter described for Haswell. There's a lock-elision way to use it to speed up code that uses and can fall back to a regular lock especially with a single lock for all elements of a container so multiple threads in the same critical section often don't collide , or to write code that knows about transactions directly.
Update: and now Intel has disabled lock-elision on later CPUs including Skylake with a microcode update. A 2MiB-aligned anonymous allocation will use hugepages by default. Some workloads e. See the kernel docs. Linux uses "hugepage" for everything larger than the standard size.
The next size up would have been 4G, but that's the whole address space, and that "level" of translation is the CR3 control register, not a page directory entry. IDK if that's related to Linux's terminology. Appendix B: Oprofile : Linux perf has mostly superseded oprofile.
A few years ago, the ocperf. For some examples of using it, see Can x86's MOV really be "free"? Why can't I reproduce this at all? As far as I remember Drepper's content describes fundamental concepts about memory: how CPU cache works, what are physical and virtual memory and how Linux kernel deals that zoo.
Probably there are outdated API references in some examples, but it doesn't matter; that won't affect the relevance of the fundamental concepts. So, any book or article that describes something fundamental cannot be called outdated. From my quick glance-through it looks quite accurate. The one thing to notice, is the portion on the difference between "integrated" and "external" memory controllers. Since this article was written, not a whole lot has changed, speeds have gotten higher, the memory controllers have gotten much more intelligent the i7 will delay writes to RAM until it feels like committing the changes , but not a whole lot has changed.
At least not in any way that a software developer would care about. Stack Overflow for Teams — Collaborate and share knowledge with a private group. For example, we run ML models on our Hadoop cluster which take hours to run distributed over around nodes.
Let me know when this can be done on a "smallish instance. I don't work in big data. But here's what I think of your field anyway :- In general, you optimize software by starting from the slowest things first, and then working your way to the faster things when you run out of things to optimize. And that makes sense for sure. The thing is: the next level of optimization is not CPU-optimization, but instead memory optimization.
In fact, all CPU-based optimization starts at the Main-memory level. Because memory is slower than virtually everything inside of the main CPU Core. Which means, memory-level optimization is a far more important skill than any other CPU-based optimization technique.
As such, optimizing RAM access is the most logical next step forward when your programs are running poorly. I work with the scripting languages only and regularly hit this scenario so I won't say this is completely useless. When you analyze a few hundred 50GB files for specific patterns, you have to go line by line in those cases shaving off milliseconds, optimizing how data should be accessed and optimized becomes valuable. I've seen the code written by a Python web-programmer that for a given binary matrix of teams and tournaments produced top 10 rivals for each team where rivals are defined as teams that have picked the same tournaments to participate as your team.
There are tens of thousands teams and thousands of tournaments and his code was taking two days to precalculate the results for every team. Written with minimal understanding of how slow the memory is, the new code takes less than a minute. There was no point of optimizing it further, the algorithmic changes were enough.
Sure, you don't really need to know any of this stuff unless you're actually needing to optimize code beyond the lowest hanging fruit. My list presupposes you've hit a wall and need the best performance you can get. That's not always the case, but I certainly wouldn't say that these optimizations don't make a difference in the "real world of softare. There could be more code in making a single AAA videogame than in the entire Amazon infrastructure.
How do you get that "experienced" feel for which to choose when you're designing your program? Milliseconds matter when the load gets beyond a certain point and your system ends up full of stragglers. How do you get that "experienced" feel for where that point might be and how far you can push a particular architecture without investing in a major rework? Lean, easy-to-manage setups can also sometimes let you afford extra personnel to build stuff faster. There's definitely a cut-off where micro-optimizations wouldn't be necessary.
A lot of efficiency gains are simple, though. Just gotta consistently use what you learn. Good list. I'd add "Virtual Memory" to that list. Although x86 specific, I'd also add x has bit physical pointers: the top bits are basically ignored by the current virtual memory system. There's lots of things to do with Virtual Memory. And anyone who actually reads profiler data needs to understand what the heck that TLB Cache Hits performance counter means. False Sharing comes as an understanding after you understand those other two concepts.
A 2nd CPU Core attempts to gain access, but it cannot until the 1st core releases control by writing data back to memory and setting the line to the Invalid state. I don't think you need to explain the intricacies of the MESI protocol — just explaining the fact that caches need to be consistent is quite sufficient.
Perhaps throw in why they must be consistent. It then becomes clear that the cores need to communicate somehow to maintain this consistency if they're touching data within the same cache line. MESI isn't really that complicated. Cache-lines are either Exclusive owned, Shared, Invalid, or Modified. CPU Cores communicate to each other which lines are owned or unowned, and that's how the caches stay coherent.
If a CPU Core wants to change a cache-line owned by another core, they have to wait until the line is closed set to "Invalid" state by the other core. I think its easier to explain cache-coherence through MESI, rather than to abstractly just say "Caches are coherent".
Which is "wrong", but its "correct enough" to explain the concept. That's what I mean by an abstraction, no CPU today actually does MESI, its simply a concept to introduce to solidify the student's understanding of cache coherency. Its close enough to reality without getting into the tricky CPU-specific details of the real world.
But the system will literally never use those top bits for anything. So some highly optimized code stores data in those top bits and then zeros them out before using.
IIRC, Lisp machines and various interpreters. Which pointers exactly? All pointers in user-space are bits. All bit pointers are translated by the page-directory virtual memory system into a real physical location. There's an extension to use bits or bits I forget exactly. But I don't think its actually been implemented yet on any CPU yet.
Yeah, I knew it was a weird number. But I guessed wrong earlier. Oh I see what you are saying but that's not really just a pointer or a userspace thing, that 48 bit limit is simply imposed by the x CPU vendors in no? With 48 bits you can still address TB of memory. I guessing that from a practical and financial point of view it probably made little sense for vendors to build a CPU that enabled addressing the full 64 bits.
At least for now. Maybe it's because I'm a physicist and not a software engineer, but it's nice to know how things work all the way down even if in the end, a phenomenological model is all you need to do your work. It depends. I find that most programmers not knowing some of this happens to be the bane of most performance and reliability issues.
Take Java for example - pretty sure the Java dev needs to know how to optimize their jvm memory settings, etc. They would need to know direct memory, etc. This nastier your traffic profile the more important the tuning described in that document are. Like all good "books" I don't remember all of it but keep coming back to refer to it. I'm sure some programmers don't need to know this. For example an Erlang or Haskell programmer can't do anything about memory layout anyway, so this knowledge would be of no immediate practical use.
Knowing memory layout and behaviour enables programmer to invent better ways to do things. Of course those algorithms are very application specific, but at least it increases solution space. Erlangers mantra here is "profile this" so that you know which one is actually better.
Not exactly. There's plenty of ways of looking at GC and controlling allocation and lifetimes in Haskell is definitely possible. And when necessary the C FFI is useful to have to write small bits of code to take advantage of particular layouts and drive them from code. See [0]. No, it isn't supposed to be ironic. The title says "should," after all. Who says every programmer should knows these things about memory? Obviously not you.
Ulrich Drepper does - I bet if you asked him, he would say everyone should know these things, but would concede that almost no programmers do and most programmers don't need to. I work in frontend after switching gears from embedded systems a few years back. Knowing some level of detail about how computers work is invaluable at all layers of the stack: I can make informed trade-offs between practical performance of code running on an actual computer and the cost of high-level language concerns and features.
Can't agree more. If I went to a software job and started reading that stuff I'd get moved into hardware : I suppose that's a credit to all the engineers who build all the middle layers that allow software engineers to float along at an abstract and more productive level. Every programmers should know something about memory, because software are becoming bloat [1] [2] because of a new generation of programmers who doesn't optimize their code.
It really all comes down to what variable one is trying to optimize. I noticed neither article mentioned money as a possible variable to optimize. That's true, but I don't see how it's relevant to this article. We can simply tell people "Write your programs so they use fewer resources, and they will run faster".
The column address is then transmitted by making it available on the address bus and lowering the CAS line. Regardless of whether the row and column addresses are sent on the same bus or a different bus, the optimization strategies for programmers are exactly the same. The same goes for almost all of the information here. I mean that literally: it's over pages long, and I think the important and relevant points for most programmers could be summarized in a page or two.
Most programmers are working for a business whose goal is to make money. Usually that involves adding more features or producing the product faster as in development time. Some features of the site may not work correctly. Unfortunately, neither the structure nor the cost of using the memory subsystem of a computer or the caches on CPUs is well understood by most programmers.
This… Expand. Save to Library Save. Create Alert Alert. Share This Paper. Background Citations. Methods Citations. Results Citations. Figures, Tables, and Topics from this paper. Citation Type. Has PDF. Publication Type. More Filters.
0コメント