CPU caching understanding









up vote
4
down vote

favorite
3












I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:



int64_t MemoryAccessAllElements(const int64_t *data, size_t length) 
for (size_t id = 0; id < length; id++)
volatile int64_t ignored = data[id];

return 0;

int64_t MemoryAccessEvery4th(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id += 4)
volatile int64_t ignored = data[id];

return 0;



And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):



Benchmark results



There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.



I tested this code on single core CPU with next caches configuration:



CPU Caches: 
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative


At this pictures we can see many changes in graph behavior:



  1. There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)

  2. Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses

But there is a lot of questions about results that I can't explain:



  1. Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?

  2. Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?

  3. Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?

I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.



Can someone help me to understand the internal processes of CPU caching?



UPD:
I use the following code to measure performance of memory access:



#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length)
auto random = Random(0);
for (size_t id = 0; id < length; id++)
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));


static void MemoryAccessAllElements_Benchmark(State &state)
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state)
DoNotOptimize(MemoryAccessAllElements(array, size));

delete array;

static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark)
for (int size = 2; size <= (1 << 24); size *= 2)
benchmark->Arg(size);


BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();


You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.










share|improve this question



















  • 1




    The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
    – Jesper Juhl
    Nov 11 at 13:31










  • @JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
    – Nikita Sivukhin
    Nov 11 at 13:44










  • I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
    – Jesper Juhl
    Nov 11 at 13:48










  • The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
    – Margaret Bloom
    Nov 11 at 16:10






  • 1




    @Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
    – BeeOnRope
    Nov 20 at 3:42














up vote
4
down vote

favorite
3












I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:



int64_t MemoryAccessAllElements(const int64_t *data, size_t length) 
for (size_t id = 0; id < length; id++)
volatile int64_t ignored = data[id];

return 0;

int64_t MemoryAccessEvery4th(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id += 4)
volatile int64_t ignored = data[id];

return 0;



And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):



Benchmark results



There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.



I tested this code on single core CPU with next caches configuration:



CPU Caches: 
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative


At this pictures we can see many changes in graph behavior:



  1. There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)

  2. Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses

But there is a lot of questions about results that I can't explain:



  1. Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?

  2. Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?

  3. Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?

I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.



Can someone help me to understand the internal processes of CPU caching?



UPD:
I use the following code to measure performance of memory access:



#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length)
auto random = Random(0);
for (size_t id = 0; id < length; id++)
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));


static void MemoryAccessAllElements_Benchmark(State &state)
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state)
DoNotOptimize(MemoryAccessAllElements(array, size));

delete array;

static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark)
for (int size = 2; size <= (1 << 24); size *= 2)
benchmark->Arg(size);


BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();


You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.










share|improve this question



















  • 1




    The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
    – Jesper Juhl
    Nov 11 at 13:31










  • @JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
    – Nikita Sivukhin
    Nov 11 at 13:44










  • I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
    – Jesper Juhl
    Nov 11 at 13:48










  • The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
    – Margaret Bloom
    Nov 11 at 16:10






  • 1




    @Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
    – BeeOnRope
    Nov 20 at 3:42












up vote
4
down vote

favorite
3









up vote
4
down vote

favorite
3






3





I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:



int64_t MemoryAccessAllElements(const int64_t *data, size_t length) 
for (size_t id = 0; id < length; id++)
volatile int64_t ignored = data[id];

return 0;

int64_t MemoryAccessEvery4th(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id += 4)
volatile int64_t ignored = data[id];

return 0;



And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):



Benchmark results



There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.



I tested this code on single core CPU with next caches configuration:



CPU Caches: 
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative


At this pictures we can see many changes in graph behavior:



  1. There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)

  2. Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses

But there is a lot of questions about results that I can't explain:



  1. Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?

  2. Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?

  3. Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?

I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.



Can someone help me to understand the internal processes of CPU caching?



UPD:
I use the following code to measure performance of memory access:



#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length)
auto random = Random(0);
for (size_t id = 0; id < length; id++)
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));


static void MemoryAccessAllElements_Benchmark(State &state)
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state)
DoNotOptimize(MemoryAccessAllElements(array, size));

delete array;

static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark)
for (int size = 2; size <= (1 << 24); size *= 2)
benchmark->Arg(size);


BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();


You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.










share|improve this question















I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:



int64_t MemoryAccessAllElements(const int64_t *data, size_t length) 
for (size_t id = 0; id < length; id++)
volatile int64_t ignored = data[id];

return 0;

int64_t MemoryAccessEvery4th(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id += 4)
volatile int64_t ignored = data[id];

return 0;



And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):



Benchmark results



There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.



I tested this code on single core CPU with next caches configuration:



CPU Caches: 
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative


At this pictures we can see many changes in graph behavior:



  1. There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)

  2. Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses

But there is a lot of questions about results that I can't explain:



  1. Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?

  2. Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?

  3. Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?

I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.



Can someone help me to understand the internal processes of CPU caching?



UPD:
I use the following code to measure performance of memory access:



#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length)
auto random = Random(0);
for (size_t id = 0; id < length; id++)
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));


static void MemoryAccessAllElements_Benchmark(State &state)
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state)
DoNotOptimize(MemoryAccessAllElements(array, size));

delete array;

static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark)
for (int size = 2; size <= (1 << 24); size *= 2)
benchmark->Arg(size);


BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();


You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.







c++ caching optimization cpu






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 20 at 12:24

























asked Nov 11 at 13:12









Nikita Sivukhin

565520




565520







  • 1




    The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
    – Jesper Juhl
    Nov 11 at 13:31










  • @JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
    – Nikita Sivukhin
    Nov 11 at 13:44










  • I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
    – Jesper Juhl
    Nov 11 at 13:48










  • The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
    – Margaret Bloom
    Nov 11 at 16:10






  • 1




    @Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
    – BeeOnRope
    Nov 20 at 3:42












  • 1




    The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
    – Jesper Juhl
    Nov 11 at 13:31










  • @JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
    – Nikita Sivukhin
    Nov 11 at 13:44










  • I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
    – Jesper Juhl
    Nov 11 at 13:48










  • The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
    – Margaret Bloom
    Nov 11 at 16:10






  • 1




    @Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
    – BeeOnRope
    Nov 20 at 3:42







1




1




The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31




The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31












@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44




@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44












I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48




I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48












The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10




The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10




1




1




@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42




@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42

















active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53249079%2fcpu-caching-understanding%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown






























active

oldest

votes













active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.





Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


Please pay close attention to the following guidance:


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53249079%2fcpu-caching-understanding%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

Barbados

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3