CPU caching understanding

up vote
4
down vote

favorite

I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:

int64_t MemoryAccessAllElements(const int64_t *data, size_t length) 
 for (size_t id = 0; id < length; id++) 
 volatile int64_t ignored = data[id];
 
 return 0;

int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) 
 for (size_t id = 0; id < length; id += 4) 
 volatile int64_t ignored = data[id];
 
 return 0;

And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):

Benchmark results

There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.

I tested this code on single core CPU with next caches configuration:

CPU Caches: 
 L1 Data 32K (x1), 8 way associative
 L1 Instruction 32K (x1), 8 way associative
 L2 Unified 256K (x1), 8 way associative
 L3 Unified 30720K (x1), 20 way associative

At this pictures we can see many changes in graph behavior:

There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)

Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses

But there is a lot of questions about results that I can't explain:

Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?

Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?

Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?

I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.

Can someone help me to understand the internal processes of CPU caching?

UPD:
I use the following code to measure performance of memory access:

#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) 
 auto random = Random(0);
 for (size_t id = 0; id < length; id++) 
 array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
 

static void MemoryAccessAllElements_Benchmark(State &state) 
 size_t size = static_cast<size_t>(state.range(0));
 auto array = new long long[size];
 InitializeWithRandomNumbers(array, size);
 for (auto _ : state) 
 DoNotOptimize(MemoryAccessAllElements(array, size));
 
 delete array;

static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) 
 for (int size = 2; size <= (1 << 24); size *= 2) 
 benchmark->Arg(size);
 

BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();

You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.

edited Nov 20 at 12:24

asked Nov 11 at 13:12

Nikita Sivukhin

565520

1

The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31

@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44

I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48

The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10

1

@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42

|
show 2 more comments

up vote
4
down vote

favorite

I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:

int64_t MemoryAccessAllElements(const int64_t *data, size_t length) 
 for (size_t id = 0; id < length; id++) 
 volatile int64_t ignored = data[id];
 
 return 0;

int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) 
 for (size_t id = 0; id < length; id += 4) 
 volatile int64_t ignored = data[id];
 
 return 0;

And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):

Benchmark results

There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.

I tested this code on single core CPU with next caches configuration:

CPU Caches: 
 L1 Data 32K (x1), 8 way associative
 L1 Instruction 32K (x1), 8 way associative
 L2 Unified 256K (x1), 8 way associative
 L3 Unified 30720K (x1), 20 way associative

At this pictures we can see many changes in graph behavior:

There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)

Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses

But there is a lot of questions about results that I can't explain:

Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?

Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?

Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?

Can someone help me to understand the internal processes of CPU caching?

UPD:
I use the following code to measure performance of memory access:

#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) 
 auto random = Random(0);
 for (size_t id = 0; id < length; id++) 
 array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
 

static void MemoryAccessAllElements_Benchmark(State &state) 
 size_t size = static_cast<size_t>(state.range(0));
 auto array = new long long[size];
 InitializeWithRandomNumbers(array, size);
 for (auto _ : state) 
 DoNotOptimize(MemoryAccessAllElements(array, size));
 
 delete array;

static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) 
 for (int size = 2; size <= (1 << 24); size *= 2) 
 benchmark->Arg(size);
 

BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();

You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.

edited Nov 20 at 12:24

asked Nov 11 at 13:12

Nikita Sivukhin

565520

1

The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31

@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44

I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48

The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10

1

@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42

|
show 2 more comments

up vote
4
down vote

favorite

I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:

int64_t MemoryAccessAllElements(const int64_t *data, size_t length) 
 for (size_t id = 0; id < length; id++) 
 volatile int64_t ignored = data[id];
 
 return 0;

int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) 
 for (size_t id = 0; id < length; id += 4) 
 volatile int64_t ignored = data[id];
 
 return 0;

And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):

Benchmark results

There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.

I tested this code on single core CPU with next caches configuration:

CPU Caches: 
 L1 Data 32K (x1), 8 way associative
 L1 Instruction 32K (x1), 8 way associative
 L2 Unified 256K (x1), 8 way associative
 L3 Unified 30720K (x1), 20 way associative

At this pictures we can see many changes in graph behavior:

There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)

Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses

But there is a lot of questions about results that I can't explain:

Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?

Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?

Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?

Can someone help me to understand the internal processes of CPU caching?

UPD:
I use the following code to measure performance of memory access:

#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) 
 auto random = Random(0);
 for (size_t id = 0; id < length; id++) 
 array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
 

static void MemoryAccessAllElements_Benchmark(State &state) 
 size_t size = static_cast<size_t>(state.range(0));
 auto array = new long long[size];
 InitializeWithRandomNumbers(array, size);
 for (auto _ : state) 
 DoNotOptimize(MemoryAccessAllElements(array, size));
 
 delete array;

static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) 
 for (int size = 2; size <= (1 << 24); size *= 2) 
 benchmark->Arg(size);
 

BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();

You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.

edited Nov 20 at 12:24

asked Nov 11 at 13:12

Nikita Sivukhin

565520

I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:

int64_t MemoryAccessAllElements(const int64_t *data, size_t length) 
 for (size_t id = 0; id < length; id++) 
 volatile int64_t ignored = data[id];
 
 return 0;

int64_t MemoryAccessEvery4th(const int64_t *data, size_t length) 
 for (size_t id = 0; id < length; id += 4) 
 volatile int64_t ignored = data[id];
 
 return 0;

And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):

Benchmark results

There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.

I tested this code on single core CPU with next caches configuration:

CPU Caches: 
 L1 Data 32K (x1), 8 way associative
 L1 Instruction 32K (x1), 8 way associative
 L2 Unified 256K (x1), 8 way associative
 L3 Unified 30720K (x1), 20 way associative

At this pictures we can see many changes in graph behavior:

There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)

Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses

But there is a lot of questions about results that I can't explain:

Why latency for the MemoryAccessEvery4th decreasing after array exceeded ~1024 bytes?

Why we can see another peak for the MemoryAccessAllElements around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained?

Why we can see latency increasing after passing the L2 cache size while benchmarking MemoryAccessEvery4th but there is no such difference with the MemoryAccessAllElements?

Can someone help me to understand the internal processes of CPU caching?

UPD:
I use the following code to measure performance of memory access:

#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length) 
 auto random = Random(0);
 for (size_t id = 0; id < length; id++) 
 array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
 

static void MemoryAccessAllElements_Benchmark(State &state) 
 size_t size = static_cast<size_t>(state.range(0));
 auto array = new long long[size];
 InitializeWithRandomNumbers(array, size);
 for (auto _ : state) 
 DoNotOptimize(MemoryAccessAllElements(array, size));
 
 delete array;

static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark) 
 for (int size = 2; size <= (1 << 24); size *= 2) 
 benchmark->Arg(size);
 

BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();

You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.

c++ caching optimization cpu

edited Nov 20 at 12:24

asked Nov 11 at 13:12

Nikita Sivukhin

565520

edited Nov 20 at 12:24

asked Nov 11 at 13:12

Nikita Sivukhin

565520

edited Nov 20 at 12:24

asked Nov 11 at 13:12

Nikita Sivukhin

565520

asked Nov 11 at 13:12

Nikita Sivukhin

565520

asked Nov 11 at 13:12

Nikita Sivukhin

565520

1

The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31

@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44

I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48

The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10

1

@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42

|
show 2 more comments

1

The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31

@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44

I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48

The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10

1

@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42

The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31

@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44

I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48

The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the MemoryAccessEvery4th function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10

@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42

|
show 2 more comments

active

oldest

votes

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53249079%2fcpu-caching-understanding%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

active

oldest

votes

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

j 57J,RXmEQhUNytxmc2O83t rh bMP9,b6199sFWNLON

搜尋此網誌

Odtnhj