CPU caching understanding
up vote
4
down vote
favorite
I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:
int64_t MemoryAccessAllElements(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id++)
volatile int64_t ignored = data[id];
return 0;
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id += 4)
volatile int64_t ignored = data[id];
return 0;
And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):
There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.
I tested this code on single core CPU with next caches configuration:
CPU Caches:
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative
At this pictures we can see many changes in graph behavior:
- There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)
- Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses
But there is a lot of questions about results that I can't explain:
- Why latency for the
MemoryAccessEvery4th
decreasing after array exceeded ~1024 bytes? - Why we can see another peak for the
MemoryAccessAllElements
around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained? - Why we can see latency increasing after passing the L2 cache size while benchmarking
MemoryAccessEvery4th
but there is no such difference with theMemoryAccessAllElements
?
I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.
Can someone help me to understand the internal processes of CPU caching?
UPD:
I use the following code to measure performance of memory access:
#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length)
auto random = Random(0);
for (size_t id = 0; id < length; id++)
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
static void MemoryAccessAllElements_Benchmark(State &state)
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state)
DoNotOptimize(MemoryAccessAllElements(array, size));
delete array;
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark)
for (int size = 2; size <= (1 << 24); size *= 2)
benchmark->Arg(size);
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();
You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.
c++ caching optimization cpu
|
show 2 more comments
up vote
4
down vote
favorite
I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:
int64_t MemoryAccessAllElements(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id++)
volatile int64_t ignored = data[id];
return 0;
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id += 4)
volatile int64_t ignored = data[id];
return 0;
And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):
There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.
I tested this code on single core CPU with next caches configuration:
CPU Caches:
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative
At this pictures we can see many changes in graph behavior:
- There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)
- Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses
But there is a lot of questions about results that I can't explain:
- Why latency for the
MemoryAccessEvery4th
decreasing after array exceeded ~1024 bytes? - Why we can see another peak for the
MemoryAccessAllElements
around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained? - Why we can see latency increasing after passing the L2 cache size while benchmarking
MemoryAccessEvery4th
but there is no such difference with theMemoryAccessAllElements
?
I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.
Can someone help me to understand the internal processes of CPU caching?
UPD:
I use the following code to measure performance of memory access:
#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length)
auto random = Random(0);
for (size_t id = 0; id < length; id++)
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
static void MemoryAccessAllElements_Benchmark(State &state)
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state)
DoNotOptimize(MemoryAccessAllElements(array, size));
delete array;
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark)
for (int size = 2; size <= (1 << 24); size *= 2)
benchmark->Arg(size);
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();
You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.
c++ caching optimization cpu
1
The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31
@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44
I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48
The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of theMemoryAccessEvery4th
function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10
1
@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42
|
show 2 more comments
up vote
4
down vote
favorite
up vote
4
down vote
favorite
I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:
int64_t MemoryAccessAllElements(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id++)
volatile int64_t ignored = data[id];
return 0;
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id += 4)
volatile int64_t ignored = data[id];
return 0;
And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):
There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.
I tested this code on single core CPU with next caches configuration:
CPU Caches:
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative
At this pictures we can see many changes in graph behavior:
- There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)
- Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses
But there is a lot of questions about results that I can't explain:
- Why latency for the
MemoryAccessEvery4th
decreasing after array exceeded ~1024 bytes? - Why we can see another peak for the
MemoryAccessAllElements
around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained? - Why we can see latency increasing after passing the L2 cache size while benchmarking
MemoryAccessEvery4th
but there is no such difference with theMemoryAccessAllElements
?
I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.
Can someone help me to understand the internal processes of CPU caching?
UPD:
I use the following code to measure performance of memory access:
#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length)
auto random = Random(0);
for (size_t id = 0; id < length; id++)
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
static void MemoryAccessAllElements_Benchmark(State &state)
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state)
DoNotOptimize(MemoryAccessAllElements(array, size));
delete array;
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark)
for (int size = 2; size <= (1 << 24); size *= 2)
benchmark->Arg(size);
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();
You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.
c++ caching optimization cpu
I tested the next code with Google Benchmark framework to measure memory access latency with different array sizes:
int64_t MemoryAccessAllElements(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id++)
volatile int64_t ignored = data[id];
return 0;
int64_t MemoryAccessEvery4th(const int64_t *data, size_t length)
for (size_t id = 0; id < length; id += 4)
volatile int64_t ignored = data[id];
return 0;
And I get next results (the results are averaged by google benchmark, for big arrays, there is about ~10 iterations and much more work performed for smaller one):
There is a lot of different stuff happened on this picture and unfortunately, I can't explain all changes in the graph.
I tested this code on single core CPU with next caches configuration:
CPU Caches:
L1 Data 32K (x1), 8 way associative
L1 Instruction 32K (x1), 8 way associative
L2 Unified 256K (x1), 8 way associative
L3 Unified 30720K (x1), 20 way associative
At this pictures we can see many changes in graph behavior:
- There is a spike after 64 bytes array size that can be explained by the fact, that cache line size is 64 bytes long and with an array of size more than 64 bytes we experience one more L1 cache miss (which can be categorized as a compulsory cache miss)
- Also, the latency increasing near the cache size bounds that is also easy to explain - at this moment we experience capacity cache misses
But there is a lot of questions about results that I can't explain:
- Why latency for the
MemoryAccessEvery4th
decreasing after array exceeded ~1024 bytes? - Why we can see another peak for the
MemoryAccessAllElements
around 512 bytes? It is an interesting point because at this moment we started to access more than one set of cache lines (8 * 64 bytes in one set). But is it really caused by this event and if it is than how it can be explained? - Why we can see latency increasing after passing the L2 cache size while benchmarking
MemoryAccessEvery4th
but there is no such difference with theMemoryAccessAllElements
?
I've tried to compare my results with the results from gallery of processor cache effects and what every programmer should know about memory, but I can't fully describe my results with reasoning from this articles.
Can someone help me to understand the internal processes of CPU caching?
UPD:
I use the following code to measure performance of memory access:
#include <benchmark/benchmark.h>
using namespace benchmark;
void InitializeWithRandomNumbers(long long *array, size_t length)
auto random = Random(0);
for (size_t id = 0; id < length; id++)
array[id] = static_cast<long long>(random.NextLong(0, 1LL << 60));
static void MemoryAccessAllElements_Benchmark(State &state)
size_t size = static_cast<size_t>(state.range(0));
auto array = new long long[size];
InitializeWithRandomNumbers(array, size);
for (auto _ : state)
DoNotOptimize(MemoryAccessAllElements(array, size));
delete array;
static void CustomizeBenchmark(benchmark::internal::Benchmark *benchmark)
for (int size = 2; size <= (1 << 24); size *= 2)
benchmark->Arg(size);
BENCHMARK(MemoryAccessAllElements_Benchmark)->Apply(CustomizeBenchmark);
BENCHMARK_MAIN();
You can find slightly different examples in the repository, but actually the basic approach for the benchmark in the question is same.
c++ caching optimization cpu
c++ caching optimization cpu
edited Nov 20 at 12:24
asked Nov 11 at 13:12
Nikita Sivukhin
565520
565520
1
The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31
@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44
I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48
The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of theMemoryAccessEvery4th
function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10
1
@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42
|
show 2 more comments
1
The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31
@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44
I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48
The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of theMemoryAccessEvery4th
function. After that the L3 zone is normalized again ...
– Margaret Bloom
Nov 11 at 16:10
1
@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42
1
1
The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31
The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31
@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44
@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44
I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48
I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48
The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the
MemoryAccessEvery4th
function. After that the L3 zone is normalized again ...– Margaret Bloom
Nov 11 at 16:10
The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the
MemoryAccessEvery4th
function. After that the L3 zone is normalized again ...– Margaret Bloom
Nov 11 at 16:10
1
1
@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42
@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42
|
show 2 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53249079%2fcpu-caching-understanding%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
The cache hierarchy is not all you need to consider. The CPUs prefetcher is also important.
– Jesper Juhl
Nov 11 at 13:31
@JesperJuhl Yes, it is. I know a little about it. But I thought that prefetcher mainly relates to the absolute values of the latency (instead of the L2 cache hit latency prefetcher allows us to access memory faster because fetch to the cache performed earlier). But I can't understand how prefetcher can explain the difference in the derivative of the graphs.
– Nikita Sivukhin
Nov 11 at 13:44
I didn't say it could explain anything. I merely pointed out that it is important and does influence your numbers. Just information. No explanation.
– Jesper Juhl
Nov 11 at 13:48
The y-axis is basically latency per byte, right? In the initial phase the latency per byte will be higher if only the first bytes of a cache line will be accessed because of the probable miss (after the first access misses, the rest go to the cache directly). This seems to repeat 2 times, the third spikes (miss) is probably enough for the CPU to trigger the L1 prefetchers and the latency/B normalizes. When we get into the L2 zone, the L2 prefetchers can keep up with the 4 elements stride of the
MemoryAccessEvery4th
function. After that the L3 zone is normalized again ...– Margaret Bloom
Nov 11 at 16:10
1
@Nikita - I'm willing to analyze, but you need the full program, since the small size effects may be related to the surrounding code.
– BeeOnRope
Nov 20 at 3:42