Difference between PREFETCH and PREFETCHNTA instructions
The PREFETCHNTA
instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT
suffix are known to skip caches and avoid cache pollution.
So what does PREFETCHNTA
do which is different from the PREFETCH
instruction?
caching assembly x86 prefetch isa
add a comment |
The PREFETCHNTA
instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT
suffix are known to skip caches and avoid cache pollution.
So what does PREFETCHNTA
do which is different from the PREFETCH
instruction?
caching assembly x86 prefetch isa
add a comment |
The PREFETCHNTA
instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT
suffix are known to skip caches and avoid cache pollution.
So what does PREFETCHNTA
do which is different from the PREFETCH
instruction?
caching assembly x86 prefetch isa
The PREFETCHNTA
instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT
suffix are known to skip caches and avoid cache pollution.
So what does PREFETCHNTA
do which is different from the PREFETCH
instruction?
caching assembly x86 prefetch isa
caching assembly x86 prefetch isa
edited Nov 12 '18 at 22:08
asked Nov 12 '18 at 21:33
Abhishek Nikam
111111
111111
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)
On paper, the x86 ISA doesn't specify how it implements the NT hint.
http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.
prefetchNTA
from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.
What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
Do current x86 architectures support non-temporal loads (from "normal" memory)?
On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta
is only ~1/16th of total L3 size).
@HadiBrais commented on this answer with some info on AMD CPUs.
Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.
Footnote 1: prefetchNTA
from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa
loads to hit an already-populated LFB. But note that movntdqa
from WB memory is not useful.
Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property ofprefetchnta
is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17
Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18
1
According to the AMD optimization manual for the 17h family Section 2.6.4,prefetchnta
fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22
1
If you have an SKX we can easily test this as follows. First, disable all prefetchers, performclflush
on a specific cache line, execute an empty loop for about 20,000 iterations, then performprefetchnta
on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24
1
...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25
|
show 16 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270421%2fdifference-between-prefetch-and-prefetchnta-instructions%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)
On paper, the x86 ISA doesn't specify how it implements the NT hint.
http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.
prefetchNTA
from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.
What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
Do current x86 architectures support non-temporal loads (from "normal" memory)?
On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta
is only ~1/16th of total L3 size).
@HadiBrais commented on this answer with some info on AMD CPUs.
Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.
Footnote 1: prefetchNTA
from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa
loads to hit an already-populated LFB. But note that movntdqa
from WB memory is not useful.
Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property ofprefetchnta
is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17
Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18
1
According to the AMD optimization manual for the 17h family Section 2.6.4,prefetchnta
fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22
1
If you have an SKX we can easily test this as follows. First, disable all prefetchers, performclflush
on a specific cache line, execute an empty loop for about 20,000 iterations, then performprefetchnta
on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24
1
...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25
|
show 16 more comments
prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)
On paper, the x86 ISA doesn't specify how it implements the NT hint.
http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.
prefetchNTA
from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.
What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
Do current x86 architectures support non-temporal loads (from "normal" memory)?
On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta
is only ~1/16th of total L3 size).
@HadiBrais commented on this answer with some info on AMD CPUs.
Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.
Footnote 1: prefetchNTA
from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa
loads to hit an already-populated LFB. But note that movntdqa
from WB memory is not useful.
Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property ofprefetchnta
is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17
Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18
1
According to the AMD optimization manual for the 17h family Section 2.6.4,prefetchnta
fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22
1
If you have an SKX we can easily test this as follows. First, disable all prefetchers, performclflush
on a specific cache line, execute an empty loop for about 20,000 iterations, then performprefetchnta
on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24
1
...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25
|
show 16 more comments
prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)
On paper, the x86 ISA doesn't specify how it implements the NT hint.
http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.
prefetchNTA
from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.
What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
Do current x86 architectures support non-temporal loads (from "normal" memory)?
On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta
is only ~1/16th of total L3 size).
@HadiBrais commented on this answer with some info on AMD CPUs.
Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.
Footnote 1: prefetchNTA
from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa
loads to hit an already-populated LFB. But note that movntdqa
from WB memory is not useful.
prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)
On paper, the x86 ISA doesn't specify how it implements the NT hint.
http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.
prefetchNTA
from WB memory1 on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.
What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
Do current x86 architectures support non-temporal loads (from "normal" memory)?
On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta
is only ~1/16th of total L3 size).
@HadiBrais commented on this answer with some info on AMD CPUs.
Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.
Footnote 1: prefetchNTA
from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa
loads to hit an already-populated LFB. But note that movntdqa
from WB memory is not useful.
edited Nov 13 '18 at 18:07
answered Nov 12 '18 at 23:03
Peter Cordes
120k16181311
120k16181311
Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property ofprefetchnta
is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17
Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18
1
According to the AMD optimization manual for the 17h family Section 2.6.4,prefetchnta
fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22
1
If you have an SKX we can easily test this as follows. First, disable all prefetchers, performclflush
on a specific cache line, execute an empty loop for about 20,000 iterations, then performprefetchnta
on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24
1
...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25
|
show 16 more comments
Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property ofprefetchnta
is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17
Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18
1
According to the AMD optimization manual for the 17h family Section 2.6.4,prefetchnta
fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22
1
If you have an SKX we can easily test this as follows. First, disable all prefetchers, performclflush
on a specific cache line, execute an empty loop for about 20,000 iterations, then performprefetchnta
on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24
1
...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25
Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of
prefetchnta
is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.– Hadi Brais
Nov 13 '18 at 2:17
Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of
prefetchnta
is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.– Hadi Brais
Nov 13 '18 at 2:17
Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18
Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18
1
1
According to the AMD optimization manual for the 17h family Section 2.6.4,
prefetchnta
fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.– Hadi Brais
Nov 13 '18 at 2:22
According to the AMD optimization manual for the 17h family Section 2.6.4,
prefetchnta
fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.– Hadi Brais
Nov 13 '18 at 2:22
1
1
If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform
clflush
on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta
on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...– Hadi Brais
Nov 13 '18 at 23:24
If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform
clflush
on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta
on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...– Hadi Brais
Nov 13 '18 at 23:24
1
1
...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25
...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25
|
show 16 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270421%2fdifference-between-prefetch-and-prefetchnta-instructions%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown