Difference between PREFETCH and PREFETCHNTA instructions

The PREFETCHNTA instruction is basically used to bring the data from main memory to caches by the prefetcher, but instructions with the NT suffix are known to skip caches and avoid cache pollution.

So what does PREFETCHNTA do which is different from the PREFETCH instruction?

edited Nov 12 '18 at 22:08

asked Nov 12 '18 at 21:33

Abhishek Nikam

111111

add a comment |

So what does PREFETCHNTA do which is different from the PREFETCH instruction?

edited Nov 12 '18 at 22:08

asked Nov 12 '18 at 21:33

Abhishek Nikam

111111

add a comment |

So what does PREFETCHNTA do which is different from the PREFETCH instruction?

edited Nov 12 '18 at 22:08

asked Nov 12 '18 at 21:33

Abhishek Nikam

111111

So what does PREFETCHNTA do which is different from the PREFETCH instruction?

caching assembly x86 prefetch isa

edited Nov 12 '18 at 22:08

asked Nov 12 '18 at 21:33

Abhishek Nikam

111111

edited Nov 12 '18 at 22:08

asked Nov 12 '18 at 21:33

Abhishek Nikam

111111

edited Nov 12 '18 at 22:08

asked Nov 12 '18 at 21:33

Abhishek Nikam

111111

asked Nov 12 '18 at 21:33

Abhishek Nikam

111111

asked Nov 12 '18 at 21:33

Abhishek Nikam

111111

add a comment |

1 Answer
1

active

oldest

votes

prefetchNTA can't bypass caches, only reduce (not avoid) pollution. It can't break cache coherency or violate the memory-ordering semantics of a WB (Write-Back) memory region. (Unlike NT stores, which do fully bypass caches and are weakly-ordered even on normal WB memory.)

On paper, the x86 ISA doesn't specify how it implements the NT hint.
http://felixcloutier.com/x86/PREFETCHh.html says: "NTA (non-temporal data with respect to all cache levels)—prefetch data into non-temporal cache structure and into a location close to the processor, minimizing cache pollution." How any specific CPU microarchitecture chooses to implement that is completely up to the architects.

prefetchNTA from WB memory¹ on Intel CPUs populates L1d normally, allowing later loads to hit in L1d normally (as long as the prefetch distance is large enough that the prefetch completes, and small enough that it isn't evicted again before the demand load). The correct prefetch distance depends on the system and other factors, and can be fairly brittle.

What it does do on Intel CPUs is skip non-inclusive outer caches. So on Intel before Skylake-AVX512, it bypasses L2 and populates L1d + L3. But on SKX it also skips L3 cache entirely because it's smaller and non-inclusive. See
Do current x86 architectures support non-temporal loads (from "normal" memory)?

On Intel CPUs with inclusive L3 caches (which it can't bypass), it reduces L3 pollution by being restricted to prefetching into one "way" of the associative inclusive L3 cache. (Which is usually something like 16-way associative, so the total capacity that can be polluted by prefetchnta is only ~1/16th of total L3 size).

@HadiBrais commented on this answer with some info on AMD CPUs.

Instead of limiting pollution by fetching into only one way of the cache, apparently AMD allocates lines fetched with NT prefetch with a "quick eviction" marking. Probably this means allocating in the LRU position instead of the Most-Recently-Used position. So the next allocation in that set of the cache will evict the line.

Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.

edited Nov 13 '18 at 18:07

answered Nov 12 '18 at 23:03

Peter Cordes

120k16181311

Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17

Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18

1

According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22

1

If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24

1

...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25

|
show 16 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53270421%2fdifference-between-prefetch-and-prefetchnta-instructions%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

@HadiBrais commented on this answer with some info on AMD CPUs.

Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.

edited Nov 13 '18 at 18:07

answered Nov 12 '18 at 23:03

Peter Cordes

120k16181311

Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17

Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18

1

According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22

1

If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24

1

...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25

|
show 16 more comments

@HadiBrais commented on this answer with some info on AMD CPUs.

Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.

edited Nov 13 '18 at 18:07

answered Nov 12 '18 at 23:03

Peter Cordes

120k16181311

Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17

Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18

1

According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22

1

If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24

1

...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25

|
show 16 more comments

@HadiBrais commented on this answer with some info on AMD CPUs.

Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.

edited Nov 13 '18 at 18:07

answered Nov 12 '18 at 23:03

Peter Cordes

120k16181311

@HadiBrais commented on this answer with some info on AMD CPUs.

Footnote 1: prefetchNTA from WC memory I think prefetches in to an LFB, allowing SSE4.1 movntdqa loads to hit an already-populated LFB. But note that movntdqa from WB memory is not useful.

edited Nov 13 '18 at 18:07

answered Nov 12 '18 at 23:03

Peter Cordes

120k16181311

edited Nov 13 '18 at 18:07

answered Nov 12 '18 at 23:03

Peter Cordes

120k16181311

answered Nov 12 '18 at 23:03

Peter Cordes

120k16181311

answered Nov 12 '18 at 23:03

Peter Cordes

120k16181311

Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17

Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18

1

According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22

1

If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24

1

...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25

|
show 16 more comments

Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17

Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18

1

According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22

1

If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24

1

...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25

Are you sure on SKX the instruction skips the L3? According to Section 7.3.2 of the Intel optimization manual? On Nehalem and later, it may/must be fetched into the L3. I think the most important property of prefetchnta is not which cache level it gets fetched into, but rather that it's marked in the cache set for quicker eviction. This applies to most or all Intel and AMD processors that support the instruction. The level in which it gets fetched is also important, however. But that is microarchtiecture dependent across Intel and AMD processors.
– Hadi Brais
Nov 13 '18 at 2:17

Fetching it into the L3 is an intuitive design because the L3 is much larger than the L1, especially when real programs may have low hit rates in the prefetched lines and the main memory latency is relatively very high in modern systems.
– Hadi Brais
Nov 13 '18 at 2:18

According to the AMD optimization manual for the 17h family Section 2.6.4, prefetchnta fetches the line in the L2 with quick eviction marking. But older AMD processors, like some older Intel processors, fetch into the L1.
– Hadi Brais
Nov 13 '18 at 2:22

If you have an SKX we can easily test this as follows. First, disable all prefetchers, perform clflush on a specific cache line, execute an empty loop for about 20,000 iterations, then perform prefetchnta on that same line, then execute the same loop, then use a demand load to access the same line and measure the access latency. This would tell us the nearest cache level in which the line got filled. Then migrate the thread to a different physical core and perform a demand load to the same line and measure that latency. We can compare this latency to the L3 latency and...
– Hadi Brais
Nov 13 '18 at 23:24

...to the cache-to-cache or main memory latencies (the Intel Memory Latency Checker tool can be used to measure these). If it was close to the L3 latency, then the line was filled in the L3. If it was close to the C2C latency or main memory latency, then the line was not in the L3. This test would enable us to determine conclusively if the line was filled into the L3 and/or the L1.
– Hadi Brais
Nov 13 '18 at 23:25

|
show 16 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

Some of your past answers have not been well-received, and you're in danger of being blocked from answering.

Please pay close attention to the following guidance:

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj