Enabling Open MP Support in Visual Studio 2017 slows down codes

I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp in the code.

I have isolated the sections and found out that this particular function is causing the problem:

void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)

 std::vector<Eigen::MatrixXf> activation;
 activation.reserve(layerNo);
 activation.push_back(inputPts);

 int inputNo = inputPts.cols();

 for (int i = 0; i < layerNo - 2; i++)
 activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());

 activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));

 val = activation[layerNo - 1]/scalingFactor;

 std::vector<Eigen::MatrixXf> delta;
 delta.reserve(layerNo);

 Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
 seed.setOnes(1, inputNo);
 delta.push_back(seed);

 for (int i = layerNo - 2; i >= 1; i--)
 
 Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
 d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
 d_temp2 = 1 - activation[i].array().square(),
 deltaLayer = d_temp.cwiseProduct(d_temp2);

 delta.push_back(deltaLayer);
 

 grad = weights[0].transpose()*delta[layerNo - 2];

The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.

I have included the header file <omp.h>. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE and calling Eigen::initParallel() as suggested in the official site but it does not help.

The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?

asked Nov 14 '18 at 12:04

mjfoo21

add a comment |

I have isolated the sections and found out that this particular function is causing the problem:

void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)

 std::vector<Eigen::MatrixXf> activation;
 activation.reserve(layerNo);
 activation.push_back(inputPts);

 int inputNo = inputPts.cols();

 for (int i = 0; i < layerNo - 2; i++)
 activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());

 activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));

 val = activation[layerNo - 1]/scalingFactor;

 std::vector<Eigen::MatrixXf> delta;
 delta.reserve(layerNo);

 Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
 seed.setOnes(1, inputNo);
 delta.push_back(seed);

 for (int i = layerNo - 2; i >= 1; i--)
 
 Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
 d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
 d_temp2 = 1 - activation[i].array().square(),
 deltaLayer = d_temp.cwiseProduct(d_temp2);

 delta.push_back(deltaLayer);
 

 grad = weights[0].transpose()*delta[layerNo - 2];

The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.

The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?

asked Nov 14 '18 at 12:04

mjfoo21

add a comment |

I have isolated the sections and found out that this particular function is causing the problem:

void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)

 std::vector<Eigen::MatrixXf> activation;
 activation.reserve(layerNo);
 activation.push_back(inputPts);

 int inputNo = inputPts.cols();

 for (int i = 0; i < layerNo - 2; i++)
 activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());

 activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));

 val = activation[layerNo - 1]/scalingFactor;

 std::vector<Eigen::MatrixXf> delta;
 delta.reserve(layerNo);

 Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
 seed.setOnes(1, inputNo);
 delta.push_back(seed);

 for (int i = layerNo - 2; i >= 1; i--)
 
 Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
 d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
 d_temp2 = 1 - activation[i].array().square(),
 deltaLayer = d_temp.cwiseProduct(d_temp2);

 delta.push_back(deltaLayer);
 

 grad = weights[0].transpose()*delta[layerNo - 2];

The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.

The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?

asked Nov 14 '18 at 12:04

mjfoo21

I have isolated the sections and found out that this particular function is causing the problem:

void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)

 std::vector<Eigen::MatrixXf> activation;
 activation.reserve(layerNo);
 activation.push_back(inputPts);

 int inputNo = inputPts.cols();

 for (int i = 0; i < layerNo - 2; i++)
 activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());

 activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));

 val = activation[layerNo - 1]/scalingFactor;

 std::vector<Eigen::MatrixXf> delta;
 delta.reserve(layerNo);

 Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
 seed.setOnes(1, inputNo);
 delta.push_back(seed);

 for (int i = layerNo - 2; i >= 1; i--)
 
 Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
 d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
 d_temp2 = 1 - activation[i].array().square(),
 deltaLayer = d_temp.cwiseProduct(d_temp2);

 delta.push_back(deltaLayer);
 

 grad = weights[0].transpose()*delta[layerNo - 2];

The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.

The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?

visual-studio-2017 openmp eigen

asked Nov 14 '18 at 12:04

mjfoo21

asked Nov 14 '18 at 12:04

mjfoo21

asked Nov 14 '18 at 12:04

mjfoo21

asked Nov 14 '18 at 12:04

mjfoo21

asked Nov 14 '18 at 12:04

mjfoo21

add a comment |

1 Answer
1

active

oldest

votes

Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:

Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE at compilation time.

More info in the doc.

More details on how hyper-threading can decrease performance:
With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.

edited Nov 15 '18 at 12:53

answered Nov 14 '18 at 12:53

ggael

20.6k23145

Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

– mjfoo21
Nov 15 '18 at 12:10

This is what I tried to explain in my answer, I extended the answer with more details.

– ggael
Nov 15 '18 at 12:53

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53299833%2fenabling-open-mp-support-in-visual-studio-2017-slows-down-codes%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:

Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

More info in the doc.

edited Nov 15 '18 at 12:53

answered Nov 14 '18 at 12:53

ggael

20.6k23145

Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

– mjfoo21
Nov 15 '18 at 12:10

This is what I tried to explain in my answer, I extended the answer with more details.

– ggael
Nov 15 '18 at 12:53

add a comment |

Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:

Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

More info in the doc.

edited Nov 15 '18 at 12:53

answered Nov 14 '18 at 12:53

ggael

20.6k23145

Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

– mjfoo21
Nov 15 '18 at 12:10

This is what I tried to explain in my answer, I extended the answer with more details.

– ggael
Nov 15 '18 at 12:53

add a comment |

Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:

Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

More info in the doc.

edited Nov 15 '18 at 12:53

answered Nov 14 '18 at 12:53

ggael

20.6k23145

Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:

Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

More info in the doc.

edited Nov 15 '18 at 12:53

answered Nov 14 '18 at 12:53

ggael

20.6k23145

edited Nov 15 '18 at 12:53

answered Nov 14 '18 at 12:53

ggael

20.6k23145

answered Nov 14 '18 at 12:53

ggael

20.6k23145

answered Nov 14 '18 at 12:53

ggael

20.6k23145

Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

– mjfoo21
Nov 15 '18 at 12:10

This is what I tried to explain in my answer, I extended the answer with more details.

– ggael
Nov 15 '18 at 12:53

add a comment |

Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

– mjfoo21
Nov 15 '18 at 12:10

This is what I tried to explain in my answer, I extended the answer with more details.

– ggael
Nov 15 '18 at 12:53

Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

– mjfoo21
Nov 15 '18 at 12:10

This is what I tried to explain in my answer, I extended the answer with more details.

– ggael
Nov 15 '18 at 12:53

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj