Enabling Open MP Support in Visual Studio 2017 slows down codes
I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp
in the code.
I have isolated the sections and found out that this particular function is causing the problem:
void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)
std::vector<Eigen::MatrixXf> activation;
activation.reserve(layerNo);
activation.push_back(inputPts);
int inputNo = inputPts.cols();
for (int i = 0; i < layerNo - 2; i++)
activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());
activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));
val = activation[layerNo - 1]/scalingFactor;
std::vector<Eigen::MatrixXf> delta;
delta.reserve(layerNo);
Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
seed.setOnes(1, inputNo);
delta.push_back(seed);
for (int i = layerNo - 2; i >= 1; i--)
Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
d_temp2 = 1 - activation[i].array().square(),
deltaLayer = d_temp.cwiseProduct(d_temp2);
delta.push_back(deltaLayer);
grad = weights[0].transpose()*delta[layerNo - 2];
The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.
I have included the header file <omp.h>
. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE
and calling Eigen::initParallel()
as suggested in the official site but it does not help.
The weird thing is that I did not even include any parallel pragma
at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?
visual-studio-2017 openmp eigen
add a comment |
I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp
in the code.
I have isolated the sections and found out that this particular function is causing the problem:
void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)
std::vector<Eigen::MatrixXf> activation;
activation.reserve(layerNo);
activation.push_back(inputPts);
int inputNo = inputPts.cols();
for (int i = 0; i < layerNo - 2; i++)
activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());
activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));
val = activation[layerNo - 1]/scalingFactor;
std::vector<Eigen::MatrixXf> delta;
delta.reserve(layerNo);
Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
seed.setOnes(1, inputNo);
delta.push_back(seed);
for (int i = layerNo - 2; i >= 1; i--)
Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
d_temp2 = 1 - activation[i].array().square(),
deltaLayer = d_temp.cwiseProduct(d_temp2);
delta.push_back(deltaLayer);
grad = weights[0].transpose()*delta[layerNo - 2];
The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.
I have included the header file <omp.h>
. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE
and calling Eigen::initParallel()
as suggested in the official site but it does not help.
The weird thing is that I did not even include any parallel pragma
at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?
visual-studio-2017 openmp eigen
add a comment |
I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp
in the code.
I have isolated the sections and found out that this particular function is causing the problem:
void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)
std::vector<Eigen::MatrixXf> activation;
activation.reserve(layerNo);
activation.push_back(inputPts);
int inputNo = inputPts.cols();
for (int i = 0; i < layerNo - 2; i++)
activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());
activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));
val = activation[layerNo - 1]/scalingFactor;
std::vector<Eigen::MatrixXf> delta;
delta.reserve(layerNo);
Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
seed.setOnes(1, inputNo);
delta.push_back(seed);
for (int i = layerNo - 2; i >= 1; i--)
Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
d_temp2 = 1 - activation[i].array().square(),
deltaLayer = d_temp.cwiseProduct(d_temp2);
delta.push_back(deltaLayer);
grad = weights[0].transpose()*delta[layerNo - 2];
The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.
I have included the header file <omp.h>
. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE
and calling Eigen::initParallel()
as suggested in the official site but it does not help.
The weird thing is that I did not even include any parallel pragma
at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?
visual-studio-2017 openmp eigen
I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp
in the code.
I have isolated the sections and found out that this particular function is causing the problem:
void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)
std::vector<Eigen::MatrixXf> activation;
activation.reserve(layerNo);
activation.push_back(inputPts);
int inputNo = inputPts.cols();
for (int i = 0; i < layerNo - 2; i++)
activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());
activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));
val = activation[layerNo - 1]/scalingFactor;
std::vector<Eigen::MatrixXf> delta;
delta.reserve(layerNo);
Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
seed.setOnes(1, inputNo);
delta.push_back(seed);
for (int i = layerNo - 2; i >= 1; i--)
Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
d_temp2 = 1 - activation[i].array().square(),
deltaLayer = d_temp.cwiseProduct(d_temp2);
delta.push_back(deltaLayer);
grad = weights[0].transpose()*delta[layerNo - 2];
The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.
I have included the header file <omp.h>
. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE
and calling Eigen::initParallel()
as suggested in the official site but it does not help.
The weird thing is that I did not even include any parallel pragma
at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?
visual-studio-2017 openmp eigen
visual-studio-2017 openmp eigen
asked Nov 14 '18 at 12:04
mjfoo21mjfoo21
72
72
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:
- Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.
- OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.
- Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).
The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE
at compilation time.
More info in the doc.
More details on how hyper-threading can decrease performance:
With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.
Thanks for the reply. I added the linesomp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel();
(see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the#pragma
statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.
– mjfoo21
Nov 15 '18 at 12:10
This is what I tried to explain in my answer, I extended the answer with more details.
– ggael
Nov 15 '18 at 12:53
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53299833%2fenabling-open-mp-support-in-visual-studio-2017-slows-down-codes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:
- Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.
- OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.
- Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).
The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE
at compilation time.
More info in the doc.
More details on how hyper-threading can decrease performance:
With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.
Thanks for the reply. I added the linesomp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel();
(see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the#pragma
statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.
– mjfoo21
Nov 15 '18 at 12:10
This is what I tried to explain in my answer, I extended the answer with more details.
– ggael
Nov 15 '18 at 12:53
add a comment |
Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:
- Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.
- OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.
- Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).
The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE
at compilation time.
More info in the doc.
More details on how hyper-threading can decrease performance:
With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.
Thanks for the reply. I added the linesomp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel();
(see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the#pragma
statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.
– mjfoo21
Nov 15 '18 at 12:10
This is what I tried to explain in my answer, I extended the answer with more details.
– ggael
Nov 15 '18 at 12:53
add a comment |
Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:
- Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.
- OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.
- Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).
The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE
at compilation time.
More info in the doc.
More details on how hyper-threading can decrease performance:
With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.
Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:
- Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.
- OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.
- Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).
The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE
at compilation time.
More info in the doc.
More details on how hyper-threading can decrease performance:
With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.
edited Nov 15 '18 at 12:53
answered Nov 14 '18 at 12:53
ggaelggael
20.6k23145
20.6k23145
Thanks for the reply. I added the linesomp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel();
(see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the#pragma
statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.
– mjfoo21
Nov 15 '18 at 12:10
This is what I tried to explain in my answer, I extended the answer with more details.
– ggael
Nov 15 '18 at 12:53
add a comment |
Thanks for the reply. I added the linesomp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel();
(see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the#pragma
statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.
– mjfoo21
Nov 15 '18 at 12:10
This is what I tried to explain in my answer, I extended the answer with more details.
– ggael
Nov 15 '18 at 12:53
Thanks for the reply. I added the lines
omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel();
(see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma
statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.– mjfoo21
Nov 15 '18 at 12:10
Thanks for the reply. I added the lines
omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel();
(see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma
statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.– mjfoo21
Nov 15 '18 at 12:10
This is what I tried to explain in my answer, I extended the answer with more details.
– ggael
Nov 15 '18 at 12:53
This is what I tried to explain in my answer, I extended the answer with more details.
– ggael
Nov 15 '18 at 12:53
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53299833%2fenabling-open-mp-support-in-visual-studio-2017-slows-down-codes%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown