Enabling Open MP Support in Visual Studio 2017 slows down codes










0















I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp in the code.



I have isolated the sections and found out that this particular function is causing the problem:



void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)

std::vector<Eigen::MatrixXf> activation;
activation.reserve(layerNo);
activation.push_back(inputPts);

int inputNo = inputPts.cols();

for (int i = 0; i < layerNo - 2; i++)
activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());

activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));

val = activation[layerNo - 1]/scalingFactor;

std::vector<Eigen::MatrixXf> delta;
delta.reserve(layerNo);

Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
seed.setOnes(1, inputNo);
delta.push_back(seed);

for (int i = layerNo - 2; i >= 1; i--)

Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
d_temp2 = 1 - activation[i].array().square(),
deltaLayer = d_temp.cwiseProduct(d_temp2);

delta.push_back(deltaLayer);


grad = weights[0].transpose()*delta[layerNo - 2];



The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.



I have included the header file <omp.h>. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE and calling Eigen::initParallel() as suggested in the official site but it does not help.



The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?










share|improve this question


























    0















    I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp in the code.



    I have isolated the sections and found out that this particular function is causing the problem:



    void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)

    std::vector<Eigen::MatrixXf> activation;
    activation.reserve(layerNo);
    activation.push_back(inputPts);

    int inputNo = inputPts.cols();

    for (int i = 0; i < layerNo - 2; i++)
    activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());

    activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));

    val = activation[layerNo - 1]/scalingFactor;

    std::vector<Eigen::MatrixXf> delta;
    delta.reserve(layerNo);

    Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
    seed.setOnes(1, inputNo);
    delta.push_back(seed);

    for (int i = layerNo - 2; i >= 1; i--)

    Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
    d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
    d_temp2 = 1 - activation[i].array().square(),
    deltaLayer = d_temp.cwiseProduct(d_temp2);

    delta.push_back(deltaLayer);


    grad = weights[0].transpose()*delta[layerNo - 2];



    The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.



    I have included the header file <omp.h>. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE and calling Eigen::initParallel() as suggested in the official site but it does not help.



    The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?










    share|improve this question
























      0












      0








      0








      I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp in the code.



      I have isolated the sections and found out that this particular function is causing the problem:



      void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)

      std::vector<Eigen::MatrixXf> activation;
      activation.reserve(layerNo);
      activation.push_back(inputPts);

      int inputNo = inputPts.cols();

      for (int i = 0; i < layerNo - 2; i++)
      activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());

      activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));

      val = activation[layerNo - 1]/scalingFactor;

      std::vector<Eigen::MatrixXf> delta;
      delta.reserve(layerNo);

      Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
      seed.setOnes(1, inputNo);
      delta.push_back(seed);

      for (int i = layerNo - 2; i >= 1; i--)

      Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
      d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
      d_temp2 = 1 - activation[i].array().square(),
      deltaLayer = d_temp.cwiseProduct(d_temp2);

      delta.push_back(deltaLayer);


      grad = weights[0].transpose()*delta[layerNo - 2];



      The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.



      I have included the header file <omp.h>. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE and calling Eigen::initParallel() as suggested in the official site but it does not help.



      The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?










      share|improve this question














      I am trying to use OpenMP to speed up my codes for Neural Network Computation. As I am using Visual Studio 2017, I need to enable the OpenMP Support in the Property sheets. However, after I have done that, some part of the code slows down by around 5 times even though I did not include any #pragma omp in the code.



      I have isolated the sections and found out that this particular function is causing the problem:



      void foo(Eigen::Matrix<float,3,Eigen::Dynamic> inputPts)

      std::vector<Eigen::MatrixXf> activation;
      activation.reserve(layerNo);
      activation.push_back(inputPts);

      int inputNo = inputPts.cols();

      for (int i = 0; i < layerNo - 2; i++)
      activation.push_back(((weights[i]*activation[i]).colwise()+bias[i]).array().tanh());

      activation.push_back(((weights[layerNo - 2]*activation[layerNo - 2]).colwise()+bias[layerNo - 2]));

      val = activation[layerNo - 1]/scalingFactor;

      std::vector<Eigen::MatrixXf> delta;
      delta.reserve(layerNo);

      Eigen::Matrix<float, 1, Eigen::Dynamic> seed;
      seed.setOnes(1, inputNo);
      delta.push_back(seed);

      for (int i = layerNo - 2; i >= 1; i--)

      Eigen::Matrix<float,Eigen::Dynamic,Eigen::Dynamic>
      d_temp = weights[i].transpose()*delta[layerNo - 2 - i],
      d_temp2 = 1 - activation[i].array().square(),
      deltaLayer = d_temp.cwiseProduct(d_temp2);

      delta.push_back(deltaLayer);


      grad = weights[0].transpose()*delta[layerNo - 2];



      The two for-loops are the one that slow down significantly (from ~3ms to ~20ms). Strangely, although this function is called many times in the program, only some of them are affected.



      I have included the header file <omp.h>. I am not sure whether it is due to the Eigen library, which is used everywhere. I tried defining EIGEN_DONT_PARALLELIZE and calling Eigen::initParallel() as suggested in the official site but it does not help.



      The weird thing is that I did not even include any parallel pragma at all, there should not be any overhead to handle the OpenMP functions? Why is it still slowing down?







      visual-studio-2017 openmp eigen






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 14 '18 at 12:04









      mjfoo21mjfoo21

      72




      72






















          1 Answer
          1






          active

          oldest

          votes


















          1














          Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:



          1. Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

          2. OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

          3. Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

          The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE at compilation time.



          More info in the doc.



          More details on how hyper-threading can decrease performance:
          With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.






          share|improve this answer

























          • Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

            – mjfoo21
            Nov 15 '18 at 12:10












          • This is what I tried to explain in my answer, I extended the answer with more details.

            – ggael
            Nov 15 '18 at 12:53










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53299833%2fenabling-open-mp-support-in-visual-studio-2017-slows-down-codes%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:



          1. Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

          2. OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

          3. Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

          The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE at compilation time.



          More info in the doc.



          More details on how hyper-threading can decrease performance:
          With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.






          share|improve this answer

























          • Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

            – mjfoo21
            Nov 15 '18 at 12:10












          • This is what I tried to explain in my answer, I extended the answer with more details.

            – ggael
            Nov 15 '18 at 12:53















          1














          Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:



          1. Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

          2. OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

          3. Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

          The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE at compilation time.



          More info in the doc.



          More details on how hyper-threading can decrease performance:
          With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.






          share|improve this answer

























          • Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

            – mjfoo21
            Nov 15 '18 at 12:10












          • This is what I tried to explain in my answer, I extended the answer with more details.

            – ggael
            Nov 15 '18 at 12:53













          1












          1








          1







          Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:



          1. Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

          2. OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

          3. Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

          The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE at compilation time.



          More info in the doc.



          More details on how hyper-threading can decrease performance:
          With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.






          share|improve this answer















          Eigen's matrix-matrix products are multi-threaded by default if OpenMP is enabled. The problem is likely the combination of:



          1. Your CPU is hyper-threaded, e.g., you have 4 physical cores able to run 8 threads.

          2. OpenMP does not allow to know the number of physical cores, and thus Eigen will launch 8 threads.

          3. Eigen's matrix-matrix product kernel is fully optimized and exploits nearly 100% of the CPU capacity. Consequently, there is no room for running two such threads on a single core, and the performance drops significantly (cache pollution).

          The solution is thus to limit the number of OpenMP threads to the number of physical cores, for instance by setting the OMP_NUM_THREADS environment variable. You can also disable Eigen's multithread by defining the macro EIGEN_DONT_PARALLELIZE at compilation time.



          More info in the doc.



          More details on how hyper-threading can decrease performance:
          With hyper-threading you have two threads running in an interleaved fashion on a single core. They alternate every instruction. If your threads are not using less than half of the ressources of the CPU (in term of computation), then that's a win because you will exploit more computing units. But if a single thread is already using 100% of the computing units (as in the case of a well optimized matrix-matrix product), then you lose performance because of 1) the natural overhead of managing two threads and 2) because the L1 cache is now shared by two different tasks. Matrix-matrix kernels are designed with precise L1 capacity in mind. With two threads, your L1 cache becomes nearly ineffective. This means that instead of fetching the very fast L1 cache most of the time, you end up accessing the much slower L2 cache, and thus you get a huge performance drop. Unlike Linux and Windows, on OSX I don't observe such performance drop, most likely because the system is able to unschedule the second threads if the CPU is already too busy.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 15 '18 at 12:53

























          answered Nov 14 '18 at 12:53









          ggaelggael

          20.6k23145




          20.6k23145












          • Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

            – mjfoo21
            Nov 15 '18 at 12:10












          • This is what I tried to explain in my answer, I extended the answer with more details.

            – ggael
            Nov 15 '18 at 12:53

















          • Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

            – mjfoo21
            Nov 15 '18 at 12:10












          • This is what I tried to explain in my answer, I extended the answer with more details.

            – ggael
            Nov 15 '18 at 12:53
















          Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

          – mjfoo21
          Nov 15 '18 at 12:10






          Thanks for the reply. I added the lines omp_set_num_threads(2); Eigen::setNbThreads(1); Eigen::initParallel(); (see link) and the run time returns to normal. Unfortunately the my timing does not improve even though I have added the #pragma statements. Guess parallel threads are not enough in my case. Just need to clarify somethign: how is hyper-threading causing this issue? Setting the OMP_NUM_THREADS to 2 gives me the usual run time but 4 slows down the code.

          – mjfoo21
          Nov 15 '18 at 12:10














          This is what I tried to explain in my answer, I extended the answer with more details.

          – ggael
          Nov 15 '18 at 12:53





          This is what I tried to explain in my answer, I extended the answer with more details.

          – ggael
          Nov 15 '18 at 12:53



















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53299833%2fenabling-open-mp-support-in-visual-studio-2017-slows-down-codes%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          How to read a connectionString WITH PROVIDER in .NET Core?

          Node.js Script on GitHub Pages or Amazon S3

          Museum of Modern and Contemporary Art of Trento and Rovereto