Pytorch speed comparison - GPU slower than CPU










5















I was trying to find out if GPU tensor operations are actually faster than CPU ones. So, I wrote this particular code below to implement a simple 2D addition of CPU tensors and GPU cuda tensors successively to see the speed difference:



import torch
import time

###CPU
start_time = time.time()
a = torch.ones(4,4)
for _ in range(1000000):
a += a
elapsed_time = time.time() - start_time

print('CPU time = ',elapsed_time)

###GPU
start_time = time.time()
b = torch.ones(4,4).cuda()
for _ in range(1000000):
b += b
elapsed_time = time.time() - start_time

print('GPU time = ',elapsed_time)


To my surprise, the CPU time was 0.93 sec and the GPU time was as high as 63 seconds. Am I doing the cuda tensor operation properly or is the concept of cuda tensors works faster only in very highly complex operations, like in neural networks?



Note: My GPU is NVIDIA 940MX and torch.cuda.is_available() call returns True.










share|improve this question




























    5















    I was trying to find out if GPU tensor operations are actually faster than CPU ones. So, I wrote this particular code below to implement a simple 2D addition of CPU tensors and GPU cuda tensors successively to see the speed difference:



    import torch
    import time

    ###CPU
    start_time = time.time()
    a = torch.ones(4,4)
    for _ in range(1000000):
    a += a
    elapsed_time = time.time() - start_time

    print('CPU time = ',elapsed_time)

    ###GPU
    start_time = time.time()
    b = torch.ones(4,4).cuda()
    for _ in range(1000000):
    b += b
    elapsed_time = time.time() - start_time

    print('GPU time = ',elapsed_time)


    To my surprise, the CPU time was 0.93 sec and the GPU time was as high as 63 seconds. Am I doing the cuda tensor operation properly or is the concept of cuda tensors works faster only in very highly complex operations, like in neural networks?



    Note: My GPU is NVIDIA 940MX and torch.cuda.is_available() call returns True.










    share|improve this question


























      5












      5








      5


      1






      I was trying to find out if GPU tensor operations are actually faster than CPU ones. So, I wrote this particular code below to implement a simple 2D addition of CPU tensors and GPU cuda tensors successively to see the speed difference:



      import torch
      import time

      ###CPU
      start_time = time.time()
      a = torch.ones(4,4)
      for _ in range(1000000):
      a += a
      elapsed_time = time.time() - start_time

      print('CPU time = ',elapsed_time)

      ###GPU
      start_time = time.time()
      b = torch.ones(4,4).cuda()
      for _ in range(1000000):
      b += b
      elapsed_time = time.time() - start_time

      print('GPU time = ',elapsed_time)


      To my surprise, the CPU time was 0.93 sec and the GPU time was as high as 63 seconds. Am I doing the cuda tensor operation properly or is the concept of cuda tensors works faster only in very highly complex operations, like in neural networks?



      Note: My GPU is NVIDIA 940MX and torch.cuda.is_available() call returns True.










      share|improve this question
















      I was trying to find out if GPU tensor operations are actually faster than CPU ones. So, I wrote this particular code below to implement a simple 2D addition of CPU tensors and GPU cuda tensors successively to see the speed difference:



      import torch
      import time

      ###CPU
      start_time = time.time()
      a = torch.ones(4,4)
      for _ in range(1000000):
      a += a
      elapsed_time = time.time() - start_time

      print('CPU time = ',elapsed_time)

      ###GPU
      start_time = time.time()
      b = torch.ones(4,4).cuda()
      for _ in range(1000000):
      b += b
      elapsed_time = time.time() - start_time

      print('GPU time = ',elapsed_time)


      To my surprise, the CPU time was 0.93 sec and the GPU time was as high as 63 seconds. Am I doing the cuda tensor operation properly or is the concept of cuda tensors works faster only in very highly complex operations, like in neural networks?



      Note: My GPU is NVIDIA 940MX and torch.cuda.is_available() call returns True.







      python pytorch






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 16 '18 at 22:53









      Robert Crovella

      96.9k5111152




      96.9k5111152










      asked Nov 15 '18 at 18:01









      HarishHarish

      243




      243






















          1 Answer
          1






          active

          oldest

          votes


















          9














          GPU acceleration works by heavy parallelization of computation. On a GPU you have a huge amount of cores, each of them is not very powerful, but the huge amount of cores here matters.



          Frameworks like PyTorch do their to make it possible to compute as much as possible in parallel. In general matrix operations are very well suited for parallelization, but still it isn't always possible to parallelize computation!



          In your example you have a loop:



          b = torch.ones(4,4).cuda()
          for _ in range(1000000):
          b += b


          You have 1000000 operations, but due to the structure of the code it impossible to parallelize much of these computations. If you think about it, to compute the next b you need to know the value of the previous (or current) b.



          So you have 1000000 operations, but each of these has to be computed one after another. Possible parallelization is limited to the size of your tensor. This size though is not very large in your example:



          torch.ones(4,4)


          So you only can parallelize 16 operations (additions) per iteration.
          As the CPU has few, but much more powerful cores, it is just much faster for the given example!



          But things change if you change the size of the tensor, then PyTorch is able to parallelize much more of the overall computation. I changed the iterations to 1000 (because I did not want to wait so long :), but you can put in any value you like, the relation between CPU and GPU should stay the same.



          Here are the results for different tensor sizes:



          #torch.ones(4,4) - the size you used
          CPU time = 0.00926661491394043
          GPU time = 0.0431208610534668

          #torch.ones(40,40) - CPU gets slower, but still faster than GPU
          CPU time = 0.014729976654052734
          GPU time = 0.04474186897277832

          #torch.ones(400,400) - CPU now much slower than GPU
          CPU time = 0.9702610969543457
          GPU time = 0.04415607452392578

          #torch.ones(4000,4000) - GPU much faster then CPU
          CPU time = 38.088677167892456
          GPU time = 0.044649362564086914


          So as you see, where it is possible to parallelize stuff (here the addition of the tensor elements), GPU becomes very powerful.
          GPU time is not changing at all for the given calculations, the GPU can handle much more!
          (as long as it doesn't run out of memory :)






          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function ()
            StackExchange.using("externalEditor", function ()
            StackExchange.using("snippets", function ()
            StackExchange.snippets.init();
            );
            );
            , "code-snippets");

            StackExchange.ready(function()
            var channelOptions =
            tags: "".split(" "),
            id: "1"
            ;
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function()
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled)
            StackExchange.using("snippets", function()
            createEditor();
            );

            else
            createEditor();

            );

            function createEditor()
            StackExchange.prepareEditor(
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader:
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            ,
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            );



            );













            draft saved

            draft discarded


















            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53325418%2fpytorch-speed-comparison-gpu-slower-than-cpu%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            9














            GPU acceleration works by heavy parallelization of computation. On a GPU you have a huge amount of cores, each of them is not very powerful, but the huge amount of cores here matters.



            Frameworks like PyTorch do their to make it possible to compute as much as possible in parallel. In general matrix operations are very well suited for parallelization, but still it isn't always possible to parallelize computation!



            In your example you have a loop:



            b = torch.ones(4,4).cuda()
            for _ in range(1000000):
            b += b


            You have 1000000 operations, but due to the structure of the code it impossible to parallelize much of these computations. If you think about it, to compute the next b you need to know the value of the previous (or current) b.



            So you have 1000000 operations, but each of these has to be computed one after another. Possible parallelization is limited to the size of your tensor. This size though is not very large in your example:



            torch.ones(4,4)


            So you only can parallelize 16 operations (additions) per iteration.
            As the CPU has few, but much more powerful cores, it is just much faster for the given example!



            But things change if you change the size of the tensor, then PyTorch is able to parallelize much more of the overall computation. I changed the iterations to 1000 (because I did not want to wait so long :), but you can put in any value you like, the relation between CPU and GPU should stay the same.



            Here are the results for different tensor sizes:



            #torch.ones(4,4) - the size you used
            CPU time = 0.00926661491394043
            GPU time = 0.0431208610534668

            #torch.ones(40,40) - CPU gets slower, but still faster than GPU
            CPU time = 0.014729976654052734
            GPU time = 0.04474186897277832

            #torch.ones(400,400) - CPU now much slower than GPU
            CPU time = 0.9702610969543457
            GPU time = 0.04415607452392578

            #torch.ones(4000,4000) - GPU much faster then CPU
            CPU time = 38.088677167892456
            GPU time = 0.044649362564086914


            So as you see, where it is possible to parallelize stuff (here the addition of the tensor elements), GPU becomes very powerful.
            GPU time is not changing at all for the given calculations, the GPU can handle much more!
            (as long as it doesn't run out of memory :)






            share|improve this answer





























              9














              GPU acceleration works by heavy parallelization of computation. On a GPU you have a huge amount of cores, each of them is not very powerful, but the huge amount of cores here matters.



              Frameworks like PyTorch do their to make it possible to compute as much as possible in parallel. In general matrix operations are very well suited for parallelization, but still it isn't always possible to parallelize computation!



              In your example you have a loop:



              b = torch.ones(4,4).cuda()
              for _ in range(1000000):
              b += b


              You have 1000000 operations, but due to the structure of the code it impossible to parallelize much of these computations. If you think about it, to compute the next b you need to know the value of the previous (or current) b.



              So you have 1000000 operations, but each of these has to be computed one after another. Possible parallelization is limited to the size of your tensor. This size though is not very large in your example:



              torch.ones(4,4)


              So you only can parallelize 16 operations (additions) per iteration.
              As the CPU has few, but much more powerful cores, it is just much faster for the given example!



              But things change if you change the size of the tensor, then PyTorch is able to parallelize much more of the overall computation. I changed the iterations to 1000 (because I did not want to wait so long :), but you can put in any value you like, the relation between CPU and GPU should stay the same.



              Here are the results for different tensor sizes:



              #torch.ones(4,4) - the size you used
              CPU time = 0.00926661491394043
              GPU time = 0.0431208610534668

              #torch.ones(40,40) - CPU gets slower, but still faster than GPU
              CPU time = 0.014729976654052734
              GPU time = 0.04474186897277832

              #torch.ones(400,400) - CPU now much slower than GPU
              CPU time = 0.9702610969543457
              GPU time = 0.04415607452392578

              #torch.ones(4000,4000) - GPU much faster then CPU
              CPU time = 38.088677167892456
              GPU time = 0.044649362564086914


              So as you see, where it is possible to parallelize stuff (here the addition of the tensor elements), GPU becomes very powerful.
              GPU time is not changing at all for the given calculations, the GPU can handle much more!
              (as long as it doesn't run out of memory :)






              share|improve this answer



























                9












                9








                9







                GPU acceleration works by heavy parallelization of computation. On a GPU you have a huge amount of cores, each of them is not very powerful, but the huge amount of cores here matters.



                Frameworks like PyTorch do their to make it possible to compute as much as possible in parallel. In general matrix operations are very well suited for parallelization, but still it isn't always possible to parallelize computation!



                In your example you have a loop:



                b = torch.ones(4,4).cuda()
                for _ in range(1000000):
                b += b


                You have 1000000 operations, but due to the structure of the code it impossible to parallelize much of these computations. If you think about it, to compute the next b you need to know the value of the previous (or current) b.



                So you have 1000000 operations, but each of these has to be computed one after another. Possible parallelization is limited to the size of your tensor. This size though is not very large in your example:



                torch.ones(4,4)


                So you only can parallelize 16 operations (additions) per iteration.
                As the CPU has few, but much more powerful cores, it is just much faster for the given example!



                But things change if you change the size of the tensor, then PyTorch is able to parallelize much more of the overall computation. I changed the iterations to 1000 (because I did not want to wait so long :), but you can put in any value you like, the relation between CPU and GPU should stay the same.



                Here are the results for different tensor sizes:



                #torch.ones(4,4) - the size you used
                CPU time = 0.00926661491394043
                GPU time = 0.0431208610534668

                #torch.ones(40,40) - CPU gets slower, but still faster than GPU
                CPU time = 0.014729976654052734
                GPU time = 0.04474186897277832

                #torch.ones(400,400) - CPU now much slower than GPU
                CPU time = 0.9702610969543457
                GPU time = 0.04415607452392578

                #torch.ones(4000,4000) - GPU much faster then CPU
                CPU time = 38.088677167892456
                GPU time = 0.044649362564086914


                So as you see, where it is possible to parallelize stuff (here the addition of the tensor elements), GPU becomes very powerful.
                GPU time is not changing at all for the given calculations, the GPU can handle much more!
                (as long as it doesn't run out of memory :)






                share|improve this answer















                GPU acceleration works by heavy parallelization of computation. On a GPU you have a huge amount of cores, each of them is not very powerful, but the huge amount of cores here matters.



                Frameworks like PyTorch do their to make it possible to compute as much as possible in parallel. In general matrix operations are very well suited for parallelization, but still it isn't always possible to parallelize computation!



                In your example you have a loop:



                b = torch.ones(4,4).cuda()
                for _ in range(1000000):
                b += b


                You have 1000000 operations, but due to the structure of the code it impossible to parallelize much of these computations. If you think about it, to compute the next b you need to know the value of the previous (or current) b.



                So you have 1000000 operations, but each of these has to be computed one after another. Possible parallelization is limited to the size of your tensor. This size though is not very large in your example:



                torch.ones(4,4)


                So you only can parallelize 16 operations (additions) per iteration.
                As the CPU has few, but much more powerful cores, it is just much faster for the given example!



                But things change if you change the size of the tensor, then PyTorch is able to parallelize much more of the overall computation. I changed the iterations to 1000 (because I did not want to wait so long :), but you can put in any value you like, the relation between CPU and GPU should stay the same.



                Here are the results for different tensor sizes:



                #torch.ones(4,4) - the size you used
                CPU time = 0.00926661491394043
                GPU time = 0.0431208610534668

                #torch.ones(40,40) - CPU gets slower, but still faster than GPU
                CPU time = 0.014729976654052734
                GPU time = 0.04474186897277832

                #torch.ones(400,400) - CPU now much slower than GPU
                CPU time = 0.9702610969543457
                GPU time = 0.04415607452392578

                #torch.ones(4000,4000) - GPU much faster then CPU
                CPU time = 38.088677167892456
                GPU time = 0.044649362564086914


                So as you see, where it is possible to parallelize stuff (here the addition of the tensor elements), GPU becomes very powerful.
                GPU time is not changing at all for the given calculations, the GPU can handle much more!
                (as long as it doesn't run out of memory :)







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited Dec 23 '18 at 20:50

























                answered Nov 15 '18 at 20:06









                blue-phoenoxblue-phoenox

                4,469101849




                4,469101849





























                    draft saved

                    draft discarded
















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid


                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.

                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function ()
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53325418%2fpytorch-speed-comparison-gpu-slower-than-cpu%23new-answer', 'question_page');

                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    這個網誌中的熱門文章

                    How to read a connectionString WITH PROVIDER in .NET Core?

                    In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

                    Museum of Modern and Contemporary Art of Trento and Rovereto