Streaming multiprocessors, Blocks and Threads (CUDA)









up vote
52
down vote

favorite
32












What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?



What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?




My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.



Is this correct?










share|improve this question



























    up vote
    52
    down vote

    favorite
    32












    What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?



    What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?




    My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.



    Is this correct?










    share|improve this question

























      up vote
      52
      down vote

      favorite
      32









      up vote
      52
      down vote

      favorite
      32






      32





      What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?



      What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?




      My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.



      Is this correct?










      share|improve this question















      What is the relationship between a CUDA core, a streaming multiprocessor and the CUDA model of blocks and threads?



      What gets mapped to what and what is parallelized and how? and what is more efficient, maximize the number of blocks or the number of threads?




      My current understanding is that there are 8 cuda cores per multiprocessor. and that every cuda core will be able to execute one cuda block at a time. and all the threads in that block are executed serially in that particular core.



      Is this correct?







      cuda nvidia






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Feb 20 '16 at 8:23









      talonmies

      58.9k17126192




      58.9k17126192










      asked Aug 19 '10 at 7:21









      ExtremeCoder

      2,38333354




      2,38333354






















          4 Answers
          4






          active

          oldest

          votes

















          up vote
          53
          down vote



          accepted










          The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:




          The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.




          Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().






          share|improve this answer


















          • 12




            Just one addition: on newer devices there are 32 (Compute Capability 2.0) or 48 (2.1) CUDA cores per SM. The actual number doesn't really make much difference to programming, the warp size is 32 and has the same meaning (i.e. executing in lock-step).
            – Tom
            Aug 20 '10 at 14:48






          • 8




            And in fact Compute Capability 3.0 (Kepler) now increases cores/SM hugely - to 192!
            – Edric
            Jul 12 '12 at 12:24






          • 2




            I still don't get it. So it's always 1 warp per core and the number of warps per SM is equal to the number of cores per SM? And how do thread blocks get mapped to warps? Do blocks always consist of whole numbers of warps? If for example each block contains 3 warps, does that mean that I am using 3 cores on a given SM?
            – Matt J
            Apr 6 '14 at 17:43











          • Wikipedia says that each SM contains 32 CUDA cores.
            – haccks
            Feb 16 '15 at 22:06










          • The number of cuda cores in a SMs depends by the GPU, for example in gtx 1060 I have 9 SMs and 128 processors (cuda cores) for each SMs for a total of 1152 CUDA cores.
            – sgira
            Apr 23 '17 at 13:07

















          up vote
          24
          down vote













          For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).



          You can define grids which maps blocks to the GPU.



          You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).



          One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.



          To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.



          Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.



          You find numbers for the different Compute Capabilities and GPU Architectures here:
          https://en.wikipedia.org/wiki/CUDA#Limitations



          You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).






          share|improve this answer



























            up vote
            3
            down vote













            The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.



            When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.



            When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.



            Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.



            This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.



            In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.



            The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.



            The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.






            share|improve this answer



























              up vote
              0
              down vote













              There are multiple streaming multiprocessor on one device.

              A SM may contain multiple blocks. Each block may contain several threads.

              A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.

              SM and block both have limits on number of thread, number of register and shared memory.






              share|improve this answer




















                Your Answer






                StackExchange.ifUsing("editor", function ()
                StackExchange.using("externalEditor", function ()
                StackExchange.using("snippets", function ()
                StackExchange.snippets.init();
                );
                );
                , "code-snippets");

                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "1"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                convertImagesToLinks: true,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: 10,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );













                 

                draft saved


                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f3519598%2fstreaming-multiprocessors-blocks-and-threads-cuda%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown

























                4 Answers
                4






                active

                oldest

                votes








                4 Answers
                4






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes








                up vote
                53
                down vote



                accepted










                The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:




                The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.




                Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().






                share|improve this answer


















                • 12




                  Just one addition: on newer devices there are 32 (Compute Capability 2.0) or 48 (2.1) CUDA cores per SM. The actual number doesn't really make much difference to programming, the warp size is 32 and has the same meaning (i.e. executing in lock-step).
                  – Tom
                  Aug 20 '10 at 14:48






                • 8




                  And in fact Compute Capability 3.0 (Kepler) now increases cores/SM hugely - to 192!
                  – Edric
                  Jul 12 '12 at 12:24






                • 2




                  I still don't get it. So it's always 1 warp per core and the number of warps per SM is equal to the number of cores per SM? And how do thread blocks get mapped to warps? Do blocks always consist of whole numbers of warps? If for example each block contains 3 warps, does that mean that I am using 3 cores on a given SM?
                  – Matt J
                  Apr 6 '14 at 17:43











                • Wikipedia says that each SM contains 32 CUDA cores.
                  – haccks
                  Feb 16 '15 at 22:06










                • The number of cuda cores in a SMs depends by the GPU, for example in gtx 1060 I have 9 SMs and 128 processors (cuda cores) for each SMs for a total of 1152 CUDA cores.
                  – sgira
                  Apr 23 '17 at 13:07














                up vote
                53
                down vote



                accepted










                The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:




                The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.




                Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().






                share|improve this answer


















                • 12




                  Just one addition: on newer devices there are 32 (Compute Capability 2.0) or 48 (2.1) CUDA cores per SM. The actual number doesn't really make much difference to programming, the warp size is 32 and has the same meaning (i.e. executing in lock-step).
                  – Tom
                  Aug 20 '10 at 14:48






                • 8




                  And in fact Compute Capability 3.0 (Kepler) now increases cores/SM hugely - to 192!
                  – Edric
                  Jul 12 '12 at 12:24






                • 2




                  I still don't get it. So it's always 1 warp per core and the number of warps per SM is equal to the number of cores per SM? And how do thread blocks get mapped to warps? Do blocks always consist of whole numbers of warps? If for example each block contains 3 warps, does that mean that I am using 3 cores on a given SM?
                  – Matt J
                  Apr 6 '14 at 17:43











                • Wikipedia says that each SM contains 32 CUDA cores.
                  – haccks
                  Feb 16 '15 at 22:06










                • The number of cuda cores in a SMs depends by the GPU, for example in gtx 1060 I have 9 SMs and 128 processors (cuda cores) for each SMs for a total of 1152 CUDA cores.
                  – sgira
                  Apr 23 '17 at 13:07












                up vote
                53
                down vote



                accepted







                up vote
                53
                down vote



                accepted






                The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:




                The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.




                Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().






                share|improve this answer














                The thread / block layout is described in detail in the CUDA programming guide. In particular, chapter 4 states:




                The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.




                Each SM contains 8 CUDA cores, and at any one time they're executing a single warp of 32 threads - so it takes 4 clock cycles to issue a single instruction for the whole warp. You can assume that threads in any given warp execute in lock-step, but to synchronise across warps, you need to use __syncthreads().







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited May 8 '14 at 5:42









                Greg Smith

                8,52312027




                8,52312027










                answered Aug 19 '10 at 9:14









                Edric

                19k13236




                19k13236







                • 12




                  Just one addition: on newer devices there are 32 (Compute Capability 2.0) or 48 (2.1) CUDA cores per SM. The actual number doesn't really make much difference to programming, the warp size is 32 and has the same meaning (i.e. executing in lock-step).
                  – Tom
                  Aug 20 '10 at 14:48






                • 8




                  And in fact Compute Capability 3.0 (Kepler) now increases cores/SM hugely - to 192!
                  – Edric
                  Jul 12 '12 at 12:24






                • 2




                  I still don't get it. So it's always 1 warp per core and the number of warps per SM is equal to the number of cores per SM? And how do thread blocks get mapped to warps? Do blocks always consist of whole numbers of warps? If for example each block contains 3 warps, does that mean that I am using 3 cores on a given SM?
                  – Matt J
                  Apr 6 '14 at 17:43











                • Wikipedia says that each SM contains 32 CUDA cores.
                  – haccks
                  Feb 16 '15 at 22:06










                • The number of cuda cores in a SMs depends by the GPU, for example in gtx 1060 I have 9 SMs and 128 processors (cuda cores) for each SMs for a total of 1152 CUDA cores.
                  – sgira
                  Apr 23 '17 at 13:07












                • 12




                  Just one addition: on newer devices there are 32 (Compute Capability 2.0) or 48 (2.1) CUDA cores per SM. The actual number doesn't really make much difference to programming, the warp size is 32 and has the same meaning (i.e. executing in lock-step).
                  – Tom
                  Aug 20 '10 at 14:48






                • 8




                  And in fact Compute Capability 3.0 (Kepler) now increases cores/SM hugely - to 192!
                  – Edric
                  Jul 12 '12 at 12:24






                • 2




                  I still don't get it. So it's always 1 warp per core and the number of warps per SM is equal to the number of cores per SM? And how do thread blocks get mapped to warps? Do blocks always consist of whole numbers of warps? If for example each block contains 3 warps, does that mean that I am using 3 cores on a given SM?
                  – Matt J
                  Apr 6 '14 at 17:43











                • Wikipedia says that each SM contains 32 CUDA cores.
                  – haccks
                  Feb 16 '15 at 22:06










                • The number of cuda cores in a SMs depends by the GPU, for example in gtx 1060 I have 9 SMs and 128 processors (cuda cores) for each SMs for a total of 1152 CUDA cores.
                  – sgira
                  Apr 23 '17 at 13:07







                12




                12




                Just one addition: on newer devices there are 32 (Compute Capability 2.0) or 48 (2.1) CUDA cores per SM. The actual number doesn't really make much difference to programming, the warp size is 32 and has the same meaning (i.e. executing in lock-step).
                – Tom
                Aug 20 '10 at 14:48




                Just one addition: on newer devices there are 32 (Compute Capability 2.0) or 48 (2.1) CUDA cores per SM. The actual number doesn't really make much difference to programming, the warp size is 32 and has the same meaning (i.e. executing in lock-step).
                – Tom
                Aug 20 '10 at 14:48




                8




                8




                And in fact Compute Capability 3.0 (Kepler) now increases cores/SM hugely - to 192!
                – Edric
                Jul 12 '12 at 12:24




                And in fact Compute Capability 3.0 (Kepler) now increases cores/SM hugely - to 192!
                – Edric
                Jul 12 '12 at 12:24




                2




                2




                I still don't get it. So it's always 1 warp per core and the number of warps per SM is equal to the number of cores per SM? And how do thread blocks get mapped to warps? Do blocks always consist of whole numbers of warps? If for example each block contains 3 warps, does that mean that I am using 3 cores on a given SM?
                – Matt J
                Apr 6 '14 at 17:43





                I still don't get it. So it's always 1 warp per core and the number of warps per SM is equal to the number of cores per SM? And how do thread blocks get mapped to warps? Do blocks always consist of whole numbers of warps? If for example each block contains 3 warps, does that mean that I am using 3 cores on a given SM?
                – Matt J
                Apr 6 '14 at 17:43













                Wikipedia says that each SM contains 32 CUDA cores.
                – haccks
                Feb 16 '15 at 22:06




                Wikipedia says that each SM contains 32 CUDA cores.
                – haccks
                Feb 16 '15 at 22:06












                The number of cuda cores in a SMs depends by the GPU, for example in gtx 1060 I have 9 SMs and 128 processors (cuda cores) for each SMs for a total of 1152 CUDA cores.
                – sgira
                Apr 23 '17 at 13:07




                The number of cuda cores in a SMs depends by the GPU, for example in gtx 1060 I have 9 SMs and 128 processors (cuda cores) for each SMs for a total of 1152 CUDA cores.
                – sgira
                Apr 23 '17 at 13:07












                up vote
                24
                down vote













                For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).



                You can define grids which maps blocks to the GPU.



                You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).



                One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.



                To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.



                Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.



                You find numbers for the different Compute Capabilities and GPU Architectures here:
                https://en.wikipedia.org/wiki/CUDA#Limitations



                You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).






                share|improve this answer
























                  up vote
                  24
                  down vote













                  For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).



                  You can define grids which maps blocks to the GPU.



                  You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).



                  One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.



                  To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.



                  Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.



                  You find numbers for the different Compute Capabilities and GPU Architectures here:
                  https://en.wikipedia.org/wiki/CUDA#Limitations



                  You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).






                  share|improve this answer






















                    up vote
                    24
                    down vote










                    up vote
                    24
                    down vote









                    For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).



                    You can define grids which maps blocks to the GPU.



                    You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).



                    One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.



                    To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.



                    Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.



                    You find numbers for the different Compute Capabilities and GPU Architectures here:
                    https://en.wikipedia.org/wiki/CUDA#Limitations



                    You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).






                    share|improve this answer












                    For the GTX 970 there are 13 Streaming Multiprocessors (SM) with 128 Cuda Cores each. Cuda Cores are also called Stream Processors (SP).



                    You can define grids which maps blocks to the GPU.



                    You can define blocks which map threads to Stream Processors (the 128 Cuda Cores per SM).



                    One warp is always formed by 32 threads and all threads of a warp are executed simulaneously.



                    To use the full possible power of a GPU you need much more threads per SM than the SM has SPs. For each Compute Capability there is a certain number of threads which can reside in one SM at a time. All blocks you define are queued and wait for a SM to have the resources (number of SPs free), then it is loaded. The SM starts to execute Warps. Since one Warp only has 32 Threads and a SM has for example 128 SPs a SM can execute 4 Warps at a given time. The thing is if the threads do memory access the thread will block until its memory request is satisfied. In numbers: An arithmetic calculation on the SP has a latency of 18-22 cycles while a non-cached global memory access can take up to 300-400 cycles. This means if the threads of one warp are waiting for data only a subset of the 128 SPs would work. Therefor the scheduler switches to execute another warp if available. And if this warp blocks it executes the next and so on. This concept is called latency hiding. The number of warps and the block size determine the occupancy (from how many warps the SM can choose to execute). If the occupancy is high it is more unlikely that there is no work for the SPs.



                    Your statement that each cuda core will execute one block at a time is wrong. If you talk about Streaming Multiprocessors they can execute warps from all thread which reside in the SM. If one block has a size of 256 threads and your GPU allowes 2048 threads to resident per SM each SM would have 8 blocks residing from which the SM can choose warps to execute. All threads of the executed warps are executed in parallel.



                    You find numbers for the different Compute Capabilities and GPU Architectures here:
                    https://en.wikipedia.org/wiki/CUDA#Limitations



                    You can download a occupancy calculation sheet from Nvidia Occupancy Calculation sheet (by Nvidia).







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Jun 15 '16 at 3:56









                    JoeFox

                    5271416




                    5271416




















                        up vote
                        3
                        down vote













                        The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.



                        When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.



                        When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.



                        Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.



                        This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.



                        In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.



                        The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.



                        The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.






                        share|improve this answer
























                          up vote
                          3
                          down vote













                          The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.



                          When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.



                          When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.



                          Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.



                          This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.



                          In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.



                          The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.



                          The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.






                          share|improve this answer






















                            up vote
                            3
                            down vote










                            up vote
                            3
                            down vote









                            The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.



                            When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.



                            When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.



                            Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.



                            This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.



                            In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.



                            The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.



                            The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.






                            share|improve this answer












                            The Compute Work Distributor will schedule a thread block (CTA) on a SM only if the SM has sufficient resources for the thread block (shared memory, warps, registers, barriers, ...). Thread block level resources such shared memory are allocated. The allocate creates sufficient warps for all threads in the thread block. The resource manager allocates warps round robin to the SM sub-partitions. Each SM subpartition contains a warp scheduler, register file, and execution units. Once a warp is allocated to a subpartition it will remain on the subpartition until it completes or is pre-empted by a context switch (Pascal architecture). On context switch restore the warp will be restored to the same SM same warp-id.



                            When all threads in warp have completed the warp scheduler waits for all outstanding instructions issued by the warp to complete and then the resource manager releases the warp level resources which include warp-id and register file.



                            When all warps in a thread block complete then block level resources are released and the SM notifies the Compute Work Distributor that the block has completed.



                            Once a warp is allocated to a subpartition and all resources are allocated the warp is considered active meaning that the warp scheduler is actively tracking the state of the warp. On each cycle the warp scheduler determine which active warps are stalled and which are eligible to issue an instruction. The warp scheduler picks the highest priority eligible warp and issues 1-2 consecutive instructions from the warp. The rules for dual-issue are specific to each architecture. If a warp issues a memory load it can continue to executed independent instructions until it reaches a dependent instruction. The warp will then report stalled until the load completes. The same is true for dependent math instructions. The SM architecture is designed to hide both ALU and memory latency by switching per cycle between warps.



                            This answer does not use the term CUDA core as this introduces an incorrect mental model. CUDA cores are pipelined single precision floating point/integer execution units. The issue rate and dependency latency is specific to each architecture. Each SM subpartition and SM has other execution units including load/store units, double precision floating point units, half precision floating point units, branch units, etc.



                            In order to maximize performance the developer has to understand the trade off of blocks vs. warps vs. registers/thread.



                            The term occupancy is the ratio of active warps to maximum warps on a SM. Kepler - Pascal architecture (except GP100) have 4 warp schedulers per SM. The minimal number of warps per SM should at least be equal to the number of warp schedulers. If the architecture has a dependent execution latency of 6 cycles (Maxwell and Pascal) then you would need at least 6 warps per scheduler which is 24 per SM (24 / 64 = 37.5% occupancy) to cover the latency. If the threads have instruction level parallelism then this could be reduced. Almost all kernels issue variable latency instructions such as memory loads that can take 80-1000 cycles. This requires more active warps per warp scheduler to hide latency. For each kernel there is a trade off point between number of warps and other resources such as shared memory or registers so optimizing for 100% occupancy is not advised as some other sacrifice will likely be made. The CUDA profiler can help identify instruction issue rate, occupancy, and stall reasons in order to help the developer determine that balance.



                            The size of a thread block can impact performance. If the kernel has large blocks and uses synchronization barriers then barrier stalls can be a come stall reasons. This can be alleviated by reducing the warps per thread block.







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered May 26 '17 at 0:59









                            Greg Smith

                            8,52312027




                            8,52312027




















                                up vote
                                0
                                down vote













                                There are multiple streaming multiprocessor on one device.

                                A SM may contain multiple blocks. Each block may contain several threads.

                                A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.

                                SM and block both have limits on number of thread, number of register and shared memory.






                                share|improve this answer
























                                  up vote
                                  0
                                  down vote













                                  There are multiple streaming multiprocessor on one device.

                                  A SM may contain multiple blocks. Each block may contain several threads.

                                  A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.

                                  SM and block both have limits on number of thread, number of register and shared memory.






                                  share|improve this answer






















                                    up vote
                                    0
                                    down vote










                                    up vote
                                    0
                                    down vote









                                    There are multiple streaming multiprocessor on one device.

                                    A SM may contain multiple blocks. Each block may contain several threads.

                                    A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.

                                    SM and block both have limits on number of thread, number of register and shared memory.






                                    share|improve this answer












                                    There are multiple streaming multiprocessor on one device.

                                    A SM may contain multiple blocks. Each block may contain several threads.

                                    A SM have multiple CUDA cores(as a developer, you should not care about this because it is abstracted by warp), which will work on thread. SM always working on warp of threads(always 32). A warp will only working on thread from same block.

                                    SM and block both have limits on number of thread, number of register and shared memory.







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Nov 10 at 20:06









                                    liu km

                                    33




                                    33



























                                         

                                        draft saved


                                        draft discarded















































                                         


                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f3519598%2fstreaming-multiprocessors-blocks-and-threads-cuda%23new-answer', 'question_page');

                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown







                                        這個網誌中的熱門文章

                                        How to read a connectionString WITH PROVIDER in .NET Core?

                                        In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

                                        Museum of Modern and Contemporary Art of Trento and Rovereto