C/C++: relaxed std::atomic vs unlocked bool on X64 architecture









up vote
1
down vote

favorite












Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool> where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?










share|improve this question























  • "since a single byte is actually atomic in hardware" - that's not a given fact.
    – Jesper Juhl
    Nov 11 at 18:30










  • Not even on X64 architecture? (Note what I wrote in the title)
    – tohava
    Nov 11 at 18:32







  • 3




    @JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
    – Peter Cordes
    Nov 11 at 19:21














up vote
1
down vote

favorite












Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool> where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?










share|improve this question























  • "since a single byte is actually atomic in hardware" - that's not a given fact.
    – Jesper Juhl
    Nov 11 at 18:30










  • Not even on X64 architecture? (Note what I wrote in the title)
    – tohava
    Nov 11 at 18:32







  • 3




    @JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
    – Peter Cordes
    Nov 11 at 19:21












up vote
1
down vote

favorite









up vote
1
down vote

favorite











Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool> where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?










share|improve this question















Is there any efficency benefit to using an unlocked boolean over using an std::atomic<bool> where the operations are always done with relaxed memory order? I would assume that both eventually compile to the same machine code, since a single byte is actually atomic on X64 hardware. Am I wrong?







c++ performance synchronization x86-64 atomic






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 11 at 18:32

























asked Nov 11 at 18:10









tohava

3,47711032




3,47711032











  • "since a single byte is actually atomic in hardware" - that's not a given fact.
    – Jesper Juhl
    Nov 11 at 18:30










  • Not even on X64 architecture? (Note what I wrote in the title)
    – tohava
    Nov 11 at 18:32







  • 3




    @JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
    – Peter Cordes
    Nov 11 at 19:21
















  • "since a single byte is actually atomic in hardware" - that's not a given fact.
    – Jesper Juhl
    Nov 11 at 18:30










  • Not even on X64 architecture? (Note what I wrote in the title)
    – tohava
    Nov 11 at 18:32







  • 3




    @JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
    – Peter Cordes
    Nov 11 at 19:21















"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30




"since a single byte is actually atomic in hardware" - that's not a given fact.
– Jesper Juhl
Nov 11 at 18:30












Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32





Not even on X64 architecture? (Note what I wrote in the title)
– tohava
Nov 11 at 18:32





3




3




@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21




@JesperJuhl: I doubt there are any architectures where a byte load or store isn't atomic. (Except rare ISAs like early DEC Alpha that don't have byte load/store instructions, only word. Or word-addressable DSPs. But on them, bool would be a word wide, not a byte.)
– Peter Cordes
Nov 11 at 19:21












2 Answers
2






active

oldest

votes

















up vote
4
down vote



accepted










Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.



If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile.




Current compilers also never combine multiple reads of an atomic variable into one, as if you'd used volatile atomic<T>, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).



This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if around the loop like a normal person.)



int sumarr_atomic(int arr) 
int sum = 0;
for(int i=0 ; i<10000 ; i++)
if (atomic_bool.load (std::memory_order_relaxed))
sum += arr[i];


return sum;



See the asm output on Godbolt.



But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).



With atomic_bool, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.



(The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)




Loops over an array of bool can auto-vectorize, but not over atomic<bool> .




Also, inverting a boolean with something like b ^= 1; or b++ can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor or lock btc. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock prefix is also a full memory barrier.)



Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.



void loop() 
for(int i=0 ; i<10000 ; i++)
regular_bool ^= 1;




compiles to asm that keeps regular_bool in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.



loop():
movzx edx, BYTE PTR regular_bool[rip] # load into a register
mov eax, 10000
.L17: # do
xor edx, 1 # flip the boolean
sub eax, 1
jne .L17 # while(--i);
mov BYTE PTR regular_bool[rip], dl # store back the result
ret


Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed) (separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.






share|improve this answer





























    up vote
    1
    down vote













    Checking over at Godbolt, loading a regular bool and a std::atomic<bool> generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool> is guaranteed to be either 0 or 1. Strange, that.



    Clang does the same thing, although the code generated is slightly different in detail.






    share|improve this answer






















    • Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() return regular_bool; that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
      – Peter Cordes
      Nov 11 at 18:39










    • @Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
      – Paul Sanders
      Nov 11 at 18:40











    • Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
      – Peter Cordes
      Nov 11 at 18:42










    • @Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
      – Paul Sanders
      Nov 11 at 18:44










    • Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
      – Peter Cordes
      Nov 11 at 18:47










    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251703%2fc-c-relaxed-stdatomicbool-vs-unlocked-bool-on-x64-architecture%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    4
    down vote



    accepted










    Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.



    If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile.




    Current compilers also never combine multiple reads of an atomic variable into one, as if you'd used volatile atomic<T>, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).



    This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if around the loop like a normal person.)



    int sumarr_atomic(int arr) 
    int sum = 0;
    for(int i=0 ; i<10000 ; i++)
    if (atomic_bool.load (std::memory_order_relaxed))
    sum += arr[i];


    return sum;



    See the asm output on Godbolt.



    But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).



    With atomic_bool, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.



    (The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)




    Loops over an array of bool can auto-vectorize, but not over atomic<bool> .




    Also, inverting a boolean with something like b ^= 1; or b++ can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor or lock btc. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock prefix is also a full memory barrier.)



    Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.



    void loop() 
    for(int i=0 ; i<10000 ; i++)
    regular_bool ^= 1;




    compiles to asm that keeps regular_bool in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.



    loop():
    movzx edx, BYTE PTR regular_bool[rip] # load into a register
    mov eax, 10000
    .L17: # do
    xor edx, 1 # flip the boolean
    sub eax, 1
    jne .L17 # while(--i);
    mov BYTE PTR regular_bool[rip], dl # store back the result
    ret


    Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed) (separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.






    share|improve this answer


























      up vote
      4
      down vote



      accepted










      Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.



      If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile.




      Current compilers also never combine multiple reads of an atomic variable into one, as if you'd used volatile atomic<T>, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).



      This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if around the loop like a normal person.)



      int sumarr_atomic(int arr) 
      int sum = 0;
      for(int i=0 ; i<10000 ; i++)
      if (atomic_bool.load (std::memory_order_relaxed))
      sum += arr[i];


      return sum;



      See the asm output on Godbolt.



      But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).



      With atomic_bool, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.



      (The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)




      Loops over an array of bool can auto-vectorize, but not over atomic<bool> .




      Also, inverting a boolean with something like b ^= 1; or b++ can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor or lock btc. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock prefix is also a full memory barrier.)



      Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.



      void loop() 
      for(int i=0 ; i<10000 ; i++)
      regular_bool ^= 1;




      compiles to asm that keeps regular_bool in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.



      loop():
      movzx edx, BYTE PTR regular_bool[rip] # load into a register
      mov eax, 10000
      .L17: # do
      xor edx, 1 # flip the boolean
      sub eax, 1
      jne .L17 # while(--i);
      mov BYTE PTR regular_bool[rip], dl # store back the result
      ret


      Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed) (separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.






      share|improve this answer
























        up vote
        4
        down vote



        accepted







        up vote
        4
        down vote



        accepted






        Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.



        If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile.




        Current compilers also never combine multiple reads of an atomic variable into one, as if you'd used volatile atomic<T>, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).



        This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if around the loop like a normal person.)



        int sumarr_atomic(int arr) 
        int sum = 0;
        for(int i=0 ; i<10000 ; i++)
        if (atomic_bool.load (std::memory_order_relaxed))
        sum += arr[i];


        return sum;



        See the asm output on Godbolt.



        But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).



        With atomic_bool, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.



        (The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)




        Loops over an array of bool can auto-vectorize, but not over atomic<bool> .




        Also, inverting a boolean with something like b ^= 1; or b++ can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor or lock btc. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock prefix is also a full memory barrier.)



        Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.



        void loop() 
        for(int i=0 ; i<10000 ; i++)
        regular_bool ^= 1;




        compiles to asm that keeps regular_bool in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.



        loop():
        movzx edx, BYTE PTR regular_bool[rip] # load into a register
        mov eax, 10000
        .L17: # do
        xor edx, 1 # flip the boolean
        sub eax, 1
        jne .L17 # while(--i);
        mov BYTE PTR regular_bool[rip], dl # store back the result
        ret


        Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed) (separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.






        share|improve this answer














        Yes, there's are potentially massive advantages, especially for local variables, or any variable used repeatedly in the same function. An atomic<> variable can't be optimized into a register.



        If you compiled without optimization, the code-gen would be similar, but compiling with normal optimization enabled there can be massive differences. Un-optimized code is similar to making every variable volatile.




        Current compilers also never combine multiple reads of an atomic variable into one, as if you'd used volatile atomic<T>, because that's what people expect and the dust hasn't settled yet on how to allow useful optimizations while prohibiting ones you don't want. (Why don't compilers merge redundant std::atomic writes? and Can and does the compiler optimize out two atomic loads?).



        This isn't a great example, but imagine that checking the boolean is done inside an inlined function, and that there's something else inside the loop. (Otherwise you'd put the if around the loop like a normal person.)



        int sumarr_atomic(int arr) 
        int sum = 0;
        for(int i=0 ; i<10000 ; i++)
        if (atomic_bool.load (std::memory_order_relaxed))
        sum += arr[i];


        return sum;



        See the asm output on Godbolt.



        But with a non-atomic bool, the compiler can make that transformation for you by hoisting the load, and then auto-vectorize the simple sum loop (or not run it at all).



        With atomic_bool, it can't. With atomic_bool, the asm loop is much like the C++ source, actually doing a test and branch on the value of the variable inside every loop iteration. And this of course defeats auto-vectorization.



        (The C++ as-if rules would allow the compiler to hoist the load because it's relaxed so it can reorder with non-atomic accesses. And merge because reading the same value every time is one possible result of a global order that reads one value. But as I said, compilers don't do that.)




        Loops over an array of bool can auto-vectorize, but not over atomic<bool> .




        Also, inverting a boolean with something like b ^= 1; or b++ can be just a regular RMW, not atomic RMW, so it doesn't have to use lock xor or lock btc. (x86 atomic RMW is only possible with sequential-consistency vs. runtime reordering, i.e. the lock prefix is also a full memory barrier.)



        Code that modifies a non-atomic boolean can optimize away the actual modifications, e.g.



        void loop() 
        for(int i=0 ; i<10000 ; i++)
        regular_bool ^= 1;




        compiles to asm that keeps regular_bool in a register. Unfortunately it doesn't optimize away to nothing (which it could because flipping a boolean an even number of times sets it back to its original value). But it could with a smarter compiler.



        loop():
        movzx edx, BYTE PTR regular_bool[rip] # load into a register
        mov eax, 10000
        .L17: # do
        xor edx, 1 # flip the boolean
        sub eax, 1
        jne .L17 # while(--i);
        mov BYTE PTR regular_bool[rip], dl # store back the result
        ret


        Even if written as atomic_b.store( !atomic_b.load(mo_relaxed), mo_relaxed) (separate atomic loads/stores), you'd still get a store/reload in the loop, creating a 6-cycle loop-carried dependency chain through the store/reload (on Intel CPUs with 5-cycle store-forwarding latency) instead of a 1-cycle dep chain through a register.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 12 at 18:47

























        answered Nov 11 at 19:19









        Peter Cordes

        117k16178305




        117k16178305






















            up vote
            1
            down vote













            Checking over at Godbolt, loading a regular bool and a std::atomic<bool> generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool> is guaranteed to be either 0 or 1. Strange, that.



            Clang does the same thing, although the code generated is slightly different in detail.






            share|improve this answer






















            • Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() return regular_bool; that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
              – Peter Cordes
              Nov 11 at 18:39










            • @Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
              – Paul Sanders
              Nov 11 at 18:40











            • Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
              – Peter Cordes
              Nov 11 at 18:42










            • @Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
              – Paul Sanders
              Nov 11 at 18:44










            • Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
              – Peter Cordes
              Nov 11 at 18:47














            up vote
            1
            down vote













            Checking over at Godbolt, loading a regular bool and a std::atomic<bool> generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool> is guaranteed to be either 0 or 1. Strange, that.



            Clang does the same thing, although the code generated is slightly different in detail.






            share|improve this answer






















            • Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() return regular_bool; that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
              – Peter Cordes
              Nov 11 at 18:39










            • @Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
              – Paul Sanders
              Nov 11 at 18:40











            • Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
              – Peter Cordes
              Nov 11 at 18:42










            • @Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
              – Paul Sanders
              Nov 11 at 18:44










            • Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
              – Peter Cordes
              Nov 11 at 18:47












            up vote
            1
            down vote










            up vote
            1
            down vote









            Checking over at Godbolt, loading a regular bool and a std::atomic<bool> generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool> is guaranteed to be either 0 or 1. Strange, that.



            Clang does the same thing, although the code generated is slightly different in detail.






            share|improve this answer














            Checking over at Godbolt, loading a regular bool and a std::atomic<bool> generate different code, although not because of synchronisation issues. Instead, the compiler (gcc) seems unwilling to assume that a std::atomic<bool> is guaranteed to be either 0 or 1. Strange, that.



            Clang does the same thing, although the code generated is slightly different in detail.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited Nov 11 at 18:39

























            answered Nov 11 at 18:36









            Paul Sanders

            4,8351521




            4,8351521











            • Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() return regular_bool; that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
              – Peter Cordes
              Nov 11 at 18:39










            • @Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
              – Paul Sanders
              Nov 11 at 18:40











            • Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
              – Peter Cordes
              Nov 11 at 18:42










            • @Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
              – Paul Sanders
              Nov 11 at 18:44










            • Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
              – Peter Cordes
              Nov 11 at 18:47
















            • Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() return regular_bool; that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
              – Peter Cordes
              Nov 11 at 18:39










            • @Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
              – Paul Sanders
              Nov 11 at 18:40











            • Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
              – Peter Cordes
              Nov 11 at 18:42










            • @Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
              – Paul Sanders
              Nov 11 at 18:44










            • Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
              – Peter Cordes
              Nov 11 at 18:47















            Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() return regular_bool; that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
            – Peter Cordes
            Nov 11 at 18:39




            Using cout << clutters the code a lot. godbolt.org/z/hFEQ5f is easier to read with separate functions that return the value of the global, like bool load_regular() return regular_bool; that compiles to a single movzx. (And the atomic version still booleanizes for no apparent reason.)
            – Peter Cordes
            Nov 11 at 18:39












            @Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
            – Paul Sanders
            Nov 11 at 18:40





            @Peter I did it that way to stop the compiler optimising out the loads. Although I see from your example that moving the load into a separate function generates better code.
            – Paul Sanders
            Nov 11 at 18:40













            Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
            – Peter Cordes
            Nov 11 at 18:42




            Yeah I know, and my point is that returning a value from a function instead of writing a main solves the same problem much more cleanly. See How to remove "noise" from GCC/clang assembly output?. Remember you're just writing code so you can look at the asm, not run it.
            – Peter Cordes
            Nov 11 at 18:42












            @Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
            – Paul Sanders
            Nov 11 at 18:44




            @Peter Ah, I see you never bother to call the functions so that gcc cannot inline them or optimise them away. A useful trick that.
            – Paul Sanders
            Nov 11 at 18:44












            Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
            – Peter Cordes
            Nov 11 at 18:47




            Even if you did write callers, you can still look at the stand-alone definition as well, if you don't make them static or inline.
            – Peter Cordes
            Nov 11 at 18:47

















            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.





            Some of your past answers have not been well-received, and you're in danger of being blocked from answering.


            Please pay close attention to the following guidance:


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251703%2fc-c-relaxed-stdatomicbool-vs-unlocked-bool-on-x64-architecture%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            In R, how to develop a multiplot heatmap.2 figure showing key labels successfully

            Museum of Modern and Contemporary Art of Trento and Rovereto