How can the multiplication be faster than shifting bits to the left?
up vote
0
down vote
favorite
It is well know that shifting bits to the left is faster than multiply because barrel shifters are implemented directly in the hardware. Therefore, this simple benchmark should be wrong:
$start = 1;
$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++)
$result2 = $start << 2;
echo microtime(1) - $timestart;
$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++)
$result1 = $start * 4;
echo microtime(1) - $timestart;
echo "n";
Because I executed it multiple times and always multiplying was faster than shifting bits to the left. For example:
0.73733711242676
0.71091389656067
Therefore, or the benchmark is wrong or the PHP interpreter is doing something here. The test is executed by PHP 7.0.32 running in Ubuntu:
PHP 7.0.32-0ubuntu0.16.04.1 (cli) ( NTS )
CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
Edit:
Executing it in a Windows box, with almost the same CPU (Intel(R) Core(TM) i5-4460S CPU @2.90GHz) the results are like expected:
0.24960112571716
0.28080010414124
The PHP version for this case is different:
PHP 7.1.19 (cli) (built: Jun 20 2018 23:24:42) ( ZTS MSVC14 (Visual C++ 2015) x64 )
php benchmarking bitwise-operators
|
show 3 more comments
up vote
0
down vote
favorite
It is well know that shifting bits to the left is faster than multiply because barrel shifters are implemented directly in the hardware. Therefore, this simple benchmark should be wrong:
$start = 1;
$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++)
$result2 = $start << 2;
echo microtime(1) - $timestart;
$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++)
$result1 = $start * 4;
echo microtime(1) - $timestart;
echo "n";
Because I executed it multiple times and always multiplying was faster than shifting bits to the left. For example:
0.73733711242676
0.71091389656067
Therefore, or the benchmark is wrong or the PHP interpreter is doing something here. The test is executed by PHP 7.0.32 running in Ubuntu:
PHP 7.0.32-0ubuntu0.16.04.1 (cli) ( NTS )
CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
Edit:
Executing it in a Windows box, with almost the same CPU (Intel(R) Core(TM) i5-4460S CPU @2.90GHz) the results are like expected:
0.24960112571716
0.28080010414124
The PHP version for this case is different:
PHP 7.1.19 (cli) (built: Jun 20 2018 23:24:42) ( ZTS MSVC14 (Visual C++ 2015) x64 )
php benchmarking bitwise-operators
If you reverse the order of the benchmarks, do you get the opposite result? If so, it's probably because your CPU doesn't ramp up to max clock speed right away or other startup overhead factors. Also, you have anecho
inside the 2nd timed interval (which if anything you'd expect to skew the opposite direction).
– Peter Cordes
Nov 11 at 13:25
In which setting would you have to left shift a hundred thousand times in your typical web application? Those benchmarks are mostly measuring the loop performance / comparison / counting rather than the instruction. And even then it's negligible in comparison to the zval overhead to be meaningful.
– mario
Nov 11 at 13:29
1
@PeterCordes I tried it before asking: Reversing the order gets the same results: multiplying is faster. You are right about the "echo", but it is inside both intervals. Anyway, even removing both, multiplying is still slower.
– Víctor
Nov 11 at 13:30
You didn't update the numbers after changing the code, so now your numbers aren't from the version of the code in the question.
– Peter Cordes
Nov 11 at 13:54
1
@PeterCordes Now the benchmark is with 10^2 more operations.
– Víctor
Nov 11 at 14:50
|
show 3 more comments
up vote
0
down vote
favorite
up vote
0
down vote
favorite
It is well know that shifting bits to the left is faster than multiply because barrel shifters are implemented directly in the hardware. Therefore, this simple benchmark should be wrong:
$start = 1;
$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++)
$result2 = $start << 2;
echo microtime(1) - $timestart;
$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++)
$result1 = $start * 4;
echo microtime(1) - $timestart;
echo "n";
Because I executed it multiple times and always multiplying was faster than shifting bits to the left. For example:
0.73733711242676
0.71091389656067
Therefore, or the benchmark is wrong or the PHP interpreter is doing something here. The test is executed by PHP 7.0.32 running in Ubuntu:
PHP 7.0.32-0ubuntu0.16.04.1 (cli) ( NTS )
CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
Edit:
Executing it in a Windows box, with almost the same CPU (Intel(R) Core(TM) i5-4460S CPU @2.90GHz) the results are like expected:
0.24960112571716
0.28080010414124
The PHP version for this case is different:
PHP 7.1.19 (cli) (built: Jun 20 2018 23:24:42) ( ZTS MSVC14 (Visual C++ 2015) x64 )
php benchmarking bitwise-operators
It is well know that shifting bits to the left is faster than multiply because barrel shifters are implemented directly in the hardware. Therefore, this simple benchmark should be wrong:
$start = 1;
$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++)
$result2 = $start << 2;
echo microtime(1) - $timestart;
$timestart = microtime(1);
for ($i = 0; $i < 10000000; $i++)
$result1 = $start * 4;
echo microtime(1) - $timestart;
echo "n";
Because I executed it multiple times and always multiplying was faster than shifting bits to the left. For example:
0.73733711242676
0.71091389656067
Therefore, or the benchmark is wrong or the PHP interpreter is doing something here. The test is executed by PHP 7.0.32 running in Ubuntu:
PHP 7.0.32-0ubuntu0.16.04.1 (cli) ( NTS )
CPU: Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
Edit:
Executing it in a Windows box, with almost the same CPU (Intel(R) Core(TM) i5-4460S CPU @2.90GHz) the results are like expected:
0.24960112571716
0.28080010414124
The PHP version for this case is different:
PHP 7.1.19 (cli) (built: Jun 20 2018 23:24:42) ( ZTS MSVC14 (Visual C++ 2015) x64 )
php benchmarking bitwise-operators
php benchmarking bitwise-operators
edited Nov 12 at 13:02
asked Nov 11 at 13:16
Víctor
135
135
If you reverse the order of the benchmarks, do you get the opposite result? If so, it's probably because your CPU doesn't ramp up to max clock speed right away or other startup overhead factors. Also, you have anecho
inside the 2nd timed interval (which if anything you'd expect to skew the opposite direction).
– Peter Cordes
Nov 11 at 13:25
In which setting would you have to left shift a hundred thousand times in your typical web application? Those benchmarks are mostly measuring the loop performance / comparison / counting rather than the instruction. And even then it's negligible in comparison to the zval overhead to be meaningful.
– mario
Nov 11 at 13:29
1
@PeterCordes I tried it before asking: Reversing the order gets the same results: multiplying is faster. You are right about the "echo", but it is inside both intervals. Anyway, even removing both, multiplying is still slower.
– Víctor
Nov 11 at 13:30
You didn't update the numbers after changing the code, so now your numbers aren't from the version of the code in the question.
– Peter Cordes
Nov 11 at 13:54
1
@PeterCordes Now the benchmark is with 10^2 more operations.
– Víctor
Nov 11 at 14:50
|
show 3 more comments
If you reverse the order of the benchmarks, do you get the opposite result? If so, it's probably because your CPU doesn't ramp up to max clock speed right away or other startup overhead factors. Also, you have anecho
inside the 2nd timed interval (which if anything you'd expect to skew the opposite direction).
– Peter Cordes
Nov 11 at 13:25
In which setting would you have to left shift a hundred thousand times in your typical web application? Those benchmarks are mostly measuring the loop performance / comparison / counting rather than the instruction. And even then it's negligible in comparison to the zval overhead to be meaningful.
– mario
Nov 11 at 13:29
1
@PeterCordes I tried it before asking: Reversing the order gets the same results: multiplying is faster. You are right about the "echo", but it is inside both intervals. Anyway, even removing both, multiplying is still slower.
– Víctor
Nov 11 at 13:30
You didn't update the numbers after changing the code, so now your numbers aren't from the version of the code in the question.
– Peter Cordes
Nov 11 at 13:54
1
@PeterCordes Now the benchmark is with 10^2 more operations.
– Víctor
Nov 11 at 14:50
If you reverse the order of the benchmarks, do you get the opposite result? If so, it's probably because your CPU doesn't ramp up to max clock speed right away or other startup overhead factors. Also, you have an
echo
inside the 2nd timed interval (which if anything you'd expect to skew the opposite direction).– Peter Cordes
Nov 11 at 13:25
If you reverse the order of the benchmarks, do you get the opposite result? If so, it's probably because your CPU doesn't ramp up to max clock speed right away or other startup overhead factors. Also, you have an
echo
inside the 2nd timed interval (which if anything you'd expect to skew the opposite direction).– Peter Cordes
Nov 11 at 13:25
In which setting would you have to left shift a hundred thousand times in your typical web application? Those benchmarks are mostly measuring the loop performance / comparison / counting rather than the instruction. And even then it's negligible in comparison to the zval overhead to be meaningful.
– mario
Nov 11 at 13:29
In which setting would you have to left shift a hundred thousand times in your typical web application? Those benchmarks are mostly measuring the loop performance / comparison / counting rather than the instruction. And even then it's negligible in comparison to the zval overhead to be meaningful.
– mario
Nov 11 at 13:29
1
1
@PeterCordes I tried it before asking: Reversing the order gets the same results: multiplying is faster. You are right about the "echo", but it is inside both intervals. Anyway, even removing both, multiplying is still slower.
– Víctor
Nov 11 at 13:30
@PeterCordes I tried it before asking: Reversing the order gets the same results: multiplying is faster. You are right about the "echo", but it is inside both intervals. Anyway, even removing both, multiplying is still slower.
– Víctor
Nov 11 at 13:30
You didn't update the numbers after changing the code, so now your numbers aren't from the version of the code in the question.
– Peter Cordes
Nov 11 at 13:54
You didn't update the numbers after changing the code, so now your numbers aren't from the version of the code in the question.
– Peter Cordes
Nov 11 at 13:54
1
1
@PeterCordes Now the benchmark is with 10^2 more operations.
– Víctor
Nov 11 at 14:50
@PeterCordes Now the benchmark is with 10^2 more operations.
– Víctor
Nov 11 at 14:50
|
show 3 more comments
1 Answer
1
active
oldest
votes
up vote
1
down vote
Your reasoning about hardware is basically irrelevant. You're using an interpreted language where most of the cost is interpreter overhead.
An asm version of either loop could run at 1 per clock (assuming a fixed-count shift), so only 100k iterations would take (on a 3GHz CPU) 0.033 ms, or 0.000033 seconds, ~250 times faster than your PHP times.
Also, an interpreted loop has to use a variable-count shift (because it can't JIT-compile the shift count into an immediate in the machine code), which is actually more expensive for throughput (3 uops) on Intel CPUs because of x86 legacy baggage (flag semantics). AMD CPUs have single-uop shifts even for variable shift counts. (shl reg, cl
vs. shr reg, imm8
). See INC instruction vs ADD 1: Does it matter? for more about why shl reg,cl
is 3 uops on Sandybridge-family, and how it could create a false dependency through flags)
Integer multiply is 1 uop, 1-per-clock throughput, 3 cycle latency, on Intel Sandybridge-family and AMD Ryzen. I per 2 clocks on AMD Bulldozer-family, not fully pipelined. So yes, multiply has higher latency, but they're both fully pipelined for throughput. Your loop throws away the result, so there's no loop-carried dependency chain so latency is irrelevant (and hidden by out-of-order execution).
But that minor difference (2 extra uops) is not enough to account for the measured difference. The actual shift or multiply is only 1/250th of the total cycles the loop takes. You say switching the order of the loops doesn't change the result, so it's not just a warm-up effect before your CPU ramps up to max clock speed.
You haven't mentioned what CPU microarchitecture you're running on, but the answer probably doesn't depend on how shift vs. multiply instructions decode.
Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
– Víctor
Nov 11 at 13:54
Ok, that's a Haswell (part of the Sandybridge family), so everything I was saying does apply.
– Peter Cordes
Nov 11 at 13:54
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
Your reasoning about hardware is basically irrelevant. You're using an interpreted language where most of the cost is interpreter overhead.
An asm version of either loop could run at 1 per clock (assuming a fixed-count shift), so only 100k iterations would take (on a 3GHz CPU) 0.033 ms, or 0.000033 seconds, ~250 times faster than your PHP times.
Also, an interpreted loop has to use a variable-count shift (because it can't JIT-compile the shift count into an immediate in the machine code), which is actually more expensive for throughput (3 uops) on Intel CPUs because of x86 legacy baggage (flag semantics). AMD CPUs have single-uop shifts even for variable shift counts. (shl reg, cl
vs. shr reg, imm8
). See INC instruction vs ADD 1: Does it matter? for more about why shl reg,cl
is 3 uops on Sandybridge-family, and how it could create a false dependency through flags)
Integer multiply is 1 uop, 1-per-clock throughput, 3 cycle latency, on Intel Sandybridge-family and AMD Ryzen. I per 2 clocks on AMD Bulldozer-family, not fully pipelined. So yes, multiply has higher latency, but they're both fully pipelined for throughput. Your loop throws away the result, so there's no loop-carried dependency chain so latency is irrelevant (and hidden by out-of-order execution).
But that minor difference (2 extra uops) is not enough to account for the measured difference. The actual shift or multiply is only 1/250th of the total cycles the loop takes. You say switching the order of the loops doesn't change the result, so it's not just a warm-up effect before your CPU ramps up to max clock speed.
You haven't mentioned what CPU microarchitecture you're running on, but the answer probably doesn't depend on how shift vs. multiply instructions decode.
Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
– Víctor
Nov 11 at 13:54
Ok, that's a Haswell (part of the Sandybridge family), so everything I was saying does apply.
– Peter Cordes
Nov 11 at 13:54
add a comment |
up vote
1
down vote
Your reasoning about hardware is basically irrelevant. You're using an interpreted language where most of the cost is interpreter overhead.
An asm version of either loop could run at 1 per clock (assuming a fixed-count shift), so only 100k iterations would take (on a 3GHz CPU) 0.033 ms, or 0.000033 seconds, ~250 times faster than your PHP times.
Also, an interpreted loop has to use a variable-count shift (because it can't JIT-compile the shift count into an immediate in the machine code), which is actually more expensive for throughput (3 uops) on Intel CPUs because of x86 legacy baggage (flag semantics). AMD CPUs have single-uop shifts even for variable shift counts. (shl reg, cl
vs. shr reg, imm8
). See INC instruction vs ADD 1: Does it matter? for more about why shl reg,cl
is 3 uops on Sandybridge-family, and how it could create a false dependency through flags)
Integer multiply is 1 uop, 1-per-clock throughput, 3 cycle latency, on Intel Sandybridge-family and AMD Ryzen. I per 2 clocks on AMD Bulldozer-family, not fully pipelined. So yes, multiply has higher latency, but they're both fully pipelined for throughput. Your loop throws away the result, so there's no loop-carried dependency chain so latency is irrelevant (and hidden by out-of-order execution).
But that minor difference (2 extra uops) is not enough to account for the measured difference. The actual shift or multiply is only 1/250th of the total cycles the loop takes. You say switching the order of the loops doesn't change the result, so it's not just a warm-up effect before your CPU ramps up to max clock speed.
You haven't mentioned what CPU microarchitecture you're running on, but the answer probably doesn't depend on how shift vs. multiply instructions decode.
Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
– Víctor
Nov 11 at 13:54
Ok, that's a Haswell (part of the Sandybridge family), so everything I was saying does apply.
– Peter Cordes
Nov 11 at 13:54
add a comment |
up vote
1
down vote
up vote
1
down vote
Your reasoning about hardware is basically irrelevant. You're using an interpreted language where most of the cost is interpreter overhead.
An asm version of either loop could run at 1 per clock (assuming a fixed-count shift), so only 100k iterations would take (on a 3GHz CPU) 0.033 ms, or 0.000033 seconds, ~250 times faster than your PHP times.
Also, an interpreted loop has to use a variable-count shift (because it can't JIT-compile the shift count into an immediate in the machine code), which is actually more expensive for throughput (3 uops) on Intel CPUs because of x86 legacy baggage (flag semantics). AMD CPUs have single-uop shifts even for variable shift counts. (shl reg, cl
vs. shr reg, imm8
). See INC instruction vs ADD 1: Does it matter? for more about why shl reg,cl
is 3 uops on Sandybridge-family, and how it could create a false dependency through flags)
Integer multiply is 1 uop, 1-per-clock throughput, 3 cycle latency, on Intel Sandybridge-family and AMD Ryzen. I per 2 clocks on AMD Bulldozer-family, not fully pipelined. So yes, multiply has higher latency, but they're both fully pipelined for throughput. Your loop throws away the result, so there's no loop-carried dependency chain so latency is irrelevant (and hidden by out-of-order execution).
But that minor difference (2 extra uops) is not enough to account for the measured difference. The actual shift or multiply is only 1/250th of the total cycles the loop takes. You say switching the order of the loops doesn't change the result, so it's not just a warm-up effect before your CPU ramps up to max clock speed.
You haven't mentioned what CPU microarchitecture you're running on, but the answer probably doesn't depend on how shift vs. multiply instructions decode.
Your reasoning about hardware is basically irrelevant. You're using an interpreted language where most of the cost is interpreter overhead.
An asm version of either loop could run at 1 per clock (assuming a fixed-count shift), so only 100k iterations would take (on a 3GHz CPU) 0.033 ms, or 0.000033 seconds, ~250 times faster than your PHP times.
Also, an interpreted loop has to use a variable-count shift (because it can't JIT-compile the shift count into an immediate in the machine code), which is actually more expensive for throughput (3 uops) on Intel CPUs because of x86 legacy baggage (flag semantics). AMD CPUs have single-uop shifts even for variable shift counts. (shl reg, cl
vs. shr reg, imm8
). See INC instruction vs ADD 1: Does it matter? for more about why shl reg,cl
is 3 uops on Sandybridge-family, and how it could create a false dependency through flags)
Integer multiply is 1 uop, 1-per-clock throughput, 3 cycle latency, on Intel Sandybridge-family and AMD Ryzen. I per 2 clocks on AMD Bulldozer-family, not fully pipelined. So yes, multiply has higher latency, but they're both fully pipelined for throughput. Your loop throws away the result, so there's no loop-carried dependency chain so latency is irrelevant (and hidden by out-of-order execution).
But that minor difference (2 extra uops) is not enough to account for the measured difference. The actual shift or multiply is only 1/250th of the total cycles the loop takes. You say switching the order of the loops doesn't change the result, so it's not just a warm-up effect before your CPU ramps up to max clock speed.
You haven't mentioned what CPU microarchitecture you're running on, but the answer probably doesn't depend on how shift vs. multiply instructions decode.
answered Nov 11 at 13:50
Peter Cordes
117k16177304
117k16177304
Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
– Víctor
Nov 11 at 13:54
Ok, that's a Haswell (part of the Sandybridge family), so everything I was saying does apply.
– Peter Cordes
Nov 11 at 13:54
add a comment |
Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
– Víctor
Nov 11 at 13:54
Ok, that's a Haswell (part of the Sandybridge family), so everything I was saying does apply.
– Peter Cordes
Nov 11 at 13:54
Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
– Víctor
Nov 11 at 13:54
Intel(R) Core(TM) i5-4460 CPU @ 3.20GHz
– Víctor
Nov 11 at 13:54
Ok, that's a Haswell (part of the Sandybridge family), so everything I was saying does apply.
– Peter Cordes
Nov 11 at 13:54
Ok, that's a Haswell (part of the Sandybridge family), so everything I was saying does apply.
– Peter Cordes
Nov 11 at 13:54
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53249110%2fhow-can-the-multiplication-be-faster-than-shifting-bits-to-the-left%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
If you reverse the order of the benchmarks, do you get the opposite result? If so, it's probably because your CPU doesn't ramp up to max clock speed right away or other startup overhead factors. Also, you have an
echo
inside the 2nd timed interval (which if anything you'd expect to skew the opposite direction).– Peter Cordes
Nov 11 at 13:25
In which setting would you have to left shift a hundred thousand times in your typical web application? Those benchmarks are mostly measuring the loop performance / comparison / counting rather than the instruction. And even then it's negligible in comparison to the zval overhead to be meaningful.
– mario
Nov 11 at 13:29
1
@PeterCordes I tried it before asking: Reversing the order gets the same results: multiplying is faster. You are right about the "echo", but it is inside both intervals. Anyway, even removing both, multiplying is still slower.
– Víctor
Nov 11 at 13:30
You didn't update the numbers after changing the code, so now your numbers aren't from the version of the code in the question.
– Peter Cordes
Nov 11 at 13:54
1
@PeterCordes Now the benchmark is with 10^2 more operations.
– Víctor
Nov 11 at 14:50