Why does the call latency on clock_gettime(CLOCK_REALTIME, ..) vary so much?
up vote
6
down vote
favorite
I'm trying to time how long clock_gettime(CLOCK_REALTIME,...)
takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.
I wrote some test code that used __rdtscp
to time repeated calls to clock_gettime
(the rdtscp
calls went around a loop that called clock_gettime
and added the results together, just so the compiler wouldn't optimize too much away).
If I call clock_gettime()
in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:
42467
1114
1077
496
455
However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:
Here at 10 second sleeps:
28293
1093
4729
4756
4736
Here at 1 second sleeps:
61578
855
4753
4741
5645
4753
4732
Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?
tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...)
don't understand why it runs faster when called in quick succession as opposed to a second between calls.
Update: Here's the cpuinfo on proc 0
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0x84
cpu MHz : 2800.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5616.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Here is the recreated test code:
#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>
// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops
unsigned long long now()
struct timespec s;
clock_gettime(CLOCK_REALTIME, &s);
return (s.tv_sec * 1000000000ull) + s.tv_nsec;
int main(int argc, char **argv)
int sleeptime = atoi(argv[1]);
int trials = atoi(argv[2]);
int loops = atoi(argv[3]);
unsigned long long x, y, n = 0;
unsigned int d;
x = __rdtscp(&d);
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);
printf("init run %lldn", (y-x));
for(int t = 0; t < trials; ++t)
if(sleeptime > 0) sleep(sleeptime);
x = __rdtscp(&d);
for(int l = 0; l < loops; ++l)
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);
printf("trial %d took %lldn", t, (y-x));
exit(0);
linux performance optimization timing code-timing
|
show 3 more comments
up vote
6
down vote
favorite
I'm trying to time how long clock_gettime(CLOCK_REALTIME,...)
takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.
I wrote some test code that used __rdtscp
to time repeated calls to clock_gettime
(the rdtscp
calls went around a loop that called clock_gettime
and added the results together, just so the compiler wouldn't optimize too much away).
If I call clock_gettime()
in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:
42467
1114
1077
496
455
However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:
Here at 10 second sleeps:
28293
1093
4729
4756
4736
Here at 1 second sleeps:
61578
855
4753
4741
5645
4753
4732
Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?
tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...)
don't understand why it runs faster when called in quick succession as opposed to a second between calls.
Update: Here's the cpuinfo on proc 0
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0x84
cpu MHz : 2800.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5616.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Here is the recreated test code:
#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>
// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops
unsigned long long now()
struct timespec s;
clock_gettime(CLOCK_REALTIME, &s);
return (s.tv_sec * 1000000000ull) + s.tv_nsec;
int main(int argc, char **argv)
int sleeptime = atoi(argv[1]);
int trials = atoi(argv[2]);
int loops = atoi(argv[3]);
unsigned long long x, y, n = 0;
unsigned int d;
x = __rdtscp(&d);
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);
printf("init run %lldn", (y-x));
for(int t = 0; t < trials; ++t)
if(sleeptime > 0) sleep(sleeptime);
x = __rdtscp(&d);
for(int l = 0; l < loops; ++l)
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);
printf("trial %d took %lldn", t, (y-x));
exit(0);
linux performance optimization timing code-timing
1
You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a/prof/cpuinfo
dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?
– Kuba Ober
Nov 11 at 19:08
1
vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I
– Kuba Ober
Nov 11 at 19:18
@Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.
– JasonN
Nov 11 at 21:17
@kuba strace verifies that the clock_gettime calls are not syscalls
– JasonN
Nov 11 at 23:20
1
@terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).
– JasonN
Nov 13 at 23:06
|
show 3 more comments
up vote
6
down vote
favorite
up vote
6
down vote
favorite
I'm trying to time how long clock_gettime(CLOCK_REALTIME,...)
takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.
I wrote some test code that used __rdtscp
to time repeated calls to clock_gettime
(the rdtscp
calls went around a loop that called clock_gettime
and added the results together, just so the compiler wouldn't optimize too much away).
If I call clock_gettime()
in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:
42467
1114
1077
496
455
However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:
Here at 10 second sleeps:
28293
1093
4729
4756
4736
Here at 1 second sleeps:
61578
855
4753
4741
5645
4753
4732
Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?
tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...)
don't understand why it runs faster when called in quick succession as opposed to a second between calls.
Update: Here's the cpuinfo on proc 0
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0x84
cpu MHz : 2800.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5616.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Here is the recreated test code:
#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>
// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops
unsigned long long now()
struct timespec s;
clock_gettime(CLOCK_REALTIME, &s);
return (s.tv_sec * 1000000000ull) + s.tv_nsec;
int main(int argc, char **argv)
int sleeptime = atoi(argv[1]);
int trials = atoi(argv[2]);
int loops = atoi(argv[3]);
unsigned long long x, y, n = 0;
unsigned int d;
x = __rdtscp(&d);
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);
printf("init run %lldn", (y-x));
for(int t = 0; t < trials; ++t)
if(sleeptime > 0) sleep(sleeptime);
x = __rdtscp(&d);
for(int l = 0; l < loops; ++l)
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);
printf("trial %d took %lldn", t, (y-x));
exit(0);
linux performance optimization timing code-timing
I'm trying to time how long clock_gettime(CLOCK_REALTIME,...)
takes to call. "Back in the day" I used to call it once at the top of a loop since it was a fairly expensive call. But now, I was hoping that with vDSO and some clock improvements, it might not be so slow.
I wrote some test code that used __rdtscp
to time repeated calls to clock_gettime
(the rdtscp
calls went around a loop that called clock_gettime
and added the results together, just so the compiler wouldn't optimize too much away).
If I call clock_gettime()
in quick succession, the length of time goes from about 45k clock cycles down to 500 cycles. Some of this I thought could be contributed to the first call having to load the vDSO code (still doesn't full make sense to me), but how it takes a few calls to get 500 I cannot explain at all, and this behavior seems to be constant regardless of how I test it:
42467
1114
1077
496
455
However, if I sleep (for a second or ten, doesn't matter) between calls to clock_gettime, it only reaches a steady state of about 4.7k cycles:
Here at 10 second sleeps:
28293
1093
4729
4756
4736
Here at 1 second sleeps:
61578
855
4753
4741
5645
4753
4732
Cache behavior wouldn't seem to describe this (on a desktop system not doing much of anything). How much should I budget for a call to clock_gettime? Why does it get progressively faster to call? Why does sleeping a small amount of time matter so much?
tl;dr I'm trying to understand the time it takes to call clock_gettime(CLOCK_REALTIME,...)
don't understand why it runs faster when called in quick succession as opposed to a second between calls.
Update: Here's the cpuinfo on proc 0
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 158
model name : Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz
stepping : 9
microcode : 0x84
cpu MHz : 2800.000
cache size : 6144 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 22
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp
bugs :
bogomips : 5616.00
clflush size : 64
cache_alignment : 64
address sizes : 39 bits physical, 48 bits virtual
power management:
Here is the recreated test code:
#include <time.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <x86intrin.h>
// compiled gcc -Wall -O3 -o clockt clockt.cpp
// called glockt sleeptime trials loops
unsigned long long now()
struct timespec s;
clock_gettime(CLOCK_REALTIME, &s);
return (s.tv_sec * 1000000000ull) + s.tv_nsec;
int main(int argc, char **argv)
int sleeptime = atoi(argv[1]);
int trials = atoi(argv[2]);
int loops = atoi(argv[3]);
unsigned long long x, y, n = 0;
unsigned int d;
x = __rdtscp(&d);
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);
printf("init run %lldn", (y-x));
for(int t = 0; t < trials; ++t)
if(sleeptime > 0) sleep(sleeptime);
x = __rdtscp(&d);
for(int l = 0; l < loops; ++l)
n = now();
asm volatile("": "+r" (n));
y = __rdtscp(&d);
printf("trial %d took %lldn", t, (y-x));
exit(0);
linux performance optimization timing code-timing
linux performance optimization timing code-timing
edited Nov 13 at 21:05
asked Nov 11 at 18:53
JasonN
750720
750720
1
You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a/prof/cpuinfo
dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?
– Kuba Ober
Nov 11 at 19:08
1
vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I
– Kuba Ober
Nov 11 at 19:18
@Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.
– JasonN
Nov 11 at 21:17
@kuba strace verifies that the clock_gettime calls are not syscalls
– JasonN
Nov 11 at 23:20
1
@terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).
– JasonN
Nov 13 at 23:06
|
show 3 more comments
1
You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a/prof/cpuinfo
dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?
– Kuba Ober
Nov 11 at 19:08
1
vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I
– Kuba Ober
Nov 11 at 19:18
@Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.
– JasonN
Nov 11 at 21:17
@kuba strace verifies that the clock_gettime calls are not syscalls
– JasonN
Nov 11 at 23:20
1
@terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).
– JasonN
Nov 13 at 23:06
1
1
You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a
/prof/cpuinfo
dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?– Kuba Ober
Nov 11 at 19:08
You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a
/prof/cpuinfo
dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?– Kuba Ober
Nov 11 at 19:08
1
1
vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I
– Kuba Ober
Nov 11 at 19:18
vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I
– Kuba Ober
Nov 11 at 19:18
@Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.
– JasonN
Nov 11 at 21:17
@Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.
– JasonN
Nov 11 at 21:17
@kuba strace verifies that the clock_gettime calls are not syscalls
– JasonN
Nov 11 at 23:20
@kuba strace verifies that the clock_gettime calls are not syscalls
– JasonN
Nov 11 at 23:20
1
1
@terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).
– JasonN
Nov 13 at 23:06
@terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).
– JasonN
Nov 13 at 23:06
|
show 3 more comments
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
The very first time clock_gettime
is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime
will take much more time than later calls, which is what you are observing.
The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now
and printf
, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL
performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime
takes more time to execute.
Now regarding the second and later measurements. From the question:
42467
1114
1077
496
455
I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now
and printf
contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp
instruction itself may vary by 4 cycles. See also this.
In practice, the clock_gettime
is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.
thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."
– JasonN
Nov 18 at 23:23
if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.
– JasonN
Nov 18 at 23:26
1
@JasonN This explains the impact of callingsleep
and passing an argument larger than zero seconds. It takes more time to callclock_gettime
(about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.
– Hadi Brais
Nov 18 at 23:26
Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that thesleep 1
is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suingperf -e L1-icache-load-misses
. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.
– JasonN
Nov 20 at 16:10
2
@JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that containnow
and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.
– Hadi Brais
Nov 20 at 18:48
|
show 2 more comments
up vote
0
down vote
I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).
For example:
./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)
Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc
as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc
-based clock_gettime
only if it seems suitable on the current system.
The second most likely reason is that clock_gettime(CLOCK_REALTIME)
doesn't go through the VDSO on your system, so is a system call even if rdtsc
is ultimately being used. I guess that could be due to an old libc version or something like that.
The third most likely reason is that rdtsc
on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.
Single Loop Results
Trying with a single clock_gettime
call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1
gives:
init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...
Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf
line was modified to:
printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);
When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is4700 + 99*50
which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.
– JasonN
Nov 26 at 0:03
@JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.
– BeeOnRope
Nov 27 at 17:40
You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.
– BeeOnRope
Nov 27 at 17:41
1
You can reproduce the pattern by passing a value of 1 to the third argumentloops
. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case whereloops
is 1. But note thatinit run
still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.
– Hadi Brais
Nov 27 at 20:15
@HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includesrdtsc
overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.
– BeeOnRope
Nov 27 at 21:34
|
show 6 more comments
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
The very first time clock_gettime
is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime
will take much more time than later calls, which is what you are observing.
The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now
and printf
, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL
performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime
takes more time to execute.
Now regarding the second and later measurements. From the question:
42467
1114
1077
496
455
I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now
and printf
contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp
instruction itself may vary by 4 cycles. See also this.
In practice, the clock_gettime
is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.
thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."
– JasonN
Nov 18 at 23:23
if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.
– JasonN
Nov 18 at 23:26
1
@JasonN This explains the impact of callingsleep
and passing an argument larger than zero seconds. It takes more time to callclock_gettime
(about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.
– Hadi Brais
Nov 18 at 23:26
Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that thesleep 1
is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suingperf -e L1-icache-load-misses
. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.
– JasonN
Nov 20 at 16:10
2
@JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that containnow
and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.
– Hadi Brais
Nov 20 at 18:48
|
show 2 more comments
up vote
1
down vote
accepted
The very first time clock_gettime
is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime
will take much more time than later calls, which is what you are observing.
The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now
and printf
, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL
performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime
takes more time to execute.
Now regarding the second and later measurements. From the question:
42467
1114
1077
496
455
I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now
and printf
contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp
instruction itself may vary by 4 cycles. See also this.
In practice, the clock_gettime
is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.
thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."
– JasonN
Nov 18 at 23:23
if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.
– JasonN
Nov 18 at 23:26
1
@JasonN This explains the impact of callingsleep
and passing an argument larger than zero seconds. It takes more time to callclock_gettime
(about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.
– Hadi Brais
Nov 18 at 23:26
Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that thesleep 1
is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suingperf -e L1-icache-load-misses
. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.
– JasonN
Nov 20 at 16:10
2
@JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that containnow
and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.
– Hadi Brais
Nov 20 at 18:48
|
show 2 more comments
up vote
1
down vote
accepted
up vote
1
down vote
accepted
The very first time clock_gettime
is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime
will take much more time than later calls, which is what you are observing.
The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now
and printf
, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL
performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime
takes more time to execute.
Now regarding the second and later measurements. From the question:
42467
1114
1077
496
455
I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now
and printf
contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp
instruction itself may vary by 4 cycles. See also this.
In practice, the clock_gettime
is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.
The very first time clock_gettime
is called, a page fault occurs on the page that contains the instructions of that function. On my system, this is a soft page fault and it takes a few thousands of cycles to handle (up to 10,000 cycles). My CPU is running at 3.4GHz. I think that your CPU is running at a much lower frequency, so handling the page fault on your system would take more time. But the point here is that the first call to clock_gettime
will take much more time than later calls, which is what you are observing.
The second major effect that your code is exhibiting is significant stalls due to instruction cache misses. It might appear that you are only calling two functions, namely now
and printf
, but these functions call other functions and they all compete on the L1 instruction cache. Overall, it depends on how all of these functions are aligned in the physical address space. When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small (you can measure this using the ICACHE.IFETCH_STALL
performance counter). However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger because the OS will schedule some other thread to run on the same core and that thread would different instructions and data. This explains why when you sleep, clock_gettime
takes more time to execute.
Now regarding the second and later measurements. From the question:
42467
1114
1077
496
455
I have observed on my system that the second measurement is not necessarily larger than later measurements. I believe this is also true on your system. In fact, this seems to be the case when you sleep for 10 seconds or 1 second. In the outer loop, the two functions now
and printf
contain thousands of dynamic instructions and they also access the L1 data cache. The variability that you see between the second and later measurements is reproducible. So it's inherent in the functions themselves. Note that the execution time of rdtscp
instruction itself may vary by 4 cycles. See also this.
In practice, the clock_gettime
is useful when the desired precision is at most a million cycles. Otherwise, it can be misleading.
edited Nov 19 at 16:22
answered Nov 18 at 23:16
Hadi Brais
9,02211737
9,02211737
thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."
– JasonN
Nov 18 at 23:23
if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.
– JasonN
Nov 18 at 23:26
1
@JasonN This explains the impact of callingsleep
and passing an argument larger than zero seconds. It takes more time to callclock_gettime
(about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.
– Hadi Brais
Nov 18 at 23:26
Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that thesleep 1
is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suingperf -e L1-icache-load-misses
. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.
– JasonN
Nov 20 at 16:10
2
@JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that containnow
and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.
– Hadi Brais
Nov 20 at 18:48
|
show 2 more comments
thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."
– JasonN
Nov 18 at 23:23
if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.
– JasonN
Nov 18 at 23:26
1
@JasonN This explains the impact of callingsleep
and passing an argument larger than zero seconds. It takes more time to callclock_gettime
(about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.
– Hadi Brais
Nov 18 at 23:26
Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that thesleep 1
is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suingperf -e L1-icache-load-misses
. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.
– JasonN
Nov 20 at 16:10
2
@JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that containnow
and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.
– Hadi Brais
Nov 20 at 18:48
thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."
– JasonN
Nov 18 at 23:23
thanks for looking at this, but I don't understand what you are trying to say here: "When the sleep time is zero seconds, the stall time due to instruction cache misses is actually relatively small.... However, when the sleep time is larger than zero seconds, this stall time becomes significantly larger."
– JasonN
Nov 18 at 23:23
if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.
– JasonN
Nov 18 at 23:26
if the icache theory is correct, I should see it in the perf counters, but also if I write the timing values to an array and printf them at the end, i should see the time go down to that ~500 cycle level. I'll test it in the next couple days, thanks.
– JasonN
Nov 18 at 23:26
1
1
@JasonN This explains the impact of calling
sleep
and passing an argument larger than zero seconds. It takes more time to call clock_gettime
(about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.– Hadi Brais
Nov 18 at 23:26
@JasonN This explains the impact of calling
sleep
and passing an argument larger than zero seconds. It takes more time to call clock_gettime
(about 4k vs. less than 1k) because of instruction cache misses. You can verify using the ICACHE.IFETCH_STALL performance counter.– Hadi Brais
Nov 18 at 23:26
Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the
sleep 1
is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing perf -e L1-icache-load-misses
. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.– JasonN
Nov 20 at 16:10
Well, perf does show an increase of 22k to 50k L1I cache misses when sleep goes from 0 to 1, and along with a few other test that seem to show that the
sleep 1
is causing more i-cache misses. I'm still not fully understanding why and need to think about it more. How do I get at that perf counter? I've been suing perf -e L1-icache-load-misses
. That would mean that about 4.5k cycles is the time it would take to call an vdso function during regular operation. ugh.– JasonN
Nov 20 at 16:10
2
2
@JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain
now
and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.– Hadi Brais
Nov 20 at 18:48
@JasonN But sleeping is not a regular operation. My guess is that the OS is scheduling some other thread to run on the same core when going into sleep. That other thread seems to be evicting the instruction cache lines that contain
now
and/or the functions it calls. Thus, when the thread wakes up, it may either run on the same core or some other core. In both cases, it would exhibit significant stalls due to instruction misses.– Hadi Brais
Nov 20 at 18:48
|
show 2 more comments
up vote
0
down vote
I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).
For example:
./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)
Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc
as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc
-based clock_gettime
only if it seems suitable on the current system.
The second most likely reason is that clock_gettime(CLOCK_REALTIME)
doesn't go through the VDSO on your system, so is a system call even if rdtsc
is ultimately being used. I guess that could be due to an old libc version or something like that.
The third most likely reason is that rdtsc
on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.
Single Loop Results
Trying with a single clock_gettime
call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1
gives:
init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...
Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf
line was modified to:
printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);
When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is4700 + 99*50
which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.
– JasonN
Nov 26 at 0:03
@JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.
– BeeOnRope
Nov 27 at 17:40
You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.
– BeeOnRope
Nov 27 at 17:41
1
You can reproduce the pattern by passing a value of 1 to the third argumentloops
. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case whereloops
is 1. But note thatinit run
still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.
– Hadi Brais
Nov 27 at 20:15
@HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includesrdtsc
overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.
– BeeOnRope
Nov 27 at 21:34
|
show 6 more comments
up vote
0
down vote
I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).
For example:
./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)
Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc
as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc
-based clock_gettime
only if it seems suitable on the current system.
The second most likely reason is that clock_gettime(CLOCK_REALTIME)
doesn't go through the VDSO on your system, so is a system call even if rdtsc
is ultimately being used. I guess that could be due to an old libc version or something like that.
The third most likely reason is that rdtsc
on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.
Single Loop Results
Trying with a single clock_gettime
call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1
gives:
init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...
Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf
line was modified to:
printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);
When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is4700 + 99*50
which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.
– JasonN
Nov 26 at 0:03
@JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.
– BeeOnRope
Nov 27 at 17:40
You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.
– BeeOnRope
Nov 27 at 17:41
1
You can reproduce the pattern by passing a value of 1 to the third argumentloops
. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case whereloops
is 1. But note thatinit run
still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.
– Hadi Brais
Nov 27 at 20:15
@HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includesrdtsc
overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.
– BeeOnRope
Nov 27 at 21:34
|
show 6 more comments
up vote
0
down vote
up vote
0
down vote
I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).
For example:
./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)
Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc
as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc
-based clock_gettime
only if it seems suitable on the current system.
The second most likely reason is that clock_gettime(CLOCK_REALTIME)
doesn't go through the VDSO on your system, so is a system call even if rdtsc
is ultimately being used. I guess that could be due to an old libc version or something like that.
The third most likely reason is that rdtsc
on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.
Single Loop Results
Trying with a single clock_gettime
call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1
gives:
init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...
Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf
line was modified to:
printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);
I could not reproduce your results. Even with large sleep times (10 seconds), and a small number of loops (100), I always get timing of less than 100 clocks (less than 38 ns on my 2.6 GHz system).
For example:
./clockt 10 10 100
init run 14896
trial 0 took 8870 (88 cycles per call)
trial 1 took 8316 (83 cycles per call)
trial 2 took 8384 (83 cycles per call)
trial 3 took 8796 (87 cycles per call)
trial 4 took 9424 (94 cycles per call)
trial 5 took 9054 (90 cycles per call)
trial 6 took 8394 (83 cycles per call)
trial 7 took 8346 (83 cycles per call)
trial 8 took 8868 (88 cycles per call)
trial 9 took 8930 (89 cycles per call)
Outside of measurement or user-error (always the most likely reason), the most likely explanation is that your system is not using rdtsc
as a time source, so a system call is made. You can configure the clock source explicitly yourself, or otherwise some heuristic is used which will select rdtsc
-based clock_gettime
only if it seems suitable on the current system.
The second most likely reason is that clock_gettime(CLOCK_REALTIME)
doesn't go through the VDSO on your system, so is a system call even if rdtsc
is ultimately being used. I guess that could be due to an old libc version or something like that.
The third most likely reason is that rdtsc
on your system is slow, perhaps because it is virtualized or disabled on your system, and implement via a VM exit or OS trap.
Single Loop Results
Trying with a single clock_gettime
call per loop, I still get "fast" results after the first few trials. For example, ./clockt 0 20 1
gives:
init run 15932
trial 0 took 352 (352 cycles per call)
trial 1 took 196 (196 cycles per call)
trial 2 took 104 (104 cycles per call)
trial 3 took 104 (104 cycles per call)
trial 4 took 104 (104 cycles per call)
trial 5 took 104 (104 cycles per call)
trial 6 took 102 (102 cycles per call)
trial 7 took 104 (104 cycles per call)
...
Note that I made one modification to the test program, to print the time per call, which seems more useful than the total time. The printf
line was modified to:
printf("trial %d took %lld (%lld cycles per call)n", t, (y-x), (y-x)/loops);
edited Nov 27 at 21:33
answered Nov 25 at 23:08
BeeOnRope
24.7k874170
24.7k874170
When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is4700 + 99*50
which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.
– JasonN
Nov 26 at 0:03
@JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.
– BeeOnRope
Nov 27 at 17:40
You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.
– BeeOnRope
Nov 27 at 17:41
1
You can reproduce the pattern by passing a value of 1 to the third argumentloops
. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case whereloops
is 1. But note thatinit run
still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.
– Hadi Brais
Nov 27 at 20:15
@HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includesrdtsc
overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.
– BeeOnRope
Nov 27 at 21:34
|
show 6 more comments
When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is4700 + 99*50
which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.
– JasonN
Nov 26 at 0:03
@JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.
– BeeOnRope
Nov 27 at 17:40
You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.
– BeeOnRope
Nov 27 at 17:41
1
You can reproduce the pattern by passing a value of 1 to the third argumentloops
. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case whereloops
is 1. But note thatinit run
still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.
– Hadi Brais
Nov 27 at 20:15
@HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includesrdtsc
overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.
– BeeOnRope
Nov 27 at 21:34
When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is
4700 + 99*50
which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.– JasonN
Nov 26 at 0:03
When you average across mutiple calls per trial, you get the numbers you see because the first call takes about 4.7k cycles and the rest are about 50 cycles, so the trial time for 100 calls is
4700 + 99*50
which is 96.5 average, close to your numbers. I know there isn't a syscall happening because strace shows none, and I double checked the call address too in gdb. There might be a misconfiguration on my machine, but I'm in full control of it (Fedora 27), and I'm not using a VM. Still a little confused by this whole thing. The I-cache answer still leaves a few questions.– JasonN
Nov 26 at 0:03
@JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.
– BeeOnRope
Nov 27 at 17:40
@JasonN - yes my answer wasn't so much based on the transient behavior, but the long run behavior which converges around 25 ns. You mentioned in your question "the length of time goes from about 45k clock cycles down to 500 cycles" so I took 500 cycles (more than 100 ns) to be your "long run, frequency calls" measurement, which is very different than mine. If you are still getting that time, something is unusual. The transient behavior when you have been sleeping etc is mostly an entirely different matter from the best case time, and Hadi had some good ideas there.
– BeeOnRope
Nov 27 at 17:40
You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.
– BeeOnRope
Nov 27 at 17:41
You might want to check if the 500 cycle time is still accurate for 1000+ back-to-back calls (no sleeping). If it is, one could certainly dig into that discrepancy.
– BeeOnRope
Nov 27 at 17:41
1
1
You can reproduce the pattern by passing a value of 1 to the third argument
loops
. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where loops
is 1. But note that init run
still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.– Hadi Brais
Nov 27 at 20:15
You can reproduce the pattern by passing a value of 1 to the third argument
loops
. If you pass a larger value, then you'll get the execution time of the whole inner loop, not individual calls. The timings mentioned in the question are for the case where loops
is 1. But note that init run
still measures the time for the very first call (which is outside the loop). This particular number varies significantly from one run to another and from one system to another (which makes sense). But the whole pattern is reproducible always.– Hadi Brais
Nov 27 at 20:15
@HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes
rdtsc
overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.– BeeOnRope
Nov 27 at 21:34
@HadiBrais - in that case, the results quickly stabilize to ~100 cycles on my system, not 500. Since it includes
rdtsc
overhead, this seems within the range of the results with more iterations per loop. So I'm not seeing long results even with only one call every trial. I should clarify that I modified the original code to print time per call as well as total time (see output at the end of my answer), so when I say 65 cycles it is per call.– BeeOnRope
Nov 27 at 21:34
|
show 6 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53252050%2fwhy-does-the-call-latency-on-clock-gettimeclock-realtime-vary-so-much%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
You’re probably misinterpreting your results: you assume that your time units are the same in all cases. You have to, at the very least, validate this assumption first before you depend on it. And without the test code, there isn’t much we can do to investigate. Finally, at the very least one needs a
/prof/cpuinfo
dump for 1st cpu to ensure apple’s-to-apples comparison. I also assume that this is not a NUMA machine nor a server motherboard, right?– Kuba Ober
Nov 11 at 19:08
1
vDSO doesn’t guarantee anything either: you’d need to know that the glibc you’re using actually uses the vDSO interfaces. I
– Kuba Ober
Nov 11 at 19:18
@Kuba glibc-2.26-30.fc27 and I've updated the questions with cpuinfo. What do you mean by time units not the same? These results persist over many tune, and the machine is single socket using rtdsc, so there shouldn't be any measurement issues.
– JasonN
Nov 11 at 21:17
@kuba strace verifies that the clock_gettime calls are not syscalls
– JasonN
Nov 11 at 23:20
1
@terry why not? I've seen plenty of code timing calls like gettimeofday for the specific purpose of timing the vDSO vs syscall overhead. I'm not even sure it is the same hardware components on the chip (I don't think clock_gettime uses rdtsc, but I could be wrong).
– JasonN
Nov 13 at 23:06