Rsyslog data being bit-shifted on Linux in AWS
I have dozens of c4.xlarge HVM CentOS 6 Linux instances running in AWS. They're part of a distributed measurement network that collects data and fowards it to a central location using rsyslog. All instances are running within the same VPC and it's a flat /24 network. The version of rsyslog that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or TLS or any of the other fancier transport protocols that rsyslog supports. Each measurement node forwards dozens of events per second to the central collector. All of this works great except that once every 3-4 months the measurement data on a small number of instances will become corrupted by having one or more bits shifted by a single character. For example, one of the fields of data that these instances forward to the central collector includes the hostname of the instance. This instance is named "tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here are the erroneous values that it used for its hostname when this problem occurred during a period of 15 minutes:
tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain
Each value is slightly different than the other values by a single character and each character has been shifted by two characters either backwards or forwards. Here are some examples taken from the above list:
r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f
This problem was observed on the central rsyslog collector once all of the aggregated data underwent validation before being sent off to another server to be processed, therefore I can not definitely say where the corruption occurred: on the measurement node where rsyslog is running, on the network when rsyslog forwarded the data to the collector, or on the collector itself. There were two measurement nodes that experienced this issue during the same time period.
I'm at a bit of a loss to determine where this bit shifting is occurring. I've looked through my kernel logs and system logs trying to correlate the times when the bit shifting occurred to some issue with the system but haven't been able to find anything, everything else on the system appears to have been working fine. I've read that bit shifting can occur as a result of bad memory, but as I understand it all of the hardware in AWS runs ECC memory which should prevent this from happening due to faulty memory. Is there anything in Linux or rsyslog that I should be looking at to determine the cause of this issue or to prevent its reocurrence? I've read that there's a kernel module called EDAC in Linux that can help detect memory errors but I'm not sure that it would function in a VM, I believe it would need direct access to physical memory. Does anyone have any suggestions on what might be causing this issue or where I should focus my troubleshooting?
linux amazon-web-services bit-shift rsyslog
add a comment |
I have dozens of c4.xlarge HVM CentOS 6 Linux instances running in AWS. They're part of a distributed measurement network that collects data and fowards it to a central location using rsyslog. All instances are running within the same VPC and it's a flat /24 network. The version of rsyslog that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or TLS or any of the other fancier transport protocols that rsyslog supports. Each measurement node forwards dozens of events per second to the central collector. All of this works great except that once every 3-4 months the measurement data on a small number of instances will become corrupted by having one or more bits shifted by a single character. For example, one of the fields of data that these instances forward to the central collector includes the hostname of the instance. This instance is named "tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here are the erroneous values that it used for its hostname when this problem occurred during a period of 15 minutes:
tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain
Each value is slightly different than the other values by a single character and each character has been shifted by two characters either backwards or forwards. Here are some examples taken from the above list:
r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f
This problem was observed on the central rsyslog collector once all of the aggregated data underwent validation before being sent off to another server to be processed, therefore I can not definitely say where the corruption occurred: on the measurement node where rsyslog is running, on the network when rsyslog forwarded the data to the collector, or on the collector itself. There were two measurement nodes that experienced this issue during the same time period.
I'm at a bit of a loss to determine where this bit shifting is occurring. I've looked through my kernel logs and system logs trying to correlate the times when the bit shifting occurred to some issue with the system but haven't been able to find anything, everything else on the system appears to have been working fine. I've read that bit shifting can occur as a result of bad memory, but as I understand it all of the hardware in AWS runs ECC memory which should prevent this from happening due to faulty memory. Is there anything in Linux or rsyslog that I should be looking at to determine the cause of this issue or to prevent its reocurrence? I've read that there's a kernel module called EDAC in Linux that can help detect memory errors but I'm not sure that it would function in a VM, I believe it would need direct access to physical memory. Does anyone have any suggestions on what might be causing this issue or where I should focus my troubleshooting?
linux amazon-web-services bit-shift rsyslog
If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?
– meuh
Nov 15 '18 at 9:59
add a comment |
I have dozens of c4.xlarge HVM CentOS 6 Linux instances running in AWS. They're part of a distributed measurement network that collects data and fowards it to a central location using rsyslog. All instances are running within the same VPC and it's a flat /24 network. The version of rsyslog that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or TLS or any of the other fancier transport protocols that rsyslog supports. Each measurement node forwards dozens of events per second to the central collector. All of this works great except that once every 3-4 months the measurement data on a small number of instances will become corrupted by having one or more bits shifted by a single character. For example, one of the fields of data that these instances forward to the central collector includes the hostname of the instance. This instance is named "tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here are the erroneous values that it used for its hostname when this problem occurred during a period of 15 minutes:
tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain
Each value is slightly different than the other values by a single character and each character has been shifted by two characters either backwards or forwards. Here are some examples taken from the above list:
r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f
This problem was observed on the central rsyslog collector once all of the aggregated data underwent validation before being sent off to another server to be processed, therefore I can not definitely say where the corruption occurred: on the measurement node where rsyslog is running, on the network when rsyslog forwarded the data to the collector, or on the collector itself. There were two measurement nodes that experienced this issue during the same time period.
I'm at a bit of a loss to determine where this bit shifting is occurring. I've looked through my kernel logs and system logs trying to correlate the times when the bit shifting occurred to some issue with the system but haven't been able to find anything, everything else on the system appears to have been working fine. I've read that bit shifting can occur as a result of bad memory, but as I understand it all of the hardware in AWS runs ECC memory which should prevent this from happening due to faulty memory. Is there anything in Linux or rsyslog that I should be looking at to determine the cause of this issue or to prevent its reocurrence? I've read that there's a kernel module called EDAC in Linux that can help detect memory errors but I'm not sure that it would function in a VM, I believe it would need direct access to physical memory. Does anyone have any suggestions on what might be causing this issue or where I should focus my troubleshooting?
linux amazon-web-services bit-shift rsyslog
I have dozens of c4.xlarge HVM CentOS 6 Linux instances running in AWS. They're part of a distributed measurement network that collects data and fowards it to a central location using rsyslog. All instances are running within the same VPC and it's a flat /24 network. The version of rsyslog that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or TLS or any of the other fancier transport protocols that rsyslog supports. Each measurement node forwards dozens of events per second to the central collector. All of this works great except that once every 3-4 months the measurement data on a small number of instances will become corrupted by having one or more bits shifted by a single character. For example, one of the fields of data that these instances forward to the central collector includes the hostname of the instance. This instance is named "tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here are the erroneous values that it used for its hostname when this problem occurred during a period of 15 minutes:
tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain
Each value is slightly different than the other values by a single character and each character has been shifted by two characters either backwards or forwards. Here are some examples taken from the above list:
r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f
This problem was observed on the central rsyslog collector once all of the aggregated data underwent validation before being sent off to another server to be processed, therefore I can not definitely say where the corruption occurred: on the measurement node where rsyslog is running, on the network when rsyslog forwarded the data to the collector, or on the collector itself. There were two measurement nodes that experienced this issue during the same time period.
I'm at a bit of a loss to determine where this bit shifting is occurring. I've looked through my kernel logs and system logs trying to correlate the times when the bit shifting occurred to some issue with the system but haven't been able to find anything, everything else on the system appears to have been working fine. I've read that bit shifting can occur as a result of bad memory, but as I understand it all of the hardware in AWS runs ECC memory which should prevent this from happening due to faulty memory. Is there anything in Linux or rsyslog that I should be looking at to determine the cause of this issue or to prevent its reocurrence? I've read that there's a kernel module called EDAC in Linux that can help detect memory errors but I'm not sure that it would function in a VM, I believe it would need direct access to physical memory. Does anyone have any suggestions on what might be causing this issue or where I should focus my troubleshooting?
linux amazon-web-services bit-shift rsyslog
linux amazon-web-services bit-shift rsyslog
asked Nov 15 '18 at 4:19
Eric CampusanoEric Campusano
61
61
If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?
– meuh
Nov 15 '18 at 9:59
add a comment |
If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?
– meuh
Nov 15 '18 at 9:59
If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?
– meuh
Nov 15 '18 at 9:59
If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?
– meuh
Nov 15 '18 at 9:59
add a comment |
0
active
oldest
votes
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312387%2frsyslog-data-being-bit-shifted-on-linux-in-aws%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
0
active
oldest
votes
0
active
oldest
votes
active
oldest
votes
active
oldest
votes
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312387%2frsyslog-data-being-bit-shifted-on-linux-in-aws%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?
– meuh
Nov 15 '18 at 9:59