Rsyslog data being bit-shifted on Linux in AWS










0















I have dozens of c4.xlarge HVM CentOS 6 Linux instances running in AWS. They're part of a distributed measurement network that collects data and fowards it to a central location using rsyslog. All instances are running within the same VPC and it's a flat /24 network. The version of rsyslog that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or TLS or any of the other fancier transport protocols that rsyslog supports. Each measurement node forwards dozens of events per second to the central collector. All of this works great except that once every 3-4 months the measurement data on a small number of instances will become corrupted by having one or more bits shifted by a single character. For example, one of the fields of data that these instances forward to the central collector includes the hostname of the instance. This instance is named "tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here are the erroneous values that it used for its hostname when this problem occurred during a period of 15 minutes:



tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain


Each value is slightly different than the other values by a single character and each character has been shifted by two characters either backwards or forwards. Here are some examples taken from the above list:



r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f


This problem was observed on the central rsyslog collector once all of the aggregated data underwent validation before being sent off to another server to be processed, therefore I can not definitely say where the corruption occurred: on the measurement node where rsyslog is running, on the network when rsyslog forwarded the data to the collector, or on the collector itself. There were two measurement nodes that experienced this issue during the same time period.



I'm at a bit of a loss to determine where this bit shifting is occurring. I've looked through my kernel logs and system logs trying to correlate the times when the bit shifting occurred to some issue with the system but haven't been able to find anything, everything else on the system appears to have been working fine. I've read that bit shifting can occur as a result of bad memory, but as I understand it all of the hardware in AWS runs ECC memory which should prevent this from happening due to faulty memory. Is there anything in Linux or rsyslog that I should be looking at to determine the cause of this issue or to prevent its reocurrence? I've read that there's a kernel module called EDAC in Linux that can help detect memory errors but I'm not sure that it would function in a VM, I believe it would need direct access to physical memory. Does anyone have any suggestions on what might be causing this issue or where I should focus my troubleshooting?










share|improve this question






















  • If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?

    – meuh
    Nov 15 '18 at 9:59















0















I have dozens of c4.xlarge HVM CentOS 6 Linux instances running in AWS. They're part of a distributed measurement network that collects data and fowards it to a central location using rsyslog. All instances are running within the same VPC and it's a flat /24 network. The version of rsyslog that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or TLS or any of the other fancier transport protocols that rsyslog supports. Each measurement node forwards dozens of events per second to the central collector. All of this works great except that once every 3-4 months the measurement data on a small number of instances will become corrupted by having one or more bits shifted by a single character. For example, one of the fields of data that these instances forward to the central collector includes the hostname of the instance. This instance is named "tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here are the erroneous values that it used for its hostname when this problem occurred during a period of 15 minutes:



tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain


Each value is slightly different than the other values by a single character and each character has been shifted by two characters either backwards or forwards. Here are some examples taken from the above list:



r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f


This problem was observed on the central rsyslog collector once all of the aggregated data underwent validation before being sent off to another server to be processed, therefore I can not definitely say where the corruption occurred: on the measurement node where rsyslog is running, on the network when rsyslog forwarded the data to the collector, or on the collector itself. There were two measurement nodes that experienced this issue during the same time period.



I'm at a bit of a loss to determine where this bit shifting is occurring. I've looked through my kernel logs and system logs trying to correlate the times when the bit shifting occurred to some issue with the system but haven't been able to find anything, everything else on the system appears to have been working fine. I've read that bit shifting can occur as a result of bad memory, but as I understand it all of the hardware in AWS runs ECC memory which should prevent this from happening due to faulty memory. Is there anything in Linux or rsyslog that I should be looking at to determine the cause of this issue or to prevent its reocurrence? I've read that there's a kernel module called EDAC in Linux that can help detect memory errors but I'm not sure that it would function in a VM, I believe it would need direct access to physical memory. Does anyone have any suggestions on what might be causing this issue or where I should focus my troubleshooting?










share|improve this question






















  • If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?

    – meuh
    Nov 15 '18 at 9:59













0












0








0








I have dozens of c4.xlarge HVM CentOS 6 Linux instances running in AWS. They're part of a distributed measurement network that collects data and fowards it to a central location using rsyslog. All instances are running within the same VPC and it's a flat /24 network. The version of rsyslog that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or TLS or any of the other fancier transport protocols that rsyslog supports. Each measurement node forwards dozens of events per second to the central collector. All of this works great except that once every 3-4 months the measurement data on a small number of instances will become corrupted by having one or more bits shifted by a single character. For example, one of the fields of data that these instances forward to the central collector includes the hostname of the instance. This instance is named "tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here are the erroneous values that it used for its hostname when this problem occurred during a period of 15 minutes:



tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain


Each value is slightly different than the other values by a single character and each character has been shifted by two characters either backwards or forwards. Here are some examples taken from the above list:



r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f


This problem was observed on the central rsyslog collector once all of the aggregated data underwent validation before being sent off to another server to be processed, therefore I can not definitely say where the corruption occurred: on the measurement node where rsyslog is running, on the network when rsyslog forwarded the data to the collector, or on the collector itself. There were two measurement nodes that experienced this issue during the same time period.



I'm at a bit of a loss to determine where this bit shifting is occurring. I've looked through my kernel logs and system logs trying to correlate the times when the bit shifting occurred to some issue with the system but haven't been able to find anything, everything else on the system appears to have been working fine. I've read that bit shifting can occur as a result of bad memory, but as I understand it all of the hardware in AWS runs ECC memory which should prevent this from happening due to faulty memory. Is there anything in Linux or rsyslog that I should be looking at to determine the cause of this issue or to prevent its reocurrence? I've read that there's a kernel module called EDAC in Linux that can help detect memory errors but I'm not sure that it would function in a VM, I believe it would need direct access to physical memory. Does anyone have any suggestions on what might be causing this issue or where I should focus my troubleshooting?










share|improve this question














I have dozens of c4.xlarge HVM CentOS 6 Linux instances running in AWS. They're part of a distributed measurement network that collects data and fowards it to a central location using rsyslog. All instances are running within the same VPC and it's a flat /24 network. The version of rsyslog that I'm running is 8.36.0-2 and I'm using the syslog protocol, not RELP or TLS or any of the other fancier transport protocols that rsyslog supports. Each measurement node forwards dozens of events per second to the central collector. All of this works great except that once every 3-4 months the measurement data on a small number of instances will become corrupted by having one or more bits shifted by a single character. For example, one of the fields of data that these instances forward to the central collector includes the hostname of the instance. This instance is named "tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain", but here are the erroneous values that it used for its hostname when this problem occurred during a period of 15 minutes:



tagserver-ppod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-uq-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-0-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d14b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b5e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4a51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prod-us-west-2-i-0d16b7e6c51e85697.internal.domain
tagserver-prod-us-west-2-i-0f16b7e4c51e85697.internal.domain
tagserver-prod-us-wgst-2-i-0d16b7e4c51e85697.internal.domain
tagserver-prof-us-west-2-i-0d16b7e4c51e85697.internal.domain
tagserver-rrod-us-west-2-i-0d16b7e4c51e85697.internal.domain


Each value is slightly different than the other values by a single character and each character has been shifted by two characters either backwards or forwards. Here are some examples taken from the above list:



r -> p
s -> q
2 -> 0
6 -> 4
7 -> 5
a -> c
d -> f


This problem was observed on the central rsyslog collector once all of the aggregated data underwent validation before being sent off to another server to be processed, therefore I can not definitely say where the corruption occurred: on the measurement node where rsyslog is running, on the network when rsyslog forwarded the data to the collector, or on the collector itself. There were two measurement nodes that experienced this issue during the same time period.



I'm at a bit of a loss to determine where this bit shifting is occurring. I've looked through my kernel logs and system logs trying to correlate the times when the bit shifting occurred to some issue with the system but haven't been able to find anything, everything else on the system appears to have been working fine. I've read that bit shifting can occur as a result of bad memory, but as I understand it all of the hardware in AWS runs ECC memory which should prevent this from happening due to faulty memory. Is there anything in Linux or rsyslog that I should be looking at to determine the cause of this issue or to prevent its reocurrence? I've read that there's a kernel module called EDAC in Linux that can help detect memory errors but I'm not sure that it would function in a VM, I believe it would need direct access to physical memory. Does anyone have any suggestions on what might be causing this issue or where I should focus my troubleshooting?







linux amazon-web-services bit-shift rsyslog






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 15 '18 at 4:19









Eric CampusanoEric Campusano

61




61












  • If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?

    – meuh
    Nov 15 '18 at 9:59

















  • If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?

    – meuh
    Nov 15 '18 at 9:59
















If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?

– meuh
Nov 15 '18 at 9:59





If you look at the binary representation of the failing characters, in each case it is bit 1 (counting from bit 0 at the right) that is toggling from 0 to 1 or 1 to 0. So it does look like a hardware problem. If it happens only every few months, perhaps it is when your VM is migrated to a particular slightly broken cpu. Indeed ECC should make this impossible, and network traffic is protected by checksums. Is it possible to log the physical AWS cpu id or something like that to identify a platform?

– meuh
Nov 15 '18 at 9:59












0






active

oldest

votes











Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);













draft saved

draft discarded


















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312387%2frsyslog-data-being-bit-shifted-on-linux-in-aws%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























0






active

oldest

votes








0






active

oldest

votes









active

oldest

votes






active

oldest

votes















draft saved

draft discarded
















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53312387%2frsyslog-data-being-bit-shifted-on-linux-in-aws%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







這個網誌中的熱門文章

How to read a connectionString WITH PROVIDER in .NET Core?

Node.js Script on GitHub Pages or Amazon S3

Museum of Modern and Contemporary Art of Trento and Rovereto