Node status changes to unknown on a high resource requirement pod
I have a Jenkins deployment pipeline which involves kubernetes plugin. Using kubernetes plugin I create a slave pod for building a node application using yarn. The requests and limits for CPU and Memory are set.
When the Jenkins master schedules the slave, sometimes (as I haven’t seen a pattern, as of now), the pod makes the entire node unreachable and changes the status of node to be Unknown. On careful inspection in Grafana, the CPU and Memory Resources seem to be well within the range with no visible spike. The only spike that occurs is with the Disk I/O, which peaks to ~ 4 MiB.
I am not sure if that is the reason for the node unable to address itself as a cluster member. I would be needing help in a few things here:
a) How to diagnose in depth the reasons for node leaving the cluster.
b) If, the reason is Disk IOPS, is there any default requests, limits for IOPS at Kubernetes level?
PS: I am using EBS (gp2)
kubernetes jenkins-plugins aws-ebs
add a comment |
I have a Jenkins deployment pipeline which involves kubernetes plugin. Using kubernetes plugin I create a slave pod for building a node application using yarn. The requests and limits for CPU and Memory are set.
When the Jenkins master schedules the slave, sometimes (as I haven’t seen a pattern, as of now), the pod makes the entire node unreachable and changes the status of node to be Unknown. On careful inspection in Grafana, the CPU and Memory Resources seem to be well within the range with no visible spike. The only spike that occurs is with the Disk I/O, which peaks to ~ 4 MiB.
I am not sure if that is the reason for the node unable to address itself as a cluster member. I would be needing help in a few things here:
a) How to diagnose in depth the reasons for node leaving the cluster.
b) If, the reason is Disk IOPS, is there any default requests, limits for IOPS at Kubernetes level?
PS: I am using EBS (gp2)
kubernetes jenkins-plugins aws-ebs
4MiB/s doesn't strike me as a very high throughput number for gp2 SSDs, but you can see Amazon's perf guidelines here to check against your disk size / type. Note that you may also be limited on IOPS instead of throughput, so it might be valuable to collect that information in Grafana too if possible.
– Dan
Nov 12 at 20:16
add a comment |
I have a Jenkins deployment pipeline which involves kubernetes plugin. Using kubernetes plugin I create a slave pod for building a node application using yarn. The requests and limits for CPU and Memory are set.
When the Jenkins master schedules the slave, sometimes (as I haven’t seen a pattern, as of now), the pod makes the entire node unreachable and changes the status of node to be Unknown. On careful inspection in Grafana, the CPU and Memory Resources seem to be well within the range with no visible spike. The only spike that occurs is with the Disk I/O, which peaks to ~ 4 MiB.
I am not sure if that is the reason for the node unable to address itself as a cluster member. I would be needing help in a few things here:
a) How to diagnose in depth the reasons for node leaving the cluster.
b) If, the reason is Disk IOPS, is there any default requests, limits for IOPS at Kubernetes level?
PS: I am using EBS (gp2)
kubernetes jenkins-plugins aws-ebs
I have a Jenkins deployment pipeline which involves kubernetes plugin. Using kubernetes plugin I create a slave pod for building a node application using yarn. The requests and limits for CPU and Memory are set.
When the Jenkins master schedules the slave, sometimes (as I haven’t seen a pattern, as of now), the pod makes the entire node unreachable and changes the status of node to be Unknown. On careful inspection in Grafana, the CPU and Memory Resources seem to be well within the range with no visible spike. The only spike that occurs is with the Disk I/O, which peaks to ~ 4 MiB.
I am not sure if that is the reason for the node unable to address itself as a cluster member. I would be needing help in a few things here:
a) How to diagnose in depth the reasons for node leaving the cluster.
b) If, the reason is Disk IOPS, is there any default requests, limits for IOPS at Kubernetes level?
PS: I am using EBS (gp2)
kubernetes jenkins-plugins aws-ebs
kubernetes jenkins-plugins aws-ebs
edited Nov 12 at 22:57
Rico
26.1k94864
26.1k94864
asked Nov 12 at 14:07
Arpit Goyal
1,560319
1,560319
4MiB/s doesn't strike me as a very high throughput number for gp2 SSDs, but you can see Amazon's perf guidelines here to check against your disk size / type. Note that you may also be limited on IOPS instead of throughput, so it might be valuable to collect that information in Grafana too if possible.
– Dan
Nov 12 at 20:16
add a comment |
4MiB/s doesn't strike me as a very high throughput number for gp2 SSDs, but you can see Amazon's perf guidelines here to check against your disk size / type. Note that you may also be limited on IOPS instead of throughput, so it might be valuable to collect that information in Grafana too if possible.
– Dan
Nov 12 at 20:16
4MiB/s doesn't strike me as a very high throughput number for gp2 SSDs, but you can see Amazon's perf guidelines here to check against your disk size / type. Note that you may also be limited on IOPS instead of throughput, so it might be valuable to collect that information in Grafana too if possible.
– Dan
Nov 12 at 20:16
4MiB/s doesn't strike me as a very high throughput number for gp2 SSDs, but you can see Amazon's perf guidelines here to check against your disk size / type. Note that you may also be limited on IOPS instead of throughput, so it might be valuable to collect that information in Grafana too if possible.
– Dan
Nov 12 at 20:16
add a comment |
1 Answer
1
active
oldest
votes
As per the docs, for the node to be 'Ready':
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
$ kubectl describe node <node-name>
To see if you see anything unusual on your kube-apiserver. On your active master run:
$ docker logs <container-id-of-kube-apiserver>
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
$ docker logs <container-id-of-kube-controller-manager>
Increase the
--node-monitor-grace-period
option in your kube-controller-manager. You can add it to the command line in the/etc/kubernetes/manifests/kube-controller-manager.yaml
and restart thekube-controller-manager
container.When the node is in the 'Unknown' state can you
ssh
into it and see if you can reach thekubeapi-server
? Both on<master-ip>:6443
and also thekubernetes.default.svc.cluster.local:443
endpoints.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263868%2fnode-status-changes-to-unknown-on-a-high-resource-requirement-pod%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As per the docs, for the node to be 'Ready':
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
$ kubectl describe node <node-name>
To see if you see anything unusual on your kube-apiserver. On your active master run:
$ docker logs <container-id-of-kube-apiserver>
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
$ docker logs <container-id-of-kube-controller-manager>
Increase the
--node-monitor-grace-period
option in your kube-controller-manager. You can add it to the command line in the/etc/kubernetes/manifests/kube-controller-manager.yaml
and restart thekube-controller-manager
container.When the node is in the 'Unknown' state can you
ssh
into it and see if you can reach thekubeapi-server
? Both on<master-ip>:6443
and also thekubernetes.default.svc.cluster.local:443
endpoints.
add a comment |
As per the docs, for the node to be 'Ready':
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
$ kubectl describe node <node-name>
To see if you see anything unusual on your kube-apiserver. On your active master run:
$ docker logs <container-id-of-kube-apiserver>
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
$ docker logs <container-id-of-kube-controller-manager>
Increase the
--node-monitor-grace-period
option in your kube-controller-manager. You can add it to the command line in the/etc/kubernetes/manifests/kube-controller-manager.yaml
and restart thekube-controller-manager
container.When the node is in the 'Unknown' state can you
ssh
into it and see if you can reach thekubeapi-server
? Both on<master-ip>:6443
and also thekubernetes.default.svc.cluster.local:443
endpoints.
add a comment |
As per the docs, for the node to be 'Ready':
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
$ kubectl describe node <node-name>
To see if you see anything unusual on your kube-apiserver. On your active master run:
$ docker logs <container-id-of-kube-apiserver>
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
$ docker logs <container-id-of-kube-controller-manager>
Increase the
--node-monitor-grace-period
option in your kube-controller-manager. You can add it to the command line in the/etc/kubernetes/manifests/kube-controller-manager.yaml
and restart thekube-controller-manager
container.When the node is in the 'Unknown' state can you
ssh
into it and see if you can reach thekubeapi-server
? Both on<master-ip>:6443
and also thekubernetes.default.svc.cluster.local:443
endpoints.
As per the docs, for the node to be 'Ready':
True if the node is healthy and ready to accept pods, False if the node is not healthy and is not accepting pods, and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 40 seconds)
If would seem that when you run your workloads your kube-apiserver doesn't hear from your node (kubelet) in 40 seconds. There could be multiple reasons, some things that you can try:
To see the 'Events' in your node run:
$ kubectl describe node <node-name>
To see if you see anything unusual on your kube-apiserver. On your active master run:
$ docker logs <container-id-of-kube-apiserver>
To see if you see anything unusual on your kube-controller-manager when your node goes into 'Unknown' state. On your active master run:
$ docker logs <container-id-of-kube-controller-manager>
Increase the
--node-monitor-grace-period
option in your kube-controller-manager. You can add it to the command line in the/etc/kubernetes/manifests/kube-controller-manager.yaml
and restart thekube-controller-manager
container.When the node is in the 'Unknown' state can you
ssh
into it and see if you can reach thekubeapi-server
? Both on<master-ip>:6443
and also thekubernetes.default.svc.cluster.local:443
endpoints.
answered Nov 12 at 23:26
Rico
26.1k94864
26.1k94864
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53263868%2fnode-status-changes-to-unknown-on-a-high-resource-requirement-pod%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4MiB/s doesn't strike me as a very high throughput number for gp2 SSDs, but you can see Amazon's perf guidelines here to check against your disk size / type. Note that you may also be limited on IOPS instead of throughput, so it might be valuable to collect that information in Grafana too if possible.
– Dan
Nov 12 at 20:16