Spark on Kubernetes. How spark nature of statefulness is maintained in Kubernetes?
I am experimenting Spark2.3 on a K8s cluster. Wondering how does the checkpoint work? Where is it stored? If the main driver dies, what happens to the existing processing?
In case of consuming from Kafka, how does the offset maintained? I tried to lookup online but could not find any answer to those questions. Our application is consuming a lot of Kafka data so it is essential to be able to restart and pick up from where it was stopped.
Any gotchas on running Spark Streaming on K8s?
apache-spark kubernetes spark-streaming
add a comment |
I am experimenting Spark2.3 on a K8s cluster. Wondering how does the checkpoint work? Where is it stored? If the main driver dies, what happens to the existing processing?
In case of consuming from Kafka, how does the offset maintained? I tried to lookup online but could not find any answer to those questions. Our application is consuming a lot of Kafka data so it is essential to be able to restart and pick up from where it was stopped.
Any gotchas on running Spark Streaming on K8s?
apache-spark kubernetes spark-streaming
add a comment |
I am experimenting Spark2.3 on a K8s cluster. Wondering how does the checkpoint work? Where is it stored? If the main driver dies, what happens to the existing processing?
In case of consuming from Kafka, how does the offset maintained? I tried to lookup online but could not find any answer to those questions. Our application is consuming a lot of Kafka data so it is essential to be able to restart and pick up from where it was stopped.
Any gotchas on running Spark Streaming on K8s?
apache-spark kubernetes spark-streaming
I am experimenting Spark2.3 on a K8s cluster. Wondering how does the checkpoint work? Where is it stored? If the main driver dies, what happens to the existing processing?
In case of consuming from Kafka, how does the offset maintained? I tried to lookup online but could not find any answer to those questions. Our application is consuming a lot of Kafka data so it is essential to be able to restart and pick up from where it was stopped.
Any gotchas on running Spark Streaming on K8s?
apache-spark kubernetes spark-streaming
apache-spark kubernetes spark-streaming
edited Nov 14 '18 at 22:33
Rico
28.4k95066
28.4k95066
asked Nov 14 '18 at 22:21
Vijay RamVijay Ram
12512
12512
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
The Kubernetes Spark Controller doesn't know anything about checkpointing, AFAIK. It's just a way for Kubernetes to schedule your Spark driver and the Workers that it needs to run a job.
Storing the offset is really up to your application and where you want to store the Kafka offset, so that when it restarts it picks up that offset and starts consuming from there. This is an example on how to store it in Zookeeper.
You could, for example, write ZK offset manager functions in Scala:
import com.metamx.common.scala.Logging
import org.apache.curator.framework.CuratorFramework
...
object OffsetManager extends Logging {
def getOffsets(client: CuratorFramework,
... =
def setOffsets(client: CuratorFramework,
... =
...
Another way would be storing your Kafka offsets in something reliable like HDFS.
Thanks for comments. So.. stateful is not provided by kubernetes. application's responsibility to take care of that. Is that what it is?
– Vijay Ram
Nov 15 '18 at 12:12
Yes, that is correct.
– Rico
Nov 15 '18 at 15:00
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309614%2fspark-on-kubernetes-how-spark-nature-of-statefulness-is-maintained-in-kubernete%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The Kubernetes Spark Controller doesn't know anything about checkpointing, AFAIK. It's just a way for Kubernetes to schedule your Spark driver and the Workers that it needs to run a job.
Storing the offset is really up to your application and where you want to store the Kafka offset, so that when it restarts it picks up that offset and starts consuming from there. This is an example on how to store it in Zookeeper.
You could, for example, write ZK offset manager functions in Scala:
import com.metamx.common.scala.Logging
import org.apache.curator.framework.CuratorFramework
...
object OffsetManager extends Logging {
def getOffsets(client: CuratorFramework,
... =
def setOffsets(client: CuratorFramework,
... =
...
Another way would be storing your Kafka offsets in something reliable like HDFS.
Thanks for comments. So.. stateful is not provided by kubernetes. application's responsibility to take care of that. Is that what it is?
– Vijay Ram
Nov 15 '18 at 12:12
Yes, that is correct.
– Rico
Nov 15 '18 at 15:00
add a comment |
The Kubernetes Spark Controller doesn't know anything about checkpointing, AFAIK. It's just a way for Kubernetes to schedule your Spark driver and the Workers that it needs to run a job.
Storing the offset is really up to your application and where you want to store the Kafka offset, so that when it restarts it picks up that offset and starts consuming from there. This is an example on how to store it in Zookeeper.
You could, for example, write ZK offset manager functions in Scala:
import com.metamx.common.scala.Logging
import org.apache.curator.framework.CuratorFramework
...
object OffsetManager extends Logging {
def getOffsets(client: CuratorFramework,
... =
def setOffsets(client: CuratorFramework,
... =
...
Another way would be storing your Kafka offsets in something reliable like HDFS.
Thanks for comments. So.. stateful is not provided by kubernetes. application's responsibility to take care of that. Is that what it is?
– Vijay Ram
Nov 15 '18 at 12:12
Yes, that is correct.
– Rico
Nov 15 '18 at 15:00
add a comment |
The Kubernetes Spark Controller doesn't know anything about checkpointing, AFAIK. It's just a way for Kubernetes to schedule your Spark driver and the Workers that it needs to run a job.
Storing the offset is really up to your application and where you want to store the Kafka offset, so that when it restarts it picks up that offset and starts consuming from there. This is an example on how to store it in Zookeeper.
You could, for example, write ZK offset manager functions in Scala:
import com.metamx.common.scala.Logging
import org.apache.curator.framework.CuratorFramework
...
object OffsetManager extends Logging {
def getOffsets(client: CuratorFramework,
... =
def setOffsets(client: CuratorFramework,
... =
...
Another way would be storing your Kafka offsets in something reliable like HDFS.
The Kubernetes Spark Controller doesn't know anything about checkpointing, AFAIK. It's just a way for Kubernetes to schedule your Spark driver and the Workers that it needs to run a job.
Storing the offset is really up to your application and where you want to store the Kafka offset, so that when it restarts it picks up that offset and starts consuming from there. This is an example on how to store it in Zookeeper.
You could, for example, write ZK offset manager functions in Scala:
import com.metamx.common.scala.Logging
import org.apache.curator.framework.CuratorFramework
...
object OffsetManager extends Logging {
def getOffsets(client: CuratorFramework,
... =
def setOffsets(client: CuratorFramework,
... =
...
Another way would be storing your Kafka offsets in something reliable like HDFS.
answered Nov 14 '18 at 22:50
RicoRico
28.4k95066
28.4k95066
Thanks for comments. So.. stateful is not provided by kubernetes. application's responsibility to take care of that. Is that what it is?
– Vijay Ram
Nov 15 '18 at 12:12
Yes, that is correct.
– Rico
Nov 15 '18 at 15:00
add a comment |
Thanks for comments. So.. stateful is not provided by kubernetes. application's responsibility to take care of that. Is that what it is?
– Vijay Ram
Nov 15 '18 at 12:12
Yes, that is correct.
– Rico
Nov 15 '18 at 15:00
Thanks for comments. So.. stateful is not provided by kubernetes. application's responsibility to take care of that. Is that what it is?
– Vijay Ram
Nov 15 '18 at 12:12
Thanks for comments. So.. stateful is not provided by kubernetes. application's responsibility to take care of that. Is that what it is?
– Vijay Ram
Nov 15 '18 at 12:12
Yes, that is correct.
– Rico
Nov 15 '18 at 15:00
Yes, that is correct.
– Rico
Nov 15 '18 at 15:00
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53309614%2fspark-on-kubernetes-how-spark-nature-of-statefulness-is-maintained-in-kubernete%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown