Where to run the processing code in Kafka?










0















I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).



But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.










share|improve this question


























    0















    I am trying to setup a data pipeline using Kafka.
    Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).



    But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.










    share|improve this question
























      0












      0








      0








      I am trying to setup a data pipeline using Kafka.
      Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).



      But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.










      share|improve this question














      I am trying to setup a data pipeline using Kafka.
      Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).



      But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.







      apache-kafka






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 15 '18 at 19:33









      user2409399user2409399

      155113




      155113






















          3 Answers
          3






          active

          oldest

          votes


















          0














          In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.



          For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).



          Where you draw those boundaries is a separate question.






          share|improve this answer






























            0














            It can be part of either producer or consumer.



            Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster






            share|improve this answer






























              0














              It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores



              1) Use kafka connect to produce your data to kafka topics.
              Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
              Produce back to a kafka topic for further use or some datastore, any sink basically


              This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.



              2) Write custom producers, do your transformations in producers before 
              writing to kafka topic or directly to a sink unless you want to reuse this produced data
              for some further processing.
              Read from kafka topic and do some further processing and write it back to persistent store.


              It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.






              share|improve this answer























                Your Answer






                StackExchange.ifUsing("editor", function ()
                StackExchange.using("externalEditor", function ()
                StackExchange.using("snippets", function ()
                StackExchange.snippets.init();
                );
                );
                , "code-snippets");

                StackExchange.ready(function()
                var channelOptions =
                tags: "".split(" "),
                id: "1"
                ;
                initTagRenderer("".split(" "), "".split(" "), channelOptions);

                StackExchange.using("externalEditor", function()
                // Have to fire editor after snippets, if snippets enabled
                if (StackExchange.settings.snippets.snippetsEnabled)
                StackExchange.using("snippets", function()
                createEditor();
                );

                else
                createEditor();

                );

                function createEditor()
                StackExchange.prepareEditor(
                heartbeatType: 'answer',
                autoActivateHeartbeat: false,
                convertImagesToLinks: true,
                noModals: true,
                showLowRepImageUploadWarning: true,
                reputationToPostImages: 10,
                bindNavPrevention: true,
                postfix: "",
                imageUploader:
                brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
                contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
                allowUrls: true
                ,
                onDemand: true,
                discardSelector: ".discard-answer"
                ,immediatelyShowMarkdownHelp:true
                );



                );













                draft saved

                draft discarded


















                StackExchange.ready(
                function ()
                StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326745%2fwhere-to-run-the-processing-code-in-kafka%23new-answer', 'question_page');

                );

                Post as a guest















                Required, but never shown

























                3 Answers
                3






                active

                oldest

                votes








                3 Answers
                3






                active

                oldest

                votes









                active

                oldest

                votes






                active

                oldest

                votes









                0














                In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.



                For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).



                Where you draw those boundaries is a separate question.






                share|improve this answer



























                  0














                  In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.



                  For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).



                  Where you draw those boundaries is a separate question.






                  share|improve this answer

























                    0












                    0








                    0







                    In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.



                    For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).



                    Where you draw those boundaries is a separate question.






                    share|improve this answer













                    In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.



                    For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).



                    Where you draw those boundaries is a separate question.







                    share|improve this answer












                    share|improve this answer



                    share|improve this answer










                    answered Nov 15 '18 at 20:45









                    akimakim

                    263




                    263























                        0














                        It can be part of either producer or consumer.



                        Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster






                        share|improve this answer



























                          0














                          It can be part of either producer or consumer.



                          Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster






                          share|improve this answer

























                            0












                            0








                            0







                            It can be part of either producer or consumer.



                            Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster






                            share|improve this answer













                            It can be part of either producer or consumer.



                            Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster







                            share|improve this answer












                            share|improve this answer



                            share|improve this answer










                            answered Nov 15 '18 at 22:12









                            cricket_007cricket_007

                            83.6k1147117




                            83.6k1147117





















                                0














                                It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores



                                1) Use kafka connect to produce your data to kafka topics.
                                Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
                                Produce back to a kafka topic for further use or some datastore, any sink basically


                                This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.



                                2) Write custom producers, do your transformations in producers before 
                                writing to kafka topic or directly to a sink unless you want to reuse this produced data
                                for some further processing.
                                Read from kafka topic and do some further processing and write it back to persistent store.


                                It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.






                                share|improve this answer



























                                  0














                                  It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores



                                  1) Use kafka connect to produce your data to kafka topics.
                                  Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
                                  Produce back to a kafka topic for further use or some datastore, any sink basically


                                  This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.



                                  2) Write custom producers, do your transformations in producers before 
                                  writing to kafka topic or directly to a sink unless you want to reuse this produced data
                                  for some further processing.
                                  Read from kafka topic and do some further processing and write it back to persistent store.


                                  It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.






                                  share|improve this answer

























                                    0












                                    0








                                    0







                                    It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores



                                    1) Use kafka connect to produce your data to kafka topics.
                                    Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
                                    Produce back to a kafka topic for further use or some datastore, any sink basically


                                    This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.



                                    2) Write custom producers, do your transformations in producers before 
                                    writing to kafka topic or directly to a sink unless you want to reuse this produced data
                                    for some further processing.
                                    Read from kafka topic and do some further processing and write it back to persistent store.


                                    It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.






                                    share|improve this answer













                                    It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores



                                    1) Use kafka connect to produce your data to kafka topics.
                                    Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
                                    Produce back to a kafka topic for further use or some datastore, any sink basically


                                    This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.



                                    2) Write custom producers, do your transformations in producers before 
                                    writing to kafka topic or directly to a sink unless you want to reuse this produced data
                                    for some further processing.
                                    Read from kafka topic and do some further processing and write it back to persistent store.


                                    It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.







                                    share|improve this answer












                                    share|improve this answer



                                    share|improve this answer










                                    answered Nov 16 '18 at 6:51









                                    AchilleusAchilleus

                                    710418




                                    710418



























                                        draft saved

                                        draft discarded
















































                                        Thanks for contributing an answer to Stack Overflow!


                                        • Please be sure to answer the question. Provide details and share your research!

                                        But avoid


                                        • Asking for help, clarification, or responding to other answers.

                                        • Making statements based on opinion; back them up with references or personal experience.

                                        To learn more, see our tips on writing great answers.




                                        draft saved


                                        draft discarded














                                        StackExchange.ready(
                                        function ()
                                        StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53326745%2fwhere-to-run-the-processing-code-in-kafka%23new-answer', 'question_page');

                                        );

                                        Post as a guest















                                        Required, but never shown





















































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown

































                                        Required, but never shown














                                        Required, but never shown












                                        Required, but never shown







                                        Required, but never shown







                                        這個網誌中的熱門文章

                                        Barbados

                                        How to read a connectionString WITH PROVIDER in .NET Core?

                                        Node.js Script on GitHub Pages or Amazon S3