How can I sum multiple columns in a spark dataframe in pyspark?










1















I've got a list of column names i want to sum



columns = ['col1','col2','col3']


How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)



Dataframe with result i want:



col1 col2 col3 result
1 2 3 6


Thanks !










share|improve this question
























  • Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

    – Prasad Khode
    Nov 14 '18 at 10:23











  • Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

    – Manrique
    Nov 14 '18 at 10:33















1















I've got a list of column names i want to sum



columns = ['col1','col2','col3']


How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)



Dataframe with result i want:



col1 col2 col3 result
1 2 3 6


Thanks !










share|improve this question
























  • Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

    – Prasad Khode
    Nov 14 '18 at 10:23











  • Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

    – Manrique
    Nov 14 '18 at 10:33













1












1








1


1






I've got a list of column names i want to sum



columns = ['col1','col2','col3']


How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)



Dataframe with result i want:



col1 col2 col3 result
1 2 3 6


Thanks !










share|improve this question
















I've got a list of column names i want to sum



columns = ['col1','col2','col3']


How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)



Dataframe with result i want:



col1 col2 col3 result
1 2 3 6


Thanks !







python apache-spark pyspark






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 14 '18 at 17:24







Manrique

















asked Nov 14 '18 at 10:21









ManriqueManrique

500114




500114












  • Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

    – Prasad Khode
    Nov 14 '18 at 10:23











  • Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

    – Manrique
    Nov 14 '18 at 10:33

















  • Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

    – Prasad Khode
    Nov 14 '18 at 10:23











  • Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

    – Manrique
    Nov 14 '18 at 10:33
















Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

– Prasad Khode
Nov 14 '18 at 10:23





Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?

– Prasad Khode
Nov 14 '18 at 10:23













Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

– Manrique
Nov 14 '18 at 10:33





Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.

– Manrique
Nov 14 '18 at 10:33












2 Answers
2






active

oldest

votes


















1














Try this:



df = df.withColumn('result', sum(df[col] for col in df.columns))


df.columns will be list of columns from df.






share|improve this answer























  • I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

    – vikrant rana
    Dec 4 '18 at 14:38


















0














[Editing to explain each step]



If you have static list of columns, you can do this:



df.withColumn("result", col("col1") + col("col2") + col("col3"))



But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:



reduce(add, [col(x) for x in df.columns])



The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.



The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).



[TL;DR,]



Combining the above steps, you can do this:



from functools import reduce
from operator import add
from pyspark.sql.functions import col

df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))


The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:



df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53297872%2fhow-can-i-sum-multiple-columns-in-a-spark-dataframe-in-pyspark%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    1














    Try this:



    df = df.withColumn('result', sum(df[col] for col in df.columns))


    df.columns will be list of columns from df.






    share|improve this answer























    • I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

      – vikrant rana
      Dec 4 '18 at 14:38















    1














    Try this:



    df = df.withColumn('result', sum(df[col] for col in df.columns))


    df.columns will be list of columns from df.






    share|improve this answer























    • I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

      – vikrant rana
      Dec 4 '18 at 14:38













    1












    1








    1







    Try this:



    df = df.withColumn('result', sum(df[col] for col in df.columns))


    df.columns will be list of columns from df.






    share|improve this answer













    Try this:



    df = df.withColumn('result', sum(df[col] for col in df.columns))


    df.columns will be list of columns from df.







    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 14 '18 at 10:25









    Mayank PorwalMayank Porwal

    4,9202724




    4,9202724












    • I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

      – vikrant rana
      Dec 4 '18 at 14:38

















    • I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

      – vikrant rana
      Dec 4 '18 at 14:38
















    I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

    – vikrant rana
    Dec 4 '18 at 14:38





    I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??

    – vikrant rana
    Dec 4 '18 at 14:38













    0














    [Editing to explain each step]



    If you have static list of columns, you can do this:



    df.withColumn("result", col("col1") + col("col2") + col("col3"))



    But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:



    reduce(add, [col(x) for x in df.columns])



    The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.



    The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).



    [TL;DR,]



    Combining the above steps, you can do this:



    from functools import reduce
    from operator import add
    from pyspark.sql.functions import col

    df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))


    The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:



    df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))






    share|improve this answer





























      0














      [Editing to explain each step]



      If you have static list of columns, you can do this:



      df.withColumn("result", col("col1") + col("col2") + col("col3"))



      But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:



      reduce(add, [col(x) for x in df.columns])



      The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.



      The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).



      [TL;DR,]



      Combining the above steps, you can do this:



      from functools import reduce
      from operator import add
      from pyspark.sql.functions import col

      df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))


      The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:



      df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))






      share|improve this answer



























        0












        0








        0







        [Editing to explain each step]



        If you have static list of columns, you can do this:



        df.withColumn("result", col("col1") + col("col2") + col("col3"))



        But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:



        reduce(add, [col(x) for x in df.columns])



        The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.



        The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).



        [TL;DR,]



        Combining the above steps, you can do this:



        from functools import reduce
        from operator import add
        from pyspark.sql.functions import col

        df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))


        The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:



        df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))






        share|improve this answer















        [Editing to explain each step]



        If you have static list of columns, you can do this:



        df.withColumn("result", col("col1") + col("col2") + col("col3"))



        But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3") iteratively. For this, you can use the reduce method with add function to get this:



        reduce(add, [col(x) for x in df.columns])



        The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3") instead of col("col1") + col("col2") + col("col3"). But the effect would be same.



        The col(x) ensures that you are getting col(col("col1") + col("col2")) + col("col3") instead of a simple string concat (which generates (col1col2col3).



        [TL;DR,]



        Combining the above steps, you can do this:



        from functools import reduce
        from operator import add
        from pyspark.sql.functions import col

        df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))


        The df.na.fill(0) portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:



        df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Jan 22 at 5:45

























        answered Jan 21 at 5:36









        Dileep Kumar PatchigollaDileep Kumar Patchigolla

        404620




        404620



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53297872%2fhow-can-i-sum-multiple-columns-in-a-spark-dataframe-in-pyspark%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3

            Museum of Modern and Contemporary Art of Trento and Rovereto