How can I sum multiple columns in a spark dataframe in pyspark?
I've got a list of column names i want to sum
columns = ['col1','col2','col3']
How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)
Dataframe with result i want:
col1 col2 col3 result
1 2 3 6
Thanks !
python apache-spark pyspark
add a comment |
I've got a list of column names i want to sum
columns = ['col1','col2','col3']
How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)
Dataframe with result i want:
col1 col2 col3 result
1 2 3 6
Thanks !
python apache-spark pyspark
Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?
– Prasad Khode
Nov 14 '18 at 10:23
Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.
– Manrique
Nov 14 '18 at 10:33
add a comment |
I've got a list of column names i want to sum
columns = ['col1','col2','col3']
How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)
Dataframe with result i want:
col1 col2 col3 result
1 2 3 6
Thanks !
python apache-spark pyspark
I've got a list of column names i want to sum
columns = ['col1','col2','col3']
How can i add the three and put it in a new column ? (in an automatic way, so that i can change the column list and have new results)
Dataframe with result i want:
col1 col2 col3 result
1 2 3 6
Thanks !
python apache-spark pyspark
python apache-spark pyspark
edited Nov 14 '18 at 17:24
Manrique
asked Nov 14 '18 at 10:21
ManriqueManrique
500114
500114
Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?
– Prasad Khode
Nov 14 '18 at 10:23
Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.
– Manrique
Nov 14 '18 at 10:33
add a comment |
Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?
– Prasad Khode
Nov 14 '18 at 10:23
Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.
– Manrique
Nov 14 '18 at 10:33
Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?
– Prasad Khode
Nov 14 '18 at 10:23
Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?
– Prasad Khode
Nov 14 '18 at 10:23
Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.
– Manrique
Nov 14 '18 at 10:33
Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.
– Manrique
Nov 14 '18 at 10:33
add a comment |
2 Answers
2
active
oldest
votes
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns
will be list of columns from df.
I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??
– vikrant rana
Dec 4 '18 at 14:38
add a comment |
[Editing to explain each step]
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3")
iteratively. For this, you can use the reduce
method with add
function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3")
instead of col("col1") + col("col2") + col("col3")
. But the effect would be same.
The col(x)
ensures that you are getting col(col("col1") + col("col2")) + col("col3")
instead of a simple string concat (which generates (col1col2col3
).
[TL;DR,]
Combining the above steps, you can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
The df.na.fill(0)
portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53297872%2fhow-can-i-sum-multiple-columns-in-a-spark-dataframe-in-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns
will be list of columns from df.
I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??
– vikrant rana
Dec 4 '18 at 14:38
add a comment |
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns
will be list of columns from df.
I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??
– vikrant rana
Dec 4 '18 at 14:38
add a comment |
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns
will be list of columns from df.
Try this:
df = df.withColumn('result', sum(df[col] for col in df.columns))
df.columns
will be list of columns from df.
answered Nov 14 '18 at 10:25
Mayank PorwalMayank Porwal
4,9202724
4,9202724
I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??
– vikrant rana
Dec 4 '18 at 14:38
add a comment |
I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??
– vikrant rana
Dec 4 '18 at 14:38
I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??
– vikrant rana
Dec 4 '18 at 14:38
I have replicate the same with below dataframe and getting an error: listA = [(10,20,40,60),(10,10,10,40)] df = spark.createDataFrame(listA, ['M1','M2','M3','M4']) newdf = df.withColumn('result', sum(df[col] for col in df.columns)) Please see below error. TypeError: 'Column' object is not callable. Am I missing something??
– vikrant rana
Dec 4 '18 at 14:38
add a comment |
[Editing to explain each step]
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3")
iteratively. For this, you can use the reduce
method with add
function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3")
instead of col("col1") + col("col2") + col("col3")
. But the effect would be same.
The col(x)
ensures that you are getting col(col("col1") + col("col2")) + col("col3")
instead of a simple string concat (which generates (col1col2col3
).
[TL;DR,]
Combining the above steps, you can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
The df.na.fill(0)
portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
add a comment |
[Editing to explain each step]
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3")
iteratively. For this, you can use the reduce
method with add
function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3")
instead of col("col1") + col("col2") + col("col3")
. But the effect would be same.
The col(x)
ensures that you are getting col(col("col1") + col("col2")) + col("col3")
instead of a simple string concat (which generates (col1col2col3
).
[TL;DR,]
Combining the above steps, you can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
The df.na.fill(0)
portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
add a comment |
[Editing to explain each step]
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3")
iteratively. For this, you can use the reduce
method with add
function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3")
instead of col("col1") + col("col2") + col("col3")
. But the effect would be same.
The col(x)
ensures that you are getting col(col("col1") + col("col2")) + col("col3")
instead of a simple string concat (which generates (col1col2col3
).
[TL;DR,]
Combining the above steps, you can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
The df.na.fill(0)
portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
[Editing to explain each step]
If you have static list of columns, you can do this:
df.withColumn("result", col("col1") + col("col2") + col("col3"))
But if you don't want to type the whole columns list, you need to generate the phrase col("col1") + col("col2") + col("col3")
iteratively. For this, you can use the reduce
method with add
function to get this:
reduce(add, [col(x) for x in df.columns])
The columns are added two at a time, so you would get col(col("col1") + col("col2")) + col("col3")
instead of col("col1") + col("col2") + col("col3")
. But the effect would be same.
The col(x)
ensures that you are getting col(col("col1") + col("col2")) + col("col3")
instead of a simple string concat (which generates (col1col2col3
).
[TL;DR,]
Combining the above steps, you can do this:
from functools import reduce
from operator import add
from pyspark.sql.functions import col
df.na.fill(0).withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
The df.na.fill(0)
portion is to handle nulls in your data. If you don't have any nulls, you can skip that and do this instead:
df.withColumn("result" ,reduce(add, [col(x) for x in df.columns]))
edited Jan 22 at 5:45
answered Jan 21 at 5:36
Dileep Kumar PatchigollaDileep Kumar Patchigolla
404620
404620
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53297872%2fhow-can-i-sum-multiple-columns-in-a-spark-dataframe-in-pyspark%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Possible duplicate of How do I add a new column to a Spark DataFrame (using PySpark)?
– Prasad Khode
Nov 14 '18 at 10:23
Thanks for answering ! I know how to add columns, i just want an efficient way to add them based on a list with column names.
– Manrique
Nov 14 '18 at 10:33