spark drop multiple duplicated columns after join
I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema
root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)
now I want to drop the last two columns
|-- test: string (nullable = true)
|-- value: string (nullable = true)
I tried with df..dropDuplicates() but it dropping all
how to drop the duplicated columns which comes in the last ?
apache-spark apache-spark-sql
add a comment |
I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema
root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)
now I want to drop the last two columns
|-- test: string (nullable = true)
|-- value: string (nullable = true)
I tried with df..dropDuplicates() but it dropping all
how to drop the duplicated columns which comes in the last ?
apache-spark apache-spark-sql
add a comment |
I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema
root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)
now I want to drop the last two columns
|-- test: string (nullable = true)
|-- value: string (nullable = true)
I tried with df..dropDuplicates() but it dropping all
how to drop the duplicated columns which comes in the last ?
apache-spark apache-spark-sql
I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema
root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)
now I want to drop the last two columns
|-- test: string (nullable = true)
|-- value: string (nullable = true)
I tried with df..dropDuplicates() but it dropping all
how to drop the duplicated columns which comes in the last ?
apache-spark apache-spark-sql
apache-spark apache-spark-sql
asked Nov 15 '18 at 10:08
user3607698user3607698
360117
360117
add a comment |
add a comment |
3 Answers
3
active
oldest
votes
You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
this solution drops both the duplicated columns, my requirement is to drop one and keep other
– user3607698
Nov 15 '18 at 13:04
in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.
– stack0114106
Nov 15 '18 at 13:36
check my update2 on how to approach it dynamically. let me know if it helps
– stack0114106
Nov 15 '18 at 14:14
Is it possible to do alias for list of columns in one go?
– user3607698
Nov 15 '18 at 14:16
yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3
– stack0114106
Nov 15 '18 at 14:23
|
show 1 more comment
Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.
second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))
– user3607698
Nov 15 '18 at 10:30
what do you mean by list of columns ? are they dynamically generated ?
– Sundeep Pidugu
Nov 15 '18 at 11:10
val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)
– user3607698
Nov 15 '18 at 11:17
i never tried this are you getting any any error while running this ? please do let me know
– Sundeep Pidugu
Nov 15 '18 at 11:33
this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist
– user3607698
Nov 15 '18 at 12:26
add a comment |
- df.dropDuplicates() works only for rows.
- You can df1.drop(df2.column("value"))
- You can specify columns you want to select, for example, with df.select(Seq of columns)
df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list
– user3607698
Nov 15 '18 at 12:55
Try df2.column("value") instead of new Column("value")
– Serge Harnyk
Nov 15 '18 at 13:06
getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame
– user3607698
Nov 15 '18 at 13:09
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53316976%2fspark-drop-multiple-duplicated-columns-after-join%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
this solution drops both the duplicated columns, my requirement is to drop one and keep other
– user3607698
Nov 15 '18 at 13:04
in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.
– stack0114106
Nov 15 '18 at 13:36
check my update2 on how to approach it dynamically. let me know if it helps
– stack0114106
Nov 15 '18 at 14:14
Is it possible to do alias for list of columns in one go?
– user3607698
Nov 15 '18 at 14:16
yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3
– stack0114106
Nov 15 '18 at 14:23
|
show 1 more comment
You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
this solution drops both the duplicated columns, my requirement is to drop one and keep other
– user3607698
Nov 15 '18 at 13:04
in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.
– stack0114106
Nov 15 '18 at 13:36
check my update2 on how to approach it dynamically. let me know if it helps
– stack0114106
Nov 15 '18 at 14:14
Is it possible to do alias for list of columns in one go?
– user3607698
Nov 15 '18 at 14:16
yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3
– stack0114106
Nov 15 '18 at 14:23
|
show 1 more comment
You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
You have to use a vararg syntax to get the column names from an array and drop it.
Check below:
scala> dfx.show
+---+---+---+---+------------+------+
| A| B| C| D| arr|mincol|
+---+---+---+---+------------+------+
| 1| 2| 3| 4|[1, 2, 3, 4]| A|
| 5| 4| 3| 1|[5, 4, 3, 1]| D|
+---+---+---+---+------------+------+
scala> dfx.columns
res120: Array[String] = Array(A, B, C, D, arr, mincol)
scala> val dropcols = Array("arr","mincol")
dropcols: Array[String] = Array(arr, mincol)
scala> dfx.drop(dropcols:_*).show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update1:
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = df.select("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 2| 3|
| 5| 4| 3| 1| 4| 3|
+---+---+---+---+---+---+
scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update2:
To remove the columns dynamically, check the below solution.
scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]
scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]
scala> df3.show
+---+---+---+---+---+---+
| A| B| C| D| B| C|
+---+---+---+---+---+---+
| 1| 2| 3| 4| 9| 9|
| 5| 4| 3| 1| 8| 8|
+---+---+---+---+---+---+
scala> val rem1 = Array("B","C")
rem1: Array[String] = Array(B, C)
scala> val rem2 = rem1.map(x=>"t2."+x)
rem2: Array[String] = Array(t2.B, t2.C)
scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> df4.show
+---+---+---+---+
| A| B| C| D|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
Update3
Renaming/aliasing in one go.
scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]
scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]
scala> dfa2.show
+---+---+---+---+
|A_2|B_2|C_2|D_2|
+---+---+---+---+
| 1| 2| 3| 4|
| 5| 4| 3| 1|
+---+---+---+---+
scala>
edited Nov 15 '18 at 14:24
answered Nov 15 '18 at 13:01
stack0114106stack0114106
4,2282421
4,2282421
this solution drops both the duplicated columns, my requirement is to drop one and keep other
– user3607698
Nov 15 '18 at 13:04
in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.
– stack0114106
Nov 15 '18 at 13:36
check my update2 on how to approach it dynamically. let me know if it helps
– stack0114106
Nov 15 '18 at 14:14
Is it possible to do alias for list of columns in one go?
– user3607698
Nov 15 '18 at 14:16
yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3
– stack0114106
Nov 15 '18 at 14:23
|
show 1 more comment
this solution drops both the duplicated columns, my requirement is to drop one and keep other
– user3607698
Nov 15 '18 at 13:04
in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.
– stack0114106
Nov 15 '18 at 13:36
check my update2 on how to approach it dynamically. let me know if it helps
– stack0114106
Nov 15 '18 at 14:14
Is it possible to do alias for list of columns in one go?
– user3607698
Nov 15 '18 at 14:16
yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3
– stack0114106
Nov 15 '18 at 14:23
this solution drops both the duplicated columns, my requirement is to drop one and keep other
– user3607698
Nov 15 '18 at 13:04
this solution drops both the duplicated columns, my requirement is to drop one and keep other
– user3607698
Nov 15 '18 at 13:04
in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.
– stack0114106
Nov 15 '18 at 13:36
in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.
– stack0114106
Nov 15 '18 at 13:36
check my update2 on how to approach it dynamically. let me know if it helps
– stack0114106
Nov 15 '18 at 14:14
check my update2 on how to approach it dynamically. let me know if it helps
– stack0114106
Nov 15 '18 at 14:14
Is it possible to do alias for list of columns in one go?
– user3607698
Nov 15 '18 at 14:16
Is it possible to do alias for list of columns in one go?
– user3607698
Nov 15 '18 at 14:16
yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3
– stack0114106
Nov 15 '18 at 14:23
yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3
– stack0114106
Nov 15 '18 at 14:23
|
show 1 more comment
Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.
second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))
– user3607698
Nov 15 '18 at 10:30
what do you mean by list of columns ? are they dynamically generated ?
– Sundeep Pidugu
Nov 15 '18 at 11:10
val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)
– user3607698
Nov 15 '18 at 11:17
i never tried this are you getting any any error while running this ? please do let me know
– Sundeep Pidugu
Nov 15 '18 at 11:33
this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist
– user3607698
Nov 15 '18 at 12:26
add a comment |
Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.
second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))
– user3607698
Nov 15 '18 at 10:30
what do you mean by list of columns ? are they dynamically generated ?
– Sundeep Pidugu
Nov 15 '18 at 11:10
val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)
– user3607698
Nov 15 '18 at 11:17
i never tried this are you getting any any error while running this ? please do let me know
– Sundeep Pidugu
Nov 15 '18 at 11:33
this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist
– user3607698
Nov 15 '18 at 12:26
add a comment |
Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.
Suppose if you have two dataframes DF1 and DF2,
You can use either of the ways to join on a particular column
1. DF1.join(DF2,Seq("column1","column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))
So to drop the duplicate columns you can use
1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))
In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.
answered Nov 15 '18 at 10:19
Sundeep PiduguSundeep Pidugu
40614
40614
second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))
– user3607698
Nov 15 '18 at 10:30
what do you mean by list of columns ? are they dynamically generated ?
– Sundeep Pidugu
Nov 15 '18 at 11:10
val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)
– user3607698
Nov 15 '18 at 11:17
i never tried this are you getting any any error while running this ? please do let me know
– Sundeep Pidugu
Nov 15 '18 at 11:33
this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist
– user3607698
Nov 15 '18 at 12:26
add a comment |
second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))
– user3607698
Nov 15 '18 at 10:30
what do you mean by list of columns ? are they dynamically generated ?
– Sundeep Pidugu
Nov 15 '18 at 11:10
val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)
– user3607698
Nov 15 '18 at 11:17
i never tried this are you getting any any error while running this ? please do let me know
– Sundeep Pidugu
Nov 15 '18 at 11:33
this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist
– user3607698
Nov 15 '18 at 12:26
second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))
– user3607698
Nov 15 '18 at 10:30
second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))
– user3607698
Nov 15 '18 at 10:30
what do you mean by list of columns ? are they dynamically generated ?
– Sundeep Pidugu
Nov 15 '18 at 11:10
what do you mean by list of columns ? are they dynamically generated ?
– Sundeep Pidugu
Nov 15 '18 at 11:10
val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)
– user3607698
Nov 15 '18 at 11:17
val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)
– user3607698
Nov 15 '18 at 11:17
i never tried this are you getting any any error while running this ? please do let me know
– Sundeep Pidugu
Nov 15 '18 at 11:33
i never tried this are you getting any any error while running this ? please do let me know
– Sundeep Pidugu
Nov 15 '18 at 11:33
this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist
– user3607698
Nov 15 '18 at 12:26
this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist
– user3607698
Nov 15 '18 at 12:26
add a comment |
- df.dropDuplicates() works only for rows.
- You can df1.drop(df2.column("value"))
- You can specify columns you want to select, for example, with df.select(Seq of columns)
df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list
– user3607698
Nov 15 '18 at 12:55
Try df2.column("value") instead of new Column("value")
– Serge Harnyk
Nov 15 '18 at 13:06
getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame
– user3607698
Nov 15 '18 at 13:09
add a comment |
- df.dropDuplicates() works only for rows.
- You can df1.drop(df2.column("value"))
- You can specify columns you want to select, for example, with df.select(Seq of columns)
df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list
– user3607698
Nov 15 '18 at 12:55
Try df2.column("value") instead of new Column("value")
– Serge Harnyk
Nov 15 '18 at 13:06
getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame
– user3607698
Nov 15 '18 at 13:09
add a comment |
- df.dropDuplicates() works only for rows.
- You can df1.drop(df2.column("value"))
- You can specify columns you want to select, for example, with df.select(Seq of columns)
- df.dropDuplicates() works only for rows.
- You can df1.drop(df2.column("value"))
- You can specify columns you want to select, for example, with df.select(Seq of columns)
answered Nov 15 '18 at 12:52
Serge HarnykSerge Harnyk
374110
374110
df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list
– user3607698
Nov 15 '18 at 12:55
Try df2.column("value") instead of new Column("value")
– Serge Harnyk
Nov 15 '18 at 13:06
getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame
– user3607698
Nov 15 '18 at 13:09
add a comment |
df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list
– user3607698
Nov 15 '18 at 12:55
Try df2.column("value") instead of new Column("value")
– Serge Harnyk
Nov 15 '18 at 13:06
getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame
– user3607698
Nov 15 '18 at 13:09
df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list
– user3607698
Nov 15 '18 at 12:55
df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list
– user3607698
Nov 15 '18 at 12:55
Try df2.column("value") instead of new Column("value")
– Serge Harnyk
Nov 15 '18 at 13:06
Try df2.column("value") instead of new Column("value")
– Serge Harnyk
Nov 15 '18 at 13:06
getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame
– user3607698
Nov 15 '18 at 13:09
getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame
– user3607698
Nov 15 '18 at 13:09
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53316976%2fspark-drop-multiple-duplicated-columns-after-join%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown