spark drop multiple duplicated columns after join










-1















I am getting many duplicated columns after joining two dataframes,
now I want to drop the columns which comes in the last, below is my printSchema



root
|-- id: string (nullable = true)
|-- value: string (nullable = true)
|-- test: string (nullable = true)
|-- details: string (nullable = true)
|-- test: string (nullable = true)
|-- value: string (nullable = true)


now I want to drop the last two columns



 |-- test: string (nullable = true)
|-- value: string (nullable = true)


I tried with df..dropDuplicates() but it dropping all



how to drop the duplicated columns which comes in the last ?










share|improve this question


























    -1















    I am getting many duplicated columns after joining two dataframes,
    now I want to drop the columns which comes in the last, below is my printSchema



    root
    |-- id: string (nullable = true)
    |-- value: string (nullable = true)
    |-- test: string (nullable = true)
    |-- details: string (nullable = true)
    |-- test: string (nullable = true)
    |-- value: string (nullable = true)


    now I want to drop the last two columns



     |-- test: string (nullable = true)
    |-- value: string (nullable = true)


    I tried with df..dropDuplicates() but it dropping all



    how to drop the duplicated columns which comes in the last ?










    share|improve this question
























      -1












      -1








      -1








      I am getting many duplicated columns after joining two dataframes,
      now I want to drop the columns which comes in the last, below is my printSchema



      root
      |-- id: string (nullable = true)
      |-- value: string (nullable = true)
      |-- test: string (nullable = true)
      |-- details: string (nullable = true)
      |-- test: string (nullable = true)
      |-- value: string (nullable = true)


      now I want to drop the last two columns



       |-- test: string (nullable = true)
      |-- value: string (nullable = true)


      I tried with df..dropDuplicates() but it dropping all



      how to drop the duplicated columns which comes in the last ?










      share|improve this question














      I am getting many duplicated columns after joining two dataframes,
      now I want to drop the columns which comes in the last, below is my printSchema



      root
      |-- id: string (nullable = true)
      |-- value: string (nullable = true)
      |-- test: string (nullable = true)
      |-- details: string (nullable = true)
      |-- test: string (nullable = true)
      |-- value: string (nullable = true)


      now I want to drop the last two columns



       |-- test: string (nullable = true)
      |-- value: string (nullable = true)


      I tried with df..dropDuplicates() but it dropping all



      how to drop the duplicated columns which comes in the last ?







      apache-spark apache-spark-sql






      share|improve this question













      share|improve this question











      share|improve this question




      share|improve this question










      asked Nov 15 '18 at 10:08









      user3607698user3607698

      360117




      360117






















          3 Answers
          3






          active

          oldest

          votes


















          1














          You have to use a vararg syntax to get the column names from an array and drop it.
          Check below:



          scala> dfx.show
          +---+---+---+---+------------+------+
          | A| B| C| D| arr|mincol|
          +---+---+---+---+------------+------+
          | 1| 2| 3| 4|[1, 2, 3, 4]| A|
          | 5| 4| 3| 1|[5, 4, 3, 1]| D|
          +---+---+---+---+------------+------+

          scala> dfx.columns
          res120: Array[String] = Array(A, B, C, D, arr, mincol)

          scala> val dropcols = Array("arr","mincol")
          dropcols: Array[String] = Array(arr, mincol)

          scala> dfx.drop(dropcols:_*).show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update1:



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = df.select("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 2| 3|
          | 5| 4| 3| 1| 4| 3|
          +---+---+---+---+---+---+


          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update2:



          To remove the columns dynamically, check the below solution.



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
          df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

          scala> df3.show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 9| 9|
          | 5| 4| 3| 1| 8| 8|
          +---+---+---+---+---+---+

          scala> val rem1 = Array("B","C")
          rem1: Array[String] = Array(B, C)

          scala> val rem2 = rem1.map(x=>"t2."+x)
          rem2: Array[String] = Array(t2.B, t2.C)

          scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
          df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> df4.show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update3



          Renaming/aliasing in one go.



          scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
          dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

          scala> dfa2.show
          +---+---+---+---+
          |A_2|B_2|C_2|D_2|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>





          share|improve this answer

























          • this solution drops both the duplicated columns, my requirement is to drop one and keep other

            – user3607698
            Nov 15 '18 at 13:04











          • in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

            – stack0114106
            Nov 15 '18 at 13:36











          • check my update2 on how to approach it dynamically. let me know if it helps

            – stack0114106
            Nov 15 '18 at 14:14











          • Is it possible to do alias for list of columns in one go?

            – user3607698
            Nov 15 '18 at 14:16











          • yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

            – stack0114106
            Nov 15 '18 at 14:23


















          0














          Suppose if you have two dataframes DF1 and DF2,
          You can use either of the ways to join on a particular column



           1. DF1.join(DF2,Seq("column1","column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))


          So to drop the duplicate columns you can use



           1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))


          In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.






          share|improve this answer























          • second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

            – user3607698
            Nov 15 '18 at 10:30











          • what do you mean by list of columns ? are they dynamically generated ?

            – Sundeep Pidugu
            Nov 15 '18 at 11:10












          • val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

            – user3607698
            Nov 15 '18 at 11:17











          • i never tried this are you getting any any error while running this ? please do let me know

            – Sundeep Pidugu
            Nov 15 '18 at 11:33











          • this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

            – user3607698
            Nov 15 '18 at 12:26


















          0














          1. df.dropDuplicates() works only for rows.

          2. You can df1.drop(df2.column("value"))

          3. You can specify columns you want to select, for example, with df.select(Seq of columns)





          share|improve this answer























          • df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

            – user3607698
            Nov 15 '18 at 12:55











          • Try df2.column("value") instead of new Column("value")

            – Serge Harnyk
            Nov 15 '18 at 13:06











          • getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

            – user3607698
            Nov 15 '18 at 13:09










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53316976%2fspark-drop-multiple-duplicated-columns-after-join%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          3 Answers
          3






          active

          oldest

          votes








          3 Answers
          3






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          1














          You have to use a vararg syntax to get the column names from an array and drop it.
          Check below:



          scala> dfx.show
          +---+---+---+---+------------+------+
          | A| B| C| D| arr|mincol|
          +---+---+---+---+------------+------+
          | 1| 2| 3| 4|[1, 2, 3, 4]| A|
          | 5| 4| 3| 1|[5, 4, 3, 1]| D|
          +---+---+---+---+------------+------+

          scala> dfx.columns
          res120: Array[String] = Array(A, B, C, D, arr, mincol)

          scala> val dropcols = Array("arr","mincol")
          dropcols: Array[String] = Array(arr, mincol)

          scala> dfx.drop(dropcols:_*).show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update1:



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = df.select("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 2| 3|
          | 5| 4| 3| 1| 4| 3|
          +---+---+---+---+---+---+


          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update2:



          To remove the columns dynamically, check the below solution.



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
          df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

          scala> df3.show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 9| 9|
          | 5| 4| 3| 1| 8| 8|
          +---+---+---+---+---+---+

          scala> val rem1 = Array("B","C")
          rem1: Array[String] = Array(B, C)

          scala> val rem2 = rem1.map(x=>"t2."+x)
          rem2: Array[String] = Array(t2.B, t2.C)

          scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
          df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> df4.show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update3



          Renaming/aliasing in one go.



          scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
          dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

          scala> dfa2.show
          +---+---+---+---+
          |A_2|B_2|C_2|D_2|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>





          share|improve this answer

























          • this solution drops both the duplicated columns, my requirement is to drop one and keep other

            – user3607698
            Nov 15 '18 at 13:04











          • in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

            – stack0114106
            Nov 15 '18 at 13:36











          • check my update2 on how to approach it dynamically. let me know if it helps

            – stack0114106
            Nov 15 '18 at 14:14











          • Is it possible to do alias for list of columns in one go?

            – user3607698
            Nov 15 '18 at 14:16











          • yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

            – stack0114106
            Nov 15 '18 at 14:23















          1














          You have to use a vararg syntax to get the column names from an array and drop it.
          Check below:



          scala> dfx.show
          +---+---+---+---+------------+------+
          | A| B| C| D| arr|mincol|
          +---+---+---+---+------------+------+
          | 1| 2| 3| 4|[1, 2, 3, 4]| A|
          | 5| 4| 3| 1|[5, 4, 3, 1]| D|
          +---+---+---+---+------------+------+

          scala> dfx.columns
          res120: Array[String] = Array(A, B, C, D, arr, mincol)

          scala> val dropcols = Array("arr","mincol")
          dropcols: Array[String] = Array(arr, mincol)

          scala> dfx.drop(dropcols:_*).show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update1:



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = df.select("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 2| 3|
          | 5| 4| 3| 1| 4| 3|
          +---+---+---+---+---+---+


          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update2:



          To remove the columns dynamically, check the below solution.



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
          df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

          scala> df3.show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 9| 9|
          | 5| 4| 3| 1| 8| 8|
          +---+---+---+---+---+---+

          scala> val rem1 = Array("B","C")
          rem1: Array[String] = Array(B, C)

          scala> val rem2 = rem1.map(x=>"t2."+x)
          rem2: Array[String] = Array(t2.B, t2.C)

          scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
          df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> df4.show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update3



          Renaming/aliasing in one go.



          scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
          dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

          scala> dfa2.show
          +---+---+---+---+
          |A_2|B_2|C_2|D_2|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>





          share|improve this answer

























          • this solution drops both the duplicated columns, my requirement is to drop one and keep other

            – user3607698
            Nov 15 '18 at 13:04











          • in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

            – stack0114106
            Nov 15 '18 at 13:36











          • check my update2 on how to approach it dynamically. let me know if it helps

            – stack0114106
            Nov 15 '18 at 14:14











          • Is it possible to do alias for list of columns in one go?

            – user3607698
            Nov 15 '18 at 14:16











          • yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

            – stack0114106
            Nov 15 '18 at 14:23













          1












          1








          1







          You have to use a vararg syntax to get the column names from an array and drop it.
          Check below:



          scala> dfx.show
          +---+---+---+---+------------+------+
          | A| B| C| D| arr|mincol|
          +---+---+---+---+------------+------+
          | 1| 2| 3| 4|[1, 2, 3, 4]| A|
          | 5| 4| 3| 1|[5, 4, 3, 1]| D|
          +---+---+---+---+------------+------+

          scala> dfx.columns
          res120: Array[String] = Array(A, B, C, D, arr, mincol)

          scala> val dropcols = Array("arr","mincol")
          dropcols: Array[String] = Array(arr, mincol)

          scala> dfx.drop(dropcols:_*).show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update1:



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = df.select("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 2| 3|
          | 5| 4| 3| 1| 4| 3|
          +---+---+---+---+---+---+


          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update2:



          To remove the columns dynamically, check the below solution.



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
          df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

          scala> df3.show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 9| 9|
          | 5| 4| 3| 1| 8| 8|
          +---+---+---+---+---+---+

          scala> val rem1 = Array("B","C")
          rem1: Array[String] = Array(B, C)

          scala> val rem2 = rem1.map(x=>"t2."+x)
          rem2: Array[String] = Array(t2.B, t2.C)

          scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
          df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> df4.show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update3



          Renaming/aliasing in one go.



          scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
          dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

          scala> dfa2.show
          +---+---+---+---+
          |A_2|B_2|C_2|D_2|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>





          share|improve this answer















          You have to use a vararg syntax to get the column names from an array and drop it.
          Check below:



          scala> dfx.show
          +---+---+---+---+------------+------+
          | A| B| C| D| arr|mincol|
          +---+---+---+---+------------+------+
          | 1| 2| 3| 4|[1, 2, 3, 4]| A|
          | 5| 4| 3| 1|[5, 4, 3, 1]| D|
          +---+---+---+---+------------+------+

          scala> dfx.columns
          res120: Array[String] = Array(A, B, C, D, arr, mincol)

          scala> val dropcols = Array("arr","mincol")
          dropcols: Array[String] = Array(arr, mincol)

          scala> dfx.drop(dropcols:_*).show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update1:



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = df.select("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 2| 3|
          | 5| 4| 3| 1| 4| 3|
          +---+---+---+---+---+---+


          scala> df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner").drop($"t2.B").drop($"t2.C").show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update2:



          To remove the columns dynamically, check the below solution.



          scala> val df = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          df: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val df2 = Seq((1,9,9),(5,8,8)).toDF("A","B","C")
          df2: org.apache.spark.sql.DataFrame = [A: int, B: int ... 1 more field]

          scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("A"),"inner")
          df3: org.apache.spark.sql.DataFrame = [A: int, B: int ... 4 more fields]

          scala> df3.show
          +---+---+---+---+---+---+
          | A| B| C| D| B| C|
          +---+---+---+---+---+---+
          | 1| 2| 3| 4| 9| 9|
          | 5| 4| 3| 1| 8| 8|
          +---+---+---+---+---+---+

          scala> val rem1 = Array("B","C")
          rem1: Array[String] = Array(B, C)

          scala> val rem2 = rem1.map(x=>"t2."+x)
          rem2: Array[String] = Array(t2.B, t2.C)

          scala> val df4 = rem2.foldLeft(df3) (acc: DataFrame, colName: String) => acc.drop(col(colName))
          df4: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> df4.show
          +---+---+---+---+
          | A| B| C| D|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>


          Update3



          Renaming/aliasing in one go.



          scala> val dfa = Seq((1,2,3,4),(5,4,3,1)).toDF("A","B","C","D")
          dfa: org.apache.spark.sql.DataFrame = [A: int, B: int ... 2 more fields]

          scala> val dfa2 = dfa.columns.foldLeft(dfa) (acc: DataFrame, colName: String) => acc.withColumnRenamed(colName,colName+"_2")
          dfa2: org.apache.spark.sql.DataFrame = [A_2: int, B_2: int ... 2 more fields]

          scala> dfa2.show
          +---+---+---+---+
          |A_2|B_2|C_2|D_2|
          +---+---+---+---+
          | 1| 2| 3| 4|
          | 5| 4| 3| 1|
          +---+---+---+---+


          scala>






          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Nov 15 '18 at 14:24

























          answered Nov 15 '18 at 13:01









          stack0114106stack0114106

          4,2282421




          4,2282421












          • this solution drops both the duplicated columns, my requirement is to drop one and keep other

            – user3607698
            Nov 15 '18 at 13:04











          • in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

            – stack0114106
            Nov 15 '18 at 13:36











          • check my update2 on how to approach it dynamically. let me know if it helps

            – stack0114106
            Nov 15 '18 at 14:14











          • Is it possible to do alias for list of columns in one go?

            – user3607698
            Nov 15 '18 at 14:16











          • yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

            – stack0114106
            Nov 15 '18 at 14:23

















          • this solution drops both the duplicated columns, my requirement is to drop one and keep other

            – user3607698
            Nov 15 '18 at 13:04











          • in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

            – stack0114106
            Nov 15 '18 at 13:36











          • check my update2 on how to approach it dynamically. let me know if it helps

            – stack0114106
            Nov 15 '18 at 14:14











          • Is it possible to do alias for list of columns in one go?

            – user3607698
            Nov 15 '18 at 14:16











          • yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

            – stack0114106
            Nov 15 '18 at 14:23
















          this solution drops both the duplicated columns, my requirement is to drop one and keep other

          – user3607698
          Nov 15 '18 at 13:04





          this solution drops both the duplicated columns, my requirement is to drop one and keep other

          – user3607698
          Nov 15 '18 at 13:04













          in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

          – stack0114106
          Nov 15 '18 at 13:36





          in that case, you can alias for the table names and a concatenated drop(s) on the resulting df. check my update.

          – stack0114106
          Nov 15 '18 at 13:36













          check my update2 on how to approach it dynamically. let me know if it helps

          – stack0114106
          Nov 15 '18 at 14:14





          check my update2 on how to approach it dynamically. let me know if it helps

          – stack0114106
          Nov 15 '18 at 14:14













          Is it possible to do alias for list of columns in one go?

          – user3607698
          Nov 15 '18 at 14:16





          Is it possible to do alias for list of columns in one go?

          – user3607698
          Nov 15 '18 at 14:16













          yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

          – stack0114106
          Nov 15 '18 at 14:23





          yes..with the foldLeft technique.. you can just rename the columns in one go.. check my update3

          – stack0114106
          Nov 15 '18 at 14:23













          0














          Suppose if you have two dataframes DF1 and DF2,
          You can use either of the ways to join on a particular column



           1. DF1.join(DF2,Seq("column1","column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))


          So to drop the duplicate columns you can use



           1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))


          In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.






          share|improve this answer























          • second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

            – user3607698
            Nov 15 '18 at 10:30











          • what do you mean by list of columns ? are they dynamically generated ?

            – Sundeep Pidugu
            Nov 15 '18 at 11:10












          • val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

            – user3607698
            Nov 15 '18 at 11:17











          • i never tried this are you getting any any error while running this ? please do let me know

            – Sundeep Pidugu
            Nov 15 '18 at 11:33











          • this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

            – user3607698
            Nov 15 '18 at 12:26















          0














          Suppose if you have two dataframes DF1 and DF2,
          You can use either of the ways to join on a particular column



           1. DF1.join(DF2,Seq("column1","column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))


          So to drop the duplicate columns you can use



           1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))


          In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.






          share|improve this answer























          • second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

            – user3607698
            Nov 15 '18 at 10:30











          • what do you mean by list of columns ? are they dynamically generated ?

            – Sundeep Pidugu
            Nov 15 '18 at 11:10












          • val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

            – user3607698
            Nov 15 '18 at 11:17











          • i never tried this are you getting any any error while running this ? please do let me know

            – Sundeep Pidugu
            Nov 15 '18 at 11:33











          • this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

            – user3607698
            Nov 15 '18 at 12:26













          0












          0








          0







          Suppose if you have two dataframes DF1 and DF2,
          You can use either of the ways to join on a particular column



           1. DF1.join(DF2,Seq("column1","column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))


          So to drop the duplicate columns you can use



           1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))


          In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.






          share|improve this answer













          Suppose if you have two dataframes DF1 and DF2,
          You can use either of the ways to join on a particular column



           1. DF1.join(DF2,Seq("column1","column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2")))


          So to drop the duplicate columns you can use



           1. DF1.join(DF2,Seq("column1","column2")).drop(DF1("column1")).drop(DF1("column1"),DF1("column2"))
          2. DF1.join(DF2,DF1("column1") === DF2("column1") && DF1("column2") === DF2("column2"))).drop(DF1("column1"),DF1("column2"))


          In either case you can use drop("columnname") to drop what ever columns you need doesn't matter from which df it comes from as it is equal in this case.







          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 15 '18 at 10:19









          Sundeep PiduguSundeep Pidugu

          40614




          40614












          • second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

            – user3607698
            Nov 15 '18 at 10:30











          • what do you mean by list of columns ? are they dynamically generated ?

            – Sundeep Pidugu
            Nov 15 '18 at 11:10












          • val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

            – user3607698
            Nov 15 '18 at 11:17











          • i never tried this are you getting any any error while running this ? please do let me know

            – Sundeep Pidugu
            Nov 15 '18 at 11:33











          • this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

            – user3607698
            Nov 15 '18 at 12:26

















          • second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

            – user3607698
            Nov 15 '18 at 10:30











          • what do you mean by list of columns ? are they dynamically generated ?

            – Sundeep Pidugu
            Nov 15 '18 at 11:10












          • val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

            – user3607698
            Nov 15 '18 at 11:17











          • i never tried this are you getting any any error while running this ? please do let me know

            – Sundeep Pidugu
            Nov 15 '18 at 11:33











          • this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

            – user3607698
            Nov 15 '18 at 12:26
















          second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

          – user3607698
          Nov 15 '18 at 10:30





          second approach worked, but I have the list of columns to be dropped in the val List("column1", "column2", "columnn"), how to pass this list for this drop(DF1("column1"),DF1("column2"),...... DF1("columnn"))

          – user3607698
          Nov 15 '18 at 10:30













          what do you mean by list of columns ? are they dynamically generated ?

          – Sundeep Pidugu
          Nov 15 '18 at 11:10






          what do you mean by list of columns ? are they dynamically generated ?

          – Sundeep Pidugu
          Nov 15 '18 at 11:10














          val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

          – user3607698
          Nov 15 '18 at 11:17





          val clmlist = List("column1", "column2", "columnn") df1.join(df2, clmlist, "inner") this is my joining function, I want something like this, df1.join(df2, clmlist, "inner").drop(clmlist)

          – user3607698
          Nov 15 '18 at 11:17













          i never tried this are you getting any any error while running this ? please do let me know

          – Sundeep Pidugu
          Nov 15 '18 at 11:33





          i never tried this are you getting any any error while running this ? please do let me know

          – Sundeep Pidugu
          Nov 15 '18 at 11:33













          this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

          – user3607698
          Nov 15 '18 at 12:26





          this I tried df1.join(df2, clmlist, "inner"), it works .. but this dropping doesnt work df1.join(df2, clmlist, "inner").drop(clmlist) . so I want a method to drop the all these columns in the clmlist

          – user3607698
          Nov 15 '18 at 12:26











          0














          1. df.dropDuplicates() works only for rows.

          2. You can df1.drop(df2.column("value"))

          3. You can specify columns you want to select, for example, with df.select(Seq of columns)





          share|improve this answer























          • df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

            – user3607698
            Nov 15 '18 at 12:55











          • Try df2.column("value") instead of new Column("value")

            – Serge Harnyk
            Nov 15 '18 at 13:06











          • getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

            – user3607698
            Nov 15 '18 at 13:09















          0














          1. df.dropDuplicates() works only for rows.

          2. You can df1.drop(df2.column("value"))

          3. You can specify columns you want to select, for example, with df.select(Seq of columns)





          share|improve this answer























          • df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

            – user3607698
            Nov 15 '18 at 12:55











          • Try df2.column("value") instead of new Column("value")

            – Serge Harnyk
            Nov 15 '18 at 13:06











          • getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

            – user3607698
            Nov 15 '18 at 13:09













          0












          0








          0







          1. df.dropDuplicates() works only for rows.

          2. You can df1.drop(df2.column("value"))

          3. You can specify columns you want to select, for example, with df.select(Seq of columns)





          share|improve this answer













          1. df.dropDuplicates() works only for rows.

          2. You can df1.drop(df2.column("value"))

          3. You can specify columns you want to select, for example, with df.select(Seq of columns)






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 15 '18 at 12:52









          Serge HarnykSerge Harnyk

          374110




          374110












          • df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

            – user3607698
            Nov 15 '18 at 12:55











          • Try df2.column("value") instead of new Column("value")

            – Serge Harnyk
            Nov 15 '18 at 13:06











          • getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

            – user3607698
            Nov 15 '18 at 13:09

















          • df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

            – user3607698
            Nov 15 '18 at 12:55











          • Try df2.column("value") instead of new Column("value")

            – Serge Harnyk
            Nov 15 '18 at 13:06











          • getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

            – user3607698
            Nov 15 '18 at 13:09
















          df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

          – user3607698
          Nov 15 '18 at 12:55





          df.select gives ambiguous error when two duplicated columns are present, my requirement is to drop one and keep other in duplicated list

          – user3607698
          Nov 15 '18 at 12:55













          Try df2.column("value") instead of new Column("value")

          – Serge Harnyk
          Nov 15 '18 at 13:06





          Try df2.column("value") instead of new Column("value")

          – Serge Harnyk
          Nov 15 '18 at 13:06













          getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

          – user3607698
          Nov 15 '18 at 13:09





          getting error df2.drop(df1.column("id")). <console>:30: error: value column is not a member of org.apache.spark.sql.DataFrame

          – user3607698
          Nov 15 '18 at 13:09

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53316976%2fspark-drop-multiple-duplicated-columns-after-join%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          How to read a connectionString WITH PROVIDER in .NET Core?

          Node.js Script on GitHub Pages or Amazon S3

          Museum of Modern and Contemporary Art of Trento and Rovereto