Create an array column from other columns after processing the column values










0















Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)



------------------------------------------------------------
StudentID | School | Type | Group
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts
2 | ABC | Elementary | Football
3 | DEF | Secondary | Basketball-Cricket
4 | DEF | Secondary | Cricket
------------------------------------------------------------


I need to add one more column to the dataframe as below:



--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------


The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.



All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.



I am looking for the best performance solution.



If it's a simple array, it can be done with a single 'withColumn' transformation.



val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))


However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.










share|improve this question
























  • Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

    – Elmar Macek
    Nov 13 '18 at 16:33












  • The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

    – John Subas
    Nov 13 '18 at 16:39











  • Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

    – Elmar Macek
    Nov 13 '18 at 16:49











  • In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

    – Elmar Macek
    Nov 13 '18 at 17:04
















0















Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)



------------------------------------------------------------
StudentID | School | Type | Group
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts
2 | ABC | Elementary | Football
3 | DEF | Secondary | Basketball-Cricket
4 | DEF | Secondary | Cricket
------------------------------------------------------------


I need to add one more column to the dataframe as below:



--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------


The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.



All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.



I am looking for the best performance solution.



If it's a simple array, it can be done with a single 'withColumn' transformation.



val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))


However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.










share|improve this question
























  • Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

    – Elmar Macek
    Nov 13 '18 at 16:33












  • The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

    – John Subas
    Nov 13 '18 at 16:39











  • Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

    – Elmar Macek
    Nov 13 '18 at 16:49











  • In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

    – Elmar Macek
    Nov 13 '18 at 17:04














0












0








0








Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)



------------------------------------------------------------
StudentID | School | Type | Group
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts
2 | ABC | Elementary | Football
3 | DEF | Secondary | Basketball-Cricket
4 | DEF | Secondary | Cricket
------------------------------------------------------------


I need to add one more column to the dataframe as below:



--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------


The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.



All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.



I am looking for the best performance solution.



If it's a simple array, it can be done with a single 'withColumn' transformation.



val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))


However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.










share|improve this question
















Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)



------------------------------------------------------------
StudentID | School | Type | Group
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts
2 | ABC | Elementary | Football
3 | DEF | Secondary | Basketball-Cricket
4 | DEF | Secondary | Cricket
------------------------------------------------------------


I need to add one more column to the dataframe as below:



--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------


The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.



All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.



I am looking for the best performance solution.



If it's a simple array, it can be done with a single 'withColumn' transformation.



val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))


However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.







scala apache-spark apache-spark-sql rdd






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 13 '18 at 17:16







John Subas

















asked Nov 13 '18 at 16:08









John SubasJohn Subas

165




165












  • Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

    – Elmar Macek
    Nov 13 '18 at 16:33












  • The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

    – John Subas
    Nov 13 '18 at 16:39











  • Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

    – Elmar Macek
    Nov 13 '18 at 16:49











  • In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

    – Elmar Macek
    Nov 13 '18 at 17:04


















  • Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

    – Elmar Macek
    Nov 13 '18 at 16:33












  • The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

    – John Subas
    Nov 13 '18 at 16:39











  • Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

    – Elmar Macek
    Nov 13 '18 at 16:49











  • In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

    – Elmar Macek
    Nov 13 '18 at 17:04

















Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

– Elmar Macek
Nov 13 '18 at 16:33






Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

– Elmar Macek
Nov 13 '18 at 16:33














The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

– John Subas
Nov 13 '18 at 16:39





The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

– John Subas
Nov 13 '18 at 16:39













Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

– Elmar Macek
Nov 13 '18 at 16:49





Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

– Elmar Macek
Nov 13 '18 at 16:49













In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

– Elmar Macek
Nov 13 '18 at 17:04






In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

– Elmar Macek
Nov 13 '18 at 17:04













3 Answers
3






active

oldest

votes


















0














Using regex replacement to start of each field and to "-" in between:



val df1 = spark.read.option("header","true").csv(filePath)
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map
colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
:_*))





share|improve this answer























  • This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

    – John Subas
    Nov 13 '18 at 19:09











  • Reason for unaccepting answer? Works as expected output you mentioned.

    – Shasankar
    Nov 13 '18 at 19:15











  • I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

    – John Subas
    Nov 13 '18 at 19:20











  • The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

    – John Subas
    Nov 13 '18 at 19:25



















1














Using the spark.sql(), Check this out:



Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
.toDF("School","Type","Group").createOrReplaceTempView("taba")
spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)


Output:



+------+----------+------------------+------------------------------------------------------+
|school|type |group |combined_array |
+------+----------+------------------+------------------------------------------------------+
|ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
|ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
|DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
|DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
+------+----------+------------------+------------------------------------------------------+


If you need it as a dataframe, then



val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
df.printSchema()

root
|-- school: string (nullable = true)
|-- type: string (nullable = true)
|-- group: string (nullable = true)
|-- combined_array: array (nullable = false)
| |-- element: string (containsNull = true)


Update:



Dynamically constructing the sql columns.



scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

scala> val columns = df.columns.mkString("select ", ",", "")
columns: String = select School,Type,Group

scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

scala> val sql_string = columns + " , " + arr + " from taba "
sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

scala> df.createOrReplaceTempView("taba")

scala> spark.sql(sql_string).show(false)
+------+----------+------------------+---------------------------------------------------+
|School|Type |Group |combined_array |
+------+----------+------------------+---------------------------------------------------+
|ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
|ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
|DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
|DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
+------+----------+------------------+---------------------------------------------------+


scala>


Update2:



scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

scala> df.createOrReplaceTempView("student")

scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

scala> df3.createOrReplaceTempView("student2")

scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
|2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
|3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
|4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
+---------+------+------------------+----------+---------------------------------------------------------------------------+


scala>





share|improve this answer

























  • This solution doesn't address the core issue where the 'Group' values need to be split dynamically

    – John Subas
    Nov 13 '18 at 17:00











  • Ok.. let me work on it

    – stack0114106
    Nov 13 '18 at 17:08











  • @John Subas.. could you please check the update.

    – stack0114106
    Nov 13 '18 at 17:46











  • Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

    – John Subas
    Nov 13 '18 at 18:22












  • @John Subas.. could you pls check Update2

    – stack0114106
    Nov 13 '18 at 19:51


















0














You need to first add an empty column then map it like so (in Java):



StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

df1 = df1.withColumn("Combined Array", lit(null))
.map((MapFunction<Row, Row>) row ->
RowFactory.create(...values...) // add existing values and new value here
, newSchema);


It should be fairly similar in Scala.






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function ()
    StackExchange.using("externalEditor", function ()
    StackExchange.using("snippets", function ()
    StackExchange.snippets.init();
    );
    );
    , "code-snippets");

    StackExchange.ready(function()
    var channelOptions =
    tags: "".split(" "),
    id: "1"
    ;
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function()
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled)
    StackExchange.using("snippets", function()
    createEditor();
    );

    else
    createEditor();

    );

    function createEditor()
    StackExchange.prepareEditor(
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader:
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    ,
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    );



    );













    draft saved

    draft discarded


















    StackExchange.ready(
    function ()
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53285040%2fcreate-an-array-column-from-other-columns-after-processing-the-column-values%23new-answer', 'question_page');

    );

    Post as a guest















    Required, but never shown

























    3 Answers
    3






    active

    oldest

    votes








    3 Answers
    3






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    Using regex replacement to start of each field and to "-" in between:



    val df1 = spark.read.option("header","true").csv(filePath)
    val columns = List("School", "Type", "Group")
    var df2 = df1.withColumn("CombinedArray", array(columns.map
    colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
    :_*))





    share|improve this answer























    • This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

      – John Subas
      Nov 13 '18 at 19:09











    • Reason for unaccepting answer? Works as expected output you mentioned.

      – Shasankar
      Nov 13 '18 at 19:15











    • I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

      – John Subas
      Nov 13 '18 at 19:20











    • The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

      – John Subas
      Nov 13 '18 at 19:25
















    0














    Using regex replacement to start of each field and to "-" in between:



    val df1 = spark.read.option("header","true").csv(filePath)
    val columns = List("School", "Type", "Group")
    var df2 = df1.withColumn("CombinedArray", array(columns.map
    colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
    :_*))





    share|improve this answer























    • This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

      – John Subas
      Nov 13 '18 at 19:09











    • Reason for unaccepting answer? Works as expected output you mentioned.

      – Shasankar
      Nov 13 '18 at 19:15











    • I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

      – John Subas
      Nov 13 '18 at 19:20











    • The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

      – John Subas
      Nov 13 '18 at 19:25














    0












    0








    0







    Using regex replacement to start of each field and to "-" in between:



    val df1 = spark.read.option("header","true").csv(filePath)
    val columns = List("School", "Type", "Group")
    var df2 = df1.withColumn("CombinedArray", array(columns.map
    colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
    :_*))





    share|improve this answer













    Using regex replacement to start of each field and to "-" in between:



    val df1 = spark.read.option("header","true").csv(filePath)
    val columns = List("School", "Type", "Group")
    var df2 = df1.withColumn("CombinedArray", array(columns.map
    colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
    :_*))






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 13 '18 at 18:50









    ShasankarShasankar

    272310




    272310












    • This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

      – John Subas
      Nov 13 '18 at 19:09











    • Reason for unaccepting answer? Works as expected output you mentioned.

      – Shasankar
      Nov 13 '18 at 19:15











    • I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

      – John Subas
      Nov 13 '18 at 19:20











    • The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

      – John Subas
      Nov 13 '18 at 19:25


















    • This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

      – John Subas
      Nov 13 '18 at 19:09











    • Reason for unaccepting answer? Works as expected output you mentioned.

      – Shasankar
      Nov 13 '18 at 19:15











    • I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

      – John Subas
      Nov 13 '18 at 19:20











    • The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

      – John Subas
      Nov 13 '18 at 19:25

















    This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

    – John Subas
    Nov 13 '18 at 19:09





    This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

    – John Subas
    Nov 13 '18 at 19:09













    Reason for unaccepting answer? Works as expected output you mentioned.

    – Shasankar
    Nov 13 '18 at 19:15





    Reason for unaccepting answer? Works as expected output you mentioned.

    – Shasankar
    Nov 13 '18 at 19:15













    I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

    – John Subas
    Nov 13 '18 at 19:20





    I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

    – John Subas
    Nov 13 '18 at 19:20













    The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

    – John Subas
    Nov 13 '18 at 19:25






    The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

    – John Subas
    Nov 13 '18 at 19:25














    1














    Using the spark.sql(), Check this out:



    Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
    .toDF("School","Type","Group").createOrReplaceTempView("taba")
    spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)


    Output:



    +------+----------+------------------+------------------------------------------------------+
    |school|type |group |combined_array |
    +------+----------+------------------+------------------------------------------------------+
    |ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
    |ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
    |DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
    |DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
    +------+----------+------------------+------------------------------------------------------+


    If you need it as a dataframe, then



    val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
    df.printSchema()

    root
    |-- school: string (nullable = true)
    |-- type: string (nullable = true)
    |-- group: string (nullable = true)
    |-- combined_array: array (nullable = false)
    | |-- element: string (containsNull = true)


    Update:



    Dynamically constructing the sql columns.



    scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
    df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

    scala> val columns = df.columns.mkString("select ", ",", "")
    columns: String = select School,Type,Group

    scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
    arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

    scala> val sql_string = columns + " , " + arr + " from taba "
    sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

    scala> df.createOrReplaceTempView("taba")

    scala> spark.sql(sql_string).show(false)
    +------+----------+------------------+---------------------------------------------------+
    |School|Type |Group |combined_array |
    +------+----------+------------------+---------------------------------------------------+
    |ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
    |ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
    |DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
    |DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
    +------+----------+------------------+---------------------------------------------------+


    scala>


    Update2:



    scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
    df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

    scala> df.createOrReplaceTempView("student")

    scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
    df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

    scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
    df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

    scala> df3.createOrReplaceTempView("student2")

    scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
    +---------+------+------------------+----------+---------------------------------------------------------------------------+
    |studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
    +---------+------+------------------+----------+---------------------------------------------------------------------------+
    |1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
    |2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
    |3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
    |4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
    +---------+------+------------------+----------+---------------------------------------------------------------------------+


    scala>





    share|improve this answer

























    • This solution doesn't address the core issue where the 'Group' values need to be split dynamically

      – John Subas
      Nov 13 '18 at 17:00











    • Ok.. let me work on it

      – stack0114106
      Nov 13 '18 at 17:08











    • @John Subas.. could you please check the update.

      – stack0114106
      Nov 13 '18 at 17:46











    • Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

      – John Subas
      Nov 13 '18 at 18:22












    • @John Subas.. could you pls check Update2

      – stack0114106
      Nov 13 '18 at 19:51















    1














    Using the spark.sql(), Check this out:



    Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
    .toDF("School","Type","Group").createOrReplaceTempView("taba")
    spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)


    Output:



    +------+----------+------------------+------------------------------------------------------+
    |school|type |group |combined_array |
    +------+----------+------------------+------------------------------------------------------+
    |ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
    |ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
    |DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
    |DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
    +------+----------+------------------+------------------------------------------------------+


    If you need it as a dataframe, then



    val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
    df.printSchema()

    root
    |-- school: string (nullable = true)
    |-- type: string (nullable = true)
    |-- group: string (nullable = true)
    |-- combined_array: array (nullable = false)
    | |-- element: string (containsNull = true)


    Update:



    Dynamically constructing the sql columns.



    scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
    df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

    scala> val columns = df.columns.mkString("select ", ",", "")
    columns: String = select School,Type,Group

    scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
    arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

    scala> val sql_string = columns + " , " + arr + " from taba "
    sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

    scala> df.createOrReplaceTempView("taba")

    scala> spark.sql(sql_string).show(false)
    +------+----------+------------------+---------------------------------------------------+
    |School|Type |Group |combined_array |
    +------+----------+------------------+---------------------------------------------------+
    |ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
    |ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
    |DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
    |DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
    +------+----------+------------------+---------------------------------------------------+


    scala>


    Update2:



    scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
    df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

    scala> df.createOrReplaceTempView("student")

    scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
    df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

    scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
    df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

    scala> df3.createOrReplaceTempView("student2")

    scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
    +---------+------+------------------+----------+---------------------------------------------------------------------------+
    |studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
    +---------+------+------------------+----------+---------------------------------------------------------------------------+
    |1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
    |2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
    |3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
    |4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
    +---------+------+------------------+----------+---------------------------------------------------------------------------+


    scala>





    share|improve this answer

























    • This solution doesn't address the core issue where the 'Group' values need to be split dynamically

      – John Subas
      Nov 13 '18 at 17:00











    • Ok.. let me work on it

      – stack0114106
      Nov 13 '18 at 17:08











    • @John Subas.. could you please check the update.

      – stack0114106
      Nov 13 '18 at 17:46











    • Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

      – John Subas
      Nov 13 '18 at 18:22












    • @John Subas.. could you pls check Update2

      – stack0114106
      Nov 13 '18 at 19:51













    1












    1








    1







    Using the spark.sql(), Check this out:



    Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
    .toDF("School","Type","Group").createOrReplaceTempView("taba")
    spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)


    Output:



    +------+----------+------------------+------------------------------------------------------+
    |school|type |group |combined_array |
    +------+----------+------------------+------------------------------------------------------+
    |ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
    |ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
    |DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
    |DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
    +------+----------+------------------+------------------------------------------------------+


    If you need it as a dataframe, then



    val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
    df.printSchema()

    root
    |-- school: string (nullable = true)
    |-- type: string (nullable = true)
    |-- group: string (nullable = true)
    |-- combined_array: array (nullable = false)
    | |-- element: string (containsNull = true)


    Update:



    Dynamically constructing the sql columns.



    scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
    df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

    scala> val columns = df.columns.mkString("select ", ",", "")
    columns: String = select School,Type,Group

    scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
    arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

    scala> val sql_string = columns + " , " + arr + " from taba "
    sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

    scala> df.createOrReplaceTempView("taba")

    scala> spark.sql(sql_string).show(false)
    +------+----------+------------------+---------------------------------------------------+
    |School|Type |Group |combined_array |
    +------+----------+------------------+---------------------------------------------------+
    |ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
    |ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
    |DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
    |DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
    +------+----------+------------------+---------------------------------------------------+


    scala>


    Update2:



    scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
    df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

    scala> df.createOrReplaceTempView("student")

    scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
    df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

    scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
    df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

    scala> df3.createOrReplaceTempView("student2")

    scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
    +---------+------+------------------+----------+---------------------------------------------------------------------------+
    |studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
    +---------+------+------------------+----------+---------------------------------------------------------------------------+
    |1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
    |2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
    |3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
    |4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
    +---------+------+------------------+----------+---------------------------------------------------------------------------+


    scala>





    share|improve this answer















    Using the spark.sql(), Check this out:



    Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
    .toDF("School","Type","Group").createOrReplaceTempView("taba")
    spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)


    Output:



    +------+----------+------------------+------------------------------------------------------+
    |school|type |group |combined_array |
    +------+----------+------------------+------------------------------------------------------+
    |ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
    |ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
    |DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
    |DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
    +------+----------+------------------+------------------------------------------------------+


    If you need it as a dataframe, then



    val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
    df.printSchema()

    root
    |-- school: string (nullable = true)
    |-- type: string (nullable = true)
    |-- group: string (nullable = true)
    |-- combined_array: array (nullable = false)
    | |-- element: string (containsNull = true)


    Update:



    Dynamically constructing the sql columns.



    scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
    df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

    scala> val columns = df.columns.mkString("select ", ",", "")
    columns: String = select School,Type,Group

    scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
    arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

    scala> val sql_string = columns + " , " + arr + " from taba "
    sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

    scala> df.createOrReplaceTempView("taba")

    scala> spark.sql(sql_string).show(false)
    +------+----------+------------------+---------------------------------------------------+
    |School|Type |Group |combined_array |
    +------+----------+------------------+---------------------------------------------------+
    |ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
    |ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
    |DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
    |DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
    +------+----------+------------------+---------------------------------------------------+


    scala>


    Update2:



    scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
    df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

    scala> df.createOrReplaceTempView("student")

    scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
    df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

    scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
    df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

    scala> df3.createOrReplaceTempView("student2")

    scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
    +---------+------+------------------+----------+---------------------------------------------------------------------------+
    |studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
    +---------+------+------------------+----------+---------------------------------------------------------------------------+
    |1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
    |2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
    |3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
    |4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
    +---------+------+------------------+----------+---------------------------------------------------------------------------+


    scala>






    share|improve this answer














    share|improve this answer



    share|improve this answer








    edited Nov 13 '18 at 19:50

























    answered Nov 13 '18 at 16:46









    stack0114106stack0114106

    2,8571417




    2,8571417












    • This solution doesn't address the core issue where the 'Group' values need to be split dynamically

      – John Subas
      Nov 13 '18 at 17:00











    • Ok.. let me work on it

      – stack0114106
      Nov 13 '18 at 17:08











    • @John Subas.. could you please check the update.

      – stack0114106
      Nov 13 '18 at 17:46











    • Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

      – John Subas
      Nov 13 '18 at 18:22












    • @John Subas.. could you pls check Update2

      – stack0114106
      Nov 13 '18 at 19:51

















    • This solution doesn't address the core issue where the 'Group' values need to be split dynamically

      – John Subas
      Nov 13 '18 at 17:00











    • Ok.. let me work on it

      – stack0114106
      Nov 13 '18 at 17:08











    • @John Subas.. could you please check the update.

      – stack0114106
      Nov 13 '18 at 17:46











    • Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

      – John Subas
      Nov 13 '18 at 18:22












    • @John Subas.. could you pls check Update2

      – stack0114106
      Nov 13 '18 at 19:51
















    This solution doesn't address the core issue where the 'Group' values need to be split dynamically

    – John Subas
    Nov 13 '18 at 17:00





    This solution doesn't address the core issue where the 'Group' values need to be split dynamically

    – John Subas
    Nov 13 '18 at 17:00













    Ok.. let me work on it

    – stack0114106
    Nov 13 '18 at 17:08





    Ok.. let me work on it

    – stack0114106
    Nov 13 '18 at 17:08













    @John Subas.. could you please check the update.

    – stack0114106
    Nov 13 '18 at 17:46





    @John Subas.. could you please check the update.

    – stack0114106
    Nov 13 '18 at 17:46













    Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

    – John Subas
    Nov 13 '18 at 18:22






    Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

    – John Subas
    Nov 13 '18 at 18:22














    @John Subas.. could you pls check Update2

    – stack0114106
    Nov 13 '18 at 19:51





    @John Subas.. could you pls check Update2

    – stack0114106
    Nov 13 '18 at 19:51











    0














    You need to first add an empty column then map it like so (in Java):



    StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

    df1 = df1.withColumn("Combined Array", lit(null))
    .map((MapFunction<Row, Row>) row ->
    RowFactory.create(...values...) // add existing values and new value here
    , newSchema);


    It should be fairly similar in Scala.






    share|improve this answer





























      0














      You need to first add an empty column then map it like so (in Java):



      StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

      df1 = df1.withColumn("Combined Array", lit(null))
      .map((MapFunction<Row, Row>) row ->
      RowFactory.create(...values...) // add existing values and new value here
      , newSchema);


      It should be fairly similar in Scala.






      share|improve this answer



























        0












        0








        0







        You need to first add an empty column then map it like so (in Java):



        StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

        df1 = df1.withColumn("Combined Array", lit(null))
        .map((MapFunction<Row, Row>) row ->
        RowFactory.create(...values...) // add existing values and new value here
        , newSchema);


        It should be fairly similar in Scala.






        share|improve this answer















        You need to first add an empty column then map it like so (in Java):



        StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

        df1 = df1.withColumn("Combined Array", lit(null))
        .map((MapFunction<Row, Row>) row ->
        RowFactory.create(...values...) // add existing values and new value here
        , newSchema);


        It should be fairly similar in Scala.







        share|improve this answer














        share|improve this answer



        share|improve this answer








        edited Nov 13 '18 at 16:25

























        answered Nov 13 '18 at 16:20









        steven35steven35

        1,4971832




        1,4971832



























            draft saved

            draft discarded
















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid


            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.

            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function ()
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53285040%2fcreate-an-array-column-from-other-columns-after-processing-the-column-values%23new-answer', 'question_page');

            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            這個網誌中的熱門文章

            Barbados

            How to read a connectionString WITH PROVIDER in .NET Core?

            Node.js Script on GitHub Pages or Amazon S3