Create an array column from other columns after processing the column values

Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)

------------------------------------------------------------
StudentID | School | Type | Group 
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts 
2 | ABC | Elementary | Football 
3 | DEF | Secondary | Basketball-Cricket 
4 | DEF | Secondary | Cricket 
------------------------------------------------------------

I need to add one more column to the dataframe as below:

--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------

The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.

All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.

I am looking for the best performance solution.

If it's a simple array, it can be done with a single 'withColumn' transformation.

val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))

However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.

edited Nov 13 '18 at 17:16

asked Nov 13 '18 at 16:08

John Subas

165

Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

– Elmar Macek
Nov 13 '18 at 16:33

The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

– John Subas
Nov 13 '18 at 16:39

Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

– Elmar Macek
Nov 13 '18 at 16:49

In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

– Elmar Macek
Nov 13 '18 at 17:04

add a comment |

Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)

------------------------------------------------------------
StudentID | School | Type | Group 
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts 
2 | ABC | Elementary | Football 
3 | DEF | Secondary | Basketball-Cricket 
4 | DEF | Secondary | Cricket 
------------------------------------------------------------

I need to add one more column to the dataframe as below:

--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------

The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.

All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.

I am looking for the best performance solution.

If it's a simple array, it can be done with a single 'withColumn' transformation.

val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))

However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.

edited Nov 13 '18 at 17:16

asked Nov 13 '18 at 16:08

John Subas

165

Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

– Elmar Macek
Nov 13 '18 at 16:33

The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

– John Subas
Nov 13 '18 at 16:39

Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

– Elmar Macek
Nov 13 '18 at 16:49

In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

– Elmar Macek
Nov 13 '18 at 17:04

add a comment |

Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)

------------------------------------------------------------
StudentID | School | Type | Group 
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts 
2 | ABC | Elementary | Football 
3 | DEF | Secondary | Basketball-Cricket 
4 | DEF | Secondary | Cricket 
------------------------------------------------------------

I need to add one more column to the dataframe as below:

--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------

The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.

All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.

I am looking for the best performance solution.

If it's a simple array, it can be done with a single 'withColumn' transformation.

val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))

However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.

edited Nov 13 '18 at 17:16

asked Nov 13 '18 at 16:08

John Subas

165

Let's say I have a spark dataframe that includes the categorical columns (School, Type, Group)

------------------------------------------------------------
StudentID | School | Type | Group 
------------------------------------------------------------
1 | ABC | Elementary | Music-Arts 
2 | ABC | Elementary | Football 
3 | DEF | Secondary | Basketball-Cricket 
4 | DEF | Secondary | Cricket 
------------------------------------------------------------

I need to add one more column to the dataframe as below:

--------------------------------------------------------------------------------------
StudentID | School | Type | Group | Combined Array
---------------------------------------------------------------------------------------
1 | ABC | Elementary | Music-Arts | ["School: ABC", "Type: Elementary", "Group: Music", "Group: Arts"]
2 | ABC | Elementary | Football | ["School: ABC", "Type: Elementary", "Group: Football"]
3 | DEF | Secondary | Basketball-Cricket | ["School: DEF", "Type: Secondary", "Group: Basketball", "Group: Cricket"]
4 | DEF | Secondary | Cricket | ["School: DEF", "Type: Secondary", "Group: Cricket"]
----------------------------------------------------------------------------------------

The extra column is combination of all categorical columns but includes a different processing on 'Group' column. The values of 'Group' column need to be split on '-'.

All the categorical columns including 'Group' are contained in a list. The 'Group' column is also input as a String as the column to be split on. The data-frame has other columns which are not used.

I am looking for the best performance solution.

If it's a simple array, it can be done with a single 'withColumn' transformation.

val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map(df1(_)):_*))

However, here because of the additional processing in 'Group' column, the solution doesn't seem straightforward.

scala apache-spark apache-spark-sql rdd

edited Nov 13 '18 at 17:16

asked Nov 13 '18 at 16:08

John Subas

165

edited Nov 13 '18 at 17:16

asked Nov 13 '18 at 16:08

John Subas

165

edited Nov 13 '18 at 17:16

asked Nov 13 '18 at 16:08

John Subas

165

asked Nov 13 '18 at 16:08

John Subas

165

asked Nov 13 '18 at 16:08

John Subas

165

Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

– Elmar Macek
Nov 13 '18 at 16:33

The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

– John Subas
Nov 13 '18 at 16:39

Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

– Elmar Macek
Nov 13 '18 at 16:49

In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

– Elmar Macek
Nov 13 '18 at 17:04

add a comment |

Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

– Elmar Macek
Nov 13 '18 at 16:33

The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

– John Subas
Nov 13 '18 at 16:39

Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

– Elmar Macek
Nov 13 '18 at 16:49

In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do: df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type")) and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

– Elmar Macek
Nov 13 '18 at 17:04

Just to be sure: why do you want redundant information in the combined column? I get why you want an array containing the "-"-split of the group, but i am less sure about the other values. I suggest df.withColumn("combined", split($"Group", "-"))

– Elmar Macek
Nov 13 '18 at 16:33

The column will be fed to countVectorizer, so each entry of the array (category: value) will be identified differently. For instance the same value may be present across different categories.

– John Subas
Nov 13 '18 at 16:39

Ah I see, well stack0114106 got the correct answer if you add the splitting of the group to it. ;)

– Elmar Macek
Nov 13 '18 at 16:49

In case you do not wanna wrangle so much with String-concats in order to put identifying prefixes to the different type of informations (which might be a little annoying for the Group category), you could also just do:

df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type"))

and just apply 3 CountVectorizers for each of the "XYZArrays" and a final VectorAssembler to put all together. This version has the benefit, that you can define different minimum frequencies for each of the CountVectorizers.

– Elmar Macek
Nov 13 '18 at 17:04

df.withColumn("combined", split($"Group", "-")).withColumn("SchoolArray", array($"School")).withColumn("TypeArray", array($"Type"))

add a comment |

3 Answers
3

active

oldest

votes

Using regex replacement to start of each field and to "-" in between:

val df1 = spark.read.option("header","true").csv(filePath)
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map
 colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
:_*))

answered Nov 13 '18 at 18:50

Shasankar

272310

This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

– John Subas
Nov 13 '18 at 19:09

Reason for unaccepting answer? Works as expected output you mentioned.

– Shasankar
Nov 13 '18 at 19:15

I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

– John Subas
Nov 13 '18 at 19:20

The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

– John Subas
Nov 13 '18 at 19:25

add a comment |

Using the spark.sql(), Check this out:

Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
 .toDF("School","Type","Group").createOrReplaceTempView("taba")
spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)

Output:

+------+----------+------------------+------------------------------------------------------+
|school|type |group |combined_array |
+------+----------+------------------+------------------------------------------------------+
|ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
|ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
|DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
|DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
+------+----------+------------------+------------------------------------------------------+

If you need it as a dataframe, then

val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
df.printSchema()

root
 |-- school: string (nullable = true)
 |-- type: string (nullable = true)
 |-- group: string (nullable = true)
 |-- combined_array: array (nullable = false)
 | |-- element: string (containsNull = true)

Update:

Dynamically constructing the sql columns.

scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

scala> val columns = df.columns.mkString("select ", ",", "")
columns: String = select School,Type,Group

scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

scala> val sql_string = columns + " , " + arr + " from taba "
sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

scala> df.createOrReplaceTempView("taba")

scala> spark.sql(sql_string).show(false)
+------+----------+------------------+---------------------------------------------------+
|School|Type |Group |combined_array |
+------+----------+------------------+---------------------------------------------------+
|ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
|ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
|DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
|DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
+------+----------+------------------+---------------------------------------------------+


scala>

Update2:

scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

scala> df.createOrReplaceTempView("student")

scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

scala> df3.createOrReplaceTempView("student2")

scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
|2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
|3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
|4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
+---------+------+------------------+----------+---------------------------------------------------------------------------+


scala>

edited Nov 13 '18 at 19:50

answered Nov 13 '18 at 16:46

stack0114106

2,8571417

This solution doesn't address the core issue where the 'Group' values need to be split dynamically

– John Subas
Nov 13 '18 at 17:00

Ok.. let me work on it

– stack0114106
Nov 13 '18 at 17:08

@John Subas.. could you please check the update.

– stack0114106
Nov 13 '18 at 17:46

Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

– John Subas
Nov 13 '18 at 18:22

@John Subas.. could you pls check Update2

– stack0114106
Nov 13 '18 at 19:51

add a comment |

You need to first add an empty column then map it like so (in Java):

StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

df1 = df1.withColumn("Combined Array", lit(null))
 .map((MapFunction<Row, Row>) row ->
 RowFactory.create(...values...) // add existing values and new value here
 , newSchema);

It should be fairly similar in Scala.

edited Nov 13 '18 at 16:25

answered Nov 13 '18 at 16:20

steven35

1,4971832

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53285040%2fcreate-an-array-column-from-other-columns-after-processing-the-column-values%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Using regex replacement to start of each field and to "-" in between:

val df1 = spark.read.option("header","true").csv(filePath)
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map
 colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
:_*))

answered Nov 13 '18 at 18:50

Shasankar

272310

This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

– John Subas
Nov 13 '18 at 19:09

Reason for unaccepting answer? Works as expected output you mentioned.

– Shasankar
Nov 13 '18 at 19:15

I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

– John Subas
Nov 13 '18 at 19:20

The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

– John Subas
Nov 13 '18 at 19:25

add a comment |

Using regex replacement to start of each field and to "-" in between:

val df1 = spark.read.option("header","true").csv(filePath)
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map
 colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
:_*))

answered Nov 13 '18 at 18:50

Shasankar

272310

This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

– John Subas
Nov 13 '18 at 19:09

Reason for unaccepting answer? Works as expected output you mentioned.

– Shasankar
Nov 13 '18 at 19:15

I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

– John Subas
Nov 13 '18 at 19:20

The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

– John Subas
Nov 13 '18 at 19:25

add a comment |

Using regex replacement to start of each field and to "-" in between:

val df1 = spark.read.option("header","true").csv(filePath)
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map
 colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
:_*))

answered Nov 13 '18 at 18:50

Shasankar

272310

Using regex replacement to start of each field and to "-" in between:

val df1 = spark.read.option("header","true").csv(filePath)
val columns = List("School", "Type", "Group")
var df2 = df1.withColumn("CombinedArray", array(columns.map
 colName => regexp_replace(regexp_replace(df1(colName),"(^)",s"$colName: "),"(-)",s", $colName: ")
:_*))

answered Nov 13 '18 at 18:50

Shasankar

272310

answered Nov 13 '18 at 18:50

Shasankar

272310

answered Nov 13 '18 at 18:50

Shasankar

272310

answered Nov 13 '18 at 18:50

Shasankar

272310

This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

– John Subas
Nov 13 '18 at 19:09

Reason for unaccepting answer? Works as expected output you mentioned.

– Shasankar
Nov 13 '18 at 19:15

I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

– John Subas
Nov 13 '18 at 19:20

The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

– John Subas
Nov 13 '18 at 19:25

add a comment |

This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

– John Subas
Nov 13 '18 at 19:09

Reason for unaccepting answer? Works as expected output you mentioned.

– Shasankar
Nov 13 '18 at 19:15

I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

– John Subas
Nov 13 '18 at 19:20

The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

– John Subas
Nov 13 '18 at 19:25

This would work probably. I will need to modify a little to include splits only for selected columns among the category columns. Will try to work it out and post the answer here.

– John Subas
Nov 13 '18 at 19:09

Reason for unaccepting answer? Works as expected output you mentioned.

– Shasankar
Nov 13 '18 at 19:15

I will accept it once I am able to work on your code to get the exact solution. The split need to be done for one column only: 'Group', not for all the columns

– John Subas
Nov 13 '18 at 19:20

The below code would be the accurate answer: var df2 = df.withColumn("CombinedArray", array(columns.map( colName => colName match case "Group" => regexp_replace(regexp_replace(df(colName),"(^)",s"$colName: "),"(-)",s", $colName: ") case _ => regexp_replace(df(colName),"(^)",s"$colName: ") ):_*))

– John Subas
Nov 13 '18 at 19:25

add a comment |

Using the spark.sql(), Check this out:

Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
 .toDF("School","Type","Group").createOrReplaceTempView("taba")
spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)

Output:

+------+----------+------------------+------------------------------------------------------+
|school|type |group |combined_array |
+------+----------+------------------+------------------------------------------------------+
|ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
|ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
|DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
|DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
+------+----------+------------------+------------------------------------------------------+

If you need it as a dataframe, then

val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
df.printSchema()

root
 |-- school: string (nullable = true)
 |-- type: string (nullable = true)
 |-- group: string (nullable = true)
 |-- combined_array: array (nullable = false)
 | |-- element: string (containsNull = true)

Update:

Dynamically constructing the sql columns.

scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

scala> val columns = df.columns.mkString("select ", ",", "")
columns: String = select School,Type,Group

scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

scala> val sql_string = columns + " , " + arr + " from taba "
sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

scala> df.createOrReplaceTempView("taba")

scala> spark.sql(sql_string).show(false)
+------+----------+------------------+---------------------------------------------------+
|School|Type |Group |combined_array |
+------+----------+------------------+---------------------------------------------------+
|ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
|ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
|DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
|DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
+------+----------+------------------+---------------------------------------------------+


scala>

Update2:

scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

scala> df.createOrReplaceTempView("student")

scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

scala> df3.createOrReplaceTempView("student2")

scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
|2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
|3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
|4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
+---------+------+------------------+----------+---------------------------------------------------------------------------+


scala>

edited Nov 13 '18 at 19:50

answered Nov 13 '18 at 16:46

stack0114106

2,8571417

This solution doesn't address the core issue where the 'Group' values need to be split dynamically

– John Subas
Nov 13 '18 at 17:00

Ok.. let me work on it

– stack0114106
Nov 13 '18 at 17:08

@John Subas.. could you please check the update.

– stack0114106
Nov 13 '18 at 17:46

Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

– John Subas
Nov 13 '18 at 18:22

@John Subas.. could you pls check Update2

– stack0114106
Nov 13 '18 at 19:51

add a comment |

Using the spark.sql(), Check this out:

Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
 .toDF("School","Type","Group").createOrReplaceTempView("taba")
spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)

Output:

+------+----------+------------------+------------------------------------------------------+
|school|type |group |combined_array |
+------+----------+------------------+------------------------------------------------------+
|ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
|ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
|DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
|DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
+------+----------+------------------+------------------------------------------------------+

If you need it as a dataframe, then

val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
df.printSchema()

root
 |-- school: string (nullable = true)
 |-- type: string (nullable = true)
 |-- group: string (nullable = true)
 |-- combined_array: array (nullable = false)
 | |-- element: string (containsNull = true)

Update:

Dynamically constructing the sql columns.

scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

scala> val columns = df.columns.mkString("select ", ",", "")
columns: String = select School,Type,Group

scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

scala> val sql_string = columns + " , " + arr + " from taba "
sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

scala> df.createOrReplaceTempView("taba")

scala> spark.sql(sql_string).show(false)
+------+----------+------------------+---------------------------------------------------+
|School|Type |Group |combined_array |
+------+----------+------------------+---------------------------------------------------+
|ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
|ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
|DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
|DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
+------+----------+------------------+---------------------------------------------------+


scala>

Update2:

scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

scala> df.createOrReplaceTempView("student")

scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

scala> df3.createOrReplaceTempView("student2")

scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
|2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
|3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
|4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
+---------+------+------------------+----------+---------------------------------------------------------------------------+


scala>

edited Nov 13 '18 at 19:50

answered Nov 13 '18 at 16:46

stack0114106

2,8571417

This solution doesn't address the core issue where the 'Group' values need to be split dynamically

– John Subas
Nov 13 '18 at 17:00

Ok.. let me work on it

– stack0114106
Nov 13 '18 at 17:08

@John Subas.. could you please check the update.

– stack0114106
Nov 13 '18 at 17:46

Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

– John Subas
Nov 13 '18 at 18:22

@John Subas.. could you pls check Update2

– stack0114106
Nov 13 '18 at 19:51

add a comment |

Using the spark.sql(), Check this out:

Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
 .toDF("School","Type","Group").createOrReplaceTempView("taba")
spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)

Output:

+------+----------+------------------+------------------------------------------------------+
|school|type |group |combined_array |
+------+----------+------------------+------------------------------------------------------+
|ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
|ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
|DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
|DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
+------+----------+------------------+------------------------------------------------------+

If you need it as a dataframe, then

val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
df.printSchema()

root
 |-- school: string (nullable = true)
 |-- type: string (nullable = true)
 |-- group: string (nullable = true)
 |-- combined_array: array (nullable = false)
 | |-- element: string (containsNull = true)

Update:

Dynamically constructing the sql columns.

scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

scala> val columns = df.columns.mkString("select ", ",", "")
columns: String = select School,Type,Group

scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

scala> val sql_string = columns + " , " + arr + " from taba "
sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

scala> df.createOrReplaceTempView("taba")

scala> spark.sql(sql_string).show(false)
+------+----------+------------------+---------------------------------------------------+
|School|Type |Group |combined_array |
+------+----------+------------------+---------------------------------------------------+
|ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
|ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
|DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
|DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
+------+----------+------------------+---------------------------------------------------+


scala>

Update2:

scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

scala> df.createOrReplaceTempView("student")

scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

scala> df3.createOrReplaceTempView("student2")

scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
|2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
|3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
|4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
+---------+------+------------------+----------+---------------------------------------------------------------------------+


scala>

edited Nov 13 '18 at 19:50

answered Nov 13 '18 at 16:46

stack0114106

2,8571417

Using the spark.sql(), Check this out:

Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket"))
 .toDF("School","Type","Group").createOrReplaceTempView("taba")
spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """).show(false)

Output:

+------+----------+------------------+------------------------------------------------------+
|school|type |group |combined_array |
+------+----------+------------------+------------------------------------------------------+
|ABC |Elementary|Music-Arts |[School:ABC, type:Elementary, group:Music-Arts] |
|ABC |Elementary|Football |[School:ABC, type:Elementary, group:Football] |
|DEF |Secondary |Basketball-Cricket|[School:DEF, type:Secondary, group:Basketball-Cricket]|
|DEF |Secondary |Cricket |[School:DEF, type:Secondary, group:Cricket] |
+------+----------+------------------+------------------------------------------------------+

If you need it as a dataframe, then

val df = spark.sql( """ select school, type, group, array(concat('School:',school),concat('type:',type),concat('group:',group)) as combined_array from taba """)
df.printSchema()

root
 |-- school: string (nullable = true)
 |-- type: string (nullable = true)
 |-- group: string (nullable = true)
 |-- combined_array: array (nullable = false)
 | |-- element: string (containsNull = true)

Update:

Dynamically constructing the sql columns.

scala> val df = Seq(("ABC","Elementary","Music-Arts"),("ABC","Elementary","Football"),("DEF","Secondary","Basketball-Cricket"),("DEF","Secondary","Cricket")).toDF("School","Type","Group")
df: org.apache.spark.sql.DataFrame = [School: string, Type: string ... 1 more field]

scala> val columns = df.columns.mkString("select ", ",", "")
columns: String = select School,Type,Group

scala> val arr = df.columns.map( x=> s"concat('"+x+"',"+x+")" ).mkString("array(",",",") as combined_array ")
arr: String = "array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array "

scala> val sql_string = columns + " , " + arr + " from taba "
sql_string: String = "select School,Type,Group , array(concat('School',School),concat('Type',Type),concat('Group',Group)) as combined_array from taba "

scala> df.createOrReplaceTempView("taba")

scala> spark.sql(sql_string).show(false)
+------+----------+------------------+---------------------------------------------------+
|School|Type |Group |combined_array |
+------+----------+------------------+---------------------------------------------------+
|ABC |Elementary|Music-Arts |[SchoolABC, TypeElementary, GroupMusic-Arts] |
|ABC |Elementary|Football |[SchoolABC, TypeElementary, GroupFootball] |
|DEF |Secondary |Basketball-Cricket|[SchoolDEF, TypeSecondary, GroupBasketball-Cricket]|
|DEF |Secondary |Cricket |[SchoolDEF, TypeSecondary, GroupCricket] |
+------+----------+------------------+---------------------------------------------------+


scala>

Update2:

scala> val df = Seq((1,"ABC","Elementary","Music-Arts"),(2,"ABC","Elementary","Football"),(3,"DEF","Secondary","Basketball-Cricket"),(4,"DEF","Secondary","Cricket")).toDF("StudentID","School","Type","Group")
df: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 2 more fields]

scala> df.createOrReplaceTempView("student")

scala> val df2 = spark.sql(""" select studentid, collect_list(concat('Group:', t.sp1)) as sp2 from (select StudentID,School,Type,explode((split(group,'-'))) as sp1 from student where size(split(group,'-')) > 1 ) t group by studentid """)
df2: org.apache.spark.sql.DataFrame = [studentid: int, sp2: array<string>]

scala> val df3 = df.alias("t1").join(df2.alias("t2"),Seq("studentid"),"LeftOuter")
df3: org.apache.spark.sql.DataFrame = [StudentID: int, School: string ... 3 more fields]

scala> df3.createOrReplaceTempView("student2")

scala> spark.sql(""" select studentid, school,group, type, array(concat('School:',school),concat('type:',type),concat_ws(',',temp_arr)) from (select studentid,school,group,type, case when sp2 is null then array(concat("Group:",group)) else sp2 end as temp_arr from student2) t """).show(false)
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|studentid|school|group |type |array(concat(School:, school), concat(type:, type), concat_ws(,, temp_arr))|
+---------+------+------------------+----------+---------------------------------------------------------------------------+
|1 |ABC |Music-Arts |Elementary|[School:ABC, type:Elementary, Group:Music,Group:Arts] |
|2 |ABC |Football |Elementary|[School:ABC, type:Elementary, Group:Football] |
|3 |DEF |Basketball-Cricket|Secondary |[School:DEF, type:Secondary, Group:Basketball,Group:Cricket] |
|4 |DEF |Cricket |Secondary |[School:DEF, type:Secondary, Group:Cricket] |
+---------+------+------------------+----------+---------------------------------------------------------------------------+


scala>

edited Nov 13 '18 at 19:50

answered Nov 13 '18 at 16:46

stack0114106

2,8571417

edited Nov 13 '18 at 19:50

answered Nov 13 '18 at 16:46

stack0114106

2,8571417

answered Nov 13 '18 at 16:46

stack0114106

2,8571417

answered Nov 13 '18 at 16:46

stack0114106

2,8571417

This solution doesn't address the core issue where the 'Group' values need to be split dynamically

– John Subas
Nov 13 '18 at 17:00

Ok.. let me work on it

– stack0114106
Nov 13 '18 at 17:08

@John Subas.. could you please check the update.

– stack0114106
Nov 13 '18 at 17:46

Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

– John Subas
Nov 13 '18 at 18:22

@John Subas.. could you pls check Update2

– stack0114106
Nov 13 '18 at 19:51

add a comment |

This solution doesn't address the core issue where the 'Group' values need to be split dynamically

– John Subas
Nov 13 '18 at 17:00

Ok.. let me work on it

– stack0114106
Nov 13 '18 at 17:08

@John Subas.. could you please check the update.

– stack0114106
Nov 13 '18 at 17:46

Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

– John Subas
Nov 13 '18 at 18:22

@John Subas.. could you pls check Update2

– stack0114106
Nov 13 '18 at 19:51

This solution doesn't address the core issue where the 'Group' values need to be split dynamically

– John Subas
Nov 13 '18 at 17:00

Ok.. let me work on it

– stack0114106
Nov 13 '18 at 17:08

@John Subas.. could you please check the update.

– stack0114106
Nov 13 '18 at 17:46

Thanks, but the update still doesn't address the core issue here: If you look at my output, the first and 3rd row has an array size of 4. We need to split the 'Group' column based on '-' and add multiple elements to array, one for each split.

– John Subas
Nov 13 '18 at 18:22

@John Subas.. could you pls check Update2

– stack0114106
Nov 13 '18 at 19:51

add a comment |

You need to first add an empty column then map it like so (in Java):

StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

df1 = df1.withColumn("Combined Array", lit(null))
 .map((MapFunction<Row, Row>) row ->
 RowFactory.create(...values...) // add existing values and new value here
 , newSchema);

It should be fairly similar in Scala.

edited Nov 13 '18 at 16:25

answered Nov 13 '18 at 16:20

steven35

1,4971832

add a comment |

You need to first add an empty column then map it like so (in Java):

StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

df1 = df1.withColumn("Combined Array", lit(null))
 .map((MapFunction<Row, Row>) row ->
 RowFactory.create(...values...) // add existing values and new value here
 , newSchema);

It should be fairly similar in Scala.

edited Nov 13 '18 at 16:25

answered Nov 13 '18 at 16:20

steven35

1,4971832

add a comment |

You need to first add an empty column then map it like so (in Java):

StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

df1 = df1.withColumn("Combined Array", lit(null))
 .map((MapFunction<Row, Row>) row ->
 RowFactory.create(...values...) // add existing values and new value here
 , newSchema);

It should be fairly similar in Scala.

edited Nov 13 '18 at 16:25

answered Nov 13 '18 at 16:20

steven35

1,4971832

You need to first add an empty column then map it like so (in Java):

StructType newSchema = df1.schema().add("Combined Array", DataTypes.StringType);

df1 = df1.withColumn("Combined Array", lit(null))
 .map((MapFunction<Row, Row>) row ->
 RowFactory.create(...values...) // add existing values and new value here
 , newSchema);

It should be fairly similar in Scala.

edited Nov 13 '18 at 16:25

answered Nov 13 '18 at 16:20

steven35

1,4971832

edited Nov 13 '18 at 16:25

answered Nov 13 '18 at 16:20

steven35

1,4971832

answered Nov 13 '18 at 16:20

steven35

1,4971832

answered Nov 13 '18 at 16:20

steven35

1,4971832

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

dr3Sk4z4G JYxF38rRqqOKKi57S waVs,tr50gVHL7 CtGduIV7VqJoImXJKw0cGxNTjUZrbPBS bAt1sOZfe5K xseU,mQxW8pFi0v

搜尋此網誌

Odtnhj