Scala: Delete empty array values from a Spark DataFrame
up vote
-2
down vote
favorite
I'm a new learner of Scala. Now given a DataFrame named df as follows:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
| | | | |
+-------+-------+-------+-------+
I'd like to delete rows if all columns is an empty array (4th row).
For example I might expect the result is:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
+-------+-------+-------+-------+
I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show()
) but still show all rows.
I found python solution of using a Hive UDF from link, but I had hard time trying to convert to a valid scala code. I would like use scala command similar to the following code:
val query = "SELECT * FROM targetDf WHERE 0".format(" AND ".join("SIZE(0) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)
Any help would be appreciated. Thank you.
arrays scala apache-spark dataframe user-defined-functions
add a comment |
up vote
-2
down vote
favorite
I'm a new learner of Scala. Now given a DataFrame named df as follows:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
| | | | |
+-------+-------+-------+-------+
I'd like to delete rows if all columns is an empty array (4th row).
For example I might expect the result is:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
+-------+-------+-------+-------+
I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show()
) but still show all rows.
I found python solution of using a Hive UDF from link, but I had hard time trying to convert to a valid scala code. I would like use scala command similar to the following code:
val query = "SELECT * FROM targetDf WHERE 0".format(" AND ".join("SIZE(0) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)
Any help would be appreciated. Thank you.
arrays scala apache-spark dataframe user-defined-functions
add a comment |
up vote
-2
down vote
favorite
up vote
-2
down vote
favorite
I'm a new learner of Scala. Now given a DataFrame named df as follows:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
| | | | |
+-------+-------+-------+-------+
I'd like to delete rows if all columns is an empty array (4th row).
For example I might expect the result is:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
+-------+-------+-------+-------+
I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show()
) but still show all rows.
I found python solution of using a Hive UDF from link, but I had hard time trying to convert to a valid scala code. I would like use scala command similar to the following code:
val query = "SELECT * FROM targetDf WHERE 0".format(" AND ".join("SIZE(0) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)
Any help would be appreciated. Thank you.
arrays scala apache-spark dataframe user-defined-functions
I'm a new learner of Scala. Now given a DataFrame named df as follows:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
| | | | |
+-------+-------+-------+-------+
I'd like to delete rows if all columns is an empty array (4th row).
For example I might expect the result is:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [null]| [0.0]| [0.0]| [null]|
| [IND1]| [5.0]| [6.0]| [A]|
| [IND2]| [7.0]| [8.0]| [B]|
+-------+-------+-------+-------+
I'm trying to use isNotNull (like val temp=df.filter(col("Column1").isNotNull && col("Column2").isNotNull && col("Column3").isNotNull && col("Column4").isNotNull).show()
) but still show all rows.
I found python solution of using a Hive UDF from link, but I had hard time trying to convert to a valid scala code. I would like use scala command similar to the following code:
val query = "SELECT * FROM targetDf WHERE 0".format(" AND ".join("SIZE(0) > 0".format(c) for c in ["Column1", "Column2", "Column3","Column4"]))
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.sql(query)
Any help would be appreciated. Thank you.
arrays scala apache-spark dataframe user-defined-functions
arrays scala apache-spark dataframe user-defined-functions
asked Nov 11 at 18:08
Haven Shi
7728
7728
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
up vote
1
down vote
accepted
Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame. Your example DF does not contain null values but empty values, there is a difference there.
One option: You could create a new column that has the length of of the array and filter for if the array is zero.
val dfFil = df
.withColumn("arrayLengthColOne", size($"Column1"))
.withColumn("arrayLengthColTwo", size($"Column2"))
.withColumn("arrayLengthColThree", size($"Column3"))
.withColumn("arrayLengthColFour", size($"Column4"))
.filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0
&& $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
.drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")
Original DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
| | | | |
+-------+-------+-------+-------+
New DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
+-------+-------+-------+-------+
You could also create a function that will map across all the columns and do it.
Hi fletchr thanks for the reply. It is a very good direction. My only concern is, this withColumn filter condition only check size($"Column1"), therefore it may delete | | [0.0]| [0.0]| [A]|. Is there anyway to modify condition as filter sum of size($"Column1"), size($"Column2"), size($"Column3") and size($"Column4")?
– Haven Shi
Nov 11 at 19:09
Let me know if this clarifies your problem. It is a bit iterative and cumbersome, but I think it will solve your problem. I just edited my answer. I'll let you find out a more elegant solution
– fletchr
Nov 11 at 19:30
Thank you for the update, I think it should be a correct answer. However on my side it shows "value && is not a member of Int" on every $$ symbol, have you seen that error before?
– Haven Shi
Nov 11 at 19:54
@HavenShi tryimport spark.implicits._
or instead of suing$
you can usecol("columnOfInterest")
– fletchr
Nov 11 at 20:52
add a comment |
up vote
1
down vote
Another approach (in addition to accepted answer) would be using Datasets.
For example, by having a case class:
case class MyClass(col1: Seq[String],
col2: Seq[Double],
col3: Seq[Double],
col4: Seq[String])
def isEmpty: Boolean = ...
You can represent your source as a typed structure:
import spark.implicits._ // needed to provide an implicit encoder/data mapper
val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset
So you could do filtering like following:
source.filter(element => !element.isEmpty) // calling class's instance method
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
1
down vote
accepted
Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame. Your example DF does not contain null values but empty values, there is a difference there.
One option: You could create a new column that has the length of of the array and filter for if the array is zero.
val dfFil = df
.withColumn("arrayLengthColOne", size($"Column1"))
.withColumn("arrayLengthColTwo", size($"Column2"))
.withColumn("arrayLengthColThree", size($"Column3"))
.withColumn("arrayLengthColFour", size($"Column4"))
.filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0
&& $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
.drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")
Original DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
| | | | |
+-------+-------+-------+-------+
New DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
+-------+-------+-------+-------+
You could also create a function that will map across all the columns and do it.
Hi fletchr thanks for the reply. It is a very good direction. My only concern is, this withColumn filter condition only check size($"Column1"), therefore it may delete | | [0.0]| [0.0]| [A]|. Is there anyway to modify condition as filter sum of size($"Column1"), size($"Column2"), size($"Column3") and size($"Column4")?
– Haven Shi
Nov 11 at 19:09
Let me know if this clarifies your problem. It is a bit iterative and cumbersome, but I think it will solve your problem. I just edited my answer. I'll let you find out a more elegant solution
– fletchr
Nov 11 at 19:30
Thank you for the update, I think it should be a correct answer. However on my side it shows "value && is not a member of Int" on every $$ symbol, have you seen that error before?
– Haven Shi
Nov 11 at 19:54
@HavenShi tryimport spark.implicits._
or instead of suing$
you can usecol("columnOfInterest")
– fletchr
Nov 11 at 20:52
add a comment |
up vote
1
down vote
accepted
Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame. Your example DF does not contain null values but empty values, there is a difference there.
One option: You could create a new column that has the length of of the array and filter for if the array is zero.
val dfFil = df
.withColumn("arrayLengthColOne", size($"Column1"))
.withColumn("arrayLengthColTwo", size($"Column2"))
.withColumn("arrayLengthColThree", size($"Column3"))
.withColumn("arrayLengthColFour", size($"Column4"))
.filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0
&& $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
.drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")
Original DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
| | | | |
+-------+-------+-------+-------+
New DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
+-------+-------+-------+-------+
You could also create a function that will map across all the columns and do it.
Hi fletchr thanks for the reply. It is a very good direction. My only concern is, this withColumn filter condition only check size($"Column1"), therefore it may delete | | [0.0]| [0.0]| [A]|. Is there anyway to modify condition as filter sum of size($"Column1"), size($"Column2"), size($"Column3") and size($"Column4")?
– Haven Shi
Nov 11 at 19:09
Let me know if this clarifies your problem. It is a bit iterative and cumbersome, but I think it will solve your problem. I just edited my answer. I'll let you find out a more elegant solution
– fletchr
Nov 11 at 19:30
Thank you for the update, I think it should be a correct answer. However on my side it shows "value && is not a member of Int" on every $$ symbol, have you seen that error before?
– Haven Shi
Nov 11 at 19:54
@HavenShi tryimport spark.implicits._
or instead of suing$
you can usecol("columnOfInterest")
– fletchr
Nov 11 at 20:52
add a comment |
up vote
1
down vote
accepted
up vote
1
down vote
accepted
Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame. Your example DF does not contain null values but empty values, there is a difference there.
One option: You could create a new column that has the length of of the array and filter for if the array is zero.
val dfFil = df
.withColumn("arrayLengthColOne", size($"Column1"))
.withColumn("arrayLengthColTwo", size($"Column2"))
.withColumn("arrayLengthColThree", size($"Column3"))
.withColumn("arrayLengthColFour", size($"Column4"))
.filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0
&& $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
.drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")
Original DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
| | | | |
+-------+-------+-------+-------+
New DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
+-------+-------+-------+-------+
You could also create a function that will map across all the columns and do it.
Using the isNotNull or isNull will not work because it is looking for a 'null' value in the DataFrame. Your example DF does not contain null values but empty values, there is a difference there.
One option: You could create a new column that has the length of of the array and filter for if the array is zero.
val dfFil = df
.withColumn("arrayLengthColOne", size($"Column1"))
.withColumn("arrayLengthColTwo", size($"Column2"))
.withColumn("arrayLengthColThree", size($"Column3"))
.withColumn("arrayLengthColFour", size($"Column4"))
.filter($"arrayLengthColOne" =!= 0 && $"arrayLengthColTwo" =!= 0
&& $"arrayLengthColThree" =!= 0 && $"arrayLengthColFour" =!= 0)
.drop("arrayLengthColOne", "arrayLengthColTwo", "arrayLengthColThree", "arrayLengthColFour")
Original DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
| | | | |
+-------+-------+-------+-------+
New DF:
+-------+-------+-------+-------+
|Column1|Column2|Column3|Column4|
+-------+-------+-------+-------+
| [A]| [B]| [C]| [d]|
+-------+-------+-------+-------+
You could also create a function that will map across all the columns and do it.
edited Nov 11 at 19:30
answered Nov 11 at 18:34
fletchr
13512
13512
Hi fletchr thanks for the reply. It is a very good direction. My only concern is, this withColumn filter condition only check size($"Column1"), therefore it may delete | | [0.0]| [0.0]| [A]|. Is there anyway to modify condition as filter sum of size($"Column1"), size($"Column2"), size($"Column3") and size($"Column4")?
– Haven Shi
Nov 11 at 19:09
Let me know if this clarifies your problem. It is a bit iterative and cumbersome, but I think it will solve your problem. I just edited my answer. I'll let you find out a more elegant solution
– fletchr
Nov 11 at 19:30
Thank you for the update, I think it should be a correct answer. However on my side it shows "value && is not a member of Int" on every $$ symbol, have you seen that error before?
– Haven Shi
Nov 11 at 19:54
@HavenShi tryimport spark.implicits._
or instead of suing$
you can usecol("columnOfInterest")
– fletchr
Nov 11 at 20:52
add a comment |
Hi fletchr thanks for the reply. It is a very good direction. My only concern is, this withColumn filter condition only check size($"Column1"), therefore it may delete | | [0.0]| [0.0]| [A]|. Is there anyway to modify condition as filter sum of size($"Column1"), size($"Column2"), size($"Column3") and size($"Column4")?
– Haven Shi
Nov 11 at 19:09
Let me know if this clarifies your problem. It is a bit iterative and cumbersome, but I think it will solve your problem. I just edited my answer. I'll let you find out a more elegant solution
– fletchr
Nov 11 at 19:30
Thank you for the update, I think it should be a correct answer. However on my side it shows "value && is not a member of Int" on every $$ symbol, have you seen that error before?
– Haven Shi
Nov 11 at 19:54
@HavenShi tryimport spark.implicits._
or instead of suing$
you can usecol("columnOfInterest")
– fletchr
Nov 11 at 20:52
Hi fletchr thanks for the reply. It is a very good direction. My only concern is, this withColumn filter condition only check size($"Column1"), therefore it may delete | | [0.0]| [0.0]| [A]|. Is there anyway to modify condition as filter sum of size($"Column1"), size($"Column2"), size($"Column3") and size($"Column4")?
– Haven Shi
Nov 11 at 19:09
Hi fletchr thanks for the reply. It is a very good direction. My only concern is, this withColumn filter condition only check size($"Column1"), therefore it may delete | | [0.0]| [0.0]| [A]|. Is there anyway to modify condition as filter sum of size($"Column1"), size($"Column2"), size($"Column3") and size($"Column4")?
– Haven Shi
Nov 11 at 19:09
Let me know if this clarifies your problem. It is a bit iterative and cumbersome, but I think it will solve your problem. I just edited my answer. I'll let you find out a more elegant solution
– fletchr
Nov 11 at 19:30
Let me know if this clarifies your problem. It is a bit iterative and cumbersome, but I think it will solve your problem. I just edited my answer. I'll let you find out a more elegant solution
– fletchr
Nov 11 at 19:30
Thank you for the update, I think it should be a correct answer. However on my side it shows "value && is not a member of Int" on every $$ symbol, have you seen that error before?
– Haven Shi
Nov 11 at 19:54
Thank you for the update, I think it should be a correct answer. However on my side it shows "value && is not a member of Int" on every $$ symbol, have you seen that error before?
– Haven Shi
Nov 11 at 19:54
@HavenShi try
import spark.implicits._
or instead of suing $
you can use col("columnOfInterest")
– fletchr
Nov 11 at 20:52
@HavenShi try
import spark.implicits._
or instead of suing $
you can use col("columnOfInterest")
– fletchr
Nov 11 at 20:52
add a comment |
up vote
1
down vote
Another approach (in addition to accepted answer) would be using Datasets.
For example, by having a case class:
case class MyClass(col1: Seq[String],
col2: Seq[Double],
col3: Seq[Double],
col4: Seq[String])
def isEmpty: Boolean = ...
You can represent your source as a typed structure:
import spark.implicits._ // needed to provide an implicit encoder/data mapper
val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset
So you could do filtering like following:
source.filter(element => !element.isEmpty) // calling class's instance method
add a comment |
up vote
1
down vote
Another approach (in addition to accepted answer) would be using Datasets.
For example, by having a case class:
case class MyClass(col1: Seq[String],
col2: Seq[Double],
col3: Seq[Double],
col4: Seq[String])
def isEmpty: Boolean = ...
You can represent your source as a typed structure:
import spark.implicits._ // needed to provide an implicit encoder/data mapper
val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset
So you could do filtering like following:
source.filter(element => !element.isEmpty) // calling class's instance method
add a comment |
up vote
1
down vote
up vote
1
down vote
Another approach (in addition to accepted answer) would be using Datasets.
For example, by having a case class:
case class MyClass(col1: Seq[String],
col2: Seq[Double],
col3: Seq[Double],
col4: Seq[String])
def isEmpty: Boolean = ...
You can represent your source as a typed structure:
import spark.implicits._ // needed to provide an implicit encoder/data mapper
val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset
So you could do filtering like following:
source.filter(element => !element.isEmpty) // calling class's instance method
Another approach (in addition to accepted answer) would be using Datasets.
For example, by having a case class:
case class MyClass(col1: Seq[String],
col2: Seq[Double],
col3: Seq[Double],
col4: Seq[String])
def isEmpty: Boolean = ...
You can represent your source as a typed structure:
import spark.implicits._ // needed to provide an implicit encoder/data mapper
val originalSource: DataFrame = ... // provide your source
val source: Dataset[MyClass] = originalSource.as[MyClass] // convert/map it to Dataset
So you could do filtering like following:
source.filter(element => !element.isEmpty) // calling class's instance method
edited Nov 12 at 13:25
answered Nov 12 at 10:27
machine head
314
314
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53251681%2fscala-delete-empty-array-values-from-a-spark-dataframe%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown