Using Word2Vec functions inside of a UDF in Apache Spark (v2.3.1)

I have a dataframe which consists of two columns, one an Int and the other
a String:

+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+

I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray method in a udf:

def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)

However, this udf causes NullPointerException when used with withColumn. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.

I have queried the dataframe for null values, there are none in either column.

I also tried extracting the words and vectors from the Word2VecModel with getVectors, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.

I would greatly appreciate any help.

edited Nov 14 '18 at 5:37

Shaido

12.3k112540

asked Nov 14 '18 at 4:15

emg

154

add a comment |

I have a dataframe which consists of two columns, one an Int and the other
a String:

+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+

I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray method in a udf:

def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)

I have queried the dataframe for null values, there are none in either column.

I would greatly appreciate any help.

edited Nov 14 '18 at 5:37

Shaido

12.3k112540

asked Nov 14 '18 at 4:15

emg

154

add a comment |

I have a dataframe which consists of two columns, one an Int and the other
a String:

+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+

I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray method in a udf:

def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)

I have queried the dataframe for null values, there are none in either column.

I would greatly appreciate any help.

edited Nov 14 '18 at 5:37

Shaido

12.3k112540

asked Nov 14 '18 at 4:15

emg

154

I have a dataframe which consists of two columns, one an Int and the other
a String:

+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+

I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray method in a udf:

def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)

I have queried the dataframe for null values, there are none in either column.

I would greatly appreciate any help.

scala apache-spark user-defined-functions word2vec apache-spark-ml

edited Nov 14 '18 at 5:37

Shaido

12.3k112540

asked Nov 14 '18 at 4:15

emg

154

edited Nov 14 '18 at 5:37

Shaido

12.3k112540

asked Nov 14 '18 at 4:15

emg

154

edited Nov 14 '18 at 5:37

Shaido

12.3k112540

edited Nov 14 '18 at 5:37

Shaido

12.3k112540

edited Nov 14 '18 at 5:37

Shaido

12.3k112540

asked Nov 14 '18 at 4:15

emg

154

asked Nov 14 '18 at 4:15

emg

154

asked Nov 14 '18 at 4:15

emg

154

add a comment |

1 Answer
1

active

oldest

votes

This is an expected outcome Word2VecModel is a distributed model, and its methods are implemented using RDD operations. Because of that, it cannot be used inside udf, map or any other executor-side code.

If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.

Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

Transform the input data.

Join both using approximate join or cross product and filter the result.

answered Nov 14 '18 at 9:22

user10651029

461

Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

– emg
Dec 9 '18 at 17:20

add a comment |

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293114%2fusing-word2vec-functions-inside-of-a-udf-in-apache-spark-v2-3-1%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.

Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

Transform the input data.

Join both using approximate join or cross product and filter the result.

answered Nov 14 '18 at 9:22

user10651029

461

Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

– emg
Dec 9 '18 at 17:20

add a comment |

If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.

Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

Transform the input data.

Join both using approximate join or cross product and filter the result.

answered Nov 14 '18 at 9:22

user10651029

461

Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

– emg
Dec 9 '18 at 17:20

add a comment |

If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.

Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

Transform the input data.

Join both using approximate join or cross product and filter the result.

answered Nov 14 '18 at 9:22

user10651029

461

If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.

Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

Transform the input data.

Join both using approximate join or cross product and filter the result.

answered Nov 14 '18 at 9:22

user10651029

461

answered Nov 14 '18 at 9:22

user10651029

461

answered Nov 14 '18 at 9:22

user10651029

461

answered Nov 14 '18 at 9:22

user10651029

461

Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

– emg
Dec 9 '18 at 17:20

add a comment |

Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

– emg
Dec 9 '18 at 17:20

Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

– emg
Dec 9 '18 at 17:20

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj