Using Word2Vec functions inside of a UDF in Apache Spark (v2.3.1)
I have a dataframe which consists of two columns, one an Int and the other
a String:
+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+
I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray
method in a udf:
def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)
However, this udf causes NullPointerException
when used with withColumn
. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.
I have queried the dataframe for null values, there are none in either column.
I also tried extracting the words and vectors from the Word2VecModel
with getVectors
, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.
I would greatly appreciate any help.
scala apache-spark user-defined-functions word2vec apache-spark-ml
add a comment |
I have a dataframe which consists of two columns, one an Int and the other
a String:
+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+
I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray
method in a udf:
def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)
However, this udf causes NullPointerException
when used with withColumn
. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.
I have queried the dataframe for null values, there are none in either column.
I also tried extracting the words and vectors from the Word2VecModel
with getVectors
, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.
I would greatly appreciate any help.
scala apache-spark user-defined-functions word2vec apache-spark-ml
add a comment |
I have a dataframe which consists of two columns, one an Int and the other
a String:
+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+
I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray
method in a udf:
def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)
However, this udf causes NullPointerException
when used with withColumn
. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.
I have queried the dataframe for null values, there are none in either column.
I also tried extracting the words and vectors from the Word2VecModel
with getVectors
, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.
I would greatly appreciate any help.
scala apache-spark user-defined-functions word2vec apache-spark-ml
I have a dataframe which consists of two columns, one an Int and the other
a String:
+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+
I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray
method in a udf:
def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)
However, this udf causes NullPointerException
when used with withColumn
. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.
I have queried the dataframe for null values, there are none in either column.
I also tried extracting the words and vectors from the Word2VecModel
with getVectors
, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.
I would greatly appreciate any help.
scala apache-spark user-defined-functions word2vec apache-spark-ml
scala apache-spark user-defined-functions word2vec apache-spark-ml
edited Nov 14 '18 at 5:37
Shaido
12.3k112540
12.3k112540
asked Nov 14 '18 at 4:15
emgemg
154
154
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
This is an expected outcome Word2VecModel
is a distributed model, and its methods are implemented using RDD
operations. Because of that, it cannot be used inside udf
, map
or any other executor-side code.
If you want to compute synonyms for the whole DataFrame
you'll can try to do it manually.
- Load the model directly as
DataFrame
as shown for example in using Word2VecModel.transform() does not work in map function - Transform the input data.
- Join both using approximate join or cross product and filter the result.
Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.
– emg
Dec 9 '18 at 17:20
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293114%2fusing-word2vec-functions-inside-of-a-udf-in-apache-spark-v2-3-1%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is an expected outcome Word2VecModel
is a distributed model, and its methods are implemented using RDD
operations. Because of that, it cannot be used inside udf
, map
or any other executor-side code.
If you want to compute synonyms for the whole DataFrame
you'll can try to do it manually.
- Load the model directly as
DataFrame
as shown for example in using Word2VecModel.transform() does not work in map function - Transform the input data.
- Join both using approximate join or cross product and filter the result.
Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.
– emg
Dec 9 '18 at 17:20
add a comment |
This is an expected outcome Word2VecModel
is a distributed model, and its methods are implemented using RDD
operations. Because of that, it cannot be used inside udf
, map
or any other executor-side code.
If you want to compute synonyms for the whole DataFrame
you'll can try to do it manually.
- Load the model directly as
DataFrame
as shown for example in using Word2VecModel.transform() does not work in map function - Transform the input data.
- Join both using approximate join or cross product and filter the result.
Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.
– emg
Dec 9 '18 at 17:20
add a comment |
This is an expected outcome Word2VecModel
is a distributed model, and its methods are implemented using RDD
operations. Because of that, it cannot be used inside udf
, map
or any other executor-side code.
If you want to compute synonyms for the whole DataFrame
you'll can try to do it manually.
- Load the model directly as
DataFrame
as shown for example in using Word2VecModel.transform() does not work in map function - Transform the input data.
- Join both using approximate join or cross product and filter the result.
This is an expected outcome Word2VecModel
is a distributed model, and its methods are implemented using RDD
operations. Because of that, it cannot be used inside udf
, map
or any other executor-side code.
If you want to compute synonyms for the whole DataFrame
you'll can try to do it manually.
- Load the model directly as
DataFrame
as shown for example in using Word2VecModel.transform() does not work in map function - Transform the input data.
- Join both using approximate join or cross product and filter the result.
answered Nov 14 '18 at 9:22
user10651029user10651029
461
461
Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.
– emg
Dec 9 '18 at 17:20
add a comment |
Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.
– emg
Dec 9 '18 at 17:20
Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.
– emg
Dec 9 '18 at 17:20
Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.
– emg
Dec 9 '18 at 17:20
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293114%2fusing-word2vec-functions-inside-of-a-udf-in-apache-spark-v2-3-1%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown