Using Word2Vec functions inside of a UDF in Apache Spark (v2.3.1)










2















I have a dataframe which consists of two columns, one an Int and the other
a String:



+-------------+---------------------+
|user_id |token |
+-------------+---------------------+
| 419| Cake|
| 419| Chocolate|
| 419| Cheese|
| 419| Cream|
| 419| Bread|
| 419| Sugar|
| 419| Butter|
| 419| Chicken|
| 419| Baking|
| 419| Grilling|
+-------------+---------------------+


I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray method in a udf:



def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)


However, this udf causes NullPointerException when used with withColumn. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.



I have queried the dataframe for null values, there are none in either column.



I also tried extracting the words and vectors from the Word2VecModel with getVectors, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.



I would greatly appreciate any help.










share|improve this question




























    2















    I have a dataframe which consists of two columns, one an Int and the other
    a String:



    +-------------+---------------------+
    |user_id |token |
    +-------------+---------------------+
    | 419| Cake|
    | 419| Chocolate|
    | 419| Cheese|
    | 419| Cream|
    | 419| Bread|
    | 419| Sugar|
    | 419| Butter|
    | 419| Chicken|
    | 419| Baking|
    | 419| Grilling|
    +-------------+---------------------+


    I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray method in a udf:



    def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)


    However, this udf causes NullPointerException when used with withColumn. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.



    I have queried the dataframe for null values, there are none in either column.



    I also tried extracting the words and vectors from the Word2VecModel with getVectors, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.



    I would greatly appreciate any help.










    share|improve this question


























      2












      2








      2








      I have a dataframe which consists of two columns, one an Int and the other
      a String:



      +-------------+---------------------+
      |user_id |token |
      +-------------+---------------------+
      | 419| Cake|
      | 419| Chocolate|
      | 419| Cheese|
      | 419| Cream|
      | 419| Bread|
      | 419| Sugar|
      | 419| Butter|
      | 419| Chicken|
      | 419| Baking|
      | 419| Grilling|
      +-------------+---------------------+


      I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray method in a udf:



      def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)


      However, this udf causes NullPointerException when used with withColumn. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.



      I have queried the dataframe for null values, there are none in either column.



      I also tried extracting the words and vectors from the Word2VecModel with getVectors, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.



      I would greatly appreciate any help.










      share|improve this question
















      I have a dataframe which consists of two columns, one an Int and the other
      a String:



      +-------------+---------------------+
      |user_id |token |
      +-------------+---------------------+
      | 419| Cake|
      | 419| Chocolate|
      | 419| Cheese|
      | 419| Cream|
      | 419| Bread|
      | 419| Sugar|
      | 419| Butter|
      | 419| Chicken|
      | 419| Baking|
      | 419| Grilling|
      +-------------+---------------------+


      I need to find the 250 closest tokens in the Word2Vec vocabulary, for each token in the "token" column. I attempted to use the findSynonymsArray method in a udf:



      def getSyn( w2v : Word2VecModel ) = udf (token : String) => w2v.findSynonymsArray(token, 10)


      However, this udf causes NullPointerException when used with withColumn. This exception occurs even if token is hard-coded, and regardless of whether code is run locally or in cluster mode. I used a try-catch inside the udf to catch the null pointer, and it is being raised on every row.



      I have queried the dataframe for null values, there are none in either column.



      I also tried extracting the words and vectors from the Word2VecModel with getVectors, running my udf on the words on this dataframe, and doing an inner join with my dataframe. The same exception is raised.



      I would greatly appreciate any help.







      scala apache-spark user-defined-functions word2vec apache-spark-ml






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 14 '18 at 5:37









      Shaido

      12.3k112540




      12.3k112540










      asked Nov 14 '18 at 4:15









      emgemg

      154




      154






















          1 Answer
          1






          active

          oldest

          votes


















          3














          This is an expected outcome Word2VecModel is a distributed model, and its methods are implemented using RDD operations. Because of that, it cannot be used inside udf, map or any other executor-side code.



          If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.



          • Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

          • Transform the input data.

          • Join both using approximate join or cross product and filter the result.





          share|improve this answer























          • Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

            – emg
            Dec 9 '18 at 17:20










          Your Answer






          StackExchange.ifUsing("editor", function ()
          StackExchange.using("externalEditor", function ()
          StackExchange.using("snippets", function ()
          StackExchange.snippets.init();
          );
          );
          , "code-snippets");

          StackExchange.ready(function()
          var channelOptions =
          tags: "".split(" "),
          id: "1"
          ;
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function()
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled)
          StackExchange.using("snippets", function()
          createEditor();
          );

          else
          createEditor();

          );

          function createEditor()
          StackExchange.prepareEditor(
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: true,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: 10,
          bindNavPrevention: true,
          postfix: "",
          imageUploader:
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          ,
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          );



          );













          draft saved

          draft discarded


















          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293114%2fusing-word2vec-functions-inside-of-a-udf-in-apache-spark-v2-3-1%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          3














          This is an expected outcome Word2VecModel is a distributed model, and its methods are implemented using RDD operations. Because of that, it cannot be used inside udf, map or any other executor-side code.



          If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.



          • Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

          • Transform the input data.

          • Join both using approximate join or cross product and filter the result.





          share|improve this answer























          • Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

            – emg
            Dec 9 '18 at 17:20















          3














          This is an expected outcome Word2VecModel is a distributed model, and its methods are implemented using RDD operations. Because of that, it cannot be used inside udf, map or any other executor-side code.



          If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.



          • Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

          • Transform the input data.

          • Join both using approximate join or cross product and filter the result.





          share|improve this answer























          • Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

            – emg
            Dec 9 '18 at 17:20













          3












          3








          3







          This is an expected outcome Word2VecModel is a distributed model, and its methods are implemented using RDD operations. Because of that, it cannot be used inside udf, map or any other executor-side code.



          If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.



          • Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

          • Transform the input data.

          • Join both using approximate join or cross product and filter the result.





          share|improve this answer













          This is an expected outcome Word2VecModel is a distributed model, and its methods are implemented using RDD operations. Because of that, it cannot be used inside udf, map or any other executor-side code.



          If you want to compute synonyms for the whole DataFrame you'll can try to do it manually.



          • Load the model directly as DataFrame as shown for example in using Word2VecModel.transform() does not work in map function

          • Transform the input data.

          • Join both using approximate join or cross product and filter the result.






          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 14 '18 at 9:22









          user10651029user10651029

          461




          461












          • Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

            – emg
            Dec 9 '18 at 17:20

















          • Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

            – emg
            Dec 9 '18 at 17:20
















          Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

          – emg
          Dec 9 '18 at 17:20





          Thanks, this turned out to be correct. We decided our particular Word2Vec problem was better addressed in Python, but still used Spark to extract training data.

          – emg
          Dec 9 '18 at 17:20

















          draft saved

          draft discarded
















































          Thanks for contributing an answer to Stack Overflow!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid


          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.

          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function ()
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53293114%2fusing-word2vec-functions-inside-of-a-udf-in-apache-spark-v2-3-1%23new-answer', 'question_page');

          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          這個網誌中的熱門文章

          Barbados

          How to read a connectionString WITH PROVIDER in .NET Core?

          Node.js Script on GitHub Pages or Amazon S3