Why does Spark's Word2Vec return a Vector?

Running the Spark's example for Word2Vec, I realized that it takes in an array of string and gives out a vector. My question is, shouldn't it return a matrix instead of a vector? I was expecting one vector per input word. But it returns one vector period!

Or maybe it should have accepted string, instead of an array of strings (one word) as input. Then, yeah sure, it could return one vector as output. But accepting an array of strings and returning one single vector does not make sense to me.

[UPDATE]

Per @Shaido's request, here's the code with my minor change to print the schema for the output:

public class JavaWord2VecExample 
 public static void main(String args) 
 SparkSession spark = SparkSession
 .builder()
 .appName("JavaWord2VecExample")
 .getOrCreate();

 // $example on$
 // Input data: Each row is a bag of words from a sentence or document.
 List<Row> data = Arrays.asList(
 RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
 RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
 RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
 );
 StructType schema = new StructType(new StructField
 new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
 );
 Dataset<Row> documentDF = spark.createDataFrame(data, schema);

 // Learn a mapping from words to Vectors.
 Word2Vec word2Vec = new Word2Vec()
 .setInputCol("text")
 .setOutputCol("result")
 .setVectorSize(7)
 .setMinCount(0);

 Word2VecModel model = word2Vec.fit(documentDF);
 Dataset<Row> result = model.transform(documentDF);

 for (Row row : result.collectAsList()) 
 List<String> text = row.getList(0);
 System.out.println("Schema: " + row.schema());
 Vector vector = (Vector) row.get(1);
 System.out.println("Text: " + text + " => nVector: " + vector + "n");
 
 // $example off$

 spark.stop();

And it prints:

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] => 
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] => 
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] => 
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]

Please correct me if I'm wrong, but the input is an array of strings and the output is a single vector. And I was expecting each word to be mapped into a vector.

edited Nov 13 '18 at 5:34

Shaido

11.9k112441

asked Nov 13 '18 at 2:08

Mehran

3,795745109

add a comment |

[UPDATE]

Per @Shaido's request, here's the code with my minor change to print the schema for the output:

public class JavaWord2VecExample 
 public static void main(String args) 
 SparkSession spark = SparkSession
 .builder()
 .appName("JavaWord2VecExample")
 .getOrCreate();

 // $example on$
 // Input data: Each row is a bag of words from a sentence or document.
 List<Row> data = Arrays.asList(
 RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
 RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
 RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
 );
 StructType schema = new StructType(new StructField
 new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
 );
 Dataset<Row> documentDF = spark.createDataFrame(data, schema);

 // Learn a mapping from words to Vectors.
 Word2Vec word2Vec = new Word2Vec()
 .setInputCol("text")
 .setOutputCol("result")
 .setVectorSize(7)
 .setMinCount(0);

 Word2VecModel model = word2Vec.fit(documentDF);
 Dataset<Row> result = model.transform(documentDF);

 for (Row row : result.collectAsList()) 
 List<String> text = row.getList(0);
 System.out.println("Schema: " + row.schema());
 Vector vector = (Vector) row.get(1);
 System.out.println("Text: " + text + " => nVector: " + vector + "n");
 
 // $example off$

 spark.stop();

And it prints:

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] => 
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] => 
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] => 
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]

Please correct me if I'm wrong, but the input is an array of strings and the output is a single vector. And I was expecting each word to be mapped into a vector.

edited Nov 13 '18 at 5:34

Shaido

11.9k112441

asked Nov 13 '18 at 2:08

Mehran

3,795745109

add a comment |

[UPDATE]

Per @Shaido's request, here's the code with my minor change to print the schema for the output:

public class JavaWord2VecExample 
 public static void main(String args) 
 SparkSession spark = SparkSession
 .builder()
 .appName("JavaWord2VecExample")
 .getOrCreate();

 // $example on$
 // Input data: Each row is a bag of words from a sentence or document.
 List<Row> data = Arrays.asList(
 RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
 RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
 RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
 );
 StructType schema = new StructType(new StructField
 new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
 );
 Dataset<Row> documentDF = spark.createDataFrame(data, schema);

 // Learn a mapping from words to Vectors.
 Word2Vec word2Vec = new Word2Vec()
 .setInputCol("text")
 .setOutputCol("result")
 .setVectorSize(7)
 .setMinCount(0);

 Word2VecModel model = word2Vec.fit(documentDF);
 Dataset<Row> result = model.transform(documentDF);

 for (Row row : result.collectAsList()) 
 List<String> text = row.getList(0);
 System.out.println("Schema: " + row.schema());
 Vector vector = (Vector) row.get(1);
 System.out.println("Text: " + text + " => nVector: " + vector + "n");
 
 // $example off$

 spark.stop();

And it prints:

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] => 
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] => 
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] => 
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]

Please correct me if I'm wrong, but the input is an array of strings and the output is a single vector. And I was expecting each word to be mapped into a vector.

edited Nov 13 '18 at 5:34

Shaido

11.9k112441

asked Nov 13 '18 at 2:08

Mehran

3,795745109

[UPDATE]

Per @Shaido's request, here's the code with my minor change to print the schema for the output:

public class JavaWord2VecExample 
 public static void main(String args) 
 SparkSession spark = SparkSession
 .builder()
 .appName("JavaWord2VecExample")
 .getOrCreate();

 // $example on$
 // Input data: Each row is a bag of words from a sentence or document.
 List<Row> data = Arrays.asList(
 RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
 RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
 RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
 );
 StructType schema = new StructType(new StructField
 new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
 );
 Dataset<Row> documentDF = spark.createDataFrame(data, schema);

 // Learn a mapping from words to Vectors.
 Word2Vec word2Vec = new Word2Vec()
 .setInputCol("text")
 .setOutputCol("result")
 .setVectorSize(7)
 .setMinCount(0);

 Word2VecModel model = word2Vec.fit(documentDF);
 Dataset<Row> result = model.transform(documentDF);

 for (Row row : result.collectAsList()) 
 List<String> text = row.getList(0);
 System.out.println("Schema: " + row.schema());
 Vector vector = (Vector) row.get(1);
 System.out.println("Text: " + text + " => nVector: " + vector + "n");
 
 // $example off$

 spark.stop();

And it prints:

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] => 
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] => 
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]

Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] => 
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]

Please correct me if I'm wrong, but the input is an array of strings and the output is a single vector. And I was expecting each word to be mapped into a vector.

java apache-spark word2vec apache-spark-ml

edited Nov 13 '18 at 5:34

Shaido

11.9k112441

asked Nov 13 '18 at 2:08

Mehran

3,795745109

edited Nov 13 '18 at 5:34

Shaido

11.9k112441

asked Nov 13 '18 at 2:08

Mehran

3,795745109

edited Nov 13 '18 at 5:34

Shaido

11.9k112441

edited Nov 13 '18 at 5:34

Shaido

11.9k112441

edited Nov 13 '18 at 5:34

Shaido

11.9k112441

asked Nov 13 '18 at 2:08

Mehran

3,795745109

asked Nov 13 '18 at 2:08

Mehran

3,795745109

asked Nov 13 '18 at 2:08

Mehran

3,795745109

add a comment |

2 Answers
2

active

oldest

votes

This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...

To start with, how exactly individual word embeddings should be combined is not in principle a feature of the Word2Vec model itself (which is about, well, individual words), but an issue of concern to "higher order" models, such as Sentence2Vec, Paragraph2Vec, Doc2Vec, Wikipedia2Vec etc (you could name a few more, I guess...).

Having said that, it turns out indeed that a very first approach in combining word vectors in order to get vector representations of larger pieces of text (phrases, sentences, tweets etc) is indeed to simply average the vector representations of the constituent words, as Spark ML does.

Starting from the practitioner community, we have:

How to concatenate word vectors to form sentence vector (SO answer):

There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] See gensim.models.doc2vec.Doc2Vec, dm_concat and
dm_mean - it allows you to use any of those three options

Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):

So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.

Just average them?

Yes that’s what we are going to do here.

Sentence2Vec (Github repo):

Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.

It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.

An expected counter-argument here is that blog posts and SO answers are arguably not that credible sources; what about the researchers and the relevant scientific literature? Well, it turns out that this simple averaging is far from uncommon here, too:

From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

enter image description here

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

enter image description here

It should be clear by now that the particular design choice in Spark ML is far from arbitrary, or even uncommon; I have blogged about what certainly seem as absurd design choices in Spark ML (see Classification in Spark 2.0: “Input validation failed” and other wondrous tales), but it seems that this is not such a case...

edited Nov 28 '18 at 23:10

answered Nov 28 '18 at 23:04

desertnaut

16.6k63567

Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30

@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36

No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33

add a comment |

To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:

+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+

So each word does have it's own representation. However, what happens when you input a sentence (array of strings) to the model is that all the vectors of the words in the sentence get averaged together.

From the github implementation:

/**
 * Transform a sentence column to a vector column to represent the whole sentence. The transform
 * is performed by averaging all word vectors it contains.
 */
 @Since("2.0.0")
 override def transform(dataset: Dataset[_]): DataFrame = {
 ...

This can easily be confirmed, for example:

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]

The first element is computed by taking the average of the first element of the vectors of the five involved words,

(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5

which equals -0.011055880039930344.

answered Nov 13 '18 at 5:33

Shaido

11.9k112441

I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42

1

@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08

I believe what you mean is that each row should hold one word (a column of type String with exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22

1

@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28

1

@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24

|
show 3 more comments

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53272749%2fwhy-does-sparks-word2vec-return-a-vector%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...

Starting from the practitioner community, we have:

How to concatenate word vectors to form sentence vector (SO answer):

There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] See gensim.models.doc2vec.Doc2Vec, dm_concat and
dm_mean - it allows you to use any of those three options

Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):

So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.

Just average them?

Yes that’s what we are going to do here.

Sentence2Vec (Github repo):

Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.

It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.

From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

enter image description here

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

enter image description here

edited Nov 28 '18 at 23:10

answered Nov 28 '18 at 23:04

desertnaut

16.6k63567

Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30

@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36

No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33

add a comment |

This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...

Starting from the practitioner community, we have:

How to concatenate word vectors to form sentence vector (SO answer):

There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] See gensim.models.doc2vec.Doc2Vec, dm_concat and
dm_mean - it allows you to use any of those three options

Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):

So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.

Just average them?

Yes that’s what we are going to do here.

Sentence2Vec (Github repo):

Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.

It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.

From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

enter image description here

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

enter image description here

edited Nov 28 '18 at 23:10

answered Nov 28 '18 at 23:04

desertnaut

16.6k63567

Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30

@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36

No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33

add a comment |

This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...

Starting from the practitioner community, we have:

How to concatenate word vectors to form sentence vector (SO answer):

There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] See gensim.models.doc2vec.Doc2Vec, dm_concat and
dm_mean - it allows you to use any of those three options

Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):

So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.

Just average them?

Yes that’s what we are going to do here.

Sentence2Vec (Github repo):

Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.

It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.

From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

enter image description here

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

enter image description here

edited Nov 28 '18 at 23:10

answered Nov 28 '18 at 23:04

desertnaut

16.6k63567

This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...

Starting from the practitioner community, we have:

How to concatenate word vectors to form sentence vector (SO answer):

There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] See gensim.models.doc2vec.Doc2Vec, dm_concat and
dm_mean - it allows you to use any of those three options

Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):

So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.

Just average them?

Yes that’s what we are going to do here.

Sentence2Vec (Github repo):

Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.

It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.

From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

enter image description here

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

enter image description here

edited Nov 28 '18 at 23:10

answered Nov 28 '18 at 23:04

desertnaut

16.6k63567

edited Nov 28 '18 at 23:10

answered Nov 28 '18 at 23:04

desertnaut

16.6k63567

answered Nov 28 '18 at 23:04

desertnaut

16.6k63567

answered Nov 28 '18 at 23:04

desertnaut

16.6k63567

Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30

@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36

No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33

add a comment |

Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30

@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36

No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33

Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30

@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36

No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33

add a comment |

To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:

+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+

From the github implementation:

/**
 * Transform a sentence column to a vector column to represent the whole sentence. The transform
 * is performed by averaging all word vectors it contains.
 */
 @Since("2.0.0")
 override def transform(dataset: Dataset[_]): DataFrame = {
 ...

This can easily be confirmed, for example:

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]

The first element is computed by taking the average of the first element of the vectors of the five involved words,

(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5

which equals -0.011055880039930344.

answered Nov 13 '18 at 5:33

Shaido

11.9k112441

I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42

1

@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08

I believe what you mean is that each row should hold one word (a column of type String with exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22

1

@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28

1

@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24

|
show 3 more comments

To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:

+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+

From the github implementation:

/**
 * Transform a sentence column to a vector column to represent the whole sentence. The transform
 * is performed by averaging all word vectors it contains.
 */
 @Since("2.0.0")
 override def transform(dataset: Dataset[_]): DataFrame = {
 ...

This can easily be confirmed, for example:

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]

The first element is computed by taking the average of the first element of the vectors of the five involved words,

(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5

which equals -0.011055880039930344.

answered Nov 13 '18 at 5:33

Shaido

11.9k112441

I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42

1

@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08

I believe what you mean is that each row should hold one word (a column of type String with exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22

1

@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28

1

@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24

|
show 3 more comments

To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:

+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+

From the github implementation:

/**
 * Transform a sentence column to a vector column to represent the whole sentence. The transform
 * is performed by averaging all word vectors it contains.
 */
 @Since("2.0.0")
 override def transform(dataset: Dataset[_]): DataFrame = {
 ...

This can easily be confirmed, for example:

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]

The first element is computed by taking the average of the first element of the vectors of the five involved words,

(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5

which equals -0.011055880039930344.

answered Nov 13 '18 at 5:33

Shaido

11.9k112441

To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:

+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+

From the github implementation:

/**
 * Transform a sentence column to a vector column to represent the whole sentence. The transform
 * is performed by averaging all word vectors it contains.
 */
 @Since("2.0.0")
 override def transform(dataset: Dataset[_]): DataFrame = {
 ...

This can easily be confirmed, for example:

Text: [Logistic, regression, models, are, neat] => 
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]

The first element is computed by taking the average of the first element of the vectors of the five involved words,

(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5

which equals -0.011055880039930344.

answered Nov 13 '18 at 5:33

Shaido

11.9k112441

answered Nov 13 '18 at 5:33

Shaido

11.9k112441

answered Nov 13 '18 at 5:33

Shaido

11.9k112441

answered Nov 13 '18 at 5:33

Shaido

11.9k112441

I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42

1

@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08

I believe what you mean is that each row should hold one word (a column of type String with exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22

1

@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28

1

@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24

|
show 3 more comments

I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42

1

@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08

I believe what you mean is that each row should hold one word (a column of type String with exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22

1

@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28

1

@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24

I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42

@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08

I believe what you mean is that each row should hold one word (a column of type String with exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22

@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28

@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24

|
show 3 more comments

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Odtnhj