Why does Spark's Word2Vec return a Vector?
Running the Spark's example for Word2Vec, I realized that it takes in an array of string and gives out a vector. My question is, shouldn't it return a matrix instead of a vector? I was expecting one vector per input word. But it returns one vector period!
Or maybe it should have accepted string, instead of an array of strings (one word) as input. Then, yeah sure, it could return one vector as output. But accepting an array of strings and returning one single vector does not make sense to me.
[UPDATE]
Per @Shaido's request, here's the code with my minor change to print the schema for the output:
public class JavaWord2VecExample
public static void main(String args)
SparkSession spark = SparkSession
.builder()
.appName("JavaWord2VecExample")
.getOrCreate();
// $example on$
// Input data: Each row is a bag of words from a sentence or document.
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
);
Dataset<Row> documentDF = spark.createDataFrame(data, schema);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(7)
.setMinCount(0);
Word2VecModel model = word2Vec.fit(documentDF);
Dataset<Row> result = model.transform(documentDF);
for (Row row : result.collectAsList())
List<String> text = row.getList(0);
System.out.println("Schema: " + row.schema());
Vector vector = (Vector) row.get(1);
System.out.println("Text: " + text + " => nVector: " + vector + "n");
// $example off$
spark.stop();
And it prints:
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] =>
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] =>
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] =>
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]
Please correct me if I'm wrong, but the input is an array of strings and the output is a single vector. And I was expecting each word to be mapped into a vector.
java apache-spark word2vec apache-spark-ml
add a comment |
Running the Spark's example for Word2Vec, I realized that it takes in an array of string and gives out a vector. My question is, shouldn't it return a matrix instead of a vector? I was expecting one vector per input word. But it returns one vector period!
Or maybe it should have accepted string, instead of an array of strings (one word) as input. Then, yeah sure, it could return one vector as output. But accepting an array of strings and returning one single vector does not make sense to me.
[UPDATE]
Per @Shaido's request, here's the code with my minor change to print the schema for the output:
public class JavaWord2VecExample
public static void main(String args)
SparkSession spark = SparkSession
.builder()
.appName("JavaWord2VecExample")
.getOrCreate();
// $example on$
// Input data: Each row is a bag of words from a sentence or document.
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
);
Dataset<Row> documentDF = spark.createDataFrame(data, schema);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(7)
.setMinCount(0);
Word2VecModel model = word2Vec.fit(documentDF);
Dataset<Row> result = model.transform(documentDF);
for (Row row : result.collectAsList())
List<String> text = row.getList(0);
System.out.println("Schema: " + row.schema());
Vector vector = (Vector) row.get(1);
System.out.println("Text: " + text + " => nVector: " + vector + "n");
// $example off$
spark.stop();
And it prints:
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] =>
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] =>
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] =>
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]
Please correct me if I'm wrong, but the input is an array of strings and the output is a single vector. And I was expecting each word to be mapped into a vector.
java apache-spark word2vec apache-spark-ml
add a comment |
Running the Spark's example for Word2Vec, I realized that it takes in an array of string and gives out a vector. My question is, shouldn't it return a matrix instead of a vector? I was expecting one vector per input word. But it returns one vector period!
Or maybe it should have accepted string, instead of an array of strings (one word) as input. Then, yeah sure, it could return one vector as output. But accepting an array of strings and returning one single vector does not make sense to me.
[UPDATE]
Per @Shaido's request, here's the code with my minor change to print the schema for the output:
public class JavaWord2VecExample
public static void main(String args)
SparkSession spark = SparkSession
.builder()
.appName("JavaWord2VecExample")
.getOrCreate();
// $example on$
// Input data: Each row is a bag of words from a sentence or document.
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
);
Dataset<Row> documentDF = spark.createDataFrame(data, schema);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(7)
.setMinCount(0);
Word2VecModel model = word2Vec.fit(documentDF);
Dataset<Row> result = model.transform(documentDF);
for (Row row : result.collectAsList())
List<String> text = row.getList(0);
System.out.println("Schema: " + row.schema());
Vector vector = (Vector) row.get(1);
System.out.println("Text: " + text + " => nVector: " + vector + "n");
// $example off$
spark.stop();
And it prints:
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] =>
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] =>
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] =>
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]
Please correct me if I'm wrong, but the input is an array of strings and the output is a single vector. And I was expecting each word to be mapped into a vector.
java apache-spark word2vec apache-spark-ml
Running the Spark's example for Word2Vec, I realized that it takes in an array of string and gives out a vector. My question is, shouldn't it return a matrix instead of a vector? I was expecting one vector per input word. But it returns one vector period!
Or maybe it should have accepted string, instead of an array of strings (one word) as input. Then, yeah sure, it could return one vector as output. But accepting an array of strings and returning one single vector does not make sense to me.
[UPDATE]
Per @Shaido's request, here's the code with my minor change to print the schema for the output:
public class JavaWord2VecExample
public static void main(String args)
SparkSession spark = SparkSession
.builder()
.appName("JavaWord2VecExample")
.getOrCreate();
// $example on$
// Input data: Each row is a bag of words from a sentence or document.
List<Row> data = Arrays.asList(
RowFactory.create(Arrays.asList("Hi I heard about Spark".split(" "))),
RowFactory.create(Arrays.asList("I wish Java could use case classes".split(" "))),
RowFactory.create(Arrays.asList("Logistic regression models are neat".split(" ")))
);
StructType schema = new StructType(new StructField
new StructField("text", new ArrayType(DataTypes.StringType, true), false, Metadata.empty())
);
Dataset<Row> documentDF = spark.createDataFrame(data, schema);
// Learn a mapping from words to Vectors.
Word2Vec word2Vec = new Word2Vec()
.setInputCol("text")
.setOutputCol("result")
.setVectorSize(7)
.setMinCount(0);
Word2VecModel model = word2Vec.fit(documentDF);
Dataset<Row> result = model.transform(documentDF);
for (Row row : result.collectAsList())
List<String> text = row.getList(0);
System.out.println("Schema: " + row.schema());
Vector vector = (Vector) row.get(1);
System.out.println("Text: " + text + " => nVector: " + vector + "n");
// $example off$
spark.stop();
And it prints:
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Hi, I, heard, about, Spark] =>
Vector: [-0.0033279924420639875,-0.0024428479373455048,0.01406305879354477,0.030621735751628878,0.00792500376701355,0.02839711122214794,-0.02286271695047617]
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [I, wish, Java, could, use, case, classes] =>
Vector: [-9.96453288410391E-4,-0.013741840076233658,0.013064394239336252,-0.01155538750546319,-0.010510949650779366,0.004538436819400106,-0.0036846946126648356]
Schema: StructType(StructField(text,ArrayType(StringType,true),false), StructField(result,org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7,true))
Text: [Logistic, regression, models, are, neat] =>
Vector: [0.012510885251685977,-0.014472834207117558,0.002779599279165268,0.0022389178164303304,0.012743516173213721,-0.02409198731184006,0.017409833287820222]
Please correct me if I'm wrong, but the input is an array of strings and the output is a single vector. And I was expecting each word to be mapped into a vector.
java apache-spark word2vec apache-spark-ml
java apache-spark word2vec apache-spark-ml
edited Nov 13 '18 at 5:34
Shaido
11.9k112441
11.9k112441
asked Nov 13 '18 at 2:08
MehranMehran
3,795745109
3,795745109
add a comment |
add a comment |
2 Answers
2
active
oldest
votes
This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...
To start with, how exactly individual word embeddings should be combined is not in principle a feature of the Word2Vec model itself (which is about, well, individual words), but an issue of concern to "higher order" models, such as Sentence2Vec, Paragraph2Vec, Doc2Vec, Wikipedia2Vec etc (you could name a few more, I guess...).
Having said that, it turns out indeed that a very first approach in combining word vectors in order to get vector representations of larger pieces of text (phrases, sentences, tweets etc) is indeed to simply average the vector representations of the constituent words, as Spark ML does.
Starting from the practitioner community, we have:
How to concatenate word vectors to form sentence vector (SO answer):
There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] Seegensim.models.doc2vec.Doc2Vec,dm_concatand
dm_mean- it allows you to use any of those three options
Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):
So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.
Just average them?
Yes that’s what we are going to do here.
Sentence2Vec (Github repo):
Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.
It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.
An expected counter-argument here is that blog posts and SO answers are arguably not that credible sources; what about the researchers and the relevant scientific literature? Well, it turns out that this simple averaging is far from uncommon here, too:
From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

It should be clear by now that the particular design choice in Spark ML is far from arbitrary, or even uncommon; I have blogged about what certainly seem as absurd design choices in Spark ML (see Classification in Spark 2.0: “Input validation failed” and other wondrous tales), but it seems that this is not such a case...
Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30
@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36
No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33
add a comment |
To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:
+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+
So each word does have it's own representation. However, what happens when you input a sentence (array of strings) to the model is that all the vectors of the words in the sentence get averaged together.
From the github implementation:
/**
* Transform a sentence column to a vector column to represent the whole sentence. The transform
* is performed by averaging all word vectors it contains.
*/
@Since("2.0.0")
override def transform(dataset: Dataset[_]): DataFrame = {
...
This can easily be confirmed, for example:
Text: [Logistic, regression, models, are, neat] =>
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]
The first element is computed by taking the average of the first element of the vectors of the five involved words,
(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5
which equals -0.011055880039930344.
I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42
1
@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08
I believe what you mean is that each row should hold one word (a column of typeStringwith exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22
1
@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28
1
@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24
|
show 3 more comments
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53272749%2fwhy-does-sparks-word2vec-return-a-vector%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...
To start with, how exactly individual word embeddings should be combined is not in principle a feature of the Word2Vec model itself (which is about, well, individual words), but an issue of concern to "higher order" models, such as Sentence2Vec, Paragraph2Vec, Doc2Vec, Wikipedia2Vec etc (you could name a few more, I guess...).
Having said that, it turns out indeed that a very first approach in combining word vectors in order to get vector representations of larger pieces of text (phrases, sentences, tweets etc) is indeed to simply average the vector representations of the constituent words, as Spark ML does.
Starting from the practitioner community, we have:
How to concatenate word vectors to form sentence vector (SO answer):
There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] Seegensim.models.doc2vec.Doc2Vec,dm_concatand
dm_mean- it allows you to use any of those three options
Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):
So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.
Just average them?
Yes that’s what we are going to do here.
Sentence2Vec (Github repo):
Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.
It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.
An expected counter-argument here is that blog posts and SO answers are arguably not that credible sources; what about the researchers and the relevant scientific literature? Well, it turns out that this simple averaging is far from uncommon here, too:
From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

It should be clear by now that the particular design choice in Spark ML is far from arbitrary, or even uncommon; I have blogged about what certainly seem as absurd design choices in Spark ML (see Classification in Spark 2.0: “Input validation failed” and other wondrous tales), but it seems that this is not such a case...
Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30
@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36
No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33
add a comment |
This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...
To start with, how exactly individual word embeddings should be combined is not in principle a feature of the Word2Vec model itself (which is about, well, individual words), but an issue of concern to "higher order" models, such as Sentence2Vec, Paragraph2Vec, Doc2Vec, Wikipedia2Vec etc (you could name a few more, I guess...).
Having said that, it turns out indeed that a very first approach in combining word vectors in order to get vector representations of larger pieces of text (phrases, sentences, tweets etc) is indeed to simply average the vector representations of the constituent words, as Spark ML does.
Starting from the practitioner community, we have:
How to concatenate word vectors to form sentence vector (SO answer):
There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] Seegensim.models.doc2vec.Doc2Vec,dm_concatand
dm_mean- it allows you to use any of those three options
Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):
So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.
Just average them?
Yes that’s what we are going to do here.
Sentence2Vec (Github repo):
Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.
It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.
An expected counter-argument here is that blog posts and SO answers are arguably not that credible sources; what about the researchers and the relevant scientific literature? Well, it turns out that this simple averaging is far from uncommon here, too:
From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

It should be clear by now that the particular design choice in Spark ML is far from arbitrary, or even uncommon; I have blogged about what certainly seem as absurd design choices in Spark ML (see Classification in Spark 2.0: “Input validation failed” and other wondrous tales), but it seems that this is not such a case...
Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30
@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36
No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33
add a comment |
This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...
To start with, how exactly individual word embeddings should be combined is not in principle a feature of the Word2Vec model itself (which is about, well, individual words), but an issue of concern to "higher order" models, such as Sentence2Vec, Paragraph2Vec, Doc2Vec, Wikipedia2Vec etc (you could name a few more, I guess...).
Having said that, it turns out indeed that a very first approach in combining word vectors in order to get vector representations of larger pieces of text (phrases, sentences, tweets etc) is indeed to simply average the vector representations of the constituent words, as Spark ML does.
Starting from the practitioner community, we have:
How to concatenate word vectors to form sentence vector (SO answer):
There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] Seegensim.models.doc2vec.Doc2Vec,dm_concatand
dm_mean- it allows you to use any of those three options
Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):
So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.
Just average them?
Yes that’s what we are going to do here.
Sentence2Vec (Github repo):
Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.
It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.
An expected counter-argument here is that blog posts and SO answers are arguably not that credible sources; what about the researchers and the relevant scientific literature? Well, it turns out that this simple averaging is far from uncommon here, too:
From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

It should be clear by now that the particular design choice in Spark ML is far from arbitrary, or even uncommon; I have blogged about what certainly seem as absurd design choices in Spark ML (see Classification in Spark 2.0: “Input validation failed” and other wondrous tales), but it seems that this is not such a case...
This is an attempt to justify the rationale of Spark here, and it should be read as a complement to the nice programming explanation already provided as an answer...
To start with, how exactly individual word embeddings should be combined is not in principle a feature of the Word2Vec model itself (which is about, well, individual words), but an issue of concern to "higher order" models, such as Sentence2Vec, Paragraph2Vec, Doc2Vec, Wikipedia2Vec etc (you could name a few more, I guess...).
Having said that, it turns out indeed that a very first approach in combining word vectors in order to get vector representations of larger pieces of text (phrases, sentences, tweets etc) is indeed to simply average the vector representations of the constituent words, as Spark ML does.
Starting from the practitioner community, we have:
How to concatenate word vectors to form sentence vector (SO answer):
There are at least three common ways to combine embedding vectors; (a)
summing, (b) summing & averaging or (c) concatenating. [...] Seegensim.models.doc2vec.Doc2Vec,dm_concatand
dm_mean- it allows you to use any of those three options
Sentence2Vec : Evaluation of popular theories — Part I (Simple average of word vectors) (blog post):
So what’s first thing that comes to your mind when you have word
vectors and need to calculate sentence vector.
Just average them?
Yes that’s what we are going to do here.
Sentence2Vec (Github repo):
Word2Vec can help to find other words with similar semantic meaning.
However, Word2Vec can only take 1 word each time, while a sentence
consists of multiple words. To solve this, I write the Sentence2Vec,
which is actually a wrapper to Word2Vec. To obtain the vector of a
sentence, I simply get the averaged vector sum of each word in the
sentence.
It certainly seems that, at least for practitioners, this simple averaging of the individual word vectors is far from unexpected.
An expected counter-argument here is that blog posts and SO answers are arguably not that credible sources; what about the researchers and the relevant scientific literature? Well, it turns out that this simple averaging is far from uncommon here, too:
From Distributed Representations of Sentences and Documents (Le & Mikolov, Google, ICML 2014):

From NILC-USP at SemEval-2017 Task 4: A Multi-view Ensemble for Twitter Sentiment analysis (SemEval 2017, section 2.1.2):

It should be clear by now that the particular design choice in Spark ML is far from arbitrary, or even uncommon; I have blogged about what certainly seem as absurd design choices in Spark ML (see Classification in Spark 2.0: “Input validation failed” and other wondrous tales), but it seems that this is not such a case...
edited Nov 28 '18 at 23:10
answered Nov 28 '18 at 23:04
desertnautdesertnaut
16.6k63567
16.6k63567
Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30
@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36
No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33
add a comment |
Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30
@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36
No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33
Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30
Thanks for such a well-thought answer. This is definitely a better-fit answer for the question asked. Please don't get me wrong but I still believe whoever made the choice to return a vector instead of a matrix has made a mistake. For one thing, going from a matrix to a vector (if they had returned the matrix) could easily be done in user code. All I'm saying is that they've implemented a really practical algorithm and at the same time, they've ruined it completely by averaging the result. I think they don't know what they have done or someone has made a mistake. Thanks, again.
– Mehran
Nov 28 '18 at 23:30
@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36
@Mehran you are very welcome; and personally, I don't trust the Spark ML people that they know what they are doing (checked my blog post?)
– desertnaut
Nov 28 '18 at 23:36
No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33
No, I have not read your post yet. Hopefully, this weekend.
– Mehran
Nov 29 '18 at 2:33
add a comment |
To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:
+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+
So each word does have it's own representation. However, what happens when you input a sentence (array of strings) to the model is that all the vectors of the words in the sentence get averaged together.
From the github implementation:
/**
* Transform a sentence column to a vector column to represent the whole sentence. The transform
* is performed by averaging all word vectors it contains.
*/
@Since("2.0.0")
override def transform(dataset: Dataset[_]): DataFrame = {
...
This can easily be confirmed, for example:
Text: [Logistic, regression, models, are, neat] =>
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]
The first element is computed by taking the average of the first element of the vectors of the five involved words,
(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5
which equals -0.011055880039930344.
I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42
1
@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08
I believe what you mean is that each row should hold one word (a column of typeStringwith exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22
1
@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28
1
@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24
|
show 3 more comments
To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:
+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+
So each word does have it's own representation. However, what happens when you input a sentence (array of strings) to the model is that all the vectors of the words in the sentence get averaged together.
From the github implementation:
/**
* Transform a sentence column to a vector column to represent the whole sentence. The transform
* is performed by averaging all word vectors it contains.
*/
@Since("2.0.0")
override def transform(dataset: Dataset[_]): DataFrame = {
...
This can easily be confirmed, for example:
Text: [Logistic, regression, models, are, neat] =>
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]
The first element is computed by taking the average of the first element of the vectors of the five involved words,
(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5
which equals -0.011055880039930344.
I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42
1
@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08
I believe what you mean is that each row should hold one word (a column of typeStringwith exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22
1
@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28
1
@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24
|
show 3 more comments
To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:
+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+
So each word does have it's own representation. However, what happens when you input a sentence (array of strings) to the model is that all the vectors of the words in the sentence get averaged together.
From the github implementation:
/**
* Transform a sentence column to a vector column to represent the whole sentence. The transform
* is performed by averaging all word vectors it contains.
*/
@Since("2.0.0")
override def transform(dataset: Dataset[_]): DataFrame = {
...
This can easily be confirmed, for example:
Text: [Logistic, regression, models, are, neat] =>
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]
The first element is computed by taking the average of the first element of the vectors of the five involved words,
(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5
which equals -0.011055880039930344.
To see the vector corresponding to each word you can run model.getVectors. For the dataframe in the question (with a vector size of 3 instead of 7) this gives:
+----------+-----------------------------------------------------------------+
|word |vector |
+----------+-----------------------------------------------------------------+
|heard |[0.14950960874557495,-0.11237259954214096,-0.03993036597967148] |
|are |[-0.16390761733055115,-0.14509087800979614,0.11349033564329147] |
|neat |[0.13949351012706757,0.08127426356077194,0.15970033407211304] |
|classes |[0.03703496977686882,0.05841822177171707,-0.02267565205693245] |
|I |[-0.018915412947535515,-0.13099457323551178,0.14300788938999176] |
|regression|[0.1529865264892578,0.060659825801849365,0.07735282927751541] |
|Logistic |[-0.12702016532421112,0.09839040040969849,-0.10370948910713196] |
|Spark |[-0.053579315543174744,0.14673036336898804,-0.002033260650932789]|
|could |[0.12216471135616302,-0.031169598922133446,-0.1427609771490097] |
|use |[0.08246973901987076,0.002503493567928672,-0.0796264186501503] |
|Hi |[0.16548289358615875,0.06477408856153488,0.09229831397533417] |
|models |[-0.05683165416121483,0.009706663899123669,-0.033789146691560745]|
|case |[0.11626788973808289,0.10363516956567764,-0.07028932124376297] |
|about |[-0.1500445008277893,-0.049380943179130554,0.03307584300637245] |
|Java |[-0.04074851796030998,0.02809843420982361,-0.16281810402870178] |
|wish |[0.11882393807172775,0.13347993791103363,0.14399205148220062] |
+----------+-----------------------------------------------------------------+
So each word does have it's own representation. However, what happens when you input a sentence (array of strings) to the model is that all the vectors of the words in the sentence get averaged together.
From the github implementation:
/**
* Transform a sentence column to a vector column to represent the whole sentence. The transform
* is performed by averaging all word vectors it contains.
*/
@Since("2.0.0")
override def transform(dataset: Dataset[_]): DataFrame = {
...
This can easily be confirmed, for example:
Text: [Logistic, regression, models, are, neat] =>
Vector: [-0.011055880039930344,0.020988055132329465,0.042608972638845444]
The first element is computed by taking the average of the first element of the vectors of the five involved words,
(-0.12702016532421112 + 0.1529865264892578 -0.05683165416121483 -0.16390761733055115 + 0.13949351012706757) / 5
which equals -0.011055880039930344.
answered Nov 13 '18 at 5:33
ShaidoShaido
11.9k112441
11.9k112441
I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42
1
@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08
I believe what you mean is that each row should hold one word (a column of typeStringwith exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22
1
@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28
1
@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24
|
show 3 more comments
I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42
1
@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08
I believe what you mean is that each row should hold one word (a column of typeStringwith exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.
– Mehran
Nov 14 '18 at 1:22
1
@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28
1
@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24
I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42
I'm sold but that does not make sense to me. Why the array is averaged? I mean the average has no value. Each input needs to be transformed into a matrix, not a vector. It seems to me this implementation is absolutely useless!
– Mehran
Nov 13 '18 at 13:42
1
1
@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08
@Mehran: Instead of inputting a sentence to transform, simply split it up into words beforehand and input the words seperatly. Then you will have a matrix.
– Shaido
Nov 14 '18 at 1:08
I believe what you mean is that each row should hold one word (a column of type
String with exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.– Mehran
Nov 14 '18 at 1:22
I believe what you mean is that each row should hold one word (a column of type
String with exactly one element). While I thought of the same but I realized that this is not gonna work. You see, if you are going to pass the output of Word2Vec to an RNN (a common scenario) you are interested in sentences as input (words won't do). And if you split up each of your sentences into words beforehand, you cannot go back to sentences since you don't know where the previous sentence ends and where the next one starts. Again, to me, this implementation seems useless. Unless I'm missing something.– Mehran
Nov 14 '18 at 1:22
1
1
@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28
@Mehran: Maybe you could have an id column marking which sentence the word belongs to (however, then the order will be lost ). I don't think there is any easy way to do this nativly in Spark... currently the implementation seems more focused on finding word synonyms and summarizing documents.
– Shaido
Nov 14 '18 at 2:28
1
1
@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24
@Mehran I concur (and +1 for your question), but IMHO this is not reason for not accepting a nice answer (which, at the end of the day, did answer the question and pointed out the reason of this behavior, irrespectively if we like the rationale of Spark people or not) - cheers...
– desertnaut
Nov 28 '18 at 18:24
|
show 3 more comments
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53272749%2fwhy-does-sparks-word2vec-return-a-vector%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
