Keras: tweets classification
Hello dear forum members,
I have a data set of 20 Million randomly collected individual tweets (no two tweets come from the same account). Let me refer to this data set as "general" data set. Also, I have another "specific" data set that includes 100,000 tweets collected from drug (opioid) abusers. Each tweet has at least one tag associated with it, e.g., opioids, addiction, overdose, hydrocodone, etc. (max 25 tags).
My goal is to use the "specific" data set to train the model using Keras and then use it to tag tweets in the "general" data set to identify tweets that might have been written by drug abusers.
Following examples in source1 and source2, I managed to build a simple working version of such model:
from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils
# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]
# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)
# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 5, verbose = 1, validation_split = 0.1)
# test prediction accuracy
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
# make predictions using a test set
for i in range(1000):
prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)
In order to move forward, I would like to clarify a few things:
- Let's say all my training tweets have a single tag -- opioids. Then if I pass the non-tagged tweets through it, isn't it likely that the model simply tags all of them as opioids as it doesn't know anything else? Should I be using a variety of different tweets/tags then for the learning purpose? Perhaps, there are any general guidelines for the selection of the tweets/tags for the training purposes?
- How can I add more columns with tags for training (not a single one like is used in the code)?
- Once I train the model and achieve appropriate accuracy, how do I pass non-tagged tweets through it to make predictions?
- How do I add a confusion matrix?
Any other relevant feedback is also greatly appreciated.
Thanks!
Examples of "general" tweets:
everybody messages me when im in class but never communicates on the weekends like this when im free. feels like that anyway lol.
i woke up late, and now i look like shit. im the type of person who will still be early to whatever, ill just look like i just woke up.
Examples of "specific" tweets:
$2 million grant to educate clinicians who prescribe opioids
early and regular marijuana use is associated with use of other illicit drugs, including opioids
python machine-learning keras text-classification tweets
add a comment |
Hello dear forum members,
I have a data set of 20 Million randomly collected individual tweets (no two tweets come from the same account). Let me refer to this data set as "general" data set. Also, I have another "specific" data set that includes 100,000 tweets collected from drug (opioid) abusers. Each tweet has at least one tag associated with it, e.g., opioids, addiction, overdose, hydrocodone, etc. (max 25 tags).
My goal is to use the "specific" data set to train the model using Keras and then use it to tag tweets in the "general" data set to identify tweets that might have been written by drug abusers.
Following examples in source1 and source2, I managed to build a simple working version of such model:
from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils
# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]
# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)
# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 5, verbose = 1, validation_split = 0.1)
# test prediction accuracy
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
# make predictions using a test set
for i in range(1000):
prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)
In order to move forward, I would like to clarify a few things:
- Let's say all my training tweets have a single tag -- opioids. Then if I pass the non-tagged tweets through it, isn't it likely that the model simply tags all of them as opioids as it doesn't know anything else? Should I be using a variety of different tweets/tags then for the learning purpose? Perhaps, there are any general guidelines for the selection of the tweets/tags for the training purposes?
- How can I add more columns with tags for training (not a single one like is used in the code)?
- Once I train the model and achieve appropriate accuracy, how do I pass non-tagged tweets through it to make predictions?
- How do I add a confusion matrix?
Any other relevant feedback is also greatly appreciated.
Thanks!
Examples of "general" tweets:
everybody messages me when im in class but never communicates on the weekends like this when im free. feels like that anyway lol.
i woke up late, and now i look like shit. im the type of person who will still be early to whatever, ill just look like i just woke up.
Examples of "specific" tweets:
$2 million grant to educate clinicians who prescribe opioids
early and regular marijuana use is associated with use of other illicit drugs, including opioids
python machine-learning keras text-classification tweets
Give me an example with two tweets for which you will say this tweet should be classified as tweeted by drug abuser and this is not!!
– Rahul Agarwal
Nov 14 '18 at 18:37
@RahulAgarwal I have added two examples of tweets in the post. I guess this will be one of the main challenges to differentiate between the two types, because general tweets could be literally about anything and are not likely to contain any specific drug-related keywords. My assumption was that Keras would be able to learn from the writing style, spelling, punctuation, and other language specific cues.
– kiton
Nov 14 '18 at 18:48
add a comment |
Hello dear forum members,
I have a data set of 20 Million randomly collected individual tweets (no two tweets come from the same account). Let me refer to this data set as "general" data set. Also, I have another "specific" data set that includes 100,000 tweets collected from drug (opioid) abusers. Each tweet has at least one tag associated with it, e.g., opioids, addiction, overdose, hydrocodone, etc. (max 25 tags).
My goal is to use the "specific" data set to train the model using Keras and then use it to tag tweets in the "general" data set to identify tweets that might have been written by drug abusers.
Following examples in source1 and source2, I managed to build a simple working version of such model:
from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils
# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]
# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)
# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 5, verbose = 1, validation_split = 0.1)
# test prediction accuracy
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
# make predictions using a test set
for i in range(1000):
prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)
In order to move forward, I would like to clarify a few things:
- Let's say all my training tweets have a single tag -- opioids. Then if I pass the non-tagged tweets through it, isn't it likely that the model simply tags all of them as opioids as it doesn't know anything else? Should I be using a variety of different tweets/tags then for the learning purpose? Perhaps, there are any general guidelines for the selection of the tweets/tags for the training purposes?
- How can I add more columns with tags for training (not a single one like is used in the code)?
- Once I train the model and achieve appropriate accuracy, how do I pass non-tagged tweets through it to make predictions?
- How do I add a confusion matrix?
Any other relevant feedback is also greatly appreciated.
Thanks!
Examples of "general" tweets:
everybody messages me when im in class but never communicates on the weekends like this when im free. feels like that anyway lol.
i woke up late, and now i look like shit. im the type of person who will still be early to whatever, ill just look like i just woke up.
Examples of "specific" tweets:
$2 million grant to educate clinicians who prescribe opioids
early and regular marijuana use is associated with use of other illicit drugs, including opioids
python machine-learning keras text-classification tweets
Hello dear forum members,
I have a data set of 20 Million randomly collected individual tweets (no two tweets come from the same account). Let me refer to this data set as "general" data set. Also, I have another "specific" data set that includes 100,000 tweets collected from drug (opioid) abusers. Each tweet has at least one tag associated with it, e.g., opioids, addiction, overdose, hydrocodone, etc. (max 25 tags).
My goal is to use the "specific" data set to train the model using Keras and then use it to tag tweets in the "general" data set to identify tweets that might have been written by drug abusers.
Following examples in source1 and source2, I managed to build a simple working version of such model:
from tensorflow.python import keras
import pandas as pd
import numpy as np
import pandas as pd
import tensorflow as tf
from sklearn.preprocessing import LabelBinarizer, LabelEncoder
from sklearn.metrics import confusion_matrix
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense, Activation, Dropout
from keras.preprocessing import text, sequence
from keras import utils
# load opioid-specific data set, where post is a tweet and tags is a single tag associated with a tweet
# how would I include multiple tags to be used in training?
data = pd.read_csv("filename.csv")
train_size = int(len(data) * .8)
train_posts = data['post'][:train_size]
train_tags = data['tags'][:train_size]
test_posts = data['post'][train_size:]
test_tags = data['tags'][train_size:]
# tokenize tweets
vocab_size = 100000 # what does vocabulary size really mean?
tokenize = text.Tokenizer(num_words=vocab_size)
tokenize.fit_on_texts(train_posts)
x_train = tokenize.texts_to_matrix(train_posts)
x_test = tokenize.texts_to_matrix(test_posts)
# make sure columns are strings
data['post'] = data['post'].astype(str)
data['tags'] = data['tags'].astype(str)
# labeling
# is this where I add more columns with tags for training?
encoder = LabelBinarizer()
encoder.fit(train_tags)
y_train = encoder.transform(train_tags)
y_test = encoder.transform(test_tags)
# model building
batch_size = 32
model = Sequential()
model.add(Dense(512, input_shape=(vocab_size,)))
model.add(Activation('relu'))
num_labels = np.max(y_train) + 1 #what does this +1 really mean?
model.add(Dense(1865))
model.add(Activation('softmax'))
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size = batch_size, epochs = 5, verbose = 1, validation_split = 0.1)
# test prediction accuracy
score = model.evaluate(x_test, y_test,
batch_size=batch_size, verbose=1)
print('Test score:', score[0])
print('Test accuracy:', score[1])
# make predictions using a test set
for i in range(1000):
prediction = model.predict(np.array([x_test[i]]))
text_labels = encoder.classes_
predicted_label = text_labels[np.argmax(prediction[0])]
print(test_posts.iloc[i][:50], "...")
print('Actual label:' + test_tags.iloc[i])
print("Predicted label: " + predicted_label)
In order to move forward, I would like to clarify a few things:
- Let's say all my training tweets have a single tag -- opioids. Then if I pass the non-tagged tweets through it, isn't it likely that the model simply tags all of them as opioids as it doesn't know anything else? Should I be using a variety of different tweets/tags then for the learning purpose? Perhaps, there are any general guidelines for the selection of the tweets/tags for the training purposes?
- How can I add more columns with tags for training (not a single one like is used in the code)?
- Once I train the model and achieve appropriate accuracy, how do I pass non-tagged tweets through it to make predictions?
- How do I add a confusion matrix?
Any other relevant feedback is also greatly appreciated.
Thanks!
Examples of "general" tweets:
everybody messages me when im in class but never communicates on the weekends like this when im free. feels like that anyway lol.
i woke up late, and now i look like shit. im the type of person who will still be early to whatever, ill just look like i just woke up.
Examples of "specific" tweets:
$2 million grant to educate clinicians who prescribe opioids
early and regular marijuana use is associated with use of other illicit drugs, including opioids
python machine-learning keras text-classification tweets
python machine-learning keras text-classification tweets
edited Nov 14 '18 at 18:46
kiton
asked Nov 14 '18 at 18:29
kitonkiton
105
105
Give me an example with two tweets for which you will say this tweet should be classified as tweeted by drug abuser and this is not!!
– Rahul Agarwal
Nov 14 '18 at 18:37
@RahulAgarwal I have added two examples of tweets in the post. I guess this will be one of the main challenges to differentiate between the two types, because general tweets could be literally about anything and are not likely to contain any specific drug-related keywords. My assumption was that Keras would be able to learn from the writing style, spelling, punctuation, and other language specific cues.
– kiton
Nov 14 '18 at 18:48
add a comment |
Give me an example with two tweets for which you will say this tweet should be classified as tweeted by drug abuser and this is not!!
– Rahul Agarwal
Nov 14 '18 at 18:37
@RahulAgarwal I have added two examples of tweets in the post. I guess this will be one of the main challenges to differentiate between the two types, because general tweets could be literally about anything and are not likely to contain any specific drug-related keywords. My assumption was that Keras would be able to learn from the writing style, spelling, punctuation, and other language specific cues.
– kiton
Nov 14 '18 at 18:48
Give me an example with two tweets for which you will say this tweet should be classified as tweeted by drug abuser and this is not!!
– Rahul Agarwal
Nov 14 '18 at 18:37
Give me an example with two tweets for which you will say this tweet should be classified as tweeted by drug abuser and this is not!!
– Rahul Agarwal
Nov 14 '18 at 18:37
@RahulAgarwal I have added two examples of tweets in the post. I guess this will be one of the main challenges to differentiate between the two types, because general tweets could be literally about anything and are not likely to contain any specific drug-related keywords. My assumption was that Keras would be able to learn from the writing style, spelling, punctuation, and other language specific cues.
– kiton
Nov 14 '18 at 18:48
@RahulAgarwal I have added two examples of tweets in the post. I guess this will be one of the main challenges to differentiate between the two types, because general tweets could be literally about anything and are not likely to contain any specific drug-related keywords. My assumption was that Keras would be able to learn from the writing style, spelling, punctuation, and other language specific cues.
– kiton
Nov 14 '18 at 18:48
add a comment |
1 Answer
1
active
oldest
votes
My shot to this is:
Create a new dataset with tweets from general + specific data. Let's say 200k-250K where 100K is you specific data set, rest is general
Take your 25 keywords/tags and write a rule if any one or more than one exists in a tweet it is DA (Drug Abuser) or NDA(Non Drug Abuser). This will be your dependent variable.
Your new dataset will be one column with all the tweets and another column with the dependent variable telling it is DA or NDA
Now divide into train/test and use keras or any other algo. so that it can learn.
Then test the model by plotting Confusion Matrix
Passs you other remaining data set from General to this model and check,
If their are new words other than 25 which are not in the specific dataset, from the model you built it will still try to intelligently guess the right category by the group of words that come together, tone etc.
Thank you very much for such a comprehensive response, Rahul. I will try to implement your suggestions. Alternatively to "tagging" approach, do you think the following approach is feasible or not. Say, I run the general and opioid-specific tweets through LIWC (liwc.wpengine.com) that can provide an extensive set of language specific characteristics (about 20 parameters). And then use these parameters as inputs that differentiate specific tweets from general. My assumptions here is that DA might be using writing differently than NDAs.
– kiton
Nov 14 '18 at 23:35
1
I am not sure about that because with tweets even same person writes a tweet differently for different topics. For example: If you want to post something about your team winning v/s a serious topic your two tweets would be quite different. Having said that, do try LiWc approach their will be some learning.
– Rahul Agarwal
Nov 15 '18 at 4:19
If you are satisfied with the answer, do upvote and accept for future users!!
– Rahul Agarwal
Nov 15 '18 at 4:20
add a comment |
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53306630%2fkeras-tweets-classification%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
My shot to this is:
Create a new dataset with tweets from general + specific data. Let's say 200k-250K where 100K is you specific data set, rest is general
Take your 25 keywords/tags and write a rule if any one or more than one exists in a tweet it is DA (Drug Abuser) or NDA(Non Drug Abuser). This will be your dependent variable.
Your new dataset will be one column with all the tweets and another column with the dependent variable telling it is DA or NDA
Now divide into train/test and use keras or any other algo. so that it can learn.
Then test the model by plotting Confusion Matrix
Passs you other remaining data set from General to this model and check,
If their are new words other than 25 which are not in the specific dataset, from the model you built it will still try to intelligently guess the right category by the group of words that come together, tone etc.
Thank you very much for such a comprehensive response, Rahul. I will try to implement your suggestions. Alternatively to "tagging" approach, do you think the following approach is feasible or not. Say, I run the general and opioid-specific tweets through LIWC (liwc.wpengine.com) that can provide an extensive set of language specific characteristics (about 20 parameters). And then use these parameters as inputs that differentiate specific tweets from general. My assumptions here is that DA might be using writing differently than NDAs.
– kiton
Nov 14 '18 at 23:35
1
I am not sure about that because with tweets even same person writes a tweet differently for different topics. For example: If you want to post something about your team winning v/s a serious topic your two tweets would be quite different. Having said that, do try LiWc approach their will be some learning.
– Rahul Agarwal
Nov 15 '18 at 4:19
If you are satisfied with the answer, do upvote and accept for future users!!
– Rahul Agarwal
Nov 15 '18 at 4:20
add a comment |
My shot to this is:
Create a new dataset with tweets from general + specific data. Let's say 200k-250K where 100K is you specific data set, rest is general
Take your 25 keywords/tags and write a rule if any one or more than one exists in a tweet it is DA (Drug Abuser) or NDA(Non Drug Abuser). This will be your dependent variable.
Your new dataset will be one column with all the tweets and another column with the dependent variable telling it is DA or NDA
Now divide into train/test and use keras or any other algo. so that it can learn.
Then test the model by plotting Confusion Matrix
Passs you other remaining data set from General to this model and check,
If their are new words other than 25 which are not in the specific dataset, from the model you built it will still try to intelligently guess the right category by the group of words that come together, tone etc.
Thank you very much for such a comprehensive response, Rahul. I will try to implement your suggestions. Alternatively to "tagging" approach, do you think the following approach is feasible or not. Say, I run the general and opioid-specific tweets through LIWC (liwc.wpengine.com) that can provide an extensive set of language specific characteristics (about 20 parameters). And then use these parameters as inputs that differentiate specific tweets from general. My assumptions here is that DA might be using writing differently than NDAs.
– kiton
Nov 14 '18 at 23:35
1
I am not sure about that because with tweets even same person writes a tweet differently for different topics. For example: If you want to post something about your team winning v/s a serious topic your two tweets would be quite different. Having said that, do try LiWc approach their will be some learning.
– Rahul Agarwal
Nov 15 '18 at 4:19
If you are satisfied with the answer, do upvote and accept for future users!!
– Rahul Agarwal
Nov 15 '18 at 4:20
add a comment |
My shot to this is:
Create a new dataset with tweets from general + specific data. Let's say 200k-250K where 100K is you specific data set, rest is general
Take your 25 keywords/tags and write a rule if any one or more than one exists in a tweet it is DA (Drug Abuser) or NDA(Non Drug Abuser). This will be your dependent variable.
Your new dataset will be one column with all the tweets and another column with the dependent variable telling it is DA or NDA
Now divide into train/test and use keras or any other algo. so that it can learn.
Then test the model by plotting Confusion Matrix
Passs you other remaining data set from General to this model and check,
If their are new words other than 25 which are not in the specific dataset, from the model you built it will still try to intelligently guess the right category by the group of words that come together, tone etc.
My shot to this is:
Create a new dataset with tweets from general + specific data. Let's say 200k-250K where 100K is you specific data set, rest is general
Take your 25 keywords/tags and write a rule if any one or more than one exists in a tweet it is DA (Drug Abuser) or NDA(Non Drug Abuser). This will be your dependent variable.
Your new dataset will be one column with all the tweets and another column with the dependent variable telling it is DA or NDA
Now divide into train/test and use keras or any other algo. so that it can learn.
Then test the model by plotting Confusion Matrix
Passs you other remaining data set from General to this model and check,
If their are new words other than 25 which are not in the specific dataset, from the model you built it will still try to intelligently guess the right category by the group of words that come together, tone etc.
answered Nov 14 '18 at 19:15
Rahul AgarwalRahul Agarwal
2,24251028
2,24251028
Thank you very much for such a comprehensive response, Rahul. I will try to implement your suggestions. Alternatively to "tagging" approach, do you think the following approach is feasible or not. Say, I run the general and opioid-specific tweets through LIWC (liwc.wpengine.com) that can provide an extensive set of language specific characteristics (about 20 parameters). And then use these parameters as inputs that differentiate specific tweets from general. My assumptions here is that DA might be using writing differently than NDAs.
– kiton
Nov 14 '18 at 23:35
1
I am not sure about that because with tweets even same person writes a tweet differently for different topics. For example: If you want to post something about your team winning v/s a serious topic your two tweets would be quite different. Having said that, do try LiWc approach their will be some learning.
– Rahul Agarwal
Nov 15 '18 at 4:19
If you are satisfied with the answer, do upvote and accept for future users!!
– Rahul Agarwal
Nov 15 '18 at 4:20
add a comment |
Thank you very much for such a comprehensive response, Rahul. I will try to implement your suggestions. Alternatively to "tagging" approach, do you think the following approach is feasible or not. Say, I run the general and opioid-specific tweets through LIWC (liwc.wpengine.com) that can provide an extensive set of language specific characteristics (about 20 parameters). And then use these parameters as inputs that differentiate specific tweets from general. My assumptions here is that DA might be using writing differently than NDAs.
– kiton
Nov 14 '18 at 23:35
1
I am not sure about that because with tweets even same person writes a tweet differently for different topics. For example: If you want to post something about your team winning v/s a serious topic your two tweets would be quite different. Having said that, do try LiWc approach their will be some learning.
– Rahul Agarwal
Nov 15 '18 at 4:19
If you are satisfied with the answer, do upvote and accept for future users!!
– Rahul Agarwal
Nov 15 '18 at 4:20
Thank you very much for such a comprehensive response, Rahul. I will try to implement your suggestions. Alternatively to "tagging" approach, do you think the following approach is feasible or not. Say, I run the general and opioid-specific tweets through LIWC (liwc.wpengine.com) that can provide an extensive set of language specific characteristics (about 20 parameters). And then use these parameters as inputs that differentiate specific tweets from general. My assumptions here is that DA might be using writing differently than NDAs.
– kiton
Nov 14 '18 at 23:35
Thank you very much for such a comprehensive response, Rahul. I will try to implement your suggestions. Alternatively to "tagging" approach, do you think the following approach is feasible or not. Say, I run the general and opioid-specific tweets through LIWC (liwc.wpengine.com) that can provide an extensive set of language specific characteristics (about 20 parameters). And then use these parameters as inputs that differentiate specific tweets from general. My assumptions here is that DA might be using writing differently than NDAs.
– kiton
Nov 14 '18 at 23:35
1
1
I am not sure about that because with tweets even same person writes a tweet differently for different topics. For example: If you want to post something about your team winning v/s a serious topic your two tweets would be quite different. Having said that, do try LiWc approach their will be some learning.
– Rahul Agarwal
Nov 15 '18 at 4:19
I am not sure about that because with tweets even same person writes a tweet differently for different topics. For example: If you want to post something about your team winning v/s a serious topic your two tweets would be quite different. Having said that, do try LiWc approach their will be some learning.
– Rahul Agarwal
Nov 15 '18 at 4:19
If you are satisfied with the answer, do upvote and accept for future users!!
– Rahul Agarwal
Nov 15 '18 at 4:20
If you are satisfied with the answer, do upvote and accept for future users!!
– Rahul Agarwal
Nov 15 '18 at 4:20
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53306630%2fkeras-tweets-classification%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Give me an example with two tweets for which you will say this tweet should be classified as tweeted by drug abuser and this is not!!
– Rahul Agarwal
Nov 14 '18 at 18:37
@RahulAgarwal I have added two examples of tweets in the post. I guess this will be one of the main challenges to differentiate between the two types, because general tweets could be literally about anything and are not likely to contain any specific drug-related keywords. My assumption was that Keras would be able to learn from the writing style, spelling, punctuation, and other language specific cues.
– kiton
Nov 14 '18 at 18:48