Analyzing Bert Output for Sequence Classification - text-classification

Can anyone help me understand the output of BERT in the last hidden layer for sequence classification? I am doing some testing with a huggingface model for sequence classification and accessing the output as follows:
sequence = "this coffee tastes bad"
tok_sequence = tokenizer(sequence)
output = model(**tok_sequence) # model inference
output[1][0] # accessed by setting config.output_hidden_states = True
output[1][0].shape # --> (1, 7, 768) --> | CLS | this | coffee | taste | ##s | bad | SEP |
In the string "this coffee tastes bad," I am basically taking the outputs of the last hidden layer, which in this case has shape (1, 7, 768), [CLS] + 5 word tokens + [SEP], and looping through each token summing up their values (768) and computing the average. The resulting totals are outputted in the below chart image.
Output of BERT model
Additional BERT output example - longer sentence
Any help interpreting would be helpful. The direction I am looking to go with this is extracting words with the most meaning or that influenced the output sentiment most. In looking at the output, especially the second example, it appears the positive values are less important words, and the negative values are influential words. This seems to hold true for most output examples. However, i may be off in my understanding of this.
Any other suggested methods would also be helpful for text classification.

Related

How to detect numerical value of a text?

We have data for survey question (e.g. rate us between 1-5) that's supposed to be numerical. However, we find that the response also includes
👍 repeated 5 times
❤️ repeated 4 times
Great!
four
3 and a half
I'd like a way to turn the user response into a numerical value. e.g. the text above should translate into 5, 4, 5, 4, 3.5 respectively. Obviously this won't work 100% of the time so I'm looking for an optimal solution (perhaps a text analysis approach) that gets me over 80%.
If you are solely looking to turn THESE SPECIFIC responses into numerical values, then you can pass them through a series of if statements in a function:
def inputToNumber(string)
#thumbs up emoji
if string == "\u{1f44d}"
return 5
#the word four
elsif string == "four"
return 4
#etc., etc. with if statements for your other cases
end
end
But it might make more sense for you to only allow numeric answers to begin with, because:
You can't predict every possible written response
Someone could input malicious code
You didn't provide your code to show how you are accepting input so I can't really offer you specific solutions, but you can look here for some suggestions: Accept only numeric input
Good luck with your project.

Is it possible to search for part the of text using word embeddings?

I have found successful weighting theme for adding word vectors which seems to work for sentence comparison in my case:
query1 = vectorize_query("human cat interaction")
query2 = vectorize_query("people and cats talk")
query3 = vectorize_query("monks predicted frost")
query4 = vectorize_query("man found his feline in the woods")
>>> print(1 - spatial.distance.cosine(query1, query2))
>>> 0.7154500319
>>> print(1 - spatial.distance.cosine(query1, query3))
>>> 0.415183904078
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.690741014142
When I add additional information to the sentence which acts as noise I get decrease:
>>> query4 = vectorize_query("man found his feline in the dark woods while picking white mushrooms and watching unicorns")
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.618269123349
Are there any ways to deal with additional information when comparing using word vectors? When I know that some subset of the text can provide better match.
UPD: edited the code above to make it more clear.
vectorize_query in my case does so called smooth inverse frequency weighting, when word vectors from GloVe model (that can be word2vec as well, etc.) are added with weights a/(a+w), where w should be the word frequency. I use there word's inverse tfidf score, i.e. w = 1/tfidf(word). Coefficient a is typically taken 1e-3 in this approach. Taking just tfidf score as weight instead of that fraction gives almost similar result, I also played with normalization, etc.
But I wanted to have just "vectorize sentence" in my example to not overload the question as I think it does not depend on how I add word vectors using weighting theme - the problem is only that comparison works best when sentences have approximately the same number of meaning words.
I am aware of another approach when distance between sentence and text is being computed using the sum or mean of minimal pairwise word distances, e.g.
"Obama speaks to the media in Illinois" <-> "The President greets the press in Chicago" where we have dist = d(Obama, president) + d(speaks, greets) + d(media, press) + d(Chicago, Illinois). But this approach does not take into account that adjective can change the meaning of noun significantly, etc - which is more or less incorporated in vector models. Words like adjectives 'good', 'bad', 'nice', etc. become noise there, as they match in two texts and contribute as zero or low distances, thus decreasing the distance between sentence and text.
I played a bit with doc2vec models, it seems it was gensim doc2vec implementation and skip-thoughts embedding, but in my case (matching short query with much bigger amount of text) I had unsatisfactory results.
If you are interested in part-of-speech to trigger similarity (e.g. only interested in nouns and noun phrases and ignore adjectives), you might want to look at sense2vec, which incorporates word classes into the model. https://explosion.ai/blog/sense2vec-with-spacy ...after which you can weight the word class while performing a dot product across all terms, effectively deboosting what you consider the 'noise'.
It's not clear your original result, the similarity decreasing when a bunch of words are added, is 'bad' in general. A sentence that says a lot more is a very different sentence!
If that result is specifically bad for your purposes – you need a model that captures whether a sentence says "the same and then more", you'll need to find/invent some other tricks. In particular, you might need a non-symmetric 'contains-similar' measure – so that the longer sentence is still a good match for the shorter one, but not vice-versa.
Any shallow, non-grammar-sensitive embedding that's fed by word-vectors will likely have a hard time with single-word reversals-of-meaning, as for example the difference between:
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *worst* Presidents
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *best* Presidents
The words 'worst' and 'best' will already be quite-similar, as they serve the same functional role and appear in the same sorts of contexts, and may only contrast with each other a little in the full-dimensional space. And then their influence may be swamped in the influence of all the other words. Only more sophisticated analyses may highlight their role as reversing the overall import of the sentence.
While it's not yet an option in gensim, there are alternative ways to calculation the "Word Mover's Distance" that report the unmatched 'remainder' after all the easy pairwise-meaning-measuring is finished. While I don't know any prior analysis or code that'd flesh out this idea for your needs, or prove its value, I have a hunch such an analysis might help better discover cases of "says the same and more", or "says mostly the same but with reversal in a few words/aspects".

How to handle categorical features for Decision Tree, Random Forest in spark ml?

I am trying to build decision tree and random forest classifier on the UCI bank marketing data -> https://archive.ics.uci.edu/ml/datasets/bank+marketing. There are many categorical features (having string values) in the data set.
In the spark ml document, it's mentioned that the categorical variables can be converted to numeric by indexing using either StringIndexer or VectorIndexer. I chose to use StringIndexer (vector index requires vector feature and vector assembler which convert features to vector feature accepts only numeric type ). Using this approach, each of the level of a categorical feature will be assigned numeric value based on it's frequency (0 for most frequent label of a category feature).
My question is how the algorithm of Random Forest or Decision Tree will understand that new features (derived from categorical features) are different than continuous variable. Will indexed feature be considered as continuous in the algorithm? Is it the right approach? Or should I go ahead with One-Hot-Encoding for categorical features.
I read some of the answers from this forum but i didn't get clarity on the last part.
One Hot Encoding should be done for categorical variables with categories > 2.
To understand why, you should know the difference between the sub categories of categorical data: Ordinal data and Nominal data.
Ordinal Data: The values has some sort of ordering between them. example:
Customer Feedback(excellent, good, neutral, bad, very bad). As you can see there is a clear ordering between them (excellent > good > neutral > bad > very bad). In this case StringIndexer alone is sufficient for modelling purpose.
Nominal Data: The values has no defined ordering between them.
example: colours(black, blue, white, ...). In this case StringIndexer alone is NOT sufficient. and One Hot Encoding is required after String Indexing.
After String Indexing lets assume the output is:
id | colour | categoryIndex
----|----------|---------------
0 | black | 0.0
1 | white | 1.0
2 | yellow | 2.0
3 | red | 3.0
Then without One Hot Encoding, the machine learning algorithm will assume: red > yellow > white > black, which we know its not true.
OneHotEncoder() will help us avoid this situation.
So to answer your question,
Will indexed feature be considered as continuous in the algorithm?
It will be considered as continious variable.
Is it the right approach? Or should I go ahead with One-Hot-Encoding
for categorical features
depends on your understanding of data.Although Random Forest and some boosting methods doesn't require OneHot Encoding, most ML algorithms need it.
Refer: https://spark.apache.org/docs/latest/ml-features.html#onehotencoder
In short, Spark's RandomForest does NOT require OneHotEncoder for categorical features created by StringIndexer or VectorIndexer.
Longer Explanation. In general DecisionTrees can handle both Ordinal and Nominal types of data. However, when it comes to the implementation, it could be that OneHotEncoder is required (as it is in Python's scikit-learn).
Luckily, Spark's implementation of RandomForest honors categorical features if properly handled and OneHotEncoder is NOT required!
Proper handling means that categorical features contain the corresponding metadata so that RF knows what it is working on. Features that have been created by StringIndexer or VectorIndexer contain metadata in the DataFrame about being generated by the Indexer and being categorical.
According to the vdep answers, the StringIndexer is enough for Ordinal Data. Howerver the StringIndexer sort the data by label frequency, for example "excellent > good > neutral > bad > very bad" maybe become the "good,excellent,neutral". So for Oridinal data, the StringIndexer do not suit for it.
Secondly, for Nominal Data, the document tells us that
for a binary classification problem with one categorical feature with three categories A, B and C whose corresponding proportions of label 1 are 0.2, 0.6 and 0.4, the categorical features are ordered as A, C, B. The two split candidates are A | C, B and A , C | B where | denotes the split.
The "corresponding proportions of label 1" is same as the label frequency? So I am confused of the feasibility with the StringInder to DecisionTree in Spark.

Multiple Time Series with Binary Outcome Prediction

I'll preface this with saying I am extremely new to neural networks and their operation. I've done a fair bit of reading, played with a few cloud based tools (Cortana and AWS), but beyond that, I am not well adept in the algorithms, the kind of neural networks etc...
I'm looking for advice on what systems / tools / kinds of algorithms I can use to achieve the below.
Problem Description
I have a data set that contains time series data for a number of users. The data set can contain a variable number of unique users (prob max out at 150), and each user has 4 different sets of time series data for four different variables. Example data set below
V = Variable
User | Time | V1 | V2 | V3 | V4
1 | 12.00am | 13 | 1045 | 12.2 | 52.4
1 | 12.01am | 12 | 1565 | 11.9 | 50.3
2 | 12.00am | 2 | 15434 | 1.93 | 47.2
2 | 12.01am | 2.02 | 17434 | 1.98 | 43.1
And so on for x users and hundreds of data points for each user.
Required Output
By parsing the data, I want to be able to train the system to either give back a binary TRUE or FALSE for a user based on the input, or alternatively, a probability % of the user being TRUE.
The binary is effectively a TRUE or FALSE result. There can only be one TRUE of all 10 users. I think getting back a % of chance of being TRUE is probably the simplest form? I may be wrong.
Input Format
End point is to have an API that I can send the data set to and it returns user and their probability (or the binary TRUE | FALSE result).
Systems
I would prefer to be able to do this on a 3rd party service as opposed to having to build my own systems to do the processing, but not a necessity.
Training Data
I have years of data to be able to train the system, hundreds of thousands of real user sets and so on.
To Wrap It Up
Looking for advice on the what and the how to predict a binary outcome from multiple time series data sets.
Really appreciate any assistance and guidance here.
Thanks
Russ
I'm working on a similar problem (I am no expert either) but I'll share my approach in case it answers your "what" part of the question.
My solution was to transform the dataset so I ended up with a problem that could be solved with traditional classification algorithms (Random Forest, boosting, ...)
This approach requires that the data is labeled. Each row of the transformed dataset will represent the information associated to each TRUE or FALSE outcome in the training dataset. Each row will be an unique event and will have:
1 column with the response
p sets of columns (one set for each of the p original variables)
k variables to indicate seasonality
Each set of the p sets of columns will consist of the variable at time t (time when you recorded the response of that row), the variable at time t-1 (lag1), ..., and the variable at time t-T (lagT).
Example:
Original dataset (I've retained only V1 and V2 and added an outcome variable)
User
Time
V1
V2
outcome
1
12.00am
13
1045
FALSE
1
12.01am
12
1565
TRUE
Transformed dataset
ID
V1_lag1
V1_lag0
V2_lag1
V2_lag0
outcome
event_id
13
12
1045
1565
TRUE
With this set up you could fit a model that would predict the probability of TRUE at time t for a new observation, based on V1 and V2 evaluated at time t and V1 and V2 evaluated at lag1 (t-1min).
You could also create new features that would describe the variables better (See Features for time series classification).
And you should incorporate the seasonality somehow if the variables show a seasonality pattern:
ID
V1_lag1
V1_lag0
V2_lag1
V2_lag0
day
hour
outcome
event_id
13
12
1045
1565
wed
12am
TRUE

Tag based clustering algorithm

I am looking to cluster many feeds based on their tags.
A typical example would be twitter feeds. Each feed will have user defined tags associated with it. By analyzing the tags , is it possible to cluster the feeds into different groups and tell so much feeds are based on so much tags.
An example would be -
Feed1 - Earthquake in indonasia #earthquake #asia #bad
Feed2 - There is a large earthquake in my area #earthquake #bad
Feed3 - My parents went to singapore #asia #tour
Feed4 - XYZ company is laying off many people #XYZ #layoff #bear
Feed5 - XYZ is getting bad is planning to layoff #XYZ #layoff #bad
Feed6 - XYZ is in a layoff spree #layoff #XYZ #worst
After clustering
#asia , # earthquake - Feed1 , Feed2
#XYZ , # layoff - Feed4 , Feed 5 , Feed6
Here clustering is found purely on basis of tags.
Is there any good algorithm to achieve this
If I understand your question correctly, you would like to cluster the tags together and then put the feeds into these clusters based on the tags in the feed.
For this, you could create a similarity measure between the tags based on the number of feeds that the tags appear in together. For your example, this would be something like this
#earthquake | #asia | #bad | ...
#earthquake 1 | 1/2 | 2/2
#asia 1/2 | 1 | 1/2
#bad 2/3 | 1/3 | 1
...
Here, value at (i,j) equals frequency of (i,j)/frequency of (i).
Now you have a similarity matrix between the tags and you could virtually any clustering algorithm that suits your needs. Since, the number of tags can be very large and estimating the number of clusters is difficult before running the algorithm, I would suggest using some heirarchical clustering algorithm like Fast Modularity clustering which is also very fast (See some details here). However, if you have some estimate of the number of clusters that you would like to break this into, then Spectral clustering might be useful too (See some details here).
After you cluster the tags together, you could use a simple approach to assign each feed to a cluster. This can be very simple, for example, counting the number of tags from each cluster in a feed and assigning a cluster with the maximum number of matching tags.
If you are flexible on your clustering strategy, then you could also try clustering the feeds together in a similar way by creating a similarity between the feeds based on the number of common tags between the feeds and then applying a clustering algorithm on the similarity matrix.
Interesting question. I'm making things up here, but I think this would work.
Algorithm
For each feed, come up with a complete list of tag combinations (of length >= 2), probably sorted for consistency. For example:
Feed1: (asia-bad), (asia-earthquake), (bad-earthquake), (asia-bad-earthquake)
Feed2: (bad-earthquake)
Feed3: (asia-tour)
Feed4: (bear-layoff), (bear-XYZ), (layoff-XYZ), (bear-layoff-XYZ)
Feed5: (bad-layoff), (bad-XYZ), (layoff-XYZ), (bad-layoff-XYZ)
Feed6: (layoff-worst), (layoff-XYZ), (worst-XYZ), (layoff-worst-XYZ)
Then reverse the mapping:
(asia-bad): Feed1
(asia-earthquake): Feed1
(bad-earthquake): Feed1, Feed2
(asia-bad-earthquake): Feed1
(asia-tour): Feed3
(bear-layoff): Feed4
...
(layoff-XYZ): Feed4, Feed5, Feed6
...
You can then cull all the entries with a frequency higher than some threshold. In this case, if we take a frequency threshold of 2, then you'd get (bad-earthquake) with Feed1 and Feed2, and (layoff-XYZ) with Feed4, Feed5 and Feed6.
Performance Concerns
A naive implementation of this would have extremely poor performance -- exponential in the number of tags per feed (not to mention space requirements). However, there are various ways to apply heuristics to improve this. For example:
Determine the most popular X tags by scanning all feeds (or a random selection of X feeds) -- this is linear in the number of tags per feed. Then only consider the Y most popular tags for each feed.
Determine the frequency of all (or most) tags. Then, for each post, only consider the X most popular tags in that post. This prevents situations where you have, say, fifteen tags for some post, resulting in a huge list of combinations, most of which would never occur.
For each post, only consider combinations of length <= X. For example, if a feed had fifteen tags, you could end up with a huge number of combinations, but most of them would have very few occurrences, especially the long ones. So only consider combinations of two or three tags.
Only scan a random selection of X feeds.
Hope this helps!

Resources