Related
I have found successful weighting theme for adding word vectors which seems to work for sentence comparison in my case:
query1 = vectorize_query("human cat interaction")
query2 = vectorize_query("people and cats talk")
query3 = vectorize_query("monks predicted frost")
query4 = vectorize_query("man found his feline in the woods")
>>> print(1 - spatial.distance.cosine(query1, query2))
>>> 0.7154500319
>>> print(1 - spatial.distance.cosine(query1, query3))
>>> 0.415183904078
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.690741014142
When I add additional information to the sentence which acts as noise I get decrease:
>>> query4 = vectorize_query("man found his feline in the dark woods while picking white mushrooms and watching unicorns")
>>> print(1 - spatial.distance.cosine(query1, query4))
>>> 0.618269123349
Are there any ways to deal with additional information when comparing using word vectors? When I know that some subset of the text can provide better match.
UPD: edited the code above to make it more clear.
vectorize_query in my case does so called smooth inverse frequency weighting, when word vectors from GloVe model (that can be word2vec as well, etc.) are added with weights a/(a+w), where w should be the word frequency. I use there word's inverse tfidf score, i.e. w = 1/tfidf(word). Coefficient a is typically taken 1e-3 in this approach. Taking just tfidf score as weight instead of that fraction gives almost similar result, I also played with normalization, etc.
But I wanted to have just "vectorize sentence" in my example to not overload the question as I think it does not depend on how I add word vectors using weighting theme - the problem is only that comparison works best when sentences have approximately the same number of meaning words.
I am aware of another approach when distance between sentence and text is being computed using the sum or mean of minimal pairwise word distances, e.g.
"Obama speaks to the media in Illinois" <-> "The President greets the press in Chicago" where we have dist = d(Obama, president) + d(speaks, greets) + d(media, press) + d(Chicago, Illinois). But this approach does not take into account that adjective can change the meaning of noun significantly, etc - which is more or less incorporated in vector models. Words like adjectives 'good', 'bad', 'nice', etc. become noise there, as they match in two texts and contribute as zero or low distances, thus decreasing the distance between sentence and text.
I played a bit with doc2vec models, it seems it was gensim doc2vec implementation and skip-thoughts embedding, but in my case (matching short query with much bigger amount of text) I had unsatisfactory results.
If you are interested in part-of-speech to trigger similarity (e.g. only interested in nouns and noun phrases and ignore adjectives), you might want to look at sense2vec, which incorporates word classes into the model. https://explosion.ai/blog/sense2vec-with-spacy ...after which you can weight the word class while performing a dot product across all terms, effectively deboosting what you consider the 'noise'.
It's not clear your original result, the similarity decreasing when a bunch of words are added, is 'bad' in general. A sentence that says a lot more is a very different sentence!
If that result is specifically bad for your purposes – you need a model that captures whether a sentence says "the same and then more", you'll need to find/invent some other tricks. In particular, you might need a non-symmetric 'contains-similar' measure – so that the longer sentence is still a good match for the shorter one, but not vice-versa.
Any shallow, non-grammar-sensitive embedding that's fed by word-vectors will likely have a hard time with single-word reversals-of-meaning, as for example the difference between:
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *worst* Presidents
After all considerations, including the relevant measures of economic, cultural, and foreign-policy progress, historians should conclude that Nixon was one of the very *best* Presidents
The words 'worst' and 'best' will already be quite-similar, as they serve the same functional role and appear in the same sorts of contexts, and may only contrast with each other a little in the full-dimensional space. And then their influence may be swamped in the influence of all the other words. Only more sophisticated analyses may highlight their role as reversing the overall import of the sentence.
While it's not yet an option in gensim, there are alternative ways to calculation the "Word Mover's Distance" that report the unmatched 'remainder' after all the easy pairwise-meaning-measuring is finished. While I don't know any prior analysis or code that'd flesh out this idea for your needs, or prove its value, I have a hunch such an analysis might help better discover cases of "says the same and more", or "says mostly the same but with reversal in a few words/aspects".
I have question about generating a game map in ruby for text based rpg.
I've researched and came up with two possible options: 2d array and matrix.
My question is which one would be more suitable for implementation of these features:
at generation the size of the map is defined
each tile represents a geounit
a geonuit is randomly assigned a name and type (housing unit, police station etc) => this would enable different map for each game
random assignment of types must take into account the proximity rule
The proximity rule would mean a certain number of units that must be between two geounits of the same type. To make the game play harder, for example two police stations must be 10 geounits apart. Of course this would mean in any direction (possibly also diagonally).
Thank you,
regards,
seba
So I'm building a rock paper scissors bot, and I need people to be able to be sure that the robot doesn't "cheat" and make its selection after the player chooses their throw.
Normally, for computer viewing, this is done by hashing the choice and maybe providing a salt, and then revealing the choice+salt. But I want something that can be "instantly" verifiable by a human. If I just hash the choice, people would cry foul on rigging the hash.
So my idea is to have a "visual hashing algorithm" of sorts -- a hashing algorithm that humans can perform themselves trivially and easily, and verify.
What my idea is right now is to have three boxes: Rock, Paper, and Scissors, and then have three other unlabeled boxes A, B, and C across from the RPS boxes. Then I connect Rock to one of them using tangled lines, Paper to another, and C to another. The lines are tangled so that it would take time to "follow back" the line from box B to, say, Scissors.
When the computer picks its throw, it "highlights" the corresponding box to the throw -- that is, if Scissors' tangled lines lead to Box B, it'll highlight Box B. But it won't reveal that it was Scissors. The human then is given, say, 3 seconds, to pick a throw. 3 seconds, hopefully, is not fast enough for them to detangle the lines and trace back from box B to Scissors.
Then, when the human picks the throw, the computer reveals Scissors, and also highlights the tangled line from Scissors to Box B so that it is clear that Scissors has lead to Box B this entire time, and it couldn't have just cheated.
While this would work, I think...it's a little ugly, and inelegant. The human can easily verify that the computer didn't cheat, but at the same time, it seems unusual or weird and introduces so many UI elements that the screen might seem cluttered or untrustworthy. The less UI elements/graphical footprint, the better.
Are there any solutions right now that exist that solve this issue?
The "hash" of the throw is presented, as well as the hashing algorithm, which takes time (at least 3 seconds) to "undo".
When the throw is revealed, it should be easily and visually and immediately identifiable that the hashing algorithm was performed validly and that the throw does indeed correspond to the hash
It uses as few UI elements as possible and has as small a graphical footprint as possible
This is interesting. An idea in my head would be to display a 10x10 grid (of say 5 pixels per square) with a key;
Red: Rock; Blue: Scissors; Green: Paper
And fill the grid randomly with 33 red, 33 blue and 33 green, and then 1 random of the 3 colours. A human would struggle to identify the 34 colours over the other 2 in a small time period, but the count could be revealed on user input, along with optionally expanding the grid/highlighting the cells etc.
A small UI footprint, and neater than your solution, but whether it's good enough...
You have 3 seconds, which calculation is correct?
R: 317 * 27 = 8829
P: 297 * 16 = 5605
S: 239 * 38 = 9082
When I tell you my answer, you could quickly verify with a calculator.
I'm doing some work processing some statistics for home approvals in a given month. I'd like to be able to show trends - that is, which areas have seen a large relative increase or decrease since the last month(s).
My first naive approach was to just calculate the percentage change between two months, but that has problems when the data is very low - any change at all is magnified:
// diff = (new - old) / old
Area | June | July | Diff |
--------------|--------|--------|--------|
South Sydney | 427 | 530 | +24% |
North Sydney | 167 | 143 | -14% |
Dubbo | 1 | 3 | +200% |
I don't want to just ignore any area or value as an outlier, but I don't want Dubbo's increase of 2 per month to outshine the increase of 103 in South Sydney. Is there a better equation I could use to show more useful trend information?
This data is eventually being plotted on Google Maps. In this first attempt, I'm just converting the difference to a "heatmap colour" (blue - decrease, green - no change, red - increase). Perhaps using some other metric to alter the view of each area might be a solution, for example, change the alpha channel based on the total number of approvals or something similar, in this case, Dubbo would be bright red, but quite transparent, whereas South Sydney would be closer to yellow but quite opaque.
Any ideas on the best way to show this data?
Look into measures of statistical significance. It could be as simple as assuming counting statistics.
In a very simple minded version, the thing you plot is
(A_2 - A_1)/sqrt(A_2 + A_1)
i.e. change over 1 sigma in simple counting statistics.
Which makes the above chart look like:
Area Reduced difference
--------------------------
S.S. +3.3
N.S. -1.3
D. +1.0
which is interpreted as meaning that South Sydney has experienced a significant (i.e. important, and possibly related to a real underlying cause) increasing, while North Sydney and Dubbo felt relatively minor changes that may or may not be point to a trend. Rule of thumb
1 sigma changes are just noise
3 sigma changes probably point to a underlying cause (and therefore the expectation of a trend)
5 sigma changes almost certainly point to a trend
Areas with very low rates (like Dubbo) will still be volatile, but they won't overwhelm the display.
This is really a statistics question. I'm not a statistician, but I suspect the answer is along the lines of well, you have no data — what do you expect‽
Perhaps you could merge Dubbo with a nearby region? You've sliced your data small enough that your signal has fallen below noise.
You could also just not show Dubbo, or make a color for not enough data.
I kinda like your transparency idea -- the data you're confident about is opaque and the data you're not confident is transparent. It's easy for the user to understand, but it will look cluttered.
My take: Don't use heatmap. It's for continuous data, while you have discrete. Use dots. Color represents increase/decrease in the surrounding region and raw volume is proportional to size of the dot.
Now how does user know what region does the dot represent? Where does South Sydney convert into North Sydney? Best approach would be to add voronoi-like guiding lines between the dots, but smartly placed rectangles will do too.
If you happen to have the area of each region in units such as sq. km, you can normalize your data by calculating home approvals/km^2 to get home approval density and use that in your equation rather than the count of home approvals. This is fix the problem if Dubbo contains less home approvals then other regions due to its size. You could also normalize by population if you have that, to get the number of home approvals per person.
Maybe you could use the totals. Add all old and new values which gives old=595, new=676, diff=+13.6%. Then calculate the changes bases on the old total which gives you +17.3% / -4.0% / +0.3% for the three places.
With a heat map you are generally attempting to show easily assimilated information. Anything too complex would probably be counter-productive.
In the case of Dubbo, the reality is that you don't have the data to draw any firm conclusions about it, so I'd color it white, say. You could possibly label it with the difference/current value too.
I think this would be preferable to possibly misleading the users.
I would highly recommend going with a hierarchical model (i.e., partial pooling). Data Analysis Using Regression and Multilevel/Hierarchical Models by Gelman and Hill is an excellent resource on the topic.
You can use an exact test like Fischer's exact test http://en.wikipedia.org/wiki/Fisher%27s_exact_test , or use the sudent's t test http://en.wikipedia.org/wiki/Student%27s_t-test , both of which are designed for low sample sizes.
As a note, the t-test is pretty much the same as a z-test but in the t-test you don't have to know the standard deviation nor do you have to approximate it like you would if you did a z-test.
You can apply a z or t test without any justification in 99.99% of cases because of the Central Limit Theorem http://en.wikipedia.org/wiki/Central_limit_theorem (formally you only need that the underlying distribution X has finite variance.) You don't need justification for the fisher test either, its exact and does not make any assumptions.
I am new to Artificial Intelligence. I understand K nearest neighbour algorithm and how to implement it. However, how do you calculate the distance or weight of things that aren't on a scale?
For example, distance of age can be easily calculated, but how do you calculate how near is red to blue? Maybe colours is a bad example because you still can say use the frequency. How about a burger to pizza to fries for example?
I got a feeling there's a clever way to do this.
Thank you in advance for your kind attention.
EDIT: Thank you all for very nice answers. It really helped and I appreciate it. But I am thinking there must be a way out.
Can I do it this way? Let's say I am using my KNN algorithm to do a prediction for a person whether he/she will eat at my restaurant that serves all three of the above food. Of course, there's other factors but to keep it simple, for the field of favourite food, out of 300 people, 150 loves burger, 100 loves pizza, and 50 loves fries. Common sense tells me favourite food affect peoples' decision on whether to eat or not.
So now a person enters his/her favourite food as burger and I am going to predict whether he/she's going to eat at my restaurant. Ignoring other factors, and based on my (training) previous knowledge base, common sense tells me that there's a higher chance the k nearest neighbours' distance for this particular field favourite food is nearer as compared to if he entered pizza or fries.
The only problem with that is that I used probability, and I might be wrong because I don't know and probably can't calculate the actual distance. I also worry about this field putting too much/too little weight on my prediction because the distance probably isn't to scale with other factors (price, time of day, whether the restaurant is full, etc that I can easily quantify) but I guess I might be able to get around it with some parameter tuning.
Oh, everyone put up a great answer, but I can only accept one. In that case, I'll just accept the one with highest votes tomorrow. Thank you all once again.
Represent all food for which you collect data as a "dimension" (or a column in a table).
Record "likes" for every person on whom you can collect data, and place the results in a table:
Burger | Pizza | Fries | Burritos | Likes my food
person1 1 | 0 | 1 | 1 | 1
person2 0 | 0 | 1 | 0 | 0
person3 1 | 1 | 0 | 1 | 1
person4 0 | 1 | 1 | 1 | 0
Now, given a new person, with information about some of the foods he likes, you can measure similarity to other people using a simple measure such as the Pearson Correlation Coefficient, or the Cosine Similarity, etc.
Now you have a way to find K nearest neighbors and make some decision..
For more advanced information on this, look up "collaborative filtering" (but I'll warn you, it gets math-y).
Well, 'nearest' implies that you have some metric on which things can be more or less 'distant'. Quantification of 'burger', 'pizza', and 'fries' isn't so much a KNN problem as it's about fundamental system modeling. If you have a system where you're doing analysis where 'burger', 'pizza', and 'fries' are terms, the reason for the system to exist is going to determine how they're quantified -- like if you're trying to figure out how to get the best taste and least calories for a given amount of money, then ta-da, you know what your metrics are. (Of course, 'best taste' is subjective, but that's another set of issues.)
It's not up to these terms to have inherent quantifiability and thereby to tell you how to design your system of analysis; it's up to you to decide what you're trying to accomplish and design metrics from there.
This is one of the problems of knowledge representation in AI. Subjectively plays a big part. Would you and me agree, for example, on the "closeness" of a burger, pizza and fries?
You'd probably need a look up matrix containing the items to be compared. You may be able to reduce this matrix if you can assume transitivity, but I think even that would be uncertain in your example.
The key may be to try and determine the feature that you are trying to compare on. For example, if you were comparing your food items on health, you may be able to get at something more objective.
If you look at "Collective Intelligence", you'll see that they assign a scale and a value. That's how Netflix is comparing movie rankings and such.
You'll have to define "nearness" by coming up with that scale and assigning values for each.
I would actually present pairs of these attributes to users and ask them to define their proximity. You would present them with a scale reaching from [synonym..very foreign] or similar. Having many people do this you will end up with a widely accepted proximity function for the non-linear attribute values.
There is no "best" way to do this. Ultimately, you need to come up with an arbitrary scale.
Good answers. You could just make up a metric, or, as malach suggests, ask some people. To really do it right, it sounds like you need bayesian analysis.