Why in transformer the larger attention score between two tokens, the more similar they are after final layer? - transformer-model

In the last self-attention layer of transformer, it seems that the larger attention score between two tokens, the more similar they will be after that layer, i.e. they are very close in the vector space. But I don't know the reason. Can someone explain it?

Related

How to calculate vectors with word2vec

Assume I have a number of pictures. Let’s say 10 pictures which are annotated by 50 people each.
So Pic 1 might be „beach, vacation, relax, sand, sun…“ I now trained word2vec with a domain specific content. I have the vectors of each word and can represent them. But what I want now, is to create ONE final vector representing each picture. So one vector with represents the 50 annotations (beach, vacation, relax, sand, sun…)
Let’s assume each vector is represented with 100 dimensions – do I just add the first dimension (the 100 dimensions) of all 50 vectors, than the 2nd dimension of all 50 vectors… etc.
I am very thankful for any comments that might help me!
I tried this, but I am not sure if this is the right way to do it.
I also tried doc2vec but I guess this is problematic as the word order of the annotations is irrelevant – but relevant for doc2vec.???
A few thoughts:
A list of annotations isn't quite like natural-language narrative text – in either the relative frequencies of tokens, or the importance of neighboring-tokens. So you may want to try out a extra-wide range of training parameters. For example, using a giant window (far larger than each of your texts) could essentially negate the (possibly-arbitrary) ordering of the annotations, putting every word in every other word's context. (That'd increase training time, but might help in other ways.) Also, look into the newly-tunable ns_exponent parameter - the paper referenced from the gensim docs suggests values very different from the default may help in certain recommendation contexts.
That said, the most-simple way to combine many vectors into one is to average them all together. Whether that works well for your purposes, you'd have to test. (It unavoidably loses the information of a larger set of independent vectors, but if other aspects of your modeling are strong enough – enough training data, enough dimensions – the really important shared aspects may be retained in the summary vector.)
(You can see some code for averaging word-vectors in another recent answer.)
You could also try Doc2Vec. It is no more ordering-dependent than Word2Vec – some modes use a window-sized context where neighboring words influence each other. (But other modes don't, or, as mentioned above, an oversized window can essentially make neighboring-distances less-relevant).

How do I mutate a possible solution when the chromosomes are not simple numbers?

I'm just trying to understand GA, so please forgive any incorrect comments or assumptionshere. I have basically got the idea of how you encode potential solutions and then combine and/or mutate them to find similar (but hopefully better) solutions.
This seems simple when you have genes that are nice and simple. For example, this tutorial describes how to use a GA to find a sequence of digits and mathematical operators that will hit a target number. Given a couple of potential solutions, I can combine them by taking (say) the first n bits of one and the last (len-n) bits from the other. However I combine and mutate, I'll get something that makes sense.
However, what if I'm trying to solve something where the genes are not so simple? For example, suppose I want to solve a puzzle like this...
The idea is to fit all the pieces inside the frame. I can represent the frame as a 8x8 array, and the pieces as smaller arrays, like this...
int[][] piece1 = new int[2][];
piece1[0] = new int[] {1, 1, 1};
piece1[1] = new int[] {0, 1, 0};
This describes the top left piece in the picture. I could describe the others in a similar manner.
I can then attempt to fit all of my pieces into the frame, and count the number of array elements left over as my error.
However, how would I combine two potential solutions to produce a third? I can't just swap individual array entries, as that would populate the frame with invalid pieces. Similarly, how would I mutate a potential solution?
Sorry if I'm on the wrong track altogether. As I said, I'm very new to this, and trying to learn. Any help would be appreciated.
A good genetic algorithm is all about the encoding and the operators, which are at least as important as the fitness function and selection rule. By choosing a naïve encoding, you can easily end up with an algorithm that takes up forever to discover an improvement, and may need some elitism to prevent good solutions from being lost to selection. This, of course, does not mean that your algorithm would never find a solution, but it may make it impractical. Dealing with hard constraints like this is tricky exactly because it may mean rethinking your encoding.
A simple encoding for your problem may be just stating where to put what piece and in what orientation. Because overlaps will happen, you will need to reject candidates which have an overlap, using a prohibitive fitness (infinity, if available in your data type) or an immediate discarding. (The former is easier to implement if you want to maintain a fixed population size.) Once you have such check implemented, it's straightforward to apply it to any result of mutation or crossover as well. Depending on your strategy you then either produce a candidate which will not be going to be selected, you retry, or you end up not generating a candidate from the current operation, if it lead to an unphysical solution.
Note that you may as well experiment with keeping the unphysical cases around, not with an infinite, just a high fitness: perhaps a second genetic operation will remove the overlap done by the first and produce something good.
Now what could an alternative encoding look like? If you want that, by its own nature, to prevent overlaps, maybe you could, instead of the final position of a piece, encode the tiling by the order the pieces are added from the top, Tetris-style. I'm not saying this is better, because that would just trade the hard limit on overlap for a hard limit on the height of the resulting structure, but it's a start. Just like in the previous case, you can then convert the hard limit to a soft one (simply make fitness proportional to height, and try to push that down to 8), leading to a reformulation of the problem equivalent to minimizing the number of unoccupied positions.
If you never want to even consider candidates that would not conform in one way or another to the hard rules, and have no intention on softening those, you would need to come up with an encoding that never encodes anything with an overlap and at the same time, never gets out of bounds. Building on the previous paragraph, you can make a distinction between a genotype and a phenotype: the genotype might be the Tetris-like encoding, and the phenotype the maximum prefix of that, cutting just before the piece that would, in Tetris terminology again, lose the game. You would then use the genotype for mutations and crossovers, but the phenotype for fitness evaluation.

Finding last wave in mql4/5

I was wondering if there's an efficient and easy way to determine waves in MQL4, just like zigzag indicator does it.
I was asked to help automate indicator, for that I need to determine 'waves', essentially max and min of a graph over some period of time (which is vague and all relative).
I don't have a clear image of how I want an indicator to work, but it would be something like that:
Find the last wave, i.e. where the direction of price last changed (neglecting the noise), and then for example reflect it with a trend line.
Is it possible to use zigzag structure to find that point, where direction changed. (Possibly not the only one, might need to find more that just the last point, but the preceding one. So i will want to adopt the algorithm)
I know it's a while since you asked this question and you probably already have an answer, but if not...
I dislike Zigzag and have not found a way to do what I want to do with it, so I will the last part of your questions with no, and believe me I tried.
The way I prefer it is to find bars that conform to the classic definition of fractals/swing points (i.e. a high with two lower highs on either side, or a low with two higher lows on either side), then try to make up for the shortcomings. E.g. Often there will be two high fractals/swings/waves in a row without an intermediate low fractal/swing/wave. So I add the best intermediate low point as a wave, or remove one of the highs (E.g. if the first one wasn't as subjectively significant). Some of the swing points that are identified are 'noisy', to use your term, and not ones that a human trader would have picked. So these need to be dealt with and so on. If you go down this route it is a long one, computers make many mistakes identifying appropriate swing points, so unfortunately not what I would call easy, but it is accurate, and how many easy indicators are there that actually make money over the long run?

Decoding Permutated English Strings

A coworker was recently asked this when trying to land a (different) research job:
Given 10 128-character strings which have been permutated in exactly the same way, decode the strings. The original strings are English text with spaces, numbers, punctuation and other non-alpha characters removed.
He was given a few days to think about it before an answer was expected. How would you do this? You can use any computer resource, including character/word level language models.
This is a basic transposition cipher. My question above was simply to determine if it was a transposition cipher or a substitution cipher. Cryptanalysis of such systems is fairly straightforward. Others have already alluded to basic methods. Optimal approaches will attempt to place the hardest and rarest letters first, as these will tend to uniquely identify the letters around them, which greatly reduces the subsequent search space. Simply finding a place to place an "a" (no pun intended) is not hard, but finding a location for a "q", "z", or "x" is a bit more work.
The overarching goal for an algorithm's quality isn't to decipher the text, as it can be done by better than brute force methods, nor is it simply to be fast, but it should eliminate possibilities absolutely as fast as possible.
Since you can use multiple strings simultaneously, attempting to create words from the rarest characters is going to allow you to test dictionary attacks in parallel. Finding the correct placement of the rarest terms in each string as quickly as possible will decipher that ciphertext PLUS all of the others at the same time.
If you search for cryptanalysis of transposition ciphers, you'll find a bunch with genetic algorithms. These are meant to advance the research cred of people working in GA, as these are not really optimal in practice. Instead, you should look at some basic optimizatin methods, such as branch and bound, A*, and a variety of statistical methods. (How deep you should go depends on your level of expertise in algorithms and statistics. :) I would switch between deterministic methods and statistical optimization methods several times.)
In any case, the calculations should be dirt cheap and fast, because the scale of initial guesses could be quite large. It's best to have a cheap way to filter out a LOT of possible placements first, then spend more CPU time on sifting through the better candidates. To that end, it's good to have a way of describing the stages of processing and the computational effort for each stage. (At least that's what I would expect if I gave this as an interview question.)
You can even buy a fairly credible reference book on deciphering double transposition ciphers.
Update 1: Take a look at these slides for more ideas on iterative improvements. It's not a great reference set of slides, but it's readily accessible. What's more, although the slides are about GA and simulated annealing (methods that come up a lot in search results for transposition cipher cryptanalysis), the author advocates against such methods when you can use A* or other methods. :)
first, you'd need a test for the correct ordering. something fairly simple like being able to break the majority of texts into words using a dictionary ordered by frequency of use without backtracking.
one you have that, you can play with various approaches. two i would try are:
using a genetic algorithm, with scoring based on 2 and 3-letter tuples (which you can either get from somewhere or generate yourself). the hard part of genetic algorithms is finding a good description of the process that can be fragmented and recomposed. i would guess that something like "move fragment x to after fragment y" would be a good approach, where the indices are positions in the original text (and so change as the "dna" is read). also, you might need to extend the scoring with something that gets you closer to "real" text near the end - something like the length over which the verification algorithm runs, or complete words found.
using a graph approach. you would need to find a consistent path through the graph of letter positions, perhaps with a beam-width search, using the weights obtained from the pair frequencies. i'm not sure how you'd handle reaching the end of the string and restarting, though. perhaps 10 sentences is sufficient to identify with strong probability good starting candidates (from letter frequency) - wouldn't surprise me.
this is a nice problem :o) i suspect 10 sentences is a strong constraint (for every step you have a good chance of common letter pairs in several strings - you probably want to combine probabilities by discarding the most unlikely, unless you include word start/end pairs) so i think the graph approach would be most efficient.
Frequency analysis would drastically prune the search space. The most-common letters in English prose are well-known.
Count the letters in your encrypted input, and put them in most-common order. Matching most-counted to most-counted, translated the cypher text back into an attempted plain text. It will be close to right, but likely not exactly. By hand, iteratively tune your permutation until plain text emerges (typically few iterations are needed.)
If you find checking by hand odious, run attempted plain texts through a spell checker and minimize violation counts.
First you need a scoring function that increases as the likelihood of a correct permutation increases. One approach is to precalculate the frequencies of triplets in standard English (get some data from Project Gutenburg) and add up the frequencies of all the triplets in all ten strings. You may find that quadruplets give a better outcome than triplets.
Second you need a way to produce permutations. One approach, known as hill-climbing, takes the ten strings and enters a loop. Pick two random integers from 1 to 128 and swap the associated letters in all ten strings. Compute the score of the new permutation and compare it to the old permutation. If the new permutation is an improvement, keep it and loop, otherwise keep the old permutation and loop. Stop when the number of improvements slows below some predetermined threshold. Present the outcome to the user, who may accept it as given, accept it and make changes manually, or reject it, in which case you start again from the original set of strings at a different point in the random number generator.
Instead of hill-climbing, you might try simulated annealing. I'll refer you to Google for details, but the idea is that instead of always keeping the better of the two permutations, sometimes you keep the lesser of the two permutations, in the hope that it leads to a better overall outcome. This is done to defeat the tendency of hill-climbing to get stuck at a local maximum in the search space.
By the way, it's "permuted" rather than "permutated."

Data structure for storing entities in game

I have a tiled map which is made of chunks and each chunk is a rectangle of tiles. Now I want to add entities to this(each entity stand on a specific tile), and every loop all the entities should call their update() function, so I wanted to ask for a suggestion: what data structure should I use for saving their locations?
I don't know yet what kind of methods I'll have, but I'll probably need a method that gets all the entities from a specific area(maybe a point) for drawing for example. This is a critic question because there maybe a huge map like 100x100 chunks where each chunk is 30x20 tiles so it will be 3000x2000 tiles, and lots of entities for example 1000, so if I'll save it in a list it will be very slow to search for an entities O(n) and if every entities make a search it will take O(n^2).
Right now I have a couple of solutions but they are all problematic:
kd-tree(for 2d) - since each loop all the entities can change their locations, the complexity of updating them will be the same as rebuilding the whole tree each loop O(nlogn).
each chunk will save the entities that belong to it - my best solution so far, easy updating, but the complexity is higher then in the kd-tree.
So does anybody have a suggestion for this problem?
A dictionary that maps tile (position) towards a list of all entities on that tile. All entities should have a position property and an event notifying when it changes, so that the dictionary can be updated at each movement.
(There should be no list for the tiles without entities. The list should be created when an entitiy moves to that position, and removed when the last entity leaves the position.)
This might be a crude suggestion, and I'm sure it can be improved, but here's a thought:
First, store your positions in such a way that you can access them in constant time given a specific object. For instance, if you want to access them directly through your entities, you can store the position structs in a list/vector and give each entity a pointer/reference to its position.
Second, store an entity pointer/reference or GUID in the same struct as the entity position, so you can identify an entity based on a position object. (There is probably a better way I'm not thinking of right now though.)
Third, utilize some of the principles of sweep and prune/sort and sweep (common in 3D games): Keep two sorted position lists/vectors, one sorted in the x direction and the other sorted in the y direction. One can hold the actual position objects, and the other can hold pointers/references. These lists can take advantage of temporal coherence, so the cost of keeping them sorted shouldn't be too high, unless there's a lot of fast and chaotic movement.
An advantage of this setup is that it's really easy to figure out where every object is relative to each other. Want to know how many objects are within 10 squares of Billy the Elf in either direction? Check Billy's position and iterate forward/backward through both lists until you reach an entity more than 10 squares away in each direction.
If you're interested in the concept, look up sort and sweep (also known as sweep and prune). You'd only be using the first half of the algorithm, but it's used for broad-phase collision detection in practically every major 3D physics engine, so you know that it has to be fast in general. ;) There's a lot of information floating around about it, so you'll probably find much more sophisticated implementation ideas floating around too. (For instance, I don't like the indirection involved in storing a sorted list of pointers/references to position structs; working with the actual structs is more cache-efficient, but then you need to update the position in two places if you want to exploit temporal coherency with persistent arrays. Someone else may have thought of a more clever design that's escaping me right now.)
EDIT: I'd comment on Erik H's idea, but my rep isn't high enough. I just wanted to say that his idea sounds very well suited to your game, especially if you will have a lot of entities tightly packed on the same tile or in a small neighborhood. If I were you, I'd probably try it before the sweep and prune idea. However, it should be accompanied with a well-planned memory management strategy: If you have a dictionary of tile locations that naively map to vectors of entities, you're going to have a lot of memory being allocated and freed when entities move from one tile to another. Instead, you'll want to implement his idea as something more like a dictionary/linked list combo:
The dictionary keys would be tile positions, and the dictionary would return a single pointer to a linked list node. This node would be part of a linked list of all entities on the same tile. Whenever an entity moves from one tile to another, it will be removed from its current linked list and added to the new one. If an entity moves to an empty tile, it will be in a linked list all on its own, and it should be added to the dictionary. When the last entity moves from a tile, the entry for that tile should be removed from the dictionary. This will allow you to move around entities without continual dynamic allocation/deallocation, since you're just updating pointers (and the dictionary will probably be pretty memory efficient).
Note that you don't have to store full-blown entities in the linked lists, either; you can easily create your linked list out of lightweight objects (containing a pointer or GUID to the actual entity).

Resources