Masked self-attention in tranformer's decoder - transformer-model

I'm writing my thesis about attention mechanisms. In the paragraph in which I explain the decoder of transformer I wrote this:
The first sub-layer is called masked self-attention, in which the masking operation consists in preventing the decoder from paying attention to subsequent words.
That is to say, while training a transformer for translation purposes, it is possible to access the target translation; on the other hand, during the inference, that is the translation of new sentences, it is not possible to access the target translation. Therefore, when calculating the probabilities of the next word in the sequence, the network must not access that word. Otherwise, the translation task would be banal and the network would not learn to predict the translation correctly.
I don't know if I said something wrong also in the previous part, but my professor thinks I made mistakes in the following part:
To understand in a simple way the functioning of the masked self-attention level, let's go back to the example “Orlando Bloom loves Miranda Kerr” (x1 x2 x3 x4 x5).
If we consider the inputs as vectors x1, x2. x3. x4. x5 and we want to translate the word x3 corresponding to "loves", you need to make sure that the following words x4 and x5 do not influence the translation y3. To prevent this influence, masking sets the weights of x4 and x5 to zero. Then a normalization of the weights is performed so that the sum of the elements of each column in the matrix is equal to 1. The result is a matrix with normalized weights in each column.
Can someone please tell me where the miskates are?

Related

Mutually exclusivity in Problog

We have
4 different storage spaces, and
5 different boxes (named b1, b2, b3, b4 and b5) which they wanted to put in this storage spaces.
Each storage space can be filled with only one unique box at a time.
*But B5 has a special condition which allows to be used in multiple storage spaces at the same time.
Each box has specific weight as assign to it (b1:4, b2:6, b3:5, b4:6 and b5:5).
Each box has a specific probability to be filled in to the storage spaces (b1:1, b2:0.6, b3=1, b4=0.8, b5=1).
We try to get the probable content of the storage spaces and their probabilities if the total weight is 22. ! (which we will use this as an evidence mechanism)
For example :
SS1 - b2(6)
SS2 - b5(5)
SS3 - b4(6)
SS4 - b5(5)
Where the total weight will be 22
And the probability of this content.
In my code bellow I get the answer for one of the probable content as totalboxweight(b2, b5, b4, b5, 22) which is okay for me. It means first box b2 is in first storage space, b5 is in second storage space and so on.
Here is my code so far, I add comments also to explain my intentions
But I need help to update it add the probabilities and apply some of the conditions I talked about.
box(b1,4).
box(b2,6).
box(b3,5).
box(b4,6).
box(b5,5). % I tried to define the boxes but I dont know how to assign probabilites to them in this format
total(D1,D2,D3,D4,Sum) :-
Sum is D1+D2+D3+D4. % I defined the sum calculation
totalboxweight(A,B,C,D,Sum) :-
box(A,D1), box(B,D2) , box(C,D3), box(D,D4),
total(D1,D2,D3,D4,Sum). % I am sum up all weights
sumtotal(Sum) :-
box(A,D1), box(B,D2) , box(C,D3), box(D,D4),
total(D1,D2,D3,D4,Sum). % I defined this one to use it as an evidence
evidence(sumtotal(22),true). % if we know the total weight is 22
query(totalboxweight(D1,D2,D3,D4,22)). % what is the probable content
I am using an online Problog editor to test my code. Here is the link.
And I am trying to do it in Problog not Prolog, so the syntax is different.
Right now with the help of answers I overcome some issues, the problems I still have ;
I couldn't apply probabilities
I couldn't apply the condition ( Each storage space can be filled with only one unique box at a time. But B5 has a special condition which allows to be used in multiple storage spaces at the same time.)
Thanks you in advance.

Solving a puzzle using swi-prolog

I've been given as an assignment to write using prolog a solver for
the battleships solitaire puzzle. To those unfamiliar, the puzzle deals
with a 6 by 6 grid on which a series of ships are placed according to the provided
constraints on each row and column, i.e. the first row must contain 3 squares with ships, the second row must contain 1 square with a ship, the third row must contain 0 squares etc for the other rows and columns.
Each puzzle comes with it's own set of constraints and revealed squares, typically two. An example can be seen here:
battleships
So, here's what I've done:
step([ShipCount,Rows,Cols,Tiles],[ShipCount2,Rows2,Cols2,Tiles2]):-
ShipCount2 is ShipCount+1,
nth1(X,Cols,X1),
X1\==0,
nth1(Y,Rows,Y1),
Y1\==0,
not(member([X,Y,_],Tiles)),
pairs(Tiles,TilesXY),
notdiaglist(X,Y,TilesXY),
member(T,[1,2,3,4,5,6]),
append([X,Y],[T],Tile),
append([Tile],Tiles,Tiles2),
dec_elem1(X,Cols,Cols2),dec_elem1(Y,Rows,Rows2).
dec_elem1(1,[A|Tail],[B|Tail]):- B is A-1.
dec_elem1(Count,[A|Tail],[A|Tail2]):- Count1 is Count-1,dec_elem1(Count1,Tail,Tail2).
neib(X1,Y1,X2,Y2) :- X2 is X1,(Y2 is Y1 -1;Y2 is Y1+1; Y2 is Y1).
neib(X1,Y1,X2,Y2) :- X2 is X1-1,(Y2 is Y1 -1;Y2 is Y1+1; Y2 is Y1).
neib(X1,Y1,X2,Y2) :- X2 is X1+1,(Y2 is Y1 -1;Y2 is Y1+1; Y2 is Y1).
notdiag(X1,Y1,X2,Y2) :- not(neib(X1,Y1,X2,Y2)).
notdiag(X1,Y1,X2,Y2) :- neib(X1,Y1,X2,Y2),((X1 == X2,t(Y1,Y2));(Y1 == Y2,t(X1,X2))).
notdiaglist(X1,Y1,[]).
notdiaglist(X1,Y1,[[X2,Y2]|Tail]):-notdiag(X1,Y1,X2,Y2),notdiaglist(X1,Y1,Tail).
t(X1,X2):- X is abs(X1-X2), X==1.
pairs([],[]).
pairs([[X,Y,Z]|Tail],[[X,Y]|Tail2]):-pairs(Tail,Tail2).
I represent a state with a list: [Count,Rows,Columns,Tiles]. The last state must be
[10,[0,0,0,0,0,0],[0,0,0,0,0,0], somelist]. A puzzle starts from an initial state, for example
initial([1, [1,3,1,1,1,2] , [0,2,2,0,0,5] , [[4,4,1],[2,1,0]]]).
I try to find a solution in the following manner:
run:-initial(S),step(S,S1),step(S1,S2),....,step(S8,F).
Now, here's the difficulty: if i restrict myself to one type of ship parts by using member(T,[1])
instead of
member(T,[1,2,3,4,5,6])
it works fine. However, when I use the full range of possible values for T which are needed
later, the query never ends since it runs for too long. this puzzles me, since :
(a) it works for 6 types of ships but only for 8 steps instead of 9
(b) going from a single type of ship to 6 types increases the number
of options for just the last step by a factor of 6, which
shouldn't have such a dramatic effect.
So, what's going on?
To answer your question directly, what's going on is that Prolog is trying to sift through an enormous space of possibilities.
You're correct that altering that line increases the search space of the last call by a factor of six, note that the size of the search space of, say, nine calls, isn't proportional to 9 times the size of one call. Prolog will backtrack on failure, so it's proportional (bounded above, actually) to the size of the possible results of one call raised to the ninth power.
That means we can expect the size of the space Prolog needs to search to grow by at most a factor of 6^9 = 10077696 when we allow T to take on 6 times as many values.
Of course, it doesn't help that (as far as I was able to tell) a solution doesn't exist if we call step 9 times starting with initial anyways. Since that last call is going to fail, Prolog will keep trying until it's exhausted all possibilities (of which there are a great many) before it finally gives up.
As far as a solution goes, I'm not sure I know enough about the problem. If the value if T is the kind of ship that fits in the grid (e.g. single square, half of a 2-square-ship, part of a 3-square-ship) you should note that that gives you a lot more information than the numbers on the rows/columns.
Right now, in pseudocode, your step looks like this:
Find a (X,Y) pair that has non-zero markings on its row/column
Check that there isn't already a ship there
Check that it isn't diagonal to a ship
Pick a kind of ship-part for it to be.
I'd suggest you approach like this:
Finish any already placed ship-bits to form complete ships (if we can't: fail)
Until we're finished:
Find acceptable places to place ship
Check that the markings on the row/column aren't zero
Try to place an entire ship here. (instead of a single part)
By using the most specific information that we have first (in this case, the previously placed parts), we can reduce the amount of work Prolog has to do and make things return reasonably fast.

Online machine learning algorithm for complex dynamical system

I have a complex dynamical system which takes input as x1, x2, x3 and gives output as y1, y2, y3. I don't have any mathematical model of the system. x(k) is the present input to the system and y(k) is the present output of the system.
My Objective is to find x1(k+1),x2(k+1),x3(k+1) to maximize y1(k+1) + y2(k+1) + y3(k+1). I can only access present state(k) and past state(k-1) values.
Intuitively y_i(k)=f(x1(k),x2(k),x3(k)) where i=1,2,3 .
The system only take inputs s.t a < x1,x2,x3 < b.
Is there any online machine learning algorithm which can be applied to solve this problem?
I really appreciate any feedback.
Thanks and regards
Marcella

Regarding the use of the Markov chain algorithm for generating text

I'm reading the book 'The Practice of Programming' by Brian W. Kernighan and Rob Pike. Chapter 3 provides the algorithm for a Markov chain approach that reads a source text and uses it to generate random text that "reads well" (meaning the output is closer to proper-sounding English than gibberish):
set w1 and w2 to the first two words in the source text
print w1 and w2
loop:
randomly choose w3, one of the successors of prefix w1 and w2 in the source text
print w3
replace w1 and w2 by w2 and w3
repeat loop
My question is: what's the standard way to handle the situation where the new values for w2 and w3 have no successor in the source text?
Many thanks in advance!
Here your options:
Choose a word at random? (Always works)
Choose a new W2 at random? (Can conceivably still loop)
Back up to previous W1 and W2? (Can conceivably still loop)
I'd probably go with trying #2 or #3 once, then fallback to #1 -- which will always work.
The situation you are describing considers 3-grams, that is the statistical frequency of a 3-tuple in a given dataset. To create a Markov matrix with no adsorbing states, that is no points where a f_2(w1,w2) -> w3 and f_2(w2,w3) = 0, you'll have to extend the possibilities. A generalized extension to #ThomasW's answers would be:
If the set predictor f_2(w1,w2) != 0 draw from that
If the set predictor f_1(w2) != 0 draw from that
If the set predictor f_0() != 0 draw from that
That is, draw like normally from the 3-gram set, than the 2-gram set than the 1-gram set. At the last step you'll simply be drawing a word at random weighted by it's statistical frequency.
I believe that this is a serious problem in NLP, one without a straightforward solution. One approach is to tag the parts of speech in addition to the actual words, in order to generalize the mappings. Using parts of speech, the program can at least predict what part of speech should follow the words W2 and W3 if there is no precedent for the word sequence. "Once this mapping has been performed on training examples, we can train a tagging model on these training examples. Given a new test sentence we can then recover the sequence of tags from the model, and it is straightforward to identify the entities identified by the model." From Columbia notes.

DSP - Converting a sampled signal from real samples to complex samples and vice versa

How can I convert baseband sampled signal from real-valued samples to complex-valued samples (real,imaginary) and vice-versa.
My samples are integers, and I'm looking for a fast (but accurate) conversion algorithms.
A C++ sample code (real, not complex ;-) would be more than welcome.
Edit: IPP code will be much welcome.
Edit: I'm looking for a method that will convert n real-samples to n/2 complex-samples and vice-versa, without affecting the bandwidth.
Adding zeros as the imaginary is conceptually the first step in what you want to do. Initially you have a real only signal that looks like this in the frequency domain:
[r0, r1, r2, r3, ...]
/-~--------\
DC +Fs/2
If you stuff it with zeros for the imaginary value, you'll see that you really have both positive and negative frequencies as mirror images:
[r0 + 0i, r1 + 0i, r2 + 0i, r3 + 0i, ...]
/--------~-\ /-~--------\
-Fs/2 DC +Fs/2
Next, you multiply that signal in the time domain by a complex tone at -Fs/4 (tuning the signal). Your signal will look like
----~-\ /-~--------\ /------
DC
So now, you filter out the center half and you get:
________/-~--------\________
DC
Then you decimate by two and you end up with:
/-~--------\
Which is what you want.
All of these steps can be performed efficiently in the time domain. If you pay attention to all of the intermediate steps, you'll notice that there are many places where you're multiplying by 0, +1, -1, +i, or -i. Furthermore, the half band low pass filter will have a lot of zeros and some symmetry to exploit. Since you know you're going to decimate by 2, you only have to calculate the samples you intend to keep. If you work through the algebra, you'll find a lot of places to simplify it for a clean and fast implementation.
Ultimately, this is all equivalent to a Hilbert transform, but I think it's much easier to understand when you decompose it into pieces like this.
Converting back to real from complex is similar. You'll stuff it with zeroes for every other sample to undo the decimation. You'll filter the complex signal to remove an alias you just introduced. You'll tune it up by Fs/4, and then throw away the imaginary component. (Sorry, I'm all ascii-arted out... :-)
Note that this conversion is lossy near the boundaries. You'd have to use an infinite length filter to do it perfectly.
If you want to create a complex vector with a strictly real spectrum, just add an imaginary component of 0.0 to every sample. Depending on your data format, this may be as easy as creating a double length memory array, zeroing it, and copying from every element of the source into every other element of the destination.
If you want to convert a complex vector containing complex data (non-zero imaginary components above your required minimum noise floor) into a real vector, you will need to double your bandwidth in order to not lose information, which may or may not make sense, unless you are modulating, demodulating or filtering the signal.
If you want to produce a one-sided signal with a complex spectrum from a real vector, you can use a Hilbert transform (or filter) to create an imaginary component with the same spectrum but conjugate phase (except for DC). This would probably not be both fast and accurate.
I'm not sure if that is what you're looking for but you might want to check the Hilbert Transform, which can be used to find the analytic representation of real-valued signals, i.e., a signal with the same amount of information but with no negative frequency components.
Such representation is mostly useful in Digital Signal Processing techniques employing Spectral Shifting such as Single Sideband Modulation, an efficient form of Amplitude Modulation (AM) that uses half the bandwidth used by the raw AM.
I don't have enough points to vote zml up yet, but his is clearly the right answer. The Hilbert transform essentially converts your real-valued signal into its more natural domain, where the components of sound are complex "phasors" rather than sine waves. It does this by essentially chopping of half of the Fourier spectrum, which involves a single choice of "helicity" (i.e. cw vs ccw) but allows you to do things like perfectly pitch shift by multiplying by a single phasor. The possibilities are endless, and I hope this complex representation of audio catches on!
Intel Performance Primitives (IPP) has a function which does exactly this.
From their documentation:
The ippsHilbert function computes a complex analytic signal,
which contains the original real signal as its real part and
computed Hilbert transform as its imaginary part.
https://software.intel.com/content/www/us/en/develop/documentation/ipp-dev-reference/top/volume-1-signal-and-data-processing/transform-functions/hilbert-transform-functions/hilbert.html#hilbert

Resources