I have a question about who we should do testing phase in transformers for time series forecasting - transformer-model

In this article https://medium.com/inside-machine-learning/what-is-a-transformer-d07dd1fbec04 the author used an empty sequence with start of the token as first element for feeding to the decoder in inference phase. For empty sequence I considered a sequence with the start token as first element and zeros as remaining elements. For instance if the output window is in size of 3, I considered a sequence with length of three ( the first element is start of the token and the remaining is zero) then these zero will be replaced with new token generated in each step. It will be continued till our sequence fed to the decoder has been filled with all new tokens. I want to know if I am in a right path? The empty sequence that I considered is a right thing ?

Related

Search data from a data set without reading each element

I have just started learning algorithms and data structures and I came by an interesting problem.
I need some help in solving the problem.
There is a data set given to me. Within the data set are characters and a number associated with each of them. I have to evaluate the sum of the largest numbers associated with each of the present characters. The list is not sorted by characters however groups of each character are repeated with no further instance of that character in the data set.
Moreover, the largest number associated with each character in the data set always appears at the largest position of reference of that character in the data set. We know the length of the entire data set and we can get retrieve the data by specifying the line number associated with that data set.
For Eg.
C-7
C-9
C-12
D-1
D-8
A-3
M-67
M-78
M-90
M-91
M-92
K-4
K-7
K-10
L-13
length=15
get(3)= D-1(stores in class with character D and value 1)
The answer for the above should be 13+10+92+3+8+12 as they are the highest numbers associated with L,K,M,A,D,C respectively.
The simplest solution is, of course, to go through all of the elements but what is the most efficient algorithm(reading the data set lesser than the length of the data set)?
You'll have to go through them each one by one, since you can't be certain what the key is.
Just for sake of easy manipulation, I would loop over the dataset and check if the key at index i is equal to the index at i+1, if it's not, that means you have a local max.
Then, store that value into a hash or dictionary if there's not already an existing key:value pair for that key, if there is, do a check to see if the existing value is less than the current value, and overwrite it if true.
While you could use statistics to optimistically skip some entries - say you read A 1, you skip 5 entries you read A 10 - good. You skip 5 more, B 3, so you need to go back and also read what is inbetween.
But in reality it won't work. Not on text.
Because IO happens in blocks. Data is stored in chunks of usually around 8k. So that is the minimum read size (even if your programming language may provide you with other sized reads, they will eventually be translated to reading blocks and buffering them).
How do you find the next line? Well you read until you find a \n...
So you don't save anything on this kind of data. It would be different if you had much larger records (several KB, like files) and an index. But building that index will require reading all at least once.
So as presented, the fastest approach would likely be to linearly scan the entire data once.

Word find style game, distributed letter generation algorithm

So, I am working on a word find style game, and how I generate new letters right now just isn't cutting it. I mean it works, however it seems to generate either letters that aren't too often used (see https://en.wikipedia.org/wiki/Letter_frequency) or generates too many of 1 letter.
Right now I just use a mod based on a random number and chooses based on that, which again it works but is not ideal.
So i have 2 cases
1) On start, it will generate the board with 25 letters which is randomly generated.
2) When a word is found, I remove those letters from the board and generate new letters to replace them
Is there a known algorithm that could based on https://en.wikipedia.org/wiki/Letter_frequency generate letters that are most used in words?
I could just do some loop over the existing letters and do a lot looping, letter count. and based on that determine what letter to generate.
I'd prefer something a little less crazy and as well be able to possibly use it for other languages (however not necessary at this point)
Any pointers would be greatly appreciated!
You could create a pool of letters according to frequency, for example the 98 tiles of English letters from Scrabble.
When you fill your grid, you remove the picked letters from the pool and place them in the grid. When the player selects a valid word from the grid, do the reverse: Remove the letters from the board and put them back into the pool. Then draw new letters to fill the gaps.
When you want to prefill the grid with some existing words to get the player started, you should also pick letters from the pool.
You can use a simple array for the pool. When you remove a random letter, shorten the array by putting the last element in the place where the picked element was. When you put back elements, just append them to the end of the array.

unique baggage tokken generator

There are three containers, small, medium and large. Passenger comes in, check-in the luggage. The baggage should be stored in the appropriate container and generate a unique token number. Then passenger should get back the bag using the same token number.
Trick was if small container is full store in medium if available or large. Now if the large bag comes in and there is now a empty space in small, than move the small bag back to small & store the large bag.
How to generate the unique token number and move the baagage internally without changing the token number?
1) Lookup should be in constant time complexity and insertion in minimum complexity.
2) We can use the hash tables to store the token numbers, but token number shouldn’t get changed if you move the baggage internally and space should not get wasted in the memory if baggage is removed.
Is there any efficient way to solve this ? Thanks in advance.
If you have enough memory, you can just straight store an associative array:
f (token) = pair (container, coordinates in container).
Let tokens be consecutive integers, or assign the least non-present positive integer each time, or just assign large random integer (while there is an equal one already present, invoke the random generation again).
When you get a bag, give it a token, put it in a container and assign f (token) = its container and coordinates.
When you move a bag, update the associative array entry.
When you give a bag back to a passenger, remove the associative array entry.
The underlying implementation of the associative array may be arbitrary (hash table, balanced search tree, etc.).

Does a reverse key index help if i use an incremental sequence to insert subsequent values

I understood the basic rationale for a reverse key index that it will reduce index contention. Now if I have 3 numbers in the index: 12345, 27999, 30632, i can see that if i reverse these numbers, the next number in the sequence won't always hit the same leaf block.
But if the numbers were like :12345,12346,12347, then the next numbers 12348,12349 (incremented by 1) would hit the same leaf block even if the index is reversed:
54321,64321,74321,84321,94321.
So how is the reverse index helping me? It was supposed to help particularly while using sequences
If we're talking about a sequence-generated value, you can't look at 5 values and draw too many conclusions. You need to think about the data that has already been inserted and the data that will be inserted in the future.
Assuming that your sequence started at 12345, the first 5 values would be inserted sequentially. But then the sixth value will be 12350. Reverse that and you get 05321 which would go to the far left of the index. Then you'd generate 12351. Reverse that to get 15321 and that's again toward the left-hand side of the index between the first value you generated (54321) and the most recent value (05321). As the sequence generates new values, they'll go further to the right until everything resets every 10 numbers and you're inserting into the far left-hand side of the index again.

python code for calculating session id randomness

I have a list of 1000 session ids. the session-id lengths are of 32 characters each. What is the most efficient algorithm which I can use to determine the randomness or variation at each character level? I am new to python, can somebody help me develop a python code snippet for the same?
Just for reference, Sequencer tool in Burpsuite gives a randomness graph for each 10 character positions if the token length is 10 characters. (algorithm is unknown to me)
I don't know how Burp does it but one way to determine the variation at each character level would be to do character frequency analysis for each position in the session ids.
The premise is that you'd expect that all character are equally likely to appear at a position across all session ids (the distribution of characters is uniform). Let's say you have collected/generated 100 session ids which are numeric (so possible characters at each position would be 0-9) you'd expect that each digit would appear 100/10=10 times at each position.
Now for each position in the sequences build a histogram with how many time a character actually appears at that position across all session ids.
To figure out how likely is your observed character distribution at each position given that you'd expect them to be uniformly distributed you can use a statistical test like the Chi Squared test.
I've written a simple Python character count tester using the Chi Squared test here: https://github.com/decbis/salr. I'll add more tests in the future.

Resources