Teacher-Student System: Training Student With k Target Sequences for Each Input Sequence - returnn

This question is related to Teacher-Student System: Training Student with Top-k Hypotheses List
I want to configure a teacher-student system, where a teacher seq2seq model generates a top-k list of hypotheses, which are used to train a student seq2seq model.
I select the top-k hypotheses list from the teacher’s ChoiceLayer (or output layer) by:
"teacher_hypotheses": {
"class": "copy", "from": ["extra.search:teacherMT_output"],
"register_as_extern_data": "teacher_hypotheses_stack"
}
The output Data of that layer has a batch axis length batch_size=k=4 times the length of the input Data’s batch axis length (cf. doc and code of: Data.copy_extend_with_beam, SearchChoices.translate_to_common_search_beam).
teacher_hypotheses_stack is selected as the student’s training target. But this leads to the following error:
TensorFlow exception: assertion failed: [shape[0]:] [92] [!=] [dim:] [23]
[[node studentMT_output/rec/subnet_base/check_seq_len_batch_size/check_input_dim/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Which is, I assume, due to the issue that the target data of the student, the hypotheses list, has a batch axis length k=4 times longer than the one of the student’s input data/encoder state data.
What do I have to do, to enable the student’s decoder to have k different target sequences for each input sequence?
EDIT (12th June 2020): I took a look into the TensorFlow graph via TensorBoard, to inspect the node mentioned in the error. To me it looks like, the target’s batch axis length is validated against the batch axis length of the student’s overall input data (meaning the encoder input data). So this check seems to be independent of what I feed into the student’s decoder.
EDIT (15th June 2020): Following Albert's advice, I opened an issue on GitHub, related to my problem: Targeting Beam as Training Target Causes Dimension Error

This might actually be a bug. Via register_as_extern_data, I'm not exactly sure that the logic of translate_to_common_search_beam is correct. I think the code currently expects that a target never has a beam.
So, to answer your question: I think you are doing it already correct (so we can close this StackOverflow question).
You should open a GitHub issue about this (and then link it here). It would be good to come up with a simple test case:
I.e. where there is some beam (you don't even need RecLayer for that, just a single ChoiceLayer would be enough I think),
then register_as_extern_data on that,
and then some other layer which uses this target in some way (e.g. just with "loss": "ce").
Probably this will cause exactly your problem.

Related

Where to view logged results in Veins 5.1

I'm somewhat new to Veins and I'm trying to record collision statistics within the sample "RSUExampleScenario" provided in the VM. I found this question which describes what line to add to the .ini file, which I have, but I'm unable to find the "ncollisions" value in the results folder, which makes me think either I ran the wrong .ini line or am looking in the wrong place.
Thanks!
Because collision statistics take time to compute (essentially: trying to decode every transmission twice: once while considering interference by other nodes as usual, then trying again while ignoring all interference), Veins 5.1 requires you to explicitly turn collision statistics on. As discussed in https://stackoverflow.com/a/52103375/4707703, this can be achieved by adding a line *.**.nic.phy80211p.collectCollisionStatistics = true to omnetpp.ini.
After altering the Veins 5.1 example simulation this way and running it again (e.g., by running ./run -u Cmdenv -c Default from the command line), the ncollisions field in the resulting .sca file should now (sometimes) have non-zero values.
You can quickly verify this by running (from the command line)
opp_scavetool export --filter 'module("**.phy80211p") and name("ncollisions")' results/Default-\#0.sca -F CSV-R -o collisions.csv
The resulting collisions.csv should now contain a line containing (among other information) param,,,*.**.nic.phy80211p.collectCollisionStatistics,true (indicating that the simulation was executed with the required configuration) as well as many lines containing (among other information) scalar,RSUExampleScenario.node[10].nic.phy80211p,ncollisions,,,1 (indicating that node[10] could have received one more message, had it not been for interference caused by other transmissions in the simulation.

R307 Fingerprint Sensor working with more then 1000 fingerprints

I want to integrate fingerprint sensor in my project. For the instance I have shortlisted R307, which has capacity of 1000 fingerprints.But as project requirement is more then 1000 prints,so I will going to store print inside the host.
The procedure I understand by reading the datasheet for achieving project requirements is :
I will register the fingerprint by "GenImg".
I will download the template by "upchr"
Now whenever a fingerprint come I will follow the step 1 and step 2.
Then start some sort of matching algorithm that will match the recently downloaded template file with
the template file stored in database.
So below are the points for which I want your thoughts
Is the procedure I have written above is correct and optimized ?
Is matching algorithm is straight forward like just comparing or it is some sort of tricky ? How can
I implement that.Please suggest if some sort of library already exist.
The sensor stores the image in 256 * 288 pixels and if I take this file to host
at maximum data rate it takes ~5(256 * 288*8/115200) seconds. which seems very
large.
Thanks
Abhishek
PS: I just mentioned "HOST" from which I going to connect sensor, it can be Arduino/Pi or any other compute device depends on how much computing require for this task, I will select.
You most probably figured it out yourself. But for anyone stumbling here in future.
You're correct for the most part.
You will take finger image (GenImg)
You will then generate a character file (Img2Tz) at BufferID: 1
You'll repeat the above 2 steps again, but this time store the character file in BufferID: 2
You're now supposed to generate a template file by combining those 2 character files (RegModel).
The device combines them for you, and stores the template in both the character buffers
As a last step; you need to store this template in your storage (Store)
For searching the finger: You'll take finger image once, generate a character file in BufferID : 1 and search the library (Search). This performs a linear search and returns the finger id along with confidence score.
There's also another method (GR_Identify); does all of the above automatically.
The question about optimization isn't applicable here, you're using a 3rd party device and you have to follow the working instructions whether it's optimized or not.
The sensor stores the image in 256 * 288 pixels and if I take this file to host at maximum data rate it takes ~5(256 * 288*8/115200) seconds. which seems very large.
I don't really get what you mean by this, but the template file ( that you intend to upload to your host ) is 512 bytes, I don't think it should take much time.
If you want an overview of how this system is implemented; Adafruit's Library is a good reference.

H2O “OUTPUT - CLUSTER MEANS” section not report metrics correctly

(note: this is related to a question I posted before
H2O (open source) for K-mean clustering)
I am using K-Means for our data set of about 100 features (some of them are timestamps)
(1) I checked the “OUTPUT - CLUSTER MEANS” section and the timestamp filed is with the value like “1.4144556086883196e+22”. Our timestamp file is about data in year 2018 and the year 2018 Unix time is like “1541092918000”. Hence, it cannot be that big number “1.4144556086883196e+22”. My understand of the numbers in “OUTPUT - CLUSTER MEANS” section should be close to the raw data (before standarization). Right ?
(2) About standardization, can you use this example https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/test/resources/hex/genmodel/algos/kmeans/model.ini#L21-L27 and tell me how the input data is converted to standardized value? Say, I have a raw vector of value ( a,b,c,d, 1.8 ) , I only keep last element and omit others. How can I know if it’s close to center 2 below in this example. Can you show me how H2O convert the raw data using standardize_means, standardize_mults and standardize_modes. I am sure H2O has a way to compute standardized value from the model output, but I cannot find the place and the formula.
center_2 = [2.0, 0.0, -0.5466317772145349, 0.04096506994984166, 2.1628815416218337]
Thanks.
1) I'm not sure where you are seeing a timestamp in Flow or if you mean your dataset contains a timestamp that H2O-3 has converted. Either way it sounds like you may have encountered a bug. The timestamps you see in H2O-3 are milliseconds since the Unix epoch, so you have to divide by 1000 before using a unix time converter (for example you could use https://currentmillis.com/). But again given that the number is so large, I'm leaning towards a bug - any code you can provide to make it reproducible would be great.
1a) When you check standardize in flow in addition to “OUTPUT - CLUSTER MEANS” (which is not standardized) you will see "OUTPUT - STANDARDIZED CLUSTER MEANS" so the non-standardize output should reflect the unit of your input.
2) Standardization in H2O-3 is described here (which says: "standardizes numeric columns to have zero mean and unit variance. "). The link you provided points to a model for testing that has been saved as MOJO and I'm not sure it makes sense to use as an example. But in general the way standardization works for h2o-3 is as standardization is defined.

How to implement the equivalent of the Aggregator EIP in Nifi

I'm very experienced with Apache Camel and EIPs and am struggling to understand how to implement equivalents in Nifi. I understand that Nifi uses a different paradigm (flow based programming) but I don't think what I'm trying to do is unreasonable.
In a nutshell I want the contents of each file to be sent to many rest services and I want to aggregate the responses into a single document which will stored in elasticsearch. I might also do some further processing and cleanup to improve what is stored (but this isn't my immediate issue)
The screenshot is a quick mock-up of what I'm trying to achieve but I don't understand enough about Nifi to know how to implement this pattern correctly.
If you are going to take a single piece of data and then fork to multiple parts of the flow and then converge back, there needs to be a way for MergeContent to know which pieces go together.
There are generally two ways this can be done...
The first is using MergeContent in "defragment mode". Think of this as reversing a split operation that was performed by one of the split processors like SplitText. For example, you split a file of 100 lines into 100 flow files of 1 line each, then do some stuff to each one, then want to converge back. The split processors produce a standard set of split attributes (described in the docs of the processors) and the defragment mode knows how to bin the splits accordingly and merge them back together. This probably doesn't apply to your example since you didn't start with a split processor.
The second approach is the "Correlation Attribute" in MergeConent. This tells merge content to only merge flow files together that have the same value for the attribute specified. In your example, when a file gets picked up by GetFile and sent to 3 InvokeHttp processors, there are 3 flow files created, and they all should have their "filename" attribute set to the name of the file picked up from disk. So telling MergeContent to correlate on filename should do the trick, and probably setting the min and max number of entries to the number you expect like 3, and a maximum time in case one of them fails or hangs.

Vowpal Wabbit: obtaining a readable_model when in --daemon mode

I am trying to stream my data to vw in --daemon mode, and would like to obtain at the end the value of the coefficients for each variable.
Therefore I'd like vw in --daemon mode to either:
- send me back the current value of the coefficients for each line of data I send.
- Write the resulting model in the "--readable_model" format.
I know about the dummy example trick save_namemodel | ... to get vw in daemon mode to save the model to a given file, but it isn't enough as I can't access the coefficient values from that file.
Any idea on how I could solve my problem ?
Unfortunately, on-demand saving of readable models isn't currently supported in the code but it shouldn't be too hard to add. Open source software is there for users to improve according to their needs. You may open a issue on github, or better, contribute the change.
See:
this code line where only the binary regressor is saved using save_predictor(). One could envision a "rsave" or "saver" tag/command to store the regressor in readable form as is being done in this code line
As a work-around you may call vw with --audit and parse every audit line for the feature names and their current weights but this would:
make vw much slower
require parsing every line to get the values rather than on demand

Resources