StanfordNLP OpenIE: Obtaining best triples to match demo - stanford-nlp

I'm using StanfordNLP OpenIE for the purpose of extracting simpler sentences from more complex sentences by identifying triples using OpenIE. The simpler sentences are necessary to improve the performance on downstream NLP tasks for question/answering.
Here are my default properties:
properties = '{"annotators":"tokenize,ssplit,pos,lemma,ner,depparse,natlog,openie","outputFormat":"json"}'
Here is my test sentence:
text = 'The patient was placed in the left lateral position and monitored continuously with ECG tracing, pulse oximetry monitoring, and direct observations.'
Whether I use the docker NLP server or run from the downloaded Java distribution, I get the following result (which makes sense):
patient | was placed in | lateral position
patient | monitored continuously | ECG tracing
patient | pulse | oximetry monitoring
patient | was placed in | left lateral position
patient | was | placed
patient | was placed in | left position
patient | monitored | ECG tracing
patient | was placed in | position
However, if I run the same test sentence on the demo page at http://corenlp.run/, the Brat visualization appears to return a more concise set of triplets:
patient | monitored continuously | ECG tracing
patient | pulse | oximetry monitoring
patient | was placed in | left lateral position
I've experimented with each OpenIE annotation option listed here:
https://nlp.stanford.edu/software/openie.html#Questions, but have failed to produce this more concise result.
Is there an option available for this? If not, any algorithm for obtaining this result would be appreciated.
Thanks

When I try corenlp.run, I get the same expanded set of triples, if you hover over each entity:
You can, however, always collapse them into fewer triples by just taking the widest possible subject+object+relation scopes. There are a few ways to do this:
Most flexible: if you call the OpenIE annotator programmatically, the relation triple keeps the sentence+token offset of each of its arguments, which you can use to determine the widest spanning arguments for each extracted relation.
Heuristically, you can simply ignore relation extractions that are subsets of other extractions.
the QA_SRL output format will actually do option (1) for you, to conform to an evaluation that penalizes "duplicate" relation extractions.
However, in many situations it's also harmless/preferable to keep more extractions around. Each of the extractions should be "true", in the sense that they're entailed from the source sentence -- some are just more specific than others.

Related

R307 Fingerprint Sensor working with more then 1000 fingerprints

I want to integrate fingerprint sensor in my project. For the instance I have shortlisted R307, which has capacity of 1000 fingerprints.But as project requirement is more then 1000 prints,so I will going to store print inside the host.
The procedure I understand by reading the datasheet for achieving project requirements is :
I will register the fingerprint by "GenImg".
I will download the template by "upchr"
Now whenever a fingerprint come I will follow the step 1 and step 2.
Then start some sort of matching algorithm that will match the recently downloaded template file with
the template file stored in database.
So below are the points for which I want your thoughts
Is the procedure I have written above is correct and optimized ?
Is matching algorithm is straight forward like just comparing or it is some sort of tricky ? How can
I implement that.Please suggest if some sort of library already exist.
The sensor stores the image in 256 * 288 pixels and if I take this file to host
at maximum data rate it takes ~5(256 * 288*8/115200) seconds. which seems very
large.
Thanks
Abhishek
PS: I just mentioned "HOST" from which I going to connect sensor, it can be Arduino/Pi or any other compute device depends on how much computing require for this task, I will select.
You most probably figured it out yourself. But for anyone stumbling here in future.
You're correct for the most part.
You will take finger image (GenImg)
You will then generate a character file (Img2Tz) at BufferID: 1
You'll repeat the above 2 steps again, but this time store the character file in BufferID: 2
You're now supposed to generate a template file by combining those 2 character files (RegModel).
The device combines them for you, and stores the template in both the character buffers
As a last step; you need to store this template in your storage (Store)
For searching the finger: You'll take finger image once, generate a character file in BufferID : 1 and search the library (Search). This performs a linear search and returns the finger id along with confidence score.
There's also another method (GR_Identify); does all of the above automatically.
The question about optimization isn't applicable here, you're using a 3rd party device and you have to follow the working instructions whether it's optimized or not.
The sensor stores the image in 256 * 288 pixels and if I take this file to host at maximum data rate it takes ~5(256 * 288*8/115200) seconds. which seems very large.
I don't really get what you mean by this, but the template file ( that you intend to upload to your host ) is 512 bytes, I don't think it should take much time.
If you want an overview of how this system is implemented; Adafruit's Library is a good reference.

Teacher-Student System: Training Student With k Target Sequences for Each Input Sequence

This question is related to Teacher-Student System: Training Student with Top-k Hypotheses List
I want to configure a teacher-student system, where a teacher seq2seq model generates a top-k list of hypotheses, which are used to train a student seq2seq model.
I select the top-k hypotheses list from the teacher’s ChoiceLayer (or output layer) by:
"teacher_hypotheses": {
"class": "copy", "from": ["extra.search:teacherMT_output"],
"register_as_extern_data": "teacher_hypotheses_stack"
}
The output Data of that layer has a batch axis length batch_size=k=4 times the length of the input Data’s batch axis length (cf. doc and code of: Data.copy_extend_with_beam, SearchChoices.translate_to_common_search_beam).
teacher_hypotheses_stack is selected as the student’s training target. But this leads to the following error:
TensorFlow exception: assertion failed: [shape[0]:] [92] [!=] [dim:] [23]
[[node studentMT_output/rec/subnet_base/check_seq_len_batch_size/check_input_dim/assert_equal_1/Assert/Assert (defined at home/philipp/Documents/bachelor-thesis/returnn/returnn-venv/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
Which is, I assume, due to the issue that the target data of the student, the hypotheses list, has a batch axis length k=4 times longer than the one of the student’s input data/encoder state data.
What do I have to do, to enable the student’s decoder to have k different target sequences for each input sequence?
EDIT (12th June 2020): I took a look into the TensorFlow graph via TensorBoard, to inspect the node mentioned in the error. To me it looks like, the target’s batch axis length is validated against the batch axis length of the student’s overall input data (meaning the encoder input data). So this check seems to be independent of what I feed into the student’s decoder.
EDIT (15th June 2020): Following Albert's advice, I opened an issue on GitHub, related to my problem: Targeting Beam as Training Target Causes Dimension Error
This might actually be a bug. Via register_as_extern_data, I'm not exactly sure that the logic of translate_to_common_search_beam is correct. I think the code currently expects that a target never has a beam.
So, to answer your question: I think you are doing it already correct (so we can close this StackOverflow question).
You should open a GitHub issue about this (and then link it here). It would be good to come up with a simple test case:
I.e. where there is some beam (you don't even need RecLayer for that, just a single ChoiceLayer would be enough I think),
then register_as_extern_data on that,
and then some other layer which uses this target in some way (e.g. just with "loss": "ce").
Probably this will cause exactly your problem.

Vowpal Wabbit: obtaining a readable_model when in --daemon mode

I am trying to stream my data to vw in --daemon mode, and would like to obtain at the end the value of the coefficients for each variable.
Therefore I'd like vw in --daemon mode to either:
- send me back the current value of the coefficients for each line of data I send.
- Write the resulting model in the "--readable_model" format.
I know about the dummy example trick save_namemodel | ... to get vw in daemon mode to save the model to a given file, but it isn't enough as I can't access the coefficient values from that file.
Any idea on how I could solve my problem ?
Unfortunately, on-demand saving of readable models isn't currently supported in the code but it shouldn't be too hard to add. Open source software is there for users to improve according to their needs. You may open a issue on github, or better, contribute the change.
See:
this code line where only the binary regressor is saved using save_predictor(). One could envision a "rsave" or "saver" tag/command to store the regressor in readable form as is being done in this code line
As a work-around you may call vw with --audit and parse every audit line for the feature names and their current weights but this would:
make vw much slower
require parsing every line to get the values rather than on demand

In bioinformatics, what is a singleton?

I've quickly realized that bioinformatics is not a subject which has its terms clearly defined and easily accessible. I have an apparent discrepancy with some of my results.
I used samtools view -b -h -f 8 fileName.bam > mateUnmapped.bam on several BAM files. I am under the impression that this command extracts only reads whose partner does not align to the draft genome (also includes header; the output is in BAM format)
When I use samtools 'flagstat' on the resulting files, I get an interesting result: the number of 'singletons' do not match the total number of reads... which seems odd to me.
The only reconciliation I can find is here:
http://seqanswers.com/forums/showthread.php?t=46711
One person which replies to the question posed in this forum claims that singletons are sometimes defined as sequences which do not have a partner read at all. However, that still doesn't explain away my result. Flagstat says about 40% of my reads are singletons, but I feel like based on the 'view' command I used, they should ALL be singletons.
Can a seasoned bioinformatician help me out?
In general genomic assembly, a singlton is a read that did not assemble into a contig or map to a reference. It is a contig of only 1 read.
In samtools, a singleton refers to a read that mapped but the mate didn't.
Flagstat says about 40% of my reads are singletons, but I feel like
based on the 'view' command I used, they should ALL be singletons.
I'm not a samtools expert, but I think -f 8 means show reads whose mates did not map. That doesn't say anything about the read itself, just its mate. So you are probably getting reads where both mates that didn't map at all (60%) AND reads where only one of the mates mapped (40%). ?
You might want to try running with -f 8 -F 4 to be reads that mapped but whose mates did not.

How can I use real time monitoring (tail -f), cut, sort, and uniq together in Unix?

I am trying to delete duplicate text that's being written into a log, while continuously monitoring it.
The only issue is that this particular log is timestamped, so before it's possible to determine if the same text is written twice or three times in a row, the timestamp must be cut.
I'm not Unix expert, but this is my attempt:
tail -f log.txt | cut -c 28- | sort | uniq
The terminal behaves unexpectedly, and just hangs. Whereas either of the two following commands work on their own:
tail -f log.txt | cut -c 28-
or
tail -f log.txt | uniq
Ideally I'd like to filter out non-adjacent text entries, i.e. I would like to be able to use sort as well, but currently I can't get it to work with the -f flag on tail.
You can't get sorted output of a stream of text before it has ended, as the next item to come in might belong ahead of the first one you've seen so far. This makes the sort | unique part of your pipeline not useful for your situation.
While it's probably possible to to filter out your duplicates with some more complicated shell scripting, you might find it easier to write a script in some other language. Many scripting languages have efficient set data structures that can quickly check if an item has been seen before. Here's a fairly trivial script that should do the job using Python 3:
#!/usr/bin/env python3
import sys
seen = set()
for line in sys.stdin:
if line not in seen:
sys.stdout.write(line)
seen.add(line)
The downside to this approach is that the filtering script will use much more memory than uniq does, since it must remember every unique line it has seen before. So, this might not be an appropriate solution if your pipeline may see a great many different lines in a single run.

Resources