case tag in Universal dependencies - stanford-nlp

I am new to NLP.
While studying Universal dependency output of Stanford parser, see case tag.
Unable to find reference to this in the manual
root(ROOT-0, transfer-1)
dep(100-3, $-2)
dobj(transfer-1, 100-3)
case(John-5, to-4)
nmod(100-3, John-5)
case(account-8, from-6)
nmod:poss(account-8, my-7)
nmod(transfer-1, account-8)
acl(account-8, ending-9)
case(1234-11, with-10)
nmod(ending-9, 1234-11)
Can someone point me to update manual reference or explain significance of case tag

These are documented in the Universal Dependencies manual. The case edge is documented here. For English, this is almost always the preposition type of the incoming preposition arc. So, "in Canada" would have an incoming edge nmod:in to Canada, and a case edge from Canada to in. The one common special case I've seen is possessives, which are now marked with nmod:poss and have an associated case edge to the "'s" token (e.g., Canada 's hokey team).

Related

How to provide printer instructions in FHIR for specimen labels

We are implementing an API according to the FHIR standard.
Our clinic customers can have orders that include Specimen, and we also want to provide the code for printing labels for these specimens (including barcodes, customer names and so on) on printers that support the Zebra Programming language.
We have decided to do this in FHIR by using the Device Resource and storing the printer code in the carrierAIDC field as a base64 encoded string.
However, I am not certain that this is the optimal solution. Is there a better way to achieve this?
My answer would depend a bit on what the code for printing labels entails, and what attributes you are considering using to reference the various resources. Are you intending to represent the label attached to the container that identifies the specimen in the container, the patient, etc.? If so, Device.udiCarrier.carrierAIDC would not be the correct one to consider as that identifies the container, not the specimen in the container. The Specimen would seem to be more appropriate as that is what the label primarily would represent (various Specimen attributes). At this time you would need an extension to do that as there is no alternative attribute to handle the barcode/label code representation.
If the label does intent to primarily represent the container and the specimen, patient and other non-container data is ancillary to that, the Device.udiCarrier would be the closest as it is meant to identify the device specifically and has an attribute to represent the label content. It likely would not yet be a full proper UDI, but creating a mostly parallel extension for doing the same seems to not be reasonable. In this case I would suggest to request HL7 to consider clarifying the variety of label information that can be associated with a device, that aims to identify the device (not something else), and not limit it to UDI as defined today only. If there is a concern that this may not be accepted, then go for an extension for now.

What is the difference between a concept and a label in XBRL, and do all listed companies share the same US GAAP labels?

Let me show tesla's company facts data with sec's RESTful api:
https://data.sec.gov/api/xbrl/companyfacts/CIK0001318605.json
You can see all labels in 'facts ---- us-gaap' such as :
AccountsAndNotesReceivableNet
AccountsPayableCurrent
AccountsReceivableNetCurrent
AccretionAmortizationOfDiscountsAndPremiumsInvestments
Do all listed companies share same us-gaap label names ?
Can every company create its own customerized us-gaap label names?
concept in xbrl is A taxonomy element that provides the meaning for a fact in the official definition.
https://www.xbrl.org/guidance/xbrl-glossary/
What is the difference between concept in xbrl and us-gaap's label ?
The short answer is yes.
First, a small detail:
AccountsAndNotesReceivableNet
AccountsPayableCurrent
AccountsReceivableNetCurrent
AccretionAmortizationOfDiscountsAndPremiumsInvestments
These are not labels, these are local names of concepts. Labels are something different, human readable, for example "Accounts and notes receivable, net" would be a label. Labels are attached with the label linkbase.
The more complete names (called QNames) of these concepts are:
us-gaap:AccountsAndNotesReceivableNet
us-gaap:AccountsPayableCurrent
us-gaap:AccountsReceivableNetCurrent
us-gaap:AccretionAmortizationOfDiscountsAndPremiumsInvestments
where the us-gaap prefix is bound with the US GAAP namespace, which changes every year and is, for 2021:
http://fasb.org/us-gaap-std/2021-01-31
This makes explicit that these concepts are not maintained by companies, but by the Financial Accounting Standards Board. Thus, all companies filing their reports into the EDGAR system share these concepts.
Two important points:
Companies are allowed to create their own concepts. These are called extension concepts. You will recognize them because they are in a company namespace, not in the US GAAP namespace. Their prefix will not be us-gaap, but some company-specific prefix. These concepts are unique to each company.
An example for Tesla is:
tsla:AccruedAndOtherCurrentLiabilities
Concepts in the US GAAP taxonomy are updated every year, i.e., some get added, some get deprecated, some are removed. However, the FASB tries to maintain consistency across years, i.e., a concept will not suddently change its semantics one year to the next.

How to set priority in Microsoft Luis Patterns?

I am using pattern recognition to catch entities with a variable size. Here are situation that i am trying to catch
1- {entity1} (has| had| have) [the] {entity2}
2.1- {entity1} (has| had| have) the {entity2}
2.2- {entity1} (has| had| have) {entity2}
i tried the 1 pattern or the 2.1 and 2.2 at the same time.
The problem is that when i enter: "Person have the properties"
the entity2 is marked as "the properties" instead of just "properties"
Is there a way to mark priority or work around this problem?
Sorry for english mistakes i hope that the question is clear enough.
There is no way you can set priority in LUIS patterns. However, given your situation above, where the entity is getting extracted incorrectly, you might want to make use of explicit lists. You can create an explicit list via the authoring API to allow the exceptions when:
Your pattern contains a Pattern.any
When that pattern syntax allows for the possibility of an incorrect entity extraction based on the utterance.
Also, make sure to refer to the best practices(https://learn.microsoft.com/en-us/azure/cognitive-services/luis/luis-concept-best-practices#do-and-dont) for LUIS apps to make sure your app behaves with improved accuracy.
Hope this helps.

How can I do "related tags"?

I have tags on my website, and I input them one by one when I create a blog post. I love gmail's new feature, that ask you if you want to include X in a mail, if you type Y's name and that you often include both of them in the same messages.
I'd like to do something similar on my website, but I don't know how to represent the tags "related-ness" in an object or database ... thoughts ?
It all boils down to create associations between certain characteristics of your posts and certain tags, and then - when you press the "publish" button - to analyse the new post and propose all tags matched with your post characteristics.
This can be done in several ways from a "totally hard-coded" association to some sort of "learning AI"... and everything in-between.
Hard-coded solutions
This are the simplest algorithms to implement. You should first decide what characteristics of your post are relevant for tagging (e.g.: it's length if you tag them "short" or "long", the presence of photos or videos if you tag them "multimedia-content", etc...). The most obvious is however to focus on which words are used in posts. For example you could build a mapping like this:
tag_hint_words = {'code-development' : ['programming',
'language', 'python', 'function',
'object', 'method'],
'family' : ['Theresa', 'kids',
'uncle Ben', 'holidays']}
Then you would check your post for the presence of the words in the list (the code between [ and ] ) and propose the tag (the word before :) as a possible candidate.
A common approach is to give "scores", or in other word to put a number that indicates the probability a given tag is the right one. For example: if your post would contain the sentence...
After months of programming, we finally left for the summer holidays at uncle Ben's cottage. Theresa and the kids were ecstatic!
...despite the presence of the word "programming" the program should indicate family as the most likely tag to use, as there are many more words hinting.
Learning AI's
One of the obvious limitations of the above method is that - say one day you pick up java beside python - you would probably need to change your code and include words like "java" or "oracle" too. The same applies if you create new tags.
To circumvent this limitation (and have some fun!!) you could try to implement a learning algorithm. Learning algorithms are those who refine their outcome the more you use them (so they indeed... learn!). Some algorithm requires initial training (many spam filters and voice recognition programs need this initial "primer"). Some don't.
I am absolutely no expert on the subject, but two common AI's are: the Naive Bayes Classifier and some flavour of Neural network.
Although the WP pages might look scary, they are surprisingly easy to implement (at least in Python). Here's the recording of a lecture at PyCon 2009 on the subject "Easy AI with Python". I found it very informative and even somehow inspiring! :)
HTH!
You should have a look at this post :
Any suggestions for a db schema for storing related keywords?
If you're looking for a schema for storing related tags it will help.
Relevancy searches where multiple agents play a part are usually done using Collaborative filtering. You might want to give that a look see.
Look up Clustering (Machine Learning algorithm). Don't be intimidated by math, it's a pretty straightforward algorithm. Check out Machine Learning for Hackers for simpler explanations of many Machine Learning algorithms and methods.

How to detect vulnerable/personal information in CVs programmatically (by means of syntax analysis/parsing etc...)

To make matter more specific:
How to detect people names (seems like simple case of named entity extraction?)
How to detect addresses: my best guess - find postcode (regexes); country and town names and take some text around them.
As for phones, emails - they could be probably caught by various regexes + preprocessing
Don't care about education/working experience at this point
Reasoning:
In order to build a fulltext index on resumes all vulnerable information should be stripped out from them.
P.S. any 3rd party APIs/services won't do as a solution.
The problem you're interested in is information extraction from semi structured sources. http://en.wikipedia.org/wiki/Information_extraction
I think you should download a couple of research papers in this area to get a sense of what can be done and what can't.
I feel it can't be done by a machine.
Every other resume will have a different format and layout.
The best you can do is to design an internal format and manually copy every resume content in there. Or ask candidates to fill out your form (not many will bother).
I think that the problem should be broken up into two search domains:
Finding information relating to proper names
Finding information that is formulaic
Firstly the information relating to proper names could probably be best found by searching for items that are either grammatically important or significant. I.e. English capitalizes only the first word of the sentence and proper nouns. For the gramatical rules you could look for all of the words that have the first letter of the word capitalized and check it against a database that contains the word and the type [i.e. Bob - Name, Elon - Place, England - Place].
Secondly: Information that is formulaic. This is more about the email addresses, phone numbers, and physical addresses. All of these have a specific formats that don't change. Use a regex and use an algorithm to detect the quality of the matches.
Watch out:
The grammatical rules change based on language. German capitalizes EVERY noun. It might be best to detect the language of the document prior to applying your rules. Also, another issue with this [and my resume sometimes] is how it is designed. If the resume was designed with something other than a text editor [designer tools] the text may not line up, or be in a bitmap format.
TL;DR Version: NLP techniques can help you a lot.

Resources