Text mining using Watson Knowledge Studio and Watson Discovery - watson-discovery

Could you please tell me how to extract values with specified unit using Watson Knowledge Studio and Watson Discovery.
The idea is based on one using Matplot described in the report "Empirical assessment of published effect sizes and power in the recent cognitive neuroscience and psychology literature".

User can create two type models with WKS - Machine Learning model and Rule Base model, and both of then can be deployable to Watson Discovery.
If you like to extract values with specified unit from documents, I think you can use Rule Base model.
For example, if your documents contains a expression of value and unit like "There are more than 10 percents of people being affected by ....", you can define a rule like "number that is followed by phrase 'percents'" to extract the value.

Related

How to implement nested entities in new LUIS version: CLU

I've been trying to replicate some of my previous LUIS knowledge into the new Microsoft Language portal (Conversational Language Understanding) and I'm getting stuck into one thing that we used to do so frequently that we have no longer been able to do: nested entities.
To clarify what my doubt is, pretend that we have just asked the user to provide a desired price range for a product. Then the user says: 'price range from $233 to $400'. With Luis, we could create a machine learning entity where we could have two other sub-entities: minimumValue, maximumValue like this:
nested entity in LUIS
and when we train and test, we get a result like this:
training and results in LUIS
My question is: how can we implement something similar in CLU?
I have already tried Quantity.NumberRange prebuilt entity and it does not cover all the possible scenarios! I tested it with many different ways of mentioning a range and it failed in many of them. Also, I tried to merge the prebuilt training with manual labeling training (learned) and the thing is that when the prebuilt did not find the minimum and maximum (meaning that it did not work) the manual labeling worked (but I couldn't specify the minimum and maximum because there are no nested entities in CLU). I would really appreciate any help

Convention for creating good data set for RASA NER_CRF

I am trying to create a dataset for training RASA ner_crf for one type of entity. Please let me know the minimum number of sentences/variation_in_sentence_formation for good result. When I have one type of each of the possible sentence NER_CRF is not giving good result.
Rasa entity extraction depends heavily on the pipeline you have defined. Also depends on language model and tokenizers. So make sure you use good tokenizer. If it is normal English utterances try using tokenizer_ spacy before ner_crf. Also try with ner_spacy
As per my experience, 5 to 10 variations of utterances for each case gave a decent result to start with

Creating Staff Directory Lookup Bot with LUIS Integration

I'm trying to setup LUIS to connect to my Azure WebApp Bot, I've been asked by my IT Director to test the bot on a "Simple" Staff Directory Lookup (hosted in Azure SQL VM's).
I was trying to configure LUIS to understand intents such as 'Who is in Hospitality', or 'Who is Joe Bloggs', but I'm struggling with how to do this.
Do I use entities for departments and people? Are there Pre-Built Intents for 'Greetings' and other commonly used intents?
Any help would be appreciated.
You have several questions so I splitted my answer in 2 parts.
Information detection (departement, names)
[I want to] understand intents such as 'Who is in Hospitality', or 'Who is Joe
Bloggs', but I'm struggling with how to do this.
Do I use entities for departments and people?
Department:
If you have a limited and known list of departments, you can create an Entity which type will be List. It will process an exact text match on the items of this list (see doc here).
If you don't have this list, use an Entity of type Simple (see doc here) and label this entity in several (various) examples utterances that you provide. You can improve the detection by also adding a Phrase list in that case: it will help and is not processing an exact match in the list. And you should improve it over the time.
People:
For the people name detection, it will be a little bit more tricky. You can have a look to Communication.ContactName pre-built entity. If it's not working, create your own simple entity but I'm not sure that the results will be relevant.
"Small talk" part
Are there Pre-Built Intents for 'Greetings' and other commonly used
intents?
There is no pre-built intents but there is a Lab Project called Personality Chat that is designed to manage such cases (in English only for the moment): https://labs.cognitive.microsoft.com/en-us/project-personality-chat
It is still in a lab version, so you should not use in production, but it is mostly open-source so you can give it a try and see if it fits your needs.

Google cloud natural language API adding own context classifier

I have been searching how to create a new entity in google natural language API, and found nothing. Can anybody help how to create a new classifier such that if I pass a sentence and I want to detect suppose 'python' as programming language then how would I get that. Current the API is giving 'python' as 'other'.
I have also looked into cloud auto ml api for my solution and tried to create and train a model but It was only able to do sentiment analysis not entity detection.It was giving me the score rather than telling me that Java is programming language.
Thanks in advance.Your help will be appreciated.
Automl content classification classifies your data into the labels specified in the training set. It does not do entity detection. But it seems like what you need to do is closer to content classification than entity detection. My understanding from the description you provided is that you have content (may be words or phrases or short sentences) and you want to classify them into some labels (e.g. programmingLanguage). If you put together a good training set, the automl model should be able to do this.
The number it provides in eval is not sentiment, it's the probability of the predicted label. As you can see in the eval page you posted, it's telling you that java is a programmingLanguage with probability of 1 (so, it's very certain about it).

OpenNLP, Training Named Entity Recognition on unsupported languages: clarifications needed

I want to experiment NER on a specific domain, that is location names extraction from travel offers in Italian language.
So far I've got that I need to prepare the training set by myself, so I'm going to put the
<START:something><END>
tags in some offers from my training set.
But looking at OpenNLP documentation on how to train for NER, I ended up in having a couple of questions:
1) When defining the START/END tags, I'm I free to use whatever name inside the tags (where I wrote "something" a few line above) or is there a restricted set to be bound?
2) I noticed that the call to the training tool
opennlp TokenNameFinderTrainer
takes a string representing the language as the first argument. What is that for? Considering I want to train a model on Italian language that is NOT supported, is there any additional task to be done before I could train for NER?
1) Yes, you can specify multiple types. If the training file contains multiple types, the created model will also be able to detect these multiple types.
2) I think that "lang" parameter has the same meaning/use of other commands (e.g. opennlp TokenizerTrainer -lang it ...)

Resources