I want to use elastic search for NER - elasticsearch

I want to use elastic search for NER
Imagine that Elastic Search engine has data included key and value.
the key is word. And the value is a list of Entity.
for example; key:apple, value:[fruit, company]
And when I send a query that consisting of a sentence. The sentence can have several candidate keywords. So, my question is whether the functionality is in the Elastic Search that gives results for each candidate keyword in a single query.
Ex)
query : "what is apple pie"
candidate keywords : "what", "what is", "what is apple", "what is apple pie", "is", "is apple", "is apple pie", "apple", "apple pie", "pie"
exist key in DB : "apple", "apple pie", "pie"
returned result : "apple":[fruit, company], "apple pie":[food], "pie":[food]
thanks.

In my case, I use CoreNLP to perform extraction, given the input to the NLP REST server, the resulting output for the tokenizing, NER, and additional parsing like lemmatization, sentiment, co-ference, etc is stored in elasticsearch for posterior discoverability in terms of how to keep training CoreNLP. This might not be the answer on how to use elasticsearch to nail NLP tasks since CoreNLP is the trainable machine learning tool should be used for this (or spaCy which is great too), so I assume you wanted to say "I want to use elastic search for searching and analyzing extracted NER", if yes, there you go.

Related

How to execute search for FHIR patient with multiple given names?

We've implemented the $match operation for patient that takes FHIR parameters with the search criteria. How should this search work when the patient resource in the parameters contains multiple given names? We don't see anything in FHIR that speaks to this. Our best guess is that we treat it as an OR when trying to match on given names in our system.
We do see that composite parameters can be used in the query string as AND or OR, but not sure how this equates when using the $match operation.
$match is intrinsically a 'fuzzy' search. Different servers will implement it differently. Many will allow for alternate spellings, common short names (e.g. 'Dick' for 'Richard'), etc. They may also allow for transposition of month and day and all sorts of similar data entry errors. The 'closeness' of the match is reflected in the score the match is given. It's entirely possible get back a match candidate that doesn't match any of the given names exactly if the score on other elements is high enough.
So technically, I think SEARCH works this way:
AND
/Patient?givenname=John&givenname=Jacob&givenname=Jingerheimer
The above is an AND clause. There is (can be) a person named with multiple given names "John", "Jacob", "Jingerheimer".
Now I realize SEARCH and MATCH are 2 different operations.
But they are loosely related.
But Patient-Matching is an "art". Be careful, a "false positive" (with a high "score") is/could-be a very big deal.
But as mentioned from Lloyd....you have a little more flexibility with your implementation of $match.
I have worked on 2 different "teams".
One team, we never let "out the door" anything that was below a 80% match-score. (How you determine a match-score is a deeper discussion).
Another team, we made $match work with a "IF you give me enough information to find a SINGLE match, I'll give it to you" .. but if not, tell people "not enough info to match a single".
Patient Matching is HARD. Do not let anyone tell you different.
at HIMSS and other events..when people show a demo of moving data, I always ask "how did you match this single person on this side.....as it is that person on the other side?"
As in "without patient matching...alot of work-flows fall a part at the get go"
Side note, I actually reported a bug with the MS-FHIR-Server (which the team fixed very quickly) (for SEARCH) here:
https://github.com/microsoft/fhir-server/issues/760
"name": [
{
"use": "official",
"family": "Kirk",
"given": [
"James",
"Tiberious"
]
},
Sidenote:
The Hapi-Fhir object to represent this is "ca.uhn.fhir.rest.param.TokenAndListParam"
Sidenote:
There is a feature request for Patient Match on the Ms-Fhir-Server github page:
https://github.com/microsoft/fhir-server/issues/943

ElasticSearch full text search across key/value

I am trying to figure what would be the best way to make search in my documents, and right now I am kind of stuck. Bear in mind I am really new to ElasticSearch and for now I am mostly trying to see if it is a match for my needs.
My dataset is originally made of XML literature files. These files are composed by identifier (say paragraph 1, paragraph 2... book 1, book 2... section 1, section 2, section 4... [Not necessarily continuous or actually numeric. They are most of the time matching \w]).
The way I thought I'd format my data for elastic search would look like :
"passages": [
{"id": "1.1", "body": "I represent the book 1 section 1 and I am a dog"},
{"id": "1.2", "body": "I am a cat and I represent the book 1 section 2"},
]
My research needs are the following : I need to be able to search across passages (So, if I am looking for dog not too far from cat, I match 1.1-1.2) and retrieve the identifiers of the first passage and the last passage on which the query spans.
As far as I have seen, it seems to be an unconventional user requirements and I do not see anything looking this way here. The need of keeping the identifier AND the ability to find across "passage" seems to be a bit of a complicated "first" dive into ES...
Thanks for your time reading the question :)

How to return distinct highlighted words in Elasticsearch

I m trying to implement autocomplete in my application.
say I have the following documents:
"red smart phone"
"super smart phone"
"small bluetooth speaker"
So when the user types "s" I need to return as suggestions:
"smart"
"small"
Currently I m using simple highlight in Elasticsearch to get the matched words (smart, small). However the thing is that I get back 2 times the "smart". Is it possible to configure ES return only distinct values for the Highlight?
On top of that is it possible to let ES return also the next (n) word(s) , e.g.:
"smart phone"
"small bluetooth (speaker)"
The completion suggester seems to be the solution for your need:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters-completion.html
You have to modify your ES mapping to add new properties on your field.
Be careful of your analyzer too.
HtH,
NicolasY.

Crate fulltext query syntax

I'm thinking about migration from Sphinx to Crate, but I can't find any documentation for fulltext query syntax. In Sphinx I can search:
("black cat" -catalog) | (awesome creature)
this stands for EITHER exact phrase "black cat" and no term "catalog" in document OR both "awesome" and "creature" at any position in document
black << big << cat
this requires document to contain all "black", "big" and "cat" terms and also requires match position of "black" be less than match position of "big" and so on.
And I need to search at specific place in the document. In sphinx I was able to use proximity operator as follows
hello NEAR/10 (mother|father -dear)
this requires document to contain "hello" term and "mother" or "father" term at most 10 terms away from "hello" and also term "dear" must not be closer than 10 terms to "hello"
The last construction with NEAR is heavily used in my application. Is it all possible in Crate?
Unfortunately I cannot comment on how it compares to Sphinx, but I will stick to your questions :)
Crate's fulltext search comes with SQL and Lucene's matching power and therefore should be able to handle complex queries. I'll just provide the queries matching your output I think it should be quite readable.
("black cat" -catalog) | (awesome creature)
select *
from mytable
where
(match(indexed_column, 'black cat') using phrase
and not match(indexed_column, 'catalog'))
or match(indexed_column, 'awesome creature') using best_fields with (operator='and');
black << big << cat
select *
from mytable
where
match(indexed_column, 'black big cat') using phrase with (slob=100000);
This one is tricky, there doesn't seem to be an operator that does exactly the same as in Sphinx, but it could be adjusted with a "slop" value. Depending on the use case there might be another (better) solution as well...
hello NEAR/10 (mother|father -dear)
select *
from mytable
where
(match(indexed_column, 'hello mother') using phrase with (slop=10)
or match(indexed_column, 'hello father') using phrase with (slop = 10))
and not match(indexed_column, 'hello dear') using phrase with (slop = 10)
They might look a bit clunky compared to Sphinx's language, but they work fine :)
Performance wise, they should still be super fast, thanks to Lucene..
Cheers, Claus

What's the right database for this? Mongo, SQL, Couch or something else?

Let's say I've got a collection of 10 million documents that look something like this:
{
"_id": "33393y33y63i6y3i63y63636",
"Name": "Document23",
"CreatedAt": "5/23/2006",
"Tags": ["website", "shopping", "trust"],
"Keywords": ["hair accessories", "fashion", "hair gel"],
"ContactVia": ["email", "twitter", "phone"],
"Body": "Our website is dedicated to making hair products that are..."}
I would like to be able to query the database for an arbitrary number of, including 0 of, any of the 3 attributes of Tags, Keywords, and ContactVia. I need to be able to select via ANDS (this document includes BOTH attributes of X and Y) or ORs (this document includes attributes of X OR Y).
Example queries:
Give me the first 10 documents that have the tags website and
shopping, with the keywords matching "hair accessories or fashion"
and with a contact_via including "email".
Give me the second 20 documents that have the tags "website" or
"trust", matching the keywords "hair gel" or "hair accessories".
Give me the 50 documents that have the tag "website".
I also need to order these by either other fields in the documents
(score-type) or created or updated dates. So there are basically four "ranges" that are queried regularly.
I started out SQL-based. Then, I moved to Mongo because it had support for Arrays and hashes (which I love). But, it doesn't support more than one range using indexes, so my Mongo database is slow..because it can't use indexes and has to scan 10 million documents.
Is there a better alternative. This is holding up moving this application into production (and the revenue that comes with it). Any thoughts as to the right database or alternative architectures would be greatly appreciated.
I'm in Ruby/Rails if that matters.
When needing to do multiple queries on arrays, we found the best solution, at least for us, was to go with ElasticSearch. We get this, plus some other bonuses. And, we can reduce the index requirements for Mongo.. so it's a win/win.
My two cents are for MongoDB. Not only can your data be represented, saved, and loaded as raw Ruby hashes, but Mongo is modern and fast, and really, really easy to know. Here's all you need to do to start Mongo server:
mongod --dbpath /path/to/dir/w/dbs
Then to get the console , which is just a basic JavaScript console, just invoke mongo. And using it is just this simple:
require 'mongo'
db = Mongo::Connection.new['somedb']
db.stuff.find #=> []
db.stuff.insert({id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!'})
db.stuff.find #=> [{id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!'}]
db.stuff.update({id: 'abcd', {'$set' => {says: 'Bork bork bork!!!! (Bork)!'}}})
db.stuff.find #=> [{id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!!!! (Bork)!'}]

Resources