I am trying to figure what would be the best way to make search in my documents, and right now I am kind of stuck. Bear in mind I am really new to ElasticSearch and for now I am mostly trying to see if it is a match for my needs.
My dataset is originally made of XML literature files. These files are composed by identifier (say paragraph 1, paragraph 2... book 1, book 2... section 1, section 2, section 4... [Not necessarily continuous or actually numeric. They are most of the time matching \w]).
The way I thought I'd format my data for elastic search would look like :
"passages": [
{"id": "1.1", "body": "I represent the book 1 section 1 and I am a dog"},
{"id": "1.2", "body": "I am a cat and I represent the book 1 section 2"},
]
My research needs are the following : I need to be able to search across passages (So, if I am looking for dog not too far from cat, I match 1.1-1.2) and retrieve the identifiers of the first passage and the last passage on which the query spans.
As far as I have seen, it seems to be an unconventional user requirements and I do not see anything looking this way here. The need of keeping the identifier AND the ability to find across "passage" seems to be a bit of a complicated "first" dive into ES...
Thanks for your time reading the question :)
Related
We've implemented the $match operation for patient that takes FHIR parameters with the search criteria. How should this search work when the patient resource in the parameters contains multiple given names? We don't see anything in FHIR that speaks to this. Our best guess is that we treat it as an OR when trying to match on given names in our system.
We do see that composite parameters can be used in the query string as AND or OR, but not sure how this equates when using the $match operation.
$match is intrinsically a 'fuzzy' search. Different servers will implement it differently. Many will allow for alternate spellings, common short names (e.g. 'Dick' for 'Richard'), etc. They may also allow for transposition of month and day and all sorts of similar data entry errors. The 'closeness' of the match is reflected in the score the match is given. It's entirely possible get back a match candidate that doesn't match any of the given names exactly if the score on other elements is high enough.
So technically, I think SEARCH works this way:
AND
/Patient?givenname=John&givenname=Jacob&givenname=Jingerheimer
The above is an AND clause. There is (can be) a person named with multiple given names "John", "Jacob", "Jingerheimer".
Now I realize SEARCH and MATCH are 2 different operations.
But they are loosely related.
But Patient-Matching is an "art". Be careful, a "false positive" (with a high "score") is/could-be a very big deal.
But as mentioned from Lloyd....you have a little more flexibility with your implementation of $match.
I have worked on 2 different "teams".
One team, we never let "out the door" anything that was below a 80% match-score. (How you determine a match-score is a deeper discussion).
Another team, we made $match work with a "IF you give me enough information to find a SINGLE match, I'll give it to you" .. but if not, tell people "not enough info to match a single".
Patient Matching is HARD. Do not let anyone tell you different.
at HIMSS and other events..when people show a demo of moving data, I always ask "how did you match this single person on this side.....as it is that person on the other side?"
As in "without patient matching...alot of work-flows fall a part at the get go"
Side note, I actually reported a bug with the MS-FHIR-Server (which the team fixed very quickly) (for SEARCH) here:
https://github.com/microsoft/fhir-server/issues/760
"name": [
{
"use": "official",
"family": "Kirk",
"given": [
"James",
"Tiberious"
]
},
Sidenote:
The Hapi-Fhir object to represent this is "ca.uhn.fhir.rest.param.TokenAndListParam"
Sidenote:
There is a feature request for Patient Match on the Ms-Fhir-Server github page:
https://github.com/microsoft/fhir-server/issues/943
I have a large ES index which I intend to populate using various sources. The sources sometimes have the same documents, meaning that I will have duplicate docs differing only by 'source' param.
To perform de-duplication when serving searches, I see 2 ways:
Get Elasticsearch to perform the priority filtering.
Get everything and filter via Python
I prefer not to filter at Python level to preserve pagination, so I want to ask if there's a way to tell Elasticsearch to priority filter based on some value in the document (in my case, source).
I want to filter by simple priority (so if my order is A,B,C, I will serve the A document if it exists, then B if doc from source A doesn't exist, followed by C).
An example set of duplicate docs would look like:
{
"id": 1,
"source": "A",
"rest_of": "data",
...
},
{
"id": 1,
"source": "B",
"rest_of": "data",
...
},
{
"id": 1,
"source": "C",
"rest_of": "data",
...
}
But if I want to serve "A" FIRST, then "B" if no "A", followed by "C" if no "B", a search result for "id": 1 will look like:
{
"id": 1,
"source": "A",
"rest_of": "data",
...
}
Note:
Alternatively, I could try to de-duplicate at the population phase, but I'm worried about the performance. Willing to explore this if there's no trivial way to implement solution 1.
I think the best solution is to actually avoid to have duplicates in your index. I don't know how frequent it will be in your data, but if you have lot of them, this will badly influence the term frequencies and may lead to poor search relevance.
A quite simple approach could be to generate the ElasticSearch ID of the document, with a consistent method across all sources. You can indeed force the _id when indexing instead of letting ES generate it for you.
What will happen then is that last source coming will override the existing one if it exists. Last to come wins. If you don't care about the source, this may work.
However, this comes with a little performance cost, as stated in this article:
As you have seen in this blog post, it is possible to prevent duplicates in Elasticsearch by specifying a document identifier externally prior to indexing data into Elasticsearch. The type and structure of the identifier can have a significant impact on indexing performance. This will however vary from use case to use case so it is recommended to benchmark to identify what is optimal for you and your particular scenario.
I'm been wrestling with this problem for days.
For example, if I have
{doc_id1:"thor marvel"},
{doc_id2:"spiderman thor"},
{doc_id3:"the avengers captain america ironman thor"}
three documents in elastic search, and do a search query for "thor", I want it to tell me where keyword "thor" is found in each document, like
{ doc_id1: 1, doc_id2: 2, doc_id3: 6}
as the desired result.
I have two possible solution on top of my head now:
1.figure out a way to put the _vectorterm info (which includes all the positions/offsets for each token/word of the document) into _source, so that I can directly access _vectorterm info in my normal search result. I can then construct the (doc, position) list outside elasticsearch. Normally, you can only access _vectorterm info for a single document at a time given the index/type/id, which is why it's tricky. That should be the ideal way to achieve the goal
2.figure out a way to trigger an action whenever a new document is added. This action will scan through all the tokens/words in the new document, create a new "token" index(if it doesn't exist) and append (doc_id, position) pair to it like
{ keyword:"thor" [ doc_id1:1,doc_id2:2,doc_id3:6] }.
So that I just need to search for "thor" among keywords indexes and then get the (doc, position) lists. This seems to be even harder and less optimal.
Sadly, I don't know how to do either one. I'll appreciate it if someone can give me some help on this. Many thanks!
Let's say I've got some documents in an index. One of the fields is a url. Something like...
{"Url": "Server1/Some/Path/A.doc"},
{"Url": "Server1/Some/OtherPath/B.doc"},
{"Url": "Server1/Some/C.doc"},
{"Url": "Server2/A.doc"},
{"Url": "Server2/Some/Path/B.doc"}
I'm trying to extract counts by paths for my search results. This would presumably be query-per-branch.
Eg:
Initial query:
Server1: 3
Server2: 2
Server1 Query:
Some: 3
Server1/Some Query:
Path: 1
OtherPath: 1
Now I can broadly see 2 ways to approach this and I'm not a great fan of either.
Option 1: Scripting. mvel seems to be limited to mathematical operations (at least I can't find a string split in the docs) so this would have to be in Java. That's possible but it feels like a lot of overhead if there are a lot of records.
Option 2: Store the path parts alongside the document...
{"Url": ..., "Parts": ["1|Server1","2|Some","3|Path"]},
{"Url": ..., "Parts": ["1|Server1","2|Some","3|OtherPath"]},
{"Url": ..., "Parts": ["1|Server1","2|Some"]},
{"Url": ..., "Parts": ["1|Server2"]},
{"Url": ..., "Parts": ["1|Server2","2|Some","3|Path"]}
This way I could do something like. Urls starting with 'Server1/Some', facet on parts starting with 3|. This feels so horribly hackish.
What's a good way to do this? I can do as much pre-processing as required but need the counts to be coming from ES as it's the count of results from a query that is important.
Given a doc with url /a/b/c
have a multivalued field url
and input (using preprocessing) values: /a, /a/b, /a/b/c
edit
When you want to contrain showing counts to paths of a certain depth you could design multiple multivalued fields as described above. Each field would represent a particular depth.
The ES-client should contain logic to decide which depth (and thus which field) to query for facets.
Still feels like a hack though, and indeed without control of data you could end up with lots of fields for this.
Let's say I've got a collection of 10 million documents that look something like this:
{
"_id": "33393y33y63i6y3i63y63636",
"Name": "Document23",
"CreatedAt": "5/23/2006",
"Tags": ["website", "shopping", "trust"],
"Keywords": ["hair accessories", "fashion", "hair gel"],
"ContactVia": ["email", "twitter", "phone"],
"Body": "Our website is dedicated to making hair products that are..."}
I would like to be able to query the database for an arbitrary number of, including 0 of, any of the 3 attributes of Tags, Keywords, and ContactVia. I need to be able to select via ANDS (this document includes BOTH attributes of X and Y) or ORs (this document includes attributes of X OR Y).
Example queries:
Give me the first 10 documents that have the tags website and
shopping, with the keywords matching "hair accessories or fashion"
and with a contact_via including "email".
Give me the second 20 documents that have the tags "website" or
"trust", matching the keywords "hair gel" or "hair accessories".
Give me the 50 documents that have the tag "website".
I also need to order these by either other fields in the documents
(score-type) or created or updated dates. So there are basically four "ranges" that are queried regularly.
I started out SQL-based. Then, I moved to Mongo because it had support for Arrays and hashes (which I love). But, it doesn't support more than one range using indexes, so my Mongo database is slow..because it can't use indexes and has to scan 10 million documents.
Is there a better alternative. This is holding up moving this application into production (and the revenue that comes with it). Any thoughts as to the right database or alternative architectures would be greatly appreciated.
I'm in Ruby/Rails if that matters.
When needing to do multiple queries on arrays, we found the best solution, at least for us, was to go with ElasticSearch. We get this, plus some other bonuses. And, we can reduce the index requirements for Mongo.. so it's a win/win.
My two cents are for MongoDB. Not only can your data be represented, saved, and loaded as raw Ruby hashes, but Mongo is modern and fast, and really, really easy to know. Here's all you need to do to start Mongo server:
mongod --dbpath /path/to/dir/w/dbs
Then to get the console , which is just a basic JavaScript console, just invoke mongo. And using it is just this simple:
require 'mongo'
db = Mongo::Connection.new['somedb']
db.stuff.find #=> []
db.stuff.insert({id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!'})
db.stuff.find #=> [{id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!'}]
db.stuff.update({id: 'abcd', {'$set' => {says: 'Bork bork bork!!!! (Bork)!'}}})
db.stuff.find #=> [{id: 'abcd', name: 'Swedish Chef', says: 'Bork bork bork!!!! (Bork)!'}]