Let's say I've got some documents in an index. One of the fields is a url. Something like...
{"Url": "Server1/Some/Path/A.doc"},
{"Url": "Server1/Some/OtherPath/B.doc"},
{"Url": "Server1/Some/C.doc"},
{"Url": "Server2/A.doc"},
{"Url": "Server2/Some/Path/B.doc"}
I'm trying to extract counts by paths for my search results. This would presumably be query-per-branch.
Eg:
Initial query:
Server1: 3
Server2: 2
Server1 Query:
Some: 3
Server1/Some Query:
Path: 1
OtherPath: 1
Now I can broadly see 2 ways to approach this and I'm not a great fan of either.
Option 1: Scripting. mvel seems to be limited to mathematical operations (at least I can't find a string split in the docs) so this would have to be in Java. That's possible but it feels like a lot of overhead if there are a lot of records.
Option 2: Store the path parts alongside the document...
{"Url": ..., "Parts": ["1|Server1","2|Some","3|Path"]},
{"Url": ..., "Parts": ["1|Server1","2|Some","3|OtherPath"]},
{"Url": ..., "Parts": ["1|Server1","2|Some"]},
{"Url": ..., "Parts": ["1|Server2"]},
{"Url": ..., "Parts": ["1|Server2","2|Some","3|Path"]}
This way I could do something like. Urls starting with 'Server1/Some', facet on parts starting with 3|. This feels so horribly hackish.
What's a good way to do this? I can do as much pre-processing as required but need the counts to be coming from ES as it's the count of results from a query that is important.
Given a doc with url /a/b/c
have a multivalued field url
and input (using preprocessing) values: /a, /a/b, /a/b/c
edit
When you want to contrain showing counts to paths of a certain depth you could design multiple multivalued fields as described above. Each field would represent a particular depth.
The ES-client should contain logic to decide which depth (and thus which field) to query for facets.
Still feels like a hack though, and indeed without control of data you could end up with lots of fields for this.
Related
We receive data from application forms in JSON and need to be able to search on it - but only the text entered by the user. Some of our data from other sources comes in as XML and this is fine - the html_strip (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-htmlstrip-charfilter.html) character filter does the job.
But is there an equivalent for JSON - you send serialized JSON as text and it strips out the tags just leaving the data?
A very simplistic example:
The application form sends back this data:
{
"ed_hist1": "Glasgow High School",
"ed_hist2": "Edinburgh University"
}
This gets serialized and added to our document as a text field:
{
"Type": "applicationform",
"Id": 1,
"Name": "Margaret Blenkinsop",
"Email": "JohnB232#myCompany.COM",
"Text": "{\"ed_hist1\":\"Glasgow High School\",\"ed_hist2\":\"Edinburgh University\"}"
}
And that gets sent to ES.
When I search the text field I don't want to be able to find "ed_hist1" or "ed_hist2" only "Glasgow High School" and ""Edinburgh University".
Or is the only way to pre-process the JSON? (Which is fine but I don't want to manually code something if ES will take care of it for me.)
Solution #1
There are a couple ways of accomplishing what you want. The most "idiomatic" way would be to pre-process the JSON coming from you application, transforming the JSON doc you have to a list.
EG:
{
"ed_hist1": "Glasgow High School",
"ed_hist2": "Edinburgh University"
}
becomes
[
"Glasgow High School",
"Edinburgh University"
]
And then your document would look as follows:
{
"Type": "applicationform",
"Id": 1,
"Name": "Margaret Blenkinsop",
"Email": "JohnB232#myCompany.COM",
"Text": ["Glasgow High School", "Edinburgh University"]
}
You could then search the Text field for the ed_hist you are looking for and have documents returned that match your query etc. This is the most simple solution for your case, but there are other ways of structuring your data depending on what questions you want to ask of it.
Solution #2
You seem to be thinking a fair bit about keeping the original JSON document as text in a field as part of another document. Without additional context, I do not particularly like this solution but I assume you have your reasons. For this second, not recommended use case, I would use char filters in addition to storeing your original field value. The process would look like this:
index document
original documented `store`d
custom char filter strips out the unwanted JSON language characters
text indexed
At query time you can get back the original stored value, but also leverage the "full text" search on the text contained in the JSON document after the characters have been stripped. I really believe you can get more out of your data if you do a solution #1 variant. Stripping HTML for text search is a good idea to search through markup documents, but stripping JSON makes less sense as JSON is ES's bread and butter, and will give you more tools to work with as JSON.
EDIT: I forgot to link to the store documentation: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-store.html
EDIT #2: Of additional note is the _source field which is enabled by default: https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-source-field.html Your original JSON document lives there, for free, as well. This can get you back the original document you sent to ES, while still giving you the more idiomatic data structure in the DB.
I just start learning Elasticsearch. My data has the company name and its website, and I have a list which contains all the domain aliases of a company. I am trying to write a query which can boost the record with the same website in the list.
My data looks like:
{"company_name": "Kaiser Permanente",
"website": "http://www.kaiserpermanente.org"},
{"company_name": "Kaiser Permanente - Urgent Care",
"website": "http://kp.org"}.
The list of domain aliases is:
["kaiserpermanente.org","kp.org","kpcomedicare.org", "kp.com"]
The actual list is longer than the above example. I've tried this query:
{
"bool": {
"should": {
"terms": {
"website": [
"kaiserpermanente.org",
"kp.org",
"kpcomedicare.org",
"kp.com"
],
"boost": 20
}
}
}
}
The query doesn't return anything because "terms" query is the exact match. The domain in the list and the url is similar but not the same.
What I except is the query should return the two records in my example. I think "match" can work, but I couldn't figure out how to match a value with any similar value in the list.
I found a similar question How to do multiple "match" or "match_phrase" values in ElasticSearch. The solution works but my alias list contains more than 50 elements. It would be very verbose if I wrote multiple "match_phrase" for each element. Is there a more efficient way like "terms" so that I could just pass in a list?
I'd appreciate if anyone can help me out with this, thanks!
What you are observing has been covered in many stackoverflow posts / ES docs - the difference between terms and match. When you store that info, I assume you are using the standard analyzer. This means when you push "http://kp.org", Elasticsearch indexes [ "http", "kp", "org" ] tokens broken out. However, when you use terms, it looks for "kp.org" but there was no such "kp.org" token to find matches for since that was broken down by the analyzer when indexing. match, however, will break down what you query for so that "kp.org" => [ "kp", "org" ] and it is able to find one or both. Phrase matching just requires the tokens to be next to each other which is probably necessary for what you need.
Unfortunately, there does not appear to be such an option that works like match but allows many values to match against like terms. I believe you have three options:
programmatically generate the query as described in the stackoverflow post that you referenced, which you noted would be verbose, but I think this might be just ok unless you have 1k aliases.
analyze the website field so that analysis transforms "http://www.kaiserpermanente.org" => "kaiserpermanente.org" and "http://kp.org" => "kp.org" for indexing. With this index time analysis approach, when querying, you can successfully use the terms filter. This might be fine given urls are structured and the use cases you outline only appear to be concerned with domains. If you do this, use multi fields to analyze one website value in multiple ways. It's nice to have Elasticsearch do this kind of work for you and not worry about it in your own code.
do this processing beforehand (before pushing data to ES) so that when you store data in elasticsearch, you store not only the website field, but also a domain, paths, and whatever else you need that you calculated beforehand. You get control at the cost of effort you have to put in.
I have a large ES index which I intend to populate using various sources. The sources sometimes have the same documents, meaning that I will have duplicate docs differing only by 'source' param.
To perform de-duplication when serving searches, I see 2 ways:
Get Elasticsearch to perform the priority filtering.
Get everything and filter via Python
I prefer not to filter at Python level to preserve pagination, so I want to ask if there's a way to tell Elasticsearch to priority filter based on some value in the document (in my case, source).
I want to filter by simple priority (so if my order is A,B,C, I will serve the A document if it exists, then B if doc from source A doesn't exist, followed by C).
An example set of duplicate docs would look like:
{
"id": 1,
"source": "A",
"rest_of": "data",
...
},
{
"id": 1,
"source": "B",
"rest_of": "data",
...
},
{
"id": 1,
"source": "C",
"rest_of": "data",
...
}
But if I want to serve "A" FIRST, then "B" if no "A", followed by "C" if no "B", a search result for "id": 1 will look like:
{
"id": 1,
"source": "A",
"rest_of": "data",
...
}
Note:
Alternatively, I could try to de-duplicate at the population phase, but I'm worried about the performance. Willing to explore this if there's no trivial way to implement solution 1.
I think the best solution is to actually avoid to have duplicates in your index. I don't know how frequent it will be in your data, but if you have lot of them, this will badly influence the term frequencies and may lead to poor search relevance.
A quite simple approach could be to generate the ElasticSearch ID of the document, with a consistent method across all sources. You can indeed force the _id when indexing instead of letting ES generate it for you.
What will happen then is that last source coming will override the existing one if it exists. Last to come wins. If you don't care about the source, this may work.
However, this comes with a little performance cost, as stated in this article:
As you have seen in this blog post, it is possible to prevent duplicates in Elasticsearch by specifying a document identifier externally prior to indexing data into Elasticsearch. The type and structure of the identifier can have a significant impact on indexing performance. This will however vary from use case to use case so it is recommended to benchmark to identify what is optimal for you and your particular scenario.
I am trying to figure what would be the best way to make search in my documents, and right now I am kind of stuck. Bear in mind I am really new to ElasticSearch and for now I am mostly trying to see if it is a match for my needs.
My dataset is originally made of XML literature files. These files are composed by identifier (say paragraph 1, paragraph 2... book 1, book 2... section 1, section 2, section 4... [Not necessarily continuous or actually numeric. They are most of the time matching \w]).
The way I thought I'd format my data for elastic search would look like :
"passages": [
{"id": "1.1", "body": "I represent the book 1 section 1 and I am a dog"},
{"id": "1.2", "body": "I am a cat and I represent the book 1 section 2"},
]
My research needs are the following : I need to be able to search across passages (So, if I am looking for dog not too far from cat, I match 1.1-1.2) and retrieve the identifiers of the first passage and the last passage on which the query spans.
As far as I have seen, it seems to be an unconventional user requirements and I do not see anything looking this way here. The need of keeping the identifier AND the ability to find across "passage" seems to be a bit of a complicated "first" dive into ES...
Thanks for your time reading the question :)
I'm been wrestling with this problem for days.
For example, if I have
{doc_id1:"thor marvel"},
{doc_id2:"spiderman thor"},
{doc_id3:"the avengers captain america ironman thor"}
three documents in elastic search, and do a search query for "thor", I want it to tell me where keyword "thor" is found in each document, like
{ doc_id1: 1, doc_id2: 2, doc_id3: 6}
as the desired result.
I have two possible solution on top of my head now:
1.figure out a way to put the _vectorterm info (which includes all the positions/offsets for each token/word of the document) into _source, so that I can directly access _vectorterm info in my normal search result. I can then construct the (doc, position) list outside elasticsearch. Normally, you can only access _vectorterm info for a single document at a time given the index/type/id, which is why it's tricky. That should be the ideal way to achieve the goal
2.figure out a way to trigger an action whenever a new document is added. This action will scan through all the tokens/words in the new document, create a new "token" index(if it doesn't exist) and append (doc_id, position) pair to it like
{ keyword:"thor" [ doc_id1:1,doc_id2:2,doc_id3:6] }.
So that I just need to search for "thor" among keywords indexes and then get the (doc, position) lists. This seems to be even harder and less optimal.
Sadly, I don't know how to do either one. I'll appreciate it if someone can give me some help on this. Many thanks!