Elasticsearch Join - elasticsearch

I have two indices. One indices "indications" which have some set of values.
Other is "projects". In this indices, I will add indications value like " indication = oncology".
Now I want to show all indications. Which I can do using terms aggregations. But my issue is that I also want to show count of project in which that indication is used .
So for that, I need to write join query.
Can anyone help me to resolve this issue?
Expected result example:
[{name:"onclogogy",projectCount:"12"}]

You cannot have joins in Elasticsearch. What you can do is store indication name in project index and then apply the term aggregation on project index. That basically will get you the different indications from all the project documents and count of each indication.
Something of the sort:
GET /project/_search
{
"query": {},
"aggs": {
"indcation":{
"terms": {
"field": "indication_name"
}
}
}
}
Elasticsearch does not supports joins. That's the whole point of having NoSQL that you keep the data as denormalised as you can. Make the documents more and more self sufficient.
There are some ways with which you can add some sort of relationship b/w your data. This is a nice blog on it.

Related

How to find what index a field belongs to in elasticsearch?

I am new to elasticsearch. I have to write a query using a given field but I don't know how to find the appropriate index. How would I find this information?
Edit:
Here's an easier/better way using mapping API
GET _mapping/field/<fieldname>
One of the ways you can find is to get records where the field exist
Replace the <fieldName> with your fields name. /_search will search across all indices and return any document that matches or has the field. Set _source to false, since you dont care about document contents but only index name.
GET /_search
{
"_source": false,
"query": {
"exists": {
"field": "<fieldName>"
}
}
}
Another, more visual way to do that is through the kibana Index Management UI (assuming you have privileges to access the site).
There you can click on the indices and open the mappings tab to get all fields of the particular index. Then just search for the desired field.
Summary:
#Polynomial Proton's answer is the way of choice in 90% of the time. I just wanted to show you another way to solve your issue. It will require more manual steps than #Polynomial Proton's answer. Also, if you have a large amount of indices this way is not appropriate.

elasticsearch scoring on multiple indexes

i have an index for any quarter of a year ("index-2015.1","index-2015.2"... )
i have around 30 million documents on each index.
a document has a text field ('title')
my document sorting method is (1)_score (2)created date
the problem is:
when searching for some text on on 'title' field for all indexes ("index-201*"), always the first results is from one index.
lets say if i am searching for 'title=home' and i have 10k documents on "index-2015.1" with title=home and 10k documents on "index-2015.2" with title=home then the first results are all documents from "index-2015.1" (and not from "index-2015.2", or mixed) even that on "index-2015.2" there are documents with "created date" higher then in "index-2015.1".
is there a reason for this?
The reason is probably, that the scores are specific to the index. So if you really have multiple indices, the result score of the documents will be calculated (slightly) different for each index.
Simply put, among other things, the score of a matching document is dependent on the query terms and their occurrences in the index. The score is calculated in regard to the index (actually, by default even to each separate shard). There are some normalizations elasticsearch does, but I don't know the details of those.
I'm not really able to explain it well, but here's the article about scoring. I think you want to read at least the part about TF/IDF. Which I think, should explain why you get different scores.
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
EDIT:
So, after testing it a bit on my machine, it seems possible to use another search_type, to achieve a score suitable for your case.
POST /index1,index2/_search?search_type=dfs_query_then_fetch
{
"query" : {
"match": {
"title": "home"
}
}
}
The important part is search_type=dfs_query_then_fetch. If you are programming java or something similar, there should be a way to specify it in the request. For details about the search_types, refer to the documentation.
Basically it will first collect the term-frequencies on all affected shards (+ indexes). Therefore the score should be generalized over all these.
according to Andrei Stefan and Slomo, index boosting solve my problem:
body={
"indices_boost" : { "index-2015.4" : 1.4, "index-2015.3" : 1.3,"index-2015.2" : 1.2 ,"index-2015.1" : 1.1 }
}
EDIT:
using search_type=dfs_query_then_fetch (as Slomo described) will solve the problem in better way (depend what is your business model...)

Elasticsearch: Accessing nested document attributes in script

I store log data in elasticsearch and my records, among other data, contain lists of values. First I represented these lists of values with regular arrays in elastic, but soon realised that the flattening in combination with the inverted index in Lucene made average aggregations on a list such as [1,1,1,1,5] came out completely wrong since the inverted index only contained [1,5]. Clearly avg([1,5]) is different from avg([1,1,1,1,5]).
Seeking out solutions I turned to nested documents, which do not flatten the data.
I now have my nested documents in elasticsearch looking something in the line of:
"nested_documents": [
{ "list1": 1, "list2": 2},
{ "list1": 3, "list2": 4}
]
Using the nested aggregation I am able to do aggregations such as:
"aggs": {
"nested_aggregation": {
"nested": {
"path": "nested_documents"
},
"aggs": {
"average_of_list1": {
"avg": {
"field": "nested_documents.list1"
}
}
}
}
Which now give me the correct result over the entire data set. However, I do have another requirements as well.
I would like to achieve things like max(avg(nested_documents.list1)), i.e. I want to have the average value of a field of my nested documents. I imagined I could use a script to achieve this, but I can't find a way to access the nested document in scripts. I did achieve the desired result using script and _source, but this was way too slow to be used in production on my datasets.
The only simple (and fast) solution I can imagine is to calculate the averages before storage, and store them along the actual lists, but that doesn't feel right.
Aggregating over aggregation results are not yet supported in elasticsearch. Apparently there is a concept called reducers that are being developed for 2.0. I would suggest having a look at scripted metric aggregations. Basically, you can create your own aggregation by controlling the collection and computation aspects yourself using scripts.
Have a look at the following question for an example of this aggregation: Elasticsearch: Possible to process aggregation results?

How to retrieve all document ids matching a search, in elastic search?

I'm working on a simple side project, and have a tech stack that involves both a SQL database and ElasticSearch. I only have ElasticSearch because I assumed that as my project grows, my full text searching would be most efficiently performed by ES. My ES schema is very simple - documents that I insert into ES have 2 fields, one being the id and the other being the field with the body of text to search. The id being inserted into ES corresponds to that document's primary key id from the SQL database.
insert record into SQL -> insert record into ES using PK from SQL
Searching would be the reverse of that. Query ES and grab all the matching ids, and then turn around and use those ids to get records from SQL.
search ES can get all PK ids -> use those ids to get documents from SQL
The problem that I am facing though, is that ES can only return documents in a paginated manner. This is a problem because I also have a WHERE clause on my SQL query, beyond just the ids. My SQL query might look like this ...
SELECT * FROM foo WHERE id IN (1,2,3,4,5) AND bar != 'baz'
Well, with ES paginating the results, my WHERE clause will always only be querying a subset of the full results from ES. Even if I utilize ES' skip and take, I'm still only querying SQL using a subset of document ids.
Is there a way to get Elastic Search to only return the entire list of matching document ids? I realize this is here to not allow me to shoot myself in the foot, because doing this across all shards and many many documents is not efficient. Is there no way, though?
After putting in some hours on this project, I've only now realized that I've poorly engineered this, unless I can get all of these ids from ES. Some alternative implementations that I've thought of would be to store the things that I'm filtering on, in SQL, in ES as well. A problem there is that I'd have to update the ES document every time I update the document in SQL. This would require a pretty big rewrite to some of my data access code. I could scrap ElasticSearch all together and just perform searching in Postgres, for now, until I can think of a better way to structure this.
The elasticsearch not support return each and every doc match to you queries. Because it Ll overload the system. Instead of this.. Use scroll concept in elasticsearch.. It's lik cursor concept in db's..
http://www.elasticsearch.org/guide/en/elasticsearch/guide/current/scan-scroll.html
For more examples refer the Github repo. https://github.com/sidharthancr/elasticsearch-java-client
Hope it helps..
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/search-request-fields.html
please have a look into the elastic search document where you can specify only particular fields that return from the match documents
hope this resolves your problem
{
"fields" : ["user", "postDate"],
"query" : {
"term" : { "user" : "kimchy" }
}
}

How to retrieve unique count of a field using Kibana + Elastic Search

Is it possible to query for a distinct/unique count of a field using Kibana? I am using elastic search as my backend to Kibana.
If so, what is the syntax of the query? Heres a link to the Kibana interface I would like to make my query: http://demo.kibana.org/#/dashboard
I am parsing nginx access logs with logstash and storing the data into elastic search. Then, I use Kibana to run queries and visualize my data in charts. Specifically, I want to know the count of unique IP addresses for a specific time frame using Kibana.
For Kibana 4 go to this answer
This is easy to do with a terms panel:
If you want to select the count of distinct IP that are in your logs, you should specify in the field clientip, you should put a big enough number in length (otherwise, it will join different IP under the same group) and specify in the style table. After adding the panel, you will have a table with IP, and the count of that IP:
Now Kibana 4 allows you to use aggregations. Apart from building a panel like the one that was explained in this answer for Kibana 3, now we can see the number of unique IPs in different periods, that was (IMO) what the OP wanted at the first place.
To build a dashboard like this you should go to Visualize -> Select your Index -> Select a Vertical Bar chart and then in the visualize panel:
In the Y axis we want the unique count of IPs (select the field where you stored the IP) and in the X axis we want a date histogram with our timefield.
After pressing the Apply button, we should have a graph that shows the unique count of IP distributed on time. We can change the time interval on the X axis to see the unique IPs hourly/daily...
Just take into account that the unique counts are approximate. For more information check also this answer.
Be aware with Unique count you are using 'cardinality' metric, which does not always guarantee exact unique count. :-)
the cardinality metric is an approximate algorithm. It is based on the
HyperLogLog++ (HLL) algorithm. HLL works by hashing your input and
using the bits from the hash to make probabilistic estimations on the
cardinality.
Depending on amount of data I can get differences of 700+ entries missing in a 300k dataset via Unique Count in Elastic which are otherwise really unique.
Read more here: https://www.elastic.co/guide/en/elasticsearch/guide/current/cardinality.html
Create "topN" query on "clientip" and then histogram with count on "clientip" and set "topN" query as source. Then you will see count of different ips per time.
Unique counts of field values are achieved by using facets. See ES documentation for the full story, but the gist is that you will create a query and then ask ES to prepare facets on the results for counting values found in fields. It's up to you to customize the fields used and even describe how you want the values returned. The most basic of facet types is just to group by terms, which would be like an IP address above. You can get pretty complex with these, even requiring a query within your facet!
{
"query": {
"match_all": {}
},
"facets": {
"terms": {
"field": "ip_address"
}
}
}
Using Aggs u can easily do that.
Writing down query for now.
GET index/_search
{
"size":0,
"aggs": {
"source": {
"terms": {
"field": "field",
"size": 100000
}
}
}
}
This would return the different values of field with there doc counts.
For Kibana 7.x, Unique Count is available in most visualizations.
For example, in Lens:
In aggregation based visualizations:
And even in TSVB (supporting normal fields as well as Runtime Fields, Scripted Fields are not supported):

Resources