Implement Equivalent Elasticsearch titan query using Elasticsearch java driver - elasticsearch

I asked question related to pagination in Elastic Search result fetched by titan here
Pagination with Elastic Search in Titan
and concluded that its not supporting right now so to get it I decided to search Titan index directly using ES java client.
Here is ths Titan way of fetching ES records:
Iterable<Result<Vertex>> vertices = g.indexQuery("search","v.testTitle:(mytext)")
.addParameter(new Parameter("from", 0))
.addParameter(new Parameter("size", 2)).vertices();
for (Result<Vertex> result : vertices) {
Vertex tv = result.getElement();
System.out.println(tv.getProperty("testTitle")+ ": " + result.getScore());
}
Its return 1 record.
but addParameter() is not supported so pagination is not allowed. So i wanted to do same thing directly from ES java client as below:
Node node = nodeBuilder().node();
Client client = new TransportClient().addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9300));
SearchResponse response = client.prepareSearch("titan")
.setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders.fieldQuery("testTitle", "mytext")) // Query
.execute()
.actionGet();
System.out.println(response.getHits().totalHits());
node.close();
Its printing zero in my case. but the same query in Titan code (above) return 1 record. Am I missing some Titan specific parameters or options in this ES java code????

I think that Titan is sending a QueryStringQuery.
That said, I would recommend using a MatchQuery.
QueryBuilders.matchQuery("offerTitle", "your text whatever you want")

Related

Filtering in high level rest java elasticsearch client over multiple fields

I need to filter over a few fields.
All is ok if I filter over one but adding the second one gives wrong results.
Fields are REQ_ID and CATEGORY
In Kibana, following will return expected results
REQ_ID: '1574778361496' and CATEGORY: 'WARNING'
In java, I tried with:
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
final SimpleQueryStringBuilder sqb = QueryBuilders.simpleQueryStringQuery("REQ_ID: '1574778361496' and CATEGORY: 'WARNING'");
searchSourceBuilder.query(sqb);
searchRequest.source(searchSourceBuilder);
This yields results equivalent to upper having or instead of and, logically.
I also tried:
final BoolQueryBuilder bool = QueryBuilders.boolQuery()
.filter(QueryBuilders.termQuery("REQ_ID", "1574778361496"))
.filter(QueryBuilders.termQuery("CATEGORY", "WARNING"))
;
searchSourceBuilder.query(bool);
This yields no results.
Not sure if this is a bug due to strange nature of high rest java client or am I reading the wrong manual?
client version: 7.4.2
elk version(sebp/elk:740): 7.4.0
Kibana use the Kibana Query Language which is a little different of the Lucene language.
You can switch to Lucene on the top right of Kibana.
But you need no use the correct Lucene syntax (both in kibana with lucene syntax and in your java code), with upper operators :
REQ_ID: 'id1' AND CATEGORY: 'WARNING'
^^^
And you must use a QueryStringQueryBuilder instead of a SimpleQueryStringBuilder :
SearchSourceBuilder searchSourceBuilder = new SearchSourceBuilder();
final QueryStringQueryBuilder sqb = QueryBuilders.queryStringQuery("REQ_ID: 'id1' and CATEGORY: 'WARNING'");
searchSourceBuilder.query(sqb);
searchRequest.source(searchSourceBuilder);
You do not need the SimpleQueryString query for your use-case. Using a TermQuery is fine. I recommend to use the un-analyzed field in this case since you are only matching on an exact value. You could wrap this in a Bool clause within filter clauses. Example:
QueryBuilder query = QueryBuilders.boolQuery()
.filter(QueryBuilders.termQuery("REQ_ID.keyword", "1574778361496"))
.filter(QueryBuilders.termQuery("CATEGORY.keyword", "WARNING"));

Bulk inject doc to elastic search with nanoseconds timestamp

I'm trying to use the latest nanoseconds support provided by ElasticSearch 7.1 (Actually after 7.0). Not sure how to do this correctly.
Before 7.0, ElasticSearch only support timestamp for milliseconds, I use the _bulk API to inject documents.
#bulk post docs to elastic search
def es_bulk_insert(log_lines, batch_size=1000):
headers = {'Content-Type': 'application/x-ndjson'}
while log_lines:
batch, log_lines = log_lines[:batch_size], log_lines[batch_size:]
batch = '\n'.join([x.es_post_payload for x in batch]) + '\n'
request = AWSRequest(method='POST', url=f'{ES_HOST}/_bulk', data=batch, headers=headers)
SigV4Auth(boto3.Session().get_credentials(), 'es', 'eu-west-1').add_auth(request)
session = URLLib3Session()
r = session.send(request.prepare())
if r.status_code > 299:
raise Exception(f'Received a bad response from Elasticsearch: {r.text}')
The log index is generated per day
#ex:
#log-20190804
#log-20190805
def es_index(self):
current_date = datetime.strftime(datetime.now(), '%Y%m%d')
return f'{self.name}-{current_date}'
The timestamp is in nanoseconds "2019-08-07T23:59:01.193379911Z" and it's automatically mapping to a date type by Elasticsearch before 7.0.
"timestamp": {
"type": "date"
},
Now I want to map the timestamp field to the "date_nanos" type. From here, I think I need to create the ES index with correct mapping before I call the es_bulk_insert() function to upload docs.
GET https://{es_url}/log-20190823
If not exist (return 404)
PUT https://{es_url}/log-20190823/_mapping
{
"properties": {
"timestamp": {
"type": "date_nanos"
}
}
}
...
call es_bulk_insert()
...
My questions are:
1. If I do not remap the old data(ex: log-20190804), so the timestamp will have two mappings (data vs data_nano), will there be a conflict when I use Kibana to search for the logs?
2. I didn't see many posts about using this new feature, will that hurt performance a lot? Did anyone use this in prod?
3. Kibana not support nanoseconds search before 7.3 not sure if can sort by nanoseconds correctly, will try.
Thanks!
You are right: For date_nanos you need to create the mapping explicitly — otherwise the dynamic mapping will fall back to date.
And you are also correct that Kibana supports date_nanos in general in 7.3; though the relevant ticket is IMO https://github.com/elastic/kibana/issues/31424.
However, sorting doesn't work correctly yet. That is because both date (millisecond precision) and date_nanos (nanosecond precision) are represented as a long since the start of the epoche. So the first one will have a value of 1546344630124 and the second one of 1546344630123456789 — this isn't giving you the expected sort order.
In Elasticsearch there is a parameter for search "numeric_type": "date_nanos" that will cast both to nanosecond precision and thus order correctly (added in 7.2). However, that parameter isn't yet used in Kibana. I've raised an issue for that now.
For performance: The release blog post has some details. Obviously there is overhead (including document size), so I would only use the higher precision if you really need it.

ElasticSearch retrieves documents slowly

I'm using Java_API to retrieve records from ElasticSearch, it needs approximately 5 second to retrieve 100000 document (record/row) in Java application.
Is it slow for ElasticSearch? or is it normal?
Here is the index settings:
I tried to get better performance but without result, here is what I did:
Set ElasticSearch heap space to 3GB it was 1GB(default) -Xms3g -Xmx3g
Migrate the ElasticSearch on SSD from 7200 RPM Hard Drive
Retrieve only one filed instead of 30
Here is my Java Implementation Code
private void getDocuments() {
int counter = 1;
try {
lgg.info("started");
TransportClient client = new PreBuiltTransportClient(Settings.EMPTY)
.addTransportAddress(new TransportAddress(InetAddress.getByName("localhost"), 9300));
SearchResponse scrollResp = client.prepareSearch("ebpp_payments_union").setSearchType(SearchType.DFS_QUERY_THEN_FETCH)
.setQuery(QueryBuilders.matchAllQuery())
.setScroll(new TimeValue(1000))
.setFetchSource(new String[] { "payment_id" }, null)
.setSize(10000)
.get();
do {
for (SearchHit hit : scrollResp.getHits().getHits()) {
if (counter % 100000 == 0) {
lgg.info(counter + "--" + hit.getSourceAsString());
}
counter++;
}
scrollResp = client.prepareSearchScroll(scrollResp.getScrollId())
.setScroll(new TimeValue(60000))
.execute()
.actionGet();
} while (scrollResp.getHits().getHits().length != 0);
client.close();
} catch (UnknownHostException e) {
e.printStackTrace();
}
}
I know that TransportClient is deprecated, I tried by
RestHighLevelClient also, but it does not changes anything.
Do you know how to get better performance?
Should I change something in ElasticSearch or modify my Java code?
Performance troubleshooting/tuning is hard to do with out understanding all of the stuff involved but that does not seem very fast. Because this is a single node cluster you're going to run into some performance issues. If this was a production cluster you would have at least a replica for each shard which can also be used for reading.
A few other things you can do:
Index your documents based on your most frequently searched attribute - this will write all of the documents with the same attribute to the same shard so ES does less work reading (This won't help you since you have a single shard)
Add multiple replica shards so you can fan out the reads across nodes in the cluster (once again, need to actually have a cluster)
Don't have the master role on the same boxes as your data - if you have a moderate or large cluster you should have boxes that are neither master nor data but are the boxes your app connects to so they can manage the meta work for the searches and let the data nodes focus on data.
Use "query_then_fetch" - unless you are using weighted searches, then you should probably stick with DFS.
I see three possible axes for optimizations:
1/ sort your documents on _doc key :
Scroll requests have optimizations that make them faster when the sort
order is _doc. If you want to iterate over all documents regardless of
the order, this is the most efficient option:
( documentation source )
2/ reduce your page size, 10000 seems a high value. Can you make differents test with reduced values like 5000 /1000?
3/ Remove the source filtering
.setFetchSource(new String[] { "payment_id" }, null)
It can be heavy to make source filtering, since the elastic node needs to read the source, transformed in Object and then filtered. So can you try to remove this? The network load will increase but its a trade :)

ElasticSearch get only document ids, _id field, using search query on index

For a given query I want to get only the list of _id values without getting any other information (without _source, _index, _type, ...).
I noticed that by using _source and requesting non-existing fields it will return only minimal data but can I get even less data in return ?
Some answers suggest to use the hits part of the response, but I do not want the other info.
Better to use scroll and scan to get the result list so elasticsearch doesn't have to rank and sort the results.
With the elasticsearch-dsl python lib this can be accomplished by:
from elasticsearch import Elasticsearch
from elasticsearch_dsl import Search
es = Elasticsearch()
s = Search(using=es, index=ES_INDEX, doc_type=DOC_TYPE)
s = s.fields([]) # only get ids, otherwise `fields` takes a list of field names
ids = [h.meta.id for h in s.scan()]
I suggest to use elasticsearch_dsl for python. They have a nice api.
from elasticsearch_dsl import Document
# don't return any fields, just the metadata
s = s.source(False)
results = list(s)
Afterwards you can get the the id with:
first_result: Document = results[0]
id: Union[str,int] = first_result.meta.id
Here is the official documentation to get some extra information: https://elasticsearch-dsl.readthedocs.io/en/latest/search_dsl.html#extra-properties-and-parameters

How to make query for elasticsearch in spring boot like we do in sense plugin

this is query which i have done in sense
GET /_search?q=2016
it searches in whole db and get results for all entries which has "2016" in any field.
Giving q as query paramater run as query string query. So you have to use java api for Query String. You can use :
SearchResponse response = client.prepareSearch("index_name")
.setTypes("type1Name")
.setQuery(QueryBuilders.queryString("2016"))
.execute()
.actionGet();

Resources