ElasticSearch - specify range for a string field - elasticsearch

I am trying to retrieve the mentions of years between 1933 and 1949 from a string field called text. However, I cannot seem to find the working range query for that. What I tried to so far crashes:
{"query":
{"query_string":
{
"text": [1933 TO 1949]
}
}
}
I have also tried it like this:
{"query":
{"filtered":
{"query":{"match_all":{}},
"filter":{"range":{"text":[1933 TO 1949]}
}
}
}
but it still crashes.
A sample text field looks like the one below, containing a mention of the year 1933:
"Primera División 1933 (Argentinië), seizoen in de Argentijnse voetbalcompetitie\n* Primera Divisió n 1933 (Chili), seizoen in de Chileense voetbalcompetitie\n* Primera División 1933 (Uruguay), seizoen in de Uruguayaanse voetbalcompetitie\n \n "
However, I also have documents not containing any years inside, and I would like to filter all the documents to preserve only the ones mentioning years in a given period. I read here http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html that the range query can be applied to text fields as well, and I don't want to use any intermediate solution to identify dates inside texts.
What I basically want to achieve is to be able to get the same results as when using a search URI query:
urltomyindex/_search?q=text:%7B1933%20TO%201949%7D%27
which works perfectly.
Is it still possible to achieve my goal? Any help much appreciated!

This should do it:
GET index1/type1/_search
{
"query": {
"filtered": {
"filter": {
"terms": {
"fieldNameHere": [
"1933",
"1934",
"1935",
"1936",
"1937",
"1938",
"1939",
"1940",
"1941",
"1942",
"1943",
"1944",
"1945",
"1946",
"1947",
"1948",
"1949"
]
}
}
}
}
}
If you know you're going to be needing this kind of search frequently it would be much better to create a new field "yearPublished" or something like that so you can search it as a number vs a text field.

Related

Elastic Search - Accessing a member of an element inside a list

I'm relatively new to elastic search and have a question about accessing an element inside of an element inside of a list. The structure is as follows:
{
'TestA':'1',
'TestB':{
'TestC':'2',
'TestD':[
{
'TestE':'3',
'TestF':'4'
},
{
'TestE':'5',
'TestF':'6'
}
]
}
}
With this following structure I want to return all the results from the query in which TestF has a value of 6. I was wondering if this is possible with the following template.
{
"query":{
"bool":{
"must":[
{
"match":{
"TestB.TestD.TestF":'6'
}
}
]
}
}
}
Would {"match" : { "TestB.TestD.TestF": '6'}} be able to search through each element of 'TestD' or would I need to use some other command to iterate through the list? This is with elastic search 5.0. Thanks in advance!
Yes, your match query should find the results you are looking for. Elasticsearch flattens arrays when it puts them in the inverted index. For more information, check out the docs:
https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html#_how_arrays_of_objects_are_flattened
Arrays of inner object fields do not work the way you may expect.
Lucene has no concept of inner objects, so Elasticsearch flattens
object hierarchies into a simple list of field names and values.

Format reading ElasticSearch dates

This is my mapping for one of the properties in my ElasticSearch model:
"timestamp":{
"type":"date",
"format":"dd-MM-yyyy||yyyy-MM-dd'T'HH:mm:ss.SSSZ||epoch_millis"
}
I'm not sure if I'm misunderstanding the documentation. It clearly says:
The first format will also act as the one that converts back from milliseconds to a string representation.
And that is exactly what I want. I would like to be able to read directly (if possible) the dates as dd-MM-yyyy.
Unfortunately, when I go to the document itself (so, accessing to the ElasticSearch's endpoint directly, not via the application layer) I still get:
"timestamp" : "2014-01-13T15:48:25.000Z",
What am I missing here?.
As #Val mentioned, you'd get the value/format as how it is being indexed.
However if you want to view the date in particular format regardless of the format it has been indexed, you can make use of Script Fields. Note that it would be applied at querying time.
Below query is what your solution would be.
POST <your_index_name>/_search
{
"query":{
"match_all":{ }
},
"script_fields":{
"timestamp":{
"script":{
"inline": "def sf = new SimpleDateFormat(\"dd-MM-yyyy\");def dt = new Date(doc['timestamp'].value);def mydate = sf.format(dt);return mydate;"
}
}
}
}
Let me know how it goes.

ElasticSearch - creating exceptions for fuzzy terms

I have simple elastic query that does a simple text field search with the fuziness distance of one:
GET /jobs/_search
{
"query": {
"fuzzy": {
"attributes.title": {
"value": "C#"
"fuzziness": 1
}
}
}
}
The above query does exactly what it is told to do, but I have a cases where I don't want a word to resolve (with fuzziness) to another specific word. In this case, I don't want C# to also return C++ results. Similarly I don't want cat to return car results.
However I do still need the fuzziness option if someone did actually misspelled cat. In that case it can return results of both cat and car.
I think this is possible with some bool query combination, it should be something like this:
bool:
//should
//match query without fuzzy
//bool
//must
//must with fuzzy query
//must_not with match query

Partitioning aggregates with groups

I'm trying to partition an aggregate similar to the example in the ElasticSearch documentation, but am not getting the example to work.
The index is populated with event-types:
public class Event
{
public int EventId { get; set; }
public string SegmentId { get; set; }
public DateTime Timestamp { get; set; }
}
The EventId is unique, and each event belongs to a specific SegmentId. Each SegmentId can be associated with zero to many events.
The question is:
How do I get the latest EventId for each SegmentId?
I expect the number of unique segments to be in the range of 10 millions, and the number of unique events one or two magnitudes greater. That's why I don't think using top_hits by itself is appropriate, as suggested here. Hence, partitioning.
Example:
I have set up a demo-index populated with 1313 documents (unique EventId), belonging to 101 distinct SegmentId (i.e. 13 events per segment). I would expect the query below to work, but the exact same results are returned regardless of which partition number I specify.
POST /demo/_search
{
"size": 0,
"aggs": {
"segments": {
"terms": {
"field": "segmentId",
"size": 15, <-- I want 15 segments from each query
"include": {
"partition": 0, <-- Trying to retrieve the first partition
"num_partitions": 7 <-- Expecting 7 partitions (7*15 > 101 segments)
}
},
"aggs": {
"latest": {
"top_hits": {
"size": 1,
"_source": [
"timestamp",
"eventId",
"segmentId"
],
"sort": {
"timestamp": "desc"
}
}
}
}
}
}
}
If I remove the include and set size to a value greater than 101, I get the latest event for every segment. However, I doubt that is a good approach with a million buckets...
You are trying to do a Scroll of the aggregation.
Scroll API is supported only for search queries and not for aggregations. If you do not want to use the Top Hits, as you have stated, due to a huge number of documents, you can either try:
Parent/Child approach - where you create segments as a parent document and the events in the child document. And everytime you add a child, you can update the timestamp field in the parent document. By doing so, you can just query the parent documents and you will have your segment id + the last event timestamp
Another approach would be you try to get the top hits only for the last 24 hours. So you can add a query to first filter the last 24 hours and then try to get the aggs using the top_hit.
It turns out I was investigating the wrong question... My example actually works perfectly.
The problem was my local ElasticSearch node. I don't know what went wrong with it, but when repeating the example on another machine, it worked. I was, however, unable to get partitioning working on my current ES installation. I therefore uninstalled and reinstalled ElasticSearch again, and then the example worked.
To answer my original question, the example I provided is the way to go. I solved my problem by using the cardinality aggregate to get an estimate on the total number of products, from which I derived a suitable number of partitions. Then I looped the query above for each partition, and added the documents to a final list.

java: how to limit score results in mongo

I have this mongo query (java):
TextQuery.queryText(textCriteria).sortByScore().limit(configuration.getSearchResultSize())
which performs a text search and sort by score.
I gave different wiehgt to different fields in the docuemnt, and now I'd like to retrieve only those results with score lower then 10.
is there a way to add that criteria to the query?
this didn't work:
query.addCriteria(Criteria.where("score").lt(10));
if the only way is to use aggregation - I need a mongoTemplate example for that.
in other words
how the do I translate the following mongo shell aggregate command, to java spring's mongoTemplate command??
can't find anywhere how to use the aggregate's match() API with the $text search component (the $text is indexed on several different fields):
db.text.aggregate(
[
{ $match: { $text: { $search: "read" } } },
{ $project: { title: 1, score: { $meta: "textScore" } } },
{ $match: { score: { $lt: 10.0 } } }
]
)
Thanks!
Please check with below code sample, MongoDB search with pagination code in java
BasicDBObject query = new BasicDBObject()
query.put(column_name, new BasicDBObject("$regex", searchString).append("$options", "i"));
DBCursor cursor = dbCollection.find(query);
cursor.skip((pageNum-1)*limit);
cursor.limit(limit);
Write a loop and and call the above code from loop and pass the values like pageNum starts from 1 to n and limit depends on your requirement. check the cursor is empty or not. If empty skip the loop if not continue calling the above code base.
Hope this will be helpful.

Resources