How can I script a field in Kibana that matches first 4 digits of a field? - elasticsearch

I'm having a steep learning curve with the syntax and my data has PII so I don't know how to describe more.
I need a new field in kibana in the already indexed documents. This field "C" would be a combination of the first 4 digits of a field "A" that contains numbers up to millions and is of type:keyword, and a field "B" that is type:keyword and is some large number.
Later I will use this field "C" that is a unique combination, to compare with a list/array of items ( I will insert the list in a query DSL in Kibana, as I need to build some visualizations and reports with the returned documents).
I saw that I could use painless to create this new field, but I don't know exactly if I need to use regex and how to.
EDIT:
As requested more info about the mapping with a concrete example.
"fieldA" : {
"type: "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"fieldB" : {
"type: "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
Example of values:
FieldA = "9876443320134",
FieldB = "000000001".
I would like to sum the first 4 digits of FieldA and the full content of FieldB. FieldC would result in a value of "9877".

The raw query could look like this:
GET combination_index/_search
{
"script_fields": {
"a+b": {
"script": {
"source": """
def a = doc['fieldA.keyword'].value;
def b = doc['fieldB.keyword'].value;
if (a == null || b == null) {
return null
}
def parsed_a = new BigInteger(a);
def parsed_b = new BigInteger(b);
return new BigInteger(parsed_a.toString().substring(0, 4)) + parsed_b;
"""
}
}
}
}
Note 1: we're parsing the strings into BigInteger b/c of seemingly insufficient Integer.MAX_VALUE.
Note 2: we're first parsing fieldA and only then calling .toString on it again in order to handle the edge case of fieldA starting w/ 0s like 009876443320134. It's assumed that you're looking for 9876, not 98, which be the result of first calling .substring and then parsing.
If you intend to use it in Kibana visualizations, you'll need an index pattern first. Once you've got one, you can proceed as follows:
then put the script in:
click save and the new scripted becomes available in numeric aggregations and queries:

Related

Elasticsearch nested phrase search within a distance

Sample ES Document
{
// other properties
"transcript" : [
{
"id" : 0,
"user_type" : "A",
"phrase" : "hi good afternoon"
},
{
"id" : 1,
"user_type" : "B",
"phrase" : "hey"
}
{
"id" : 2,
"user_type" : "A",
"phrase" : "hi "
}
{
"id" : 3,
"user_type" : "B",
"phrase" : "my name is john"
}
]
}
transcript is a nested field whose mapping looks like
{
"type":"nested",
"properties": {
"id":{
"type":"integer"
}
"phrase": {
"type":"text",
"analyzer":"standard"
},
"user_type": {
"type":"keyword"
}
}
}
I need to search for two phrases inside transcript that are apart by at max a given distance d.
For example:
If the phrases are hi and name and d is 1, the above document match because hi is present in third nested object, and name is present in fourth nested object. (Note: hi in first nested object and name in fourth nested object is NOT valid, as they are apart by more than d=1 distance)
If the phrases are good and name and d is 1, the above document does not match because good and name are 3 distance apart.
If both phrases are present in same sentence, the distance is considered as 0.
Possible Solution:
I can fetch all documents where both phrases are present and on the application side, I can discard documents where phrases were more than the given threshold(d) apart. The problem in this case could be that I cannot get the count of such documents beforehand in order to show in the UI as found in 100 documents out of 1900 (as without processing from application side, we can't be sure if the document is indeed a match or not, and it's not feasible to do processing for all documents in index)
Second possible solution is:
{
"query": {
"bool": {
// suppose d = 2
// if first phrase occurs at 0th offset, second phrase can occur at
// ... 0th, 1st or 2nd offset
// if first phrase occurs at 1st offset, second phrase can occur at
// ... 1st, 2nd or 3rd offset
// any one of above permutation should exist
"should": [
{
// search for 1st permutation
},
{
// search for 2nd permutation
},
...
]
}
}
}
This is clearly not scalable as if d is large, and if the transcript is large, the query is going to be very very big.
Kindly suggest any approach.

Using Regexp Search inside a must bool query vs using must_not bool query

I want to make queries like - get all documents containing/not containing "some value" for a given field
-get all documents having value equal/not equal to "some value" for a given field.
As per my mapping the fields are String type meaning they support both keyword and full text search something like:
"myField" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
I was initially using regex matching like(this query is for not matches) :
"bool": {
"must":[
{
"regexp": {
"myField.keyword": {
"value": "~(some value)",
"flags": "ALL"
}
}
}
]
}
so, basically ~(word) for not, .*word.* for contains and ~(.*word.*) for not containing.
But, then also came across the 'must_not' bool query, so I understand I can also add a 'must_not' for the not equals cases clause along with the 'must' and 'should' clauses(for boolean AND and OR between other fields) in my bigger bool query, but still not sure about contains and not contains search, can someone definitively explain, what is the best practice here speaking both in terms of performance and accuracy of the result set returned.
ElasticSearch version used - Currently transitioning from v 6.3 to v 7.1.1

How do && and || work constructing queries in NEST?

According to http://nest.azurewebsites.net/concepts/writing-queries.html, the && and || operators can be used to combine two queries using the NEST library to communicate with Elastic Search.
I have the following query set up:
var ssnQuery = Query<NameOnRecordDTO>.Match(
q => q.OnField(f => f.SocialSecurityNumber).QueryString(nameOnRecord.SocialSecurityNumber).Fuzziness(0)
);
which is then combined with a Bool query as shown below:
var result = client.Search<NameOnRecordDTO>(
body => body.Query(
query => query.Bool(
bq => bq.Should(
q => q.Match(
p => p.OnField(f => f.Name.First)
.QueryString(nameOnRecord.Name.First).Fuzziness(fuzziness)
),
q => q.Match(p => p.OnField(f => f.Name.Last)
.QueryString(nameOnRecord.Name.Last).Fuzziness(fuzziness)
)
).MinimumNumberShouldMatch(2)
) || ssnQuery
)
);
What I think this query means is that if the SocialSecurityNumber matches, or both the Name.First and Name.Last fields match, then the record should be included in the results.
When I execute this query with the follow data for the nameOnRecord object used in the calls to QueryString:
"socialSecurityNumber":"123456789",
"name" : {
"first":"ryan",
}
the results are the person with SSN 123456789, along with anyone with first name ryan.
If I remove the || ssnQuery from the query above, I get everyone whose first name is 'ryan'.
With the || ssnQuery in place and the following query:
{
"socialSecurityNumber":"123456789",
"name" : {
"first":"ryan",
"last": "smith"
}
}
I appear to get the person with SSN 123456789 along with people whose first name is 'ryan' or last name is 'smith'.
So it does not appear that adding || ssnQuery is having the effect that I expected, and I don't know why.
Here is the definition of the index on object in question:
"nameonrecord" : {
"properties": {
"name": {
"properties": {
"name.first": {
"type": "string"
},
"name.last": {
"type": "string"
}
}
},
"address" : {
"properties": {
"address.address1": {
"type": "string",
"index_analyzer": "address",
"search_analyzer": "address"
},
"address.address2": {
"type": "string",
"analyzer": "address"
},
"address.city" : {
"type": "string",
"analyzer": "standard"
},
"address.state" : {
"type": "string",
"analyzer": "standard"
},
"address.zip" : {
"type" : "string",
"analyzer": "standard"
}
}
},
"otherName": {
"type": "string"
},
"socialSecurityNumber" : {
"type": "string"
},
"contactInfo" : {
"properties": {
"contactInfo.phone": {
"type": "string"
},
"contactInfo.email": {
"type": "string"
}
}
}
}
}
I don't think the definition of the address analyzer is important, since the address fields are not being used in the query, but can include it if someone wants to see it.
This was in fact a bug in NEST
A precursor to how NEST helps translate boolean queries:
NEST allows you to use operator overloading to create verbose bool queries/filters easily i.e:
term && term will result in:
bool
must
term
term
A naive implementation of this would rewrite
term && term && term to
bool
must
term
bool
must
term
term
As you can image this becomes unwieldy quite fast the more complex a query becomes NEST can spot these and join them together to become
bool
must
term
term
term
Likewise term && term && term && !term simply becomes:
bool
must
term
term
term
must_not
term
now if in the previous example you pass in a booleanquery directly like so
bool(must=term, term, term) && !term
it would still generate the same query. NEST will also do the same with should's when it sees that the boolean descriptors in play ONLY consist of should clauses. This is because the boolquery does not quite follow the same boolean logic you expect from a programming language.
To summarize the latter:
term || term || term
becomes
bool
should
term
term
term
but
term1 && (term2 || term3 || term4) will NOT become
bool
must
term1
should
term2
term3
term4
This is because as soon as a boolean query has a must clause the should start acting as a boosting factor. So in the previous you could get back results that ONLY contain term1 this is clearly not what you want in the strict boolean sense of the input.
NEST therefor rewrites this query to
bool
must
term1
bool
should
term2
term3
term4
Now where the bug came into play was that your situation you have this
bool(should=term1, term2, minimum_should_match=2) || term3 NEST identified both sides of the OR operation only contains should clauses and it would join them together which would give a different meaning to the minimum_should_match parameter of the first boolean query.
I just pushed a fix for this and this will be fixed in the next release 0.11.8.0
Thanks for catching this one!

Elasticsearch dynamic mapping with geopoint

We use Elasticsearch to index schemaless data. The thing is that the majority of the entries that we want to index contain fields like "longitude", "latitude", "lat" or "long".
What would be the best way to index that data so the field type allows search with geo distance filter ?
Thanks a lot.
I know it's been some time since you posted this but in case someone stumbles upon it like I did, here's some ways to do it.
In our case, we needed a dynamic radius so here's the mapping we have:
"mappings": {
"mygeopoints": {
"properties": {
"geopoint": {
"type": "geo_point",
"lat_lon" : true
},
"radius": {
"type": "long"
}
}
}
}
Our document is indexed using a SQL query that looks like that:
SELECT label, (lat || ',' || lon) as geopoint, radius FROM points;
We're sending the geopoint as a string that contains both latitude and longitude seperated by a coma.
To now search through the points you can use the geo_distance filter:
"filter" : {
"geo_distance" : {
"geopoint" : [ 5.7, 43.5 ],
"distance" : "15km"
}
}
On our side though, we needed a dynamic range so we did not find any other solution than using a script filter.
"filter" : {
"script" : {
"script" : "!doc['geopoint'].empty && doc['geopoint'].distanceInKm(43.5,5.7) <= doc['radius'].value"
}
}

Check for id existence in param Array with Elasticsearch custom script field

Is it possible to add a custom script field that is a Boolean and returns true if the document's id exists in an array that is sent as a param?
Something like this https://gist.github.com/2437370
What would be the correct way to do this with mvel?
Update:
Having trouble getting it to work as specified in Imotov's answer.
Mapping:
Sort:
:sort=>{:_script=>{:script=>"return friends_visits_ids.contains(_fields._id.value)", :type=>"string", :params=>{:friends_visits_ids=>["4f8d425366eaa71471000011"]}, :order=>"asc"}}}
place: {
properties: {
_id: { index: "not_analyzed", store: "yes" },
}
}
I don't get any errors, the documents just doesn't get sorted right.
Update 2
Oh, and I do get this back on the documents:
"sort"=>["false"]
You were on the right track. It just might be more efficient to store list of ids in a map instead of an array if this list is large.
"sort" : {
"_script" : {
"script" : "return friends_visits_ids.containsKey(_fields._id.value)",
"type" : "string",
"params": {
"friends_visits_ids": { "1" : {}, "2" : {}, "4" : {}}
}
}
}
Make sure that id field is stored. Otherwise _fields._id.value will return null for all records.

Resources