Is it possible to set several routing values using Elasticsearch NEST? - elasticsearch

I need to query data from several shards. Elasticsearch REST API provides a possibility to send a request with several routing keys:
//https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html#_searching_with_custom_routing
GET my-index-000001/_search?routing=user1,user2
{
"query": {
"match": {
"title": "document"
}
}
}
Is it possible to do the same by NEST client?

Yes, you can pass a comma-separated string to the Routing() method of a search request.

Related

Best practices for writing a PUT endpoint for a REST API

I am building a basic CRUD service with some business logic under the hood, and I'm about to start working on the PUT (update) endpoint. I have already fully written+tested GET (read) and POST (create) for my data object. The data store for my documents is an ElasticSearch instance on AWS.
I have some decisions to make about how I want to architect the PUT, namely, how I want to determine a valid request. My goal is to make it so the POST is only for the creation of new assets, and PUT will only update existing documents. (At the moment, I am POSTing to elastic with /_doc/, however the intent is to move to /_create/ as part of this work)
What I'm a little hung-up on is the "right" way to check that a document exists before making the API call to Elastic to update.
When a user submits a document to PUT, should I first GET from Elastic with the document ID to make sure the document already exists? Or should I simply try to "update" the resource and if it doesn't exists, one is created?
Obviously there are trade-offs to each strategy. With the latter, PUTing a document that doesn't exist almost completely negates the need for a POST at all, so I'd be more inclined to go with the former - despite the additional REST call - to maintain the integrity of the basic REST definition.
Thoughts?
The consideration whether to update a doc (with versioning) or create a new one with some shared ID related to all previous versions depends on your use case -- either of them are 'correct' but there's too little information to advise on that right now.
With regards to the document-exists strategies -- there are essentially 2 types of IDs in ES -- what I call:
internal ids (_id)
external ids (doc_values-provided ids)
Create an index & a doc:
PUT myindex
PUT myindex/_doc/internal_id_1
{
"external_id": "1"
}
Internal ID check
GET myindex/_doc/internal_id_1
or
GET myindex/_count
{
"query": {
"ids": {
"values": [
"internal_id_1"
]
}
}
}
or
GET myindex/_count
{
"query": {
"term": {
"_id": {
"value": "internal_id_1"
}
}
}
}
External ID check
GET myindex/_count
{
"query": {
"term": {
"external_id": {
"value": "1"
}
}
}
}
and many others (terms, match (for partial matches etc), ...)
Note that I've used the _count endpoint instead of _search -- it's slightly faster.
If you intend to check the _version of a given doc before you proceed to update it, replace _count with _search?version=true and the _version attribute will become available:
{
"_index":"myindex",
"_type":"_doc",
"_id":"internal_id_1",
"_version":2, <---
"_score":1.0,
"_source":{
"external_id":"1"
}
}

Datatype creation in AWS DynamoDB and elastic search for List of URL's

I have enabled Aws DynamoDB streams and created a lambda function to index the data into Elasticsearch.
In my DynamoDb table there is a column named URL in this i am going to store the list of URL's for a single row.
URL is most preferably like object URL of AWS S3 objects
After streaming i am indexing the data into elastic search here my question is what is the datatype should i prefer to store multiple URL in both DynamoDB (single row) and Elasticsearch (Single document)
Could some one help me to achieve this in most efficient way? Thanks in advance
Json structure
{
"id":"234561",
"policyholdername":"xxxxxx",
"age":"24",
"claimnumber":"234561",
"policynumber":"456784",
"url":"https://dgs-dms.s3.amazonaws.com/G-3114_Textract.pdf",
"claimtype":"Accident",
"modified_date":"2020-02-05T17:36:49.053Z",
"dob":"2020-02-05T17:36:49.053Z",
"client_address":"no,7 royal avenue thirumullaivoyal chennai"
}
In future for a single claim number there should be multiple URL's
So, how to handle this?
Not sure about Dynamo DB types. But in Elasticsearch there is no dedicated type for list. To store list of strings(URLs in your case) you can use keyword field type.
For example your data can be like
{
"id":"234561",
"policyholdername":"xxxxxx",
"age":"24",
"claimnumber":"234561",
"policynumber":"456784",
"url":["https://dgs-dms.s3.amazonaws.com/G-3114_Textract.pdf","https://foo/bar/foo.pdf"]
"claimtype":"Accident",
"modified_date":"2020-02-05T17:36:49.053Z",
"dob":"2020-02-05T17:36:49.053Z",
"client_address":"no,7 royal avenue thirumullaivoyal chennai"
}
and the equivalent elasticsearch mapping could be
{
"mappings": {
"_doc": {
"properties": {
"url": {
"type": "keyword"
}
}
}
}
}
and the search query can be
POST index/_search
{
"query": {
"term": {
"url": "https://foo/bar/foo.pdf"
}
}
}

Elasticsearch nested objects with query_string as first class attributes

I'm trying to index a nested field as a first-class attribute in my document so that I can search them using query_string without dot syntax.
For example, if I have a document like
"data": { "name": "Bob" }
instead of searching for data.name:Bob I would like to be able to search for name:Bob
The root of my issue is that we index a jsonb column that may have varying attributes. In some instances the data property may contain a data.business attribute, etc. I would like users to be able to search on these attributes without needing to "dig" into the object.
The data field does not have to be indexed as a nested type unless necessary; I was indexing it as an object previously.
I have tried to leverage the _all field as suggested in this post.
I have also tried to use include_in_parent:true and set the datatype as nested for my data field as suggested in this post.
I have also looked into the inner_hits feature to no avail.
Here's an example of my mapping for the data attribute.
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"data": {
"type": "object"
}
}
}
}
}
Example document
PUT my_index/_doc/1
{
"data": {
name: "bob",
business: "None of yours"
}
}
And how my query currently looks:
GET my_index/_search
{
"query": {
"query_string": {
"query": "name:bob",
"fields": ["data.*"]
}
}
}
With the current setup I almost get my desired results. I can search on individual properties like data.name:bob and data.business:"None of yours" and get back the correct documents.
However I want to be able to get the exact same results with business:"None of yours" or name:bob.
Thanks in advance for any help!
I figured it out using dynamic templates. For anyone coming across this in the future, here is how I solved the issue:
I used path_match to match the data object (data.*).
Then using copy_to and {name} I dynamically created top-level fields on my parent object.
{
"dynamic_templates":[
{"template_1":
{"mapping":
{"copy_to":"{name}"},
"path_match":"data.*"
}
}
]
}

Which query does the search api execute by default in elasticsearch

In elasticsearch, i can access the default search api like
server: 9200/index/_search?q=keyword but how can i replicate this if I am building the query myself? I've tried multi_match and query string, but the result set seem a bit different than the default search api.
PS: i am using elasticsearch PHP client, if that matters
The equivalent query to server:9200/index/_search?q=keyword is a query_string query like this one
{
"query": {
"query_string": {
"query": "keyword"
}
}
}

Elasticsearch to wildcard search email addresses

I'm trying to use elasticsearch for a project I'm working on. I was wondering if someone could help steer me in the right direction. I'm using an index with 100+ million records.
I need to be able to search with a wildcard query like the following:
b*g#gmail.com
b*g#*.com
*gus#gmail.com
br*gu*#gmail.com
*g*#*
When I try using Wildcard and other searches, I don't get completely expected results.
What type of search with elasticsearch should I look into implementing? Is ElasticSearch even the right tool to be using? The source I'm pulling this out of is Mysql, so if not I may consider using Sphinx or Solr.
I assume that you have tried out the wildcard query as described here.
However, it has very different behaviour if your email is analyzed versus not analyzed. I would suggest you delete your index and change your mapping. e.g.
PUT /emails
{
"mappings": {
"email": {
"properties": {
"email": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
Once you have this, you can just do the normal wildcard query or query_string. e.g.
GET emails/_search
{
"query": {
"wildcard": {
"email": {
"value": "s*com"
}
}
}
}
As an aside, when you just index email without setting it as not_analyzed, the default mapping actually splits up the email prefix from the domain and so that's why you don't get results for when you do s*#gmail.com. You would still get results for s* or *gmail.com but for your case, using not_analyzed works correctly. If you want to support case insensitivity, then you might want to look at a custom analyzer that uses the uax_url_email tokenizer as described here.

Resources