Best practices for writing a PUT endpoint for a REST API - elasticsearch

I am building a basic CRUD service with some business logic under the hood, and I'm about to start working on the PUT (update) endpoint. I have already fully written+tested GET (read) and POST (create) for my data object. The data store for my documents is an ElasticSearch instance on AWS.
I have some decisions to make about how I want to architect the PUT, namely, how I want to determine a valid request. My goal is to make it so the POST is only for the creation of new assets, and PUT will only update existing documents. (At the moment, I am POSTing to elastic with /_doc/, however the intent is to move to /_create/ as part of this work)
What I'm a little hung-up on is the "right" way to check that a document exists before making the API call to Elastic to update.
When a user submits a document to PUT, should I first GET from Elastic with the document ID to make sure the document already exists? Or should I simply try to "update" the resource and if it doesn't exists, one is created?
Obviously there are trade-offs to each strategy. With the latter, PUTing a document that doesn't exist almost completely negates the need for a POST at all, so I'd be more inclined to go with the former - despite the additional REST call - to maintain the integrity of the basic REST definition.
Thoughts?

The consideration whether to update a doc (with versioning) or create a new one with some shared ID related to all previous versions depends on your use case -- either of them are 'correct' but there's too little information to advise on that right now.
With regards to the document-exists strategies -- there are essentially 2 types of IDs in ES -- what I call:
internal ids (_id)
external ids (doc_values-provided ids)
Create an index & a doc:
PUT myindex
PUT myindex/_doc/internal_id_1
{
"external_id": "1"
}
Internal ID check
GET myindex/_doc/internal_id_1
or
GET myindex/_count
{
"query": {
"ids": {
"values": [
"internal_id_1"
]
}
}
}
or
GET myindex/_count
{
"query": {
"term": {
"_id": {
"value": "internal_id_1"
}
}
}
}
External ID check
GET myindex/_count
{
"query": {
"term": {
"external_id": {
"value": "1"
}
}
}
}
and many others (terms, match (for partial matches etc), ...)
Note that I've used the _count endpoint instead of _search -- it's slightly faster.
If you intend to check the _version of a given doc before you proceed to update it, replace _count with _search?version=true and the _version attribute will become available:
{
"_index":"myindex",
"_type":"_doc",
"_id":"internal_id_1",
"_version":2, <---
"_score":1.0,
"_source":{
"external_id":"1"
}
}

Related

Most performant way to update a single document in Elasticsearch via an alias

I have an Elasticsearch setup with an alias that points to many indices. I need to update a single document, but I don't know which index it resides in.
There are two ways I can accomplish this as far as I can see:
_update_by_query:
POST my-alias/_update_by_query
{
"query": {
"terms": {
"_id": ["my-id-to-update"]
}
},
"script": {
"source": "ctx._source['Field'] = 'new value'"
}
}
read (which returns the specific index) then write:
GET my-alias/_search
{
"query": {
"terms": {
"_id": ["my-id-to-update"]
}
}
}
POST my-index-returned-from-the-get/_update/my-id-to-update
{
"doc": {
"Field": "new value"
}
}
Which method is more performant?
Which method is preferred?
Is there a better way than either of these two?
The performance of both approach will be the same with one difference that your first approach only need to send one request compare to second one with two request, so it would be better to use first approach as you will reduce the API calls by half.
Also in my opinion the first approach is much cleaner and fits more in concept of aliases of Elasticsearch because you are encapsulating exact index name from your application, as application doesn't need to have any clue about exact index-name your documents are in.
An important note about updating a document in Elasticsearch is documents in Elasticsearch don't get updated, it means the document will be flagged as deleted and new document will be created (this is due to Lucene implementation), then during process of Lucene segment merging the document will be actually deleted.
you can find a good blog post about segment merging here.

How to extract and visualize values from a log entry in OpenShift EFK stack

I have an OKD cluster setup with EFK stack for logging, as described here. I have never worked with one of the components before.
One deployment logs requests that contain a specific value that I'm interested in. I would like to extract just this value and visualize it with an area map in Kibana that shows the amount of requests and where they come from.
The content of the message field basically looks like this:
[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}
This plz is a German zip code, which I would like to visualize as described.
My problem here is that I have no idea how to extract this value.
A nice first success would be if I could find it with a regexp, but Kibana doesn't seem to work the way I think it does. Following its docs, I expect this /\"plz\":\"[0-9]{5}\"/ to deliver me the result, but I get 0 hits (time interval is set correctly). Even if this regexp matches, I would only find the log entry where this is contained and not just the specifc value. How do I go on here?
I guess I also need an external geocoding service, but at which point would I include it? Or does Kibana itself know how to map zip codes to geometries?
A beginner-friendly step-by-step guide would be perfect, but I could settle for some hints that guide me there.
It would be possible to parse the message field as the document gets indexed into ES, using an ingest pipeline with grok processor.
First, create the ingest pipeline like this:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
}
]
}
Then, when you index your data, you simply reference that pipeline:
PUT plz/_doc/1?pipeline=parse-plz
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}"""
}
And you will end up with a document like the one below, which now has a field called plz with the 12345 value in it:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345"
}
When indexing your document from Fluentd, you can specify a pipeline to be used in the configuration. If you can't or don't want to modify your Fluentd configuration, you can also define a default pipeline for your index that will kick in every time a new document is indexed. Simply run this on your index and you won't need to specify ?pipeline=parse-plz when indexing documents:
PUT index/_settings
{
"index.default_pipeline": "parse-plz"
}
If you have several indexes, a better approach might be to define an index template instead, so that whenever a new index called project.foo-something is created, the settings are going to be applied:
PUT _template/project-indexes
{
"index_patterns": ["project.foo*"],
"settings": {
"index.default_pipeline": "parse-plz"
}
}
Now, in order to map that PLZ on a map, you'll first need to find a data set that provides you with geolocations for each PLZ.
You can then add a second processor in your pipeline in order to do the PLZ/ZIP to lat,lon mapping:
PUT _ingest/pipeline/parse-plz
{
"processors": [
{
"grok": {
"field": "message",
"patterns": [
"%{POSINT:plz}"
]
}
},
{
"script": {
"lang": "painless",
"source": "ctx.location = params[ctx.plz];",
"params": {
"12345": {"lat": 42.36, "lon": 7.33}
}
}
}
]
}
Ultimately, your document will look like this and you'll be able to leverage the location field in a Kibana visualization:
{
"message": """[fooServiceClient#doStuff] {"somekey":"somevalue", "multivalue-key": {"plz":"12345", "foo": "bar"}, "someotherkey":"someothervalue"}""",
"plz": "12345",
"location": {
"lat": 42.36,
"lon": 7.33
}
}
So to sum it all up, it all boils down to only two things:
Create an ingest pipeline to parse documents as they get indexed
Create an index template for all project* indexes whose settings include the pipeline created in step 1

ElasticSearch - Delete documents by specific field

This seemingly simple task is not well-documented in the ElasticSearch documentation:
We have an ElasticSearch instance with an index that has a field in it called sourceId. What API call would I make to first, GET all documents with 100 in the sourceId field (to verify the results before deletion) and then to DELETE same documents?
You probably need to make two API calls here. First to view the count of documents, second one to perform the deletion.
Query would be the same, however the end points are different. Also I'm assuming the sourceId would be of type keyword
Query to Verify
POST <your_index_name>/_search
{
"size": 0,
"query": {
"term": {
"sourceId": "100"
}
}
}
Execute the above Term Query and take a note at the hits.total of the response.
Remove the "size":0 in the above query if you want to view the entire documents as response.
Once you have the details, you can go ahead and perform the deletion using the same query as shown in the below query, notice the endpoint though.
Query to Delete
POST <your_index_name>/_delete_by_query
{
"query": {
"term": {
"sourceId": "100"
}
}
}
Once you execute the Deletion By Query, notice the deleted field in the response. It must show you the same number.
I've used term queries however you can also make use of any Match or any complex Bool Query. Just make sure that the query is correct.
Hope it helps!
POST /my_index/_delete_by_query?conflicts=proceed&pretty
{
"query": {
"match_all": {}
}
}
Delete all the documents of an index without deleting the mapping and settings:
See: https://opster.com/guides/elasticsearch/search-apis/elasticsearch-delete-by-query/

How do I architect my Elasticsearch indexes for rapidly changing data?

I have two models in my MySQL database: Users and Posts
users have geolocation attributes (lat/long)
posts simply have a body of text.
I want to use Elasticsearch to find all posts that match a string of text plus use the user's location as a filter. The problem is -- the user's location always changes (as people walk around the city). I will be frequently updating the lat/long of each user.
This is my current solution:
Index the posts and have a geolocation attribute in each
document. When a user changes location, run an elasticsearch batch
update on all that user's posts, and modify the geolocation attribute
on those documents.
Obviously this is not a scalable solution -- what if the user has 2000 posts and walks around the city? I'd have to update 2000 documents every minute.
Is there a way to "relationally map" the posts to the user's object and use it as a filter, so that when the location changes, I only need to update that user's object instead of all his posts?
Updating 2000 posts per minute is not a big deal either with the update by query plugin or with the upcoming reindex API. However, if you have many users with many posts and you need to update them in short intervals (e.g. 1 min), it might not be that scalable, indeed. Say if it takes 500 milliseconds to update all posts from a user, you'd start to lag behind at around 120 users.
Clearly, since the users' posts need to "follow" the user and don't keep the location the user had when she posted them, I would first query the users around a given location and get their IDs, and then run a second query on posts filtered by those user IDs and the matching body text.
It is perfectly OK to keep both of your indices simple and only update the location in a single user's document every minute. Those two queries I'm suggesting should be quite fast and you should not be worried of running them. People are often worried when they need to run two or more queries in order to find their results. Sometimes, trying to tie the documents to tight together is not the solution and simply running two queries over two indices is the key and works perfectly well.
The query to retrieve users would look similar to the first one below, where you only retrieve the _id property of the user. I'm making the assumption that your user documents have the id of the user as their ES doc _id, so you do not have to retrieve the _source at all (i.e. "_source": false) which is even faster and you can simply return the _id with response filtering:
POST /users/_search?filter_path=hits.hits._id
{
"size": 1000,
"_source": false,
"query": {
"bool": {
"filter": [
{
"geo_distance": {
"distance": "100m",
"location": {
"lat": 32.5362723,
"lon": -80.3654783
}
}
}
]
}
}
}
You'll get all the _id values of the users who are currently 100 meters around the desired geographic location. Then the next query consists of filtering the posts by those ids while matching their body text.
POST /posts/_search
{
"size": 50,
"query": {
"bool": {
"must": {
"match": {
"body": "some text"
}
},
"filter": [
{
"terms": {
"user_id": [ 1, 2, 3, 4 ]
}
}
]
}
}
}

ElasticSearch query referencing document

I read some time ago that there was a way to build a query that references another document in your index. At the time, this wasn't helpful to me, but I now have very large GIS areas that I need to query against and sending this data to ElasticSearch in the query body every time seems wasteful.
While my specific use-case relates to GIS, geo_shape, etc, it's a general issue that can be applied to other types of queries.
I have a document type areas that holds all of the predefined search areas (these are things like suburbs, states, etc) and entities that hold all of my search data, including a geo_point type field with lat/lon.
I need to be able to construct a geo_shape query for entities documents that references the mpoly attribute (which is a GeoShape type) on an areas document for it's shape coordinates.
Unfortunately, neither Google nor reading the ElasticSearch docs have proved useful in this case, because generally nested documents (related, but not what I'm looking for) is what people seem to be more interested in.
Finally found the answer myself while looking for something different. Unfortunately, the information about the GeoShape filter is not in the GeoShape query manual pages:
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-geo-shape-filter.html#_pre_indexed_shape
{
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"geo_shape": {
"location": {
"indexed_shape": {
"id": "DEU",
"type": "countries",
"index": "shapes",
"path": "location"
}
}
}
}
}
}
If anyone has better information about how to do this generically, I will happily accept their answer instead.

Resources