Using numerics as type in Elasticsearch - elasticsearch

I am going to store transaction logs on elasticsearch. I am new to ELK stack and not sure about how I should implement this on ELK stack. My transaction is printing lines of log sequentially(upserts) and instead of logging these to a file I want to store these on ElastichSearch and later I will query the logs by the transactionId I have created.
Normally the URI for querying will be
/bookstore/books/_search
but in my case it must be like
/transactions/transactionId/_search
because I dont want to store lines as array attached to a single transaction record but I am not sure if this is a good practice to create a new type in the beginning of every transaction. I am not even sure if this is possible.
Can you give advices about storing these transaction data on elasticsearch?

if you want to query with a URI like /transactions/transactionId/_search, that means you are planning to create multiple types every time a new transactionid comes. Now , apart from this being a bad design, its not even possible to have more than one type in an index(post version 5.X I guess) and types have been completely removed since version 7.X .
One work-around is if you use the transactionId itself as the document ID while creation. Then you can get the log associated with one transactionId by querying GET transactions/transactionId (read about the length restrictions of the document id though) but this might cause another issue, that being , there can be multiple logs for the same transaction, so each log entry having the same id would simply overwrite the previous entry.
The best solution here will be to change how you query those records.
For this you can put transactionId as one of the fields in the json body, along with maybe a created time stamp at the time of insertion ( let ES create the documents with the auto generated id) and then query all logs associated with a transaction like :
POST transactions/_search
{
"sort": [
{
"createdDate": {
"order": "asc"
}
}
],
"query":{
"bool":{
"must":[
{
"term":{
"transactionId.keyword":"<transaction id>"
}
}
]
}
}
}
Hope, this helps

Related

Query does not return some items from DynamoDB via GraphQL

May I please know what is the reason why are items in DynamoDB not being fetched by GraphQL?
When searching via the DynamoDB console interface, I could easily see and query the item in there but once used in GraphQL, some items are not showing. Mind you, this isn't a connection problem because I could query items its just there's a specific item that is not being returned.
For example, if I query all Posts, it will return all posts in an array but the item is not showing there. However, when I try to query a Post just by targetting it by its ID, it is working well.
Example code that is not working:
listPosts(filter: {groupID: {eq: "25"}}) {
items {
id
content
}
}
but when I do this, it is working well:
getPost(id: "c59ce7e9") {
id
content
}
I had this same issue and can share what i found and worked for me.
The default resolver for the list operation has a limit:20 built in.
{
"version": "2017-02-28",
"operation": "Scan",
"filter": #if($context.args.filter) $util.transform.toDynamoDBFilterExpression($ctx.args.filter) #else null #end,
"limit": $util.defaultIfNull($ctx.args.limit, 20),
"nextToken": $util.toJson($util.defaultIfNullOrEmpty($ctx.args.nextToken, null)),
}
I imagine you could change this or you could add a limit filter to your query like this:
listPosts(filter: {groupID: {eq: "25"}}, limit:100) {
items {
id
content
}
}
The limit should be higher than the number of records.
You can see that this would be an issue because it is using the scan operation meaning it inspects each record for a match. this would hurt performance. you could add pagination or you should craft a query for this. you will need to look into pagination, relations and connection.
https://docs.aws.amazon.com/appsync/latest/devguide/designing-your-schema.html#advanced-relations-and-pagination

How can I let ES support mixed type of a field?

I am saving logs to Elasticsearch for analysis but I found there are mixed types of a particular field which causing error when indexing the document.
For example, I may save below log to the index where uuid is an object.
POST /index-000001/_doc
{
"uuid": {"S": "001"}
}
but from another event, the log would be:
POST /index-000001/_doc
{
"uuid": "001"
}
the second POST will fail because the type of uuid is not an object. so I get this error: object mapping for [uuid] tried to parse field [uuid] as object, but found a concrete value
I wonder what the best solution for that? I can't change the log because they are from different application. The first log is from the data of dynamodb while the second one is the data from application. How can I save both types of logs into ES?
If I disable dynamic mapping, I will have to specify all fields in the index mapping. For any new fields, I am not able to search them. so I do need dynamic mapping.
There will be many cases like that. so I am looking for a solution which can cover all conflict fields.
It's perfectly possible using ingest pipelines which are run before the indexing process.
The following would be a solution for your particular use case, albeit somewhat onerous:
create a pipeline
PUT _ingest/pipeline/uuid_normalize
{
"description" : "Makes sure uuid is a hash map",
"processors" : [
{
"script": {
"source": """
if (ctx.uuid != null && !(ctx.uuid instanceof java.util.HashMap)) {
ctx.uuid = ['S': ctx.uuid]; // hash map init
}
"""
}
}
]
}
run the pipeline when ingesting a new doc
POST /index-000001/_doc
{
"uuid": {"S": "001"}
}
POST /index-000001/_doc?pipeline=uuid_normalize <------
{
"uuid": "001"
}
You could now extend this to be as generic as you like but it is assumed that you know what you expect as input in each and every doc. In other words, unlike dynamic templates, you need to know what you want to safeguard against.
You can read more about painless script operators here.
You just cannot.
You should either normalize all your field in a way or another.
Or use 2 separate field.
I can suggest to use a field like this :
"uuid": {"key": "S", "value": "001"}
and skip the key when not necessary.
But you will have to preprocess your value before ingestion.

Filtering collapsed results in Elasticsearch

I have an elasticsearch index containing documents that represent entities at a given point in time. When an entity changes state, a new document is created with a timestamp. When I need to get the current state of all entities, I can do the following:
GET https://127.0.0.1:9200/myindex/_search
{
"collapse": {
"field": "entity_id"
},
"sort" : [{
"timestamp": {
"order": "desc"
}
}]
}
However, I would like to further filter the result of the collapse. When entities are deleted I create a new document that includes an is_deleted flag along with the timestamp in a nested metadata field. I would like to extend the above query to entirely filter out those entities that have been deleted. Using a term filter on entity_metadata.is_deleted: true obviously does not work, because then my result just includes the last document with that entity_id before it got marked as deleted. How can I filter my results after the collapse is done to exclude any tombstoned entites?
What I would suggest is that instead of adding an is_deleted flag to all entity_id documents, you could add a date_deleted field with the date of the deletion to all documents of that entity, and then when you view a document, given its date and the deleted_date you'd know if the document was LIVE or deleted at that date.
In addition, it would allow you to consider:
all documents that don't have a deleted_date field (i.e. not deleted) and
all documents that have a deleted_date before/after a given date.

Logstash -> Elasticsearch - update denormalized data

Use case explanation
We have a relational database with data about our day-to-day operations. The goal is to allow users to search the important data with a full-text search engine. The data is normalized and thus not in the best form to make full-text queries, so the idea was to denormalize a subset of the data and copy it in real-time to Elasticsearch, which allows us to create a fast and accurate search application.
We already have a system in place that enables Event Sourcing of our database operations (inserts, updates, deletes). The events only contains the changed columns and primary keys (on an update we don't get the whole row). Logstash already gets notified for each event so this part is already handled.
Actual problem
Now we are getting to our problem. Since the plan is to denormalize our data we will have to make sure updates on parent objects are propagated to the denormalized child objects in Elasticsearch. How can we configure logstash to do this?
Example
Lets say we maintain a list of Employees in Elasticsearch. Each Employee is assigned to a Company. Since the data is denormalized (for the purpose of faster search), each Employee also carries the name and address of the Company. An update changes the name of a Company - how can we configure logstash to update the company name in all Employees, assigned to the Company?
Additional explanation
#Darth_Vader:
The problem we are facing is, that we get an event that a Company has changed, but we want to modify documents of type Employee in Elasticsearch, because they carry the data about the company in itself. Your answer expects that we will get an event for every Employee, which is not the case.
Maybe this will make it clearer. We have 3 employees in Elasticsearch:
{type:'employee',id:'1',name:'Person 1',company.cmp_id:'1',company.name:'Company A'}
{type:'employee',id:'2',name:'Person 2',company.cmp_id:'1',company.name:'Company A'}
{type:'employee',id:'3',name:'Person 3',company.cmp_id:'2',company.name:'Company B'}
Then an update happens in the source DB.
UPDATE company SET name = 'Company NEW' WHERE cmp_id = 1;
We get an event in logstash, where it says something like this:
{type:'company',cmp_id:'1',old.name:'Company A',new.name:'Company NEW'}
This should then be propagated to Elasticsearch, so that the resulting employees are:
{type:'employee',id:'1',name:'Person 1',company.cmp_id:'1',company.name:'Company NEW'}
{type:'employee',id:'2',name:'Person 2',company.cmp_id:'1',company.name:'Company NEW'}
{type:'employee',id:'3',name:'Person 3',company.cmp_id:'2',company.name:'Company B'}
Notice that the field company.name changed.
I suggest a similar solution to what I've posted here, i.e. to use the http output plugin in order to issue an update by query call to the Employee index. The query would need to look like this:
POST employees/_update_by_query
{
"script": {
"source": "ctx._source.company.name = params.name",
"lang": "painless",
"params": {
"name": "Company NEW"
}
},
"query": {
"term": {
"company.cmp_id": "1"
}
}
}
So your Logstash config should look like this:
input {
...
}
filter {
mutate {
add_field => {
"[script][lang]" => "painless"
"[script][source]" => "ctx._source.company.name = params.name"
"[script][params][name]" => "%{new.name}"
"[query][term][company.cmp_id]" => "%{cmp_id}"
}
remove_field => ["host", "#version", "#timestamp", "type", "cmp_id", "old.name", "new.name"]
}
}
output {
http {
url => "http://localhost:9200/employees/_update_by_query"
http_method => "post"
format => "json"
}
}

Elasticsearch: document relationship

I'm doing a elastic search autocomplete-as-you-type
I'm using cool features like ngram's and other stuff to create needed analyzer.
currently I break my had around indexing following data.
Let say I have Payments type,
each document in this type looks like this
{
..elastic meta data..
paymentId: 123453425342,
providerAccount : {
id: 123456
firstName: Alex,
lastName: Web
},
consumerAccount : {
id: 7575757,
firstName: John,
lastName: Doe
},
amount: 556,
date : 342523454235345 (some unix timestamp)
}
so basically this document represents not only the payment itself but it also shows the relationship of the payment, the 2 entities which related to the payment.
Payment always have its provider and consumer.
I need this data in payment document because I want to show it in UI.
By indexing it like so, it might be a big pain for handling the updates of Consumer or Provider because each time some of them change its properties I have to update all the payments which has this entity.
Another possible solution is to store only id's of this consumers/providers and make a query on payments and then 2 queries for the entities for retrieving needed fields, but i'm not sure about this because i'm doing ajax requests each time a character entered, so here comes the performance question.
I have also looked into parent/child relationship solution which basically fits my case but I wasn't able to figure out if I can retrieve also the parent(consumer/provider) fields while I querying child(payment).
What would you suggest?
Thanks!
Yes, you can retrieve the parent while querying child using has_child.
Considering payment as child and consumer as parent, You can search all the consumers by :
GET /index_name/consumer/_search
{
"query": {
"has_child": {
"type": "payment",
"query": {
// any query on payment table
},
"inner_hits": {}
}
}
}
This would fetch you all the consumer based on the query on child i.e payment in your case.
inner_hits is what you are looking for. This will retrieve you the children as well. But it was introduced in elasticsearch 1.5.0. So version should be greater than elasticsearch 1.5.0.
You can refer https://www.elastic.co/blog/elasticsearch-1-5-0-released.
Your problem is not an issue. I suppose you want tot freeze data after the pay, right? So you don't need to update the accounts data in existing payment documents.
Further: parent/schild is easy for updating, but less efficient with querying. For auto complete, stay using your current mapping!

Resources