I am interested in using kyotocabinet key-value store and I want to preform a range query preferably with a bulk interface (returning byte array). I checked kyotocabinets Java API and I couldn't find any particular method to do this. As range queries are very common for key-value stores, I was wondering if anybody has experience in doing range queries over kyotocabinet.
Related
I have 2 indexes and they both have one common field (basically relationship).
Now as elastic search is not giving filters from multiple indexes, should we store them in memory in variable and filter them in node.js (which basically means that my application itself is working as a database server now).
We previously were using MongoDB which is also a NoSQL DB but we were able to manage it through aggregate queries but seems the elastic search is not providing that.
So even if we use both databases combined, we have to store results of them somewhere to further filter data from them as we are giving users advanced search functionality where they are able to filter data from multiple collections.
So should we store results in memory to filter data further? We are currently giving advanced search in 100 million records to customers but that was not having the advanced text search that elastic search provides, now we are planning to provide elastic search text search to customers.
What do you suggest should we use the approach here to make MongoDB and elastic search together? We are using node.js to serve data.
Or which option to choose from
Denormalizing: Flatten your data
Application-side joins: Run multiple queries on normalized data
Nested objects: Store arrays of objects
Parent-child relationships: Store multiple documents through joins
https://blog.mimacom.com/parent-child-elasticsearch/
https://spoon-elastic.com/all-elastic-search-post/simple-elastic-usage/denormalize-index-elasticsearch/
Storing things client side in memory is not the solution.
First of all the simplest way to solve this problem is to simply make one combined index. Its very trivial to do this. Just insert all the documents from index 2 into index 1. Prefix all fields coming from index-2 by some prefix like "idx2". That way you won't overwrite any similar fields. You can use an ingestion pipeline to do this, or just do it client side. You only will ever do this once.
After that you can perform aggregations on the single index, since you have all the data in one-index.
If you are using somehting other than ES as your primary data-store you need to reconfigure the indexing operation to redirect everything that was earlier going into index-2 to go into index-1 as well(with the prefixed terms).
100 million records is trivial for something like ELasticsearch. Doing anykind of "joins" client side is NOT RECOMMENDED, as this will obviate the entire value of using ES.
If you need any further help on executing this, feel free to contact me. I have 11 years exp in ES. And I have seen people struggle with "joins" for 99% of the time. :)
The first thing to do when coming from MySQL/PostGres or even Mongodb is to restructure the indices to suit the needs of data-querying. Never try to work with multiple indices, ES is not built for that.
HTH.
I am new to DynamoDB and I am looking for suggestions / recommendations. There's a use case where we have a paginated API and we have to search for multiple values of an indexed attribute. Since DynamoDB allows only one value to be searched for an indexed attribute in a single query, a batch call should be done. However, since it requires pagination (batch call would make the pagination complicated), therefore currently, the required IDs are fetched from ElasticSearch for those multiple values (in a paginated way) after which the complete documents are fetched from DynamoDB based on IDs obtained from ElasticSearch. Is this the correct approach or is there any better alternative?
I'm trying to decide how to structure the data in ElasticSearch.
I have a system that is producing metrics on a daily basis. I would like to put those metrics into ES so I could do some advances querying/sorting. I also only care about the most recent data that's in there. The system producing the data could also be late.
Currently I can think of two options:
I can have one index with a date column that contains the date that the metric was created. I am unsure, however, of how to write the query so that if multiple days worth of data are in the index I filter it to just the most recent set.
I could also try and split the data up into different indexes (recent and past) and have some sort of process that migrates data from the recent index to the past index. I think the challenge with this would be having downtime where the data is being moved and/or added into the recent.
Thoughts?
A common approach to solving this problem with elastic search would be to store data in a form that allows historic querying, then again in a second form that allows querying the most recent data. For example if your metric update looked like:
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Then it can be indexed into our current values index using a composite key constructed from the document (obviously, for this to work you'd need to be able to construct a composite key from your document!). For example, your identity for this document might be the type and name concatenated. You then leverage the upsert API to allow you to write your updates to the same document:
POST current_metrics/_update/OperationsPerSecond-Questions
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
Every time you call this API with the same composite key it will update the existing document, rather than create a new document. This will give you an index that only contains a single record per metric you are monitoring, and you can query that index to get your most recent values.
To store your historic data, you change your primary key strategy, it would probably be most straightforward to use the index API and get elastic to generate a primary key for you.
POST all_metrics/_doc/
{
"type":"OperationsPerSecond",
"name":"Questions",
"value":10
}
This API will create a new document for every request made to it. So as long as you have something in your data that you can use in an elastic range query, such as a field like createdDate with a value that looks like a date time, then you should be able to query historic data.
The main thing is, don't worry about duplicating your data for different purposes, elastic does a good job of compressing this stuff on disk and in memory. Storing data multiple times is called denormalization and is a pretty common technique in data warehousing and big data.
In Elasticsearch, can I associate each document with a (different) function that must be satisfied by parameters I supply on a search, in order to be returned on that search?
The particular functions I would particularly like to use involve a loop, some kind of simple branching (if-statement of switch-statement), an array-like data structure, strings comparisons, and simple boolean operators.
couple of keynotes here:
At query time:
- If your looking to shape the relevancy function, meaning the actual relevancy score of each document, you could use a script score query.
- If you're only looking to filter out unwanted documents, you could use a script query that allows you to do just that.
Both of those solutions enables you to compute a score comparing incoming query parameters against existing previously indexed values.
Take note that usage of scripts at query time can lead to increased memory usage and performance issues.
Elastic can also handle a second batch of filtering rules that are applied to the actual query result in the form of a post filter. Can come in handy sometime if you're not in a position of stream processing the output at API view level.
At index time:
There is such a thing called script fields that allows you to store a function that computes a result based on other fields value and incoming query parameters. they can be really powerful given the fact that they are assigned at index time. I think they might be what you are looking for.
I would not be using those if i weren't to have those field values compared against query params. Reason is that I like my index process to be lean and fast so I tend to compute those kinds of values at stream level, in upstream from the actual bulk indexing query.
Although convenient, those custom scripts results are likely to be achievable with a combination of regular queries and filters. In each release, the elasticsearch teams is adding new query and field types that let you do what you use to do via scripted queries whiteout the risk of blowing out you memory. a good example of this is the rank feature datatype recently introduced in the 7.x release.
A piece of advice for you. think of your elasticsearch service as a regular API in your datalayer. As such you can do query processing before the actual call to elastic and you can do data processing from the actual elastic results. If you really can't fit your business rules in there, that would be your last resort.
Fell free to contact me if you still have any questions. All the best.
Is it possible to use Slice via solrTemplate ?
actually I am struggling to see if it will even make a difference because even without using spring, there doesnt appear to be any way of telling Solr to exclude its "numFound" (total results) from a query
And when I use a normal spring data Page<..> query , when I look under the hood I only see one query issued to solr, i.e. no extra one for count. Or is the count simply done inside Solr somehow in an extra step ?
confused
Total document count is part of the Solr query. No additional query is required. Therefore, there is no advantage to Slice vs. Page.
The only related concept is when somebody wants to export a significant amount of data, in which case built-in paging becomes slower the further is data requested. For that, Solr has exporting functionality.