Infinispan cache union query - caching

I have two caches with different types inside
Would like to do a paged query on both of them. So would like to pass in a sort/filter values and get content from both caches?
Is there a way how to do it without manually writing the merge and pagination?
Currently I can only do something like this:
val queryFactory = Search.getQueryFactory(cache)
queryFactory.from(Class.getClass)
or
val searchManager = Search.getSearchManager(cache)
searchManager.buildQueryBuilderForClass(Class.getClass).get()

Searching across multiple caches is not supported and there are no concrete plans to support it. Neither the query DSL nor the direct Lucene API allow it. The workaround is to merge the search results yourself.
The main reason for this is each cache has its own separate set of indexes. So a search across caches would have to retrieve data from multiple indexes and perform a merge which is not efficient in current implementation, so this was left out for technical reasons for now.

Related

Is it possible to merge to create a document C from document A and B with ElasticSearch

My problem is that I need to perform sorting from data coming from two different datasource, a MySQL database which contains information about some products and a PostgreSQL that contains some metrics linked to these products.
Because the data resides in two different datasources I cannot out of the box come up with a single performant query that would make the ordering (pagination) at database level.
I need to make two different queries and then manually merge the data and perform sorting and pagination code side.
I would like to avoid as much as possible having to create a custom pagination system and a manual data merging and as much as possible delegate this job to the underlying database.
This is where I thought a system such as ElasticSearch (or Solr, but ES seems to be easier to use) could help.
1) Does ES provide tools or mechanism to merge 2 datasource into 1 document ? Or this job needs to be done by a 3rd party tool that will peridocally pull the data from both datasource and create / update the documents?
2) I'm correct to assume that having 2 indices (or 2 different doc type) is pointless in my case since ES cannot perform join queries ?
3) Apart from creating one single document what other solution do I have that ES can help with? Is it possible 'somehow' that with having datasource1 data in an index1 and datasource2 data in an index2 I can perform multiple search queries using both the index at the same time (since join is a no go).
Does ES provide tools or mechanism to merge 2 datasource into 1 document ? Or this job needs to be done by a 3rd party tool that will peridocally pull the data from both datasource and create / update the documents?
There are two approaches to accomplish this :
An ETL process (Extract, Transform, Load) to load data from both sources into one single document. In the Elastic world you can use logstash to accomplish this
Data Virtualization is supposed to do this without the need to copy the data
3) Apart from creating one single document what other solution do I have that ES can help with? Is it possible 'somehow' that with having datasource1 data in an index1 and datasource2 data in an index2 I can perform multiple search queries using both the index at the same time (since join is a no go).
It's very easy to perform a single query through multiple indices. Answers here

Caching Strategy/Design Pattern for complex queries

We have an existing API with a very simple cache-hit/cache-miss system using Redis. It supports being searched by Key. So a query that translates to the following is easily cached based on it's primary key.
SELECT * FROM [Entities] WHERE PrimaryKeyCol = #p1
Any subsequent requests can lookup the entity in REDIS by it's primary key or fail back to the database, and then populate the cache with that result.
We're in the process of building a new API that will allow searches by a lot more params, will return multiple entries in the results, and will be under fairly high request volume (enough so that it will impact our existing DTU utilization in SQL Azure).
Queries will be searchable by several other terms, Multiple PKs in one search, various other FK lookup columns, LIKE/CONTAINS statements on text etc...
In this scenario, are there any design patterns, or cache strategies that we could consider. Redis doesn't seem to lend itself particularly well to these type of queries. I'm considering simply hashing the query params, and then cache that hash as the key, and the entire result set as the value.
But this feels like a bit of a naive approach given the key-value nature of Redis, and the fact that one entity might be contained within multiple result sets under multiple query hashes.
(For reference, the source of this data is currently SQL Azure, we're using Azure's hosted Redis service. We're also looking at alternative approaches to hitting the DB incl. denormalizing the data, ETLing the data to CosmosDB, hosting the data in Azure Search but there's other implications for doing these including Implementation time, "freshness" of data etc...)
Personally, I wouldn't try and cache the results, just the individual entities. When I've done things like this in the past, I return a list of IDs from live queries, and retrieve individual entities from my cache layer. That way the ID list is always "fresh", and you don't have nasty cache invalidation logic issues.
If you really do have commonly reoccurring searches, you can cache the results (of ids), but you will likely run into issues of pagination and such. Caching query results can be tricky, as you generally need to cache all the results, not just the first "page" worth. This is generally very expensive, and has high transfer costs that exceed the value of the caching.
Additionally, you will absolutely have freshness issues with caching query results. As new records show up, they won't be in the cached list. This is avoided with the entity-only cache, as the list of IDs is always fresh, just the entities themselves can be stale (but that has a much easier cache-expiration methodology).
If you are worried about the staleness of the entities, you can return not only an ID, but also a "Last updated date", which allows you to compare the freshness of each entity to the cache.

Neo4j bulk import and indexing

I'm importing a big dataset (far over 10m nodes) into neo4j using the neo4j-import tool. After importing my data I run several queries over it. One of those queries performs very badly. I optimized it (PROFILING, using relationship types, splitting up for multicore support and so on) as much as I could.
Still it takes too long, so my idea was to tell neo4j to start at a specific type of nodes by using the USING INDEX clause. I then could check how my db hits change and possibly make it work. Right now my database doesn't have indexes though.
I wanted to create indexes when I'm done writing all the queries I need, it seems I need to start using them already though.
I'm wondering if I can create those indexes during the bulk import process. That seems to be a good solution to me. How would I do that?
Also I wonder if it's possible to actually write a statement that would create indexes for an attribute that exists on every single one of my nodes (let's call it "type").
CREATE INDEX ON :(type);
doesn't work (label is missing but I want to omit it)
Indexes are on Labels + Properties. You need indexes right after your import and before you start trying to optimize queries. Anything your query will use to find a starting point should be indexed (user_id, object_id, etc) and probably any dates or properties used for range queries (modified_on, weight, etc).
CREATE INDEX ON :Label(property)
Cypher queries are single threaded so I have no idea what you mean by multi-core support. What did you read about that, got a link? You can multi-thread Neo4j, but at this point you have to do it manually. See https://maxdemarzi.com/2017/01/06/multi-threading-a-traversal/
Most of the time, the queries can be greatly optimized with an index or expressing it differently. But sometimes you need to redo your model to fit the query. Take a look at https://maxdemarzi.com/2015/08/26/modeling-airline-flights-in-neo4j/ for some hints.

What is the most efficient way to filter a search?

I am working with node.js and mongodb.
I am going to have a database setup and use socket.io to have real-time updates that will have the db queried again as well or push the new update to the client.
I am trying to figure out what is the best way to filter the database?
Some more information in regards to what is being queried and what the real time updates are:
A document in the database will include information such as an address, city, time, number of packages, name, price.
Filters include city/price/name/time (meaning only to see addresses within the same city, or within the same time period)
Real-time info: includes adding a new document to the database which will essentially update the admin on the website with a notification of a new address added.
Method 1: Query the db with the filters being searched?
Method 2: Query the db for all searches and then filter it on the client side (Javascript)?
Method 3: Query the db for all searches then store it in localStorage then query localStorage for what the filters are?
Trying to figure out what is the fastest way for the user to filter it?
Also, if it is different than what is the most cost effective way, then the most cost effective as well (which I am assuming is less db queries)...
It's hard to say because we don't see exact conditions of the filter, but in general:
Mongo can use only 1 index in a query condition. Thus whatever fields are covered by this index can be used in an efficient filtering. Otherwise it might do full table scan which is slow. If you are using an index then you are probably doing the most efficient query. (Mongo can still use another index for sorting though).
Sometimes you will be forced to do processing on client side because Mongo can't do what you want or it takes too many queries.
The least efficient option is to store results somewhere just because IO is slow. This would only benefit you if you use them as cache and do not recalculate.
Also consider overhead and latency of networking. If you have to send lots of data back to the client it will be slower. In general Mongo will do better job filtering stuff than you would do on the client.
According to you if you can filter by addresses within time period then you could have an index that cuts down lots of documents. You most likely need a compound index - multiple fields.

solr More than on entity in DataImportHandler

I need to know what is the recommended solution when I want to index my solr data using multiple queries and entities.
I ask because I have to add a new fields into schema.xml configuration. And depends of entity(query) there should be different fields definition.
query_one = "select * from car"
query_two = "select * fromm user"
Tables car and user have differents fields, so I should include this little fact in my schema.xml config (when i will be preparing fields definition).
Maybe someone of you creates a new solr instance for that kind of problem ?
I found something what is call MultiCore. Is it alright solution for my problem ?
Thanks
Solr does not stop you to host multiple entities in a single collection.
You can define the fields for both the entities and have them hosted within the Collection.
You would need to have an identifier to identify the Entities, if you want to filter the results per entity.
If your collections are small or there is a relationship between the User and Car it might be helpful to host them within the same collection
For Solr Multicore Check Answer
Solr Multicore is basically a set up for allowing Solr to host multiple cores.
These Cores which would host a complete different set of unrelated entities.
You can have a separate Core for each table as well.
For e.g. If you have collections for Documents, People, Stocks which are completely unrelated entities you would want to host then in different collections
Multicore setup would allow you to
Host unrelated entities separately so that they don't impact each other
Having a different configuration for each core with different behavior
Performing activities on each core differently (Update data, Load, Reload, Replication)
keep the size of the core in check and configure caching accordingly
Its more a matter of preference and requirements.
The main question for you is whether people will search for cars and users together. If not (they are different domains), you can setup multiple collections/cores. If they are going to be used together (e.g. a search for something that shows up in both cars and people), you may want to merge them into one index.
If you do use single collection for both types, you may want to setup dedicated request handlers returning different sets of fields and possibly tuning the searches. You can see an example of doing that (and a bit more) in the multilingual example from my book.

Categories

Resources