Enrich each existing value in a cache with the data from another cache in an Ignite cluster - performance

What is the best way to update a field of each existing value in a Ignite cache with data from another cache in the same cluster in the most performant way (tens of millions of records about a kilobyte each)?
Pseudo code:
try (mappings = getCache("mappings")) {
try (entities = getCache("entities")) {
entities.foreach((key, entity) -> entity.setInternalId(mappings.getValue(entity.getExternalId());
}
}

I would advise to use compute and send a closure to all the nodes in the cache topology. Then, on each node you would iterate through a local primary set and do the updates. Even with this approach you would still be better off batching up updates and issuing them with a putAll call (or maybe use IgniteDataStreamer).
NOTE: for the example below, it is important that keys in "mappings" and "entities" caches are either identical or colocated. More information on collocation is here:
https://apacheignite.readme.io/docs/affinity-collocation
The pseudo code would look something like this:
ClusterGroup cacheNodes = ignite.cluster().forCache("mappings");
IgniteCompute compute = ignite.compute(cacheNodes.nodes());
compute.broadcast(() -> {
IgniteCache<> mappings = getCache("mappings");
IgniteCache<> entities = getCache("entities");
// Iterate over local primary entries.
entities.localEntries(CachePeekMode.PRIMARY).forEach((entry) -> {
V1 mappingVal = mappings.get(entry.getKey());
V2 entityVal = entry.getValue();
V2 newEntityVal = // do enrichment;
// It would be better to create a batch, and then call putAll(...)
// Using simple put call for simplicity.
entities.put(entry.getKey(), newEntityVal);
}
});

Related

How does Apollo paginated "read" and "merge" work?

I was reading through the docs to learn pagination approaches for Apollo. This is the simple example where they explain the paginated read function:
https://www.apollographql.com/docs/react/pagination/core-api#paginated-read-functions
Here is the relevant code snippet:
const cache = new InMemoryCache({
typePolicies: {
Query: {
fields: {
feed: {
read(existing, { args: { offset, limit }}) {
// A read function should always return undefined if existing is
// undefined. Returning undefined signals that the field is
// missing from the cache, which instructs Apollo Client to
// fetch its value from your GraphQL server.
return existing && existing.slice(offset, offset + limit);
},
// The keyArgs list and merge function are the same as above.
keyArgs: [],
merge(existing, incoming, { args: { offset = 0 }}) {
const merged = existing ? existing.slice(0) : [];
for (let i = 0; i < incoming.length; ++i) {
merged[offset + i] = incoming[i];
}
return merged;
},
},
},
},
},
});
I have one major question around this snippet and more snippets from the docs that have the same "flaw" in my eyes, but I feel like I'm missing some piece.
Suppose I run a first query with offset=0 and limit=10. The server will return 10 results based on this query and store it inside cache after accessing merge function.
Afterwards, I run the query with offset=5 and limit=10. Based on the approach described in docs and the above code snippet, what I'm understanding is that I will get only the items from 5 through 10 instead of items from 5 to 15. Because Apollo will see that existing variable is present in read (with existing holding initial 10 items) and it will slice the available 5 items for me.
My question is - what am I missing? How will Apollo know to fetch new data from the server? How will new data arrive into cache after initial query? Keep in mind keyArgs is set to [] so the results will always be merged into a single item in the cache.
Apollo will not slice anything automatically. You have to define a merge function that keeps the data in the correct order in the cache. One approach would be to have an array with empty slots for data not yet fetched, and place incoming data in their respective index. For instance if you fetch items 30-40 out of a total of 100 your array would have 30 empty slots then your items then 60 empty slots. If you subsequently fetch items 70-80 those will be placed in their respective indexes and so on.
Your read function is where the decision on whether a network request is necessary or not will be made. If you find all the data in existing you will return them and no request to the server will be made. If any items are missing then you need to return undefined which will trigger a network request, then your merge function will be triggered once data is fetched, and finally your read function will run again only this time the data will be in the cache and it will be able to return them.
This approach is for the cache-first caching policy which is the default.
The logic for returning undefined from your read function will be implemented by you. There is no apollo magic under the hood.
If you use cache-and-network policy then a your read doesn't need to return undefined when data

MongoTemplate, Criteria and Hashmap

Good Morning.
I'm starting to learn some mongo right now.
I'm facing this problem right now, and i'm start to think if this is the best approach to resolve this "task", or if is bettert to turn around and write another way to solve this "problem".
My goal is to iterate a simple map of values (key) and vector\array (values)
My test map will be recived by a rest layer.
{
"1":["1","2","3"]
}
now after some logic, i need to use the Dao in order to look into db.
The Key will be "realm", the value inside vector are "castle".
Every Realm have some castle and every castle have some "rules".
I need to find every rules for each avaible combination of realm-castle.
AccessLevel is a pojo labeled by #Document annotation and it will have various params, such as castle and realm (both simple int)
So the idea will be to iterate a map and write a long query for every combination of key-value.
public AccessLevel searchAccessLevel(Map<String,Integer[]> request){
Query q = new Query();
Criteria c = new Criteria();
request.forEach((k,v)-> {
for (int i: Arrays.asList(v)
) {
q.addCriteria(c.andOperator(
Criteria.where("realm").is(k),
Criteria.where("castle").is(v))
);
}
});
List<AccessLevel> response=db.find(q,AccessLevel.class);
for (AccessLevel x: response
) {
System.out.println(x.toString());
}
As you can see i'm facing an error concerning $and.
Due to limitations of the org.bson.Document, you can't add a second '$and' expression specified as [...]
it seems mongo can't handle various $and, something i'm pretty used to abuse over sql
select * from a where id =1 and id=2 and id=3 and id=4
(not the best, sincei can use IN(), but sql allow me)
So, the point is: mongo can actualy work in this way and i need to dig more into the problem, or i need to do another approach, like using criterion.in(), and make N interrogation via mongotemplate one for every key in my Map?

Spring data Neo4j Affected row count

Considering a Spring Boot, neo4j environment with Spring-Data-neo4j-4 I want to make a delete and get an error message when it fails to delete.
My problem is since the Repository.delete() returns void I have no ideia if the delete modified anything or not.
First question: is there any way to get the last query affected lines? for example in plsql I could do SQL%ROWCOUNT
So anyway, I tried the following code:
public void deletesomething(Long somethingId) {
somethingRepository.delete(getExistingsomething(somethingId).getId());
}
private something getExistingsomething(Long somethingId, int depth) {
return Optional.ofNullable(somethingRepository.findOne(somethingId, depth))
.orElseThrow(() -> new somethingNotFoundException(somethingId));
}
In the code above I query the database to check if the value exist before I delete it.
Second question: do you recommend any different approach?
So now, just to add some complexity, I have a cluster database and db1 can only Create, Update and Delete, and db2 and db3 can only Read (this is ensured by the cluster sockets). db2 and db3 will receive the data from db1 from the replication process.
For what I seen so far replication can take up to 90s and that means that up to 90s the database will have a different state.
Looking again to the code above:
public void deletesomething(Long somethingId) {
somethingRepository.delete(getExistingsomething(somethingId).getId());
}
in debug that means:
getExistingsomething(somethingId).getId() // will hit db2
somethingRepository.delete(...) // will hit db1
and so if replication has not inserted the value in db2 this code wil throw the exception.
the second question is: without changing those sockets is there any way for me to delete and give the correct response?
This is not currently supported in Spring Data Neo4j, if you wish please open a feature request.
In the meantime, perhaps the easiest work around is to fall down to the OGM level of abstraction.
Create a class that is injected with org.neo4j.ogm.session.Session
Use the following method on Session
Example: (example is in Kotlin, which was on hand)
fun deleteProfilesByColor(color : String)
{
var query = """
MATCH (n:Profile {color: {color}})
DETACH DELETE n;
"""
val params = mutableMapOf(
"color" to color
)
val result = session.query(query, params)
val statistics = result.queryStatistics() //Use these!
}

Getting the objects with similar secondary index in Riak?

Is there a way to get all the objects in key/value format which are under one similar secondary index value. I know we can get the list of keys for one secondary index (bucket/{{bucketName}}/index/{{index_name}}/{{index_val}}). But somehow my requirements are such that if I can get all the objects too. I don't want to perform a separate query for each key to get the object details separately if there is way around it.
I am completely new to Riak and I am totally a front-end guy, so please bear with me if something I ask is of novice level.
In Riak, it's sometimes the case that the better way is to do separate lookups for each key. Coming from other databases this seems strange, and likely inefficient, however you may find your query will be faster over an index and a bunch of single object gets, than a map/reduce for all the objects in a single go.
Try both these approaches, and see which turns out fastest for your dataset - variables that affect this are: size of data being queried; size of each document; power of your cluster; load the cluster is under etc.
Python code demonstrating the index and separate gets (if the data you're getting is large, this method can be made memory-efficient on the client, as you don't need to store all the objects in memory):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
return [v.key];
}"""
)
results = query.run()
bucket = riak_client.bucket("bucket_name")
for key in results:
obj = bucket.get(key)
# .. do something with the object
Python code demonstrating a map/reduce for all objects (returns a list of {key:document} objects):
query = riak_client.index("bucket_name", 'myindex', 1)
query.map("""
function(v, kd, args) {
var obj = Riak.mapValuesJson(v)[0];
return [ {
'key': v.key,
'data': obj,
} ];
}"""
)
results = query.run()

How to get keySet() and size() for entire GridGain cluster?

GridCache.keySet(), .primarySize(), and .size() only return information for that node.
How do I get these information but for the whole cluster?
Scanning the entire cluster "works", but all I need is the keys or the count, not the values.
The problem is SQL query works if I want to find based on an indexed field, but I can't find based on the grid cache entry key itself.
My workaround that works but far from elegant and performant is:
Set<String> ruleIds = FluentIterable.from(cache.queries().createSqlFieldsQuery("SELECT property FROM YagoRule").execute().get())
.<String>transform((it) -> (String) it.iterator().next()).toSet();
This requires the key is the same as one of the field, and the field need to be indexed for performance reasons.
Next release of GridGain (6.2.0) will have globalSize() and globalPrimarySize() methods which will ask the cluster for the sizes.
For now you can use the following code:
// Only grab nodes on which cache "mycache" is started.
GridCompute compute = grid.forCache("mycache").compute();
Collection<Integer> res = compute.broadcast(
// This code will execute on every caching node.
new GridCallable<Integer>() {
#Override public Integer call() {
return grid.cache("mycache").size();
}
}
).get();
int sum = 0;
for (Integer i : res)
sum += i;

Resources