No cache on rust-diesel - caching

I am using MySQL as my database, and diesel to retrieve the data. The data get updated every second from multiple end-points. The problem is while using diesel, to retrieve the data, the results I get are outdated (probably due to the cache on diesel side). MySQL runs with SET GLOBAL query_cache_size = 0, so there is no active cache on the DB server-side.
Here is the part of my code to retrieve the data:
pub struct Weather {
pub id: u32,
pub temperature: f32,
pub datetime: NaiveDateTime,
}
pub fn {
let timewindow = ... // A timewindow I set
let results = weather.filter(datetime.ge(timewindow)).load::<Weather>(&db).unwrap();
println("{:?}", results)
}
Do you know how I can deactivate the cache on diesel?

The Diesel reduces boilerplate for database interactions and is the most efficient way to interact with databases in Rust because of its safe and composable abstractions over queries. Diesel offers a high level query builder and basically Diesel causes zero overhead while abstraction and tries running the query and loading the data even much faster than in C. There is no cache as such that can cause such outdated results and also it does not support SQL_NO_CACHE option related query. The outdated results seems to be related with multiple other points like query used or system setup/design as also updated in other comments.
However, if you are using prepared statements, they can improve performance while re-using the same statement that you prepared (statement will be cached for reuse by future calls to prepare_cached). In such case, you can also set the maximum number of cached prepared statements the particular connection will hold as in such case by default, a connection will hold a relatively small number of cached statements.

Related

Retrieving all fields vs only some in graphql, Time Comparison

I am currently working on a project involving graphQL and I was wondering if the action of retrieving every elements of a given type in a query was taking significantly more time than only retrieving some or if this time was negligible.
Here is an exemple:
fragment GlobalProtocolStatsFields on GlobalProtocolStats {
totalProfiles
totalBurntProfiles
totalPosts
totalMirrors
totalComments
totalCollects
totalFollows
totalRevenue {
...Erc20AmountFields
}}
vs
fragment GlobalProtocolStatsFields on GlobalProtocolStats {
totalProfiles
totalBurntProfiles
totalPosts
totalMirrors
}
Thanks in advance!
The answer highly depends on the implementation on the backend side. Let's look at what three stages the data goes through and how this can impact response time.
1. Data fetching from the source
First, the GraphQL server has to fetch the data from the database or a different data source. Some data sources allow you to specify which fields you want to receive. If the GraphQL service is optimised to only fetch the data needed, some time can be saved here. In my experience, it is often not worth it to do this and it is much easier to just fetch all fields that could be needed for an object type. Some GraphQL implementations do this automatically, e.g. Hasure, Postgraphile, Pothos with the Prisma Plugin. What can be more expensive is resolving relationships between entities. Often, the GraphQL implementation has to do another roundtrip to the server.
2. Data transformation and business logic
Sometimes, the data has to be transformed before it is returned from the resolver. The resolver model allows this business logic to be called conditionally. Leaving out a field will skip its resolver. In my experience, most business logic is incredibly fast and does not really impact response time.
3. Data serialisation and network
Once all the data is ready on the server side, it has to be serialised to JSON and sent to the client. Serializing large amounts of data can be expensive, especially because GraphQL is hard to serialise in a stream. Sending data to the client can also take a while, if the connection is slow or the data has a large size. This was one of the motivations for GraphQL: Allow the client to select the required fields and reduce unused data transfer.
Summary
As you can see, the response time is mostly related to the amount of data returned from the API and the network connection. Depending on the implementation, real savings are only made on the network, but more advanced implementations can drastically reduce the work done on the server as well.

Specify cache policy for parts of a graphQL query

In Apollo's GraphQL version, there are fetch policies that specify whether a fetch query should obtain data from server or use the local cache (if any data is available).
In addition, cache normalization allows usage of the cache to cut down on the amount of data that needs to be obtained from the server. For example, if I am requesting object A and object B, but earlier I had requested A and C, then in my current query it will get A from cache, and get B from server.
However, however, these specify cache policies for the entire query. I want to know if there is a method for specifying TTLs on individual fields.
From a developer standpoint, I want to be able to specify in my query that I want to go to cache for some information that I am requesting, but not others. For example, take the below query:
query PersonInfo($id: String) {
person(id: $id) {
birthcertificate // Once this is cached, it is cached forever. I should just always get this info from the cache if it is available.
age // I want to have this have a TTL of a day before invalidating the cached value and going to network
legalName // I want to always go to network for this information.
}
}
In other words, for a fixed id value (and assuming this is the only query that touches the person object or its fields):
the first time I make this query, I get all three fields from the server.
now if I make this query again within a few seconds, I should only get the third field (legalName) from the server, and the first two from the cache.
now, if I then wait more than a day, and then make this query again, I get birthCertificate from the cache, and age + legalName from the server.
Currently, to do this the way I would want to, I end up writing three different queries, one for each TTL. Is there a better way?
Update: there is some progress on cache timing done on the iOS client (https://github.com/apollographql/apollo-ios/issues/142), but nothing specifically on this?
It would be a nice feature but AFAIK [for now, taking js/react client, probably the same for ios]:
there is no query normalization, only cache normalization
if any of requested field not exists in cache then entire query is fetched from network
no time stored in cache [normalized] entries (per query/per type)
For now [only?] solution is to [save in local state/]store timestamps for each/all/some queries/responses (f.e. in onCompleted) and use it to invalidate/evict them before fetching. It could probably be automated f.e. starting timers within some field policy fn.
You can fetch person data at start (session) just after login ... any following and more granular person(id: $id) { birthcertificate } query (like in react subcomponent) can have "own" 'cache-only' policy. If you need always fresh legalName, fetch for it [separately or not] with network-only policy.

query result cachnig in OrientDB and Neo4j

just out of curiosity, does anybody know if Neo4j and OrientDB implement caching of query results, that is, storing in cache a query together with its result, so that subsequent requests of the same query are served without actually computing the result of the query.
Notice that this is different from caching part of the DB since in this case the query would be anyway executed (possibly using only data taken from memory instead of disk).
Starting from release v2.2 (not in SNAPSHOT but will be RC in few days), OrientDB supports caching of commands results. Caching command results has been used by other DBMSs and proven to dramatically improve the following use cases:
database is mostly read than write
there are a few heavy queries that result a small result set
you have available RAM to use or caching results
By default, the command cache is disabled. To enable it, set command.timeout=true.
For more information: http://orientdb.com/docs/last/Command-Cache.html.
There are a couple of layers where you can put the caching. You can put it at the highest level behind Varnish ( https://www.varnish-cache.org ) or some other high level cache. You can use a KV store like Redis ( http://redis.io ) and store a result with an expiration. You can also cache within Neo4j using extensions. Both simple things like index look-ups, partial traversals or complete results. See http://maxdemarzi.com/2014/03/23/caching-partial-traversals/ or http://maxdemarzi.com/2015/02/27/caching-immutable-id-lookups-in-neo4j/ for some ideas.

Dumping Azure tables quickly

My task is to dump entire Azure tables with arbitrary unknown schemas. Standard code to do this resembles the following:
TableQuery<DynamicTableEntity> query = new TableQuery<DynamicTableEntity>();
foreach (DynamicTableEntity entity in table.ExecuteQuery(query))
{
// Write a dump of the entity (row).
}
Depending on the table, this works at a rate of 1000-3000 rows per second on my system. I'm guessing this (lack of) performance has something to do with separate HTTP requests issued to retrieve the data in chunks. Unfortunately, some of the tables are multi-gigabyte in size, so this takes a rather long time.
Is there a good way to parallelize the above or speed it up some other way? It would seem that those HTTP requests could be sent by multiple threads, as in web crawlers and the like. However, I don't see an immediate method to do so.
Unless you know the PartitionKeys of the entities in the table (or some other querying criteria which includes PartitionKey), AFAIK you would need to take a top down approach which you're doing right now. In order for you to fire queries in parallel which would work efficiently you have to include PartitionKey in your queries.

How to cache queries in EJB and return result efficient (performance POV)

I use JBoss EJB 3.0 implementation (JBoss 4.2.3 server)
At the beginning I created native query all the time using construction like
Query query = entityManager.createNativeQuery("select * from _table_");
Of couse it is not that efficient, I performed some tests and found out that it really takes a lot of time... Then I found a better way to deal with it, to use annotation to define native queries:
#NamedNativeQuery( name = "fetchData", value = "select * from _table_", resultClass=Entity.class )
and then just use it
Query query = entityManager.createNamedQuery("fetchData");
the performance of code line above is two times better than where I started from, but still not that good as I expected... then I found that I can switch to Hibernate annotation for NamedNativeQuery (anyway, JBoss's implementation of EJB is based on Hibernate), and add one more thing:
#NamedNativeQuery( name = "fetchData2", value = "select * from _table_", resultClass=Entity.class, readOnly=true)
readOnly - marks whether the results are fetched in read-only mode or not. It sounds good, because at least in this case of mine I don't need to update data, I wanna just fetch it for report. When I started server to measure performance I noticed that query without readOnly=true (by default it is false) returns result with each iteration better and better, and at the same time another one (fetchData2) works like "stable" and with time difference between them is shorter and shorter, and after 5 iterations speed of both was almost the same...
The questions are:
1) is there any other way to speed query using up? Seems that named queries should be prepared once, but I can't say it... In fact if to create query once and then just use it it would be better from performance point of view, but it is problematic to cache this object, because after creating query I can set parameters (when I use ":variable" in query), and it changes query object (isn't it?). well, is here any way to cache them? Or named query is the best option I can use?
2) any other approaches how to make results retrieveng faster. I mean, for instance I don't need those Entities to be attached, I won't update them, all I need is just fetch collection of data. Maybe readOnly is the only available way, so I can't speed it up, but who knows :)
P.S. I don't ask about DB performance, all I need now is how not to create query object all the time, so use it efficient, and to "allow" EJB to do less job with the same result concerning data returning.
Added 15.03.2010:
By query I mean query object (so how to cache this object to reuse); and to cache query results is not a solution for me because of where cause in query can be almost unique for each querying because of float-pointing parameters there. Cache just will not understand that "a > 50.0001" and "a > 50.00101" can give the same result, but also can not.
You could use second level cache and query cache to avoid hitting the database (works especially well with read-only objects). Second level cache is supported by Hibernate (with a third party cache provider) but is an extension to JPA 1.0 though.

Resources