These days I made some experiments about loadCache、localLoadCache and query data from cache. However, I was more and more puzzled. Here are my puzzles. Please help me if you know how to explain.
What's the difference between loadCache and localLoadCache?
What's the logic inside cache's storing data? For example, I start a node called ''A',whose cache stores some data(assume 10 items) from table Person in DB. And then inside the code I let it query data from cache per 5 seconds.
Then I start a new node called 'B',whose cache stores 20 other items data from table Person in DB and also let it query data from cache per 5 seconds. However, why querying data from ‘A’ and 'B' is 30 items data (the sum of 10+20)?
If I let B put a new item data into cache,and then 'A' can also query out the new data? When I close B, then A query out 10 items data, which means it is same as first. Why?
Ignite is a distributed data storage. It partitions your data set and equally distributes it across available nodes. E.g., if you have 30 entries and 2 nodes, you will have approximately 15 entries on each node. The ownership is defined by Ignite automatically, you can't decide where to store a particular entry (well, you can, but this is non-trivial).
Having said that, when a table is loaded into the cache, it is treated as a single data set. When you get an entry from the cache, it will be transparently returned regardless of where it is stored.
As for the loading, the process is the following:
Each node independently fetches the whole table from the DB and iterates through rows.
For each row the CacheStore implementation creates key and value objects and passes them to the cache.
Cache decides whether this particular key-value pair belongs to the local node. If yes, it is saved. If not, it is discarded.
As a result, the table will be fully stored in the cluster in a distributed fashion and each node will have it's own subset of data.
localLoadCache method executes this process on local node only (useful in some specific cases). loadCache is basically just a shortcut that broadcasts a closure and calls localLoadCache on all nodes, therefore it triggers the distributed data loading.
Related
Suppose I run a sql query and DB makes use of index structures to arrive at a ROWID(assuming an INDEX SCAN, like in Oracle), and now DB wants to use it to fetch the actual records.
So how does ROWID helps in fast access of the record? I assume ROWID must be somehow mapped to internal record storage.
I understand index is basically combination of B-tree and doubly linked list. But how are actual records stored such that ROWID fetches them fast.
A ROWID is simply a 10 byte physical row address that contains the relative file number, the block number within that file, and the row number within that block. See this succinct explanation:
Oracle FAQs - ROWID
With that information, Oracle can make an I/O read request of a single block by calculating the block offset byte position in the file and the length of the block. Then it can use the block's internal row map to jump directly to the byte offset within the block of the desired row. It doesn't have to scan through anything.
You can pull a human-readable representation of these three components by using this query against any (heap) table, any row:
SELECT ROWIDTOCHAR(rowid) row_id,
dbms_rowid.rowid_relative_fno(rowid) fno,
dbms_rowid.rowid_block_number(rowid) blockno,
dbms_rowid.rowid_row_number(rowid) rowno
FROM [yourtable]
WHERE ROWNUM = 1 -- pick any row
The fast retrieval is also often aided by the fact that single block reads are frequently bypassed altogether because the block is already in the buffer cache. Or if it is not in Oracle's buffer cache, a single block read from many cooked filesystems, unless disabled by the setting of filesytemio_options, may be cached at the OS level and never go to storage. And if you use a storage appliance, it probably has it's own caching mechanism. All of these caching mechanisms likely give caching preference to small reads over large ones, so single block reads from Oracle are likely to avoid hitting magnetic disk altogether, more so than multiblock reads associated with table scans.
But be careful - just because ROWID is the fastest way to retrieve a single row does not mean it is the fastest way to retrieve many rows. Because of the overhead of a read call, many single calls accumulate a lot of wasted overhead. When pulling large amounts of rows it is frequently more efficient to do a full table scan, especially when Oracle uses direct path reads to do so, than utilize ROWIDs either manually or via indexes.
Within my product I use elasticsearch for storing CDRs (call them txn logs, if you will). My transactions are asynchronous and happen at a very fast rate i.e. around 5000 txns/sec. My transaction involves submitting request to a network entity, and later at some other point of time I receive the response.
The data ingestion technique to ES, earlier involved two phase operations viz., 1) add an entry into ES as soon as I submit to the network layer; 2) when I get response, then update the previous entry with additional status such as delivery succeeded.
I am doing this with bulk insertion method, in which the bulk records contain both inserts and updates. As a result the ingestion is very very slow, which ended up hogging / halting my application. Later, we changed the ingestion technique in such a way that we only insert to elastic when we get final response. Till such time we store the data in a redis store. But this has disadvantages of data loss and non-realtime reports.
So, I was looking at some option like having 2 indexes for the same record. Parent index will have all data, and the child record will have delivery status. I don't know if this is possible. I studied about nested queries and has-child, has-parent queries. What I am unsure is, can I insert the parent and child data at separate points in time, without having to use update. Or should I create two different records with common txn-id without worrying about parent/child?
What is the best way?
I have a use case where I receive some data quite frequently that I need to cache with Infinispan (in a replicated cluster in library mode/in-process) where the data is often very similar but the amount different keys is much larger than the amount of possible different associated data.
I am worried about the number of data objects being created/replicated unnecessarily which are mostly duplicates of each other under different keys.
Is my only option to break my cache into two? e.g.
key -> data hash -> data
My only problem with this is the possibility of the key -> data hash being replicated to the rest of the cluster before the data has -> data cache. I need the data to be there by the time the key is replicated (as I handle that event).
Or are there any other options available such as intercepting a cache insert to use a pool of these data objects?
There is no feature that would let you de-duplicate the data, so yes, you need to break that into two caches. You could write your own interceptor, but there be lions. Would you iterate through all local entries to find a match?
If you use non-transactional cache with synchronous replication, you can just update the dataId -> data cache first, and then the key -> dataId. By the time the second operation is invoked the first write is replicated to all nodes.
My issue is that how to update cache with new entries from database table?
my cache has my Cassandra table data suppose for till 3 p.m.
till that time user has purchase 3 item so my cache has 3 items entry associated with that user.
But after sometime (say 30min) what if user purchase 2 more item ?
As i have 3 entry in cache it wont query from database, how to get those 2 new entry at time of calculation final bill.
One option i have is to call cache.loadCache(null, null) every 15 min? but some where this is not feasible to call every time?
The better option here is to insert data not directly to Cassandra, but using Ignite. It will give a possibility to have always updated data in the cache without running any additional synchronizations with DB.
But if you will choose to run loadCache each time, you can add a timestamp to your object in DB and implement your own CacheStore, which will have addition method that will load only new data from DB. Here is a link to the documentation, it will help you to implement your own CacheStore.
I maintain an application which leverage JCS to hold the cache in JVM (JVM1). This data will be loaded from a database for the first time when the JVM gets started/ restarted.
However the database will be accessed from a different JVM (JVM2) and will help adding data to database.
In order to make sure this additional/ newly added records loaded into cache, we need to restart JVM1 for every addition in the database.
Is there a way we can refresh/load the cache (only for newly added records) in JVM1 for regular intervals (instead of frequent db polling)?
Thanks,
Jaya Krishna
Can you not simply have JVM1 first check the in memory cache, and then, if the item is absent in the in-memory cache, check the database cache?
If you, however, need to list all items in existance, of some certain type, and don't want to access the database. Then, for JVM1 to know that there's a new item in the databse, I suppose that either 1) JVM2 would have to send a network message to JVM1 telling it that there're new entries in the database. Or 2) there could be a database trigger that fires when new data is inserted, and sends a network message to JVM1. (But having the database send network messages to an application server feels rather weird I think.) — I think these approaches seem rather complicated though.
Have you considered some kind of new-item-ids table, that logs the IDs of items recently inserted into the database? It could be updated by a database trigger, or by JVM1 and 2 when they write to the databse. Then JVM1 would only need to poll this single table perhaps once per second, to get a list of new IDs, and then it could load the new items from the database.
Finally, have you considered a distributed cache? So that both JVM1 and 2 share the same cache, and JVM1 and 2 writes items to this cache when they insert them into the datbase. (This approach would be somewhat similar to sending network messages between JVM1 and 2, but the distributed cache system would send the messages itself, so you didn't need to write any new code)