This question is about a Java JTree or a Window .Net Tree (Winforms) or an Adobe Flex Tree.
In a client-server application (for Flex it's Web, really), I have a tree with hierarchical data (in a Windows Explorer type interface). Right now I lazily load up the tree as the user requests more data from the server. This is fine and will work up to about 750K nodes (empirically tested on .Net Winforms and Adobe Flex), but after that it gets sluggish. But the databases grow fast (mostly because users can paste in huge amounts of nodes) and a database of 20 million nodes is not at all unlikely.
Should I be releasing data from the tree when a branch is collapsed so the Garbage Collector can release the memory? This is fine, but what if the users are not efficient and don't collapse branches? Should I do a memory management module that goes around closing branches that haven't been touched in a while?
This all seems like a lot of work so as not to not run out of memory.
Edit: Should I release data on node collapse? If so, when? The weak-object cache idea is good, but should I just continue filling up the UI until it busts (maybe it's not a bad idea)?
If the users don't collapse branches, then I guess they're going to be scrolling through 750K to 20M nodes, right? Seems rather inefficient to me from a user POV. So, the problem may well be self-enforcing.
in most frameworks i've seen, the tree structure itself is fairly efficient, but if you have a non-trivial object at each tree leaf it quickly adds.
the easiest would be not to store anything at the tree leaves, but on the render/draw/update/whatever method, pick your object from a weak-ref cache. if it's not there, load from the server. the trick is not to keep any other reference to the object, only the weak one on the cache. that way, it will be still available, but collected when necessary.
Related
How do I keep the in-memory data structures in sync between two processes. Both processes are the same process(server) - one is active and the other one is a stand-by. The stand-by needs to take over in case of crash/or similar of the active. For the standby to take over the active, the in-memory data structures need to be kept in-sync. Can I use Virtual Synchrony? Will it help? If it would is there any library that I can use? I am coding on C++ on Windows(Visual Studio).
If that is not a solution what is a good solution I can refer to?
TIA
The easiest solution to implement is to store the state in a separate database, so that when you fail over, the standby will just continue using the same database. If you are worried about the database crashing, pay the money and complexity required to have main and standby databases, with the database also failing over. This is an attempt to push the complexity of handling state across failovers onto the database. Of course you may find that the overhead of database transactions becomes a bottleneck. It might be tempting to go NoSQL for this, but remember that you are probably relying on the ACID guarantees you get with a traditional database. If you ditch these, typically getting eventual consistency in return, you will have to think about what this means on failover. Will you lose a small amount of recent information on failover? do you care?
Virtual synchrony looks interesting. I have searched for similar things and found academic pages like http://www.cs.cornell.edu/ken/, some of which, like this, have links to open source software produced by research groups. I have never used them. I seem to remember reports that they worked pretty well for small number of machines with very good connectivity, but hit performance problems with scale, which I presume won't be a problem for you.
Once upon a time people built multiprocess systems on Unix machines by having the processes communicate via shared memory, or memory mapped files. For very simple data structures, this can be made to work. One problem you have is if one of the processes crashes halfway through modifying the shared data - will this mess up the other processes? You can solve these problems, but you are in danger of discovering that you have implemented everything inside the database that I described in my first paragraph.
You can go for in memory database like memcached or redis.
In my OSX app, I'm using an NSTreeController to keep track of any changes to to a document. The tree controller enables versioning by acting as a source control, which means that documents can create their own branches, etc.
It works fine so far. The problem is that every change to the document adds an NSTreeNode to the tree. Which means that after a few hours of use, the tree has accumulated many nodes, which means tons of objects in memory.
Is there a way I can create an NSTreeController with a capacity (like you'd give to an NSArray) which will automatically trim child nodes? If not, what's the best way to manually flush nodes at an appropriate interval so memory usage doesn't bloat?
I've recently come across the phrase "multi-tier cache" relating to multi-tiered architectures, but without a meaningful explanation of what such a cache would be (or how it would be used).
Relevant online searches for that phrase don't really turn up anything either. My interpretation would be a cache servicing all tiers of some n-tier web app. Perhaps a distributed cache with one cache node on each tier.
Has SO ever come across this term before? Am I right? Way off?
I know this is old, but thought I'd toss in my two cents here since I've written several multi-tier caches, or at least several iterations of one.
Consider this; Every application will have different layers, and at each layer a different form of information can be cached. Each cache item will generally expire for one of two reasons, either a period of time has expired, or a dependency has been updated.
For this explanation, lets imagine that we have three layers:
Templates (object definitions)
Objects (complete object cache)
Blocks (partial objects / block cache)
Each layer depends on it's parent, and we would define those using some form of dependency assignment. So Blocks depend on Objects which depend on Templates. If an Object is changed, any dependencies in Block would be expunged and refreshed; if a Template is changed, any Object dependencies would be expunged, in turn expunging any Blocks, and all would be refreshed.
There are several benefits, long expiry times are a big one because dependencies will ensure that downstream resources are updated whenever parents are updated, so you won't get stale cached resources. Block caches alone are a big help because, short of whole page caching (which requires AJAX or Edge Side Includes to avoid caching dynamic content), blocks will be the closest elements to an end users browser / interface and can save boatloads of pre-processing cycles.
The complication in a multi-tier cache like this though is that it generally can't rely on a purely DB based foreign key expunging, that is unless each tier is 1:1 in relation to its parent (ie. Block will only rely on a single object, which relies on a single template). You'll have to programmatically address the expunging of dependent resources. You can either do this via stored procedures in the DB, or in your application layer if you want to dynamically work with expunging rules.
Hope that helps someone :)
Edit: I should add, any one of these tiers can be clustered, sharded, or otherwise in a scaled environment, so this model works in both small and large environments.
After playing around with EhCache for a few weeks it is still not perfectly clear what they mean by the term "multi-tier" cache. I will follow up with what I interpret to be the implied meaning; if at any time down the road someone comes along and knows otherwise, please feel free to answer and I'll remove this one.
A multi-tier cache appears to be a replicated and/or distributed cache that lives on 1+ tiers in an n-tier architecture. It allows components on multiple tiers to gain access to the same cache(s). In EhCache, using a replicated or distributed cache architecture in conjunction with simply referring to the same cache servers from multiple tiers achieves this.
For fun i am writing a fastcgi app. Right now all i do is generate a GUID and display it at the top of the page then make a db query based on the url which pulls data from one of my existing sites.
I would like to attempt to cache everything on the page except for the GUID. What is a good way of doing that? I heard of but never used redis. But it appears its a server which means its in a seperate process. Perhaps an in process solution would be faster? (unless its not?)
What is a good solution for page caching? (i'm using C++)
Your implementation sounds like you need a simple key-value caching mechanism, and you could possibly use a container like std::unordered_map from C++11, or its boost cousin, boost::unordered_map. unordered_map provides a hash table implementation. If you needed even higher performance at some point, you could also look at Boost.Intrusive which provides high performance, standard library-compatible containers.
If you roll your cache with the suggestions mentioned, a second concern will be expiring cache entries, because of the possibility your cached data will grow stale. I don't know what your data is like, but you can choose to implement a caching strategy like any of these:
after a certain time/number of uses, expire a cached entry
after a certain time/number of uses, expire the entire cache (extreme)
least-recently used - there's a stack overflow question concerning this: LRU cache design
Multithreaded/concurrent access may also be a concern, though as suggested in the link above, a possibility would be to lock the cache on access rather than worry about granular locking.
Now if you're talking about scaling, and moving up to multiple processes, and distributing server processes across multiple physical machines, the simple in-process caching might not be the way to go anymore (everyone could have different copies of data at any given time, inconsistency of performance if some server has cached data but others don't).
That's where Redis/Memcached/Membase/etc. shine - they are built for scaling and for offloading work from a database. They could be beaten out by a database and in-memory cache in performance (there is latency, after all, and a host of other factors), but when it comes to scaling, they are very useful and save load from a database, and can quickly serve requests. They also come with features cache expiration (implementations differ between them).
Best of all? They're easy to use and drop in. You don't have to choose redis/memcache from the outset, as caching itself is just an optimization and you can quickly replace the caching code with using, say, an in-memory cache of your own to using redis or something else.
There are still some differences between the caching servers though - membase and memcache distribute their data, while redis has master-slave replication.
For the record: I work in a company where we use memcached servers - we have several of them in the data center with the rest of our servers each having something like 16 GB of RAM allocated completely to cache.
edit:
And for speed comparisons, I'll adapt something from a Herb Sutter presentation I watched long ago:
process in-memory -> really fast
getting data from a local process in-memory data -> still really fast
data from local disk -> depends on your I/O device, SSD can be fast, but mechanical drives are glacial
getting data from remote process (in-memory data) -> fast-ish, and your cache servers better be close
getting data from remote process (disk) -> iceberg
Most of the documentation of Lucene advises to keep a single instance of the indexReader and reuse it because of the overhead of opening a new Reader.
However i find it hard to see what this overhead is based and what influences it.
related to this is how much overhead does having an open IndexReader actualy cause?
The context for this question is:
We currently run a clustered tomcat stack where we do fulltext from the ServletContainer.
These searches are done on a separate Lucene indexes for each client because each client only seeks in his own data. Each of these indexes contains ranging from a few thousand to (currently) about 100.000 documents.
Because of the clustered tomcat nodes, any client can connect on any tomcat node.
Therefore keeping the IndexReader open would actually mean keep a few thousand indexReaders open on each tomcat node. This seems like a bad idea, however constantly reopening doesn't seem like a very good idea either.
While its possible for me to somewhat change the way we deploy Lucene if its not needed i'd rather not.
Usually the field cache is the slowest piece of Lucene to warm up, although other things like filters and segment pointers contribute. The specific amount kept in cache will depend on your usage, especially with stuff like how much data is stored (as opposed to just indexed).
You can use whatever memory usage investigation tool is appropriate for your environment to see how much Lucene itself takes up for your application, but keep in mind that "warm up cost" also refers to the various caches that the OS and file system keep open which will probably not appear in top or whatever you use.
You are right that having thousands of indexes is not a common practice. The standard advice is to have them share an index and use filters to ensure that the appropriate results are returned.
Since you are interested in performance, you should keep in mind that having thousands of indices on the server will result in thousands of files strewn all across the disk, which will lead to tons of seek time that wouldn't happen if you just had one big index. Depending on your requirements, this may or may not be an issue.
As a side note: it sounds like you may be using a networked file system, which is a big performance hit for Lucene.