Routing Table implementation : linked list vs array - data-structures

I am trying to implement my own routing protocol over BTLE PHY and link layers to have a multi-hop link for BTLE radio. I am using Cortex-M0 processor for the same. My routing table structure is basically as follows:
|Neighbour Address| Info about Link quality | Possible Destination Addr|
The neighbour address will have address of immediate neighbour and possible destination addresses field will have addresses of destinations (within one hop) that can be reached from that particular neighbour (the routing only supports 2-hop communication). In short, the possible destinations will have entries of the elements which are in Neighbour address of the neighbour.
I am implementing this in C with CodeSorcery Toolchain Bare for ARM. So, for building a routing table, should a linked list be used or an array? Using array will be easier than implementing a linked list but then, the size of array will be predefined and limited. Plus, when initializing, it will eat up all the space dedicated to it. Will it be actually good to reserve space for routing table so that it will not cause memory problem later? Or should it be linked list which is more flexible in data allocation?

The best data structure is to do both. You can allocate address tables in 'blocks'. Each block will contain multiple routing table entries and the next pointer. After you fill up the block you can allocate the next block and fill it's address in the previous block next pointer.
Such structure gets both of both worlds, you get the speed of simple table scan and flexibility of the linked list.
How large your network is going to be? For very large networks, simple table will be a performance bottleneck, so you should consider using a hash table of prefixes so that you don't need to traverse the entire table to find a particular neighbor.

Related

What is Fractal Index (by Tokutek) exactly?

There are 3 kinds of IO-optimized write-optimized data structures (not including LSM) mentioned in connection to Fractal Index (by Tokutek):
1) Buffered repository trees of any kind. Related publications with the same idea:
http://www.cc.gatech.edu/~bader/COURSES/GATECH/CSE-Algs-Fall2013/papers/Arg03.pdf
http://cs.slu.edu/~goldwasser/publications/SODA2000.pdf
2) COLA (cache-oblivious lookahead (forward pointers) array).
http://supertech.csail.mit.edu/papers/sbtree.pdf
3) Shuttle trees:
http://supertech.csail.mit.edu/papers/sbtree.pdf
What data structure is actually called a "Fractal Tree Index"?
How COLAs used exactly in real software? Is COLA used as a small buffers for buffered tree or it handle terabytes of data in real applications, similar to LSM? Why someone would prefer to use COLA instead of buffered tree? How it is different from LSM on terabytes?
Speaking of buffered tree by Lars Arge: as far as i understand, the "buffers" may be stored in external memory and the "buffers" may have size of entire RAM: the only requirement is to fit into memory for sorting before pushing one level down?
Why someone would prefer to use such a large external memory "buffers" instead of using smaller buffers of size B on every internal node?

How to get all unallocated spaces from extended partition via WMI?

I'm trying to get all unallocated spaces (as a list of <offset,size> pairs) on a disk.
Everything is good as far there aren't any extended partitions on the disk - I just list Win32_DiskPartitions associated with the selected Win32_DiskDrive and analyze their offsets and sizes to find gaps between them.
If, however, there's an extended partition, things get complicated - it is like a black box and the internal partitions aren't among objects associated with my Win32_DiskDrive. I tried listing the extended partition's associated objects, but there are no "internal" Win32_DiskPartitions linked to the extended partition, onlyWin32_LogicalDisks, but they don't give me any information about the partition's actual geometry
It tried using diskpart for this purpose, but it rounds all the partition sizes to GB, and I need them to be exact. Also it's locale-dependent, which makes it hard it parse the output (my app needs to be as locale-independent as possible)

max number of couchbase views per bucket

How many views per bucket is too much, assuming a large amount of data in the bucket (>100GB, >100M documents, >12 document types), and assuming each view applies only to one document type? Or asked another way, at what point should some document types be split into separate buckets to save on the overhead of processing all views on all document types?
I am having a hard time deciding how to split my data into couchbase buckets, and the performance implications of the views required on the data. My data consists of more than a dozen relational DBs, with at least half with hundreds of millions of rows in a number of tables.
The http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-writing-bestpractice.html doc section "using document types" seems to imply having multiple document types in the same bucket is not ideal because views on specific document types are updated for all documents, even those that will never match the view. Indeed, it suggests separating data into buckets to avoid this overhead.
Yet there is a limit of 10 buckets per cluster for performance reasons. My only conclusion therefore is that each cluster can handle a maximum of 10 large collections of documents efficiently. Is this accurate?
Tug's advice was right on and allow me to add some perspective as well.
A bucket can be considered most closely related to (though not exactly) a "database instantiation" within the RDMS world. There will be multiple tables/schemas within that "database" and those can all be combined within a bucket.
Think about a bucket as a logical grouping of data that all shares some common configuration parameters (RAM quota, replica count, etc) and you should only need to split your data into multiple buckets when you need certain datasets to be controlled separately. Other reasons are related to very different workloads to different datasets or the desire to be able to track the workload to those datasets separately.
Some examples:
-I want to control the caching behavior for one set of data differently than another. For instance, many customers have a "session" bucket that they want always in RAM whereas they may have a larger, "user profile" bucket that doesn't need all the data cached in RAM. Technically these two data sets could reside in one bucket and allow Couchbase to be intelligent about which data to keep in RAM, but you don't have as much guarantee or control that the session data won't get pushed out...so putting it in its own bucket allows you to enforce that. It also gives you the added benefit of being able to monitor that traffic separately.
-I want some data to be replicated more times than others. While we generally recommend only one replica in most clusters, there are times when our users choose certain datasets that they want replicated an extra time. This can be controlled via separate buckets.
-Along the same lines, I only want some data to be replicated to another cluster/datacenter. This is also controlled per-bucket and so that data could be split to a separate bucket.
-When you have fairly extreme differences in workload (especially around the amount of writes) to a given dataset, it does begin to make sense from a view/index perspective to separate the data into a separate bucket. I mention this because it's true, but I also want to be clear that it is not the common case. You should use this approach after you identify a problem, not before because you think you might.
Regarding this last point, yes every write to a bucket will be picked up by the indexing engine but by using document types within the JSON, you can abort the processing for a given document very quickly and it really shouldn't have a detrimental impact to have lots of data coming in that doesn't apply to certain views. If you don't mind, I'm particularly curious at which parts of the documentation imply otherwise since that certainly wasn't our intention.
So in general, we see most deployments with a low number of buckets (2-3) and only a few upwards of 5. Our limit of 10 comes from some known CPU and disk IO overhead of our internal tracking of statistics (the load or lack thereof on a bucket doesn't matter here). We certainly plan to reduce this overhead with future releases, but that still wouldn't change our recommendation of only having a few buckets. The advantages of being able to combine multiple "schemas" into a single logical grouping and apply view/indexes across that still exist regardless.
We are in the process right now of coming up with much more specific guidelines and sizing recommendations (I wrote those first two blogs as a stop-gap until we do).
As an initial approach, you want to try and keep the number of design documents around 4 because by default we process up to 4 in parallel. You can increase this number, but that should be matched by increased CPU and disk IO capacity. You'll then want to keep the number of views within each document relatively low, probably well below 10, since they are each processed in serial.
I recently worked with one user who had an fairly large amount of views (around 8 design documents and some dd's with nearly 20 views) and we were able to drastically bring this down by combining multiple views into one. Obviously it's very application dependent, but you should try to generate multiple different "queries" off of one index. Using reductions, key-prefixing (within the views), and collation, all combined with different range and grouping queries can make a single index that may appear crowded at first, but is actually very flexible.
The less design documents and views you have, the less disk space, IO and CPU resources you will need. There's never going to be a magic bullet or hard-and-fast guideline number unfortunately. In the end, YMMV and testing on your own dataset is better than any multi-page response I can write ;-)
Hope that helps, please don't hesitate to reach out to us directly if you have specific questions about your specific use case that you don't want published.
Perry
As you can see from the Couchbase documentation, it is not really possible to provide a "universal" rules to give you an exact member.
But based on the best practice document that you have used and some discussion(here) you should be able to design your database/views properly.
Let's start with the last question:
YES the reason why Couchbase advice to have a small number of bucket is for performance - and more importantly resources consumption- reason. I am inviting you to read these blog posts that help to understand what's going on "inside" Couchbase:
Sizing 1: http://blog.couchbase.com/how-many-nodes-part-1-introduction-sizing-couchbase-server-20-cluster
Sizing 2: http://blog.couchbase.com/how-many-nodes-part-2-sizing-couchbase-server-20-cluster
Compaction: http://blog.couchbase.com/compaction-magic-couchbase-server-20
So you will see that most of the "operations" are done by bucket.
So let's now look at the original question:
yes most the time your will organize the design document/and views by type of document.
It is NOT a problem to have all the document "types" in a single(few) buckets, this is in fact the way your work with Couchbase
The most important part to look is, the size of your doc (to see how "long" will be the parsing of the JSON) and how often the document will be created/updated, and also deleted, since the JS code of the view is ONLY executed when you create/change the document.
So what you should do:
1 single bucket
how many design documents? (how many types do you have?)
how any views in each document you will have?
In fact the most expensive part is not during the indexing or quering it is more when you have to rebalance the data and indices between nodes (add, remove , failure of nodes)
Finally, but it looks like you already know it, this chapter is quite good to understand how views works (how the index is created and used):
http://www.couchbase.com/docs/couchbase-manual-2.0/couchbase-views-operation.html
Do not hesitate to add more information if needed.

A better idea/data structure to collect analytics data

I'm collecting analytics data. I'm using a master map that holds many other nested maps.
Considering maps are immutable, many new maps are going to be allocated. (Yes, that is efficient in Clojure).
Basic operation that I'm using is update-in , very convenient to update a value for a given path or create the binding for a non-existant value.
Once I reached a specific point, I'm going to save that data structure to the data base.
What would be a better idea to collect these data more efficiently in Clojure? a transient data structure?
As with all optimizations, measure first, and If the map update is a bottle neck then switching to a transient map is a rather unintrusive code change. If you find that GC overhead is the real culprit, as it often is with persistent data structures, and transients dont help enough then collecting the data into a list and batch adding it into a transient map which is made persistent and saved into the DB at the end may be a more effective though larger change. Adding to a list produces very little GC overhead because unlike adding to a map the old head does not need to be discarded and GCd

Storage for Write Once Read Many

I have a list of 1 million digits. Every time the user submit an input, I would need to do a matching of the input with the list.
As such, the list would have the Write Once Read Many (WORM) characteristics?
What would be the best way to implement storage for this data?
I am thinking of several options:
A SQL Database but is it suitable for WORM (UPDATE: using VARCHAR field type instead of INT)
One file with the list
A directory structure like /1/2/3/4/5/6/7/8/9/0 (but this one would be taking too much space)
A bucket system like /12345/67890/
What do you think?
UPDATE: The application would be a web application.
To answer this question you'll need to think about two things:
Are you trying to minimize storage space, or are you trying to minimize process time.
Storing the data in memory will give you the fastest processing time, especially if you could optimize the datastructure for your most common operations (in this case a lookup) at the cost of memory space. For persistence, you could store the data to a flat file, and read the data during startup.
SQL Databases are great for storing and reading relational data. For instance storing Names, addresses, and orders can be normalized and stored efficiently. Does a flat list of digits make sense to store in a relational database? For each access you will have a lot of overhead associated with looking up the data. Constructing the query, building the query plan, executing the query plan, etc. Since the data is a flat list, you wouldn't be able to create an effective index (your index would essentially be the values you are storing, which means you would do a table scan for each data access).
Using a directory structure might work, but then your application is no longer portable.
If I were writing the application, I would either load the data during startup from a file and store it in memory in a hash table (which offers constant lookups), or write a simple indexed file accessor class that stores the data in a search optimized order (worst case a flat file).
Maybe you are interested in how The Pi Searcher did it. They have 200 million digits to search through, and have published a description on how their indexed searches work.
If you're concerned about speed and don't want to care about file system storage, probably SQL is your best shot. You can optimize your table indexes but also will add another external dependency on your project.
EDIT: Seems MySQL have an ARCHIVE Storage Engine:
MySQL supports on-the-fly compression since version 5.0 with the ARCHIVE storage engine. Archive is a write-once, read-many storage engine, designed for historical data. It compresses data up to 90%. It does not support indexes. In version 5.1 Archive engine can be used with partitioning.
Two options I would consider:
Serialization - when the memory footprint of your lookup list is acceptable for your application, and the application is persistent (a daemon or server app), then create it and store it as a binary file, read the binary file on application startup. Upside - fast lookups. Downside - memory footprint, application initialization time.
SQL storage - when the lookup is amenable to index-based lookup, and you don't want to hold the entire list in memory. Upside - reduced init time, reduced memory footprint. Downside - requires DBMS (extra app dependency, design expertise), fast, but not as fast as holding the whole list in memeory
If you're concerned about tampering, buy a writable DVD (or a CD if you can find a store which still carries them ...), write the list on it and then put it into a server with only a DVD drive (not a DVD writer/burner). This way, the list can't be modified. Another option would be to buy an USB stick which has a "write protect" switch but they are hard to come by and the security isn't as good as with a CD/DVD.
Next, write each digit into a file on that disk with one entry per line. When you need to match the numbers, just open the file, read each line and stop when you find a match. With todays computer speeds and amounts of RAM (and therefore file system cache), this should be fast enough for a once-per-day access pattern.
Given that 1M numbers is not a huge amount of numbers for todays computers, why not just do pretty much the simplest thing that could work. Just store the numbers in a text file and read them into a hash set on application startup. On my computer reading in 1M numbers from a text file takes under a second and after that I can do about 13M lookups per second.

Resources