I need to search for a specific record in a large file. The search will be performed on a microprocessor (ESP8266), so I'm working with limited storage and RAM.
The list looks like this:
BSSID,data1,data2
001122334455,float,float
001122334466,float,float
...
I was thinking using an index to speed up the search. The data are static, and the index will be built on a computer and then loaded onto the microcontroller.
What I've done so far is very simplistic.
I created an index of the first byte of the BSSID and points at the first and last values with that BSSID prefix.
The performance is terrible, but the index file is very small and uses very little RAM. I though to go further with this method, taking a look at the first two bytes, but the index table will be 256 times larger, resulting in a table 1/3 the size of the data file.
This is the index with the first method:
00,0000000000,0000139984
02,0000139984,0000150388
04,0000150388,0000158812
06,0000158812,0000160900
08,0000160900,0000171160
What indexing algorithm do you suggest that I use?
EDIT:Sorry I didn't include enough background before.I'm storing the data and index file on the flash memory of the chip. I have at the moment 30000 records, but this number could potentially grow until the chips momery limit is hit. The set is indeed static when is stored on the microcontroller but could be updated in a second moment with the help of a computer.The data isn't spread simmetrically between indexes.My goal is to find a good compromise between search speed, index size and RAM used.
I'm not sure where you're stuck, but I can comment on what you've done so far.
Most of all, the way to determine the "best" method is to
define "best" for your purposes;
research indexing algorithms (basic ones have been published for over 50 years);
choose a handful to implement;
Evaluate those implementations according to your definition of "best".
Keep in mind your basic resource restriction: you have limited RAM. If method requires more RAM than you have, it doesn't work, and is therefore infinitely slower than any method that does work.
You've come close to a critical idea, however: you want your index table to expand to consume any free RAM, using that space as effectively as possible. If you can index 16 bits instead of 8 and still fit the table comfortably into your available space, then you've cut down your linear search time by roughly a factor of 256.
Indexing considerations
Don't put the ending value in each row: it's identical to the starting value in the next row. Omit that, and you save one word in each row of the table, giving you twice the table room.
Will you get better performance if you slice the file into equal parts (same quantity of BSSIDS for each row of your table), and then store the entire starting BSSID with its record number? If your BSSIDs are heavily clumped, this might improve your overall processing, even though your table had fewer rows. You can't use a direct index in this case; you have to search the first column to get the proper starting point.
Does that move you toward a good solution?
Not sure how much memory you got (I am not familiar with that MCU) but do not forget that these tables are static/constant so they can be stored in EEPROM instead of RAM some chips have quite a lot of EEPROM usually way more than RAM...
Assume your file is sorted by the index. So You you got (assuming 32bit address) per each entry:
BYTE ix, DWORD beg,DWORD end
Why not this:
struct entry { DWORD beg,end };
entry ix0[256];
Where the first BYTE is also address in index array. This will spare 1 Byte per entry
Now as Prune suggested you can ignore the end address as you will scan the following entries in file anyway until you hit the correct index or index with different first BYTE. so yo can use:
DWORD ix[256];
where yo have only start address beg.
Now we do not know how many entries you actually have nor how many entries will share the same second BYTE of index. So we can not do any further assumption to improve...
You wanted to do something like:
DWORD ix[65536];
But have not enough memory for it ... how about doing something like this instead:
const N=1024; // number of entries you can store
const dix=(max_index_value+1)/N;
const ix[N]={.....};
so each entry ix[i] will cover all the indexes from i*dix to ((i+1)*dix)-1. So to find index you do this:
i = ix[index/dix];
for (;i<file_size;)
{
read entry from file at i-th position;
update position i;
if (file_index==index) { do your stuff; break; }
if (file_index> index) { index not found; break; }
}
To improve performance you can rewrite this linear scan into binary search between address of ix[index/dix] and ix[(index/dix)+1] or file size for the last index ... assuming each entry in file has the same size ...
Related
I have a question on the utility of slices in Go. I have just seen Why are lists used infrequently in Go? and Why use arrays instead of slices? but had some question which I did not see answered there.
In my application:
I read a CSV file containing approx 10 million records, with 23 columns per record.
For each record, I create a struct and put it into a linked list.
Once all records have been read, the rest of the application logic works with this linked list (the processing logic itself is not relevant for this question).
The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need. Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Every Go tutorial or article I read seems to suggest that I should use slices and not lists (as a slice can do everything a list can, but do it better somehow). However, I don't see how or why a slice would be more helpful for what I need? Any ideas from anyone?
... approx 10 million records, with 23 columns per record ... The reason I prefer a list and not a slice is due to the large amount of contiguous memory an array/slice would need.
This contiguous memory is its own benefit as well as its own drawback. Let's consider both parts.
(Note that it is also possible to use a hybrid approach: a list of chunks. This seems unlikely to be very worthwhile here though.)
Also, since I don't know the size of the exact number of records in the file upfront, I can't specify the array size upfront (I know Go can dynamically re-dimension the slice/array as needed, but this seems terribly inefficient for such a large set of data).
Clearly, if there are n records, and you allocate and fill in each one once (using a list), this is O(n).
If you use a slice, and allocate a single extra slice entry every time, you start with none, grow it to size 1, then copy the 1 to a new array of size 2 and fill in item #2, grow it to size 3 and fill in item #3, and so on. The first of the n entities is copied n times, the second is copied n-1 times, and so on, for n(n+1)/2 = O(n2) copies. But if you use a multiplicative expansion technique—which Go's append implementation does—this drops to O(log n) copies. Each one does copy more bytes though. It ends up being O(n), amortized (see Why do dynamic arrays have to geometrically increase their capacity to gain O(1) amortized push_back time complexity?).
The space used with the slice is obviously O(n). The space used for the linked list approach is O(n) as well (though the records now require at least one forward pointer so you need some extra space per record).
So in terms of the time needed to construct the data, and the space needed to hold the data, it's O(n) either way. You end up with the same total memory requirement. The main difference, at first glace anyway, is that the linked-list approach doesn't require contiguous memory.
So: What do we lose when using contiguous memory, and what do we gain?
What we lose
The thing we lose is obvious. If we already have fragmented memory regions, we might not be able to get a contiguous block of the right size. That is, given:
used: 1 MB (starting at base, ending at base+1M)
free: 1 MB (starting at +1M, ending at +2M)
used: 1 MB (etc)
free: 1 MB
used: 1 MB
free: 1 MB
we have a total of 6 MB, 3 used and 3 free. We can allocate 3 1 MB blocks, but we can't allocate one 3 MB block unless we can somehow compact the three "used" regions.
Since Go programs tend to run in virtual memory on large-memory-space machines (virtual sizes of 64 GB or more), this tends not to be a big problem. Of course everyone's situation differs, so if you really are VM-constrained, that's a real concern. (Other languages have compacting GC to deal with this, and a future Go implementation could at least in theory use a compacting GC.)
What we gain
The first gain is also obvious: we don't need pointers in each record. This saves some space—the exact amount depends on the size of the pointers, whether we're using singly linked lists, and so on. Let's just assume 2 8 byte pointers, or 16 bytes per record. Multiply by 10 million records and we're looking pretty good here: we've saved 160 MBytes. (Go's container/list implementation uses a doubly linked list, and on a 64 bit machine, this is the size of the per-element threading needed.)
We gain something less obvious at first, though, and it's huge. Because Go is a garbage-collected language, every pointer is something the GC must examine at various times. The slice approach has zero extra pointers per record; the linked-list approach has two. That means that the GC system can avoid examining the nonexistent 20 million pointers (in the 10 million records).
Conclusion
There are times to use container/list. If your algorithm really calls for a list and is significantly clearer that way, do it that way, unless and until it proves to be a problem in practice. Or, if you have items that can be on some collection of lists—items that are actually shared, but some of them are on the X list and some are on the Y list and some are on both—this calls for a list-style container. But if there's an easy way to express something as either a list or a slice, go for the slice version first. Because slices are built into Go, you also get the type safety / clarity mentioned in the first link (Why are lists used infrequently in Go?).
Let us consider the alternative {Index, Tag, Offset}. The usage and the size of each of the field remain the same, e.g. index is used to locate a block in cache, and its bit-length is still determined by the number of cache blocks. The only difference is that we now uses the MSB bits for index, the middle portion for tag, and the last portion for offset.
What do you think is the shortcoming of this scheme?
This will work — and if the cache is fully associative this won't matter (as there is no index, its all tag), but if the associativity is limited it will make (far) less effective use of the cache memory. Why?
Consider an object, that is sufficiently large to cross a cache block boundary.
When accessing the object, the address of some fields vs. the other fields will not be in the same cache block. How will the cache behave?
When the index is in the middle, then the cache block/line index will change, allowing the cache to store different nearby entities even with limited associativity.
When the index is at the beginning (most significant bytes), the tag will have changed between these two addresses, but the index will be the same — thus, there will be an collision at the index, which will use up one of the ways of the set-associativity. If the cache were direct mapped (i.e. 1-way set associative), it could thrash badly on repeated access to the same object.
Let's pretend that we have 12-bit address space, and the index, tag, and offset are each 4 bits.
Let's consider an object of four 32-bit integer fields, and that the object is at location 0x248 so that two integer fields, a, b, are at 0x248 and 0x24c and two other integer fields, c, d, are at 0x250 and 0x254.
Consider what happens when we access either a or b followed by c or d followed by a or b again.
If the tag is the high order hex digit, then the cache index (in the middle) goes from 4 to 5, meaning that even in an direct mapped cache both the a&b fields and the c&d fields can be in the cache at the same time.
For the same access pattern, if the tag is the middle hex digit and the index the high hex digit, then the cache index doesn't change — it stays at 2. Thus, on a 1-way set associative cache, accessing fields a or b followed by c or d will evict the a&b fields, which will result in a miss if/when a or b are accessed later.
So, it really depends on access patterns, but one thing that makes a cache really effective is when the program accesses either something it accessed before or something in the same block as it accessed before. This happens as we manipulate individual objects, and as we allocate objects that end up being adjacent, and as we repeat accesses to an array (e.g. 2nd loop over an array).
If the index is in the middle, we get more variation as we use different addresses of within some block or chunk or area of memory — in our 12-bit address space example, the index changes every 16 bytes, and adjacent block of 16 bytes can be stored in the cache.
But if the index is at the beginning we need to consume more memory before we get to a different index — the index changes only every 256 bytes, so two adjacent 16-byte blocks will often have collisions.
Our programs and compilers are generally written assuming locality is favored by the cache — and this means that the index should be in the middle and the tag in the high position.
Both tag/index position options offer good locality for addresses in the same block, but one favors adjacent addresses in different blocks more than the other.
Suppose I have an int array of 10 elements. With a 64 byte cacheline, it can hold 16 array elements from arr[0] to arr[15].
I would like to know what happens when you fetch, for example, arr[5] from the L1 cache into a register. How does this operation take place? Can the cpu pick an offset into a cacheline and read the next n bytes?
The cache will usually provide the full line (64B in this case), and a separate component in the MMU would rotate and cut the result (usually some barrel shifter), according to the requested offset and size. You would usually also get some error checks (if the cache supports ECC mechanisms) along the way.
Note that caches are often organized in banks, so a read may have to fetch bytes from multiple locations. By providing a full line, the cache can construct the bytes in proper order first (and perform the checks), before letting the MMU pick the relevant part.
Some designs focusing on power saving may decide to implement lower granularity, but this is often only adding complexity as you may have to deal with more cases of line segments being split.
A quick question to make sure I understand the concept behind a "block" and its usage with caches.
If I have a small cache that holds 4 blocks of 4 words each. Let's say its also directly mapped. If I try to access a word at memory address 2, would the block that contains words 0-3 be brought into the first block position of the cache or would it bring in words 2-5 instead?
I guess my question is how "blocks" exist in memory. When a value is accessed and a cache miss is trigger, does the CPU load one block's worth of data (4 words) starting at the accessed value in memory or does it calculate what block that word in memory is in and brings that block instead.
If this question is hard to understand, I can provide diagrams to what I'm trying to explain.
Usually caches are organized into "cache lines" (or, as you put it, blocks). The contents of the cache need to be associatively addressed, ie, accessed by using some portion of the requested address (ie "lookup table key" if you will). If the cache uses a block size of 1 word, the entire address -- all N bits of it -- would be the "key". Each word would be accessible with the granularity just described.
However, this associative key matching process is very hardware intensive, and is the bottleneck in both design complexity (gates used) and speed (if you want to use fewer gates, you take a speed hit in the tradeoff). Certainly, at some point, you cannot minimize gate usage by trading off for speed (delay in accessing the desired element), because a cache's whole purpose is to be FAST!
So, the tradeoff is done a little differently. The cache is organized into blocks (cache "lines" or "rows"). Each block usually starts at some 2^N aligned boundary corresponding to the cache line size. For example, for a cache line of 128 bytes, the cache line key address will always have 0's in the bottom seven bits (2^7 = 128). This effectively eliminates 7 bits from the address match complexity we just mentioned earlier. On the other hand, the cache will read the entire cache line into the cache memory whenever any part of that cache line is "needed" due to a "cache miss" -- the address "key" is not found in the associative memory.
Now, it seems like, if you needed byte 126 in a 128-byte cache line, you'd be twiddling your thumbs for quite a while, waiting for that cache block to be read in. To accomodate that situation, the cache fill can take place starting with the "critical cache address" -- the word that the processor needs to complete the current fetch cycle. This allows the CPU to go on its merry way very quickly, while the cache control unit proceeds onward -- usually by reading data word by word in a modulo N fashion (where N is the cache line size) into the cache memory.
The old MPC5200 PowerPC data book gives a pretty good description of this kind of critical word cache fill ordering. I'm sure it's used elsewhere as well.
HTH... JoGusto.
Is there any algorithms to sort data from serial input using buffer which is smaller than data length?
For example, I have 100 bytes of serial data, which can be read only once, and 40 bytes buffer. And I need to print out sorted bytes.
I need it in Javascript, but any general ideas are appreciated.
This kind of sorting is not possible in a single pass.
Using your example: suppose you have filled your 40 byte buffer, so you need to start printing out bytes in order to make room for the next one. In order to print out sorted data, you must print the smallest byte first. However, if the smallest byte has not been read, you can't possibly print it out yet!
The closest relevant fit to your question may be external sorting algorithms, which take multiple passes in order to sort data that can't fit into memory. That is, if you have peripherals that can store the output of a processing pass, you can sort data larger than your memory in O(log(N/M)) passes, where N is the size of the problem, and M is the size of your memory.
The classic storage peripheral for external sorting is the tape drive; however, the same algorithms work for disk drives (of whatever kind). Also, as cache hierarchies grow in depth, the principles of external sorting become more relevant even for in-memory sorts -- try taking a look at cache-oblivious algorithms.