Intersection of data using CPU Address Bus - cpu

I'm reading a paper and in a part of this paper there is a note about intersecting sets using address bus. This is the exact quote from the paper:
Fast retrieval methods often rely on intersecting sets of documents
that contain a particular word or feature. Semantic hashing is no
exception. Each of the binary values in the code assigned to a
document represents a set containing about half the entire document
collection. Intersecting such sets would be slow if they were
represented by explicit lists, but all computers come with a special
piece of hardware – the address bus – that can intersect sets in a
single machine instruction. Semantic hashing is simply a way of
mapping the set intersections required for document retrieval directly
onto the available hardware.
I have some basic knowledge about cpu architecture. All I need is an abstract explanation to understand how this operation is done.
P.S. The paper is about the sets, but my question is general (any kind of data).

Essentially, what he's saying is that you can implement any mapping of input numbers to output numbers in a single instruction if you have enough memory. Simply populate memory with your mapping and from the read the address in your mapping corresponding to the input number.

Related

Working with a Set that does not fit in memory

Let's say I have a huge list of fixed-length strings, and I want to be able to quickly determine if a new given string is part of this huge list.
If the list remains small enough to fit in memory, I would typically use a set: I would feed it first with the list of strings, and by design, the data structure would allow me to quickly check whether or not a given string is part of the set.
But as far as I can see, the various standard implementation of this data structure store data in memory, and I already know that the huge list of strings won't fit in memory, and that I'll somehow need to store this list on disk.
I could rely on something like SQLite to store the strings in a indexed table, then query the table to know whether a string is part of the initial set or not. However, using SQLite for this seems unnecessarily heavy to me, as I definitely don't need all the querying features it supports.
Have you guys faced this kind of problems before? Do you know any library that might be helpful? (I'm quite language-agnostic, feel free to throw whatever you have)
There are multiple solutions to efficiently find if a string is a part of a huge set of strings.
A first solution is to use a trie to make the set much more compact. Indeed, many strings will likely start by the same header and re-writing it over and over in memory is not space efficient. It may be enough to keep the full set in memory or not. If not, the root part of the trie can be stored in memory referencing leaf-like nodes stored on the disk. This enable the application to quickly find with part of the leaf-like nodes need to be loaded with a relatively small cost. If the number of string is not so huge, most leaf parts of the trie related to a given leaf of the root part can be loaded in one big sequential chunk from the storage device.
Another solution is to use a hash table to quickly find if a given string exist in the set with a low latency (eg. with only 2 fetches). The idea is just to hash a searched string and perform a lookup at a specific item of a big array stored on the storage device. Open-adressing can be used to make the structure more compact at the expense of a possibly higher latency while only 2 fetches are needed with closed-adressing (the first get the location of the item list associated to the given hash and the second get all the actual items).
One simple way to easily implement such data structures so they can work on a storage devices is to make use of mapped memory. Mapped memory enable you to access data on a storage device transparently as if it was in memory (whatever the language used). However, the cost to access data is the one of the storage device and not the one of the memory. Thus, the data structure implementation should be adapted to the use of mapped memory for better performance.
Finally, you can cache data so that some fetches can be much faster. One way to do that is to use Bloom filters. A Bloom filter is a very compact probabilistic hash-based data structure. It can be used to cache data in memory without actually storing any string item. False positive matches are possible, but false negatives are not. Thus, they are good to discard searched strings that are often not in the set without the need to do any (slow) fetch on the storage device. A big Bloom filter can provide a very good accuracy. This data structure need to be mixed with the above ones if deterministic results are required. LRU/LFU caches might also help regarding the distribution of the searched items.

When using 2P CRDT data structures (for example 2P-set), how do you free up space?

2P-set allows to remove the elements from a set, but doesn't allow to free the space those removed elements take up. In fact, removal of an element consumes space, rather that frees it.
What's the algorithm to free up space for 2P structures?
I'm trying to understand for what problems can I use CRDT structures in practice. Without a way to free up space, the 2P CRDT structures seem to have a very limited use for the real world tasks.
While I cannot speak for 2P-Set - since I still haven't figured out a practical use case for it. However usually we can apply few techniques:
Compaction of metadata used by CRDT: a lot of CRDTs where initially implemented with very simple design, and later on optimized to meet industry standards. Example of such can be OR-Set reimplemented on top of dotted vector versions. In this implementation you don't need to keep removed elements in memory: instead we can track added/removed elements using dots that could eventually be compressed into vector clocks. Here I described this problem in more detail.
Prunning can be useful, once some of the replicas are no longer needed, eg. because we reduced a number of nodes or these nodes are no longer available. In such case, we can merge the payload with metadata as if it was produced by another replica. Example: given G-Counter represented with map {A:1,B:2,C:1} and a dead node B (which can no longer increment its state), we could prune it by merging B's entry into shape {A:3,C:1}, therefore reducing its size while still preserving the correct value. The problem is that prunning algorithm must guarantee, that all replicas must converge into this decision independently.

How to scale an algorithm/service/system with multiple machines?

I had some interviews recently and it's quite normal to be asked some scale problems.
For example, you have a long list of words(dict) and list of characters as the inputs, design an algorithm to find out a shortest word which in dict contains all the chars in the char list. Then the interviewer asked how to scale your algorithm into multiple machines.
Another example is you have been designed a traffic light control system for an intersection in a city. How do you scale this control system to the whole city which has many intersections.
I always have no idea about this kind of "scale" problems, welcome any suggestions and comments.
Your first question is completely different from your second question. In fact the control of traffic lights in cities is a local operation. There are boxes nearby that you can tune and optical sensor on top of the light that detects waiting cars. I guess if you need to optimize for some objective function of flow, you can route information to a server process, then it can become how to scale this server process over multiple machines.
I am no expert in design of distributed algorithm, which spans a whole field of research. But the questions in undergrad interviews usually are not that specialized. After all they are not interviewing a graduate student specializing in those fields. Take your first question as an example, it is quite generic indeed.
Normally these questions involve multiple data structures (several lists and hashtables) interacting (joining, iterating, etc) to solve a problem. Once you have worked out a basic solution, scaling is basically copying that solution on many machines and running them with partitions of the input at the same time. (Of course, in many cases this is difficult if not impossible, but interview questions won't be that hard)
That is, you have many identical workers splitting the input workload and work at the same time, but those workers are processes in different machines. That brings the problem of communication protocol and network latency etc, but we will ignore these to get to the basics.
The most common way to scale is let the workers hold copies of smaller data structures and have them split the larger data structures as workload. In your example (first question), the list of characters is small in size, so you would give each worker a copy of the list, and a portion of the dictionary to work on with the list. Notice that the other way around won't work, because each worker holding a dictionary will consume a large amount of memory in total, and it won't save you anything scaling up.
If your problem gets larger, then you may need more layer of splitting, which also implies you need a way of combining the outputs from the workers taking in the split input. This is the general concept and motivation for the MapReduce framework and its derivatives.
Hope it helps...
For the first question, how to search words that contain all the char in the char list that can run on the same time on the different machine. (Not yet the shortest). I will do it with map-reduce as the base.
First, this problem is actually can run on different machine at the same time. This is because for each word in the database, you can check it on another machine (so to check another word, you didn't have to wait for the previous word or the next word, you can literally send each word to different computer to be checked).
Using map-reduce, you can map each word as a value and then check it if it contain every char in the char list.
Map(Word, keyout, valueout){
//Word comes from dbase, keyout & valueout is input for Reduce
if(check if word contain all char){
sharedOutput(Key, Word)//Basically, you send the word to a shared file.
//The output shared file, should be managed by the 'said like' hadoop
}
}
After this Map running, you get all the Word that you want from the database locate in shared file. As for the reduce step, you can actually used some simple step to reduce it based on it length. And tada, you get the shortest one.
As for the second question, multi threading come to my mind. It's actually a problem that not relate to each other. I mean each intersection has its own timer right? So to be able handle tons of intersection, you should use multi threading.
The simple term will be using each core in the processor to control each intersection. Rather then go loop through all intersection on by one. You can alocate them in each core so that the process will be faster.

FPGA logic cells

I have an small presentation about FPGA techonology. My questions is: If your FPGA has 85k logic cells, does this mean it can run 85k operations simultaneously?
What I am trying to achieve is to shock the audience with some crazy illustrated facts about FPGA technology or facts. The people who listens now very little about FPGA, so I want to impress them.
What's inside a 'cell' can vary per manufacturer, but the Xilinx definition (using this manufacturer as an example, as these are the devices that I'm familiar with) is one four-input look-up table, and one register. Xilinx devices are made up of a number of 'slices', and these contain a number of functional elements. These might include:
Look-up tables
Registers
Multiplexers
Logic for use in carry chains
etc
As an example, a Spartan6 LX4 has 600 slices, and the marketing material claims that this is equivalent to 3840 'logic cells'. You can look in the user guide for a device to determine exactly what is contained inside a slice.
In addition to this, there are other resources such as multipliers, memories, PLLs, etc.
I suppose you could say that one logic cell can perform one operation, but a single cell is only capable of very simple operations, for example an AND gate, 2:1 multiplexer, etc.
I would say no, but it depends on what you mean by an operation. A logic cell has the capability to implement a number of logical functions (and/or/xor), and it has the ability to hold a state with storage elements. These two functions are how every digital system under the sun operates. Even addition and subtraction are higher level constructs built on top of logical functions. As in other answers, FPGA manufacturers publish guides on what is inside of their logic cell. It is this fundamental cell that is stamped repeatedly in the die to create this "array" as in Field Programmable Gate "Array".
This yields a distinctly "more or less" answer. The logic blocks can be used in multiple modes, and you might even be able to pack more than one function in one (including with two independent outputs), but you must also be able to transport meaningful data to work on. It sounds like you have a 7z020 as an example. You may want to note that besides those logic cells, it also has 220 hardware multiply+add blocks. That amount is not random; the surrounding logic is enough to keep them fed in particular cases, every cycle. Looking in 7 Series FPGAs Configurable Logic Block User Guide (UG474), we find that the Logic Cells number given is an estimate of equivalent 4LUT+FF configurations. The reason this number is lower than the number of flipflops (106k) is that the input arguments for the two 5luts you can split a 6lut into must overlap.

Techniques for handling arrays whose storage requirements exceed RAM

I am author of a scientific application that performs calculations on a gridded basis (think finite difference grid computation). Each grid cell is represented by a data object that holds values of state variables and cell-specific constants. Until now, all grid cell objects have been present in RAM at all times during the simulation.
I am running into situations where the people using my code wish to run it with more grid cells than they have available RAM. I am thinking about reworking my code so that information on only a subset of cells is held in RAM at any given time. Unfortunately the grids (or matrices if you prefer) are not sparse, which eliminates a whole class of possible solutions.
Question: I assume that there are libraries out in the wild designed to facilitate this type of data access (i.e. retrieve constants and variables, update variables, store for future reference, wipe memory, move on...) After several hours of searching Google and Stack Overflow, I have found relatively few libraries of this sort.
I am aware of a few options, such as this one from the HSL mathematical library: http://www.hsl.rl.ac.uk/specs/hsl_of01.pdf. I'd prefer to work with something that is open source and is written in Fortran or C. (my code is mostly Fortran 95/2003, with a little C and Python thrown in for good measure!)
I'd appreciate any suggestions regarding available libraries or advice on how to reformulate my problem. Thanks!
Bite the bullet and roll your own?
I deal with too-large data all the time, such as 30,000+ data series of half-hourly data that span decades. Because of the regularity of the data (daylight savings changeovers a problem though) it proved quite straightforward to devise a scheme involving a random-access disc file and procedures ReadDay and WriteDay that use a series number, and a day number, with further details because series start and stop at different dates. Thus, a day's data in an array might be Array(Run,DayNum) but now is ReturnCode = ReadDay(Run,DayNum,Array) and so forth, the codes indicating presence/absence of that day's data, etc. The key is that a day's data is a convenient size, and a regular (almost) size, and although my prog. allocates a buffer of one record per series, it runs in ~100MB of memory rather than GB.
Because your array is non-sparse, it is regular. Granted that a grid cell's data are of fixed size, you could devise a random-access disc file with each record holding one cell, or, perhaps a row's worth of cells (or a column's worth of cells) or some worthwhile blob size. I choose to have 4,096 bytes/record as that is the disc file allocation size. Let the computer's operating system and disc storage controller do whatever buffering to real memory they feel up to. Typical execution is restricted to the speed of data transfer however, unless the local data's computation is heavy. Thus, I get cpu use of a few percent until data requests start being satisfied from buffers.
Because fortran uses the same syntax for functions as for arrays (unlike say Pascal), instead of declaring DIMENSION ARRAY(Big,Big) you would remove that and devise FUNCTION ARRAY(i,j), and all read references in your source file stay as they are. Alas, in the absence of a "palindromic" function declaration, assignments of values to your array will have to be done with a different syntax and you devise a subroutine or similar. Possibly a scratchpad array could be collated, worked upon with convenient syntax, and then written back if changed.

Resources