How are null values stored in MonetDB - monetdb

I have a column of 4 byte integers that contains a lot of Nulls. I would like to know how are null values represented in storage. Do they use up 4 bytes as well or they are handled in a way they don't waste space?

It depends on the data type. For example, in the case on numerical types (int, float, etc) nulls are represented as minimal value of the type. As a result, no extra space is wasted, but one cannot use the smallest possible value for that type.
For other types, such as boolean columns, some extra space is used, since single bit is not sufficient to represent true, false and null. (It's not a qubit ;) )
You can find more information here: https://www.monetdb.org/Documentation/Manuals/SQLreference/BuiltinTypes
The link below provides more developer targeted info (with more details): https://www.monetdb.org/wiki/MonetDB_type_system

Related

Any reference to definition or use of the data structuring technique "hash linking"?

I would like more information about a data structure - or perhaps it better described as a data structuring technique - that was called hash linking when I read about it in an IBM Research Report a long time ago - in the 70s or early 80s. (The RR may have been from the 60s.)
The idea was to be able to (more) compactly store a table (array, vector) of values when most values fit in a (relatively) small compact range but some values (may) have had unusually large (or small) values out of that range. Instead of making each element of the table wider to hold the entire range you would store, in the table, only those values that fit in the small compact range and put all other entries that didn't fit into a hash table.
One use case I remember being mentioned was for bank accounts - you might determine that 98% of the accounts in your bank had balances under $10,000.00 so they would nicely fit in a 6-digit (decimal) field. To handle the very few accounts $10,000.00 or over you would hash-link them.
There were two ways to arrange it: Both involved a table (array, vector, whatever) where each entry would have enough space to fit the 95-99% case of your data values, and a hash table where you would put the ones that didn't fit, as a key-value pair (key was index into table, value was the item value) where the value field could really fit the entire range of the values.
You would pick a sentinel value, depending on your data type. Might be 0, might be the largest representable value. If the value you were trying to store didn't fit the table you'd stick the sentinel in there and put the (index, actual value) into the hash table. To retrieve you'd get the value by its index, check if it was the sentinel, and if it was look it up in the hash table.
You would have no reasonable sentinel value. No problem. You just store the exceptional values in your hash table, and on retrieval you always look in the hash table first. If the index you're trying to fetch isn't there you're good: just get it out of the table itself.
Benefit was said to be saving a lot of storage while only increasing access time by a small constant factor in either case (due to the properties of a hash table).
(A related technique is to work it the other way around if most of your values were a single value and only a few were not that value: Keep a fast searchable table of index-value pairs of the ones that were not the special value and a set of the indexes of the ones that were the very-much-most-common-value. Advantage would be that the set would use less storage: it wouldn't actually have to store the value, only the indexes. But I don't remember if that was described in this report or I read about that elsewhere.)
The answer I'm looking for is a pointer to the original IBM report (though my search on the IBM research site turned up nothing), or to any other information describing this technique or using this technique to do anything. Or maybe it is a known technique under a different name, that would be good to know!
Reason I'm asking: I'm using the technique now and I'd like to credit it properly.
N.B.: This is not a question about:
anything related to hash tables as hash tables, especially not linking entries or buckets in hash tables via pointer chains (which is why I specifically did not add the tag hashtable),
an "anchor hash link" - using a # in a URL to point to an anchor tag - which is what "hash link" gets you when you search for it on the intertubes,
hash consing which is a different way to save space, for much different use cases.
Full disclosure: There's a chance it wasn't in fact an IBM report where I read it. During the 70s and 80s I was reading a lot of TRs from IBM and other corporate labs, and MIT, CMU, Stanford and other university departments. It was definitely in a TR (not a journal or ACM SIG publication) and I'm nearly 100% sure it was IBM (I've got this image in my head ...) but maybe, just maybe, it was wasn't ...

Why multiple data type not possible in single array? - datastructure

I general, I am aware multiple data types in single array typically not allowed in strongly typed language like Java.
I would like to know why technically multiple data type in single array is not possible in terms of data structure (storing in memory location).
Will different datatype hold different bit?. Due to this indexing element will be an problem for multiple data type array?
All strongly typed languages are produce virtual address space at compile time. Strongly typed languages like C , Java compiler should know the data time so that it will will create address space. Its all compiler dependent. All assignments, whether explicit or via parameter passing in method calls, are checked for type compatibility. There are no automatic corrections or conversions of conflicting types as in some languages.
Thanks,
Ram
Very good question.
Array generally means you can access any element by index into the array in constant time i.e. O(1).
In C/C++, if you create array of 200, 32-bit integers they are all stored in
consecutive memory locations.
So if you want to access 10th element you can compute memory address of 10th element as
base address of the array + 4 * 9 . Then read 32 bytes at that address as your 10th integer. This is what a[9] does.
Suppose you allow storage of elements of different types in an array, you won't be able to do this address jumping because different types of items will have different sizes and you won't know what to multiply with.
This breaks the promise that items can be found in O(1).
That is why we don't put items of different types in an array even though its possible. Its definitely possible to do.

Does adding an explicit column of SHA-256 hash for a CLOB field improve the search(exact match) performance on that CLOB field

We have a requirement to implement a table(probably an orable db table or a mssql db table) as follows:
One column stores a string value, the length of this string value is highly variable, typically from several bytes to 500 megabytes(occasionally beyond 1 gigabytes )
Based on above, we decided to use CLOB type in db.(using system file is not an option somehow)
The table is very large up to several millions of records.
One of most frequent and important operation against this table is searching records by this CLOB column and the search string needs to EXACTlY match this CLOB column value.
My question is besides adding an index on CLOB column, whether we need to do some particular optimisation to improve the search performance?
One of my team member suggested adding an extra column in which to calculate SHA-256 hash of CLOB column above and search by this hash value instead of CLOB column. In terms of his opinion, the grounds of doing so are hash values are equal length other than variable so that indexing on that makes search faster.
However, I don't think this way makes big difference because assuming adding an explicit hash improves search performance database should be intelligent enough to do it by its own, likely storing this hash value in some hidden places of db system. Why bother we developers do it explicitly, on the other hand, this hash value theoretically creates collision although it's rare.
The only benefit I can imagine is when the client side of database does a search of which the keyword is very large, you can reduce network roundtrip by hashing this large value to a small length value, therefore network transferring is faster.
So any database gurus, please shed lights on this question. Many thanks!
Regular indexes don't work on CLOB columns. Instead you would need to create an Oracle Text index, which is primarily for full text searching of key words/phrases, rather than full text matching.
In contrast by computing a hash function on the column data, you can then create an index on the hash value since it's short enough to fit in a standard VARCHAR2 or RAW column. Such a hash function can significantly reduce your search space when trying to find exact matches.
Further your concern over hash collisions, while not unfounded can be mitigated. First off, hash collisions are relatively rare, but when they do occur, the documents are unlikely to be very similar, so a direct text comparison could be used in situations where a collision is detected. Alternatively due to the way hashing functions work, where small changes to the original document result in significant changes in the hash value, and where the same change to different documents affects the hash value differently, you could compute a secondary hash of a subset (or super set) of the original text to act as a collision avoidance mechanism.

Efficient mapping from 2^24 values to a 2^7 index

I have a data structure that stores amongst others a 24-bit wide value. I have a lot of these objects.
To minimize storage cost, I calculated the 2^7 most important values out of the 2^24 possible values and stored them in a static array. Thus I only have to save a 7-bit index to that array in my data structure.
The problem is: I get these 24-bit values and I have to convert them to my 7-bit index on the fly (no preprocessing possible). The computation is basically a search which one out of 2^7 values fits best. Obviously, this takes some time for a big number of objects.
An obvious solution would be to create a simple mapping array of bytes with the length 2^24. But this would take 16 MB of RAM. Too much.
One observation of the 16 MB array: On average 31 consecutive values are the same. Unfortunately there are also a number of consecutive values that are different.
How would you implement this conversion from a 24-bit value to a 7-bit index saving as much CPU and memory as possible?
Hard to say without knowing what the definition is of "best fit". Perhaps a kd-tree would allow a suitable search based on proximity by some metric or other, so that you quickly rule out most candidates, and only have to actually test a few of the 2^7 to see which is best?
This sounds similar to the problem that an image processor has when reducing to a smaller colour palette. I don't actually know what algorithms/structures are used for that, but I'm sure they're look-up-able, and might help.
As an idea...
Up the index table to 8 bits, then xor all 3 bytes of the 24 bit word into it.
then your table would consist of this 8 bit hash value, plus the index back to the original 24 bit value.
Since your data is RGB like, a more sophisticated hashing method may be needed.
bit24var & 0x000f gives you the right hand most char.
(bit24var >> 8) & 0x000f gives you the one beside it.
(bit24var >> 16) & 0x000f gives you the one beside that.
Yes, you are thinking correctly. It is quite likely that one or more of the 24 bit values will hash to the same index, due to the pigeon hole principal.
One method of resolving a hash clash is to use some sort of chaining.
Another idea would be to put your important values is a different array, then simply search it first. If you don't find an acceptable answer there, then you can, shudder, search the larger array.
How many 2^24 haves do you have? Can you sort these values and count them by counting the number of consecutive values.
Since you already know which of the 2^24 values you need to keep (i.e. the 2^7 values you have determined to be important), we can simply just filter incoming data and assign a value, starting from 0 and up to 2^7-1, to these values as we encounter them. Of course, we would need some way of keeping track of which of the important values we have already seen and assigned a label in [0,2^7) already. For that we can use some sort of tree or hashtable based dictionary implementation (e.g. std::map in C++, HashMap or TreeMap in Java, or dict in Python).
The code might look something like this (I'm using a much smaller range of values):
import random
def make_mapping(data, important):
mapping=dict() # dictionary to hold the final mapping
next_index=0 # the next free label that can be assigned to an incoming value
for elem in data:
if elem in important: #check that the element is important
if elem not in mapping: # check that this element hasn't been assigned a label yet
mapping[elem]=next_index
next_index+=1 # this label is assigned, the next new important value will get the next label
return mapping
if __name__=='__main__':
important_values=[1,5,200000,6,24,33]
data=range(0,300000)
random.shuffle(data)
answer=make_mapping(data,important_values)
print answer
You can make the search much faster by using hash/tree based set data structure for the set of important values. That would make the entire procedure O(n*log(k)) (or O(n) if its is a hashtable) where n is the size of input and k is the set of important values.
Another idea is to represent the 24BitValue array in a bit map. A nice unsigned char can hold 8 bits, so one would need 2^16 array elements. Thats 65536. If the corresponding bit is set, then you know that that specific 24BitValue is present in the array, and needs to be checked.
One would need an iterator, to walk through the array and find the next set bit. Some machines actually provide a "find first bit" operation in their instruction set.
Good luck on your quest.
Let us know how things turn out.
Evil.

Does setting "NOT NULL" on a column in postgresql increase performance?

I know this is a good idea in MySQL. If I recall correctly, in MySQL it allows indexes to work more efficiently.
Setting NOT NULL has no effect per se on performance. A few cycles for the check - irrelevant.
But you can improve performance by actually using NULLs instead of dummy values. Depending on data types, you can save a lot of disk space and RAM, thereby speeding up ... everything.
The null bitmap is only allocated if there are any NULL values in the row. It's one bit for every column in the row (NULL or not). For tables up to 8 columns the null bitmap is effectively completely free, using a spare byte between tuple header and row data. After that, space is allocated in multiples of MAXALIGN (typically 8 bytes, covering 64 columns). The difference is lost to padding. So you pay the full (low!) price for the first NULL value in each row. Additional NULL values can only save space.
The minimum storage requirement for any non-null value is 1 byte (boolean, "char", ...) or typically much more, plus (possibly) padding for alignment. Read up on data types or check the gory details in the system table pg_type.
More about null storage:
Does not using NULL in PostgreSQL still use a NULL bitmap in the header?
The manual.
It's always a good ideal to keep columns from being NULL if you can avoid it, because the semantics of using are so messy; see What is the deal with NULLs? for good a discussion of how those can get you into trouble.
In versions of PostgreSQL up to 8.2, the software didn't know how to do comparisons on the most common type index (the b-tree) in a way that would include finding NULL values in them. In the relevant bit of documentation on index types, you can see that described as "but note that IS NULL is not equivalent to = and is not indexable". The effective downside to this is that if you specify a query that requires including NULL values, the planner might not be able to satisfy it using the obvious index for that case. As a simple example, if you have an ORDER BY statement that could be accelerated with an index, but your query needs to return NULL values too, the optimizer can't use that index because the result will be missing any NULL data--and therefore be incomplete and useless. The optimizer knows this, and instead will do an unindexed scan of the table instead, which can be very expensive.
PostgreSQL improved this in 8.3, "an IS NULL condition on an index column can be used with a B-tree index". So the situations where you can be burned by trying to index something with NULL values have been reduced. But since NULL semantics are still really painful and you might run into a situation where even the 8.3 planner doesn't do what you expect because of them, you should still use NOT NULL whenever possible to lower your chances of running into a badly optimized query.
No, as long as you don't actually store NULLs in the table the indexes will look exactly the same (and equally efficient).
Setting the column to NOT NULL has a lot of other advantages though, so you should always set it to that when you don't plan to store NULLs in it :-)

Resources