Best method to store and retrieve data using multiple keys - algorithm

As part of an application being developed in "C" language, clients would be sending requests with two 16-byte values and some data. One of the two 16-byte values is the per client UID and the other being some random unique object UID. So, server would be receiving below kind of data from clients:
From client1:
<Client UID1, obj UID1, random-data>
From client2:
<Client UID2, obj UID4, random-data>
From client3:
<Client UID3, obj UID5, random-data>
Upon receiving above information, server needs to store them in a kind of table and respond a 8-byte unique ID for each such request. Clients will come back to server using either the combination of <Client UID, Obj UID> or the unique 8-byte ID (generated by server) to operate on the random data associated with the <Client UID, Obj UID>.
As number of such requests received would be humongous, I cannot store them in an array and use the index as the 8-byte unique ID. Also it would force me to do linear kind of search for identifying the matching <Client UID, Obj UID>.
So, can I know how to organize the data in an efficient manner so that I would be able to index using both 8-byte unique ID and the combination of both <Client UID, Obj UID>?

I think, you need to use hash-table, and store/lookup data here.
It is very quick for add or update record.
If you using double hashing algorithm, you can use hashtable index as a unique 8-byte ID, to direct access into table cell without search. Of course, that index will be less than 8 bytes, but I think, you can pad it with 0s or so on.
You can get as basis my own double hash implementation in emerSSL project.

Related

Is getting data back from Redis SETS faster or more performant than HSETS?

I currently have a scenario where we are using REDIS to store string field-value pairs within a hashed set HSET.
The original reasoning behind using hashed sets instead of just sets is ease of retrieving the records using HSCAN inside a GUI Search Bar as opposed to just SCAN because it's easier to get the length of a hash to use in the COUNT field.
I read in the Redis documentation that both GET and HGET commands execute with O(1) time complexity, but a member of my team thinks that if I store all the values inside a single key then it basically returns the entire hash during HGET instead of the singular field-value that I need.
So for a made up but similar example:
I have a Redis instance with a single Hashed Set called users.
The hashed set has 150,000 field:value pairs of username:email
If when I execute hget users coolguy, is the entire hash getting returned or just the email for user coolguy?
First of all, HSET is not a hash set, it creates a hash table. The mechanism behind the hash table and set (which is indeed a hash set) in redis is the same, the difference is mainly that the hash table has values.
To answer your question:
If when I execute hget users coolguy, is the entire hash getting returned or just the email for user coolguy?
Just the email for that user. You can also use HMGET to get the emails of multiple users at once. It's O(1) for each user you fetch, or O(n) for n users.

Is it a good idea to use Hashing (ORA_HASH) to define uniqueness in relational tables?

I have an XML file containing Client id and addresses which I need to load into relational tables in Oracle database.
<Clients>
<Client id="100">
<name>ABC Corportation</name>
<Address>
<Addr1>1 Pine Street</Addr1>
<City>Chennai</City>
<State>Tamil Nadu</State>
<Country>India</Country>
<Postalcode>6000000</Postalcode>
</Address>
<Address>
<Addr1>1 Apple Street</Addr1>
<City>Coimbatore</City>
<State>Tamil Nadu</State>
<Country>India</Country>
<Postalcode>6000101</Postalcode>
</Address>
</Client>
<Client id="101">
....
....
</Client>
I have 2 relational tables defined as below-
Client
CLIENT_ID (Unique Key)
CLIENT_NAME
Client_Location
CLIENT_ID
ADDR1
CITY
STATE
COUNTRY
POSTAL_CODE
Updates to client address at source will sent in the XML file everyday. The ETL designed in a way that it requires a unique key on the table based on which it will identify the change coming in the XML as INSERT or UPDATE and accordingly sync the table to the XML. Identifying DELETEs is not really necessary.
Question: What should be defined as the unique key for Client_Location to process incremental changes coming everyday in the XML file? There is no identifier for address in the XML file. I was thinking about creating an additional hashing column (using ORA_HASH function) based on the 3 columns (STATE, COUNTRY, POSTAL_CODE). The unique key for the table would (CLIENT_ID, <>) which the ETL will use.. The idea is that it is not common for STATE/COUNTRY/POSTAL_CODE to change in an address. Ofcourse, this is a big assumption which I'm making. I would like to implement the below-
1) If there is any small change to ADDR1, I want the ETL to pick it up as a "valid" update at source and sync it to the table.
2) If there is a small change in the STATE/COUNTRY/POSTAL_CODE (eg: typo correction or case change like India to INDIA), then I don't want this to picked as a change because it would lead to INSERT (hashing value would change which is part of the unique key) and in turn duplicate rows in the table.
Does the idea of using a hashing column to define uniqueness make sense? Is there a better way to deal with this?
Is there a way to tweak ORA_HASH to produce results expected in #2 above?
If the client can have only one location reuse CLIENT_ID as primary key.
If there are more locations posible add SEQUENCEkey (sequence number 1..N) to the CLIENT_ID as a PK.
The simplest possibility to distinct and identify the locations is to use the feature of XML that the order of elements is well defined and has meaning. So the first ADDRESS element (pine street) becomes sequence 1, the second 2 and so on.
Please check the FOR ORDINALITY clause on XML table how to get this identification while parsing the XML.
You may also add TIMESTAMP (as a simple attribute - not a key) to keep the timestamp of the change and a STATUS column to identify deleted locations.
HASH may be usefull to quickly test a change if you have tons of columns, but for 5 columns it is probably an overkill (as you may simple compare the column values). I'd not recommend to used HASH as part of a key as it has no advantage to the proposed solution.

Alternative to ORA_HASH?

We are working with a table in a 3rd party database that does not have a primary key but does have a unique index.
I have therefore been looking at using the ORA_HASH function to produce a de facto unique Id by passing in the values of the columns in the unique index.
Unfortunately, I can already see that we have a few collisions, which means that we can't derive a unique id using this method.
Is there an alternative to ORA_HASH that would provide a unique id for a unique input?
I suppose I could generate an Id using DBMS_CRYPTO.Hash but I'd ideally like to get a numeric value.
Edit
The added complication is that I then need to store these records in another (SQL Server) database and then compare the records from the original and the replica tables. So rank doesn't help me here since records can be added or deleted in the original table.
DBMS_CRYPTO.HASH could be used to generate a high-bit hash (high enough to give you a very low, but not zero, chance of collisions), but it returns 'RAW' not 'NUMBER'.
To guarantee no collisions ever, you need a one-to-one hash function. As far as I know, Oracle does not provide one.
A practical approach would be to create a new table to map unique keys to a newly generated primary key. E.g., unique value ("ABC",123, 888) maps to 838491 (where you generated 838491 using a sequence).
You'd have to update the mapping table periodically, to account for inserted rows, and that would be a pain, but it would let you generate your own PKs and keep track of them without a lot of complication.
Have you tried:
DBMS_UTILITY.GET_HASH_VALUE (
name VARCHAR2,
base NUMBER,
hash_size NUMBER)
RETURN NUMBER;

SNMP4J dynamic index value

I am trying to build a client to get values from an snmp enabled device using snmp4j. Using the OID and index number i can fetch the name and serial number of the device. But i heard that index number is not constant and it keeps changing.
Thou i may find the required index number (for example, of a network interface) among the SNMP OIDs, sometimes we may not completely rely on the index number always staying the same.
Index numbers may be dynamic - they may change over time and your item may stop working as a consequence.
SO i need to find a way as to how to fetch the index number dynamically. Or is there any way that i can get the serial number without hard coding the serial number.
One OID might have 150 index numbers each having an different value. i need to get a particular info from that table.
It's (unfortunately) not unusual that index numbers change. For example, some equipment will re-order some tables on reboot.
You probably already realized that if the index values are volatile, you won't be able to fetch the data in one request. But you could do it by "walking" the table.
Using the GetNextRequest, you can start at the column headers, and then proceed through the table, fetching all the data or just individual columns. For a more detailed example, see section 4.2.2.1 of RFC 1905:
https://www.rfc-editor.org/rfc/rfc1905
Assuming that there is some column in the table which will identify the correct card, you could either:
Walk only the identifying column, and from the values find the index of the card you want, then issue a single GetRequest for the serial number of that card
(More efficient) Walk both columns (identifier and serial number) by requesting those two column headers in the first request, etc. In the resulting data set, find the data for you card.

Index a column encrypted with pkg_crypto

I'm working on a project that uses pkg_crypto to protect users' personal information. There are several thousand rows (which is expected to grow to maybe several tens of thousands), and whenever I use a WHERE or ORDER BY clause in a query, the whole table is decrypted before the results are returned. This takes several seconds for a single query, which is usable for development but will probably not be very good for the release.
Is there a way to create an index that will work on the encrypted columns without compromising security?
The inserts and selects look something like this (with iBatis):
insert:
INSERT INTO "USER_TABLE"
(
"ID"
,"LOGIN"
,"PASSWORD"
,"NAME"
,"EMAIL"
)
VALUES
(
user_table_seq.nextval,
#login#
,#password#
,pkg_crypto.encrypt(#name#, 'key')
,pkg_crypto.encrypt(#email#, 'key')
)
select:
SELECT
"ID"
,"LOGIN"
,"PASSWORD"
,pkg_crypto.decrypt("NAME", 'key') NAME
,pkg_crypto.decrypt("EMAIL", 'key') EMAIL
FROM "USER_TABLE"
WHERE pkg_crypto.decrypt("NAME", 'key') LIKE #name# || '%'
AND pkg_crypto.decrypt("EMAIL", 'key') LIKE '%' || #email#
I'll preemptively put out there that the password is hashed by the servlet before being passed to the db.
Do you need to use PKG_CRYPTO to encrypt the data (which, I'm assuming, is something you wrote that calls either DBMS_CRYPTO or DBMS_OBFUSCATION_TOOLKIT? Oracle has a feature called transparent data encryption (TDE) (though this is an extra cost option) that would allow you to have Oracle transparently encrypt the data on disk, decrypt it when it's read from disk, and then use this sort of LIKE predicate on your data.
Substantially, the answer is No.
When each value is encrypted, it has a random IV (initialization vector) chosen to go with it. And this means that you cannot predict what is going into the index. If you re-encrypt the value (even with the same key), you will get a different result. Therefore, you cannot meaningfully use an index on the encrypted value because you cannot reproduce the encryption for the value you're searching for. The index would, in any case, only be useful for equality searches. The data would be in a random sequence.
You might do better with a hash value stored (as well as the encrypted value). If you hash the names with a known algorithm, then you can reproduce the hash value on demand and find the rows that match. But simply knowing the hash won't allow you (or an intruder) to determine the value that was hashed except through pre-computed 'rainbow tables'.
So, you cannot meaningfully index encrypted columns - not even for uniqueness (since the same value would be encrypted different ways by virtue of the random IV).

Resources