Select distinct secondary key values in BerkeleyDB JE - berkeley-db-je

I have a Berkeley DB JE setup using the DPL.
I have a secondary key field which is a string, and I would like to retrieve all distinct values for this key. No additional fitlering is required, I simply want all the distinct values.
I could iterate over all entries and add values to sets, but this seems very inefficient as I have on the order of 10s of values from hundreds of thousands of rows.

If you just need the distinct keys (not the distinct entities), you can do this by calling SecondaryIndex.keys and then calling EntityCursor.nextNoDup to iterate through the unique values.
I recommend posting questions on Berkeley DB Java Edition to its OTN forum.
--mark

Related

Randomly generated public unique ids

Currently I'm generating unique ids for rows in my database using int and auto_increment. These ids are public facing so in the url you can see something like this https://example.com/path/1 or https://example.com/path/2
After talking with another engineer they've advised me that I should use randomly generated ids so that they're not guessable.
How can I generate a unique ID & without doing a forloop on the database each time to make sure it's unique? e.g. take stripe for example. All of their ids are price_sdfgsdfg or prod_iisdfgsdfg. Whats the best way to generate unique ids for rows like these?
Without knowing which language or database you're using, the simplest way is using uuids.
To prevent downloading all existing database unique keys, and then for looping over them all, simply just try to INSERT INTO whichever table you are using.
If the result fails (e.g. Exception), then the row is taken, continue.
If the result passes, break loop.
This only works when you have a column which is NOT NULL, and UNIQUE.
That's how I "know" without looping over the whole database of IDs, or downloading them into local memory, etc.
Using auto_increment wont lead to duplicates because when a SQL or no-SQL table is in use, it will be locked and given to the next available number in the queue, which is the beauty of databases.
SQL example (mySQL, SQLite, mariadb):
CREATE TABLE `my_db`.`my_table` ( `unique_id` INT NOT NULL , UNIQUE (`unique_id`)) ENGINE = InnoDB;`
Insert a unique_id
INSERT INTO `test` (`unique_id`) VALUES ('999999999');
Great, we have a row
INSERT INTO `test` (`unique_id`) VALUES ('999999999');
If not, then retry:
Error:
#1062 - Duplicate entry '999999999' for key 'unique_id'
If these are public URLs, and the content is sensitive, then I definitely do not recommend int's as someone can trivially guess 1 through 99999999... etc.
In any language, have a look at /dev/urandom.
In shell/bash scripts, I might use uuidgen:
9dccd646-043e-4984-9126-3060b4ced180
In Python, I'll use pandas:
df.set_index(pd.util.hash_pandas_object(df, encoding='utf8'), drop=True, inplace=True)
df.index.rename('hash', inplace=True)
Lastly, UUID's aren't perfect: they are only a-f 0-9 all lowercase, but they are easy to generate: every language has one.
In JavaScript you may want to check out some secure Open Source apps, for example, Jitsi: https://github.com/jitsi/js-utils/blob/master/random/roomNameGenerator.js where they conjugate word:
E.g. Satisfied-Global-Architectural-Bitter

DyanamoDB with AWS Lambda function Order by Desc Order with scan

I am trying to create AWS Lambda function using Node.js and try to scan records from dynamodb. But it gives me records in random order I would like to fetch top 5 records which are recently added in to table. I would like to sort based on Timestamp so can get latest 5 records. Any one have an idea please help me out.
dynamodb does not intend to support ordering in its scan operation. Order is supported in query operations.
To get the behavior you want you can do the following (with one caveat, see below):
Make sure that each record on your table has an attribute (let's call it x) which always holds the same value (does not matter which value, let's say the value is always "y")
define a global secondary index on your table. the key of that index should use x as the partition key (aka: "hash key") and the timestamp field as the sort key.
then you can issue a query action on that index. "Query results are always sorted by the sort key value" (see here) which is exactly what you need.
The caveat: this means that your index will hold all records of your table under the same partition key. This goes against best practices of dynamodb (see Choosing the Right DynamoDB Partition Key). It will not scale for large tables (more than tens of GB).

Alternative to ORA_HASH?

We are working with a table in a 3rd party database that does not have a primary key but does have a unique index.
I have therefore been looking at using the ORA_HASH function to produce a de facto unique Id by passing in the values of the columns in the unique index.
Unfortunately, I can already see that we have a few collisions, which means that we can't derive a unique id using this method.
Is there an alternative to ORA_HASH that would provide a unique id for a unique input?
I suppose I could generate an Id using DBMS_CRYPTO.Hash but I'd ideally like to get a numeric value.
Edit
The added complication is that I then need to store these records in another (SQL Server) database and then compare the records from the original and the replica tables. So rank doesn't help me here since records can be added or deleted in the original table.
DBMS_CRYPTO.HASH could be used to generate a high-bit hash (high enough to give you a very low, but not zero, chance of collisions), but it returns 'RAW' not 'NUMBER'.
To guarantee no collisions ever, you need a one-to-one hash function. As far as I know, Oracle does not provide one.
A practical approach would be to create a new table to map unique keys to a newly generated primary key. E.g., unique value ("ABC",123, 888) maps to 838491 (where you generated 838491 using a sequence).
You'd have to update the mapping table periodically, to account for inserted rows, and that would be a pain, but it would let you generate your own PKs and keep track of them without a lot of complication.
Have you tried:
DBMS_UTILITY.GET_HASH_VALUE (
name VARCHAR2,
base NUMBER,
hash_size NUMBER)
RETURN NUMBER;

Is there any way to generate an ID without a sequence?

Current application use JPA to auto generate table/entity id. Now a requirement wants to get a query to manually insert data in to the database using SQL queries
So the questions are:
Is it worth to create a sequence in this schema just for this little requirement?
If answer to 1 is no, then what could be a plan b?
Yes. A sequence is trivial - why would you not do it?
N/A
Few ways:
Use a UUID. UUIDs are pseudo-random, large alphanumeric strings which are guaranteed to be unique once generated.
Does the data have something unique? Like a timestamp, or IP address, etc? If so, use that
Combination of current timestamp + some less unique value in the data
Combination of current timestamp + some integer i that you keep incrementing
There are others (including generating a checksum, custom random numbers instead of UUIDs, etc) - but those have the possibility of overlaps, so not mentioning them.
Edit: Minor clarifications
Are you just doing a single data load into an empty table, and there are no other users concurrently inserting data? If so, you can just use ROWNUM to generate the IDs starting from 1, e.g.
INSERT INTO mytable
SELECT ROWNUM AS ID
,etc AS etc
FROM ...

Sort by key in Cassandra

Let's assume I have a keyspace with a column family that stores user objects and the key of these objects is the username.
How can I use Hector to get a list of users sorted by username?
I tried to use a RangeSlicesQuery, paging works fine with this query, but the results are not sorted in any way.
I'm an absolute Cassandra beginner, can anyone point me to a simple example that shows how to sort a column family by key? Please ask if you need more details on my efforts.
Edit:
The result was not sorted because I used the default RandomPartitioner instead of the OrderPreseveringPartitioner in cassandra.yaml.
Probably it's better not to rely on the sorting by key but to use a secondary index.
Quoting Cassandra - The Definitive Guide
Column names are stored in sorted order according to the value of compare_with. Rows,
on the other hand, are stored in an order defined by the partitioner (for example,
with RandomPartitioner, they are in random order, etc.)
I guess you are using RandomPartitioner which
... return data in an essentially random order.
You should probably use OrderPreservingPartitioner (OPP) where
Rows are therefore stored
by key order, aligning the physical structure of the data with your sort order.
Be aware of inefficiency of OPP.
(edit on Mar 07, 2014)
Important:
This answer is very old now.
It is a system-wide setting. You can set in cassandra.yaml. See this doc. Again, OPP is highly discouraged. This document is for version 1.1, and you can see it is deprecated. It is likely that it is removed from latest version. If you do want to use OPP, you may want to revisit the architecture the architecture.
Or create a row called "meta:userNames" in same column family and put all user names as a look up hash. Something like that.
Users {
key: "meta:userNames" {david:david, paolo:paolo, victor:victor},
key: "paolo" {password:"*****", locale:"it_it"},
key: "david" {password:"*****", locale:"en_us"},
key: "victor" {password:"*****", locale:"en_uk"}
}
First query the meta:userNames columns (that are sorted) and use them to get the user rows. Don't try to get everything via single db query as in SQL driven databases. Use Cassandra as huge Hash Map which provides rapid random access to its data.

Resources