Looking up a "key" in an 8GB+ text file - algorithm

I have some 'small' text files that contain about 500000 entries/rows. Each row has also a 'key' column. I need to find this keys in a big file (8GB, at least 219 million entries). When found, I need to append the 'Value' from the big file into the small file, at the end of the row as a new column.
The big file that is like this:
KEY VALUE
"WP_000000298.1" "abc"
"WP_000000304.1" "xyz"
"WP_000000307.1" "random"
"WP_000000307.1" "text"
"WP_000000308.1" "stuff"
"WP_000000400.1" "stuffy"
Simply put, I need to look up 'key' in the big file.
Obviously I need to load the whole table in RAM (but this is not a problem I have 32GB available). The big file seems to be already sorted. I have to check this.
The problem is that I cannot do a fast lookup using something like TDictionary because as you can see, the key is not unique.
Note: This is probably a one-time computation. I will use the program once, then throw it away. So, it doesn't have to be the BEST algorithm (difficult to implement). It just need to finish in decent time (like 1-2 days). PS: I prefer doing this without DB.
I was thinking at this possible solution: TList.BinarySearch. But it seems that TList is limited to only 134,217,727 (MaxInt div 16) items. So TList won't work.
Conclusion:
I choose Arnaud Bouchez solution. His TDynArray is impressive! I totally recommend it if you need to process large files.
AlekseyKharlanov's provided another nice solution but TDynArray is already implemented.

Instead of re-inventing the wheel of binary search or B-Tree, try with an existing implementation.
Feed the content into a SQLite3 in-memory DB (with the proper index, and with a transaction every 10,000 INSERT) and you are done. Ensure you target Win64, to have enough space in RAM. You may even use a file-based storage: a bit slower to create, but with indexes, queries by Key will be instant. If you do not have SQlite3 support in your edition of Delphi (via latest FireDAC), you may use our OpenSource unit and its associated documentation.
Using SQlite3 will be definitively faster, and use less resources than a regular client-server SQL database - BTW the "free" edition of MS SQL is not able to handle so much data you need, AFAIR.
Update: I've written some sample code to illustrate how to use SQLite3, with our ORM layer, for your problem - see this source code file in github.
Here are some benchmark info:
with index defined before insertion:
INSERT 1000000 rows in 6.71s
SELECT 1000000 rows per Key index in 1.15s
with index created after insertion:
INSERT 1000000 rows in 2.91s
CREATE INDEX 1000000 in 1.28s
SELECT 1000000 rows per Key index in 1.15s
without the index:
INSERT 1000000 rows in 2.94s
SELECT 1000000 rows per Key index in 129.27s
So for huge data set, an index is worth it, and creating the index after the data insertion reduces the resources used! Even if the insertion is slower, the gain of an index is huge when selecting per key.. You may try to do the same with MS SQL, or using another ORM, and I guess you will cry. ;)

Another answer, since it is with another solution.
Instead of using a SQLite3 database, I used our TDynArray wrapper, and its sorting and binary search methods.
type
TEntry = record
Key: RawUTF8;
Value: RawUTF8;
end;
TEntryDynArray = array of TEntry;
const
// used to create some fake data, with some multiple occurences of Key
COUNT = 1000000; // million rows insertion !
UNIQUE_KEY = 1024; // should be a power of two
procedure Process;
var
entry: TEntryDynArray;
entrycount: integer;
entries: TDynArray;
procedure DoInsert;
var i: integer;
rec: TEntry;
begin
for i := 0 to COUNT-1 do begin
// here we fill with some data
rec.Key := FormatUTF8('KEY%',[i and pred(UNIQUE_KEY)]);
rec.Value := FormatUTF8('VALUE%',[i]);
entries.Add(rec);
end;
end;
procedure DoSelect;
var i,j, first,last, total: integer;
key: RawUTF8;
begin
total := 0;
for i := 0 to pred(UNIQUE_KEY) do begin
key := FormatUTF8('KEY%',[i]);
assert(entries.FindAllSorted(key,first,last));
for j := first to last do
assert(entry[j].Key=key);
inc(total,last-first+1);
end;
assert(total=COUNT);
end;
Here are the timing results:
one million rows benchmark:
INSERT 1000000 rows in 215.49ms
SORT ARRAY 1000000 in 192.64ms
SELECT 1000000 rows per Key index in 26.15ms
ten million rows benchmark:
INSERT 10000000 rows in 2.10s
SORT ARRAY 10000000 in 3.06s
SELECT 10000000 rows per Key index in 357.72ms
It is more than 10 times faster than the SQLite3 in-memory solution. The 10 millions rows stays in memory of the Win32 process with no problem.
And a good sample of how the TDynArray wrapper works in practice, and how its SSE4.2 optimized string comparison functions give good results.
Full source code is available in our github repository.
Edit: with 100,000,000 rows (100 millions rows), under Win64, for more than 10GB of RAM used during the process:
INSERT 100000000 rows in 27.36s
SORT ARRAY 100000000 in 43.14s
SELECT 100000000 rows per Key index in 4.14s

Since this is One-time task. Fastest way is to load whole file into memory, scan memory line by line, parse key and compare it with search key(keys) and print(save) found positions.
UPD: If you have sorted list in source file. And assume you have 411000keys to lookup. You can use this trick: Sort you search keys in same order with source file. Read first key from both lists and compare it. If they differs read next from source until they equal. Save position, if next key in source also equal, save it too..etc. If next key is differ, read next key from search keys list. Continue until EOF.

Use memory-mapped files. Just think your file is already read into the memory in its entirety and do that very binary search in memory that you wanted. Let Windows care about reading the portions of file when you do your in-memory search.
https://en.wikipedia.org/wiki/Memory-mapped_file
https://msdn.microsoft.com/en-us/library/ms810613.aspx
https://stackoverflow.com/a/9609448/976391
https://stackoverflow.com/a/726527/976391
http://msdn.microsoft.com/en-us/library/aa366761%28VS.85.aspx
https://msdn.microsoft.com/en-us/library/windows/desktop/aa366537.aspx
You may take any of those sources for start, just do not forget to update them for Win64
http://torry.net/quicksearchd.php?String=memory+mapped+files&Title=No

A method that needs the file to be sorted, but avoids data structures entirely:
You only care about one line, so why even read the bulk of the file?
Open the file and move the "get pointer" (apologies for talking C) halfway through the file. You'll have to figure out if you're at a number or a word, but a number should be close by. Once you know the closest number, you know if it's higher or lower than what you want, and continue with the binary search.

Idea based on Aleksey Kharlanov answer. I accepted his answer.
I only copy his idea here because he did not elaborate it (no pseudo-code or deeper analysis of the algorithm). I want to confirm it works before implementing it.
We sort both files (once).
We load Big file in memory (once).
We read Small file line by line from disk (once).
Code:
In the code below, sKey is the current key in Small file. bKey is the current key in the Big file:
LastPos:= 0
for sKey in SmallFile do
for CurPos:= LastPos to BigFile.Count do
if sKey = bKey
then
begin
SearchNext // search (down) next entries for possible duplicate keys
LastPos:= CurPos
end
else
if sKey < bKey
then break
It works because I know the last position (in Big file) of the last key. The next key can only be somewhere UNDER the last position; ON AVERAGE it should be in the next 440 entries. However, I don't even have to always read 440 entries below LastPos because if my sKey does not exist in the big file, it will be smaller than the bKey so I quickly break the inner loop and move on.
Thoughts?

If I were doing this as a one-time thing, I'd create a set with all the keys I need to look up. Then read the file line-by line, check to see if the key exists in the set, and output the value if so.
Briefly, the algorithm is:
mySet = dictionary of keys to look up
for each line in the file
key = parse key from line
if key in mySet
output key and value
end for
Since Delphi doesn't have a generic set, I'd use TDictionary and ignore the value.
The dictionary lookup is O(1), so should be very fast. Your limiting factor will be file I/O time.
I figure that'd take about 10 minutes to code up, and less than 10 minutes to run.

Related

Rapidminer: Memory issues transforming nominal to binominal attributes

I want to analyze a large dataset (2,000,000 records, 20,000 customer IDs, 6 nominal attributes) using the Generalized Sequential Pattern algorithm.
This requires all attributes, aside from the time and customer ID attribute, to be binominal. Having 6 nominal attributes which I want to analyze for patterns, I need to transform those into binominal attributes, using the "Nominal to Binominal" Function. This is causing memory problems on my workstation (with 16GB RAM, of which I allocated 12 to the Java instance running rapidminer).
Ideally I would like to set up my project in a way, that it writes temporarily to the disc or using temporary tables in my oracle database, from which my model also reads the data directly. In order to use the "write database" or "update database" function, I need to have an existing table already in my database with boolean columns already (if I'm not mistaken).
I tried to write step by step the results of the binominal conversion into csv files onto my local disk. I started using the nominal attribute with the least distinct values, resulting in a csv file containing my dataset ID and now 7 binominal attributes. I was seriously surprised seeing the filesize being >200MB already. This is cause by rapidminer writing strings for the binominal values "true"/"false". Wouldn't it be way more memory efficient just writing 0/1?
Is there a way to either use the oracle database directly or working with 0/1 values instead of "true"/"false"? My next column would have 3000 distinct values to be transformed which would end in a nightmare...
I'd highly appreciate recommendations on how to use the memory more efficient or work directly in the database. If anyone knows how to easily transform a varchar2-column in Oracle into boolean columns for each distinct value that would also be appreciated!
Thanks a lot,
Holger
edit:
My goal is to get from such a structure:
column_a; column_b; customer_ID; timestamp
value_aa; value_ba; 1; 1
value_ab; value_ba; 1; 2
value_ab; value_bb; 1; 3
to this structure:
customer_ID; timestamp; column_a_value_aa; column_a_value_ab; column_b_value_ba; column_b_value_bb
1; 1; 1; 0; 1; 0
1; 2; 0; 1; 1; 0
1; 3; 0; 1; 0; 1
This answer is too long for a comment.
If you have thousands of levels for the six variables you are interested in, then you are unlikely to get useful results using that data. A typical approach is to categorize the data going in, which results in fewer "binominal" variables. For instance, instead of "1 Gallon Whole Milk", you use "diary products". This can result in more actionable results. Remember, Oracle only allows 1,000 columns in a table so the database has other limiting factors.
If you are working with lots of individual items, then I would suggest other approaches, notably an approach based on association rules. This will not limit you by the number of variables.
Personally, I find that I can do much of this work in SQL, which is why I wrote a book on the topic ("Data Analysis Using SQL and Excel").
You can use the operator Nominal to Numeric to convert true and false values to 1 or 0. set the coding type parameter to be unique integers.

Any tips to making excel dsum faster?

I am currently using a dsum to calculate some totals and I noticed excel has become really slow (needs 2 seconds per cell change).
This is the situation:
- I am trying to calculate 112 dsums to show in a chart;
- all dsums are queries on a table with 15 columns and +32k rows;
- all dsums have multiple criteria (5-6 constraints);
- the criteria uses both numerical and alpha-numerical constraints;
- i have the source table/range sorted;
- excel file is 3.4 mb in size;
(I am using excel 2007 on an 4 year old windows laptop)
Any ideas on what can be done to make it faster?
...other than reducing the number of dsums :P ====>>> already working on that one.
Thanks!
Some options are:
Change Calculation to Manual and press F9 whenever you want to calculate
Try SUMIFS rather than DSUM
Exploit the fact that the data is sorted by using MATCH and COUNTIF to find the first row and count of rows, then use OFFSET or INDEX to get the relevant subset of data to feed to SUMIFS for the remaining constraints
Instead of DSUMs you could also put it all in one or multiple Pivot tables - and then use GETPIVOTDATA to extract the data you need. The reading of the table will take up a bit of time (though 32k rows should be done below 1") - and then GETPIVOTDATA is lightning fast!
Downsides:
You need to manually refresh the pivot when you get new data
The pivot(s) need to be laid out so the requested data is show
File size will increase (unless Pivot cache is not stored, the file loading takes longer)

Windows Azure Paging Large Datasets Solution

I'm using Windows Azure Table Storage to store millions of entities, however I'm trying to figure out the best solution that easily allows for two things:
1) a search on an entity, will retrieve that entity and at least (pageSize) number of entities either side of that entity
2) if there are more entities beyond (pageSize) number of entities either side of that entity, then page next or page previous links are shown, this will continue until either the start or end is reached.
3) the order is reverse chronological order
I've decided that the PartitionKey will be the Title provided by the user as each container is unique in the system. The RowKey is Steve Marx's lexiographical algorithm:
http://blog.smarx.com/posts/using-numbers-as-keys-in-windows-azure
which when converted to javascript instead of c# looks like this:
pad(new Date(100000000 * 86400000).getTime() - new Date().getTime(), 19) + "_" + uuid()
uuid() is a javascript function that returns a guid and pad adds zeros up to 19 chars in length. So records in the system look something like this:
PK RK
TEST 0008638662595845431_ecf134e4-b10d-47e8-91f2-4de9c4d64388
TEST 0008638662595845432_ae7bb505-8594-43bc-80b7-6bd34bb9541b
TEST 0008638662595845433_d527d215-03a5-4e46-8a54-10027b8e23f8
TEST 0008638662595845434_a2ebc3f4-67fe-43e2-becd-eaa41a4132e2
This pattern allows for every new entity inserted to be at the top of the list which satisfies point number 3 above.
With a nice way of adding new records in the system I thought then I would create a mechanism that looks at the first half of the RowKey i.e. 0008638662595845431_ part and does a greater than or less than comparison depending on which direction of the already found item. In other words to get the row immediately before 0008638662595845431 I would do a query like so:
var tableService = azure.createTableService();
var minPossibleDateTimeNumber = pad(new Date(-100000000*86400000).getTime() - new Date().getTime(), 19);
tableService.getTable('testTable', function (error) {
if (error === null) {
var query = azure.TableQuery
.select()
.from('testTable')
.where('PartitionKey eq ?', 'TEST')
.and('RowKey gt ?', minPossibleDateTimeNumber + '_')
.and('RowKey lt ?', '0008638662595845431_')
.and('Deleted eq ?', 'false');
If the results returned are greater than 1000 and azure gives me a continuation token, then I thought I would remember the last items RowKey i.e. the number part 0008638662595845431. So now the next query will have the remembered value as the starting value etc.
I am using Windows Azure Node.Js SDK and language is javascript.
Can anybody see gotcha's or problems with this approach?
I do not see how this can work effectively and efficiently, especially to get the rows for a previous page.
To be efficient, the prefix of your “key” needs to be a serially incrementing or decrementing value, instead of being based on a timestamp. A timestamp generated value would have duplicates as well as holes, making mapping page size to row count at best inefficient and at worst difficult to determine.
Also, this potential algorithm is dependent on a single partition key, destroying table scalability.
The challenge here would be to have a method of generating a serially incremented key. One solution is to use a SQL database and performing an atomic update on a single row, such that an incrementing or decrementing value is produced in sequence. Something like UPDATE … SET X = X + 1 and return X. Maybe using a stored procedure.
So the key could be a zero left padded serially generated number. Split such that say the first N digits of the number is the partition key and remaining M digits are the row key.
For example
PKey RKey
00001 10321
00001 10322
….
00954 98912
Now, since the rows are in sequence it is possible to write a query with the exact key range for the page size.
Caveat. There is a small risk of a failure occurring between generating a serial key and writing to table storage. In which case, there may be holes in the table. However, your paging algorithm should be able to detect and work around such instances quite easily by specify a page size slightly larger than necessary or by retrying with an adjusted range.

Removing duplicates from a huge file

We have a huge chunk of data and we want to perform a few operations on them. Removing duplicates is one of the main operations.
Ex.
a,me,123,2631272164
yrw,wq,1237,123712,126128361
yrw,dsfswq,1323237,12xcvcx3712,1sd26128361
These are three entries in a file and we want to remove duplicates on the basis of 1st column. So, 3rd row should be deleted. Each row may have different number of columns but the column we are interested into, will always be present.
In memory operation doesn't look feasible.
Another option is to store the data in database and removing duplicates from there but it's again not a trivial task.
What design should I follow to dump data into database and removing duplicates?
I am assuming that people must have faced such issues and solved it.
How do we usually solve this problem?
PS: Please consider this as a real life problem rather than interview question ;)
If the number of keys is also infeasible to load into memory, you'll have to do a Stable(order preserving) External Merge Sort to sort the data and then a linear scan to do duplicate removal. Or you could modify the external merge sort to provide duplicate elimination when merging sorted runs.
I guess since this isn't an interview question or efficiency/elegance seems to not be an issue(?). Write a hack python script that creates 1 table with the first field as the primary key. Parse this file and just insert the records into the database, wrap the insert into a try except statement. Then preform a select * on the table, parse the data and write it back to a file line by line.
If you go down the database route, you can load the csv into a database and use 'duplicate key update'
using mysql:-
Create a table with rows to match your data (you may be able to get away with just 2 rows - id and data)
dump the data using something along the lines of
LOAD DATA LOCAL infile "rs.txt" REPLACE INTO TABLE data_table FIELDS TERMINATED BY ',';
You should then be able to dump out the data back into csv format without duplicates.
If the number of unique keys aren't extremely high, you could simply just do this;
(Pseudo code since you're not mentioning language)
Set keySet;
while(not end_of_input_file)
read line from input file
if first column is not in keySet
add first column to keySet
write line to output file
end while
If the input is sorted or can be sorted, then one could do this which only needs to store one value in memory:
r = read_row()
if r is None:
os.exit()
last = r[0]
write_row(r)
while True:
r = read_row()
if r is None:
os.exit()
if r[0] != last:
write_row(r)
last = r[0]
Otherwise:
What I'd do is keep a set of the first column values that I have already seen and drop the row if it is in that set.
S = set()
while True:
r = read_row()
if r is None:
os.exit()
if r[0] not in S:
write_row(r)
S.add(r[0])
This will stream over the input using only memory proportional to the size of the set of values from the first column.
If you need to preserve order in your original data, it MAY be sensible to create new data that is a tuple of position and data, then sort on the data you want to de-dup. Once you've sorted by data, de-duplication is (essentially) a linear scan. After that, you can re-create the original order by sorting on the position-part of the tuple, then strip it off.
Say you have the following data: a, c, a, b
With a pos/data tuple, sorted by data, we end up with: 0/a, 2/a, 3/b, 1/c
We can then de-duplicate, trivially being able to choose either the first or last entry to keep (we can also, with a bit more memory consumption, keep another) and get: 0/a, 3/b, 1/c.
We then sort by position and strip that: a, c, b
This would involve three linear scans over the data set and two sorting steps.

What would be the best algorithm to find an ID that is not used from a table that has the capacity to hold a million rows

To elaborate ..
a) A table (BIGTABLE) has a capacity to hold a million rows with a primary Key as the ID. (random and unique)
b) What algorithm can be used to arrive at an ID that has not been used so far. This number will be used to insert another row into table BIGTABLE.
Updated the question with more details..
C) This table already has about 100 K rows and the primary key is not an set as identity.
d) Currently, a random number is generated as the primary key and a row inserted into this table, if the insert fails another random number is generated. the problem is sometimes it goes into a loop and the random numbers generated are pretty random, but unfortunately, They already exist in the table. so if we re try the random number generation number after some time it works.
e) The sybase rand() function is used to generate the random number.
Hope this addition to the question helps clarify some points.
The question is of course: why do you want a random ID?
One case where I encountered a similar requirement, was for client IDs of a webapp: the client identifies himself with his client ID (stored in a cookie), so it has to be hard to brute force guess another client's ID (because that would allow hijacking his data).
The solution I went with, was to combine a sequential int32 with a random int32 to obtain an int64 that I used as the client ID. In PostgreSQL:
CREATE FUNCTION lift(integer, integer) returns bigint AS $$
SELECT ($1::bigint << 31) + $2
$$ LANGUAGE SQL;
CREATE FUNCTION random_pos_int() RETURNS integer AS $$
select floor((lift(1,0) - 1)*random())::integer
$$ LANGUAGE sql;
ALTER TABLE client ALTER COLUMN id SET DEFAULT
lift((nextval('client_id_seq'::regclass))::integer, random_pos_int());
The generated IDs are 'half' random, while the other 'half' guarantees you cannot obtain the same ID twice:
select lift(1, random_pos_int()); => 3108167398
select lift(2, random_pos_int()); => 4673906795
select lift(3, random_pos_int()); => 7414644984
...
Why is the unique ID Random? Why not use IDENTITY?
How was the ID chosen for the existing rows.
The simplest thing to do is probably (Select Max(ID) from BIGTABLE) and then make sure your new "Random" ID is larger than that...
EDIT: Based on the added information I'd suggest that you're screwed.
If it's an option: Copy the table, then redefine it and use an Identity Column.
If, as another answer speculated, you do need a truly random Identifier: make your PK two fields. An Identity Field and then a random number.
If you simply can't change the tables structure checking to see if the id exists before trying the insert is probably your only recourse.
There isn't really a good algorithm for this. You can use this basic construct to find an unused id:
int id;
do {
id = generateRandomId();
} while (doesIdAlreadyExist(id));
doSomethingWithNewId(id);
Your best bet is to make your key space big enough that the probability of collisions is extremely low, then don't worry about it. As mentioned, GUIDs will do this for you. Or, you can use a pure random number as long as it has enough bits.
This page has the formula for calculating the collision probability.
A bit outside of the box.
Why not pre-generate your random numbers ahead of time? That way, when you insert a new row into bigtable, the check has already been made. That would make inserts into bigtable a constant time operation.
You will have to perform the checks eventually, but that could be offloaded to a second process that doesn’t involve the sensitive process of inserting into bigtable.
Or go generate a few billion random numbers, and delete the duplicates, then you won't have to worry for quite some time.
Make the key field UNIQUE and IDENTITY and you wont have to worry about it.
If this is something you'll need to do often you will probably want to maintain a live (non-db) data structure to help you quickly answer this question. A 10-way tree would be good. When the app starts it populates the tree by reading the keys from the db, and then keeps it in sync with the various inserts and deletes made in the db. So long as your app is the only one updating the db the tree can be consulted very quickly when verifying that the next large random key is not already in use.
Pick a random number, check if it already exists, if so then keep trying until you hit one that doesn't.
Edit: Or
better yet, skip the check and just try to insert the row with different IDs until it works.
First question: Is this a planned database or a already functional one. If it already has data inside then the answer by bmdhacks is correct. If it is a planned database here is the second question:
Does your primary key really need to be random? If the answer is yes then use a function to create a random id from with a known seed and a counter to know how many Ids have been created. Each Id created will increment the counter.
If you keep the seed secret (i.e., have the seed called and declared private) then no one else should be able to predict the next ID.
If ID is purely random, there is no algorithm to find an unused ID in a similarly random fashion without brute forcing. However, as long as the bit-depth of your random unique id is reasonably large (say 64 bits), you're pretty safe from collisions with only a million rows. If it collides on insert, just try again.
depending on your database you might have the option of either using a sequenser (oracle) or a autoincrement (mysql, ms sql, etc). Or last resort do a select max(id) + 1 as new id - just be carefull of concurrent requests so you don't end up with the same max-id twice - wrap it in a lock with the upcomming insert statement
I've seen this done so many times before via brute force, using random number generators, and it's always a bad idea. Generating a random number outside of the db and attempting to see if it exists will put a lot strain on your app and database. And it could lead to 2 processes picking the same id.
Your best option is to use MySQL's autoincrement ability. Other databases have similar functionality. You are guaranteed a unique id and won't have issues with concurrency.
It is probably a bad idea to scan every value in that table every time looking for a unique value. I think the way to do this would be to have a value in another table, lock on that table, read the value, calculate the value of the next id, write the value of the next id, release the lock. You can then use the id you read with the confidence your current process is the only one holding that unique value. Not sure how well it scales.
Alternatively use a GUID for your ids, since each newly generated GUID is supposed to be unique.
Is it a requirement that the new ID also be random? If so, the best answer is just to loop over (randomize, test for existence) until you find one that doesn't exist.
If the data just happens to be random, but that isn't a strong constraint, you can just use SELECT MAX(idcolumn), increment in a way appropriate to the data, and use that as the primary key for your next record.
You need to do this atomically, so either lock the table or use some other concurrency control appropriate to your DB configuration and schema. Stored procs, table locks, row locks, SELECT...FOR UPDATE, whatever.
Note that in either approach you may need to handle failed transactions. You may theoretically get duplicate key issues in the first (though that's unlikely if your key space is sparsely populated), and you are likely to get deadlocks on some DBs with approaches like SELECT...FOR UPDATE. So be sure to check and restart the transaction on error.
First check if Max(ID) + 1 is not taken and use that.
If Max(ID) + 1 exceeds the maximum then select an ordered chunk at the top and start looping backwards looking for a hole. Repeat the chunks until you run out of numbers (in which case throw a big error).
if the "hole" is found then save the ID in another table and you can use that as the starting point for the next case to save looping.
Skipping the reasoning of the task itself, the only algorithm that
will give you an ID not in the table
that will be used to insert a new line in the table
will result in a table still having random unique IDs
is generating a random number and then checking if it's already used
The best algorithm in that case is to generate a random number and do a select to see if it exists, or just try to add it if your database errs out sanely. Depending on the range of your key, vs, how many records there are, this could be a small amount of time. It also has the ability to spike and isn't consistent at all.
Would it be possible to run some queries on the BigTable and see if there are any ranges that could be exploited? ie. between 100,000 and 234,000 there are no ID's yet, so we could add ID's there?
Why not append your random number creator with the current date in seconds. This way the only way to have an identical ID is if two users are created at the same second and are given the same random number by your generator.

Resources