Convert ms timestamp to sequential unique 32 bit number? - algorithm

I have a table where each record has a field for the timestamp (in ms) from when it was created. This gives a unique ID for each record, as well as sequential ordering. Record 12345678 is different from and comes after 12222222.
There are not records every millisecond, or even every second (although the rate could increase).
My problem is I have a client expecting unique 32-bit IDs. These IDs also need to be numeric, unique and sequential. But the above timestamp currently is ~43 bits.
I could hash them down, but then I lose the sequential and numeric properties. I could chop off the first 10-15 bits or the last, but then I might lose the uniqueness. Someone suggested accepting that the first record isn't from before 1 Jan 2010, so take timestamp - (40 years). I don't love it, and there are enough milliseconds in one year to make that not work.
Any good ways of dealing with this?

if you need to be able to handle records even in ms time difference, there is no way to squeeze the timestamp down to 32bit without the risk of collisions. Simply because there might be more than 2^32 records some day.
How I understand your problem, you need to be able to find the records later by the id and you are not able to store the 32bit id in the records.
is this right?
I see the following possibilities:
You can ensure that there is no more than one record every 4s than you can simply remove the last 12 bits of your 43bit timestamp.But this will no longer work if your timestamp increases to 44bits
if you can modify the timestamp of your records you can take the above approach and if two records are to close together, you can simply modifiy the timestamp of the later one to make the upper 32 bits of the timestamp unique. This will work as long as the average rate of records is less than one records every 4 seconds. [Disadvantage: the timestamps of the records are no longer exactly the creation time but still more or less ok]

Related

How to make simple GROUP BY use index?

I want to get average temperatures hourly for given table with temperature reads of a thermometer, with row structure: thermometer_id, timestamp (float, julian days), value (float) plus ascending index on timestamp.
To get whole day 4 days ago, I'm using this query:
SELECT
ROUND(AVG(value), 2), -- average temperature
COUNT(*) -- count of readings
FROM reads
WHERE
timestamp >= (julianday(date('now')) - 5) -- between 5 days
AND
timestamp < (julianday(date('now')) - 4) -- ...and 4 days ago
GROUP BY CAST(timestamp * 24 as int) -- make hours from floats, group by hours
It does it work well, yet it works very slowly, for a 9MB database, 355k rows, it takes more than half a second to finish, which is confusingly long, it shouldn't take more than few tens of ms. It does so on not quite fast hardware (not ssd though), yet I'm preparing it to use on raspberry pi, quite slow in comparison + it's going to get 80k more rows per day of work.
Explain explains the reason:
"USE TEMP B-TREE FOR GROUP BY"
I've tried adding day and hour columns with indexes just for the sake of quick access, but still, group by didn't use any of the indexes.
How can I tune this query or database to make this query faster?
If an index is used to optimize the GROUP BY, the timestamp search can no longer be optimized (except by using the skip-scan optimization, which your old SQLite might not have). And going through all rows in reads, only to throw most of them away because of a non-matching timestamp, would not be efficient.
If SQLite doesn't automatically do the right thing, even after running ANALYZE, you can try to force it to use a specific index:
CREATE INDEX rhv ON reads(hour, value);
SELECT ... FROM reads INDEXED BY rhv WHERE timestamp ... GROUP BY hour;
But this is unlikely to result in a query plan that is actually faster.
As #colonel-thirty-two commented, the problem was with cast and multiplication on GROUP BY CAST(timestamp * 24 as int). Such grouping would totally omit the index, hence the slow query time. When I've used hour column both for time comparison and grouping, the query finished immediately.

How can I generate an order number with similar results as Amazon when they do it?

Note: I have already read through older questions like What is the best format for a customer number, order number? , however my question is a little more specific.
Generating pseudo-random numbers encounter the "birthday problem" before long. For example, if I am using a 27-bit field for my order number, after 15000 entries, the chances of collision increase to 50%.
I am wondering whether large ecommerce businesses like Amazon generates its order number in any other way - for example :
pre-generate the entire set and pick from them randomly (a few hundred GB of database)
Use lexicographical "next_permutation" starting from a particular seed number
MD5 or SHA-1 hash of the date, user-id, etc parameters, truncated to 14 digits
etc
All I want is a non-repeating integer (doesnt need to be very random except to obfuscate total number of orders) of a certain width. Any ideas on how this can be achieved ?
Suggest starting with the date in reverse format then starting at 1, followed by a check (or random) digit. If you are likely to never exceed 100 orders per day you need add two digits plus a check/random digit.
The year need include only the final two digits, possibly only the final digit, depending on how long you keep records of orders: 7 years or so is usually enough, meaning the records from 2009 (beginning with 9) could be deleted during 2018 in preparation to use the order numbers again in 2019. You could use mmdd for the next 4 digits, or simply number the days through the year and use just 3 digits - it depends how human-friendly you want the number to be. It's also possible just to omit the day of the month and restart the sequential numbers at the start of each month, rather than every day.
Today is 2 Nov 2017, let's suppose this is order no 16 today, your order no would be 71102168 (where the 8 is a check digit or random digit). If you're likely to have up to, but not exceeding a thousand, you'll need an extra digit, thus: 711020168. To avoid limiting yourself the number of digits, you might prefer to use a hyphen: 71102-168 … you could include another hyphen before the check/random digit if you wish: 71102-16-8.
If you have several areas dealing with orders, you may wish to include a depot number, perhaps at the beginning or after the date, allowing you to use the sequence numbers at each depot - eg depot 5 might be: 5-71102-168, 71102-5-168 or 711025168. Again, if you don't use hyphens, you'll need to assess whether you need up to ten, a hundred or a thousand (etc) possible depot numbers. I hope this helps!
This problem has been solved, why
not use the UUID. See RFC 4122. These are close enough to globally unique you can easily combine many systems and never ever have a duplicate just because the number space is so massive.

Oracle Date Range Inconsistancy

I am running a fairly large query on a specific range of dates. The query takes about 30 seconds EXCEPT when I do a range of 10/01/2011-10/31/2011. For some reason that range never finishes. For example 01/01/2011-01/31/2011, and pretty much every other range, finishes in the expected time.
Also, I noticed that doing smaller ranges, like a week, takes longer than larger ranges.
When Oracle gathers statistics on a table, it will record the low value and the high value in a date column and use that to estimate the cardinality of a predicate. If you create a histogram on the column, it will gather more detailed information about the distribution of data within the column. Otherwise, Oracle's cost based optimizer (CBO) will assume a uniform distribution.
For example, if you have a table with 1 million rows and a DATE column with a low value of January 1, 2001 and a high value of January 1, 2011, it will assume that the approximately 10% of the data is in the range January 1, 2001 - January 1, 2002 and that roughly 0.027% of the data would come from some time on March 3, 2008 (1/(10 years * 365 days per year + leap days).
So long as your queries use dates from within the known range, the optimizer's cardinality estimates are generally pretty good so its decisions about what plan to use are pretty good. If you go a bit beyond the upper or lower bound, the estimates are still pretty good because the optimizer assumes that there probably is data that is larger or smaller than it saw when it sampled the data to gather the statistics. But when you get too far from the range that the optimizer statistics expect to see, the optimizer's cardinality estimates get too far out of line and it eventually chooses a bad plan. In your case, prior to refreshing the statistics, the maximum value the optimizer was expecting was probably September 25 or 26, 2011. When your query looked for data for the month of October, 2011, the optimizer most likely expected that the query would return very few rows and chose a plan that was optimized for that scenario rather than for the larger number of rows that were actually returned. That caused the plan to be much worse given the actual volume of data that was returned.
In Oracle 10.2, when Oracle does a hard parse and generates a plan for a query that is loaded into the shared pool, it peeks at the bind variable values and uses those values to estimate the number of rows a query will return and thus the most efficient query plan. Once a query plan has been created and until the plan is aged out of the shared pool, subsequent executions of the same query will use the same query plan regardless of the values of the bind variables. Of course, the next time the query has to be hard parsed because the plan was aged out, Oracle will peek and will likely see new bind variable values.
Bind variable peeking is not a particularly well-loved feature (Adaptive Cursor Sharing in 11g is much better) because it makes it very difficult for a DBA or a developer to predict what plan is going to be used at any particular instant because you're never sure if the bind variable values that the optimizer saw during the hard parse are representative of the bind variable values you generally see. For example, if you are searching over a 1 day range, an index scan would almost certainly be more efficient. If you're searching over a 5 year range, a table scan would almost certainly be more efficient. But you end up using whatever plan was chosen during the hard parse.
Most likely, you can resolve the problem simply by ensuring that statistics are gathered more frequently on tables that are frequently queried based on ranges of monotonically increasing values (date columns being by far the most common such column). In your case, it had been roughly 6 weeks since statistics had been gathered before the problem arose so it would probably be safe to ensure that statistics are gathered on this table every month or every couple weeks depending on how costly it is to gather statistics.
You could also use the DBMS_STATS.SET_COLUMN_STATS procedure to explicitly set the statistics for this column on a more regular basis. That requires more coding and work but saves you the time of gathering statistics. That can be hugely beneficial in a data warehouse environment but it's probably overkill in a more normal OLTP environment.

How to predict table sizes Oracle?

I'm trying to do a growth prediction on some tables I have and for that I've got to do some calculations on my row sizes, how many rows I generate by day and well.. the maths.
I'm calculating the average size of each row in my table as the sum of the average size of each field. So basicaly:
SELECT 'COL1' , avg(vsize(COL1)) FROM TABLE union
SELECT 'COL2' , avg(vsize(COL2)) FROM TABLE
Sum that up, multiply by the number of entries of a day and work the predictions from there.
Turns out that for one of the tables I've looked the resulting size is a lot smaller than I thought it would be and got me wondering if my method was right.
Also, I did not consider indexes sizes for my predictions - and of course I should.
My questions are:
Is this method I'm using reliable?
Tips on how could I work the predictions for the Indexes?
I've done my googling, but the methods I find are all about the segments and extends or else calculations based in the whole table. I will need the step with the actual row of my table to do the predictions (I have to analyse the data in the table in order to figure how many records a day).
And finally, this is an approximation. I know I'm missing some bytes here and there with overheads and stuff. I just want to make sure I'm only missing bytes and not gigas :)
1) Your method is sound to calculate the average size of a row. (Though be aware that if your column contains null, you should use avg(nvl(vsize(col1), 0)) instead of avg(vsize(COL1))). However, it doesn't take into account the physical arrangement of rows.
First of all, it doesn't take into account the header info (from both blocks and rows): you can't fit 8k data into 8k blocks. See the documentation on data block format for more information.
Then, rows are not always stored neatly packed. Oracle lets some space in each blocks so that the rows can grow when they are updated (governed by the pctfree parameter). Also when the rows are deleted the empty space is not reclaimed right away (if you're not using ASSM with locally managed tablespaces, the amount of free space required for a block to return to the list of available blocks depends on pctused).
If you already have some representative data in your table, you can estimate the amount of extra space you will need by comparing the space physically used (all_tables.blocks*block_size after having gathered statistics) to the average row length.
By the way Oracle can easily give you a good estimate of the average row length: gather statistics on the table and query all_tables.avg_row_len.
2) Most of the time (read: unless there is a bug or you fall into an atypical use of the index), the index will grow proportionaly to the number of rows.
If you have representative data, you can have a good estimation of its future size by multiplying its actual size by the relative growth of the number of rows.
The last time Oracle published their formulae for estimating the size of schema objects was in Oracle 8.0, which means the linked document is ten years out of date. However, I don't expect very much has changed in how Oracle reserves segment header, block header, or row header information.

Generating a Not-Quite-Globally Unique Identifier

I've found a number of different questions on generating UIDs, but as far as I can tell, my requirements here are somewhat unique (ha).
To summarize: I need to generate a very short ID that's "locally" unique, but does not have to be "globally" or "universally" unique. The constraints are not simply based on aesthetic or space concerns, but due to the fact that this is essentially being used as a hardware tag and is this subject to the hardware's constraints. Here are the specifications:
Hard Requirements
The ID must contain only decimal digits (the underlying data is a BCD);
The maximum length of the ID is 12 characters (digits).
Must be generated offline - a database/web connection is not always available!
Soft Requirements
We'd like it to begin with the calendar year and/or month. As this does waste a lot of entropy, I don't mind compromising on this or scrapping it entirely (if necessary).
IDs generated from a particular machine should appear sequential.
IDs do not have to sort by machine - for example, it's perfectly fine for machine 1 to spit out [123000, 124000, 125000], and machine 2 to spit out [123500, 123600, 124100].
However, the more sequential-looking in a collective sense, the better. A set of IDs like [200912000001, 200912000002, 200912000003, ...] would be perfect, although this obviously does not scale across multiple machines.
Usage Scenario:
IDs within the scope of this scheme will be generated from 10, maybe 100 different machines at most.
There will not be more than a few million IDs generated, total.
Concurrency is extremely low. A single machine will not generate IDs more often than every 5 minutes or so. Also, most likely no more than 5 machines at a time will generate IDs within the same hour or even the same day. I expect less than 100 IDs to be generated within one day on a given machine and less than 500 for all machines.
A small number of machines (3-5) would most likely be responsible for generating more than 80% of the IDs.
I know that it's possible to encode a timestamp down to 100 ms or even 10 ms precision using less than 12 decimal digits, which is more than enough to guarantee a "unique enough" ID for this application. The reason I am asking this here on SO, is because I would really like to either try to incorporate human-readable year/month in there or encode some piece of information about the source machine, or both.
I'm hoping that someone can either help with a compromise on those soft requirements... or explain why none of them are possible given the other requirements.
(P.S. My "native" language is C# but code in any language or even pseudocode is fine if anybody has any brilliant ideas.)
Update:
Now that I've had the chance to sleep on it, I think what I'm actually going to do is use a timestamp encoding by default, and allow individual installations to switch to a machine-sequential ID by defining their own 2- or 3-digit machine ID. That way, customers who want to mess with the ID and pack in human-readable information can sort out their own method of ensuring uniqueness, and we're not responsible for misuse. Maybe we help out by providing a server utility to handle machine IDs if they happen to be doing all online installations.
"The reason I am asking this here on
SO, is because I would really like to
either try to incorporate
human-readable year/month in there or
encode some piece of information about
the source machine, or both."
Let me start by saying I've dealt with this before and attempting to store useful information into a serial number is a BAD idea long term. A device serial number should be meaningless. Just like the primary key of a database record should be meaningless.
The second you start trying to put real data into your serial number, you've just thrown BUSINESS LOGIC into it and you will be forced to maintain it like any other piece of code. Future you will hate past you. Trust me on this. ;o)
If you attempt to store date/time values, then you'll waste numeric space with invalid time/dates. For instance you'll never have anything greater than 12 in the month field.
A straight epoch / unit time counter would be better, but for a machine that only generates a few id's per minute you'll still waste a lot of space.
12 digits is not a lot of space. Look at the VIN page on Wikipedia. Space for only a few manufacturers, only a few thousand cars. They are now reusing VINs because they ran out of space by packing meaning into it.
http://en.wikipedia.org/wiki/VIN
That's not to say ALL meaning in a serial number is bad, just keep it strictly limited to making sure the numbers don't collide.
Something like this...
Position 1-3: 999 Machines
Position 4-12: Sequential Numbers
That's ALL you need to avoid collisions. If you adding a location digit, then you are screwed when you get to 11 locations.
Sorry if this sounds like a rant. I deal with this a lot manufacturing electronics and various machined parts. It had never ended well long term unless there's LOTS of space available, or a secondary tag (which -wow- provides the necessary id space mentioned before)
How about yyMMddhhmmID?
yy = two-digit year
MM = two-digit month
dd = two-digit day
hh = two-digit hour (24-hour time)
mm = two-digit minute
ID = machine-specific ID
Example: 0912113201 from machine with ID = 01.
Alternatively (if you don't like two-digit years (Y2K lol)), how about yyyyMMIDxxxx?
yyyy = four-digit year
MM = two-digit month
ID = machine-specific ID
xxxx = sequentially-incremented integer
Example: 200912010001 from machine with ID = 01.
As you said that each machine will only generate one identifier maximum every five minutes, this gives you room for 8,928 (24 * 31 * 60 / 5 = 8928) identifiers per month which will fit in xxxx. Here you could squeeze the year down to a three-digit year yyy (009, e.g.) if you needed an extra digit in the xxxx sequence or the machine ID.
Both of these fit timestamp/machine ID as you requested.
We all like concrete code:
class Machine {
public int ID { get; private set; }
public Machine(int id) {
ID = id;
}
}
class IdentifierGenerator {
readonly Machine machine;
int seed;
const int digits = 4;
readonly int modulus;
readonly string seedFormat;
public IdentifierGenerator(Machine machine) {
this.machine = machine;
this.modulus = (int)Math.Pow(10, digits);
this.seedFormat = new string('0', digits);
}
public string Generate() {
string identifier = DateTime.Now.ToString("yyyyMM")
+ machine.ID.ToString("00")
+ seed.ToString(seedFormat);
seed = (seed + 1) % modulus;
return identifier;
}
}
Machine m = new Machine(1);
IdentifierGenerator gen = new IdentifierGenerator(m);
Console.WriteLine(gen.Generate());
Console.WriteLine(gen.Generate());
Outputs:
200912010000
200912010001
When you install your software, also install a machiine id file/registry key which contains a unique numeric id. As you only have a few machines, this should not take more than 3 or 4 digits. Use these as the MS digits. Generate the remaining digits sequentially starting at 1.
I'm gathering you're developing for Windows (re: your comment about "MSI/EXE" in response to Jason's answer). As such, you could WMI or similar to get some unique hardware attribute (processor or HDD serial number, or NIC's MAC address for example) to base a unique machine ID upon. An alternative might also be using the unique serial number of the hardware you are yourself developing (if it has one).
That would most likely be longer than you need, so you could potentially truncate or hash it to reduce it to (say) 16 bits or so and use that as your machine ID. Obviously, this may cause collisions, but the small number of machines (~100) means this is unlikely, and using the truncated output of a cryptographic hash (say MD5) makes this even less so.
Then, since you have a (most probably unique) machine ID, you can then generate essentially unique IDs using the approaches listed by the other answers.
There are 864000 100ms ticks in 24 hours, so tacking that onto a date might work 09.12.24.86400.0, but you have to lose the century to fit in 12 digits, and you don't have any space for machine IDs.
Idea number one:
YYMMDDmmnnnn
where
YY is two digit year
MM is two digit month
DD is two digit day
mm is a two digit code unique to that machine (00 - 99)
nnnn is a sequential four digit code for that machine on that day.
~~
Idea number two:
mmmmnnnnnnnn
Where
mmmm is four digit code unique to the machine
nnnnnnnn is a sequential number.
My suggestion would be to combine multiple approaches in a single id. For example: start with the two year digits, the two month digits and then generate a random number with the time as a seed for the next several digits and then a unique machine id for the last couple. Or something like that.
Each machine gets a starting id of DDNNN, where DD is a unique machine identifier and NNN is the current identifier generated by that machine that day. Each machine keeps track of the ids that it has generated on a particular date and allocates the next one when it needs a new one by incrementing the last one by 1. It resets its counter to 0 at the beginning of each day. The date YYYYDOY is prepended to the number generated by each machine (4-digit year, 3-digit day of year). The number is guaranteed unique because the machine identifier is unique.
If you needed more space for more machines, you could drop the millenium from the year and add a digit for the machine id: YYYDOYDDDNNN.
"A single machine will not generate IDs more often than every 5 minutes or so"
Assuming this is true, then just use the timestamp. (32 bit Unix time has 10 decimal digits but will run out in 2038)
But I think its rather optimistic to assume there won't be a collision.
"IDs generated from a particular machine should appear sequential."
Then your only option is to use a sequence number.
Which doesn't really seem to match what you say in later constraints?
Concatenate a padded version of the node id to get unique values across the cluster.
Use the MAC address of the machine as a MACHINE ID. You can use this to encode your timestamp i.e. via XOR or you can append/prepend it to the generated serialized code.

Resources