Create a sorted list from n sorted sublists (efficiently) - sorting

I was playing around with parallel sorting tonight.
creating sort file
naive-sort ...
1000000
23.61265496
partial-hyper-sort ...
4
7.4924575
simple-hyper-sort ...
1000000
141.7945921
naive-hyper-sort ...
1000000
23.5756172
Two things stand out.
a) naive-hyper-sort is just as fast as ordinary sort
b) The sorting in partial-hyper-sort is 66% faster than ordinary sort.
My problem: partial-hyper-sort is exactly that: "partial". It returns (on my system) 4 sublists, but you want of course one. My attempt to merge them into one (simple-hyper-sort) is an order of magnitude slower than the whole sorting!
So how do I get this faster? And if someone can explain why naive-hyper-sort is not faster than naive-sort, bonus points and a cookie (seriously, a literal cookie).
create-sortfile
unless "tosort.txt".IO.e;
my $start = DateTime.now;
say "naive-sort ...";
say naive-sort.elems;
say DateTime.now - $start;
$start = DateTime.now;
say "partial-hyper-sort ...";
say partial-hyper-sort.elems;
say DateTime.now - $start;
$start = DateTime.now;
say "simple-hyper-sort ...";
say simple-hyper-sort.elems;
say DateTime.now - $start;
$start = DateTime.now;
say "naive-hyper-sort ...";
say naive-hyper-sort.elems;
say DateTime.now - $start;
sub create-sortfile
{
say "creating sort file";
my $to-sort = "tosort.txt".IO.open(:w);
$to-sort.say( ( 10_000 .. 99_999 ).pick )
for ( 1 .. 1_000_000 );
$to-sort.close;
}
sub simple-hyper-sort
{
my $to-sort = "tosort.txt".IO.open( :r );
my $lines = $to-sort.lines;
my $degrees = $*KERNEL.cpu-cores;
my $batch = $lines.elems div $degrees;
my #parts = $lines.batch( $batch ).hyper( :batch(1) ).map({ .sort });
my #index = 0 xx $degrees;
return gather loop
{
my $smallest = Inf;
my $smallest-index = -1;
my $smallest-degree = -1;
for ^$degrees -> $degree
{
my $index = #index[$degree];
if ( $index < $batch )
{
my $value = #parts[$degree;$index];
if $value < $smallest
{
$smallest = $value;
$smallest-index = $index;
$smallest-degree = $degree;
}
}
}
last if $smallest-index < 0;
#index[$smallest-degree]++;
take $smallest;
}
}
sub partial-hyper-sort
{
my $to-sort = "tosort.txt".IO.open( :r );
my $lines = $to-sort.lines;
my $degrees = $*KERNEL.cpu-cores;
my $batch = $lines.elems div $degrees;
my #parts = $lines.batch( $batch ).hyper( :batch(1) ).map({ .sort });
}
multi sub naive-hyper-sort
{
my $to-sort = "tosort.txt".IO.open( :r );
my $lines = $to-sort.lines;
my $degrees = $*KERNEL.cpu-cores;
my $batch = $lines.elems div $degrees;
$lines.hyper( :$batch, :$degrees ).sort;
}
sub naive-sort {
my $to-sort = "tosort.txt".IO.open( :r );
$to-sort.lines.sort;
}

Using .hyper and .race only results in a speedup if there is a parallel implementation of the operation that follows. At the time of writing, there is not a parallel sort implementation in Rakudo, which means that it will fall back to using the regular sort implementation. So, this answers why native-hyper-sort doesn't come out faster right now (however it almost certainly will in the future).
The idea in simple-hyper-sort is along the right lines: break the data up into sublists, sort the sublists, and then merge them. We can therefore parallelize the sorting of the sublists. As you've observed, this achieving a win is dependent on the merge operation itself being fast enough, and so we'd need to carefully optimize that.
It's much easier to write a tight (not to mention correct!) merge operation if it only needs to merge two sublists. Thus, we need to structure the problem in a way that gives us that. This points to a different approach:
Break the list in half
start a task to sort each half
await the two tasks
Merge the results of the two tasks
Note that step 2 involves recursion. We stop recursing when the size of a partition is too small, and use the built-in sort on such partitions. (We can choose to define "too small" by dividing the input list size by the number of CPU cores, along the lines of your example.)
Thus we get a solution like this:
sub parallel-merge-sort {
my $to-sort = "tosort.txt".IO.open( :r );
my $lines = $to-sort.lines;
return do-sort $lines, ceiling($lines.elems / $*KERNEL.cpu-cores);
sub do-sort(#in, $limit) {
if #in.elems < $limit {
#in.sort
}
else {
my $pivot = #in.elems div 2;
merge |await
(start do-sort #in[0..$pivot], $limit),
(start do-sort #in[$pivot^..#in.end], $limit)
}
}
sub merge(#a, #b) {
my #result;
my int $a-idx = 0;
my int $a-elems = +#a;
my int $b-idx = 0;
my int $b-elems = +#b;
my int $r-idx = 0;
while $a-idx < $a-elems && $b-idx < $b-elems {
my $a := #a[$a-idx];
my $b := #b[$b-idx];
if $a before $b {
$a-idx++;
#result[$r-idx++] := $a;
}
else {
$b-idx++;
#result[$r-idx++] := $b;
}
}
if $a-idx < $a-elems {
#result[$r-idx++] := $_ for #a[$a-idx..*];
}
elsif $b-idx < $b-elems {
#result[$r-idx++] := $_ for #b[$b-idx..*];
}
return #result;
}
}
I didn't spend terribly long optimizing this (haven't profiled, etc.), but did take care to use natives and binding in order to reduce allocations. On My Machine, this does give a speedup over the serial sorting, however.
One other easy speedup we can get on this - at the cost of a tad more complexity in the code - comes from realizing that we don't need to slice the input in do-sort until the point that we actually need to send it to the built-in sort:
sub do-sort(#in, $limit, $from = 0, $to = #in.end) {
my $elems = $to - $from;
if $elems < $limit {
#in[$from..$to].sort
}
else {
my $pivot = $from + $elems div 2;
merge |await
(start do-sort #in, $limit, $from, $pivot),
(start do-sort #in, $limit, $pivot + 1, $to)
}
}
Which saves some work; by this point, I measure a factor of two speedup on the machine I'm testing it on, which isn't amazing, but given we've an enforced serial O(n) step, and a bunch more parallelized O(n) steps, over the serial sort algorithm, it's perhaps not so disappointing after all.

Related

Perl6 : What is the best way for dealing with very big files?

Last week I decided to give a try to Perl6 and started to reimplement one of my program.
I have to say, Perl6 is so the easy for object programming, an aspect very painfull to me in Perl5.
My program have to read and store big files, such as whole genomes (up to 3 Gb and more, See example 1 below) or tabulate data.
The first version of the code was made in the Perl5 way by iterating line by line ("genome.fa".IO.lines). It was very slow and unsable for a correct execution time.
my class fasta {
has Str $.file is required;
has %!seq;
submethod TWEAK() {
my $id;
my $s;
for $!file.IO.lines -> $line {
if $line ~~ /^\>/ {
say $id;
if $id.defined {
%!seq{$id} = sequence.new(id => $id, seq => $s);
}
my $l = $line;
$l ~~ s:g/^\>//;
$id = $l;
$s = "";
}
else {
$s ~= $line;
}
}
%!seq{$id} = sequence.new(id => $id, seq => $s);
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
So after a little bit of RTFM, I changed for a slurp on the file, a split on the \n which I parsed with a for loop. This way I managed to load the data in 2 min. Much better but not enough. By cheating, I mean by removing a maximum of \n (Example 2), I decreased the execution time to 30 seconds. Quite good, but not totaly satisfied, by this fasta format is not the most used.
my class fasta {
has Str $.file is required;
has %!seq;
submethod TWEAK() {
my $id;
my $s;
say "Slurping ...";
my $f = $!file.IO.slurp;
say "Spliting file ...";
my #lines = $f.split(/\n/);
say "Parsing lines ...";
for #lines -> $line {
if $line !~~ /^\>/ {
$s ~= $line;
}
else {
say $id;
if $id.defined {
%!seq{$id} = seq.new(id => $id, seq => $s);
}
$id = $line;
$id ~~ s:g/^\>//;
$s = "";
}
}
%!seq{$id} = seq.new(id => $id, seq => $s);
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
So RTFM again and I discovered the magic of Grammar. So new version and an execution time of 45 seconds whatever the fasta format used. Not the fastest way but more elegant and stable.
my grammar fastaGrammar {
token TOP { <fasta>+ }
token fasta {<.ws><header><seq> }
token header { <sup><id>\n }
token sup { '>' }
token id { <[\d\w]>+ }
token seq { [<[ACGTNacgtn]>+\n]+ }
}
my class fastaActions {
method TOP ($/){
my #seqArray;
for $<fasta> -> $f {
#seqArray.push: seq.new(id => $f.<header><id>.made, seq => $f<seq>.made);
}
make #seqArray;
}
method fasta ($/) { make ~$/; }
method id ($/) { make ~$/; }
method seq ($/) { make $/.subst("\n", "", :g); }
}
my class fasta {
has Str $.file is required;
has %seq;
submethod TWEAK() {
say "=> Slurping ...";
my $f = $!file.IO.slurp;
say "=> Grammaring ...";
my #seqArray = fastaGrammar.parse($f, actions => fastaActions).made;
say "=> Storing data ...";
for #seqArray -> $s {
%!seq{$s.id} = $s;
}
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
I think that I found good solution to handle these kind of big files, but performances are still under those of Perl5.
As a newbie in Perl6, I would be interested to know if there is better ways to deal with big data or if there is some limitation due to the Perl6 implementation ?
As a newbie in Perl6, I would ask two questions :
Is there other Perl6 mechanisms that I'm not aware yet, or not yet
documented, for storing huge data from a file (like my genomes) ?
Did I reach the maximum performances for the current version of
Perl6 ?
Thanks for reading !
Fasta Example 1 :
>2L
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...
>3R
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATG
ATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGATGAACGAGAT
...
Fasta example 2 :
>2L
GACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT...
>3R
TAGGGAGAAATATGATCGCGTATGCGAGAGTAGTGCCAACATATTGTGCT...
EDIT
I applied advises of #Christoph and #timotimo and test with code:
my class fasta {
has Str $.file is required;
has %!seq;
submethod TWEAK() {
say "=> Slurping / Parsing / Storing ...";
%!seq = slurp($!file, :enc<latin1>).split('>').skip(1).map: {
.head => seq.new(id => .head, seq => .skip(1).join) given .split("\n").cache;
}
}
}
sub MAIN()
{
my $f = fasta.new(file => "genome.fa");
}
The program finished in 2.7s, which is so great !
I also tried this code on the wheat genome (10 Gb). It finished in 35.2s.
Perl6 is not so slow finally !
Big Thank for the help !
One simple improvement is to use a fixed-width encoding such as latin1 to speed up character decoding, though I'm not sure how much this will help.
As far as Rakudo's regex/grammar engine is concerned, I've found it to be pretty slow, so it might indeed be necessary to take a more low-level approach.
I did not do any benchmarking, but what I'd try first is something like this:
my %seqs = slurp('genome.fa', :enc<latin1>).split('>')[1..*].map: {
.[0] => .[1..*].join given .split("\n");
}
As the Perl6 standard library is implemented in Perl6 itself, it is sometimes possible to improve performance by just avoiding it, writing code in an imperative style such as this:
my %seqs;
my $data = slurp('genome.fa', :enc<latin1>);
my $pos = 0;
loop {
$pos = $data.index('>', $pos) // last;
my $ks = $pos + 1;
my $ke = $data.index("\n", $ks);
my $ss = $ke + 1;
my $se = $data.index('>', $ss) // $data.chars;
my #lines;
$pos = $ss;
while $pos < $se {
my $end = $data.index("\n", $pos);
#lines.push($data.substr($pos..^$end));
$pos = $end + 1
}
%seqs{$data.substr($ks..^$ke)} = #lines.join;
}
However, if the parts of the standard library used has seen some performance work, this might actually make things worse. In that case, the next step to take would be adding low-level type annotations such as str and int and replacing calls to routines such as .index with NQP builtins such as nqp::index.
If that's still too slow, you're out of luck and will need to switch languages, eg calling into Perl5 by using Inline::Perl5 or C using NativeCall.
Note that #timotimo has done some performance measurements and wrote an article about it.
If my short version is the baseline, the imperative version improves performance by 2.4x.
He actually managed to squeeze a 3x improvement out of the short version by rewriting it to
my %seqs = slurp('genome.fa', :enc<latin-1>).split('>').skip(1).map: {
.head => .skip(1).join given .split("\n").cache;
}
Finally, rewriting the imperative version using NQP builtins sped things up by a factor of 17x, but given potential portability issues, writing such code is generally discouraged, but may be necessary for now if you really need that level of performance:
use nqp;
my Mu $seqs := nqp::hash();
my str $data = slurp('genome.fa', :enc<latin1>);
my int $pos = 0;
my str #lines;
loop {
$pos = nqp::index($data, '>', $pos);
last if $pos < 0;
my int $ks = $pos + 1;
my int $ke = nqp::index($data, "\n", $ks);
my int $ss = $ke + 1;
my int $se = nqp::index($data ,'>', $ss);
if $se < 0 {
$se = nqp::chars($data);
}
$pos = $ss;
my int $end;
while $pos < $se {
$end = nqp::index($data, "\n", $pos);
nqp::push_s(#lines, nqp::substr($data, $pos, $end - $pos));
$pos = $end + 1
}
nqp::bindkey($seqs, nqp::substr($data, $ks, $ke - $ks), nqp::join("", #lines));
nqp::setelems(#lines, 0);
}

Alogrithm in using perl to find the value in array - Absolutely Interview Questions

I am asked to do the perl program to find a value(from user input) in array. If matched "its ok". If not matched, then check within the value in the index[0] to index[1] ... index[n]. So then if the value matched to the between two elements then report which is near to these elements might be index[0] or index[1].
Let you explain.
Given array : 10 15 20 25 30;
Get the value from user : 14 (eg.)
Hence 14 matched with in the two elements that is 10(array[0]) - 15(array[1])
Ultimately the check point is do not use more than one for loop and never use the while loop. You need to check one for loop and many of if conditions.
I got the output by which I did here is:
use strict;
use warnings;
my #arr1 = qw(10 15 20 25 30);
my $in = <STDIN>;
chomp($in);
if(grep /$in/, #arr1)
{ } #print "S: $in\n"; }
else
{
for(my $i=0; $i<scalar(#arr1); $i++)
{
my $j = $i + 1;
if($in > $arr1[$i] && $in < $arr1[$j])
{
#print "SN: $arr1[$i]\t$arr1[$j]\n";
my ($inc, $dec) = "0";
my $chk1 = $arr1[$i] + 1;
AGAIN1:
if($in == $chk1)
{ }
else
{ $chk1++; $inc++; goto AGAIN1; }
my $chk2 = $arr1[$j] - 1;
AGAIN2:
if($in == $chk2){ }
else
{ $chk2--; $dec++; goto AGAIN2; }
if($inc > $dec)
{ print "Matched value nearest to $arr1[$j]\n"; }
elsif($inc < $dec)
{ print "Matched value nearest to $arr1[$i]\n"; }
}
}
}
However my question is there a way in algorithm?. Hence if someone can help on this one and it would be appreciated.
Thanks in advance.
You seem determined to make this as complicated as possible :-)
Your specification isn't completely clear, but I think this does what you want:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
my #array = qw[10 15 20 25 30];
chomp(my $in = <STDIN>);
if ($in < $array[0]) {
say "$in is less than first element in the array";
exit;
}
if ($in > $array[-1]) {
say "$in is greater than last element in the array";
exit;
}
for (0 .. $#array) {
if ($in == $array[$_]) {
say "$in is in the array";
exit;
}
if ($in < $array[$_]) {
if ($in - $array[$_ - 1] < $array[$_] - $in) {
say "$in is closest to $array[$_ - 1]";
} else {
say "$in is closest to $array[$_]";
}
exit;
}
}
say "Shouldn't get here!";
Using the helper functions any and reduce from the core module List::Util and the built in abs.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce any/;
my #arr1 = qw(10 15 20 25 30);
chomp(my $in = <STDIN>);
if (any {$in == $_} #arr1) {
print "$in is in the array\n";
}
else {
my $i = reduce { abs($in - $arr1[$a]) > abs($in - $arr1[$b]) ? $b : $a} 0 .. $#arr1;
print "$in is closest to $arr1[$i]\n";
}

Why iterating with $_ takes longer

I was wondering based on many books on Internet, if $_ is really faster way of iterating through array (no instantiating of new variable), but somehow I always get different results. Here's the performance code test:
#!/usr/bin/perl
use Time::HiRes qw(time);
use strict;
use warnings;
# $_ is a default argument for many operators, and also for some control structures.
my $test_array = [1..1000000];
my $number_of_tests = 100;
my $dollar_wins = 0;
my $dollar_wins_sum = 0;
for (my $i = 1; $i <= $number_of_tests; $i++) {
my $odd_void_array = [];
my $start_time_1 = time();
foreach my $item (#{$test_array}) {
if ($item % 2 == 1) {
push (#{$odd_void_array}, $item);
}
}
foreach my $item_odd (#{$odd_void_array}) {
}
my $end_time_1 = time();
$odd_void_array = [];
my $start_time_2 = time();
foreach (#{$test_array}) {
if ($_ % 2 == 1) {
push (#{$odd_void_array}, $_);
}
}
foreach (#{$odd_void_array}) {
}
my $end_time_2 = time();
my $diff = ($end_time_1-$start_time_1) - ($end_time_2-$start_time_2);
if ($diff > 0) {
$dollar_wins ++;
$dollar_wins_sum += $diff;
print "Dollar won ($dollar_wins out of $i) with diff $diff \n";
}
}
print "=================================\n";
print "When using dollar underscore, execution was faster in $dollar_wins cases (".(($dollar_wins/$number_of_tests)*100)."%), with average difference of ".($dollar_wins_sum/$dollar_wins)."\n";
So, I have twice iterating (once with assigning to my $item, other without). I get mostly that iterating with $_ was faster in about 20-30% cases.
Shouldn't be iterating without new variable be faster?
You aren't really benchmarking iteration with different variables.
Your timings includes array creation and other calculations.
You only tell which is faster, not by how much.
You have too few iterations to tell anything reliable.
Let's take this better test that actually benchmarks what you are claiming to benchmark:
use strict;
use warnings;
use Benchmark ':hireswallclock', 'cmpthese';
my #numbers = 1..100_000;
cmpthese -3, {
'$_' => sub {
for (#numbers) {
1;
}
},
'my $x' => sub {
for my $x (#numbers) {
1;
}
},
'$x' => sub {
my $x;
for $x (#numbers) {
1;
}
},
}
Result:
Rate $_ my $x $x
$_ 107/s -- -0% -0%
my $x 107/s 0% -- -0%
$x 108/s 0% 0% --
So they are equally fast on my test system (perl 5.18.2 built for i686-linux-thread-multi-64int).
My suspicion is that using $_ is slightly slower than a lexical, as it's a global variable. However, the speed of iteration is equivalent. Indeed, modifying the benchmark…
use strict;
use warnings;
use Benchmark ':hireswallclock', 'cmpthese';
my #numbers = 1..100_000;
cmpthese -3, {
'$_' => sub {
for (#numbers) {
$_ % 2 == 0;
}
},
'my $x' => sub {
for my $x (#numbers) {
$x % 2 == 0;
}
},
'$x' => sub {
my $x;
for $x (#numbers) {
$x % 2 == 0;
}
},
}
… gives
Rate $_ $x my $x
$_ 40.3/s -- -1% -6%
$x 40.6/s 1% -- -5%
my $x 42.9/s 7% 6% --
but the effects are still too small to draw any solid conclusion.

Performance issue with this AWK - This Code takes around 200 mins on 5GB input file

Below code runs on 5GB file and it consumes 99% CPU, I wanted to know if I am doing something terribly wrong or anything can improve execution time.
2013-04-03 08:54:19,989 INFO [Logger] 2013-04-03T08:54:19.987-04:00PCMC.common.manage.springUtil<log-message-body><headers><fedDKPLoggingContext id="DKP_DumpDocumentProperties" type="context.generated.FedDKPLoggingContext"><logFilter>7</logFilter><logSeverity>255</logSeverity><schemaType>PCMC.MRP.DocumentMetaData</schemaType><UID>073104c-4e-4ce-bda-694344ee62</UID><consumerSystemId>JTR</consumerSystemId><consumerLogin>jbserviceid</consumerLogin><logLocation>Successful Completion of Service</logLocation></fedDKPLoggingContext></headers><payload>0</payload></log-message-body>
This is the code I am using. I tried with gz format also but all in vain. And I am call this awk from bash this below command.
awk -f mytest.awk <(gzip -dc scanned-file.$yesterday.gz)| gzip > tem.gz
cat mytest.awk
#!/bin/awk -f
function to_ms (time, time_ms, s) {
split(time, s, /:|\,/ )
time_ms = (s[1]*3600+s[2]*60+s[3])*1000+s[4]
#printf ("%s\n", newtime)
return time_ms
}
{
stid = gensub(/.*UID&gt;([^&]+).*/,"\\1","")
}
(stid in starttime) {
etime = to_ms($2)
endtime[stid] = etime
docid[stid] = gensub(/.*id="([^""]+).*/,"\\1","")
consumer[stid]= gensub(/.*schemaType&gt;PNC.([^.]+).*/,"\\1","")
state[stid]= gensub(/.*lt;logLocation&gt;([^'' ]+).*/,"\\1","")
next
}
{
stime = to_ms($2)
starttime[stid] = stime
st_hour[stid] = stime/(60*60*1000)
timestamp[stid] = $1" "$2
}
END {
print "Document,Consumer,Hour,ResponseTime,Timestamp,State"
for (x in starttime) {
for (y in endtime) {
if (x==y) {
diff = (endtime[y]-starttime[x])
st = sprintf("%02d", st_hour[x])
print docid[y], consumer[y], st":00", diff, timestamp[x], state[y] |"sort -k3"
delete starttime[x]
delete endtime[y]
delete docid[y]
delete consumer[y]
delete timestamp[x]
delete state[y]
}
}
}
}
Assuming there's just one end time for every stid - don't build up an array of start times and an array of end times and then loop through them both, just process the stid when you hit it's end time. i.e. not this as you do today:
{ stid = whatever }
stid in starttime {
populate endtime[stid]
next
}
{ populate starttime[stid] }
END {
for (x in starttime) {
for (y in endtime) {
if (x == y) {
stid = x
process endtime[stid] - starttime[stid]
}
}
}
}
but this:
{ stid = whatever }
stid in starttime {
process to_ms($2) - starttime[stid]
delete starttime[stid]
next
}
{ populate starttime[stid] }
If you can't do that, e.g. due to there being multiple records with the same stid and you want to get the time stamps from the first and last ones, then change the loop in your END section to just loop through the stids that you got endtimes for (since you already KNOW they have corresponding starttimes) instead of trying to find all the stids in those massive loops on starttime and endtime, e.g.:
{ stid = whatever }
stid in starttime {
populate endtime[stid]
next
}
{ populate starttime[stid] }
END {
for (stid in endtime) {
process endtime[stid] - starttime[stid]
}
}
With either approach you should see a massive performance improvement.
In the END section it always goes through the internal for-loop, even if the y item is found and then deleted from the endtime array. I would suggest to use a break to jump out of the internal loop.
On the other hand (as I see) the internal loop is not needed at all! It tries to find an element with a known key in an associative array.
Also I may suggest no to delete the found items. To find an item in associative array is done in constant time (depending on the hash key generated algorithm and how many duplicated items are generated), so removing items from such array will not necessarily speed up the process, but the deletion of items will definitely slow it down.
So I may suggest to use this:
for (x in starttime) {
if (x in endtime) {
diff = (endtime[x]-starttime[x])
st = sprintf("%02d", st_hour[x])
print docid[x], consumer[x], st":00", diff, timestamp[x], state[x] |"sort -k3"
}
}
Using gzip will even consume more CPU resources, but You can spare some I/O bandwidth.
# Ed, first approach is not giving me expected result. this is what did
# end time and diff
(stid in starttime)
{ etime = to_ms($2)
diff = etime - stime
print diff,stid
delete starttime[stid]
next }
# Populate starttime
{
stime = to_ms($2)
starttime[stid] = stime
st_hour[stid] = stime/(60*60*1000)
}
o/p
is like left this sud come in milisecond and with stid.
561849 c858591f-e01b-4407-b9f9-48302b65c383
562740 c858591f-e01b-4407-b9f9-48302b65c383
563629 56c71ef3-d952-4261-9711-16b18a32c6ba
564484 56c71ef3-d952-4261-9711-16b18a32c6ba

How do I create a URL shortener? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef".
Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings.
My approach:
I have a database table with three columns:
id, integer, auto-increment
long, string, the long URL the user entered
short, string, the shortened URL (or just the six characters)
I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too.
My idea:
For "http://www.google.de/" I get the auto-increment id 239472. Then I do the following steps:
short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.
That could be repeated until the number isn't divisible any more. Do you think this is a good approach? Do you have a better idea?
Due to the ongoing interest in this topic, I've published an efficient solution to GitHub, with implementations for JavaScript, PHP, Python and Java. Add your solutions if you like :)
I would continue your "convert number to string" approach. However, you will realize that your proposed algorithm fails if your ID is a prime and greater than 52.
Theoretical background
You need a Bijective Function f. This is necessary so that you can find a inverse function g('abc') = 123 for your f(123) = 'abc' function. This means:
There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
and for every y you must be able to find an x so that f(x) = y.
How to convert the ID to a shortened URL
Think of an alphabet we want to use. In your case, that's [a-zA-Z0-9]. It contains 62 letters.
Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).
For this example, I will use 12510 (125 with a base of 10).
Now you have to convert 12510 to X62 (base 62).
12510 = 2×621 + 1×620 = [2,1]
This requires the use of integer division and modulo. A pseudo-code example:
digits = []
while num > 0
remainder = modulo(num, 62)
digits.push(remainder)
num = divide(num, 62)
digits = digits.reverse
Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like:
0 → a
1 → b
...
25 → z
...
52 → 0
61 → 9
With 2 → c and 1 → b, you will receive cb62 as the shortened URL.
http://shor.ty/cb
How to resolve a shortened URL to the initial ID
The reverse is even easier. You just do a reverse lookup in your alphabet.
e9a62 will be resolved to "4th, 61st, and 0th letter in the alphabet".
e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810
Now find your database-record with WHERE id = 19158 and do the redirect.
Example implementations (provided by commenters)
C++
Python
Ruby
Haskell
C#
CoffeeScript
Perl
Why would you want to use a hash?
You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z, a-z, 0-9, etc.) has 62 characters, convert the id to a base-40 number and use the characters as the digits.
public class UrlShortener {
private static final String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static final int BASE = ALPHABET.length();
public static String encode(int num) {
StringBuilder sb = new StringBuilder();
while ( num > 0 ) {
sb.append( ALPHABET.charAt( num % BASE ) );
num /= BASE;
}
return sb.reverse().toString();
}
public static int decode(String str) {
int num = 0;
for ( int i = 0; i < str.length(); i++ )
num = num * BASE + ALPHABET.indexOf(str.charAt(i));
return num;
}
}
Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other characters very very similar that they are near impossible to tell the difference) and downright error prone. Try to use lower or upper case only.
Also, try to have a format where you mix the numbers and characters in a predefined form. There are studies that show that people tend to remember one form better than others (think phone numbers, where the numbers are grouped in a specific form). Try something like num-char-char-num-char-char. I know this will lower the combinations, especially if you don't have upper and lower case, but it would be more usable and therefore useful.
My approach: Take the Database ID, then Base36 Encode it. I would NOT use both Upper AND Lowercase letters, because that makes transmitting those URLs over the telephone a nightmare, but you could of course easily extend the function to be a base 62 en/decoder.
Here is my PHP 5 class.
<?php
class Bijective
{
public $dictionary = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
public function __construct()
{
$this->dictionary = str_split($this->dictionary);
}
public function encode($i)
{
if ($i == 0)
return $this->dictionary[0];
$result = '';
$base = count($this->dictionary);
while ($i > 0)
{
$result[] = $this->dictionary[($i % $base)];
$i = floor($i / $base);
}
$result = array_reverse($result);
return join("", $result);
}
public function decode($input)
{
$i = 0;
$base = count($this->dictionary);
$input = str_split($input);
foreach($input as $char)
{
$pos = array_search($char, $this->dictionary);
$i = $i * $base + $pos;
}
return $i;
}
}
A Node.js and MongoDB solution
Since we know the format that MongoDB uses to create a new ObjectId with 12 bytes.
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id
a 3-byte counter (in your machine), starting with a random value.
Example (I choose a random sequence)
a1b2c3d4e5f6g7h8i9j1k2l3
a1b2c3d4 represents the seconds since the Unix epoch,
4e5f6g7 represents machine identifier,
h8i9 represents process id
j1k2l3 represents the counter, starting with a random value.
Since the counter will be unique if we are storing the data in the same machine we can get it with no doubts that it will be duplicate.
So the short URL will be the counter and here is a code snippet assuming that your server is running properly.
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
// Create a schema
const shortUrl = new Schema({
long_url: { type: String, required: true },
short_url: { type: String, required: true, unique: true },
});
const ShortUrl = mongoose.model('ShortUrl', shortUrl);
// The user can request to get a short URL by providing a long URL using a form
app.post('/shorten', function(req ,res){
// Create a new shortUrl */
// The submit form has an input with longURL as its name attribute.
const longUrl = req.body["longURL"];
const newUrl = ShortUrl({
long_url : longUrl,
short_url : "",
});
const shortUrl = newUrl._id.toString().slice(-6);
newUrl.short_url = shortUrl;
console.log(newUrl);
newUrl.save(function(err){
console.log("the new URL is added");
})
});
I keep incrementing an integer sequence per domain in the database and use Hashids to encode the integer into a URL path.
static hashids = Hashids(salt = "my app rocks", minSize = 6)
I ran a script to see how long it takes until it exhausts the character length. For six characters it can do 164,916,224 links and then goes up to seven characters. Bitly uses seven characters. Under five characters looks weird to me.
Hashids can decode the URL path back to a integer but a simpler solution is to use the entire short link sho.rt/ka8ds3 as a primary key.
Here is the full concept:
function addDomain(domain) {
table("domains").insert("domain", domain, "seq", 0)
}
function addURL(domain, longURL) {
seq = table("domains").where("domain = ?", domain).increment("seq")
shortURL = domain + "/" + hashids.encode(seq)
table("links").insert("short", shortURL, "long", longURL)
return shortURL
}
// GET /:hashcode
function handleRequest(req, res) {
shortURL = req.host + "/" + req.param("hashcode")
longURL = table("links").where("short = ?", shortURL).get("long")
res.redirect(301, longURL)
}
C# version:
public class UrlShortener
{
private static String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static int BASE = 62;
public static String encode(int num)
{
StringBuilder sb = new StringBuilder();
while ( num > 0 )
{
sb.Append( ALPHABET[( num % BASE )] );
num /= BASE;
}
StringBuilder builder = new StringBuilder();
for (int i = sb.Length - 1; i >= 0; i--)
{
builder.Append(sb[i]);
}
return builder.ToString();
}
public static int decode(String str)
{
int num = 0;
for ( int i = 0, len = str.Length; i < len; i++ )
{
num = num * BASE + ALPHABET.IndexOf( str[(i)] );
}
return num;
}
}
You could hash the entire URL, but if you just want to shorten the id, do as marcel suggested. I wrote this Python implementation:
https://gist.github.com/778542
Take a look at https://hashids.org/ it is open source and in many languages.
Their page outlines some of the pitfalls of other approaches.
If you don't want re-invent the wheel ... http://lilurl.sourceforge.net/
// simple approach
$original_id = 56789;
$shortened_id = base_convert($original_id, 10, 36);
$un_shortened_id = base_convert($shortened_id, 36, 10);
alphabet = map(chr, range(97,123)+range(65,91)) + map(str,range(0,10))
def lookup(k, a=alphabet):
if type(k) == int:
return a[k]
elif type(k) == str:
return a.index(k)
def encode(i, a=alphabet):
'''Takes an integer and returns it in the given base with mappings for upper/lower case letters and numbers 0-9.'''
try:
i = int(i)
except Exception:
raise TypeError("Input must be an integer.")
def incode(i=i, p=1, a=a):
# Here to protect p.
if i <= 61:
return lookup(i)
else:
pval = pow(62,p)
nval = i/pval
remainder = i % pval
if nval <= 61:
return lookup(nval) + incode(i % pval)
else:
return incode(i, p+1)
return incode()
def decode(s, a=alphabet):
'''Takes a base 62 string in our alphabet and returns it in base10.'''
try:
s = str(s)
except Exception:
raise TypeError("Input must be a string.")
return sum([lookup(i) * pow(62,p) for p,i in enumerate(list(reversed(s)))])a
Here's my version for whomever needs it.
Why not just translate your id to a string? You just need a function that maps a digit between, say, 0 and 61 to a single letter (upper/lower case) or digit. Then apply this to create, say, 4-letter codes, and you've got 14.7 million URLs covered.
Here is a decent URL encoding function for PHP...
// From http://snipplr.com/view/22246/base62-encode--decode/
private function base_encode($val, $base=62, $chars='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ') {
$str = '';
do {
$i = fmod($val, $base);
$str = $chars[$i] . $str;
$val = ($val - $i) / $base;
} while($val > 0);
return $str;
}
Don't know if anyone will find this useful - it is more of a 'hack n slash' method, yet is simple and works nicely if you want only specific chars.
$dictionary = "abcdfghjklmnpqrstvwxyz23456789";
$dictionary = str_split($dictionary);
// Encode
$str_id = '';
$base = count($dictionary);
while($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $dictionary[$rem];
}
// Decode
$id_ar = str_split($str_id);
$id = 0;
for($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i-1], $dictionary) * pow($base, $i - 1);
}
Did you omit O, 0, and i on purpose?
I just created a PHP class based on Ryan's solution.
<?php
$shorty = new App_Shorty();
echo 'ID: ' . 1000;
echo '<br/> Short link: ' . $shorty->encode(1000);
echo '<br/> Decoded Short Link: ' . $shorty->decode($shorty->encode(1000));
/**
* A nice shorting class based on Ryan Charmley's suggestion see the link on Stack Overflow below.
* #author Svetoslav Marinov (Slavi) | http://WebWeb.ca
* #see http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener/10386945#10386945
*/
class App_Shorty {
/**
* Explicitly omitted: i, o, 1, 0 because they are confusing. Also use only lowercase ... as
* dictating this over the phone might be tough.
* #var string
*/
private $dictionary = "abcdfghjklmnpqrstvwxyz23456789";
private $dictionary_array = array();
public function __construct() {
$this->dictionary_array = str_split($this->dictionary);
}
/**
* Gets ID and converts it into a string.
* #param int $id
*/
public function encode($id) {
$str_id = '';
$base = count($this->dictionary_array);
while ($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $this->dictionary_array[$rem];
}
return $str_id;
}
/**
* Converts /abc into an integer ID
* #param string
* #return int $id
*/
public function decode($str_id) {
$id = 0;
$id_ar = str_split($str_id);
$base = count($this->dictionary_array);
for ($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i - 1], $this->dictionary_array) * pow($base, $i - 1);
}
return $id;
}
}
?>
public class TinyUrl {
private final String characterMap = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private final int charBase = characterMap.length();
public String covertToCharacter(int num){
StringBuilder sb = new StringBuilder();
while (num > 0){
sb.append(characterMap.charAt(num % charBase));
num /= charBase;
}
return sb.reverse().toString();
}
public int covertToInteger(String str){
int num = 0;
for(int i = 0 ; i< str.length(); i++)
num += characterMap.indexOf(str.charAt(i)) * Math.pow(charBase , (str.length() - (i + 1)));
return num;
}
}
class TinyUrlTest{
public static void main(String[] args) {
TinyUrl tinyUrl = new TinyUrl();
int num = 122312215;
String url = tinyUrl.covertToCharacter(num);
System.out.println("Tiny url: " + url);
System.out.println("Id: " + tinyUrl.covertToInteger(url));
}
}
This is what I use:
# Generate a [0-9a-zA-Z] string
ALPHABET = map(str,range(0, 10)) + map(chr, range(97, 123) + range(65, 91))
def encode_id(id_number, alphabet=ALPHABET):
"""Convert an integer to a string."""
if id_number == 0:
return alphabet[0]
alphabet_len = len(alphabet) # Cache
result = ''
while id_number > 0:
id_number, mod = divmod(id_number, alphabet_len)
result = alphabet[mod] + result
return result
def decode_id(id_string, alphabet=ALPHABET):
"""Convert a string to an integer."""
alphabet_len = len(alphabet) # Cache
return sum([alphabet.index(char) * pow(alphabet_len, power) for power, char in enumerate(reversed(id_string))])
It's very fast and can take long integers.
For a similar project, to get a new key, I make a wrapper function around a random string generator that calls the generator until I get a string that hasn't already been used in my hashtable. This method will slow down once your name space starts to get full, but as you have said, even with only 6 characters, you have plenty of namespace to work with.
I have a variant of the problem, in that I store web pages from many different authors and need to prevent discovery of pages by guesswork. So my short URLs add a couple of extra digits to the Base-62 string for the page number. These extra digits are generated from information in the page record itself and they ensure that only 1 in 3844 URLs are valid (assuming 2-digit Base-62). You can see an outline description at http://mgscan.com/MBWL.
Very good answer, I have created a Golang implementation of the bjf:
package bjf
import (
"math"
"strings"
"strconv"
)
const alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
func Encode(num string) string {
n, _ := strconv.ParseUint(num, 10, 64)
t := make([]byte, 0)
/* Special case */
if n == 0 {
return string(alphabet[0])
}
/* Map */
for n > 0 {
r := n % uint64(len(alphabet))
t = append(t, alphabet[r])
n = n / uint64(len(alphabet))
}
/* Reverse */
for i, j := 0, len(t) - 1; i < j; i, j = i + 1, j - 1 {
t[i], t[j] = t[j], t[i]
}
return string(t)
}
func Decode(token string) int {
r := int(0)
p := float64(len(token)) - 1
for i := 0; i < len(token); i++ {
r += strings.Index(alphabet, string(token[i])) * int(math.Pow(float64(len(alphabet)), p))
p--
}
return r
}
Hosted at github: https://github.com/xor-gate/go-bjf
Implementation in Scala:
class Encoder(alphabet: String) extends (Long => String) {
val Base = alphabet.size
override def apply(number: Long) = {
def encode(current: Long): List[Int] = {
if (current == 0) Nil
else (current % Base).toInt :: encode(current / Base)
}
encode(number).reverse
.map(current => alphabet.charAt(current)).mkString
}
}
class Decoder(alphabet: String) extends (String => Long) {
val Base = alphabet.size
override def apply(string: String) = {
def decode(current: Long, encodedPart: String): Long = {
if (encodedPart.size == 0) current
else decode(current * Base + alphabet.indexOf(encodedPart.head),encodedPart.tail)
}
decode(0,string)
}
}
Test example with Scala test:
import org.scalatest.{FlatSpec, Matchers}
class DecoderAndEncoderTest extends FlatSpec with Matchers {
val Alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
"A number with base 10" should "be correctly encoded into base 62 string" in {
val encoder = new Encoder(Alphabet)
encoder(127) should be ("cd")
encoder(543513414) should be ("KWGPy")
}
"A base 62 string" should "be correctly decoded into a number with base 10" in {
val decoder = new Decoder(Alphabet)
decoder("cd") should be (127)
decoder("KWGPy") should be (543513414)
}
}
Function based in Xeoncross Class
function shortly($input){
$dictionary = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','0','1','2','3','4','5','6','7','8','9'];
if($input===0)
return $dictionary[0];
$base = count($dictionary);
if(is_numeric($input)){
$result = [];
while($input > 0){
$result[] = $dictionary[($input % $base)];
$input = floor($input / $base);
}
return join("", array_reverse($result));
}
$i = 0;
$input = str_split($input);
foreach($input as $char){
$pos = array_search($char, $dictionary);
$i = $i * $base + $pos;
}
return $i;
}
Here is a Node.js implementation that is likely to bit.ly. generate a highly random seven-character string.
It uses Node.js crypto to generate a highly random 25 charset rather than randomly selecting seven characters.
var crypto = require("crypto");
exports.shortURL = new function () {
this.getShortURL = function () {
var sURL = '',
_rand = crypto.randomBytes(25).toString('hex'),
_base = _rand.length;
for (var i = 0; i < 7; i++)
sURL += _rand.charAt(Math.floor(Math.random() * _rand.length));
return sURL;
};
}
My Python 3 version
base_list = list("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
base = len(base_list)
def encode(num: int):
result = []
if num == 0:
result.append(base_list[0])
while num > 0:
result.append(base_list[num % base])
num //= base
print("".join(reversed(result)))
def decode(code: str):
num = 0
code_list = list(code)
for index, code in enumerate(reversed(code_list)):
num += base_list.index(code) * base ** index
print(num)
if __name__ == '__main__':
encode(341413134141)
decode("60FoItT")
For a quality Node.js / JavaScript solution, see the id-shortener module, which is thoroughly tested and has been used in production for months.
It provides an efficient id / URL shortener backed by pluggable storage defaulting to Redis, and you can even customize your short id character set and whether or not shortening is idempotent. This is an important distinction that not all URL shorteners take into account.
In relation to other answers here, this module implements the Marcel Jackwerth's excellent accepted answer above.
The core of the solution is provided by the following Redis Lua snippet:
local sequence = redis.call('incr', KEYS[1])
local chars = '0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz'
local remaining = sequence
local slug = ''
while (remaining > 0) do
local d = (remaining % 60)
local character = string.sub(chars, d + 1, d + 1)
slug = character .. slug
remaining = (remaining - d) / 60
end
redis.call('hset', KEYS[2], slug, ARGV[1])
return slug
Why not just generate a random string and append it to the base URL? This is a very simplified version of doing this in C#.
static string chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
static string baseUrl = "https://google.com/";
private static string RandomString(int length)
{
char[] s = new char[length];
Random rnd = new Random();
for (int x = 0; x < length; x++)
{
s[x] = chars[rnd.Next(chars.Length)];
}
Thread.Sleep(10);
return new String(s);
}
Then just add the append the random string to the baseURL:
string tinyURL = baseUrl + RandomString(5);
Remember this is a very simplified version of doing this and it's possible the RandomString method could create duplicate strings. In production you would want to take in account for duplicate strings to ensure you will always have a unique URL. I have some code that takes account for duplicate strings by querying a database table I could share if anyone is interested.
This is my initial thoughts, and more thinking can be done, or some simulation can be made to see if it works well or any improvement is needed:
My answer is to remember the long URL in the database, and use the ID 0 to 9999999999999999 (or however large the number is needed).
But the ID 0 to 9999999999999999 can be an issue, because
it can be shorter if we use hexadecimal, or even base62 or base64. (base64 just like YouTube using A-Z a-z 0-9 _ and -)
if it increases from 0 to 9999999999999999 uniformly, then hackers can visit them in that order and know what URLs people are sending each other, so it can be a privacy issue
We can do this:
have one server allocate 0 to 999 to one server, Server A, so now Server A has 1000 of such IDs. So if there are 20 or 200 servers constantly wanting new IDs, it doesn't have to keep asking for each new ID, but rather asking once for 1000 IDs
for the ID 1, for example, reverse the bits. So 000...00000001 becomes 10000...000, so that when converted to base64, it will be non-uniformly increasing IDs each time.
use XOR to flip the bits for the final IDs. For example, XOR with 0xD5AA96...2373 (like a secret key), and the some bits will be flipped. (whenever the secret key has the 1 bit on, it will flip the bit of the ID). This will make the IDs even harder to guess and appear more random
Following this scheme, the single server that allocates the IDs can form the IDs, and so can the 20 or 200 servers requesting the allocation of IDs. The allocating server has to use a lock / semaphore to prevent two requesting servers from getting the same batch (or if it is accepting one connection at a time, this already solves the problem). So we don't want the line (queue) to be too long for waiting to get an allocation. So that's why allocating 1000 or 10000 at a time can solve the issue.

Resources