Adding Pharmacode to PyZbar - barcode

I have an inspection application in pharma. I need to inspect batch numbers and Pharmacodes on labels using OpenCV. I thought of using PyZbar but it doesn't support Pharmacode. How can I add more codes, like Pharmacode, to PyZbar?
Many thanks

Pharmacode (also known as code32) is implemented as code39 using a radix-32 "compression scheme". So you can use PyZbar, with the understanding that you will get a code39 decode from the library and then have to convert the value from base-32 to base-10.
Note that there is no way of knowing whether a code39 "read" is actually code32 or just plain code39.
This is some javascript that converts a code39 value to its original code32 string:
// code32 radix set (missing the vowels AEIO)
const code32set = '0123456789BCDFGHJKLMNPQRSTUVWXYZ';
function code39toCode32(val) {
if (/[^0-9BCDFGHJKLMNPQRSTUVWXYZ]/.test(val)) {
throw new Error("Not code32");
}
let res = 0;
for (let i = 0; i < val.length; i++) {
res = res * 32 + code32set.indexOf(val[i]);
}
let code32 = '' + res;
if (code32.length < 9) {
code32 = ('000000000' + code32).slice(-9);
}
return code32;
}
That function should translate to python pretty easily.

Related

How to get gtest results?

How to get the result of EXPECT_EXIT() in gtest?
I want to do something according to the results of EXPECT_EXIT. For example:
for(int i = 0; i < 1000; i++) {
for(int j = 0; j < 10; j++) {
auto &result = EXPECT_EXIT(Function(i, inputs[j]));
if (result) {
do something;
break; //jump out of the loop
}
}
}
From your comment:
But once it failed, I need record something and skip rest of the input
check for this function
Short answer:
To stop execution at first mismatch you need to use ASSERT version instead of EXPECT, in your case ASSERT_EXIT.
To output some extra data for failed check you can just use stream operator of check:
ASSERT_EXIT(Function(i, j), testing::ExitedWithCode(0), "") << "Some custom data: " << i << "; " << j;
If you want to do something more on fail - you can use ::testing::Test::HasFailure() function.
Long suggestion:
One of the properties of good unittest is simplicity. That implies absence of any loops or branches. Setup logic (if any) should be moved to fixtures. Also unittests must be executed in isolation.
By switching your test to parametric one you would cover that. Check this:
void Function(int i, int j)
{
if (j != 3)
exit(0);
}
class FunTest : public testing::TestWithParam<std::tuple<int, int>> { };
TEST_P(FunTest, Should_Exit)
{
const auto i = std::get<0>(GetParam());
const auto j = std::get<1>(GetParam());
EXPECT_EXIT(Function(i, j), testing::ExitedWithCode(0), "");
}
INSTANTIATE_TEST_SUITE_P(,
FunTest,
testing::Combine(
testing::Range(1, 20),
testing::Range(1, 10)),
[](const testing::TestParamInfo<std::tuple<int, int>>& info) -> std::string {
const auto i = std::get<0>(info.param);
const auto j = std::get<1>(info.param);
return std::to_string(i) + '_' + std::to_string(j);
});
This approach would help with your problem: you can stop execution by using --gtest_break_on_failure and you can control naming of test cases by implementing custom printer for specific TestParamInfo (lambda in example).
It would also guarantee isolation and provide certain level of simplicity (all complex stuff provided by framework, imho test code is minimalistic and simple).
Also this would allow execution of specific test using gtest_filter
(e.g. --gtest_filter=FunTest.Should_Exit/4_3).

How to convert binary data to utf8 string after dart2js

I'm storing my data in the localStorage. Because I store a lot of data, I need to compress it. I do it with GZip. Everything works, but I found out that my convertion from List which the result of GZip to Utf8 is extremely slow. On 500k of data, it can take up to 5 min on a fast computer.
The problem appear mainly during the loading function.
Load:
import 'dart:convert';
import 'package:archive/archive.dart';
import 'package:crypto/crypto.dart';
var base64 = CryptoUtils.base64StringToBytes(window.localStorage[storageName]);
var decompress = new GZipDecoder().decodeBytes(base64);
storage = JSON.decode(new Utf8Decoder().convert(decompress));
Save:
var g = JSON.encode(storage);
List<int> bz2 = new GZipEncoder().encode(new Utf8Encoder().convert(g));
window.localStorage[storageName] = CryptoUtils.bytesToBase64(bz2);
Am I doing something wrong? Is there other way of converting a List to a String in Dart? Also, I do both save and load, so there is no need of utf8.
I tried Ascii, but it throws error for unsupported character. I got kind of working latin1, for which a much higher performance, but I still get some unsupported character.
If UTF-8 is really the bottleneck, and you don't need it, you can just use Dart's native String encoding which is UTF-16.
var compressed = new GZipEncoder().encode(str.codeUnits);
var decompressed = new GZipDecoder().decodeBytes(compressed);
var finalStr = new String.fromCharCodes(decompressed);
This is just a hack and is not the best, but it works and it's giving me good performance for now:
String convertListIntToUtf8(List<int> codeUnits) {
var endIndex = codeUnits.length;
StringBuffer buffer = new StringBuffer();
int i = 0;
while (i < endIndex) {
var c = codeUnits[i];
if (c < 0x80) {
buffer.write(new String.fromCharCode(c));
i += 1;
} else if (c < 0xE0) {
buffer.write(new String.fromCharCode((c << 6) + codeUnits[i+1] - 0x3080));
i += 2;
} else if (c < 0xF0) {
buffer.write(new String.fromCharCode((c << 12) + (codeUnits[i+1] << 6) + codeUnits[i+2] - 0xE2080));
i += 3;
} else {
buffer.write(new String.fromCharCode((c << 18) + (codeUnits[i+1] << 12) + (codeUnits[i+2] << 6) + codeUnits[i+3] - 0x3C82080));
i += 4;
}
if (i >= endIndex) break;
}
return buffer.toString();
}

How do I send a 2D array from jquery to php?

I have a 2D array which I want to send to a php page with $.ajax.
This is the code which creates the array:
for (var i = 0; i<rowlen; i++) {
if (breakcheck) {
break;
}
for (var j = 0; j<=columnlen; j++) {
thtext = columnheads.eq(j).text();
current_td = $(newrows[i]).find("td").eq(j);
if (current_td.find("input").length >0) {
rowdata[i,thtext] = current_td.find("input").val().trim();
if (rowdata[i,thtext] =='') {
alert("You must complete all fields");
breakcheck = true;
break;
}
} else {
rowdata[i,thtext] ='nada';
}
}//inner loop
}//outer loop
The array is filled properly with the nested loops and the I use JSON.stringify to format it. However when the ajax call is made all that is sent is an empty object ([]). What's wrong?
I might be wrong, but arr[i,j] is not the way to use multidimensional arrays in C-style languages. That would be arr[i][j].
IMHO what arr[i,j] will do is function as comma operator and use only j as an index.
OK I solved this by declaring r as an object (var r = {}) instead of declaring it as an array (var r = []). Thanks for the help.

Lossless compression in small blocks with precomputed dictionary

I have an application where I am reading and writing small blocks of data (a few hundred bytes) hundreds of millions of times. I'd like to generate a compression dictionary based on an example data file and use that dictionary forever as I read and write the small blocks. I'm leaning toward the LZW compression algorithm. The Wikipedia page (http://en.wikipedia.org/wiki/Lempel-Ziv-Welch) lists pseudocode for compression and decompression. It looks fairly straightforward to modify it such that the dictionary creation is a separate block of code. So I have two questions:
Am I on the right track or is there a better way?
Why does the LZW algorithm add to the dictionary during the decompression step? Can I omit that, or would I lose efficiency in my dictionary?
Thanks.
Update: Now I'm thinking the ideal case be to find a library that lets me store the dictionary separate from the compressed data. Does anything like that exist?
Update: I ended up taking the code at http://www.enusbaum.com/blog/2009/05/22/example-huffman-compression-routine-in-c and adapting it. I am Chris in the comments on that page. I emailed my mods back to that blog author, but I haven't heard back yet. The compression rates I'm seeing with that code are not at all impressive. Maybe that is due to the 8-bit tree size.
Update: I converted it to 16 bits and the compression is better. It's also much faster than the original code.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.IO;
namespace Book.Core
{
public class Huffman16
{
private readonly double log2 = Math.Log(2);
private List<Node> HuffmanTree = new List<Node>();
internal class Node
{
public long Frequency { get; set; }
public byte Uncoded0 { get; set; }
public byte Uncoded1 { get; set; }
public uint Coded { get; set; }
public int CodeLength { get; set; }
public Node Left { get; set; }
public Node Right { get; set; }
public bool IsLeaf
{
get { return Left == null; }
}
public override string ToString()
{
var coded = "00000000" + Convert.ToString(Coded, 2);
return string.Format("Uncoded={0}, Coded={1}, Frequency={2}", (Uncoded1 << 8) | Uncoded0, coded.Substring(coded.Length - CodeLength), Frequency);
}
}
public Huffman16(long[] frequencies)
{
if (frequencies.Length != ushort.MaxValue + 1)
{
throw new ArgumentException("frequencies.Length must equal " + ushort.MaxValue + 1);
}
BuildTree(frequencies);
EncodeTree(HuffmanTree[HuffmanTree.Count - 1], 0, 0);
}
public static long[] GetFrequencies(byte[] sampleData, bool safe)
{
if (sampleData.Length % 2 != 0)
{
throw new ArgumentException("sampleData.Length must be a multiple of 2.");
}
var histogram = new long[ushort.MaxValue + 1];
if (safe)
{
for (int i = 0; i <= ushort.MaxValue; i++)
{
histogram[i] = 1;
}
}
for (int i = 0; i < sampleData.Length; i += 2)
{
histogram[(sampleData[i] << 8) | sampleData[i + 1]] += 1000;
}
return histogram;
}
public byte[] Encode(byte[] plainData)
{
if (plainData.Length % 2 != 0)
{
throw new ArgumentException("plainData.Length must be a multiple of 2.");
}
Int64 iBuffer = 0;
int iBufferCount = 0;
using (MemoryStream msEncodedOutput = new MemoryStream())
{
//Write Final Output Size 1st
msEncodedOutput.Write(BitConverter.GetBytes(plainData.Length), 0, 4);
//Begin Writing Encoded Data Stream
iBuffer = 0;
iBufferCount = 0;
for (int i = 0; i < plainData.Length; i += 2)
{
Node FoundLeaf = HuffmanTree[(plainData[i] << 8) | plainData[i + 1]];
//How many bits are we adding?
iBufferCount += FoundLeaf.CodeLength;
//Shift the buffer
iBuffer = (iBuffer << FoundLeaf.CodeLength) | FoundLeaf.Coded;
//Are there at least 8 bits in the buffer?
while (iBufferCount > 7)
{
//Write to output
int iBufferOutput = (int)(iBuffer >> (iBufferCount - 8));
msEncodedOutput.WriteByte((byte)iBufferOutput);
iBufferCount = iBufferCount - 8;
iBufferOutput <<= iBufferCount;
iBuffer ^= iBufferOutput;
}
}
//Write remaining bits in buffer
if (iBufferCount > 0)
{
iBuffer = iBuffer << (8 - iBufferCount);
msEncodedOutput.WriteByte((byte)iBuffer);
}
return msEncodedOutput.ToArray();
}
}
public byte[] Decode(byte[] bInput)
{
long iInputBuffer = 0;
int iBytesWritten = 0;
//Establish Output Buffer to write unencoded data to
byte[] bDecodedOutput = new byte[BitConverter.ToInt32(bInput, 0)];
var current = HuffmanTree[HuffmanTree.Count - 1];
//Begin Looping through Input and Decoding
iInputBuffer = 0;
for (int i = 4; i < bInput.Length; i++)
{
iInputBuffer = bInput[i];
for (int bit = 0; bit < 8; bit++)
{
if ((iInputBuffer & 128) == 0)
{
current = current.Left;
}
else
{
current = current.Right;
}
if (current.IsLeaf)
{
bDecodedOutput[iBytesWritten++] = current.Uncoded1;
bDecodedOutput[iBytesWritten++] = current.Uncoded0;
if (iBytesWritten == bDecodedOutput.Length)
{
return bDecodedOutput;
}
current = HuffmanTree[HuffmanTree.Count - 1];
}
iInputBuffer <<= 1;
}
}
throw new Exception();
}
private static void EncodeTree(Node node, int depth, uint value)
{
if (node != null)
{
if (node.IsLeaf)
{
node.CodeLength = depth;
node.Coded = value;
}
else
{
depth++;
value <<= 1;
EncodeTree(node.Left, depth, value);
EncodeTree(node.Right, depth, value | 1);
}
}
}
private void BuildTree(long[] frequencies)
{
var tiny = 0.1 / ushort.MaxValue;
var fraction = 0.0;
SortedDictionary<double, Node> trees = new SortedDictionary<double, Node>();
for (int i = 0; i <= ushort.MaxValue; i++)
{
var leaf = new Node()
{
Uncoded1 = (byte)(i >> 8),
Uncoded0 = (byte)(i & 255),
Frequency = frequencies[i]
};
HuffmanTree.Add(leaf);
if (leaf.Frequency > 0)
{
trees.Add(leaf.Frequency + (fraction += tiny), leaf);
}
}
while (trees.Count > 1)
{
var e = trees.GetEnumerator();
e.MoveNext();
var first = e.Current;
e.MoveNext();
var second = e.Current;
//Join smallest two nodes
var NewParent = new Node();
NewParent.Frequency = first.Value.Frequency + second.Value.Frequency;
NewParent.Left = first.Value;
NewParent.Right = second.Value;
HuffmanTree.Add(NewParent);
//Remove the two that just got joined into one
trees.Remove(first.Key);
trees.Remove(second.Key);
trees.Add(NewParent.Frequency + (fraction += tiny), NewParent);
}
}
}
}
Usage examples:
To create the dictionary from sample data:
var freqs = Huffman16.GetFrequencies(File.ReadAllBytes(#"D:\nodes"), true);
To initialize an encoder with a given dictionary:
var huff = new Huffman16(freqs);
And to do some compression:
var encoded = huff.Encode(raw);
And decompression:
var raw = huff.Decode(encoded);
The hard part in my mind is how you build your static dictionary. You don't want to use the LZW dictionary built from your sample data. LZW wastes a bunch of time learning since it can't build the dictionary faster than the decompressor can (a token will only be used the second time it's seen by the compressor so the decompressor can add it to its dictionary the first time its seen). The flip side of this is that it's adding things to the dictionary that may never get used, just in case the string shows up again. (e.g., to have a token for 'stackoverflow' you'll also have entries for 'ac','ko','ve','rf' etc...)
However, looking at the raw token stream from an LZ77 algorithm could work well. You'll only see tokens for strings seen at least twice. You can then build a list of the most common tokens/strings to include in your dictionary.
Once you have a static dictionary, using LZW sans the dictionary update seems like an easy implementation but to get the best compression I'd consider a static Huffman table instead of the traditional 12 bit fixed size token (as George Phillips suggested). An LZW dictionary will burn tokens for all the sub-strings you may never actually encode (e.g, if you can encode 'stackoverflow', there will be tokens for 'st', 'sta', 'stac', 'stack', 'stacko' etc.).
At this point it really isn't LZW - what makes LZW clever is how the decompressor can build the same dictionary the compressor used only seeing the compressed data stream. Something you won't be using. But all LZW implementations have a state where the dictionary is full and is no longer updated, this is how you'd use it with your static dictionary.
LZW adds to the dictionary during decompression to ensure it has the same dictionary state as the compressor. Otherwise the decoding would not function properly.
However, if you were in a state where the dictionary was fixed then, yes, you would not need to add new codes.
Your approach will work reasonably well and it's easy to use existing tools to prototype and measure the results. That is, compress the example file and then the example and test data together. The size of the latter less the former will be the expected compressed size of a block.
LZW is a clever way to build up a dictionary on the fly and gives decent results. But a more thorough analysis of your typical data blocks is likely to generate a more efficient dictionary.
There's also room for improvement in how LZW represents compressed data. For instance, each dictionary reference could be Huffman encoded to a closer to optimal length based on the expected frequency of their use. To be truly optimal the codes should be arithmetic encoded.
I would look at your data to see if there's an obvious reason it's so easy to compress. You might be able to do something much simpler than LZ78. I've done both LZ77 (lookback) and LZ78 (dictionary).
Try running a LZ77 on your data. There's no dictionary with LZ77, so you could use a library without alteration. Deflate is an implementation of LZ77.
Your idea of using a common dictionary is a good one, but it's hard to know whether the files are similar to each other or just self-similar without doing some tests.
The right track is to use an library -- almost every modern language have a compression library. C#, Python, Perl, Java, VB.net, whatever you use.
LZW save some space by depending the dictionary on previous inputs. It have an initial dictionary, and when you decompress something, you add them to the dictionary -- so the dictionary is growing. (I am omitting some details here, but this is the general idea)
You can omit this step by supply the whole (complete) dictionary as the initial one. But this would cost some space.
I find this aproach quite interesting for repeated log entries and something I would like to explore using.
Can you share the compression statistics for using this approach for your use case so I can compare it with other alternatives?
Have you considered having the common dictionary grow over time or is that not a valid option?

How do I create a URL shortener? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Closed 1 year ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I want to create a URL shortener service where you can write a long URL into an input field and the service shortens the URL to "http://www.example.org/abcdef".
Instead of "abcdef" there can be any other string with six characters containing a-z, A-Z and 0-9. That makes 56~57 billion possible strings.
My approach:
I have a database table with three columns:
id, integer, auto-increment
long, string, the long URL the user entered
short, string, the shortened URL (or just the six characters)
I would then insert the long URL into the table. Then I would select the auto-increment value for "id" and build a hash of it. This hash should then be inserted as "short". But what sort of hash should I build? Hash algorithms like MD5 create too long strings. I don't use these algorithms, I think. A self-built algorithm will work, too.
My idea:
For "http://www.google.de/" I get the auto-increment id 239472. Then I do the following steps:
short = '';
if divisible by 2, add "a"+the result to short
if divisible by 3, add "b"+the result to short
... until I have divisors for a-z and A-Z.
That could be repeated until the number isn't divisible any more. Do you think this is a good approach? Do you have a better idea?
Due to the ongoing interest in this topic, I've published an efficient solution to GitHub, with implementations for JavaScript, PHP, Python and Java. Add your solutions if you like :)
I would continue your "convert number to string" approach. However, you will realize that your proposed algorithm fails if your ID is a prime and greater than 52.
Theoretical background
You need a Bijective Function f. This is necessary so that you can find a inverse function g('abc') = 123 for your f(123) = 'abc' function. This means:
There must be no x1, x2 (with x1 ≠ x2) that will make f(x1) = f(x2),
and for every y you must be able to find an x so that f(x) = y.
How to convert the ID to a shortened URL
Think of an alphabet we want to use. In your case, that's [a-zA-Z0-9]. It contains 62 letters.
Take an auto-generated, unique numerical key (the auto-incremented id of a MySQL table for example).
For this example, I will use 12510 (125 with a base of 10).
Now you have to convert 12510 to X62 (base 62).
12510 = 2×621 + 1×620 = [2,1]
This requires the use of integer division and modulo. A pseudo-code example:
digits = []
while num > 0
remainder = modulo(num, 62)
digits.push(remainder)
num = divide(num, 62)
digits = digits.reverse
Now map the indices 2 and 1 to your alphabet. This is how your mapping (with an array for example) could look like:
0 → a
1 → b
...
25 → z
...
52 → 0
61 → 9
With 2 → c and 1 → b, you will receive cb62 as the shortened URL.
http://shor.ty/cb
How to resolve a shortened URL to the initial ID
The reverse is even easier. You just do a reverse lookup in your alphabet.
e9a62 will be resolved to "4th, 61st, and 0th letter in the alphabet".
e9a62 = [4,61,0] = 4×622 + 61×621 + 0×620 = 1915810
Now find your database-record with WHERE id = 19158 and do the redirect.
Example implementations (provided by commenters)
C++
Python
Ruby
Haskell
C#
CoffeeScript
Perl
Why would you want to use a hash?
You can just use a simple translation of your auto-increment value to an alphanumeric value. You can do that easily by using some base conversion. Say you character space (A-Z, a-z, 0-9, etc.) has 62 characters, convert the id to a base-40 number and use the characters as the digits.
public class UrlShortener {
private static final String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static final int BASE = ALPHABET.length();
public static String encode(int num) {
StringBuilder sb = new StringBuilder();
while ( num > 0 ) {
sb.append( ALPHABET.charAt( num % BASE ) );
num /= BASE;
}
return sb.reverse().toString();
}
public static int decode(String str) {
int num = 0;
for ( int i = 0; i < str.length(); i++ )
num = num * BASE + ALPHABET.indexOf(str.charAt(i));
return num;
}
}
Not an answer to your question, but I wouldn't use case-sensitive shortened URLs. They are hard to remember, usually unreadable (many fonts render 1 and l, 0 and O and other characters very very similar that they are near impossible to tell the difference) and downright error prone. Try to use lower or upper case only.
Also, try to have a format where you mix the numbers and characters in a predefined form. There are studies that show that people tend to remember one form better than others (think phone numbers, where the numbers are grouped in a specific form). Try something like num-char-char-num-char-char. I know this will lower the combinations, especially if you don't have upper and lower case, but it would be more usable and therefore useful.
My approach: Take the Database ID, then Base36 Encode it. I would NOT use both Upper AND Lowercase letters, because that makes transmitting those URLs over the telephone a nightmare, but you could of course easily extend the function to be a base 62 en/decoder.
Here is my PHP 5 class.
<?php
class Bijective
{
public $dictionary = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
public function __construct()
{
$this->dictionary = str_split($this->dictionary);
}
public function encode($i)
{
if ($i == 0)
return $this->dictionary[0];
$result = '';
$base = count($this->dictionary);
while ($i > 0)
{
$result[] = $this->dictionary[($i % $base)];
$i = floor($i / $base);
}
$result = array_reverse($result);
return join("", $result);
}
public function decode($input)
{
$i = 0;
$base = count($this->dictionary);
$input = str_split($input);
foreach($input as $char)
{
$pos = array_search($char, $this->dictionary);
$i = $i * $base + $pos;
}
return $i;
}
}
A Node.js and MongoDB solution
Since we know the format that MongoDB uses to create a new ObjectId with 12 bytes.
a 4-byte value representing the seconds since the Unix epoch,
a 3-byte machine identifier,
a 2-byte process id
a 3-byte counter (in your machine), starting with a random value.
Example (I choose a random sequence)
a1b2c3d4e5f6g7h8i9j1k2l3
a1b2c3d4 represents the seconds since the Unix epoch,
4e5f6g7 represents machine identifier,
h8i9 represents process id
j1k2l3 represents the counter, starting with a random value.
Since the counter will be unique if we are storing the data in the same machine we can get it with no doubts that it will be duplicate.
So the short URL will be the counter and here is a code snippet assuming that your server is running properly.
const mongoose = require('mongoose');
const Schema = mongoose.Schema;
// Create a schema
const shortUrl = new Schema({
long_url: { type: String, required: true },
short_url: { type: String, required: true, unique: true },
});
const ShortUrl = mongoose.model('ShortUrl', shortUrl);
// The user can request to get a short URL by providing a long URL using a form
app.post('/shorten', function(req ,res){
// Create a new shortUrl */
// The submit form has an input with longURL as its name attribute.
const longUrl = req.body["longURL"];
const newUrl = ShortUrl({
long_url : longUrl,
short_url : "",
});
const shortUrl = newUrl._id.toString().slice(-6);
newUrl.short_url = shortUrl;
console.log(newUrl);
newUrl.save(function(err){
console.log("the new URL is added");
})
});
I keep incrementing an integer sequence per domain in the database and use Hashids to encode the integer into a URL path.
static hashids = Hashids(salt = "my app rocks", minSize = 6)
I ran a script to see how long it takes until it exhausts the character length. For six characters it can do 164,916,224 links and then goes up to seven characters. Bitly uses seven characters. Under five characters looks weird to me.
Hashids can decode the URL path back to a integer but a simpler solution is to use the entire short link sho.rt/ka8ds3 as a primary key.
Here is the full concept:
function addDomain(domain) {
table("domains").insert("domain", domain, "seq", 0)
}
function addURL(domain, longURL) {
seq = table("domains").where("domain = ?", domain).increment("seq")
shortURL = domain + "/" + hashids.encode(seq)
table("links").insert("short", shortURL, "long", longURL)
return shortURL
}
// GET /:hashcode
function handleRequest(req, res) {
shortURL = req.host + "/" + req.param("hashcode")
longURL = table("links").where("short = ?", shortURL).get("long")
res.redirect(301, longURL)
}
C# version:
public class UrlShortener
{
private static String ALPHABET = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private static int BASE = 62;
public static String encode(int num)
{
StringBuilder sb = new StringBuilder();
while ( num > 0 )
{
sb.Append( ALPHABET[( num % BASE )] );
num /= BASE;
}
StringBuilder builder = new StringBuilder();
for (int i = sb.Length - 1; i >= 0; i--)
{
builder.Append(sb[i]);
}
return builder.ToString();
}
public static int decode(String str)
{
int num = 0;
for ( int i = 0, len = str.Length; i < len; i++ )
{
num = num * BASE + ALPHABET.IndexOf( str[(i)] );
}
return num;
}
}
You could hash the entire URL, but if you just want to shorten the id, do as marcel suggested. I wrote this Python implementation:
https://gist.github.com/778542
Take a look at https://hashids.org/ it is open source and in many languages.
Their page outlines some of the pitfalls of other approaches.
If you don't want re-invent the wheel ... http://lilurl.sourceforge.net/
// simple approach
$original_id = 56789;
$shortened_id = base_convert($original_id, 10, 36);
$un_shortened_id = base_convert($shortened_id, 36, 10);
alphabet = map(chr, range(97,123)+range(65,91)) + map(str,range(0,10))
def lookup(k, a=alphabet):
if type(k) == int:
return a[k]
elif type(k) == str:
return a.index(k)
def encode(i, a=alphabet):
'''Takes an integer and returns it in the given base with mappings for upper/lower case letters and numbers 0-9.'''
try:
i = int(i)
except Exception:
raise TypeError("Input must be an integer.")
def incode(i=i, p=1, a=a):
# Here to protect p.
if i <= 61:
return lookup(i)
else:
pval = pow(62,p)
nval = i/pval
remainder = i % pval
if nval <= 61:
return lookup(nval) + incode(i % pval)
else:
return incode(i, p+1)
return incode()
def decode(s, a=alphabet):
'''Takes a base 62 string in our alphabet and returns it in base10.'''
try:
s = str(s)
except Exception:
raise TypeError("Input must be a string.")
return sum([lookup(i) * pow(62,p) for p,i in enumerate(list(reversed(s)))])a
Here's my version for whomever needs it.
Why not just translate your id to a string? You just need a function that maps a digit between, say, 0 and 61 to a single letter (upper/lower case) or digit. Then apply this to create, say, 4-letter codes, and you've got 14.7 million URLs covered.
Here is a decent URL encoding function for PHP...
// From http://snipplr.com/view/22246/base62-encode--decode/
private function base_encode($val, $base=62, $chars='0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ') {
$str = '';
do {
$i = fmod($val, $base);
$str = $chars[$i] . $str;
$val = ($val - $i) / $base;
} while($val > 0);
return $str;
}
Don't know if anyone will find this useful - it is more of a 'hack n slash' method, yet is simple and works nicely if you want only specific chars.
$dictionary = "abcdfghjklmnpqrstvwxyz23456789";
$dictionary = str_split($dictionary);
// Encode
$str_id = '';
$base = count($dictionary);
while($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $dictionary[$rem];
}
// Decode
$id_ar = str_split($str_id);
$id = 0;
for($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i-1], $dictionary) * pow($base, $i - 1);
}
Did you omit O, 0, and i on purpose?
I just created a PHP class based on Ryan's solution.
<?php
$shorty = new App_Shorty();
echo 'ID: ' . 1000;
echo '<br/> Short link: ' . $shorty->encode(1000);
echo '<br/> Decoded Short Link: ' . $shorty->decode($shorty->encode(1000));
/**
* A nice shorting class based on Ryan Charmley's suggestion see the link on Stack Overflow below.
* #author Svetoslav Marinov (Slavi) | http://WebWeb.ca
* #see http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener/10386945#10386945
*/
class App_Shorty {
/**
* Explicitly omitted: i, o, 1, 0 because they are confusing. Also use only lowercase ... as
* dictating this over the phone might be tough.
* #var string
*/
private $dictionary = "abcdfghjklmnpqrstvwxyz23456789";
private $dictionary_array = array();
public function __construct() {
$this->dictionary_array = str_split($this->dictionary);
}
/**
* Gets ID and converts it into a string.
* #param int $id
*/
public function encode($id) {
$str_id = '';
$base = count($this->dictionary_array);
while ($id > 0) {
$rem = $id % $base;
$id = ($id - $rem) / $base;
$str_id .= $this->dictionary_array[$rem];
}
return $str_id;
}
/**
* Converts /abc into an integer ID
* #param string
* #return int $id
*/
public function decode($str_id) {
$id = 0;
$id_ar = str_split($str_id);
$base = count($this->dictionary_array);
for ($i = count($id_ar); $i > 0; $i--) {
$id += array_search($id_ar[$i - 1], $this->dictionary_array) * pow($base, $i - 1);
}
return $id;
}
}
?>
public class TinyUrl {
private final String characterMap = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
private final int charBase = characterMap.length();
public String covertToCharacter(int num){
StringBuilder sb = new StringBuilder();
while (num > 0){
sb.append(characterMap.charAt(num % charBase));
num /= charBase;
}
return sb.reverse().toString();
}
public int covertToInteger(String str){
int num = 0;
for(int i = 0 ; i< str.length(); i++)
num += characterMap.indexOf(str.charAt(i)) * Math.pow(charBase , (str.length() - (i + 1)));
return num;
}
}
class TinyUrlTest{
public static void main(String[] args) {
TinyUrl tinyUrl = new TinyUrl();
int num = 122312215;
String url = tinyUrl.covertToCharacter(num);
System.out.println("Tiny url: " + url);
System.out.println("Id: " + tinyUrl.covertToInteger(url));
}
}
This is what I use:
# Generate a [0-9a-zA-Z] string
ALPHABET = map(str,range(0, 10)) + map(chr, range(97, 123) + range(65, 91))
def encode_id(id_number, alphabet=ALPHABET):
"""Convert an integer to a string."""
if id_number == 0:
return alphabet[0]
alphabet_len = len(alphabet) # Cache
result = ''
while id_number > 0:
id_number, mod = divmod(id_number, alphabet_len)
result = alphabet[mod] + result
return result
def decode_id(id_string, alphabet=ALPHABET):
"""Convert a string to an integer."""
alphabet_len = len(alphabet) # Cache
return sum([alphabet.index(char) * pow(alphabet_len, power) for power, char in enumerate(reversed(id_string))])
It's very fast and can take long integers.
For a similar project, to get a new key, I make a wrapper function around a random string generator that calls the generator until I get a string that hasn't already been used in my hashtable. This method will slow down once your name space starts to get full, but as you have said, even with only 6 characters, you have plenty of namespace to work with.
I have a variant of the problem, in that I store web pages from many different authors and need to prevent discovery of pages by guesswork. So my short URLs add a couple of extra digits to the Base-62 string for the page number. These extra digits are generated from information in the page record itself and they ensure that only 1 in 3844 URLs are valid (assuming 2-digit Base-62). You can see an outline description at http://mgscan.com/MBWL.
Very good answer, I have created a Golang implementation of the bjf:
package bjf
import (
"math"
"strings"
"strconv"
)
const alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
func Encode(num string) string {
n, _ := strconv.ParseUint(num, 10, 64)
t := make([]byte, 0)
/* Special case */
if n == 0 {
return string(alphabet[0])
}
/* Map */
for n > 0 {
r := n % uint64(len(alphabet))
t = append(t, alphabet[r])
n = n / uint64(len(alphabet))
}
/* Reverse */
for i, j := 0, len(t) - 1; i < j; i, j = i + 1, j - 1 {
t[i], t[j] = t[j], t[i]
}
return string(t)
}
func Decode(token string) int {
r := int(0)
p := float64(len(token)) - 1
for i := 0; i < len(token); i++ {
r += strings.Index(alphabet, string(token[i])) * int(math.Pow(float64(len(alphabet)), p))
p--
}
return r
}
Hosted at github: https://github.com/xor-gate/go-bjf
Implementation in Scala:
class Encoder(alphabet: String) extends (Long => String) {
val Base = alphabet.size
override def apply(number: Long) = {
def encode(current: Long): List[Int] = {
if (current == 0) Nil
else (current % Base).toInt :: encode(current / Base)
}
encode(number).reverse
.map(current => alphabet.charAt(current)).mkString
}
}
class Decoder(alphabet: String) extends (String => Long) {
val Base = alphabet.size
override def apply(string: String) = {
def decode(current: Long, encodedPart: String): Long = {
if (encodedPart.size == 0) current
else decode(current * Base + alphabet.indexOf(encodedPart.head),encodedPart.tail)
}
decode(0,string)
}
}
Test example with Scala test:
import org.scalatest.{FlatSpec, Matchers}
class DecoderAndEncoderTest extends FlatSpec with Matchers {
val Alphabet = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
"A number with base 10" should "be correctly encoded into base 62 string" in {
val encoder = new Encoder(Alphabet)
encoder(127) should be ("cd")
encoder(543513414) should be ("KWGPy")
}
"A base 62 string" should "be correctly decoded into a number with base 10" in {
val decoder = new Decoder(Alphabet)
decoder("cd") should be (127)
decoder("KWGPy") should be (543513414)
}
}
Function based in Xeoncross Class
function shortly($input){
$dictionary = ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z','A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T','U','V','W','X','Y','Z','0','1','2','3','4','5','6','7','8','9'];
if($input===0)
return $dictionary[0];
$base = count($dictionary);
if(is_numeric($input)){
$result = [];
while($input > 0){
$result[] = $dictionary[($input % $base)];
$input = floor($input / $base);
}
return join("", array_reverse($result));
}
$i = 0;
$input = str_split($input);
foreach($input as $char){
$pos = array_search($char, $dictionary);
$i = $i * $base + $pos;
}
return $i;
}
Here is a Node.js implementation that is likely to bit.ly. generate a highly random seven-character string.
It uses Node.js crypto to generate a highly random 25 charset rather than randomly selecting seven characters.
var crypto = require("crypto");
exports.shortURL = new function () {
this.getShortURL = function () {
var sURL = '',
_rand = crypto.randomBytes(25).toString('hex'),
_base = _rand.length;
for (var i = 0; i < 7; i++)
sURL += _rand.charAt(Math.floor(Math.random() * _rand.length));
return sURL;
};
}
My Python 3 version
base_list = list("0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
base = len(base_list)
def encode(num: int):
result = []
if num == 0:
result.append(base_list[0])
while num > 0:
result.append(base_list[num % base])
num //= base
print("".join(reversed(result)))
def decode(code: str):
num = 0
code_list = list(code)
for index, code in enumerate(reversed(code_list)):
num += base_list.index(code) * base ** index
print(num)
if __name__ == '__main__':
encode(341413134141)
decode("60FoItT")
For a quality Node.js / JavaScript solution, see the id-shortener module, which is thoroughly tested and has been used in production for months.
It provides an efficient id / URL shortener backed by pluggable storage defaulting to Redis, and you can even customize your short id character set and whether or not shortening is idempotent. This is an important distinction that not all URL shorteners take into account.
In relation to other answers here, this module implements the Marcel Jackwerth's excellent accepted answer above.
The core of the solution is provided by the following Redis Lua snippet:
local sequence = redis.call('incr', KEYS[1])
local chars = '0123456789ABCDEFGHJKLMNPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz'
local remaining = sequence
local slug = ''
while (remaining > 0) do
local d = (remaining % 60)
local character = string.sub(chars, d + 1, d + 1)
slug = character .. slug
remaining = (remaining - d) / 60
end
redis.call('hset', KEYS[2], slug, ARGV[1])
return slug
Why not just generate a random string and append it to the base URL? This is a very simplified version of doing this in C#.
static string chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890";
static string baseUrl = "https://google.com/";
private static string RandomString(int length)
{
char[] s = new char[length];
Random rnd = new Random();
for (int x = 0; x < length; x++)
{
s[x] = chars[rnd.Next(chars.Length)];
}
Thread.Sleep(10);
return new String(s);
}
Then just add the append the random string to the baseURL:
string tinyURL = baseUrl + RandomString(5);
Remember this is a very simplified version of doing this and it's possible the RandomString method could create duplicate strings. In production you would want to take in account for duplicate strings to ensure you will always have a unique URL. I have some code that takes account for duplicate strings by querying a database table I could share if anyone is interested.
This is my initial thoughts, and more thinking can be done, or some simulation can be made to see if it works well or any improvement is needed:
My answer is to remember the long URL in the database, and use the ID 0 to 9999999999999999 (or however large the number is needed).
But the ID 0 to 9999999999999999 can be an issue, because
it can be shorter if we use hexadecimal, or even base62 or base64. (base64 just like YouTube using A-Z a-z 0-9 _ and -)
if it increases from 0 to 9999999999999999 uniformly, then hackers can visit them in that order and know what URLs people are sending each other, so it can be a privacy issue
We can do this:
have one server allocate 0 to 999 to one server, Server A, so now Server A has 1000 of such IDs. So if there are 20 or 200 servers constantly wanting new IDs, it doesn't have to keep asking for each new ID, but rather asking once for 1000 IDs
for the ID 1, for example, reverse the bits. So 000...00000001 becomes 10000...000, so that when converted to base64, it will be non-uniformly increasing IDs each time.
use XOR to flip the bits for the final IDs. For example, XOR with 0xD5AA96...2373 (like a secret key), and the some bits will be flipped. (whenever the secret key has the 1 bit on, it will flip the bit of the ID). This will make the IDs even harder to guess and appear more random
Following this scheme, the single server that allocates the IDs can form the IDs, and so can the 20 or 200 servers requesting the allocation of IDs. The allocating server has to use a lock / semaphore to prevent two requesting servers from getting the same batch (or if it is accepting one connection at a time, this already solves the problem). So we don't want the line (queue) to be too long for waiting to get an allocation. So that's why allocating 1000 or 10000 at a time can solve the issue.

Resources