How do I reverse a UTF-8 string in place? - utf-8

Recently, someone asked about an algorithm for reversing a string in place in C. Most of the proposed solutions had troubles when dealing with non single-byte strings. So, I was wondering what could be a good algorithm for dealing specifically with utf-8 strings.
I came up with some code, which I'm posting as an answer, but I'd be glad to see other people's ideas or suggestions. I preferred to use actual code, so I've chosen C#, as it seems to be one of the most popular language in this site, but I don't mind if your code is in another language, as long as it could be reasonably understood by anyone who is familiar with an imperative language. And, as this is intended to see how such an algorithm could be implemented at a low-level (by low-level I just mean dealing with bytes), the idea is to avoid using libraries for the core code.
Notes:
I'm interested in the algorithm itself, its performance and how could it be optimized (I mean algorithm-level optimization, not replacing i++ with ++i and such; I'm not really interested in actual benchmarks either).
I don't mean to actually use it in production code or "reinventing the wheel". This is just out of curiosity and as an exercise.
I'm using C# byte arrays so I'm assuming you can get the length of the string without running though the string until you find a NUL.
That is, I'm not accounting for the complexity of finding the length of the string. But if you're using C, for instance, you could factor that out by using strlen() before calling the core code.
Edit:
As Mike F points out, my code (and other people's code posted here) is not dealing with composite characters. Some info about those here. I'm not familiar with the concept, but if that means that there are "combining characters", i.e., characters / code points that are only valid in combination with other "base" characters / code points, a look-up table of such characters could be used to preserve the order of the "global" character ("base" + "combining" characters) when reversing.

I'd make one pass reversing the bytes, then a second pass that reverses the bytes in any multibyte characters (which are easily detected in UTF8) back to their correct order.
You can definitely handle this in line in a single pass, but I wouldn't bother unless the routine became a bottleneck.

This code assumes that the input UTF-8 string is valid and well formed (i.e. at most 4 bytes per multibyte character):
#include "string.h"
void utf8rev(char *str)
{
/* this assumes that str is valid UTF-8 */
char *scanl, *scanr, *scanr2, c;
/* first reverse the string */
for (scanl= str, scanr= str + strlen(str); scanl < scanr;)
c= *scanl, *scanl++= *--scanr, *scanr= c;
/* then scan all bytes and reverse each multibyte character */
for (scanl= scanr= str; c= *scanr++;) {
if ( (c & 0x80) == 0) // ASCII char
scanl= scanr;
else if ( (c & 0xc0) == 0xc0 ) { // start of multibyte
scanr2= scanr;
switch (scanr - scanl) {
case 4: c= *scanl, *scanl++= *--scanr, *scanr= c; // fallthrough
case 3: // fallthrough
case 2: c= *scanl, *scanl++= *--scanr, *scanr= c;
}
scanr= scanl= scanr2;
}
}
}
// quick and dirty main for testing purposes
#include "stdio.h"
int main(int argc, char* argv[])
{
char buffer[256];
buffer[sizeof(buffer)-1]= '\0';
while (--argc > 0) {
strncpy(buffer, argv[argc], sizeof(buffer)-1); // don't overwrite final null
printf("%s → ", buffer);
utf8rev(buffer);
printf("%s\n", buffer);
}
return 0;
}
If you compile this program (example name: so199260.c) and run it on a UTF-8 environment (a Linux installation in this case):
$ so199260 γεια και χαρά français АДЖИ a♠♡♢♣b
a♠♡♢♣b → b♣♢♡♠a
АДЖИ → ИЖДА
français → siaçnarf
χαρά → άραχ
και → ιακ
γεια → αιεγ
If the code is too cryptic, I will happily clarify.

Agree that your approach is the only sane way to do it in-place.
Personally I don't like revalidating UTF8 inside every function that deals with it, and generally only do what's needed to avoid crashes; it adds up to a lot less code. Dunno much C# so here it is in C:
(edited to eliminate strlen)
void reverse( char *start, char *end )
{
while( start < end )
{
char c = *start;
*start++ = *end;
*end-- = c;
}
}
char *reverse_char( char *start )
{
char *end = start;
while( (end[1] & 0xC0) == 0x80 ) end++;
reverse( start, end );
return( end+1 );
}
void reverse_string( char *string )
{
char *end = string;
while( *end ) end = reverse_char( end );
reverse( string, end-1 );
}

My initial approach could by summarized this way:
1) Reverse bytes naively
2) Run the string backwards and fix the utf8 sequences as you go.
Illegal sequences are dealt with in the second step and in the first step, we check if the string is in "sync" (that is, if it starts with a legal leading byte).
EDIT: improved validation for leading byte in Reverse()
class UTF8Utils {
public static void Reverse(byte[] str) {
int len = str.Length;
int i = 0;
int j = len - 1;
// first, check if the string is "synced", i.e., it starts
// with a valid leading character. Will check for illegal
// sequences thru the whole string later.
byte leadChar = str[0];
// if it starts with 10xx xxx, it's a trailing char...
// if it starts with 1111 10xx or 1111 110x
// it's out of the 4 bytes range.
// EDIT: added validation for 7 bytes seq and 0xff
if( (leadChar & 0xc0) == 0x80 ||
(leadChar & 0xfc) == 0xf8 ||
(leadChar & 0xfe) == 0xfc ||
(leadChar & 0xff) == 0xfe ||
leadChar == 0xff) {
throw new Exception("Illegal UTF-8 sequence");
}
// reverse bytes in-place naïvely
while(i < j) {
byte tmp = str[i];
str[i] = str[j];
str[j] = tmp;
i++;
j--;
}
// now, run the string again to fix the multibyte sequences
UTF8Utils.ReverseMbSequences(str);
}
private static void ReverseMbSequences(byte[] str) {
int i = str.Length - 1;
byte leadChar = 0;
int nBytes = 0;
// loop backwards thru the reversed buffer
while(i >= 0) {
// since the first byte in the unreversed buffer is assumed to be
// the leading char of that byte, it seems safe to assume that the
// last byte is now the leading char. (Given that the string is
// not out of sync -- we checked that out already)
leadChar = str[i];
// check how many bytes this sequence takes and validate against
// illegal sequences
if(leadChar < 0x80) {
nBytes = 1;
} else if((leadChar & 0xe0) == 0xc0) {
if((str[i-1] & 0xc0) != 0x80) {
throw new Exception("Illegal UTF-8 sequence");
}
nBytes = 2;
} else if ((leadChar & 0xf0) == 0xe0) {
if((str[i-1] & 0xc0) != 0x80 ||
(str[i-2] & 0xc0) != 0x80 ) {
throw new Exception("Illegal UTF-8 sequence");
}
nBytes = 3;
} else if ((leadChar & 0xf8) == 0xf0) {
if((str[i-1] & 0xc0) != 0x80 ||
(str[i-2] & 0xc0) != 0x80 ||
(str[i-3] & 0xc0) != 0x80 ) {
throw new Exception("Illegal UTF-8 sequence");
}
nBytes = 4;
} else {
throw new Exception("Illegal UTF-8 sequence");
}
// now, reverse the current sequence and then continue
// whith the next one
int back = i;
int front = back - nBytes + 1;
while(front < back) {
byte tmp = str[front];
str[front] = str[back];
str[back] = tmp;
front++;
back--;
}
i -= nBytes;
}
}
}

The best solution:
Convert to a wide char string
Reverse the new string
Never, never, never, never treat single bytes as characters.

Related

Find word in string buffer/paragraph/text

This was asked in Amazon telephonic interview - "Can you write a program (in your preferred language C/C++/etc.) to find a given word in a string buffer of big size ? i.e. number of occurrences "
I am still looking for perfect answer which I should have given to the interviewer.. I tried to write a linear search (char by char comparison) and obviously I was rejected.
Given a 40-45 min time for a telephonic interview, what was the perfect algorithm he/she was looking for ???
The KMP Algorithm is a popular string matching algorithm.
KMP Algorithm
Checking char by char is inefficient. If the string has 1000 characters and the keyword has 100 characters, you don't want to perform unnecessary comparisons. The KMP Algorithm handles many cases which can occur, but I imagine the interviewer was looking for the case where: When you begin (pass 1), the first 99 characters match, but the 100th character doesn't match. Now, for pass 2, instead of performing the entire comparison from character 2, you have enough information to deduce where the next possible match can begin.
// C program for implementation of KMP pattern searching
// algorithm
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
void computeLPSArray(char *pat, int M, int *lps);
void KMPSearch(char *pat, char *txt)
{
int M = strlen(pat);
int N = strlen(txt);
// create lps[] that will hold the longest prefix suffix
// values for pattern
int *lps = (int *)malloc(sizeof(int)*M);
int j = 0; // index for pat[]
// Preprocess the pattern (calculate lps[] array)
computeLPSArray(pat, M, lps);
int i = 0; // index for txt[]
while (i < N)
{
if (pat[j] == txt[i])
{
j++;
i++;
}
if (j == M)
{
printf("Found pattern at index %d \n", i-j);
j = lps[j-1];
}
// mismatch after j matches
else if (i < N && pat[j] != txt[i])
{
// Do not match lps[0..lps[j-1]] characters,
// they will match anyway
if (j != 0)
j = lps[j-1];
else
i = i+1;
}
}
free(lps); // to avoid memory leak
}
void computeLPSArray(char *pat, int M, int *lps)
{
int len = 0; // length of the previous longest prefix suffix
int i;
lps[0] = 0; // lps[0] is always 0
i = 1;
// the loop calculates lps[i] for i = 1 to M-1
while (i < M)
{
if (pat[i] == pat[len])
{
len++;
lps[i] = len;
i++;
}
else // (pat[i] != pat[len])
{
if (len != 0)
{
// This is tricky. Consider the example
// AAACAAAA and i = 7.
len = lps[len-1];
// Also, note that we do not increment i here
}
else // if (len == 0)
{
lps[i] = 0;
i++;
}
}
}
}
// Driver program to test above function
int main()
{
char *txt = "ABABDABACDABABCABAB";
char *pat = "ABABCABAB";
KMPSearch(pat, txt);
return 0;
}
This code is taken from a really good site that teaches algorithms:
Geeks for Geeks KMP
Amazon and companies alike expect knowledge of Boyer–Moore string search or / and Knuth–Morris–Pratt algorithms.
Those are good if you want to show perfect knowledge. Otherwise, try to be creative and write something relatively elegant and efficient.
Did you ask about delimiters before you wrote anything? It could be that they may simplify your task to provide some extra information about a string buffer.
Even code below could be ok (it's really not) if you provide enough information in advance, properly explain runtime, space requirements, choice of data containers.
int find( std::string & the_word, std::string & text )
{
std::stringstream ss( text ); // !!! could be really bad idea if 'text' is really big
std::string word;
std::unordered_map< std::string, int > umap;
while( ss >> text ) ++umap[text]; // you have to assume that each word separated by white-spaces.
return umap[the_word];
}

Algorithm Challenge: Arbitrary in-place base conversion for lossless string compression

It might help to start out with a real world example. Say I'm writing a web app that's backed by MongoDB, so my records have a long hex primary key, making my url to view a record look like /widget/55c460d8e2d6e59da89d08d0. That seems excessively long. Urls can use many more characters than that. While there are just under 8 x 10^28 (16^24) possible values in a 24 digit hex number, just limiting yourself to the characters matched by a [a-zA-Z0-9] regex class (a YouTube video id uses more), 62 characters, you can get past 8 x 10^28 in only 17 characters.
I want an algorithm that will convert any string that is limited to a specific alphabet of characters to any other string with another alphabet of characters, where the value of each character c could be thought of as alphabet.indexOf(c).
Something of the form:
convert(value, sourceAlphabet, destinationAlphabet)
Assumptions
all parameters are strings
every character in value exists in sourceAlphabet
every character in sourceAlphabet and destinationAlphabet is unique
Simplest example
var hex = "0123456789abcdef";
var base10 = "0123456789";
var result = convert("12245589", base10, hex); // result is "bada55";
But I also want it to work to convert War & Peace from the Russian alphabet plus some punctuation to the entire unicode charset and back again losslessly.
Is this possible?
The only way I was ever taught to do base conversions in Comp Sci 101 was to first convert to a base ten integer by summing digit * base^position and then doing the reverse to convert to the target base. Such a method is insufficient for the conversion of very long strings, because the integers get too big.
It certainly feels intuitively that a base conversion could be done in place, as you step through the string (probably backwards to maintain standard significant digit order), keeping track of a remainder somehow, but I'm not smart enough to work out how.
That's where you come in, StackOverflow. Are you smart enough?
Perhaps this is a solved problem, done on paper by some 18th century mathematician, implemented in LISP on punch cards in 1970 and the first homework assignment in Cryptography 101, but my searches have borne no fruit.
I'd prefer a solution in javascript with a functional style, but any language or style will do, as long as you're not cheating with some big integer library. Bonus points for efficiency, of course.
Please refrain from criticizing the original example. The general nerd cred of solving the problem is more important than any application of the solution.
Here is a solution in C that is very fast, using bit shift operations. It assumes that you know what the length of the decoded string should be. The strings are vectors of integers in the range 0..maximum for each alphabet. It is up to the user to convert to and from strings with restricted ranges of characters. As for the "in-place" in the question title, the source and destination vectors can overlap, but only if the source alphabet is not larger than the destination alphabet.
/*
recode version 1.0, 22 August 2015
Copyright (C) 2015 Mark Adler
This software is provided 'as-is', without any express or implied
warranty. In no event will the authors be held liable for any damages
arising from the use of this software.
Permission is granted to anyone to use this software for any purpose,
including commercial applications, and to alter it and redistribute it
freely, subject to the following restrictions:
1. The origin of this software must not be misrepresented; you must not
claim that you wrote the original software. If you use this software
in a product, an acknowledgment in the product documentation would be
appreciated but is not required.
2. Altered source versions must be plainly marked as such, and must not be
misrepresented as being the original software.
3. This notice may not be removed or altered from any source distribution.
Mark Adler
madler#alumni.caltech.edu
*/
/* Recode a vector from one alphabet to another using intermediate
variable-length bit codes. */
/* The approach is to use a Huffman code over equiprobable alphabets in two
directions. First to encode the source alphabet to a string of bits, and
second to encode the string of bits to the destination alphabet. This will
be reasonably close to the efficiency of base-encoding with arbitrary
precision arithmetic. */
#include <stddef.h> // size_t
#include <limits.h> // UINT_MAX, ULLONG_MAX
#if UINT_MAX == ULLONG_MAX
# error recode() assumes that long long has more bits than int
#endif
/* Take a list of integers source[0..slen-1], all in the range 0..smax, and
code them into dest[0..*dlen-1], where each value is in the range 0..dmax.
*dlen returns the length of the result, which will not exceed the value of
*dlen when called. If the original *dlen is not large enough to hold the
full result, then recode() will return non-zero to indicate failure.
Otherwise recode() will return 0. recode() will also return non-zero if
either of the smax or dmax parameters are less than one. The non-zero
return codes are 1 if *dlen is not long enough, 2 for invalid parameters,
and 3 if any of the elements of source are greater than smax.
Using this same operation on the result with smax and dmax reversed reverses
the operation, restoring the original vector. However there may be more
symbols returned than the original, so the number of symbols expected needs
to be known for decoding. (An end symbol could be appended to the source
alphabet to include the length in the coding, but then encoding and decoding
would no longer be symmetric, and the coding efficiency would be reduced.
This is left as an exercise for the reader if that is desired.) */
int recode(unsigned *dest, size_t *dlen, unsigned dmax,
const unsigned *source, size_t slen, unsigned smax)
{
// compute sbits and scut, with which we will recode the source with
// sbits-1 bits for symbols < scut, otherwise with sbits bits (adding scut)
if (smax < 1)
return 2;
unsigned sbits = 0;
unsigned scut = 1; // 2**sbits
while (scut && scut <= smax) {
scut <<= 1;
sbits++;
}
scut -= smax + 1;
// same thing for dbits and dcut
if (dmax < 1)
return 2;
unsigned dbits = 0;
unsigned dcut = 1; // 2**dbits
while (dcut && dcut <= dmax) {
dcut <<= 1;
dbits++;
}
dcut -= dmax + 1;
// recode a base smax+1 vector to a base dmax+1 vector using an
// intermediate bit vector (a sliding window of that bit vector is kept in
// a bit buffer)
unsigned long long buf = 0; // bit buffer
unsigned have = 0; // number of bits in bit buffer
size_t i = 0, n = 0; // source and dest indices
unsigned sym; // symbol being encoded
for (;;) {
// encode enough of source into bits to encode that to dest
while (have < dbits && i < slen) {
sym = source[i++];
if (sym > smax) {
*dlen = n;
return 3;
}
if (sym < scut) {
buf = (buf << (sbits - 1)) + sym;
have += sbits - 1;
}
else {
buf = (buf << sbits) + sym + scut;
have += sbits;
}
}
// if not enough bits to assure one symbol, then break out to a special
// case for coding the final symbol
if (have < dbits)
break;
// encode one symbol to dest
if (n == *dlen)
return 1;
sym = buf >> (have - dbits + 1);
if (sym < dcut) {
dest[n++] = sym;
have -= dbits - 1;
}
else {
sym = buf >> (have - dbits);
dest[n++] = sym - dcut;
have -= dbits;
}
buf &= ((unsigned long long)1 << have) - 1;
}
// if any bits are left in the bit buffer, encode one last symbol to dest
if (have) {
if (n == *dlen)
return 1;
sym = buf;
sym <<= dbits - 1 - have;
if (sym >= dcut)
sym = (sym << 1) - dcut;
dest[n++] = sym;
}
// return recoded vector
*dlen = n;
return 0;
}
/* Test recode(). */
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <assert.h>
// Return a random vector of len unsigned values in the range 0..max.
static void ranvec(unsigned *vec, size_t len, unsigned max) {
unsigned bits = 0;
unsigned long long mask = 1;
while (mask <= max) {
mask <<= 1;
bits++;
}
mask--;
unsigned long long ran = 0;
unsigned have = 0;
size_t n = 0;
while (n < len) {
while (have < bits) {
ran = (ran << 31) + random();
have += 31;
}
if ((ran & mask) <= max)
vec[n++] = ran & mask;
ran >>= bits;
have -= bits;
}
}
// Get a valid number from str and assign it to var
#define NUM(var, str) \
do { \
char *end; \
unsigned long val = strtoul(str, &end, 0); \
var = val; \
if (*end || var != val) { \
fprintf(stderr, \
"invalid or out of range numeric argument: %s\n", str); \
return 1; \
} \
} while (0)
/* "bet n m len count" generates count test vectors of length len, where each
entry is in the range 0..n. Each vector is recoded to another vector using
only symbols in the range 0..m. That vector is recoded back to a vector
using only symbols in 0..n, and that result is compared with the original
random vector. Report on the average ratio of input and output symbols, as
compared to the optimal ratio for arbitrary precision base encoding. */
int main(int argc, char **argv)
{
// get sizes of alphabets and length of test vector, compute maximum sizes
// of recoded vectors
unsigned smax, dmax, runs;
size_t slen, dsize, bsize;
if (argc != 5) { fputs("need four arguments\n", stderr); return 1; }
NUM(smax, argv[1]);
NUM(dmax, argv[2]);
NUM(slen, argv[3]);
NUM(runs, argv[4]);
dsize = ceil(slen * ceil(log2(smax + 1.)) / floor(log2(dmax + 1.)));
bsize = ceil(dsize * ceil(log2(dmax + 1.)) / floor(log2(smax + 1.)));
// generate random test vectors, encode, decode, and compare
srandomdev();
unsigned source[slen], dest[dsize], back[bsize];
unsigned mis = 0, i;
unsigned long long dtot = 0;
int ret;
for (i = 0; i < runs; i++) {
ranvec(source, slen, smax);
size_t dlen = dsize;
ret = recode(dest, &dlen, dmax, source, slen, smax);
if (ret) {
fprintf(stderr, "encode error %d\n", ret);
break;
}
dtot += dlen;
size_t blen = bsize;
ret = recode(back, &blen, smax, dest, dlen, dmax);
if (ret) {
fprintf(stderr, "decode error %d\n", ret);
break;
}
if (blen < slen || memcmp(source, back, slen)) // blen > slen is ok
mis++;
}
if (mis)
fprintf(stderr, "%u/%u mismatches!\n", mis, i);
if (ret == 0)
printf("mean dest/source symbols = %.4f (optimal = %.4f)\n",
dtot / (i * (double)slen), log(smax + 1.) / log(dmax + 1.));
return 0;
}
As has been pointed out in other StackOverflow answers, try not to think of summing digit * base^position as converting it to base ten; rather, think of it as directing the computer to generate a representation of the quantity represented by the number in its own terms (for most computers probably closer to our concept of base 2). Once the computer has its own representation of the quantity, we can direct it to output the number in any way we like.
By rejecting "big integer" implementations and asking for letter-by-letter conversion you are at the same time arguing that the numerical/alphabetical representation of quantity is not actually what it is, namely that each position represents a quantity of digit * base^position. If the nine-millionth character of War and Peace does represent what you are asking to convert it from, then the computer at some point will need to generate a representation for Д * 33^9000000.
I don't think any solution can work generally because if ne != m for some integer e and some MAX_INT because there's no way to calculate the value of the target base in a certain place p if np > MAX_INT.
You can get away with this for the case where ne == m for some e because the problem is recursively doable (the first e digits of n can be summed and converted into the first digit of M, and then chopped off and repeated.
If you don't have this useful property, then eventually you're going to have to try to take some part of the original base and try to perform modulus in np and np is going to be greater than MAX_INT, which means it's impossible.

Huffman algorithm inverse matching

I was wondering if given a binary sequence we can check if it matches a string using the Huffman algorithm.
for example, if we a string "abdcc" and several binary sequences we can calculate which one is a possible representation of "abdcc" that used Huffman's algorithm
Interesting puzzle. As mentioned by j_random_hacker in a comment, it's possible to do this using a backtracking search. There are a few constraints to valid Huffman encodings of the string that we can use to narrow the search down:
No two Huffman codes of length n and m can be identical in the first n or m bits (whichever is shorter). This is because otherwise a Huffman decoder wouldn't be able to tell if it had encountered the longer or the shorter code when decoding. And obviously two codes of the same length cannot be identical. (1)
If at any time there are less bits remaining in the bitstream than characters remaining in the string we are matching then the string cannot match. (2)
If we reach the end of the string and there are still bits remaining in the bitstream then the string does not match (3)
If we encounter a character in the string for the second time, and we have already assumed a Huffman code for that same character earlier in the string, then an identical code must be present in the bit stream or the string cannot match. (4)
We can define a function matchHuffmanString that matches a string with Huffman encoded bitstream, with a Huffman code table as part of the global state. To begin with the code table is empty and we call matchHuffmanString, passing the start of the string and the start of the bitstream.
When the function is called, it checks if there are enough bits in the stream to match the string and returns if not. (2)
If the string is empty, then if the bitstream is also empty then there is a match and the code table is output. If the stream is empty but the bitstream is not then there is no match so the function returns. (3)
If characters remain in the string, then the first character is read. The function checks if there is already an entry in the code table for that character, and if so then the same code must be present in the bitstream. If not then there is no match so the function returns (4). If there is then the function calls itself, moving on to the next character and past the matching code in the bitstream.
If there is no matching code for the character, then the possibility that it is represented by a code of every possible length n from 1 bit to 32 bits (an arbitrary limit) is considered. n bits are read from the bitstream and checked to see if such a code would conflict with any existing codes according to rule (1). If no conflict exists then the code is added to the code table, then the function recurses, moving onto the next character and past the assumed code of length n bits. After returning then it backtracks by removing the code from the table.
Simple implementation in C:
#include <stdio.h>
// Huffman table:
// a 01
// b 0001
// c 1
// d 0010
char* string = "abdcc";
// 01 0001 0010 1 1
// reverse bit order (MSB first) an add extra 0 for padding to stop getBits reading past the end of the array:
#define MESSAGE_LENGTH (12)
unsigned int message[] = {0b110100100010, 0};
// can handle messages of >32 bits, even though the above message is only 12 bits long
unsigned int getBits(int start, int n)
{
return ((message[start>>5] >> (start&31)) | (message[(start>>5)+1] << (32-(start&31)))) & ((1<<n)-1);
}
unsigned int codes[26];
int code_lengths[26];
int callCount = 0;
void outputCodes()
{
// output the codes:
int i, j;
for(i = 0; i < 26; i++)
{
if(code_lengths[i] != 0)
{
printf("%c ", i + 'a');
for(j = 0; j < code_lengths[i]; j++)
printf("%s", codes[i] & (1 << j) ? "1" : "0");
printf("\n");
}
}
}
void matchHuffmanString(char* s, int len, int startbit)
{
callCount++;
if(len > MESSAGE_LENGTH - startbit)
return; // not enough bits left to encode the rest of the message even at 1 bit per char (2)
if(len == 0) // no more characters to match
{
if(startbit == MESSAGE_LENGTH)
{
// (3) we exactly used up all the bits, this stream matches.
printf("match!\n\n");
outputCodes();
printf("\nCall count: %d\n", callCount);
}
return;
}
// read a character from the string (assume 'a' to 'z'):
int c = s[0] - 'a';
// is there already a code for this character?
if(code_lengths[c] != 0)
{
// check if the code in the bit stream matches:
int length = code_lengths[c];
if(startbit + length > MESSAGE_LENGTH)
return; // ran out of bits in stream, no match
unsigned int bits = getBits(startbit, length);
if(bits != codes[c])
return; // bits don't match (4)
matchHuffmanString(s + 1, len - 1, startbit + length);
}
else
{
// this character doesn't have a code yet, consider every possible length
int i, j;
for(i = 1; i < 32; i++)
{
// are there enough bits left for a code this long?
if(startbit + i > MESSAGE_LENGTH)
continue;
unsigned int bits = getBits(startbit, i);
// does this code conflict with an existing code?
for(j = 0; j < 26; j++)
{
if(code_lengths[j] != 0) // check existing codes only
{
// do the two codes match in the first i or code_lengths[j] bits, whichever is shorter?
int length = code_lengths[j] < i ? code_lengths[j] : i;
if((bits & ((1 << length)-1)) == (codes[j] & ((1 << length)-1)))
break; // there's a conflict (1)
}
}
if(j != 26)
continue; // there was a conflict
// add the new code to the codes array and recurse:
codes[c] = bits; code_lengths[c] = i;
matchHuffmanString(s + 1, len - 1, startbit + i);
code_lengths[c] = 0; // clear the code (backtracking)
}
}
}
int main(void) {
int i;
for(i = 0; i < 26; i++)
code_lengths[i] = 0;
matchHuffmanString(string, 5, 0);
return 0;
}
output:
match!
a 01
b 0001
c 1
d 0010
Call count: 42
Ideone.com Demo
The above code could be improved by iterating over the string as long as it is encountering characters that it already has a code for, and only recursing when it finds one it doesn't. Also it only works for lowercase letters a-z with no spaces and doesn't do any validation. I'd have to test it to be sure, but I think it's a tractable problem even for long strings, because any possible combinatorial explosion only happens when encountering new characters that don't already have codes in the table, and even then it's subject to contraints.

Converting lower/upper case letters without ctype.h

I just saw that this could technically work, the only mistake I couldn´t resolve was the last ASCII character that gets printed everytime I test it out, I also tested this out without using the name variable, I mean just making a substraction of 32 to any lower case letter in ASCII should give me their upper case one and it does, but I´m curious on why I´m getting an additional char, wich from what I see in screen is apparently Û.
#include <stdio.h>
main()
{
char name[22];
int i;
fputs("Type your name ",stdout);
fgets(name,22,stdin);
for (i = 0; name[i] != '\0'; i = i + 1)
printf("%c",(name[i])-32); /*This will convert lower case to upper */
/* using as reference the ASCII table*/
fflush(stdin);
getchar();
}
Perhaps there is a line break character at the end of the string.
You can check the chararacter code, so that you only convert characters that actually are lower case letters:
for (i = 0; name[i] != '\0'; i = i + 1) {
char c = name[i];
if (c => 97 && c <= 122) {
c -= 32;
}
printf("%c", c);
}
void read_chararray(char in_array[], int* Length)
{
int Indx = 0, Indx2 = 0, Indx3 = 0; // int declarations for indexs of some loops
char cinput = { 0 }, word[255] = { 0 }, word2[255] = { 0 }; // declaration of cinput and first char array before punctiation removed
for (Indx = 0; (cinput = getchar()) != '\n'; Indx++) { // Loop for getting characters from user stop at <enter>
word[Indx] = cinput; // Placing char into array while changing to lowercase
}
Indx2 = Indx; // Set Indx2 to Indx for loop operation
for (Indx = 0; Indx < Indx2; Indx++) { // Loop to check and replace upper characters with lower
cinput = word[Indx];
if (cinput >= 65 && cinput <= 90) { // If cinput is within the ASCII range 65 and 90, this indicates upper characters
cinput += 32; // Add 32 to cinput to shift to the lower character range within the ASCII table
in_array[Indx] = cinput; // Input new value into array pointer
}
else if (cinput >= 97 && cinput <= 122) // scans if character are lower ASCII, places them in array irraticating punctuation and whitespce
in_array[Indx] = cinput; // Input remaining lower case into array pointer
}
*Length = Indx; // final size of array set to Length variable for future use
}
#include<stdio.h>
void upper(char);
void main()
{
char ch;
printf("\nEnter the character in lower case");
scanf("%c", &ch);
upper(ch);
}
void upper( char c)
{
printf("\nUpper Case: %c", c-32);
}

Looking for more details about "Group varint encoding/decoding" presented in Jeff's slides

I noticed that in Jeff's slides "Challenges in Building Large-Scale Information Retrieval Systems", which can also be downloaded here: http://research.google.com/people/jeff/WSDM09-keynote.pdf, a method of integers compression called "group varint encoding" was mentioned. It was said much faster than 7 bits per byte integer encoding (2X more). I am very interested in this and looking for an implementation of this, or any more details that could help me implement this by myself.
I am not a pro and new to this, and any help is welcome!
That's referring to "variable integer encoding", where the number of bits used to store an integer when serialized is not fixed at 4 bytes. There is a good description of varint in the protocol buffer documentation.
It is used in encoding Google's protocol buffers, and you can browse the protocol buffer source code.
The CodedOutputStream contains the exact encoding function WriteVarint32FallbackToArrayInline:
inline uint8* CodedOutputStream::WriteVarint32FallbackToArrayInline(
uint32 value, uint8* target) {
target[0] = static_cast<uint8>(value | 0x80);
if (value >= (1 << 7)) {
target[1] = static_cast<uint8>((value >> 7) | 0x80);
if (value >= (1 << 14)) {
target[2] = static_cast<uint8>((value >> 14) | 0x80);
if (value >= (1 << 21)) {
target[3] = static_cast<uint8>((value >> 21) | 0x80);
if (value >= (1 << 28)) {
target[4] = static_cast<uint8>(value >> 28);
return target + 5;
} else {
target[3] &= 0x7F;
return target + 4;
}
} else {
target[2] &= 0x7F;
return target + 3;
}
} else {
target[1] &= 0x7F;
return target + 2;
}
} else {
target[0] &= 0x7F;
return target + 1;
}
}
The cascading ifs will only add additional bytes onto the end of the target array if the magnitude of value warrants those extra bytes. The 0x80 masks the byte being written, and the value is shifted down. From what I can tell, the 0x7f mask causes it to signify the "last byte of encoding". (When OR'ing 0x80, the highest bit will always be 1, then the last byte clears the highest bit (by AND'ing 0x7f). So, when reading varints you read until you get a byte with a zero in the highest bit.
I just realized you asked about "Group VarInt encoding" specifically. Sorry, that code was about basic VarInt encoding (still faster than 7-bit). The basic idea looks to be similar. Unfortunately, it's not what's being used to store 64bit numbers in protocol buffers. I wouldn't be surprised if that code was open sourced somewhere though.
Using the ideas from varint and the diagrams of "Group varint" from the slides, it shouldn't be too too hard to cook up your own :)
Here is another page describing Group VarInt compression, which contains decoding code. Unfortunately they allude to publicly available implementations, but they don't provide references.
void DecodeGroupVarInt(const byte* compressed, int size, uint32_t* uncompressed) {
const uint32_t MASK[4] = { 0xFF, 0xFFFF, 0xFFFFFF, 0xFFFFFFFF };
const byte* limit = compressed + size;
uint32_t current_value = 0;
while (compressed != limit) {
const uint32_t selector = *compressed++;
const uint32_t selector1 = (selector & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector1];
*uncompressed++ = current_value;
compressed += selector1 + 1;
const uint32_t selector2 = ((selector >> 2) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector2];
*uncompressed++ = current_value;
compressed += selector2 + 1;
const uint32_t selector3 = ((selector >> 4) & 3);
current_value += *((uint32_t*)(compressed)) & MASK[selector3];
*uncompressed++ = current_value;
compressed += selector3 + 1;
const uint32_t selector4 = (selector >> 6);
current_value += *((uint32_t*)(compressed)) & MASK[selector4];
*uncompressed++ = current_value;
compressed += selector4 + 1;
}
}
I was looking for the same thing and found this GitHub project in Java:
https://github.com/stuhood/gvi/
Looks promising !
Instead of decoding with bitmask, in c/c++ you could use predefined structures that corresponds to the value in the first byte.. complete example that uses this: http://www.oschina.net/code/snippet_12_5083
Another Java implementation for groupvarint: https://github.com/catenamatteo/groupvarint
But I suspect the very large switch has some drawback in Java

Resources