Understanding Hadoop VIntWritable compression - hadoop

I know its a bit shame, but i don't understand the compression technique in VIntWritable and VLongWritable. Can someone elaborate with an example
Code snippet from WritableUtils.java
public static void writeVLong(DataOutput stream, long i) throws IOException {
if (i >= -112 && i <= 127) {
stream.writeByte((byte)i);
return;
}
int len = -112;
if (i < 0) {
i ^= -1L; // take one's complement'
len = -120;
}
long tmp = i;
while (tmp != 0) {
tmp = tmp >> 8;
len--;
}
stream.writeByte((byte)len);
len = (len < -120) ? -(len + 120) : -(len + 112);
for (int idx = len; idx != 0; idx--) {
int shiftbits = (idx - 1) * 8;
long mask = 0xFFL << shiftbits;
stream.writeByte((byte)((i & mask) >> shiftbits));
}
}

Related

Simplex solver - issues with getting it working

I'm trying to write a simple simplex solver for linear optimization problems, but I'm having trouble getting it working. Every time I run it I get a vector subscript out of range (which is quite easy to find), but I think that its probably a core issue somewhere else in my impl.
Here is my simplex solver impl:
bool pivot(vector<vector<double>>& tableau, int row, int col) {
int n = tableau.size();
int m = tableau[0].size();
double pivot_element = tableau[row][col];
if (pivot_element == 0) return false;
for (int j = 0; j < m; j++) {
tableau[row][j] /= pivot_element;
}
for (int i = 0; i < n; i++) {
if (i != row) {
double ratio = tableau[i][col];
for (int j = 0; j < m; j++) {
tableau[i][j] -= ratio * tableau[row][j];
}
}
}
return true;
}
int simplex(vector<vector<double>>& tableau, vector<double>& basic, vector<double>& non_basic) {
int n = tableau.size() - 1;
int m = tableau[0].size() - 1;
while (true) {
int col = -1;
for (int j = 0; j < m; j++) {
if (tableau[n][j] > 0) {
col = j;
break;
}
}
if (col == -1) break;
int row = -1;
double min_ratio = numeric_limits<double>::infinity();
for (int i = 0; i < n; i++) {
if (tableau[i][col] > 0) {
double ratio = tableau[i][m] / tableau[i][col];
if (ratio < min_ratio) {
row = i;
min_ratio = ratio;
}
}
}
if (row == -1) return -1;
if (!pivot(tableau, row, col)) return -1;
double temp = basic[row];
basic[row] = non_basic[col];
non_basic[col] = temp;
}
return 1;
}

Finding longest sequence of '1's in a binary array by replacing any one '0' with '1'

I have an array which is constituted of only 0s and 1s. Task is to find index of a 0, replacing which with a 1 results in the longest possible sequence of ones for the given array.
Solution has to work within O(n) time and O(1) space.
Eg:
Array - 011101101001
Answer - 4 ( that produces 011111101001)
My Approach gives me a result better than O(n2) but times out on long string inputs.
int findIndex(int[] a){
int maxlength = 0; int maxIndex= -1;
int n=a.length;
int i=0;
while(true){
if( a[i] == 0 ){
int leftLenght=0;
int j=i-1;
//finding count of 1s to left of this zero
while(j>=0){
if(a[j]!=1){
break;
}
leftLenght++;
j--;
}
int rightLenght=0;
j=i+1;
// finding count of 1s to right of this zero
while(j<n){
if(a[j]!=1){
break;
}
rightLenght++;
j++;
}
if(maxlength < leftLenght+rightLenght + 1){
maxlength = leftLenght+rightLenght + 1;
maxIndex = i;
}
}
if(i == n-1){
break;
}
i++;
}
return maxIndex;
}
The approach is simple, you just need to maintain two numbers while iterating through the array, the current count of the continuous block of one, and the last continuous block of one, which separated by zero.
Note: this solution assumes that there will be at least one zero in the array, otherwise, it will return -1
int cal(int[]data){
int last = 0;
int cur = 0;
int max = 0;
int start = -1;
int index = -1;
for(int i = 0; i < data.length; i++){
if(data[i] == 0){
if(max < 1 + last + cur){
max = 1 + last + cur;
if(start != -1){
index = start;
}else{
index = i;
}
}
last = cur;
start = i;
cur = 0;
}else{
cur++;
}
}
if(cur != 0 && start != -1){
if(max < 1 + last + cur){
return start;
}
}
return index;
}
O(n) time, O(1) space
Live demo: https://ideone.com/1hjS25
I believe the problem can we solved by just maintaining a variable which stores the last trails of 1's that we saw before reaching a '0'.
int last_trail = 0;
int cur_trail = 0;
int last_seen = -1;
int ans = 0, maxVal = 0;
for(int i = 0; i < a.size(); i++) {
if(a[i] == '0') {
if(cur_trail + last_trail + 1 > maxVal) {
maxVal = cur_trail + last_trail + 1;
ans = last_seen;
}
last_trail = cur_trail;
cur_trail = 0;
last_seen = i;
} else {
cur_trail++;
}
}
if(cur_trail + last_trail + 1 > maxVal && last_seen > -1) {
maxVal = cur_trail + last_trail + 1;
ans = last_seen;
}
This can be solved by a technique that is known as two pointers. Most two-pointers use O(1) space and O(n) time.
Code : https://www.ideone.com/N8bznU
#include <iostream>
#include <string>
using namespace std;
int findOptimal(string &s) {
s += '0'; // add a sentinel 0
int best_zero = -1;
int prev_zero = -1;
int zeros_in_interval = 0;
int start = 0;
int best_answer = -1;
for(int i = 0; i < (int)s.length(); ++i) {
if(s[i] == '1') continue;
else if(s[i] == '0' and zeros_in_interval == 0) {
zeros_in_interval++;
prev_zero = i;
}
else if(s[i] == '0' and zeros_in_interval == 1) {
int curr_answer = i - start; // [start, i) only contains one 0
cout << "tried this : [" << s.substr(start, i - start) << "]\n";
if(curr_answer > best_answer) {
best_answer = curr_answer;
best_zero = prev_zero;
}
start = prev_zero + 1;
prev_zero = i;
}
}
cout << "Answer = " << best_zero << endl;
return best_zero;
}
int main() {
string input = "011101101001";
findOptimal(input);
return 0;
}
This is an implementation in C++. The output looks like this:
tried this : [0111]
tried this : [111011]
tried this : [1101]
tried this : [10]
tried this : [01]
Answer = 4

Processing Image data changes during save

I'm trying to create a program to hide data in a image file. Data bits are hidden into last bit of every pixels blue value. First four pixels contain the length of following data bytes.
Everything works fine when I encrypt the data to image and then decrypt it without saving the image in between. However if I encrypt the data to an image and then save it and then open the file again and try to decrypt it, decryption fails since the values seem to have changed.
I wonder if there is something similar happening as with txt files where there is BOM containing byte order data prepended into the file?
The code works if I change the color c = crypted.pixels[pos + i];
to color c = original.pixels[pos + i]; in readByteAt function
and run the encrypting function first and then the decryption function.
This causes the code to run the decryption function on the just encrypted image still in program memory instead reading it from the file.
Any ideas on what causes this or how to prevent it are welcome!
here is the full (messy) code:
PImage original;
PImage crypted;
int imagesize;
boolean ready = false;
void setup() {
size(100, 100);
imagesize = width * height;
}
void draw() {
}
void encrypt()
{
original = loadImage("image.jpg");
original.loadPixels();
println("begin encrypt");
int pos = 0;
byte b[] = loadBytes("DATA.txt");
println("encrypting in image...");
int len = b.length;
println("len " + len);
writeByteAt((len >> (3*8)) & 0xFF, 0);
writeByteAt((len >> (2*8)) & 0xFF, 8);
writeByteAt((len >> (1*8)) & 0xFF, 16);
writeByteAt(len & 0xFF, 24);
pos = 32;
for (int i = 3; i < b.length; i++) {
int a = b[i] & 0xff;
print(char(a));
writeByteAt(a, pos);
pos += 8;
}
original.updatePixels();
println();
println("done");
original.save("encrypted.jpg");
}
void writeByteAt(int b, int pos)
{
println("writing " + b + " at " + pos);
for (int i = 0; i < 8; i++)
{
color c = original.pixels[pos + i];
int v = int(blue(c));
if ((b & (1 << i)) > 0)
{
v = v | 1;
} else
{
v = v & 0xFE;
}
original.pixels[pos+i] = color(red(c), green(c), v);
//original.pixels[pos+i] = color(255,255,255);
}
}
int readByteAt(int pos)
{
int b = 0;
for (int i = 0; i < 8; i++)
{
color c = crypted.pixels[pos + i];
int v = int(blue(c));
if ((v & 1) > 0)
{
b += (1 << i);
}
}
return b;
}
void decrypt()
{
crypted = loadImage("encrypted.jpg");
crypted.loadPixels();
println("begin decrypt");
int pos = 0;
PrintWriter output = createWriter("out.txt");
println("decrypting...");
int len = 0;
len += readByteAt(0) << 3*8;
len += readByteAt(8) << 2*8;
len += readByteAt(16) << 1*8;
len += readByteAt(24);
pos = 32;
if(len >= imagesize)
{
println("ERROR: DATA LENGTH OVER IMAGE SIZE");
return;
}
println(len);
while (pos < ((len+1)*8)) {
output.print(char(readByteAt(pos)));
print(char(readByteAt(pos)));
pos += 8;
}
output.flush(); // Writes the remaining data to the file
output.close();
println("\nDone");
}
void keyPressed()
{
if(key == 'e')
{
encrypt();
}
if(key == 'd')
{
decrypt();
}
}

how to generate Chase's sequence

In the draft section 7.2.1.3 of The art of computer programming, generating all combinations, Knuth introduced Algorithm C for generating Chase's sequence.
He also mentioned a similar algorithm (based on the following equation) working with index-list without source code (exercise 45 of the draft).
I finally worked out a c++ version which I think is quite ugly. To generate all C_n^m combination, the memory complexity is about 3 (m+1) and the time complexity is bounded by O(m n^m)
class chase_generator_t{
public:
using size_type = ptrdiff_t;
enum class GET : char{ VALUE, INDEX };
chase_generator_t(size_type _n) : n(_n){}
void choose(size_type _m){
m = _m;
++_m;
index.resize(_m);
threshold.resize(_m + 1);
tag.resize(_m);
for (size_type i = 0, j = n - m; i != _m; ++i){
index[i] = j + i;
tag[i] = tag_t::DECREASE;
using std::max;
threshold[i] = max(i - 1, (index[i] - 3) | 1);
}
threshold[_m] = n;
}
bool get(size_type &x, size_type &y, GET const which){
if (which == GET::VALUE) return __get<false>(x, y);
return __get<true>(x, y);
}
size_type get_n() const{
return n;
}
size_type get_m() const{
return m;
}
size_type operator[](size_t const i) const{
return index[i];
}
private:
enum class tag_t : char{ DECREASE, INCREASE };
size_type n, m;
std::vector<size_type> index, threshold;
std::vector<tag_t> tag;
template<bool GetIndex>
bool __get(size_type &x, size_type &y){
using std::max;
size_type p = 0, i, q;
find:
q = p + 1;
if (index[p] == threshold[q]){
if (q >= m) return false;
p = q;
goto find;
}
x = GetIndex ? p : index[p];
if (tag[p] == tag_t::INCREASE){
using std::min;
increase:
index[p] = min(index[p] + 2, threshold[q]);
threshold[p] = index[p] - 1;
}
else if (index[p] && (i = (index[p] - 1) & ~1) >= p){
index[p] = i;
threshold[p] = max(p - 1, (index[p] - 3) | 1);
}
else{
tag[p] = tag_t::INCREASE;
i = p | 1;
if (index[p] == i) goto increase;
index[p] = i;
threshold[p] = index[p] - 1;
}
y = index[p];
for (q = 0; q != p; ++q){
tag[q] = tag_t::DECREASE;
threshold[q] = max(q - 1, (index[q] - 3) | 1);
}
return true;
}
};
Does any one has a better implementation, i.e. run faster with the same memory or use less memory with the same speed?
I think that the C code below is closer to what Knuth had in mind. Undoubtedly there are ways to make it more elegant (in particular, I'm leaving some scaffolding in case it helps with experimentation), though I'm skeptical that the array w can be disposed of. If storage is really important for some reason, then steal the sign bit from the a array.
#include <stdbool.h>
#include <stdio.h>
enum {
N = 10,
T = 5
};
static void next(int a[], bool w[], int *r) {
bool found_r = false;
int j;
for (j = *r; !w[j]; j++) {
int b = a[j] + 1;
int n = a[j + 1];
if (b < (w[j + 1] ? n - (2 - (n & 1)) : n)) {
if ((b & 1) == 0 && b + 1 < n) b++;
a[j] = b;
if (!found_r) *r = j > 1 ? j - 1 : 0;
return;
}
w[j] = a[j] - 1 >= j;
if (w[j] && !found_r) {
*r = j;
found_r = true;
}
}
int b = a[j] - 1;
if ((b & 1) != 0 && b - 1 >= j) b--;
a[j] = b;
w[j] = b - 1 >= j;
if (!found_r) *r = j;
}
int main(void) {
typedef char t_less_than_n[T < N ? 1 : -1];
int a[T + 1];
bool w[T + 1];
for (int j = 0; j < T + 1; j++) {
a[j] = N - (T - j);
w[j] = true;
}
int r = 0;
do {
for (int j = T - 1; j > -1; j--) printf("%x", a[j]);
putchar('\n');
if (false) {
for (int j = T - 1; j > -1; j--) printf("%d", w[j]);
putchar('\n');
}
next(a, w, &r);
} while (a[T] == N);
}

Euclidean greatest common divisor for more than two numbers

Can someone give an example for finding greatest common divisor algorithm for more than two numbers?
I believe programming language doesn't matter.
Start with the first pair and get their GCD, then take the GCD of that result and the next number. The obvious optimization is you can stop if the running GCD ever reaches 1. I'm watching this one to see if there are any other optimizations. :)
Oh, and this can be easily parallelized since the operations are commutative/associative.
The GCD of 3 numbers can be computed as gcd(a, b, c) = gcd(gcd(a, b), c). You can apply the Euclidean algorithm, the extended Euclidian or the binary GCD algorithm iteratively and get your answer. I'm not aware of any other (smarter?) ways to find a GCD, unfortunately.
A little late to the party I know, but a simple JavaScript implementation, utilising Sam Harwell's description of the algorithm:
function euclideanAlgorithm(a, b) {
if(b === 0) {
return a;
}
const remainder = a % b;
return euclideanAlgorithm(b, remainder)
}
function gcdMultipleNumbers(...args) { //ES6 used here, change as appropriate
const gcd = args.reduce((memo, next) => {
return euclideanAlgorithm(memo, next)}
);
return gcd;
}
gcdMultipleNumbers(48,16,24,96) //8
I just updated a Wiki page on this.
[https://en.wikipedia.org/wiki/Binary_GCD_algorithm#C.2B.2B_template_class]
This takes an arbitrary number of terms.
use GCD(5, 2, 30, 25, 90, 12);
template<typename AType> AType GCD(int nargs, ...)
{
va_list arglist;
va_start(arglist, nargs);
AType *terms = new AType[nargs];
// put values into an array
for (int i = 0; i < nargs; i++)
{
terms[i] = va_arg(arglist, AType);
if (terms[i] < 0)
{
va_end(arglist);
return (AType)0;
}
}
va_end(arglist);
int shift = 0;
int numEven = 0;
int numOdd = 0;
int smallindex = -1;
do
{
numEven = 0;
numOdd = 0;
smallindex = -1;
// count number of even and odd
for (int i = 0; i < nargs; i++)
{
if (terms[i] == 0)
continue;
if (terms[i] & 1)
numOdd++;
else
numEven++;
if ((smallindex < 0) || terms[i] < terms[smallindex])
{
smallindex = i;
}
}
// check for exit
if (numEven + numOdd == 1)
continue;
// If everything in S is even, divide everything in S by 2, and then multiply the final answer by 2 at the end.
if (numOdd == 0)
{
shift++;
for (int i = 0; i < nargs; i++)
{
if (terms[i] == 0)
continue;
terms[i] >>= 1;
}
}
// If some numbers in S are even and some are odd, divide all the even numbers by 2.
if (numEven > 0 && numOdd > 0)
{
for (int i = 0; i < nargs; i++)
{
if (terms[i] == 0)
continue;
if ((terms[i] & 1) == 0)
terms[i] >>= 1;
}
}
//If every number in S is odd, then choose an arbitrary element of S and call it k.
//Replace every other element, say n, with | n−k | / 2.
if (numEven == 0)
{
for (int i = 0; i < nargs; i++)
{
if (i == smallindex || terms[i] == 0)
continue;
terms[i] = abs(terms[i] - terms[smallindex]) >> 1;
}
}
} while (numEven + numOdd > 1);
// only one remaining element multiply the final answer by 2s at the end.
for (int i = 0; i < nargs; i++)
{
if (terms[i] == 0)
continue;
return terms[i] << shift;
}
return 0;
};
For golang, using remainder
func GetGCD(a, b int) int {
for b != 0 {
a, b = b, a%b
}
return a
}
func GetGCDFromList(numbers []int) int {
var gdc = numbers[0]
for i := 1; i < len(numbers); i++ {
number := numbers[i]
gdc = GetGCD(gdc, number)
}
return gdc
}
In Java (not optimal):
public static int GCD(int[] a){
int j = 0;
boolean b=true;
for (int i = 1; i < a.length; i++) {
if(a[i]!=a[i-1]){
b=false;
break;
}
}
if(b)return a[0];
j=LeastNonZero(a);
System.out.println(j);
for (int i = 0; i < a.length; i++) {
if(a[i]!=j)a[i]=a[i]-j;
}
System.out.println(Arrays.toString(a));
return GCD(a);
}
public static int LeastNonZero(int[] a){
int b = 0;
for (int i : a) {
if(i!=0){
if(b==0||i<b)b=i;
}
}
return b;
}

Resources