I am trying to decode a websocket frame, but I'm not successful when it comes to decoding the extended payload. Here what I did achieve so far:
char *in = data;
char *buffer;
unsigned int i;
unsigned char mask[4];
unsigned int packet_length = 0;
int rc;
/* Expect a finished text frame. */
assert(in[0] == '\x81');
packet_length = ((unsigned char) in[1]) & 0x7f;
mask[0] = in[2];
mask[1] = in[3];
mask[2] = in[4];
mask[3] = in[5];
if (packet_length <= 125) { **// This decoding works**
/* Unmask the payload. */
for (i = 0; i < packet_length; i++)
in[6 + i] ^= mask[i % 4];
rc = asprintf(&buffer, "%.*s", packet_length, in + 6);
} else
if (packet_length == 126) { **//This decosing does NOT work**
/* Unmask the payload. */
for (i = 0; i < packet_length; i++)
in[8 + i] ^= mask[i % 4];
rc = asprintf(&buffer, "%.*s", packet_length, in + 8);
}
What am I doing wrong? How do I encode the extended payload?
The sticking point is at > 125 bytes payload.
The format is pretty simple, lets say you send ten a's in JavaScript:
ws.send("a".repeat(10))
Then the server will receive:
bytes[16]=818a8258a610e339c771e339c771e339
byte 0: The 0x81 is just an indicator that a message received
byte 1: the 0x8a is the length, substract 0x80 from it, 0x0A == 10
byte 2, 3, 4, 5: the 4 byte xor key to decrypt the payload
the rest: payload
But now lets say you send 126 a's in JavaScript:
ws.send("a".repeat(126))
Then the server will receive:
bytes[134]=81fe007ee415f1e5857490848574908485749084857490848574908485749084857490848574908485749084857490848574908485749084857490848574908485749084857490848574908485749084857490848574908485749084857490848574908485749084857490848574908485749084857490848574908485749084857490848574
If the length of the payload is > 125, the byte 1 will have the value 0xfe, the format changes then to:
byte 0: The 0x81 is just an indicator that a message received
byte 1: will be 0xfe
byte 2, 3: the length of the payload as a uint16 number
byte 4, 5, 6, 7: the 4 byte xor key to decrypt the payload
the rest: payload
Example code in C#:
List<byte[]> decodeWebsocketFrame(Byte[] bytes)
{
List<Byte[]> ret = new List<Byte[]>();
int offset = 0;
while (offset + 6 < bytes.Length)
{
// format: 0==ascii/binary 1=length-0x80, byte 2,3,4,5=key, 6+len=message, repeat with offset for next...
int len = bytes[offset + 1] - 0x80;
if (len <= 125)
{
//String data = Encoding.UTF8.GetString(bytes);
//Debug.Log("len=" + len + "bytes[" + bytes.Length + "]=" + ByteArrayToString(bytes) + " data[" + data.Length + "]=" + data);
Debug.Log("len=" + len + " offset=" + offset);
Byte[] key = new Byte[] { bytes[offset + 2], bytes[offset + 3], bytes[offset + 4], bytes[offset + 5] };
Byte[] decoded = new Byte[len];
for (int i = 0; i < len; i++)
{
int realPos = offset + 6 + i;
decoded[i] = (Byte)(bytes[realPos] ^ key[i % 4]);
}
offset += 6 + len;
ret.Add(decoded);
} else
{
int a = bytes[offset + 2];
int b = bytes[offset + 3];
len = (a << 8) + b;
//Debug.Log("Length of ws: " + len);
Byte[] key = new Byte[] { bytes[offset + 4], bytes[offset + 5], bytes[offset + 6], bytes[offset + 7] };
Byte[] decoded = new Byte[len];
for (int i = 0; i < len; i++)
{
int realPos = offset + 8 + i;
decoded[i] = (Byte)(bytes[realPos] ^ key[i % 4]);
}
offset += 8 + len;
ret.Add(decoded);
}
}
return ret;
}
If packet_length is 126, the following 2 bytes give the length of data to be read.
If packet_length is 127, the following 8 bytes give the length of data to be read.
The mask is contained in the following 4 bytes (after the length).
The message to be decoded follows this.
The data framing section of the spec has a useful illustration of this.
If you re-order your code to something like
Read packet_length
Check for packet_length of 126 or 127. Reassign packet_length to value of following 2/4 bytes if required.
Read mask (the 4 bytes after packet_length, including any additional 2 or 8 bytes read for the step above).
Decode message (everything after the mask).
then things should work.
Related
I have this string in hexadecimal:
"0000803F00000000000000B4B410D1A90000803FB41051B500000034B41051350000803F000000000000000000C05B400000000000C06B400000000000D07440"
and I know what it contains:
(1, 0, -1.192093e-007),
(-9.284362e-014, 1, -7.788287e-007),
(1.192093e-007, 7.788287e-007, 1),
(111, 222, 333).
And yes, it is a tranform matrix!
Decoding the first 72 characters (8 chars per number) was trivial, you only need to split by 8 and use IEEE floating point format ie. 0x0000803F = 1.0f
So we still have "000000000000000000C05B400000000000C06B400000000000D07440" that contains the fourth vector but I never saw such kind of numeric codification.
Any though on this?
It looks like these are 8-byte IEEE floating point numbers, starting at byte 40. So the layout is:
Bytes 0-11: first vector, 3 single-precision numbers
Bytes 12-23: second vector, 3 single-precision numbers
Bytes 25-35: third vector, 3 single-precision numbers
Bytes 36-39: Unused? (Padding?)
Bytes 40-63: fourth vector, 3 double-precision numbers
The code below shows an example of parsing this in C#. The output of the code is:
(1, 0, -1.192093E-07)
(-9.284362E-14, 1, -7.788287E-07)
(1.192093E-07, 7.788287E-07, 1)
(111, 222, 333)
Sample code:
using System;
class Program
{
static void Main(string[] args)
{
string text = "0000803F00000000000000B4B410D1A90000803FB41051B500000034B41051350000803F000000000000000000C05B400000000000C06B400000000000D07440";
byte[] bytes = ParseHex(text);
for (int i = 0; i < 3; i++)
{
float x = BitConverter.ToSingle(bytes, i * 12);
float y = BitConverter.ToSingle(bytes, i * 12 + 4);
float z = BitConverter.ToSingle(bytes, i * 12 + 8);
Console.WriteLine($"({x}, {y}, {z})");
}
// Final vector
{
double x = BitConverter.ToDouble(bytes, 40);
double y = BitConverter.ToDouble(bytes, 48);
double z = BitConverter.ToDouble(bytes, 56);
Console.WriteLine($"({x}, {y}, {z})");
}
}
// From https://stackoverflow.com/a/854026/9574109
public static byte[] ParseHex(string hex)
{
int offset = hex.StartsWith("0x") ? 2 : 0;
if ((hex.Length % 2) != 0)
{
throw new ArgumentException("Invalid length: " + hex.Length);
}
byte[] ret = new byte[(hex.Length-offset)/2];
for (int i=0; i < ret.Length; i++)
{
ret[i] = (byte) ((ParseNybble(hex[offset]) << 4)
| ParseNybble(hex[offset+1]));
offset += 2;
}
return ret;
}
static int ParseNybble(char c)
{
if (c >= '0' && c <= '9')
{
return c-'0';
}
if (c >= 'A' && c <= 'F')
{
return c-'A'+10;
}
if (c >= 'a' && c <= 'f')
{
return c-'a'+10;
}
throw new ArgumentException("Invalid hex digit: " + c);
}
}
I hope someone can help here.
I have a large byte vector from which i create a small byte vector ( based on a mask ) which I then process with simd.
Currently the mask is an array of baseOffset + submask (byte[256]) , optimized for storage as there are > 10^8 . I create a maxsize subvector , then loop through the mask array multiply the baseOffssetby 256 and for each bit offset in the mask load from the large vector and put the values in a smaller vector sequentially . The smaller vector is then processed via a number of VPMADDUBSW and accumulated . I can change this structure. eg walk the bits once to use a 8K bit array buffer and then create the small vector.
Is there a faster way i can create the subarray ?
I pulled the code out of the app into a test program but the original is in a state of flux ( moving to AVX2 and pulling more out of C# )
#include "stdafx.h"
#include<stdio.h>
#include <mmintrin.h>
#include <emmintrin.h>
#include <tmmintrin.h>
#include <smmintrin.h>
#include <immintrin.h>
//from
char N[4096] = { 9, 5, 5, 5, 9, 5, 5, 5, 5, 5 };
//W
char W[4096] = { 1, 2, -3, 5, 5, 5, 5, 5, 5, 5 };
char buffer[4096] ;
__declspec(align(2))
struct packed_destination{
char blockOffset;
__int8 bitMask[32];
};
__m128i sum = _mm_setzero_si128();
packed_destination packed_destinations[10];
void process128(__m128i u, __m128i s)
{
__m128i calc = _mm_maddubs_epi16(u, s); // pmaddubsw
__m128i loints = _mm_cvtepi16_epi32(calc);
__m128i hiints = _mm_cvtepi16_epi32(_mm_shuffle_epi32(calc, 0x4e));
sum = _mm_add_epi32(_mm_add_epi32(loints, hiints), sum);
}
void process_array(char n[], char w[], int length)
{
sum = _mm_setzero_si128();
int length128th = length >> 7;
for (int i = 0; i < length128th; i++)
{
__m128i u = _mm_load_si128((__m128i*)&n[i * 128]);
__m128i s = _mm_load_si128((__m128i*)&w[i * 128]);
process128(u, s);
}
}
void populate_buffer_from_vector(packed_destination packed_destinations[], char n[] , int dest_length)
{
int buffer_dest_index = 0;
for (int i = 0; i < dest_length; i++)
{
int blockOffset = packed_destinations[i].blockOffset <<8 ;
// go through mask and copy to buffer
for (int j = 0; j < 32; j++)
{
int joffset = blockOffset + j << 3;
int mask = packed_destinations[i].bitMask[j];
if (mask & 1 << 0)
buffer[buffer_dest_index++] = n[joffset + 1<<0 ];
if (mask & 1 << 1)
buffer[buffer_dest_index++] = n[joffset + 1<<1];
if (mask & 1 << 2)
buffer[buffer_dest_index++] = n[joffset + 1<<2];
if (mask & 1 << 3)
buffer[buffer_dest_index++] = n[joffset + 1<<3];
if (mask & 1 << 4)
buffer[buffer_dest_index++] = n[joffset + 1<<4];
if (mask & 1 << 5)
buffer[buffer_dest_index++] = n[joffset + 1<<5];
if (mask & 1 << 6)
buffer[buffer_dest_index++] = n[joffset + 1<<6];
if (mask & 1 << 7)
buffer[buffer_dest_index++] = n[joffset + 1<<7];
};
}
}
int _tmain(int argc, _TCHAR* argv[])
{
for (int i = 0; i < 32; ++i)
{
packed_destinations[0].bitMask[i] = 0x0f;
packed_destinations[1].bitMask[i] = 0x04;
}
packed_destinations[1].blockOffset = 1;
populate_buffer_from_vector(packed_destinations, N, 1);
process_array(buffer, W, 256);
int val = sum.m128i_i32[0] +
sum.m128i_i32[1] +
sum.m128i_i32[2] +
sum.m128i_i32[3];
printf("sum is %d" , val);
printf("Press Any Key to Continue\n");
getchar();
return 0;
}
Normally mask usage would be 5-15% for some work loads it would be 25-100% .
MASKMOVDQU is close but then we would have to re pack /swl according to the mask before saving..
A couple of optimisations for your existing code:
If your data is sparse then it would probably be a good idea to add an additional test of each 8 bit mask value prior to testing the additional bits, i.e.
int mask = packed_destinations[i].bitMask[j];
if (mask != 0)
{
if (mask & 1 << 0)
buffer[buffer_dest_index++] = n[joffset + 1<<0 ];
if (mask & 1 << 1)
buffer[buffer_dest_index++] = n[joffset + 1<<1];
...
Secondly your process128 function can be optimised considerably:
inline __m128i process128(const __m128i u, const __m128i s, const __m128i sum)
{
const __m128i vk1 = _mm_set1_epi16(1);
__m128i calc = _mm_maddubs_epi16(u, s);
calc = _mm_madd_epi16(v, vk1);
return _mm_add_epi32(sum, calc);
}
Note that as well as reducing the SSE instruction count from 6 to 3, I've also made sum a parameter, to get away from any dependency on global variables (it's always a good idea to avoid globals, not only for good software engineering but also because they can inhibit certain compiler optimisations).
It would be interesting to see a profile of your code (using a decent sampling profiler, not via instrumentation), since this would help to prioritise any further optimisation efforts.
I'm currently working on a program that stores RGB-info from two images to compare them.
I created two example images with paint.net.
Both are 16x16 and one is BLUE and the other one is RED.
I set the value in paint.net to (255, 0 ,0) in the RGB value for RED and in the blue image to (0,0,255).
As I loaded it into a ByteBuffer and looked inside it.
// Buffer for texture data
ByteBuffer res = BufferUtils.makeByteBufferT4(w * h);
// Convert pixel format
for (int y = 0; y != h; y++) {
for (int x = 0; x != w; x++) {
int pp = bi.getRGB(x, y);
byte a = (byte) ((pp & 0xff000000) >> 24);
byte r = (byte) ((pp & 0x00ff0000) >> 16);
byte g = (byte) ((pp & 0x0000ff00) >> 8);
byte b = (byte) (pp & 0x000000ff);
res.put((y * w + x) * 4 + 0, r);
res.put((y * w + x) * 4 + 1, g);
res.put((y * w + x) * 4 + 2, b);
res.put((y * w + x) * 4 + 3, a);
}
}
public static ByteBuffer makeByteBufferT4(int length){
// As "int" in java has 4 bytes we have to multiply our length with 4 for every single int value
ByteBuffer res = null;
return res = ByteBuffer.allocateDirect(length * 4);
}
Via res.get(0) I expected 1 as value, but got -1
I recognized that against my expectation it stores the value -1.
I expected the value 1.
Why is this so, should'nt it store the value 1?
This is not problem that a affects my coding negatively,
But more an understanding issue, I have.
Is there a way to find a number of rectangular submatrices containing all zeros with a complexity smaller than O(n^3), where n is the dimension of given matrix?
Here is a solution O(n² log n).
First, let's convert the main problem to something like this:
For given histogram, find the number of submatrices containing all zeros.
How to convert it ?
For each position calculate the height of column that start on that position and contain only zeros.
Example:
10010 01101
00111 12000
00001 -> 23110
01101 30020
01110 40001
It can be easily find in O(n²).
for(int i = 1; i <= n; i++)
for(int j = 1; j <= m; j++)
up[i][j] = arr[i][j] ? 0 : 1 + up[i - 1][j];
Now we can consider each row as histogram with given heights.
Let's solve the problem with histogram.
Our goal is to travel all heights from left to right, and on each step we are going to update array L.
This array for each height is going to contain maximum widths so that we can make a rectangle of this width from current position, to the left and of given height.
Consider example:
0
0 0
0 000
00000 -> heights: 6 3 4 4 5 2
000000
000000
L[6]: 1 0 0 0 0 0
L[5]: 1 0 0 0 1 0
L[4]: 1 0 1 2 3 0
L[3]: 1 2 3 4 5 0
L[2]: 1 2 3 4 5 6
L[1]: 1 2 3 4 5 6
steps: 1 2 3 4 5 6
As you can see if we add all those numbers we will receive an answer for given histogram.
We can simply update array L in O(n), however we can also do it in O(log n) by using segment tree (with lazy propagation) that can add in interval, set value in interval and get sum from interval.
In each step we just add 1 to interval [1, height] and set 0 in interval[height + 1, maxHeight] and get sum from interval [1, maxHeight].
height - height of current column in histogram.
maxHeight - maximum height of column in histogram.
And thats how you can get O(n² * log n) solution :)
Here is main code in C++:
const int MAXN = 1000;
int n;
int arr[MAXN + 5][MAXN + 5]; // stores given matrix
int up[MAXN + 5][MAXN + 5]; // heights of columns of zeros
long long answer;
long long calculate(int *h, int maxh) { // solve it for histogram
clearTree();
long long result = 0;
for(int i = 1; i <= n; i++) {
add(1, h[i]); // add 1 to [1, h[i]]
set(h[i] + 1, maxh); // set 0 in [h[i] + 1, maxh];
result += query(); // get sum from [1, maxh]
}
return result;
}
int main() {
ios_base::sync_with_stdio(0);
cin >> n;
for(int i = 1; i <= n; i++)
for(int j = 1; j <= n; j++)
cin >> arr[i][j]; // read the data
for(int i = 1; i <= n; i++)
for(int j = 1; j <= n; j++)
up[i][j] = arr[i][j] ? 0 : 1 + up[i - 1][j]; // calculate values of up
for(int i = 1; i <= n; i++)
answer += calculate(up[i], i); // calculate for each row
cout << answer << endl;
}
Here is the beginning of code, segment tree:
#include <iostream>
using namespace std;
// interval-interval tree that stores sums
const int p = 11;
int sums[1 << p];
int lazy[1 << p];
int need[1 << p];
const int M = 1 << (p - 1);
void update(int node) {
if(need[node] == 1) { // add
sums[node] += lazy[node];
if(node < M) {
need[node * 2] = need[node * 2] == 2 ? 2 : 1;
need[node * 2 + 1] = need[node * 2 + 1] == 2 ? 2 : 1;
lazy[node * 2] += lazy[node] / 2;
lazy[node * 2 + 1] += lazy[node] / 2;
}
} else if(need[node] == 2) { // set
sums[node] = lazy[node];
if(node < M) {
need[node * 2] = need[node * 2 + 1] = 2;
lazy[node * 2] = lazy[node] / 2;
lazy[node * 2 + 1] = lazy[node] / 2;
}
}
need[node] = 0;
lazy[node] = 0;
}
void insert(int node, int l, int r, int lq, int rq, int value, int id) {
update(node);
if(lq <= l && r <= rq) {
need[node] = id;
lazy[node] = value * (r - l + 1);
update(node);
return;
}
int mid = (l + r) / 2;
if(lq <= mid) insert(node * 2, l, mid, lq, rq, value, id);
if(mid + 1 <= rq) insert(node * 2 + 1, mid + 1, r, lq, rq, value, id);
sums[node] = sums[node * 2] + sums[node * 2 + 1];
}
int query() {
return sums[1]; // we only need to know sum of the whole interval
}
void clearTree() {
for(int i = 1; i < 1 << p; i++)
sums[i] = lazy[i] = need[i] = 0;
}
void add(int left, int right) {
insert(1, 0, M - 1, left, right, 1, 1);
}
void set(int left, int right) {
insert(1, 0, M - 1, left, right, 0, 2);
}
// end of the tree
I'm looking for an algorithm to flip a 1 Bit Bitmap line horizontally. Remember these lines are DWORD aligned!
I'm currently unencoding an RLE stream to an 8 bit-per-pixel buffer, then re-encoding to a 1 bit line, however, I would like to try and keep it all in the 1 bit space in an effort to increase its speed. Profiling indicates this portion of the program to be relatively slow compared to the rest.
Example line (Before Flip):
FF FF FF FF 77 AE F0 00
Example line (After Flip):
F7 5E EF FF FF FF F0 00
Create a conversion table to swap the bits in a byte:
byte[] convert = new byte[256];
for (int i = 0; i < 256; i++) {
int value = 0;
for (int bit = 1; bit <= 128; bit<<=1) {
value <<= 1;
if ((i & bit) != 0) value++;
}
convert[i] = (byte)value;
}
Now you can use the table to swap a byte, then you just have to store the byte in the right place in the result:
byte[] data = { 0xFF, 0xFF, 0xFF, 0xFF, 0x77, 0xAE, 0xF0, 0x00 };
int width = 52;
int shift = data.Length * 8 - width;
int shiftBytes = data.Length - 1 - shift / 8;
int shiftBits = shift % 8;
byte[] result = new byte[data.Length];
for (int i = 0; i < data.Length; i++) {
byte swap = convert[data[i]];
if (shiftBits == 0) {
result[shiftBYtes - i] = swap;
} else {
if (shiftBytes - i >= 0) {
result[shiftBytes - i] |= (byte)(swap << shiftBits);
}
if (shiftBytes - i - 1 >= 0) {
result[shiftBytes - i - 1] |= (byte)(swap >> (8 - shiftBits));
}
}
}
Console.WriteLine(BitConverter.ToString(result));
Output:
F7-5E-EF-FF-FF-FF-F0-00
The following code reads and reverses the data in blocks of 32 bits as integers. The code to reverse the bits is split into two parts because on a little endian machine reading four bytes as an 32 bit integer reverses the byte order.
private static void Main()
{
var lineLength = 52;
var input = new Byte[] { 0xFF, 0xFF, 0xFF, 0xFF, 0x77, 0xAE, 0xF0, 0x00 };
var output = new Byte[input.Length];
UInt32 lastValue = 0x00000000;
var numberBlocks = lineLength / 32 + ((lineLength % 32 == 0) ? 0 : 1);
var numberBitsInLastBlock = lineLength % 32;
for (Int32 block = 0; block < numberBlocks; block++)
{
var rawValue = BitConverter.ToUInt32(input, 4 * block);
var reversedValue = (ReverseBitsA(rawValue) << (32 - numberBitsInLastBlock)) | (lastValue >> numberBitsInLastBlock);
lastValue = rawValue;
BitConverter.GetBytes(ReverseBitsB(reversedValue)).CopyTo(output, 4 * (numberBlocks - block - 1));
}
Console.WriteLine(BitConverter.ToString(input).Replace('-', ' '));
Console.WriteLine(BitConverter.ToString(output).Replace('-', ' '));
}
private static UInt32 SwapBitGroups(UInt32 value, UInt32 mask, Int32 shift)
{
return ((value & mask) << shift) | ((value & ~mask) >> shift);
}
private static UInt32 ReverseBitsA(UInt32 value)
{
value = SwapBitGroups(value, 0x55555555, 1);
value = SwapBitGroups(value, 0x33333333, 2);
value = SwapBitGroups(value, 0x0F0F0F0F, 4);
return value;
}
private static UInt32 ReverseBitsB(UInt32 value)
{
value = SwapBitGroups(value, 0x00FF00FF, 8);
value = SwapBitGroups(value, 0x0000FFFF, 16);
return value;
}
It is a bit ugly and not robust against errors ... but it is just sample code. And it outputs the following.
FF FF FF FF 77 AE F0 00
F7 5E EF FF FF FF F0 00