LZ77 slow compression speed - algorithm

I'm writing simple compression program using LZ77 algorithm. My problem is very slow compression speed on any big files (for 2 MB image it takes more than 1 minute if buffer size is 12 and dictionary size is 4096). I use Boyer-Moore-Horspool algorithm for searching current buffer prefixes in the dictionary. Please, tell me what could cause such slowdown and are there any ways to improve this code's performance?
void findLongestMatch(unsigned char* d, unsigned char* b, short &len, short &off)
{
short alphabet[256];
short shift = 0;
short dict_pos=0;
bool found = false;
if(cur_dict_length==0) { return; }
for(int prefix_length=cur_buf_length-1; prefix_length>=0; prefix_length--)
{
found=false;
for(int j=0; j<256; j++)
{
alphabet[j]=prefix_length+1;
}
for(int j=prefix_length; j>=1; j--)
{
alphabet[(unsigned char)b[j]]=j;
}
shift = 0;
dict_pos = cur_dict_length-(prefix_length+1);
while(dict_pos>=0)
{
if(memcmp(&d[dict_pos], &b[0], prefix_length+1)==0)
{
found=true;
len=prefix_length+1;
off=cur_dict_length-dict_pos;
break;
}
shift = alphabet[(unsigned char)d[dict_pos]];
dict_pos = dict_pos-shift;
}
if(found==true) break;
}
return;
}
void compressData(long block_size, unsigned char* s, fstream &file_out)
{
unsigned char buf_out[3];
unsigned char* dict;
unsigned char* buf;
link myLink;
file_out.seekp(0, ios_base::beg);
cur_dict_length = 0;
cur_buf_length = buf_length;
for(int i=0; i<block_size; i++)
{
buf=&s[i];
dict=&s[i-cur_dict_length];
myLink.length=0;myLink.offset=0;
findLongestMatch(dict,buf,myLink.length,myLink.offset);
if(myLink.length<=buf_length) myLink.next=buf[myLink.length];
else myLink.next=s[i];
compactLink(myLink, buf_out);
for(int j=0; j<3; j++)
{
file_out << buf_out[j];
}
i=i+myLink.length;
if(cur_dict_length<dict_length) {
cur_dict_length=cur_dict_length+1+myLink.length;
if(cur_dict_length>dict_length) cur_dict_length=dict_length;
}
if(i+cur_buf_length>=block_size) cur_buf_length=cur_buf_length-1-(i+cur_buf_length-block_size);
}
}

As noted, it's the algorithm that is your issue. You can use a chained hash like deflate, or a suffix tree.

Related

Why is my code printing symbols instead of letters?

I am supposed to write a program with three files (mysource.c, myMain.c, and mysource.h) to create a randomly generated string of characters. The length of the string is decided by the user. After the string is generated, the program will bump all letters in the string to the next letter in the alphabet to create a new offset string. I have most of the code sorted out, but my output is printing "╠╠╠╠". It prints the correct amount of characters but it is only printing those symbols. What do I need to do so that the characters print as actual letters rather than these symbols?
Here is my header file:
void generateChars(char *myarr, int len);
void offsetChars(char *myarr, int len);
void printChars(char *myarr, int len);
Here is my source code:
#include <stdio.h>
#include <stdlib.h>
#include "mysource.h"
void generateChars(char* myarr, int len)
{
int i = 0;
char letters[26] ={'a','b','c','d','e','f','g','h','i','j','k','l','m','n','o'
,'p','q','r','s','t','u','v','w','x','y','z' };
for (i = 0; i < len; i++);
{
myarr[i] = letters[rand() % 26];
}
}
//end generate function
void offsetChars(char *myarr, int len)
{
char i;
int j;
for (j = 0; j < len; j++)
{
for (i = 'a'; i <= 'z'; i++)
{
if (myarr[j] == i)
{
myarr[j] = i + 1;
break;
}
if (myarr[j] == 'z')
{
myarr[j] = 'a';
break;
}
}
}
}
//end offset function
void printChars(char *myarr, int len)
{
int i;
for (i = 0; i < len; i++)
{
printf("%c",myarr[i]);
}
}//end of print function
Here is my main code:
#include <stdio.h>
#include <stdlib.h>
#include "mysource.h"
int main()
{
int n;
printf("How many random characters do you want to
generate?: ");
scanf_s("%i", &n);
char myarr[1024];
printf("\nOriginal Combo:\n");
generateChars(&myarr, n);
printChars(&myarr, n);
printf("\nOffset Combo:\n");
offsetChars(&myarr, n);
printChars(&myarr, n);
return 0;
}
Here is the output I get:
I don't have enough reputation so this is the picture of the output
Yes there are two source codes, the objective is to make this assignment work with both source codes. Any help is appreciated!

CUDA string search in large file, wrong result

I am working on simple naive string search in CUDA.
I am new in CUDA. It works fine fol smaller files ( aprox. ~1MB ). After I make these files bigger ( ctrl+a ctrl+c several times in notepad++ ), my program's results are higher ( about +1% ) than a
grep -o text file_name | wc -l
It is very simple function, so I don't know what could cause this. I need it to work with larger files ( ~500MB ).
Kernel code ( gpuCount is a __device__ int global variable ):
__global__ void stringSearchGpu(char *data, int dataLength, char *input, int inputLength){
int id = blockDim.x*blockIdx.x + threadIdx.x;
if (id < dataLength)
{
int fMatch = 1;
for (int j = 0; j < inputLength; j++)
{
if (data[id + j] != input[j]) fMatch = 0;
}
if (fMatch)
{
atomicAdd(&gpuCount, 1);
}
}
}
This is calling the kernel in main function:
int blocks = 1, threads = fileSize;
if (fileSize > 1024)
{
blocks = (fileSize / 1024) + 1;
threads = 1024;
}
clock_t cpu_start = clock();
// kernel call
stringSearchGpu<<<blocks, threads>>>(cudaBuffer, strlen(buffer), cudaInput, strlen(input));
cudaDeviceSynchronize();
After this I just copy the result to Host and print it.
Can anyone please help me with this?
First of all, you should always check return values of CUDA functions to check for errors. Best way to do so would be the following:
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
Wrap your CUDA calls, such as:
gpuErrchk(cudaDeviceSynchronize());
Second, your kernel accesses out of bounds memory. Suppose, dataLength=100, inputLength=7 and id=98. In your kernel code:
if (id < dataLength) // 98 is less than 100, so condition true
{
int fMatch = 1;
for (int j = 0; j < inputLength; j++) // j runs from [0 - 6]
{
// if j>1 then id+j>=100, which is out of bounds, illegal operation
if (data[id + j] != input[j]) fMatch = 0;
}
Change the condition to something like:
if (id < dataLength - inputLength)

Optimizing N-queen with openmp

I am learning OPENMP and wrote the following code to solve nqueens problem.
//Full Code: https://github.com/Shafaet/Codes/blob/master/OPENMP/Parallel%20N- Queen%20problem.cpp
int n;
int call(int col,int rowmask,int dia1,int dia2)
{
if(col==n)
{
return 1;
}
int row,ans=0;
for(row=0;row<n;row++)
{
if(!(rowmask & (1<<row)) & !(dia1 & (1<<(row+col))) & !(dia2 & (1<<((row+n-1)-col))))
{
ans+=call(col+1,rowmask|1<<row,dia1|(1<<(row+col)), dia2|(1<<((row+n-1)-col)));
}
}
return ans;
}
double parallel()
{
double st=omp_get_wtime();
int ans=0;
int i;
int rowmask=0,dia1=0,dia2=0;
#pragma omp parallel for reduction(+:ans) shared(i,rowmask)
for(i=0;i<n;i++)
{
rowmask=0;
dia1=0,dia2=0;
int col=0,row=i;
ans+=call(1,rowmask|1<<row,dia1|(1<<(row+col)), dia2|(1<<((row+n-1)-col)));
}
printf("Found %d configuration for n=%d\n",ans,n);
double en=omp_get_wtime();
printf("Time taken using openmp %lf\n",en-st);
return en-st;
}
double serial()
{
double st=omp_get_wtime();
int ans=0;
int i;
int rowmask=0,dia1=0,dia2=0;
for(i=0;i<n;i++)
{
rowmask=0;
dia1=0,dia2=0;
int col=0,row=i;
ans+=call(1,rowmask|1<<row,dia1|(1<<(row+col)), dia2|(1<<((row+n-1)-col)));
}
printf("Found %d configuration for n=%d\n",ans,n);
double en=omp_get_wtime();
printf("Time taken without openmp %lf\n",en-st);
return en-st;
}
int main()
{
double average=0;
int count=0;
for(int i=2;i<=13;i++)
{
count++;
n=i;
double stime=serial();
double ptime=parallel();
printf("OpenMP is %lf times faster for n=%d\n",stime/ptime,n);
average+=stime/ptime;
puts("===============");
}
printf("On average OpenMP is %lf times faster\n",average/count);
return 0;
}
Parallel code is already faster than normal one but i wonder how can i optimize it more using openmp pragmas. I want to know what i should do for better performance and what i should not do.
Thanks in advance.
(Please dont suggest any optimizations which are non-related to parallel programming)
Your code seems to use classic backtracking N-Queens recursive algorithm, which is not the fastest possible for N-Queens solving, but (due to simplicity) is the most vivid one in terms of practicing with parallelism basics.
That's being said: this is very simple, thus you don't expect it to naturally demonstrate lots of advanced OpenMP means except basic "parallel for" and reduction.
But, as far as you're looking for learning parallelism and probably for more clearness and better learning curve, there is one more (out of many possible) implementation available, which uses the same algorithm but tends to be more readable and vivid from educational perspective:
void setQueen(int queens[], int row, int col) {
//check all previously placed rows for attacks
for(int i=0; i<row; i++) {
// vertical attacks
if (queens[i]==col) {
return;
}
// diagonal attacks
if (abs(queens[i]-col) == (row-i) ) {
return;
}
}
// column is ok, set the queen
queens[row]=col;
if(row==size-1) {
#pragma omp atomic
nrOfSolutions++; //Placed final queen, found a solution
}
else {
// try to fill next row
for(int i=0; i<size; i++) {
setQueen(queens, row+1, i);
}
}
}
//Function to find all solutions for nQueens problem on size x size chessboard.
void solve() {
#pragma omp parallel for
for(int i=0; i<size; i++) {
// try all positions in first row
int * queens = new int[size]; //array representing queens placed on a chess board. Index is row position, value is column.
setQueen(queens, 0, i);
delete[](queens);
}
}
This given code is one of Intel Advisor XE samples (for both C++ and Fortran); the parallelization aspects for given sample are discussed in very detailed manner in Chapter 10 of given Parallel Programming Book (in fact, given chapter just uses N-Queens to demonstrate how to use tools in order to parallelize serial code in general).
Given Advisor n-queens sample uses essentially the same algorithm as yours, but it replaces explicit reduction with combination of simple parallel for + atomic. This code is expected to be less efficient, but more "procedural-style" and more "educational", since it demonstrates "hidden" data race. In case you upload given samplecode, you will actually find 4 equialent N-Queens parallel implementatons using TBB, Cilk Plus and OpenMP (OMP is for C++ and Fortran).
I know I am a little late for the party, but you can use task queueing for further optimization.(about 7-10% faster results).No idea why. Here's the code,that i am using :
#include <iostream> // std::cout, cin, cerr ...
#include <iomanip> // modify std::out
#include <omp.h>
using namespace std;
int nrOfSolutions=0;
int size=0;
void print(int queens[]) {
cerr << "Solution " << nrOfSolutions << endl;
for(int row=0; row<size; row++) {
for(int col=0; col<size; col++) {
if(queens[row]==col) {
cout << "Q";
}
else {
cout << "-";
}
}
cout << endl;
}
}
void setQueen(int queens[], int row, int col, int id) {
for(int i=0; i<row; i++) {
// vertical attacks
if (queens[i]==col) {
return;
}
// diagonal attacks
if (abs(queens[i]-col) == (row-i) ) {
return;
}
}
// column is ok, set the queen
queens[row]=col;
if(row==size-1) {
// only one thread should print allowed to print at a time
{
// increasing the solution counter is not atomic
#pragma omp critical
nrOfSolutions++;
#ifdef _DEBUG
#pragma omp critical
print(queens);
#endif
}
}
else {
// try to fill next row
for(int i=0; i<size; i++) {
setQueen(queens, row+1, i, id);
}
}
}
void solve() {
int myid=0 ;
#pragma omp parallel
#pragma omp single
{
for(int i=0; i<size; i++) {
/*
#ifdef _OMP //(???)
myid = omp_get_thread_num();
#endif
#ifdef _DEBUG
cout << "ThreadNum: " << myid << endl ;
#endif
*/
// try all positions in first row
// create separate array for each recursion
// started here
#pragma omp task
setQueen(new int[size], 0, i, myid);
}
}
}
int main(int argc, char*argv[]) {
if(argc !=2) {
cerr << "Usage: nq-openmp-taskq boardSize.\n";
return 0;
}
size = atoi(argv[1]);
cout << "Starting OpenMP Task Queue solver for size " << size << "...\n";
double st=omp_get_wtime();
solve();
double en=omp_get_wtime();
printf("Time taken using openmp %lf\n",en-st);
cout << "Number of solutions: " << nrOfSolutions << endl;
return 0;
}

Parallel hashing using openmp

I have a piece of code for parallel hashing, the insert code is as follows:
int main(int argc, char** argv){
.....
Entry* table;//hash table
for(size_t i=0;i<N;i++){
keys[i]=i;
values[i] = rand();//random key-value pairs
}
int omp_p = omp_get_max_threads();
#pragma omp parallel for
for(int p=0;p<omp_p;p++){
size_t start = p*N/omp_p;
size_t end = (p+1)*N/omp_p;//each thread gets contiguous chunks of the arrays
for(size_t i=start;i<end;i++){
size_t key = keys[i];
size_t value = values[i];
if(insert(table,key,value) == 0){
printf("Failure!\n");
}
}
}
....
return 0;
}
int insert(Entry* table,size_t key, size_t value){
Entry entry = (((Entry)key) << 32)+value; //Coalesce key and value into an entry
/*Use cuckoo hashing*/
size_t location = hash_function_1(key);
for(size_t its=0;its<MAX_ITERATIONS;its++){
entry = __sync_lock_test_and_set(&table[location],entry);
key=get_key(entry);
if(key == KEY_EMPTY)
return1;
}
/*We have replaced a valid key, try to hash it using next available hash function*/
size_t location_1 = hash_function_1(key);
size_t location_2 = hash_function_2(key);
size_t location_3 = hash_function_3(key);
if(location == location_1) location = location_2;
else if(location == location_2) location = location_3;
else location = location_1;
}
return 0;
}
The insert code doesn't scale at all. If I use a single thread, for say, 10M keys, I complete in about 170ms, whereas using 16 threads, I take > 500ms. My suspicion is that this is because the cache line (consisting of the table[] array) is being moved around between the threads during the write operation (__sync_lock_test_and_set(...)) and the invalidation results in a slow down
For example if I modify the insert code to just:
int insert(Entry* table,size_t key, size_t value){
Entry entry = (((Entry)key) << 32)+value; //Coalesce key and value into an entry
size_t location = hash_function_1(key);
table[location] = entry;
return 1;
}
I still get the same bad performance. Since this is hashing, I cannot control, where a particular element hashes to. So any suggestions? Also, if this isn't the right reason, any other pointers as to what might be going wrong? I have tried it from 1M to 100M keys, but the single threaded performance is always better.
I have a few suggestions. Since the run time of your insert function is not constant then you should use schedule(dynamic). Second, you should let OpenMP divide the tasks and not do it yourself (one reason, but not the main reason, is that the way you have it now N has to be a multiple of omp_p). If you want to have some control over how it divides the tasks then try changing the chunksize like this schedule(dynamic,n) where n is the chuck size.
#pragma omp parallel for schedule(dynamic)
for(size_t i=0;i<N;i++){
size_t key = keys[i];
size_t value = values[i];
if(insert(table,key,value) == 0){
printf("Failure!\n");
}
}
I would try experimenting with a strategy based on locks, like this simple snippet shows:
#include<omp.h>
#define NHASHES 4
#define NTABLE 1000000
typedef size_t (hash_f)(size_t);
int main(int argc, char** argv) {
Entry table [NTABLE ];
hash_f hashes[NHASHES];
omp_lock_t locks [NTABLE ]
/* ... */
for(size_t ii = 0; ii < N; ii++) {
keys [ii] = ii;
values [ii] = rand();
}
for(size_t ii = 0; ii < NTABLE; ii++) {
omp_init_lock(&locks[ii]);
}
#pragma omp parallel
{
#pragma omp for schedule(static)
for(int ii = 0; ii < N; ii++) {
size_t key = keys [ii];
size_t value = values[ii];
Entry entry = (((Entry)key) << 32) + value;
for ( jj = 0; jj < NHASHES; jj++ ) {
size_t location = hashes[jj]; // I assume this is the computationally demanding part
omp_set_lock(&locks[location]); // Locks the hash table location before working on it
if ( get_key(table[location]) == KEY_EMPTY ) {
table[location] = entry;
break;
}
omp_unset_lock(&locks[location]); // Unlocks the hash table location
}
// Handle failures here
}
} /* pragma omp parallel */
for(size_t ii = 0; ii < NTABLE; ii++) {
omp_destroy_lock(&locks[ii]);
}
/* ... */
return 0;
}
With a little more machinery you can handle a variable number of locks ranging from 1 (equivalent to a critical section) to NTABLE (equivalent to an atomic construct) and see if the granularity in-between provides some benefit.

how to reduce page faults in this program?

I'm gating more then 1000 page faults in this program.
can i reduce them to some smaller value or even to zero ?
or even any other changes can speed up the execution
#include <stdio.h>
#include<stdlib.h>
int main(int argc, char* argv[])
{
register unsigned int u, v,i;
register unsigned int arr_size=0;
register unsigned int b_size=0;
register unsigned int c;
register unsigned int *b;
FILE *file;
register unsigned int *arr;
file=fopen(argv[1],"r");
arr=(unsigned int *)malloc(4*10000000);
while(!feof(file)){
++arr_size;
fscanf(file,"%u\n",&arr[arr_size-1]);
}
fclose(file);
b=(unsigned int *)malloc(arr_size*4);
if (arr_size!=0)
{
++b_size;
b[b_size-1]=0;
for (i = 1; i < arr_size; ++i)
{
if (arr[b[b_size-1]] < arr[i])
{
++b_size;
b[b_size-1]=i;
continue;
}
for (u = 0, v = b_size-1; u < v;)
{
c = (u + v) / 2;
if (arr[b[c]] < arr[i]) u=c+1; else v=c;
}
if (arr[i] < arr[b[u]])
{
b[u] = i;
}
if(i>arr_size)break;
}
}
free(arr);
free(b);
printf("%u\n", b_size);
return 0;
}
The line:
arr=(unsigned int *)malloc(4*10000000);
is not a good programming style. Are you sure that your file is as big as 40MBs? try not to allocate the whole memory in the first lines of your program.

Resources