Longest common SUBSTRING optimization - performance

Can anybody help me with optimization of my LONGEST COMMON SUBSTRING problem? I must read really big files (up to 2 Gb), but i cant figure out which structure to use... In c++ there is no hash maps.. There is concurrent hash map in TBB but it is very complicated to use with this algorithm. I have this problem solved with **L matrix but it is greedy and cannot be used for large inputs. Matrix is full of zeros, and that can be eliminated by i.e. using map> and store only non-zeros but that is really slow and practicaly unusable. Speed is very important. Here is the code :
// L[i][j] will contain length of the longest substring
// ending by positions i in refSeq and j in otherSeq
size_t **L = new size_t*[refSeq.length()];
for(size_t i=0; i<refSeq.length();++i)
L[i] = new size_t[otherSeq.length()];
// iteration over the characters of the reference sequence
for(size_t i=0; i<refSeq.length();i++){
// iteration over the characters of the sequence to compare
for(size_t j=0; j<otherSeq.length();j++){
// if the characters are the same,
// increase the consecutive matching score from the previous cell
if(refSeq[i]==otherSeq[j]){
if(i==0 || j==0)
L[i][j]=1;
else
L[i][j] = L[i-1][j-1] + 1;
}
// or reset the matching score to 0
else
L[i][j]=0;
}
}
// output the matches for this sequence
// length must be at least minMatchLength
// and the longest possible.
for(size_t i=0; i<refSeq.length();i++){
for(size_t j=0; j<otherSeq.length();j++){
if(L[i][j]>=minMatchLength) {
//this sequence is part of a longer one
if(i+1<refSeq.length() && j+1<otherSeq.length() && L[i][j]<=L[i+1][j+1])
continue;
//this sequence is part of a longer one
if(i<refSeq.length() && j+1<otherSeq.length() && L[i][j]<=L[i][j+1])
continue;
//this sequence is part of a longer one
if(i+1<refSeq.length() && j<otherSeq.length() && L[i][j]<=L[i+1][j])
continue;
cout << i-L[i][j]+2 << " " << i+1 << " " << j-L[i][j]+2 << " " << j+1 << "\n";
// output the matching sequences for debugging :
//cout << refSeq.substr(i-L[i][j]+1,L[i][j]) << "\n";
//cout << otherSeq.substr(j-L[i][j]+1,L[i][j]) << "\n";
}
}
}

There is a Intel Contest about the same problem.
Maybe they will post some solutinons when it's over
http://software.intel.com/fr-fr/articles/AYC-early2012_home/

Related

SHA256 Find Partial Collision

I have two message:
messageA: "Frank is one of the "best" students topicId{} "
messageB: "Frank is one of the "top" students topicId{} "
I need to find SHA256 partially collision of these two messages(8 digits).
Therefore, The first 8 digests of SHA256(messageA) == The first 8 digest of SHA256(messageB)
We can put any letters and numbers in {}, Both {} should have same string
I have tried brute force and birthday attack with hash table to solve this problem, but it costs too much time. I know the cycle detection algorithm like Floyd and Brent, however i have no idea how to construct the cycle for this problem. Are there any other methods to solve this problem? Thank you so much!
This is pretty trivial to solve with a birthday attack. Here's how I did it in Python (v2):
def find_collision(ntries):
from hashlib import sha256
str1 = 'Frank is one of the "best" students topicId{%d} '
str2 = 'Frank is one of the "top" students topicId{%d} '
seen = {}
for n in xrange(ntries):
h = sha256(str1 % n).digest()[:4].encode('hex')
seen[h] = n
for n in xrange(ntries):
h = sha256(str2 % n).digest()[:4].encode('hex')
if h in seen:
print str1 % seen[h]
print str2 % n
find_collision(100000)
If your attempt took too long to find a solution, then either you simply made a mistake in your coding somewhere, or you were using the wrong data type.
Python's dictionary data type is implemented using hash tables. That means you can search for dictionary elements in constant time. If you implemented seen using a list instead of a dict in the above code, then the search at line 11 would take an awful lot longer.
Edit:
If the two topicId tokens have to be identical, then — as pointed out in the comments — there is little option but to grind through somewhere in the order of 231 values. You will find a collision eventually, but it could take a long time.
Just leave this running overnight and with a bit of luck you'll have an answer in the morning:
def find_collision():
from hashlib import sha256
str1 = 'Frank is one of the "best" students topicId{%x} '
str2 = 'Frank is one of the "top" students topicId{%x} '
seen = {}
n = 0
while True:
if sha256(str1 % n).digest()[:4] == sha256(str2 % n).digest()[:4]:
print str1 % n
print str2 % n
break
n += 1
find_collision()
If you're in a hurry, you could maybe look into using a GPU to speed up the hash calculations.
I'm assuming the space at the end of the strings in the question was intentional so I left it in.
"Frank is one of the "top" students topicId{59220691223} "
6026d9b323898bcd7ecdbcbcd575b0a1d9dc22fd9e60074aefcbaade494a50ae
"Frank is one of the "best" students topicId{59220691223} "
6026d9b31ba780bb9973e7cfc8c9f74a35b54448d441a61cc9bf8db0fcae5280
It actually took about 7 billion tries to find one using brute force, a lot more than I expected.
I figure 2^32 is roughly 4.3 billion and so chance of not finding any match after 4.3 billion tries is about 36.78%
I actually found a match after about 7 billion tries, there was less than a 20% chance of no matches in 7 billion tries.
This is the C++ code I used running on 7 threads, each thread gets a different starting point and it quits once a match is found on any thread. Each thread also updates its progress to cout every 1 million attempts.
I've fast forwarded to where the match was found on threadId=5, so it takes less than a minute to run. But if you change the starting point you can look for other matches.
And I'm not sure either how one would use Floyd and Brent since the strings have to use the same topicId so you are locked in on both the prefix and suffix.
/*
To compile go get picosha2 header file from https://github.com/okdshin/PicoSHA2
Copy this code into same directory as picosha2.h file, save it as hash.cpp for example.
On Linux go to command line and cd to directory where these files are.
To compile it:
g++ -O2 -o hash hash.cpp -l pthread
And run it:
./hash
*/
#include <iostream>
#include <string>
#include <thread>
#include <mutex>
// I used picoSHA2 header only file for the hashing
// https://github.com/okdshin/PicoSHA2
#include "picosha2.h"
// return 1st 4 bytes (8 chars) of SHA256 hash
std::string hash8(const std::string& src_str) {
std::vector<unsigned char> hash(picosha2::k_digest_size);
picosha2::hash256(src_str.begin(), src_str.end(), hash.begin(), hash.end());
return picosha2::bytes_to_hex_string(hash.begin(), hash.begin() + 4);
}
bool done = false;
std::mutex mtxCout;
void work(unsigned long long threadId) {
std::string a = "Frank is one of the \"best\" students topicId{",
b = "Frank is one of the \"top\" students topicId{";
// Each thread gets a different starting point, I've fast forwarded to the part
// where I found the match so this won't take long to run if you try it, < 1 minute.
// If you want to run a while drop the last "+ 150000000ULL" term and it will run
// for about 1 billion total (150 million each thread, assuming 7 threads) take
// about 30 minutes on Linux.
// Collision occurred on threadId = 5, so if you change it to use less than 6 threads
// then your mileage may vary.
unsigned long long start = threadId * (11666666667ULL + 147000000ULL) + 150000000ULL;
unsigned long long x = start;
for (;;) {
// Not concerned with making the reading/updating "done" flag atomic, unlikely
// 2 collisions are found at once on separate threads, and writing to cout
// is guarded anyway.
if (done) return;
std::string xs = std::to_string(x++);
std::string hashA = hash8(a + xs + "} "), hashB = hash8(b + xs + "} ");
if (hashA == hashB) {
std::lock_guard<std::mutex> lock(mtxCout);
std::cout << "*** SOLVED ***" << std::endl;
std::cout << (x-1) << std::endl;
std::cout << "\"" << a << (x - 1) << "} \" = " << hashA << std::endl;
std::cout << "\"" << b << (x - 1) << "} \" = " << hashB << std::endl;
done = true;
return;
}
if (((x - start) % 1000000ULL) == 0) {
std::lock_guard<std::mutex> lock(mtxCout);
std::cout << "thread: " << threadId << " = " << (x-start)
<< " tries so far" << std::endl;
}
}
}
void runBruteForce() {
const int NUM_THREADS = 7;
std::thread threads[NUM_THREADS];
for (int i = 0; i < NUM_THREADS; i++) threads[i] = std::thread(work, i);
for (int i = 0; i < NUM_THREADS; i++) threads[i].join();
}
int main(int argc, char** argv) {
runBruteForce();
return 0;
}

How can i make this modification on Dijkstra Algorithm more efficient?

The problem is a part of my computer science homework. The homework includes 5 different types of students that travel through a given weighted undirected node graph where each student has a different method. The fifth student is the most difficult one and I haven't been able to implement it efficiently.
The fifth student has a secret power, (s)he can teleport between adjacent nodes, so it takes 0 time to travel between them. However, to recharge that secret power, (s)he needs to pass two edges, and (s)he does not have that secret power at the beginning of his/her journey. Unlike other four students, (s)he can use the same edge multiple times, so in the first move, (s)he may go N_1->N_2 and N_2->N_1 to recharge his/her secret power. (S)he cannot store his/her secret power and must use it right away after after gaining it.
The fifth student wants to know the shortest time to reach the summit. At start, (s)he does not have any power, so (s)he needs to pass two edges to recharge it.
The method i tried was a modification of Dijkstra's Algorithm; where instead of moving node by node, from one node it calculates all three possible jumps, only considering the weights of the first two jumps. It considers all cases such as going to a node and coming back to recharge power and jump a highly weighted node. It does work and i do get all the correct answers for the given test cases, but it is SLOW. We are under a two second limit and right now my algorithm takes around 4 seconds for test cases with 50 000 nodes and 100 000 edges.
I'm guessing the problem is within reaching neighbors since there are 3 nested for loops to reach all possible 3 jump away neighbors (while also being able to use the same edges more than once), which basically makes this O(n^3) (But I'm not great with big-oh notation so I'm not sure if it's actually that.)
Does anyone have any ideas to make this algorithm more efficient, or a different algorithm that isn't so slow?
Any help is appreciated!
Here's the code if it's of any help.
long long int HelpStudents::fifthStudent() {
auto start = std::chrono::system_clock::now();
set< pair<long long int,int> >setds;
vector<long long int> dist(totalNodes+15,std::numeric_limits<long long int>::max());
setds.insert(make_pair(0,1));
dist[1] = 0;
bool change = false;
int counter = 0; //these variables were just for checking some things
int max_counter = 0;
int changed_summit = 0;
int operations_after_last_change = 0;
int w1;
int w2;
int db = 0;
vector<int> neighbors;
vector<int> neighbors2;
vector<int> neighbors3;
int u;
while(!setds.empty()){
pair<long long int,int> tmp = *(setds.begin());
setds.erase(setds.begin());
u = tmp.second; //vertex label
if(dist[u] > dist[summit_no]){
continue;
}
if(!change){
counter++;
}else{
counter = 0; //debugging stuff
}
db++;
//cout << db2 << endl;
operations_after_last_change++;
max_counter = max(counter,max_counter);
//cout << "counter: " << counter <<endl;
change = false;
neighbors = adjacency_map[u]; //adjacency map holds a vector which contains the adjacent nodes for the given key
//cout << "processing: " << "(" << tmp.first << ","<< tmp.second << ") " << endl;
for(int nb : neighbors){
w1 = getWeight(u,nb); //this is one jump,
//nb is neighboor
neighbors2 = adjacency_map[nb];
//cout << "\t->" << nb << endl;
if( nb == summit_no){
if(dist[nb] > dist[u] + (w1)){
auto t = setds.find(make_pair(dist[nb],nb));
if(t != setds.end()){
setds.erase(t);
}
dist[nb] = dist[u] + (w1);
change = true;
changed_summit++;
operations_after_last_change = 0;
//cout << "changed summit to " << (dist[u] + (w1)) << endl;
//continue;
}
}
for(int nb2: neighbors2){ //second jump
w2 = getWeight(nb,nb2);
//cout << "\t\t->" << nb2 << endl;
if( nb2 == summit_no){
if(dist[nb2] > dist[u] + (w1+w2)){
auto t = setds.find(make_pair(dist[nb2],nb2));
if(t != setds.end()){
setds.erase(t);
}
dist[nb2] = dist[u] + (w1+w2);
change=true;
changed_summit++;
operations_after_last_change = 0;
//cout << "changed summit to " << (dist[u] + (w1+w2)) << endl;
//continue;
}
}
neighbors3 = adjacency_map[nb2];
for(int nbf: neighbors3){ //third jump, no weight
//cout << "\t\t\t->" << nbf;
if(dist[nbf] > dist[u] + (w1+w2)){
auto t = setds.find(make_pair(dist[nbf],nbf));
if(t != setds.end()) {
setds.erase(t);
}
change = true;
dist[nbf] = dist[u] + (w1+w2);
if(nbf == summit_no){
changed_summit++;
operations_after_last_change = 0;
//cout << endl;
}else{
setds.insert(make_pair(dist[nbf],nbf));
//cout << "\t\t\t\t inserted ("<<dist[nbf]<<","<<nbf<<")" << endl;
}
//cout << "changed " << nbf << " to " << (dist[u] + (w1+w2)) << "; path: "<< u <<" -> "<<nb<<" -> "<<nb2 << " -> " <<nbf << endl;
//setds.insert(make_pair(dist[nbf],nbf));
}else{
//cout << endl;
}
}
}
}
}
auto end = std::chrono::system_clock::now();
std::chrono::duration<double> elapsed_seconds = end-start;
cout << "time passed: "<< elapsed_seconds.count() <<" total loop: "<<db<< endl;
return dist[summit_no];
You make (or more likely imagine) a new directed graph with a node for each unique situation/state that student 5 can be in -- that is combination of original graph node and charge state (0, 1, or 2). Because there are 3 charge states, this graph will have 3 times as many nodes as the original.
Then you use perfectly ordinary Dijkstra's algorithm on this new graph.

c++ vector sorting with row and first column

i have a 2d vector 10*100, with totally 1000 integers.
i want to sort the numbers by row and then by the first column containing every smallest number in each row after sorting by row.
then i want to find the smallest in the first column and store it in a new vector, at the same time erase this smallest num in that row, the rest numbers move forward and keep going in this way. but my code always has problems, i am a beginner, i can not fix it even after i have tried more than 20 times.
please help me out!!!
vector<int> final;
vector<int> firstcol;
for (int j=0; j<vec.size(); j++) {
firstcol.push_back(vec[j][0]);
cout << firstcol[j]<< endl;
}
int mini = *firstcol.begin();
for(int k=0; k< firstcol.size();k++){
while (firstcol[k]< mini) {
mini=firstcol[k];
final.push_back(mini);
}
vec[k].erase(vec[k].begin());
}
cout << "mini:" << mini<<endl;
for (int m=0; m< final.size(); m++) {
cout << final[m]<<endl;
}

Why is problemsPathCoordinate vector of a vector of a pair empty?

I have tried to fill a smaller vector of a vector of pairs with some contents from a bigger vector of a vector of pairs without success. Below is the relevant code with couts and their output. Hopefully this is detailed enough.
/*******************Problems Occur*****************/
int iFirst=problemsStartAt;//first index to copy
int iLast=problemsEndAt-1;//last index -1, 11th stays
int iLen=iLast-iFirst;//10-8=2
//if(problemsStartAt!=0)//I.a
if(problemsStartAt!=0&&problemsEndAt!=0)//I.b
{
v_problem_temp=allPathCoordinates[problemsStartAt];
cout<<"266:"<<v_problem_temp.size()<<endl;
cout<<"267:"<<allPathCoordinates.at(1).size()<<endl;
for(vector<pair<int,int>>::iterator it2=v_problem_temp.begin();
it2!=v_problem_temp.end();
++it2)
{
apair=*it2;
point[apair.first][apair.second]=Yellow;
cout<<apair.first<<","<<apair.second<<endl;
}
problemsPathCoordinate.resize(iLen);
cout<<"iLen*sizeof(problemsPathCoordinate):" <<iLen*sizeof(problemsPathCoordinate)<<endl;
memcpy(&problemsPathCoordinate[0],&allPathCoordinates[iFirst],iLen*sizeof(problemsPathCoordinate));
cout<<"279:problemsPathCoordinate.size():"<<problemsPathCoordinate.size()<<endl;
problemsPathCoordinate.resize(iLen);
memcpy(&problemsPathCoordinate[0],&allPathCoordinates[iFirst],iLen*sizeof(problemsPathCoordinate));
cout<<"283:problemsPathCoordinate.size():"<<problemsPathCoordinate[0].size()<<endl;
cout<<"284:problemsPathCoordinate.size():"<<problemsPathCoordinate[1].size()<<endl;
cout<<"286:allPathCoordinates.size():"<<allPathCoordinates.size()<<endl;
cout<<"287:allPathCoordinates.size():"<<allPathCoordinates.size()<<endl;
//from http://stackoverflow.com/questions/35265577/c-reverse-a-smaller-range-in-a-vector
}
Output:
759: path NOT full-filled, number: 8
755: Problems START here at:8
759: path NOT full-filled, number: 9
700: Problems END here at: 11
266:0
267:0
iLen*sizeof(problemsPathCoordinate):72
279:problemsPathCoordinate.size():3
283:problemsPathCoordinate.size():0
284:problemsPathCoordinate.size():0
286:allPathCoordinates.size():79512
287:allPathCoordinates.size():79512
time:39 seconds
Why are the three problemsPathCoordinate elements empty. How to fix it?
Bo
for (vector< vector > >::iterator it = allPathCoordinates.begin(); it != allPathCoordinates.end(); ++it)
{
allPathCoordinates.erase(allPathCoordinates.begin()+5,allPathCoordinates.end()-2);
v_temp = *it;
//cout<<"v_temp.size():"<
for (vector<pair<int,int> >::iterator it2 = v_temp.begin(); it2 != v_temp.end(); ++it2) {
//v_temp.erase(v_temp.begin()+2);
apair = *it2;
//cout << "(" << apair.first << "," << apair.second << ") ; ";
openPoints[apair.first][apair.second]=0;
closedPoints[apair.first][apair.second]=1;
allObstacles[apair.first][apair.second]=Wall;
point[apair.first][apair.second]=Yellow;
}
/

Ternary search recursion isn't correct

I learned about ternary search from Wikipedia. I'm not sure what they mean by the parameter absolute precision. They didn't elaborate. But here is the pseudocode:
def ternarySearch(f, left, right, absolutePrecision):
#left and right are the current bounds; the maximum is between them
if (right - left) < absolutePrecision:
return (left + right)/2
leftThird = (2*left + right)/3
rightThird = (left + 2*right)/3
if f(leftThird) < f(rightThird):
return ternarySearch(f, leftThird, right, absolutePrecision)
return ternarySearch(f, left, rightThird, absolutePrecision)
I want to find max value from a unimodal function. That means I want to print the border point of the increasing and decreasing sequence. If the sequence is
1 2 3 4 5 -1 -2 -3 -4
then I want to print 5 as output.
Here is my attempt. It isn't giving output. Can you please help or give me link that contains good tutorial on ternary search for self learning?
#include<iostream>
using namespace std;
int ternary_search(int[], int, int, int);
int precval = 1;
int main()
{
int n, arr[100], target;
cout << "\t\t\tTernary Search\n\n" << endl;
//cout << "This program will find max element in an unidomal array." << endl;
cout << "How many integers: ";
cin >> n;
for (int i=0; i<n; i++)
cin >> arr[i];
cout << endl << "The max number in the array is: ";
int res = ternary_search(arr,0,n-1,precval)+0;
cout << res << endl;
return 0;
}
int ternary_search(int arr[], int left, int right, int precval)
{
if (right-left <= precval)
return (arr[right] > arr[left]) ? arr[right] : arr[left];
int first_third = (left * 2 + right) / 3;
int last_third = (left + right * 2) / 3;
if(arr[first_third] < arr[last_third])
return ternary_search(arr, first_third, right, precval);
else
return ternary_search(arr, left, last_third, precval);
}
Thank you in advance.
Absolute precision means the maximum error between the returned result and the true result i.e. max | returned_result - true_result |. In that context, f is a continuous function.
Since you are looking at a discrete function, you can't do much better than get to the point where right - left <= 1. Then, just compare the two resultant values and return the value corresponding to the larger one (since you're looking for max).
EDIT
The first partition point, being mathematically 2/3*left + right/3, should be discretized to ceil(2/3*left + right/3) (so that the relationship is left < first_third <= last_third < right
So first_third = (left*2+right)/3 should be changed to first_third = (left*2 + right + 2)/3.
Try Golden Section search (or Fibonacci search for discrete functions).
It has a smaller number of recursions AND a 50% reduction in evaluations of f, compared to the above ternary search.

Resources