Fastest way to index large sorted hash file - bash

I am building a file-based index for the sorted haveibeenpwned passwords text file and it got me wondering what's the fastest way to do this?
I figured a good way to build a quickly grep-able index would be split the sorted file into 256 files named with the first two hex digits (i.e. FF.txt, FE.txt, etc). I found ripgrep rg to be about 5 times faster than grep on my computer. So I tried something like this:
for i in {255..0}
do
start=$(date +%s)
hex="$(printf '%02x' $i | tr [:lower:] [:upper:])"
rg "^$hex" pwned-passwords-ntlm-ordered-by-hash-v4.txt > ntlm/$hex-ntlm.txt
echo 0x$hex completed in $(($(date +%s) - $start)) seconds
done
This is the fastest solution I could come up with. ripgrep is able to create each file in 25 seconds. So I'm looking at about 100 minutes to create this index. When I split the job in half, and run them in parallel, each pair of files is created in 80 seconds. So it seems best to just let ripgrep work its magic and work in series.
Obviously, I won't be indexing this list too often, but it's just fun to think about. Any thoughts on a faster way (aside from using a database) to index this file?

ripgrep, like any other tool that's able to work with unsorted input files at all, is the wrong tool for this job. When you're trying to grep sorted inputs, you want something that can bisect your input file to find a position in logarithmic time. For big enough inputs, even a slow O(log n) implementation will be faster than a highly optimized O(n) one.
pts-line-bisect is one such tool, though of course you're also welcome to write your own. You'll need to write it in a language with full access to the seek() syscall, which is not exposed in bash.

You are reading through the file 256 times, doing a full file scan every time. Consider an approach that reads the file once, writing each line into an open file descriptor. I'm thinking python would be an easy choice of implementation (if that's your thing). You could optimize by keeping the file open until you hit a new hex code at the beginning of the line. If you want to be even more clever, there is no need to go through the sorted file line by line. Based on Charles Duffy's hint, you could create a heuristic for sampling the file (using seek()) to get to the next hex value. Once the program has found the byte offset of the next hex value, the block of bytes can be written to the new file. However, since this is tagged as 'bash' let's keep the solution set in that domain:
while
read line
do
hex=${line:0:2}
echo $line >> ntlm/$hex-ntlm.txt
done < pwned-passwords-ntlm-ordered-by-hash-v4.txt

I wrote a Python3 script that solves fast binary-search lookups in the hash file without having to create an index. It doesn't directly address your question (indexing) but probably solves the underlying problem that you wanted to solve with an index - to quickly look up individual hashes. This script checks hundreds of passwords in seconds.
import argparse
import hashlib
parser = argparse.ArgumentParser(description='Searches passwords in https://haveibeenpwned.com/Passwords database.')
parser.add_argument('passwords', metavar='TEST', type=str, help='text file with passwords to test, one per line, utf-8')
parser.add_argument('database', metavar='DATABASE', type=str, help='the downloaded text file with sha-1:count')
args = parser.parse_args()
def search(f: object, pattern: str) -> str:
def search(left, right: int) -> str:
if left >= right:
return None
middle = (left + right) // 2
if middle == 0:
f.seek(0, 0)
test = f.readline()
else:
f.seek(middle - 1, 0)
_ = f.readline()
test = f.readline()
if test.upper().startswith(pattern):
return test
elif left == middle:
return None
elif pattern < test:
return search(left, middle)
else:
return search(middle, right)
f.seek(0, 2)
return search(0, f.tell())
fsource = open(args.passwords)
fdatabase = open(args.database)
source_lines = fsource.readlines()
for l in source_lines:
line = l.strip()
hash_object = hashlib.sha1(line.encode("utf-8"))
pattern = hash_object.hexdigest().upper()
print("%s:%s" % (line, str(search(fdatabase, pattern)).strip()))
fsource.close()
fdatabase.close()

Related

How can I find both identical and similar strings in a particular field in a text file in Linux?

My apologies ahead of time - I'm not sure that there is an answer for this one using only Linux command-line fu. Please note I am not a programmer, but I have been playing around with bash and python a bit over the last few years.
I have a large text file with rows and columns that resemble the following (note - fields are separated with tabs):
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
3078 Copland 2017GENERAL 07/07/17 Confirmed
3890 Bartok FOODS 09/11/17 Confirmed
5440 Alphapha 00B1106IMNH 01/09/18 Queued
What I want to do is find and output only those rows where the third field is either identical OR similar to another in the list. I don't really care whether the other fields are similar or not, but they should all be included in the output. By similar, I mean no more than [n] characters are different in that particular field (for example, no more than 3 characters are different). So the output I would want would be:
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
5440 Alphapha 00B1106IMNH 01/09/18 Queued
The line beginning 1074 has a third field that differs by 3 characters with 5440, so both of them are included. 3430 and 3431 are included because they are exactly identical. 3078 and 3890 are eliminated because they are not similar.
Through googling the forums I've managed to piece together this rather longish pipeline to be able to find all of the instances where field 3 is exactly identical:
cat inputfile.txt | awk 'BEGIN { OFS=FS="\t" } {if (count[$3] > 1) print $0; else if (count[$3] == 1) { print save[$3]; print $0; } else save[$3] = $0; count[$3]++; }' > outputfile.txt
I must confess I don't really understand awk all that well; I'm just copying and adapting from the web. But that seemed to work great at finding exact duplicates (i.e., it would output only 3430 and 3431 above). But I have no idea how to approach trying to find strings that are not identical but that differ in no more than 3 places.
For instance, in my example above, it should match 1074 and 5440 because they would both fit the pattern:
??B1106?MNH
But I would want it to be able to match also any other random pattern of matches, as long as there are no more than three differences, like this:
20?7G?N?RAL
These differences could be arbitrarily in any position.
The reason for needing this is we are trying to find a way to automatically find typographical errors in a serial-number-like field. There might be a mis-key, or perhaps a letter "O" replaced with a number "0", or the like.
So... any ideas? Thanks for the help!
you can use this script
$ more hamming.awk
function hamming(x,y,xs,ys,min,max,h) {
if(x==y) return 0;
else {
nx=split(x,xs,"");
mx=split(y,ys,"");
min=nx<mx?nx:mx;
max=nx<mx?mx:nx;
for(i=1;i<=min;i++) if(xs[i]!=ys[i]) h++;
return h+(max-min);
}
}
BEGIN {FS=OFS="\t"}
NR==FNR {
if($3 in a) nrs[NR];
for(k in a)
if(hamming(k,$3)<4) {
nrs[NR];
nrs[a[k]];
}
a[$3]=NR;
next
}
FNR in nrs
usage
$ awk -f hamming.awk file{,}
it's a double scan algorithm, finds the hamming distance (the one you described) between keys. Notice the it's O(n^2) algorithm, so may not suitable for very large data sets. However, not sure any other algorithm can do better.
NB Additional note based on the comment which I missed from the post. This algorithm compares the keys character by character, so displacements won't be identified. For example 123 and 23 will give a distance of 3.
Levenshtein distance aka "edit distance" suits your task best. Perl script below requires installing a module Text::Levenshtein (for debian/ubuntu do: sudo apt install libtext-levenshtein-perl).
use Text::Levenshtein qw(distance);
$maxdist = shift;
#ll = (<>);
#k = map {
$k = (split /\t/, $_)[2];
# $k =~ s/O/0/g;
} #ll;
for ($i = 0; $i < #ll; ++$i) {
for ($j = 0; $j < #ll; ++$j) {
if ($i != $j and distance($k[$i], $k[$j]) < $maxdist) {
print $ll[$i];
last;
}
}
}
Usage:
perl lev.pl 3 inputfile.txt > outputfile.txt
The algorithm is the same O(n^2) as in #karakfa's post, but matching is more flexible.
Also note the commented line # $k =~ s/O/0/g;. If you uncomment it, then all O's in key will become 0's, which will fix keys damaged by O->0 transformation. When working with damaged data I always use small rules like this to fix data gradually, refining rules from run to run, to the point where data is almost perfect and fuzzy match is no longer needed.

How to see if a string exists in a huge (>19GB) sorted file?

I have files that can be 19GB or greater, they will be huge but sorted. Can I use the fact that they are sorted to my advantage when searching to see if a certain string exists?
I looked at something called sgrep but not sure if its what I'm looking for. An example is I will have a 19GB text file with millions of rows of
ABCDEFG,1234,Jan 21,stackoverflow
and I want to search just the first column of these millions of row to see if ABCDEFG exists in this huge text file.
Is there a more efficient way then just greping this file for the string and seeing if a result comes. I don't even need the line, I just need almost a boolean, true/false if it is inside this file
Actually sgrep is what I was looking for. The reason I got confused was because structured grep has the same name as sorted grep and I was installing the wrong package. sgrep is amazing
I don't know if there are any utilities that would help you out if the box, but it would be pretty straight forward to write an application specific to your problem. A binary search would work well, and should yield your result within 20-30 queries against the file.
Let's say your lines are never more than 100 characters, and the file is B bytes long.
Do something like this in your favorite language:
sub file_has_line(file, target) {
a = 0
z = file.length
while (a < z) {
m = (a+z)/2
chunk = file.read(m, 200)
// That is, read 200 bytes, starting at m.
line = chunk.split(/\n/)[2]
// split the line on newlines, and keep only the second line.
if line < target
z = m - 1
else
a = m + 1
}
return (line == target)
}
If you're only doing a single lookup, this will dramatically speed up your program. Instead of reading ~20 GB, you'll be reading ~20 KB of data.
You could try to optimize this a bit by extrapolating that "Xerox" is going to be at 98% of the file and starting the midpoint there...but unless your need for optimization is quite extreme, you really won't see much difference. The binary search will get you that close within 4 or 5 passes, anyway.
If you're doing lots of lookups (I just saw your comment that you will be), I would look to pump all that data into a database where you can query at will.
So if you're doing 100,000 lookups, but this is a one-and-done process where having it in a database has no ongoing value, you could take another approach...
Sort your list of targets, to match the sort order of the log file. Then walk through each in parallel. You'll still end up reading the entire 20 GB file, but you'll only have to do it once and then you'll have all your answers. Something like this:
sub file_has_lines(file, target_array) {
target_array = target_array.sort
target = ''
hits = []
do {
if line < target
line = file.readln()
elsif line > target
target = target_array.pop()
elseif line == target
hits.push(line)
line = file.readln()
} while not file.eof()
return hits
}

File with random data but specific size

I am trying to generate a file in ruby that has a specific size. The content doesn't matter.
Here is what I got so far (and it works!):
File.open("done/#{NAME}.txt", 'w') do |f|
contents = "x" * (1024*1024)
SIZE.to_i.times { f.write(contents) }
end
The problem is: Once I zip or rar this file the created archive is only a few kb small. I guess thats because the random data in the file got compressed.
How do I create data that is more random as if it were just a normal file (for example a movie file)? To be specific: How to create a file with random data that keeps its size when archived?
You cannot guarantee an exact file size when compressing. However, as you suggest in the question, completely random data does not compress.
You can generate a random String using most random number generators. Even simple ones are capable of making hard-to-compress data, but you would have to write your own string-creation code. Luckily for you, Ruby comes with a built-in library that already has a convenient byte-generating method, and you can use it in a variation of your code:
require 'securerandom'
one_megabyte = 2 ** 20 # or 1024 * 1024, if you prefer
# Note use 'wb' mode to prevent problems with character encoding
File.open("done/#{NAME}.txt", 'wb') do |f|
SIZE.to_i.times { f.write( SecureRandom.random_bytes( one_megabyte ) ) }
end
This file is not going to compress much, if at all. Many compressors will detect that and just store the file as-is (making a .zip or .rar file slightly larger than the original).
For a given string size N and compression method c (e.g., from the rubyzip, libarchive or seven_zip_ruby gems), you want to find a string str such that:
str.size == c(str).size == N
I'm doubtful that you can be assured of finding such a string, but here's a way that should come close:
Step 0: Select a number m such that m > N.
Step 1: Generate a random string s with m characters.
Step 2: Compute str = c(str). If str.size <= N, increase m and repeat Step 1; else go to Step 3.
Step 3: Return str[0,N].

Ruby String/Array Write program

For a project that I am working on for school, one of the parts of the project asks us to take a collection of all the Federalist papers and run it through a program that essentially splits up the text and writes new files (per different Federalist paper).
The logic I decided to go with is to run a search, and every time the search is positive for "Federalist No." it would save into a new file everything until the next "Federalist No".
This is the algorithm that I have so far:
file_name = "Federalist"
section_number = "1"
new_text = File.open(file_name + section_number, 'w')
i = 0
n= 1
while i < l.length
if (l[i]!= "federalist") and (l[i+1]!= "No")
new_text.puts l[i]
i = i + i
else
new_text.close
section_number = (section_number.to_i +1).to_s
new_text = File.open(file_name + section_number, "w")
new_text.puts(l[i])
new_text.puts(l[i+1])
i=i+2
end
end
After debugging the code as much as I could (I am a beginner at Ruby), the problem that I run into now is that because the while function always holds true, it never proceeds to the else command.
In terms of going about this in a different way, my TA suggested the following:
Put the entire text in one string by looping through the array(l) and adding each line to the one big string each time.
Split the string using the split method and the key word "FEDERALIST No." This will create an array with each element being one section of the text:
arrayName = bigString.split("FEDERALIST No.")
You can then loop through this new array to create files for each element using a similar method you use in your program.
But as simple as it may sound, I'm having an extremely difficult time putting even that code together.
i = i + i
i starts at 0, and 0 gets added to it, which gives 0, which will always be less than l, whatever that value is/means.
Since this is a school assignment, I hesitate to give you a straight-up answer. That's really not what SO is for, and I'm glad that you haven't solicited a full solution either.
So I'll direct you to some useful methods in Ruby instead that could help.
In Array: .join, .each or .map
In String: .split
Fyi, your TA's suggestion is far simpler than the algorithm you've decided to embark on... although technically, it is not wrong. Merely more complex.

Calculating IDs for model runs

I'm running some array jobs on a PBS system (although hopefully no knowledge of PBS systems is needed to answer my question!). I've got 24 runs, but I want to split them up into 5 sub-jobs each, so I need to run my script 120 times.
After giving the PBS option of -t 1-120, I can get the current job-array ID using $PBS_ARRAYID. However, I want to create some output files. It would be best if these output files used the ID that it would have had if there were only 24 runs, together with a sub-run identifier (e.g. output-1a.txt, output-1b.txt ... output-1e.txt, output-2a.txt).
What I therefore need is a way of calculating a way to get the ID (in the range 1-24) together with the sub-run identifier (presumably in a set of if-statements), which can be used in a shell-script. Unfortunately, neither my maths nor my Unix knowledge is quite good enough to figure this out. I assume that I'll need something to do with the quotient/remainder based on the current $PBS_ARRAYID relative to either 120 or 24, but that's as far as I've got...
You just need a little modular division. A quick simulation of this in Ruby would be:
p = Array.new;
(1..120).each {|i| p[i] = "Run #{1+(i/5)}-#{((i%5)+96).chr}" }
What this says is simply that the run should start at 1 and increment after each new section of five, and that the trailing sub-run should be the ascii character represented by 96 plus the position of the sub-run (eg, 97 == 'a').
Here it is in Bash:
#!/bin/bash
chr() {
local tmp
[ ${1} -lt 256 ] || return 1
printf -v tmp '%03o' "$1"
printf \\"$tmp"
}
for ((i = 0; i < ${#ARP[*]}; i++))
do
charcode=$((($i % 5)+97))
charachter=$(chr "$charcode")
echo "Filename: output-$((($i/5)+1))$charachter"
done
I just used ARP as the name of the array, but you can obviously substitute that. Good luck!

Resources