Calculating IDs for model runs - shell

I'm running some array jobs on a PBS system (although hopefully no knowledge of PBS systems is needed to answer my question!). I've got 24 runs, but I want to split them up into 5 sub-jobs each, so I need to run my script 120 times.
After giving the PBS option of -t 1-120, I can get the current job-array ID using $PBS_ARRAYID. However, I want to create some output files. It would be best if these output files used the ID that it would have had if there were only 24 runs, together with a sub-run identifier (e.g. output-1a.txt, output-1b.txt ... output-1e.txt, output-2a.txt).
What I therefore need is a way of calculating a way to get the ID (in the range 1-24) together with the sub-run identifier (presumably in a set of if-statements), which can be used in a shell-script. Unfortunately, neither my maths nor my Unix knowledge is quite good enough to figure this out. I assume that I'll need something to do with the quotient/remainder based on the current $PBS_ARRAYID relative to either 120 or 24, but that's as far as I've got...

You just need a little modular division. A quick simulation of this in Ruby would be:
p = Array.new;
(1..120).each {|i| p[i] = "Run #{1+(i/5)}-#{((i%5)+96).chr}" }
What this says is simply that the run should start at 1 and increment after each new section of five, and that the trailing sub-run should be the ascii character represented by 96 plus the position of the sub-run (eg, 97 == 'a').
Here it is in Bash:
#!/bin/bash
chr() {
local tmp
[ ${1} -lt 256 ] || return 1
printf -v tmp '%03o' "$1"
printf \\"$tmp"
}
for ((i = 0; i < ${#ARP[*]}; i++))
do
charcode=$((($i % 5)+97))
charachter=$(chr "$charcode")
echo "Filename: output-$((($i/5)+1))$charachter"
done
I just used ARP as the name of the array, but you can obviously substitute that. Good luck!

Related

Fastest way to index large sorted hash file

I am building a file-based index for the sorted haveibeenpwned passwords text file and it got me wondering what's the fastest way to do this?
I figured a good way to build a quickly grep-able index would be split the sorted file into 256 files named with the first two hex digits (i.e. FF.txt, FE.txt, etc). I found ripgrep rg to be about 5 times faster than grep on my computer. So I tried something like this:
for i in {255..0}
do
start=$(date +%s)
hex="$(printf '%02x' $i | tr [:lower:] [:upper:])"
rg "^$hex" pwned-passwords-ntlm-ordered-by-hash-v4.txt > ntlm/$hex-ntlm.txt
echo 0x$hex completed in $(($(date +%s) - $start)) seconds
done
This is the fastest solution I could come up with. ripgrep is able to create each file in 25 seconds. So I'm looking at about 100 minutes to create this index. When I split the job in half, and run them in parallel, each pair of files is created in 80 seconds. So it seems best to just let ripgrep work its magic and work in series.
Obviously, I won't be indexing this list too often, but it's just fun to think about. Any thoughts on a faster way (aside from using a database) to index this file?
ripgrep, like any other tool that's able to work with unsorted input files at all, is the wrong tool for this job. When you're trying to grep sorted inputs, you want something that can bisect your input file to find a position in logarithmic time. For big enough inputs, even a slow O(log n) implementation will be faster than a highly optimized O(n) one.
pts-line-bisect is one such tool, though of course you're also welcome to write your own. You'll need to write it in a language with full access to the seek() syscall, which is not exposed in bash.
You are reading through the file 256 times, doing a full file scan every time. Consider an approach that reads the file once, writing each line into an open file descriptor. I'm thinking python would be an easy choice of implementation (if that's your thing). You could optimize by keeping the file open until you hit a new hex code at the beginning of the line. If you want to be even more clever, there is no need to go through the sorted file line by line. Based on Charles Duffy's hint, you could create a heuristic for sampling the file (using seek()) to get to the next hex value. Once the program has found the byte offset of the next hex value, the block of bytes can be written to the new file. However, since this is tagged as 'bash' let's keep the solution set in that domain:
while
read line
do
hex=${line:0:2}
echo $line >> ntlm/$hex-ntlm.txt
done < pwned-passwords-ntlm-ordered-by-hash-v4.txt
I wrote a Python3 script that solves fast binary-search lookups in the hash file without having to create an index. It doesn't directly address your question (indexing) but probably solves the underlying problem that you wanted to solve with an index - to quickly look up individual hashes. This script checks hundreds of passwords in seconds.
import argparse
import hashlib
parser = argparse.ArgumentParser(description='Searches passwords in https://haveibeenpwned.com/Passwords database.')
parser.add_argument('passwords', metavar='TEST', type=str, help='text file with passwords to test, one per line, utf-8')
parser.add_argument('database', metavar='DATABASE', type=str, help='the downloaded text file with sha-1:count')
args = parser.parse_args()
def search(f: object, pattern: str) -> str:
def search(left, right: int) -> str:
if left >= right:
return None
middle = (left + right) // 2
if middle == 0:
f.seek(0, 0)
test = f.readline()
else:
f.seek(middle - 1, 0)
_ = f.readline()
test = f.readline()
if test.upper().startswith(pattern):
return test
elif left == middle:
return None
elif pattern < test:
return search(left, middle)
else:
return search(middle, right)
f.seek(0, 2)
return search(0, f.tell())
fsource = open(args.passwords)
fdatabase = open(args.database)
source_lines = fsource.readlines()
for l in source_lines:
line = l.strip()
hash_object = hashlib.sha1(line.encode("utf-8"))
pattern = hash_object.hexdigest().upper()
print("%s:%s" % (line, str(search(fdatabase, pattern)).strip()))
fsource.close()
fdatabase.close()

How can I find both identical and similar strings in a particular field in a text file in Linux?

My apologies ahead of time - I'm not sure that there is an answer for this one using only Linux command-line fu. Please note I am not a programmer, but I have been playing around with bash and python a bit over the last few years.
I have a large text file with rows and columns that resemble the following (note - fields are separated with tabs):
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
3078 Copland 2017GENERAL 07/07/17 Confirmed
3890 Bartok FOODS 09/11/17 Confirmed
5440 Alphapha 00B1106IMNH 01/09/18 Queued
What I want to do is find and output only those rows where the third field is either identical OR similar to another in the list. I don't really care whether the other fields are similar or not, but they should all be included in the output. By similar, I mean no more than [n] characters are different in that particular field (for example, no more than 3 characters are different). So the output I would want would be:
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
5440 Alphapha 00B1106IMNH 01/09/18 Queued
The line beginning 1074 has a third field that differs by 3 characters with 5440, so both of them are included. 3430 and 3431 are included because they are exactly identical. 3078 and 3890 are eliminated because they are not similar.
Through googling the forums I've managed to piece together this rather longish pipeline to be able to find all of the instances where field 3 is exactly identical:
cat inputfile.txt | awk 'BEGIN { OFS=FS="\t" } {if (count[$3] > 1) print $0; else if (count[$3] == 1) { print save[$3]; print $0; } else save[$3] = $0; count[$3]++; }' > outputfile.txt
I must confess I don't really understand awk all that well; I'm just copying and adapting from the web. But that seemed to work great at finding exact duplicates (i.e., it would output only 3430 and 3431 above). But I have no idea how to approach trying to find strings that are not identical but that differ in no more than 3 places.
For instance, in my example above, it should match 1074 and 5440 because they would both fit the pattern:
??B1106?MNH
But I would want it to be able to match also any other random pattern of matches, as long as there are no more than three differences, like this:
20?7G?N?RAL
These differences could be arbitrarily in any position.
The reason for needing this is we are trying to find a way to automatically find typographical errors in a serial-number-like field. There might be a mis-key, or perhaps a letter "O" replaced with a number "0", or the like.
So... any ideas? Thanks for the help!
you can use this script
$ more hamming.awk
function hamming(x,y,xs,ys,min,max,h) {
if(x==y) return 0;
else {
nx=split(x,xs,"");
mx=split(y,ys,"");
min=nx<mx?nx:mx;
max=nx<mx?mx:nx;
for(i=1;i<=min;i++) if(xs[i]!=ys[i]) h++;
return h+(max-min);
}
}
BEGIN {FS=OFS="\t"}
NR==FNR {
if($3 in a) nrs[NR];
for(k in a)
if(hamming(k,$3)<4) {
nrs[NR];
nrs[a[k]];
}
a[$3]=NR;
next
}
FNR in nrs
usage
$ awk -f hamming.awk file{,}
it's a double scan algorithm, finds the hamming distance (the one you described) between keys. Notice the it's O(n^2) algorithm, so may not suitable for very large data sets. However, not sure any other algorithm can do better.
NB Additional note based on the comment which I missed from the post. This algorithm compares the keys character by character, so displacements won't be identified. For example 123 and 23 will give a distance of 3.
Levenshtein distance aka "edit distance" suits your task best. Perl script below requires installing a module Text::Levenshtein (for debian/ubuntu do: sudo apt install libtext-levenshtein-perl).
use Text::Levenshtein qw(distance);
$maxdist = shift;
#ll = (<>);
#k = map {
$k = (split /\t/, $_)[2];
# $k =~ s/O/0/g;
} #ll;
for ($i = 0; $i < #ll; ++$i) {
for ($j = 0; $j < #ll; ++$j) {
if ($i != $j and distance($k[$i], $k[$j]) < $maxdist) {
print $ll[$i];
last;
}
}
}
Usage:
perl lev.pl 3 inputfile.txt > outputfile.txt
The algorithm is the same O(n^2) as in #karakfa's post, but matching is more flexible.
Also note the commented line # $k =~ s/O/0/g;. If you uncomment it, then all O's in key will become 0's, which will fix keys damaged by O->0 transformation. When working with damaged data I always use small rules like this to fix data gradually, refining rules from run to run, to the point where data is almost perfect and fuzzy match is no longer needed.

Need help understanding this implementation of tetris

I am trying to understand this implementation of Tetris.
I have a few questions.
In update_score function,
if (( score > LEVEL_UP * level)) ; then # if level should be increased
((level++)) # increment level
pkill -SIGUSR1 -f "/bin/bash $0"
What is the use of having a separate process at all for adjusting the delay? Why use SIGUSR1 and SIGUSR2?
In the draw_piece function, why multiply by 8? I don't understand how the conversion is taking place or how the concept of "rotation" is implemented here.
for ((i = 0; i < 8; i += 2)) {
# relative coordinates are retrieved based on orientation and added to absolute coordinates
((x = $1 + ${piece[$3]:$((i + $4 * 8 + 1)):1} * 2))
((y = $2 + ${piece[$3]:$((i + $4 * 8)):1}))
xyprint $x $y "$5"
...
}
Nor do I understand the syntax involving : here.
In clear_next, why is draw_next ${filled_cell//?/ } necessary instead of just ${filled_cell}? What do the // signify?
I'm a beginner to shell scripting and programming in general and I have been trying to understand this implementation of Tetris [in shell]
Somehow, I suspect you could have found easier programs to start with.
What is the use of having a separate process at all for adjusting the delay? Why use [SIGUSR1] and [SIGUSR2]?
I don't think there's a separate process for adjusting the delay, but for implementing the timer. The timer must run even while the program is waiting for the player to give input, and if the shell functions don't give any way of having a timeout on read, that must be exported to another process. So then you get what there is in the end of script, a divide into the timer, the user input handler, and the actual game logic, with output from the first two going to the last one:
(ticker & reader) | (controller)
Bash's read does have the -t flag for timeout, so if it was implemented in Bash, you might not need the extra timer process. However, putting the timer in an external process also makes it independent of the user input, the read timeout would instead reset every time user hits a button. Working around that would require some way of accurately determining the elapsed time (or using a really short timeout on read and counting the ticks).
SIGUSR1 and SIGUSR2 are just "innocent" signals that don't have a meaning to the system at large, so they can be used here. Of course you could use others, but catching SIGINT or SIGHUP would annoy users if they wanted to stop the game.
In the draw_piece function, why multiply by 8?
((x = $1 + ${piece[$3]:$((i + $4 * 8 + 1)):1} * 2))
The piece array contains the different shapes and orientations of the pieces. A piece is 4 squares large, each square needs two coordinates, so we get 8 numbers per piece/orientation. For, example, the string for the S piece is 0001111201101120, so it has two orientations:
yx yx yx yx yx yx yx yx
00 01 11 12 01 10 11 20
And the piece looks something like this:
012 012
0 xx. 0 .x.
1 .xx 1 xx.
2 ... 2 x..
The ${variable:position:length} notation picks a substring from the given variable, so the program gets the single digits it needs from the bigger string. That's a somewhat weird way of implementing an array.
In clear_next, why is draw_next ${filled_cell//?/ } necessary ...? What do the // signify?
The ${parameter/foo/bar} construct is a pattern replacement (See e.g. Bash's manual on parameter expansion, look for "replace"). Whatever matches foo in the value of parameter, is replaced with bar, and the result is expanded. With a double slash, all matches are replaced, with a single slash, only the first. The question mark matches any character as with filename globs, so that effectively makes a string of spaces as long as the original string.
For example:
$ str="hallo hallo"
$ echo "${str/a/e}"
hello hallo
$ echo "${str//a/e}"
hello hello
$ str="abc"
$ echo "x${str//?/ }x"
x x

Adding up variable within loop

I got a number of files and I need to determine how many of those will fit on a 4Tb drive by just knowing first filename. Name pattern is 001j00_rf_geo_????$seqn with sequential 3-digit number at the end. Say I start with 001j00_rf_geo_????100.
block=4000000000000
shopt -s dotglob
seqn="100"
size=`stat -c%s 001j00_rf_geo_????$seqn`
for (( i=$size ;i < $block ; seqn++ ))
do
((size+=$(stat -c%s 001j00_rf_geo_????$seqn)))
done
echo $size
I am pretty sure the summing up part in for loop is wrong. I just could get my head around how to get a total size of files having the the loop part in code.
Look at your for loop, you are not using 'i' at all -- it is unneeded. If you want to use a C-style for loop, you can simply omit the initializer:
for ((; size < block; seqn++))
do
or use a while loop instead
while ((size < block))
do
...
((seqn++))
done
Of course you can just move your initialization to the for loop as well and get rid of the one above
for ((seqn = 100; size < block; seqn++))
do
Give either a try and let me know if you have further questions.

What useful GDB scripts have you used/written?

People use gdb on and off for debugging,
of course there are lots of other debugging tools
across the varied OSes, with and without GUI and,
maybe other fancy IDE features.
I would like to know what useful gdb scripts you have written and liked.
While, I do not mean a dump of commands in a something.gdb file that you source to pull out a bunch of data, if that made your day, go ahead and talk about it.
Lets think conditional processing, control loops and functions written for more elegant and refined programming to debug and, maybe even for whitebox testing
Things get interesting when you start debugging remote systems (say, over a serial/ethernet interface)
And, what if the target is a multi-processor (and, multithreaded) system
Let me put a simple case as an example...
Say,
A script that traversed serially over entries
to locate a bad entry in a large hash-table
that is implemented on an embedded platform.
That helped me debug a broken hash-table once.
This script, not written by me, pretty prints STL containers, such as vector, map, etc: http://www.yolinux.com/TUTORIALS/src/dbinit_stl_views-1.03.txt
Totally awesome.
When debugging an AOLserver SIGSEGV crash, I used the following script to examine the TCL-level call stack from GDB:
define tcl_stack_dump
set $interp = *(Interp*)interp
set $frame = $interp->framePtr
while (0 != (CallFrame *)$frame->callerPtr != 0)
set $i = 0
if 0 != $frame->objv
while ($i < $frame->objc)
if (0 != $frame->objv[$i] && 0 != $frame->objv[$i]->bytes)
printf " %s", (char *)(CallFrame *)$frame->objv[$i]->bytes
end
set $i = $i + 1
end
printf "\n"
end
set $frame = (CallFrame *)$frame->callerPtr
end
end
document tcl_stack_dump
Print a list of TCL procs and arguments currently being called in an
interpreter. Most TCL C API functions beginning with Tcl[^_] will have a
Tcl_Interp parameter. Assumes the `interp' local C variable holds a
Tcl_Interp which is represented internally as an Interp struct.
See:
ptype Interp
ptype CallFrame
ptype Proc
ptype Command
ptype Namespace
ptype Tcl_Obj
end
1. When trying to get some 3rd party closed-source DLLs working with our project under Mono, it was giving meaningless errors. Consequently, I resorted to the scripts from the Mono project.
2. I also had a project that could dump it's own information to stdout for use in GDB, so at a breakpoint, I could run the function, then cut-n-paste its output into GDB.
[Edit]
3. Most of my GCC/G++ use has been a while, but I also recall using a macro to take advantage of the fact that GDB knew the members of some opaque data I had (the library was compiled with debug). That was enormously helpful.
4. And I just found this, too. It dumps a list of objects (from a global "headMeterFix" SLL) that contain, among other things, dynamic arrays of another object type. One of the few times I've used nested loops in a macro:
define showFixes
set $i= headMeterFix
set $n = 0
while ($i != 0)
set $p = $i->resolved_list
set $x = $i->resolved_cnt
set $j = 0
printf "%08x [%d] = {", $i, $x
printf "%3d [%3d] %08x->%08x (D/R): %3d/%-3d - %3d/%-3d {", $n, $i, $x, $i->fix, $i->depend_cnt, dynArySizeDepList($i->depend_list), $i->resolved_cnt, dynArySizeDepList($i->resolved_list)
while ($j < $x)
printf " %08x", $p[$j]
set $j=$j+1
end
printf " }\n"
set $i = $i->next
set $n = $n+1
end
end

Resources