Problem when renaming a files in a foor loop - bash

I want to rename different files:
for ((i = 2; i < 10; i++)); mv /Users/neurolab/Desktop/Stephan/Oncology/dummy/sub-$i/ses-1/anat/Sub* /Users/neurolab/Desktop/Stephan/Oncology/dummy/sub-$i/ses-1/anat/sub-$i_ses-1_T1w.nii.gz;
But by using $i_ it does not recognize the "$i_ses" part and skips "$i_ses". There is no problem if I use a hyphen such as "$i-ses".
How can I avoid this problem?
Best

Related

How can I find both identical and similar strings in a particular field in a text file in Linux?

My apologies ahead of time - I'm not sure that there is an answer for this one using only Linux command-line fu. Please note I am not a programmer, but I have been playing around with bash and python a bit over the last few years.
I have a large text file with rows and columns that resemble the following (note - fields are separated with tabs):
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
3078 Copland 2017GENERAL 07/07/17 Confirmed
3890 Bartok FOODS 09/11/17 Confirmed
5440 Alphapha 00B1106IMNH 01/09/18 Queued
What I want to do is find and output only those rows where the third field is either identical OR similar to another in the list. I don't really care whether the other fields are similar or not, but they should all be included in the output. By similar, I mean no more than [n] characters are different in that particular field (for example, no more than 3 characters are different). So the output I would want would be:
1074 Beetle OOB11061MNH 12/22/16 Confirmed
3430 Hightop 0817BESTYET 08/07/17 Queued
3431 Hightop 0817BESTYET 08/07/17 Queued
5440 Alphapha 00B1106IMNH 01/09/18 Queued
The line beginning 1074 has a third field that differs by 3 characters with 5440, so both of them are included. 3430 and 3431 are included because they are exactly identical. 3078 and 3890 are eliminated because they are not similar.
Through googling the forums I've managed to piece together this rather longish pipeline to be able to find all of the instances where field 3 is exactly identical:
cat inputfile.txt | awk 'BEGIN { OFS=FS="\t" } {if (count[$3] > 1) print $0; else if (count[$3] == 1) { print save[$3]; print $0; } else save[$3] = $0; count[$3]++; }' > outputfile.txt
I must confess I don't really understand awk all that well; I'm just copying and adapting from the web. But that seemed to work great at finding exact duplicates (i.e., it would output only 3430 and 3431 above). But I have no idea how to approach trying to find strings that are not identical but that differ in no more than 3 places.
For instance, in my example above, it should match 1074 and 5440 because they would both fit the pattern:
??B1106?MNH
But I would want it to be able to match also any other random pattern of matches, as long as there are no more than three differences, like this:
20?7G?N?RAL
These differences could be arbitrarily in any position.
The reason for needing this is we are trying to find a way to automatically find typographical errors in a serial-number-like field. There might be a mis-key, or perhaps a letter "O" replaced with a number "0", or the like.
So... any ideas? Thanks for the help!
you can use this script
$ more hamming.awk
function hamming(x,y,xs,ys,min,max,h) {
if(x==y) return 0;
else {
nx=split(x,xs,"");
mx=split(y,ys,"");
min=nx<mx?nx:mx;
max=nx<mx?mx:nx;
for(i=1;i<=min;i++) if(xs[i]!=ys[i]) h++;
return h+(max-min);
}
}
BEGIN {FS=OFS="\t"}
NR==FNR {
if($3 in a) nrs[NR];
for(k in a)
if(hamming(k,$3)<4) {
nrs[NR];
nrs[a[k]];
}
a[$3]=NR;
next
}
FNR in nrs
usage
$ awk -f hamming.awk file{,}
it's a double scan algorithm, finds the hamming distance (the one you described) between keys. Notice the it's O(n^2) algorithm, so may not suitable for very large data sets. However, not sure any other algorithm can do better.
NB Additional note based on the comment which I missed from the post. This algorithm compares the keys character by character, so displacements won't be identified. For example 123 and 23 will give a distance of 3.
Levenshtein distance aka "edit distance" suits your task best. Perl script below requires installing a module Text::Levenshtein (for debian/ubuntu do: sudo apt install libtext-levenshtein-perl).
use Text::Levenshtein qw(distance);
$maxdist = shift;
#ll = (<>);
#k = map {
$k = (split /\t/, $_)[2];
# $k =~ s/O/0/g;
} #ll;
for ($i = 0; $i < #ll; ++$i) {
for ($j = 0; $j < #ll; ++$j) {
if ($i != $j and distance($k[$i], $k[$j]) < $maxdist) {
print $ll[$i];
last;
}
}
}
Usage:
perl lev.pl 3 inputfile.txt > outputfile.txt
The algorithm is the same O(n^2) as in #karakfa's post, but matching is more flexible.
Also note the commented line # $k =~ s/O/0/g;. If you uncomment it, then all O's in key will become 0's, which will fix keys damaged by O->0 transformation. When working with damaged data I always use small rules like this to fix data gradually, refining rules from run to run, to the point where data is almost perfect and fuzzy match is no longer needed.

Adding up variable within loop

I got a number of files and I need to determine how many of those will fit on a 4Tb drive by just knowing first filename. Name pattern is 001j00_rf_geo_????$seqn with sequential 3-digit number at the end. Say I start with 001j00_rf_geo_????100.
block=4000000000000
shopt -s dotglob
seqn="100"
size=`stat -c%s 001j00_rf_geo_????$seqn`
for (( i=$size ;i < $block ; seqn++ ))
do
((size+=$(stat -c%s 001j00_rf_geo_????$seqn)))
done
echo $size
I am pretty sure the summing up part in for loop is wrong. I just could get my head around how to get a total size of files having the the loop part in code.
Look at your for loop, you are not using 'i' at all -- it is unneeded. If you want to use a C-style for loop, you can simply omit the initializer:
for ((; size < block; seqn++))
do
or use a while loop instead
while ((size < block))
do
...
((seqn++))
done
Of course you can just move your initialization to the for loop as well and get rid of the one above
for ((seqn = 100; size < block; seqn++))
do
Give either a try and let me know if you have further questions.

Scripting libreoffice calc formula into csv

I have a bash script that writes some data extracted from raw log files into a file in csv format. Now I want to apply libreoffice calc formulas on this dataset. My idea is to write "raw" calc formula in the csv file directly from the bash script (using ';' [semicolon] instead of ',' [comma] to separate data to avoid breaking formulas). So I have a script like that (for example):
#!/bin/bash
for (( i=1; i<=5; i++ ))
do
echo "$i; $((i+1)); =SUM(A$i, B$i)" >> sum.csv
done
Executing this script gives this sum.csv file:
1; 2; =SUM(A1, B1)
2; 3; =SUM(A2, B2)
3; 4; =SUM(A3, B3)
4; 5; =SUM(A4, B4)
5; 6; =SUM(A5, B5)
When I open it with calc, it gives the expected result with each value separated in single cell. But the problem is that the formulas are not evaluated. Even manually editing the cell doesn't trigger an evaluation. The only thing working is to copy the formulas without '=' and manually writing '=', then pasting the formulas.
I tried using INDIRECT() but it didn't help.
Is there a way to force evaluation of formulas ? Or is there some other way to do what I want (without learning a new language...) ?
It should work after removing the leading space before the equals sign. The third field currently has the content =SUM(A1, B1) (notice the leading space). LO will recognize the formula if the content starts with = instead of =:
1;2;=SUM(A1, B1)
2;3;=SUM(A2, B2)
3;4;=SUM(A3, B3)
4;5;=SUM(A4, B4)
5;6;=SUM(A5, B5)

Bash Recursive Method not Finishing Method Call Stack

I'm new to Bash, and as a project I'm trying to create a shell script that will create a tree of folders. For example, if I tell it to create a tree that is 3 folders deep and 4 wide, then it would create a level of folders labeled 0, 1, and 2; then inside each of those folders it would create folders 0, 1, and 2, and so on until it reached 4 levels deep. (This would create 4^3 folders.)
Here is the code for the method I created:
function createLevel () { #param1 = number of levels of folders, param2 = number of folders per level
numLevels=$1
numPerLevel=$2
if [ $numLevels -eq 1 ];
then
for ((i=0; i < numPerLevel; i++));
do
mkdir $i
done
else
for ((i=0; i < numPerLevel; i++));
do
mkdir $i
cd $i
createLevel $((numLevels - 1)) $numPerLevel
cd ..
done
fi
}
It usually just creates one branch, so for example it will create a 0 folder within a 0 folder within a 0 folder, but it will not trace back out and make the other folders. I feel like it's not finishing the method call stack and instead of going back and finishing the method, it just quits after it calls itself. Any help would be appreciated!
You need to declare your variables local if you call the function recursively:
local i
local numLevels=$1
local numPerLevel=$2
[...]
Otherwise they will be overwritten by the "inner" calls.

Calculating IDs for model runs

I'm running some array jobs on a PBS system (although hopefully no knowledge of PBS systems is needed to answer my question!). I've got 24 runs, but I want to split them up into 5 sub-jobs each, so I need to run my script 120 times.
After giving the PBS option of -t 1-120, I can get the current job-array ID using $PBS_ARRAYID. However, I want to create some output files. It would be best if these output files used the ID that it would have had if there were only 24 runs, together with a sub-run identifier (e.g. output-1a.txt, output-1b.txt ... output-1e.txt, output-2a.txt).
What I therefore need is a way of calculating a way to get the ID (in the range 1-24) together with the sub-run identifier (presumably in a set of if-statements), which can be used in a shell-script. Unfortunately, neither my maths nor my Unix knowledge is quite good enough to figure this out. I assume that I'll need something to do with the quotient/remainder based on the current $PBS_ARRAYID relative to either 120 or 24, but that's as far as I've got...
You just need a little modular division. A quick simulation of this in Ruby would be:
p = Array.new;
(1..120).each {|i| p[i] = "Run #{1+(i/5)}-#{((i%5)+96).chr}" }
What this says is simply that the run should start at 1 and increment after each new section of five, and that the trailing sub-run should be the ascii character represented by 96 plus the position of the sub-run (eg, 97 == 'a').
Here it is in Bash:
#!/bin/bash
chr() {
local tmp
[ ${1} -lt 256 ] || return 1
printf -v tmp '%03o' "$1"
printf \\"$tmp"
}
for ((i = 0; i < ${#ARP[*]}; i++))
do
charcode=$((($i % 5)+97))
charachter=$(chr "$charcode")
echo "Filename: output-$((($i/5)+1))$charachter"
done
I just used ARP as the name of the array, but you can obviously substitute that. Good luck!

Resources