I have files in a directory stored as
abc123.0000.pdb
abc123.0001.pdb
abc123.0002.pdb
.
.
.
abc123.0456.pdb
Note "abc123" is arbitrary and so is the number "0456". I figured that I can get the largest filename using
\ls | tail -1
But how do I obtain the digits "456" only without the padded zeros and store it as a variable in a bash script?
awk is a good tool for this:
ls | awk '
{
if(match($0, /^.*\.0*([0-9]+)\.pdb$/, a)) {
if(max <= a[1]) {
max = a[1]
}
}
}END{
print max
}'
Each line of the input (from ls in this case) is run through the regex /^.*\.0*([0-9]+)\.pdb$/ which matches any digits (without leading zeros) directly after a . and before a .pdb extension. You view an explanation of the regex here. If the match is successful, the number is set to a[1] and is compared with max. At the end, the largest number is printed out, or nothing if no matches were found.
This can also be run in a single line:
ls | awk '{if(match($0,/^.*\.0*([0-9]+)\.pdb$/,a)){if(max<=a[1]){max=a[1]}}}END{print max}'
This is more robust than your solution of ls | tail -1 | egrep -o [1-9]+ | tail -1, which will fail if:
A file such as z.txt is added to the directory.
The last file has a 0 in the middle or end of the number, such as abc123.0101.pdb or abc123.0010.pdb.
The numbers go above 9999. For example if abc123.9999.pdb and abc123.10000.pdb exist, abc123.9999.pdb may be sorted last by ls.
Try this:
abc123.0000.pdb
regex='^[^.]*.0*([0-9]+).pdb'
for f in ./*; do
[[ $f =~ $regex ]]
echo "for file $f: ${BASH_REMATCH[1]}"
done
Related
I have a set of data files across a number of directories with format
ls lcp01/output/
> dst000.dat dst001.dat ... dst075.dat nn000.dat nn001.dat ... nn036.dat aa000.dat aa001.dat ... aa040.dat
That is to say, there are a set of directories lcp01 through lcp25 with a collection of different data files in their output folders. I want to know what the highest number dstXXX.dat file is in each directory (in the example shown the result would be 75).
I wrote a script which achieves this, but I'm not satisfied with the final step which feels a bit hacky:
#!/bin/bash
for i in `seq -f "%02g" 1 25`; #specify dir extensions 1 through 25
do
echo " "
echo $i
names=($(ls lcp$i/output | grep dst )) #dir containing dst files
NUMS=()
for j in "${names[#]}";
do
temp="$(echo $j | tr -dc '0-9' && printf " ")" # record suffixes for each dst file
NUMS+=("$((10#$temp))") #force base 10 interpretation of dst suffixes
done
numList="$(echo "${NUMS[*]}" | sort -nr | head -n1)"
echo ${numList:(-3)} #print out the last 3 characters of the sorted list - the largest file suffix
done
The final two steps organise the list of output indices, then I show the last 3 characters of that list which will be my largest file number (providing the file numbers are smaller than 100).
Is there a cleaner way of doing this? Ideally I would like more control over the output format, but mainly it's the step of reading the last 3 characters out. I would like to be able to just output the largest number, which should be the last element of the list but I cannot figure out how.
You could do something like the following:
for d in lc[0-9][0-9]; do find $d -name 'dst*.dat' -print | sort -u | tail -n1; done
Above command will only work if the numbering has the same number of digits (dst001..999.dat), as it is sorted as a string; if that's not the case:
for d in lc[0-9][0-9]; do echo -n $d: ; find $d -name 'dst*.dat' -print | grep -o '[0-9]*.dat' | sort -n | tail -n1; done
using filename expansions
for d in lcp*/output; do
files=( $d/dst*.dat )
file=${files[-1]}
[[ -e $file ]] || continue
file=${file#dst*}
echo ${file%.dat}
done
or with extension option to restrict pattern to numbers
shopt -s extglob
... lcp*([0-9])/output
... $d/dst*([0-9]).dat
...
file=${file##dst*(0)}
...
I'm trying to write a script to pull the integers out of 4 files that store temperature readings from 4 industrial freezers, this is a hobby script it generates the general readouts I wanted, however when I try to generate a SUM of the temperature readings I get the following printout into the file and my goal is to print the end SUM only not the individual numbers printed out in a vertical format
Any help would be greatly appreciated;here's my code
grep -o "[0.00-9.99]" "/location/$value-1.txt" | awk '{ SUM += $1; print $1} END { print SUM }' >> "/location/$value-1.txt"
here is what I am getting in return
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
1
7
.
2
8
1
7
.
0
1
1
7
.
0
0
1
8
.
7
2
53
It does generate the SUM I don't need the already listed numbers, just the SUM total
Why not stick with AWK completely? Code:
$ cat > summer.awk
{
while(match($0,/[0-9]+\.[0-9]+/)) # while matches on record
{
sum+=substr($0, RSTART, RLENGTH) # extract matches and sum them
$0=substr($0, RSTART + RLENGTH) # reset to start after previous match
count++ # count matches
}
}
END {
print sum"/"count"="sum/count # print stuff
Data:
$ cat > data.txt
Morningtemp:17.28
Noontemp:17.01
Lowtemp:17.00 Hightemp:18.72
Run:
$ awk -f summer.awk file
70.01/4=17.5025
It might work in the winter too.
The regex in grep -o "[0.00-9.99]" "/location/$value-1.txt" is equivalent to [0-9.], but you're probably looking for numbers in the range 0.00 to 9.99. For that, you need a different regex:
grep -o "[0-9]\.[0-9][0-9]" "/location/$value-1.txt"
That looks for a digit, a dot, and two more digits. It was almost tempting to use [.] in place of \.; it would also work. A plain . would not; that would select entries such as 0X87.
Note that the pattern shown ([0-9]\.[0-9][0-9]) will match 192.16.24.231 twice (2.16 and 4.23). If that's not what you want, you have to be a lot more precise. OTOH, it may not matter in the slightest for the actual data you have. If you'd want it to match 192.16 and 24.231 (or .24 and .231), you have to refine your regex.
Your command structure:
grep … filename | awk '…' >> filename
is living dangerously. In the example, it is 'OK' (but there's a huge grimace on my face as I type 'OK') because the awk script doesn't write anything to the file until grep has read it all. But change the >> to > and you have an empty input, or have awk write material before the grep is complete and suddenly it gets very tricky to determine what happens (it depends, in part, on what awk writes to the end of the file).
I have a file with a bunch of paths that look like so:
7 /usr/file1564
7 /usr/file2212
6 /usr/file3542
I am trying to use sort to pull out and print the path(s) with the most occurrences. Here it what I have so far:
cat temp| sort | uniq -c | sort -rk1 > temp
I am unsure how to only print the highest occurrences. I also want my output to be printed like this:
7 1564
7 2212
7 being the total number of occurrences and the other numbers being the file numbers at the end of the name. I am rather new to bash scripting so any help would be greatly appreciated!
To emit only the first line of output (with the highest number, since you're doing a reverse numeric sort immediately prior), pipe through head -n1.
To remove all content which is not either a number or whitespace, pipe through tr -cd '0-9[:space:]'.
To filter for only the values with the highest number, allowing there to be more than one:
{
read firstnum name && printf '%s\t%s\n' "$firstnum" "$name"
while read -r num name; do
[[ $num = $firstnum ]] || break
printf '%s\t%s\n' "$num" "$name"
done
} < temp
If you want to avoid sort and you are allowed to use awk, then you can do this:
awk '{
if($1>maxcnt) {s=$1" "substr($2,10,4); maxcnt=$1} else
if($1==maxcnt) {s=s "\n"$1" "substr($2,10,4)}} END{print s}' \
temp
Example i run
sh mycode Manu gg44
And I need to get file with name Manu
with content:
gg44
192.168.1.2.(second line) (this number I explain below)
(in the directory DIR=/h/Manu/HOME/hosts there is already file Alex
cat Alex
ff55
198.162.1.1.(second line))
So mycode creates file named Manu with the first line gg44 and generate IP at the second line.
BUT for generating IP he has compare with Alex file IP. So second line of Manu has to be 198.162.1.2. If we have more than one files in the directory then we have to check all second lines of all files and then generate according to them.
[CODE]
DIR=/h/Manu/HOME/hosts #this is a directory where i have my files (structure of the files above)
for j in $1 $2 #$1 is Manu; $2 is gg44
do
if [ -d $DIR ] #checking if directory exists (it exists already)
then #if it exists
for i in $* # for every file in this directory do operation
do
sort /h/ManuHOME/hosts/* | tail -2 | head -1 # get second line of every file
IFS="." read A B C D # divide number in second line into 4 parts (our number 192.168.1.1. for example)
if [ "$D" != 255 ] #compare D (which is 1 in our example: if its less than 255)
then
D=` expr $D + 1 ` #then increment it by 1
else
C=` expr $C + 1 ` #otherwise increment C and make D=0
D=0
fi
echo "$2 "\n" $A.$B.$C.$D." >/h/Manu/HOME/hosts/$1
done done #get $2 (which is gg44 in example as a first line and get ABCD as a second line)[/CODE]
In the result it creates file with name Manu and first line, but second line is totally wrong. It gives me ...1.
Also error message
sort: open failed: /h/u15/c2/00/c2rsaldi/HOME/hosts/yu: No such file or directory
yu n ...1.
#!/bin/bash
dir=/h/Manu/HOME/hosts
filename=$dir/$1
firstline=$2
# find the max IP address from all current files:
maxIP=$(awk 'FNR==2' $dir/* | cut -d. -f4 | sort -nr | head -1)
ip=198.162.1.$(( maxIP + 1 ))
cat > $filename <<END
$firstline
$ip
END
I'll leave it up to you to decide what to do when you get more than 255 files...
What's an easy way to read random line from a file in a shell script?
You can use shuf:
shuf -n 1 $FILE
There is also a utility called rl. In Debian it's in the randomize-lines package that does exactly what you want, though not available in all distros. On its home page it actually recommends the use of shuf instead (which didn't exist when it was created, I believe). shuf is part of the GNU coreutils, rl is not.
rl -c 1 $FILE
Another alternative:
head -$((${RANDOM} % `wc -l < file` + 1)) file | tail -1
sort --random-sort $FILE | head -n 1
(I like the shuf approach above even better though - I didn't even know that existed and I would have never found that tool on my own)
This is simple.
cat file.txt | shuf -n 1
Granted this is just a tad slower than the "shuf -n 1 file.txt" on its own.
perlfaq5: How do I select a random line from a file? Here's a reservoir-sampling algorithm from the Camel Book:
perl -e 'srand; rand($.) < 1 && ($line = $_) while <>; print $line;' file
This has a significant advantage in space over reading the whole file in. You can find a proof of this method in The Art of Computer Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
using a bash script:
#!/bin/bash
# replace with file to read
FILE=tmp.txt
# count number of lines
NUM=$(wc - l < ${FILE})
# generate random number in range 0-NUM
let X=${RANDOM} % ${NUM} + 1
# extract X-th line
sed -n ${X}p ${FILE}
Single bash line:
sed -n $((1+$RANDOM%`wc -l test.txt | cut -f 1 -d ' '`))p test.txt
Slight problem: duplicate filename.
Here's a simple Python script that will do the job:
import random, sys
lines = open(sys.argv[1]).readlines()
print(lines[random.randrange(len(lines))])
Usage:
python randline.py file_to_get_random_line_from
Another way using 'awk'
awk NR==$((${RANDOM} % `wc -l < file.name` + 1)) file.name
A solution that also works on MacOSX, and should also works on Linux(?):
N=5
awk 'NR==FNR {lineN[$1]; next}(FNR in lineN)' <(jot -r $N 1 $(wc -l < $file)) $file
Where:
N is the number of random lines you want
NR==FNR {lineN[$1]; next}(FNR in lineN) file1 file2
--> save line numbers written in file1 and then print corresponding line in file2
jot -r $N 1 $(wc -l < $file) --> draw N numbers randomly (-r) in range (1, number_of_line_in_file) with jot. The process substitution <() will make it look like a file for the interpreter, so file1 in previous example.
#!/bin/bash
IFS=$'\n' wordsArray=($(<$1))
numWords=${#wordsArray[#]}
sizeOfNumWords=${#numWords}
while [ True ]
do
for ((i=0; i<$sizeOfNumWords; i++))
do
let ranNumArray[$i]=$(( ( $RANDOM % 10 ) + 1 ))-1
ranNumStr="$ranNumStr${ranNumArray[$i]}"
done
if [ $ranNumStr -le $numWords ]
then
break
fi
ranNumStr=""
done
noLeadZeroStr=$((10#$ranNumStr))
echo ${wordsArray[$noLeadZeroStr]}
Here is what I discovery since my Mac OS doesn't use all the easy answers. I used the jot command to generate a number since the $RANDOM variable solutions seems not to be very random in my test. When testing my solution I had a wide variance in the solutions provided in the output.
RANDOM1=`jot -r 1 1 235886`
#range of jot ( 1 235886 ) found from earlier wc -w /usr/share/dict/web2
echo $RANDOM1
head -n $RANDOM1 /usr/share/dict/web2 | tail -n 1
The echo of the variable is to get a visual of the generated random number.
Using only vanilla sed and awk, and without using $RANDOM, a simple, space-efficient and reasonably fast "one-liner" for selecting a single line pseudo-randomly from a file named FILENAME is as follows:
sed -n $(awk 'END {srand(); r=rand()*NR; if (r<NR) {sub(/\..*/,"",r); r++;}; print r}' FILENAME)p FILENAME
(This works even if FILENAME is empty, in which case no line is emitted.)
One possible advantage of this approach is that it only calls rand() once.
As pointed out by #AdamKatz in the comments, another possibility would be to call rand() for each line:
awk 'rand() * NR < 1 { line = $0 } END { print line }' FILENAME
(A simple proof of correctness can be given based on induction.)
Caveat about rand()
"In most awk implementations, including gawk, rand() starts generating numbers from the same starting number, or seed, each time you run awk."
-- https://www.gnu.org/software/gawk/manual/html_node/Numeric-Functions.html