listing of files in a directory - bash

I need to list all the files in a directory like:
/home/rk/a.root /home/rk/b.root /home/rk/c.root
for that I am using
$ls | gawk 'BEGIN{ORS=" "}{print "/home/rk/"$1}'
But in that directory there are 2000 files and I need to list first 100 in one line then next 100 in next line and so on.
Also, Before each line I need to add a line "hadd result.root"

try this:
find /home/rk -type f |xargs -n100

Use printf instead of print to prevent automatically adding newlines. Then declare a counter variable in the BEGIN{ } section, increment it for every file and if that (counter % 100) == 0 print a newline and/or the per-line requisite.

Related

Extracting Number From a File

I'm trying to write a script (with bash) that looks for a word (for example "SOME(X) WORD:") and prints the rest of the line which is effectively some numbers with "-" in front. To clarify, an example line that I'm looking for in a file is;
SOME(X) WORD: -1.0475392439 ANOTHER.W= -0.0590214433
I want to extract the number after "SOME(X) WORD:", so "-1.0475392439" for this example. I have a similar script to this which extracts the number from the following line (both lines are from the same input file)
A-DESIRED RESULT W( WORD) = -9.68765465413
And the script for this is,
local output="$1"
local ext="log"
local word="W( WORD)"
cd $dir
find "${output}" -type f -name "*.${ext}" -exec awk -v ptn="${word}" 'index($0,ptn) {print $NF,FILENAME}' {} +
But when I change the local word variable from "W( WORD)" to "SOME(X) WORD", it captures the "-0.0590214433" instead of "-1.0475392439" meaning it takes the last number in line. How can I find a solution to this? Thanks in advance!
As you have seen, print $NF outputs the last field of the line. Please modify the find line as:
find "${output}" -type f -name "*.${ext}" -exec awk -v ptn="${word}" 'index($0, ptn) {if (match($0, /-[0-9]+\.[0-9]+/)) print substr($0, RSTART, RLENGTH), FILENAME}' {} +
Then it will output the first number in the line.
Please note it assumes the number always starts with the - sign.

How Can I Use Sort or another bash cmd To Get 1 line from all the lines if 1st 2nd and 3rd Field are The same

I have a file named file.txt
$cat file.txt
1./abc/cde/go/ftg133333.jpg
2./abc/cde/go/ftg24555.jpg
3./abc/cde/go/ftg133333.gif
4./abt/cte/come/ftg24555.jpg
5./abc/cde/go/ftg133333.jpg
6./abc/cde/go/ftg24555.pdf
MY GOAL: To get only one line from lines who's first, second and third PATH are the same and have the same file EXTENSION.
Note each PATH is separated by forward slash "/". Eg in the first line of the list, the first PATH is abc, second PATH is cde and third PATH is go.
File EXTENSION is .jpg, .gif,.pdf... always at the end of the line.
HERE IS WHAT I TRIED
sort -u -t '/' -k1 -k2 -k3
My thoughts
Using / as a delimiter gives me 4 fields in each line. Sorting them with "-u" will remove all but 1 line with unique First, Second and 3rd field/PATH. But obviously, I didn't take into account the EXTENSION(jpg,pdf,gif) in this case.
MY QUESTION
I need a way to grep only 1 of the lines if the first, second and third field are same and have the same EXTENSION using "/" as delimiter to divide it into fields. I want to output it to a another file, say file2.txt.
In the file2.txt, how do I add a word say "KALI" before the extension in each line, so it will look something like /abc/cde/go/ftg13333KALI.jpg using line 1 as an example in file.txt above.
Desired Output
/abc/cde/go/ftg133333KALI.jpg
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
COMMENT
Line 1,2 & 5 have the same 1st,2nd and 3rd field, with same file extension
".jpg" so only line 1 should be in the output.
Line 3 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".gif".
Line 4 has different 1st, 2nd and 3rd field, hence it in output.
Line 6 is in the output even though it has same 1st,2nd and 3rd field with
1,2 and 5, because the extension is different ".pdf".
$ awk '{ # using awk
n=split($0,a,/\//) # split by / to get all path components
m=split(a[n],b,".") # split last by . to get the extension
}
m>1 && !seen[a[2],a[3],a[4],b[m]]++ { # if ext exists and is unique with 3 1st dirs
for(i=2;i<=n;i++) # loop component parts and print
printf "/%s%s",a[i],(i==n?ORS:"")
}' file
Output:
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
I split by / separately from .s in case there are .s in dir names.
Missed the KALI part:
$ awk '{
n=split($0,a,/\//)
m=split(a[n],b,".")
}
m>1&&!seen[a[2],a[3],a[4],b[m]]++ {
for(i=2;i<n;i++)
printf "/%s",a[i]
for(i=1;i<=m;i++)
printf "%s%s",(i==1?"/":(i==m?"KALI.":".")),b[i]
print ""
}' file
Output:
/abc/cde/go/ftg133333KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg24555KALI.pdf
Using awk:
$ awk -F/ '{ split($5, ext, "\\.")
if (!(($2,$3,$4,ext[2]) in files)) files[$2,$3,$4,ext[2]]=$0
}
END { for (f in files) {
sub("\\.", "KALI.", files[f])
print files[f]
}}' input.txt
/abt/cte/come/ftg24555KALI.jpg
/abc/cde/go/ftg133333KALI.gif
/abc/cde/go/ftg24555KALI.pdf
/abc/cde/go/ftg133333KALI.jpg
another awk
$ awk -F'[./]' '!a[$2,$3,$4,$NF]++' file
/abc/cde/go/ftg133333.jpg
/abc/cde/go/ftg133333.gif
/abt/cte/come/ftg24555.jpg
/abc/cde/go/ftg24555.pdf
assumes . doesn't exist in directory names (not necessarily true in general).

Find and delete a word followed by the next (n) lines in a file [duplicate]

This question already has answers here:
How to pick multiple fasta sequences from a genes list
(4 answers)
Closed 2 years ago.
I have one file called unclassified. A sample of it looks like this (each is on a new line):
OTU3
OTU9
OTU10
OTU1
OTU6
OTU4
I have another file called OTUcounts. A sample of it looks like this
>OTU4
TACGTACGTAGCTAGTCGATCGTAGTGCTCGTCATCGTGCTGCTGCTAGCTAGCTAGCTCGTCGTACGTACGTACGTCGTAGTACGCTGCATGCATGCATCGTACGTACGTACGCTAGTCGACTGACTAGCTGACTAGCTAGCTAGCTAGCTAGCTACGTACGATCGTACGTACGTACGTAGCTAGCTACGTAGCTAGCTAGTAGCTAGCTACGTACGTCGTCGTGTCGTCGTTTGT
>OTU6
AACGGCTAGCTAGCTAGCTGCTCTACGTCGATCATCGATGTCAGACTGCGGCAGACTCGTACGTACGTCGTCAGTCGCATCATCAGTCAGTAGACTGCTAGCTCAGATCCGCATCGATCAGTCGACTGCATGCATCAGTCAGCTAGCATCAGTCAGTACGCTAGACTAGTAAGGGGGGGGGCGATGATCGTCGTGCTTATTAGTAGTTTGACCGCGGCGCGCGCGAGACTAGTCGTA
How would I search the OTUcounts file and delete the OTUs listed in the unclassified file, to ultimatley end up with a new file that looks like OTUcounts but with the unclassifieds removed?
I have started to use:
grep -x -f unclassified OTUcounts > newOTUcounts
but I know it needs more added - I am fairly new to this.
Any ideas?
You could use awk and store the OTU fields of unclassified in an array. When OTUcounts is read, test if the first field
is present in the array. If true, then set a flag and skip the next lines until the next OTU is found. Then reset the flag.
awk '
NR==FNR{a[$1]; next}
$1 in a{skip=1; next}
skip{if (/^OTU/){skip=0; print}next} 1
' unclassified OTUcounts > newOTUcounts
Explanation:
awk '
NR==FNR{ # if this is the first file...
a[$1] # save the first field in array `a` as array index
next # continue with the next line
}
$1 in a{ # if the first field is present in array `a`
skip=1 # set a flag to skip the next lines
next # continue with the next line
}
skip{ # if the flag is set
if (/^OTU/){ # if this is the next OTU
skip=0 # reset the flag
print # print the current line
}
next # continue with the next line
}
1 # print the current line
' unclassified OTUcounts > newOTUcounts
Try using v option of grep:
grep -v -f unclassified OTUcounts > newOTUcounts
We can indeed do it with only grep, by first generating a list of the OTUs to be kept and then using the --after-context option to print the three lines of context; finally we have to remove the line containing a group separator (--) which grep places between contiguous groups of matches.
grep OTU OTUcounts|grep -vwfunclassified|grep -xf- -A2 OTUcounts|grep -ve-- >newOTUcounts
An alternative approach that uses GNU sed (And a shell like bash or zsh that understands <(command) redirection):
gsed -f <(while read otu; do echo "/^>${otu}\$/,+2d"; done < unclassified) OTUcounts > newOTUcounts
It turns each line of the unclassified file into a sed command that deletes any case of that OTU and the next two lines - OTU3, for example, is transformed into /^>OTU3$/,+2d

Grep list (file) from another file

Im new to bash and trying to extract a list of patterns from file:
File1.txt
ABC
BDF
GHJ
base.csv (tried comma separated and tab delimited)
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
line 3 .."himk,n,hn.ujj., BDF"
etc
Suggested output is smth like
ABC
line 1..
line 2..(whole lines)
BDF
line 3..
and so on for each pattern from file 1
the code i tried was:
#!/bin/bash
for i in *.txt -# cycle through all files containing pattern lists
do
for q in "$i"; # # cycle through list
do
echo $q >>output.${i};
grep -f "${q}" base.csv >>output.${i};
echo "\n";
done
done
But output is only filename and then some list of strings without pattern names, e.g.
File1.txt
line 1...
line 2...
line 3..
so i don`t know to what pattern belongs each string and have to check and assign manually. Can you please point out my errors? Thanks!
grep can process multiple files in one go, and then has the attractive added bonus of indicating which file it found a match in.
grep -f File1.txt base.csv >output.txt
It's not clear what you hope for the inner loop to do; it will just loop over a single token at a time, so it's not really a loop at all.
If you want the output to be grouped per pattern, here's a for loop which looks for one pattern at a time:
while read -r pat; do
echo "$pat"
grep "$pat" *.txt
done <File1.txt >output.txt
But the most efficient way to tackle this is to write a simple Awk script which processes all the input files at once, and groups the matches before printing them.
An additional concern is anchoring. grep "ABC" will find a match in 123DEABCXYZ; is this something you want to avoid? You can improve the regex, or, again, turn to Awk which gives you more control over where exactly to look for a match in a structured line.
awk '# Read patterns into memory
NR==FNR { a[++i] = $1; next }
# Loop across patterns
{ for(j=1; j<=i; ++j)
if($0 ~ a[j]) {
print FILENAME ":" FNR ":" $0 >>output.a[j]
next }
}' File1.txt base.csv
You're not actually reading the files, you're just handling the filenames. Try this:
#!/bin/bash
for i in *.txt # cycle through all files containing pattern lists
do
while read -r q # read file line by line
do
echo "$q" >>"output.${i}"
grep -f "${q}" base.csv >>"output.${i}"
echo "\n"
done < "${i}"
done
Here is one that separates (with split, comma-separatd with quotes and spaces stripped off) words from file2 to an array (word[]) and stores the record names (line 1 etc.) to it comma-separated:
awk '
NR==FNR {
n=split($0,tmp,/[" ]*(,|$)[" ]*/) # split words
for(i=2;i<=n;i++) # after first
if(tmp[i]!="") # non-empties
word[tmp[i]]=word[tmp[i]] (word[tmp[i]]==""?"":",") tmp[1] # hash rownames
record[tmp[1]]=$0 # store records
next
}
($1 in word) { # word found
n=split(word[$1],tmp,",") # get record names
print $1 ":" # output word
for(i=1;i<=n;i++) # and records
print record[tmp[i]]
}' file2 file1
Output:
ABC:
line 1,,,,"hfhf,ferf,ju,ABC"
line 2 ,,,,,"ewy,trggt,gtg,ABC,RFR"
BDF:
line 3 .."himk,n,hn.ujj., BDF"
Thank you for your kind help, my friends.
Tried both variants above but kept getting various errors ( "do" expected) or misbehavior ( gets names of pattern blocks, eg ABC, BDF, but no lines.
Gave up for a while and then eventually tried another way
While base goal were to cycle through pattern list files, search for patterns in huge file and write out specific columns from lines found - i simply wrote
for *i in *txt # cycle throughfiles w/ patterns
do
grep -F -f "$i" bigfile.csv >> ${i}.out1 #greps all patterns from current file
cut -f 2,3,4,7 ${i}.out1>> ${i}.out2 # cuts columns of interest and writes them out to another file
done
I'm aware that this code should be improved using some fancy pipeline features, but it works perfectly as is, hope it`ll help somebody in similar situation. You can easily add some echoes to write out pattern list names as i initially requested

bash script reading lines in every file copying specific values to newfile

I want to write a script helping me to do my work.
Problem: I have many files in one dir containing data and I need from every file specific values copied in a newfile.
The datafiles can look likes this:
Name abc $desV0
Start MJD56669 opCMS v2
End MJD56670 opCMS v2
...
valueX 0.0456 RV_gB
...
valueY 12063.23434 RV_gA
...
What the script should do is copy valueX and the following value and also valueY and following value copied into an new file in one line. And the add in that line the name of the source datafile. Additionally the value of valueY should only contain everything before the dot.
The result should look like this:
valueX 0.0456 valueY 12063 name_of_sourcefile
I am so far:
for file in $(find -maxdepth 0 -type f -name *.wt); do
for line in $(cat $file | grep -F vb); do
cp $line >> file_done
done
done
But that doesn't work at all. I also have no idea how to get the data in ONE line in the newfile.
Can anyone help me?
I think you can simplify your script a lot using awk:
awk '/valueX/{x=$2}/valueY/{print "valueX",x,"valueY",$2,FILENAME}' *.wt > file_done
This goes through every file in the current directory. When "valueX" is matched, the value is saved to the variable x. When "valueY" is matched, the line is printed.
This assumes that the line containing "valueX" always comes before the one containing "valueY". If that isn't a valid assumption, the script can easily be changed.
To print only the integer part of "valueY", you can use printf instead of print:
awk '/valueX/{x=$2}/valueY/{printf "valueX %s valueY %d %s\n",x,$2,FILENAME}' *.wt > file_done
%d is the format specifier for an integer.
If your requirements are more complex and you need to use find, you should use -exec rather than looping through the results, to avoid problems with awkward file names:
find -maxdepth 1 -iname "5*.par" ! -iname "*_*" -exec \
awk '/valueX/{x=$2}/valueY/{printf "valueX %s valueY %d %s\n",x,$2,"{}"}' '{}' \; > file_done
don't fight. I'm really thankful for your help and exspecially the fast answers.
This is my final solution I think:
#!/bin/bash
for file in $(find * -maxdepth 1 -iname "5*.par" ! -iname "*_*"); do
awk '/TASC/{x=$2}/START/{printf "TASC %s MJD %d %s",x,$2, FILENAME}' $file > mjd_vs_tasc
done
Very thanks again to you guys.
Try something like below :
egrep "valueX|valueY" *.wt | awk -vRD="\n" -vORS=" " -F':| ' '{if (NR%2==0) {print $2, $3, $1} else {print $2, $3}}' > $file.new.txt

Resources