Naming awk output in loop - bash

I'm relatively new to the world of shell scripts so hopefully this won't be too difficult. I have a file (dirlist) with a list of directories. I want to
cat 'dirlist' with the path to each file
use a program called samtools to modify the file from dirlist
use awk to subset the samtools output on a variable chr17
write the output to a file that uses the 8th field of the directory, from 'dirlist' for naming
do this for all the files listed in dirlist
I think I have all the pieces here. Items 1-3 are working fine but the loop is simply naming the file "echo".
for i in `cat dirlist`; do samtools depth $i | awk '$1 == "chr17" {print $0}' echo $i | awk -F'[/]' '{print $8}'; done
Any help would be greatly appreciated

A native bash implementation (just one process, rather than starting an awk for every file) follows:
while IFS= read -r filename; do
while IFS= read -r line; do
if [[ $line = "chr17"[[:space:]]* ]]; then
IFS=/ read -r -a pieces <<<"$filename"
printf '%s\n' "${pieces[7]}"
fi
done < <(samtools depth "$filename")
done <dirlist

I think that's what you want to do
... | awk -v f="$i" 'BEGIN{split(f,fs,"/")} $1=="chr17" {print > fs[8]}'
the final file name will be generated from the original file name split by "/" and use only the 8th segment. Kind of unusual, perhaps needs some error handling.
not tested, caveat emptor...

Related

Evaluating a log file using a sh script

I have a log file with a lot of lines with the following format:
IP - - [Timestamp Zone] 'Command Weblink Format' - size
I want to write a script.sh that gives me the number of times each website has been clicked.
The command:
awk '{print $7}' server.log | sort -u
should give me a list which puts each unique weblink in a separate line. The command
grep 'Weblink1' server.log | wc -l
should give me the number of times the Weblink1 has been clicked. I want a command that converts each line created by the Awk command above to a variable and then create a loop that runs the grep command on the extracted weblink. I could use
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "Text read from file: $line"
done
(source: Read a file line by line assigning the value to a variable) but I don't want to save the output of the Awk script in a .txt file.
My guess would be:
while IFS='' read -r line || [[ -n "$line" ]]; do
grep '$line' server.log | wc -l | ='$variabel' |
echo " $line was clicked $variable times "
done
But I'm not really familiar with connecting commands in a loop, as this is my first time. Would this loop work and how do I connect my loop and the Awk script?
Shell commands in a loop connect the same way they do without a loop, and you aren't very close. But yes, this can be done in a loop if you want the horribly inefficient way for some reason such as a learning experience:
awk '{print $7}' server.log |
sort -u |
while IFS= read -r line; do
n=$(grep -c "$line" server.log)
echo "$line" clicked $n times
done
# you only need the read || [ -n ] idiom if the input can end with an
# unterminated partial line (is illformed); awk print output can't.
# you don't really need the IFS= and -r because the data here is URLs
# which cannot contain whitespace and shouldn't contain backslash,
# but I left them in as good-habit-forming.
# in general variable expansions should be doublequoted
# to prevent wordsplitting and/or globbing, although in this case
# $line is a URL which cannot contain whitespace and practically
# cannot be a glob. $n is a number and definitely safe.
# grep -c does the count so you don't need wc -l
or more simply
awk '{print $7}' server.log |
sort -u |
while IFS= read -r line; do
echo "$line" clicked $(grep -c "$line" server.log) times
done
However if you just want the correct results, it is much more efficient and somewhat simpler to do it in one pass in awk:
awk '{n[$7]++}
END{for(i in n){
print i,"clicked",n[i],"times"}}' |
sort
# or GNU awk 4+ can do the sort itself, see the doc:
awk '{n[$7]++}
END{PROCINFO["sorted_in"]="#ind_str_asc";
for(i in n){
print i,"clicked",n[i],"times"}}'
The associative array n collects the values from the seventh field as keys, and on each line, the value for the extracted key is incremented. Thus, at the end, the keys in n are all the URLs in the file, and the value for each is the number of times it occurred.

Read a file in a Bash script

I have a file in my file system. I want to read that file in bash script. File format is different i want to read only selected values from the file. I don't want to read the whole file as the file is very huge. Below is my file format:
Name=TEST
Add=TEST
LOC=TEST
In the file it will have data like above. From that I want to get only Add date in a variable. Could you please suggest me how I can do this.
As of now i am doing this to read the file:
file="data.txt"
while IFS= read line
do
# display $line or do somthing with $line
echo "$line"
done < "$file"
Use the right tool meant for the job, Awk in this case to speed things up!
dateValue="$(awk -F"=" '$1=="Add"{print $2; exit}' file)"
printf "%s\n" "dateValue"
TEST
The idea is to split input lines by = as the de-limiter. The awk logic works by checking the $1 field which equals to Add and prints the corresponding value associated with it.
The exit part after print is optional. It will quit the processing as soon as the Add string is met. It will help in quick processing if the file is huge as you have indicated.
You could rewrite your loop this way, notice the break after you got your line:
while IFS='=' read -r key value; do
if [[ $value == "Add" ]]; then
# your logic
break
fi
done < "$file"
If your intention is to just get the very first occurrence of "Add=", then you could use grep this way:
value=$(grep -m 1 '^Add=' "$file" | cut -f2 -d=)

Extract first word in colon separated text file

How do i iterate through a file and print the first word only. The line is colon separated. example
root:01:02:toor
the file contains several lines. And this is what i've done so far but it does'nt work.
FILE=$1
k=1
while read line; do
echo $1 | awk -F ':'
((k++))
done < $FILE
I'm not good with bash-scripting at all. So this is probably very trivial for one of you..
edit: variable k is to count the lines.
Use cut:
cut -d: -f1 filename
-d specifies the delimiter
-f specifies the field(s) to keep
If you need to count the lines, just
count=$( wc -l < filename )
-l tells wc to count lines
awk -F: '{print $1}' FILENAME
That will print the first word when separated by colon. Is this what you are looking for?
To use a loop, you can do something like this:
$ cat test.txt
root:hello:1
user:bye:2
test.sh
#!/bin/bash
while IFS=':' read -r line || [[ -n $line ]]; do
echo $line | awk -F: '{print $1}'
done < test.txt
Example of reading line by line in bash: Read a file line by line assigning the value to a variable
Result:
$ ./test.sh
root
user
A solution using perl
%> perl -F: -ane 'print "$F[0]\n";' [file(s)]
change the "\n" to " " if you don't want a new line printed.
You can get the first word without any external commands in bash like so:
printf '%s' "${line%%:*}"
which will access the variable named line and delete everything that matches the glob :* and do so greedily, so as close to the front (that's the %% instead of a single %).
Though with this solution you do need to do the loop yourself. If this is the only thing you want to do with the variable the cut solution is better so you don't have to do the file iteration yourself.

Using the output of awk as the list of names in a for loop

How can I pass the output of awk to a for file in loop?
for file in awk '{print $2}' my_file; do echo $file done;
my_file contains the name of the files whose name should be displayed (echoed).
I get just a
>
instead of my normal prompt.
Use backticks or $(...) to substitute the output of a command:
for file in $(awk '{print $2}' my_file)
do
echo "$file"
done
for file in $(awk '{print $2}' my_file); do echo "$file"; done
The notation to use is $(...) or Command Substitution.
for file in $(awk '{print $2}' my_file)
do
echo $file
done
Where I assume that you do more in the body of the loop than just echo since you could then leave the loop out altogether:
awk '{print $2}' my_file
Or, if you miss typing semicolons and don't like to spread code over multiple lines for readability, then you can use:
for file in $(awk '{print $2}' my_file); do echo $file; done
You will also find in (mostly older) code the backticks used:
for file in `awk '{print $2}' my_file`
do
echo $file
done
Quite apart from being difficult to use in the Markdown used to format comments (and questions and answers) on Stack Overflow, the backticks are not as friendly, especially when nested, so you should recognize them and understand them but not use them.
Incidentally, the reason you got the > prompt is that this command line:
for file in awk '{print $2}' my_file; do echo $file done;
is missing a semicolon before the done. The shell was still waiting for the done. Had you typed done and return, you would have seen the output:
awk done
{print $2} done
my_file done
Using backticks or $(awk ...) for command substitution is an acceptable solution for a small number of files; however, consider using xargs for single commands or pipes or a simple while read ... for more complex tasks (but it will work for simple ones too)
awk '...' |while read FILENAME; do
#do work with each file here using $FILENAME
done
This will allow processing to be done as each filename is processed instead of having to wait for the whole awk script to complete and allow for a larger set of filenames (you can only give so many args to a for x in ...; do) This will typically speed up your scripts and allow the same kinds of operations you would get in a for in loop without its limitations.

"while read LINE do" and grep problems

I have two files.
file1.txt:
Afghans
Africans
Alaskans
...
where file2.txt contains the output from a wget on a webpage, so it's a big sloppy mess, but does contain many of the words from the first list.
Bashscript:
cat file1.txt | while read LINE; do grep $LINE file2.txt; done
This did not work as expected. I wondered why, so I echoed out the $LINE variable inside the loop and added a sleep 1, so i could see what was happening:
cat file1.txt | while read LINE; do echo $LINE; sleep 1; grep $LINE file2.txt; done
The output looks in terminal looks something like this:
Afghans
Africans
Alaskans
Albanians
Americans
grep: Chinese: No such file or directory
: No such file or directory
Arabians
Arabs
Arabs/East Indians
: No such file or directory
Argentinans
Armenians
Asian
Asian Indians
: No such file or directory
file2.txt: Asian Naruto
...
So you can see it did finally find the word "Asian". But why does it say:
No such file or directory
?
Is there something weird going on or am I missing something here?
What about
grep -f file1.txt file2.txt
#OP, First, use dos2unix as advised. Then use awk
awk 'FNR==NR{a[$1];next}{ for(i=1;i<=NF;i++){ if($i in a) {print $i} } } ' file1 file2_wget
Note: using while loop and grep inside the loop is not efficient, since for every iteration, you need to invoke grep on the file2.
#OP, crude explanation:
For meaning of FNR and NR, please refer to gawk manual. FNR==NR{a[1];next} means getting the contents of file1 into array a. when FNR is not equal to NR (which means reading the 2nd file now), it will check if each word in the file is in array a. If it is, print out. (the for loop is used to iterate each word)
Use more quotes and use less cat
while IFS= read -r LINE; do
grep "$LINE" file2.txt
done < file1.txt
As well as the quoting issue, the file you've downloaded contains CRLF line endings which are throwing read off. Use dos2unix to convert file1.txt before iterating over it.
Although usng awk is faster, grep produces a lot more details with less effort. So, after issuing dos2unix use:
grep -F -i -n -f <file_containing_pattern> <file_containing_data_blob>
You will have all the matches + line numbers (case insensitive)
At minimum this will suffice to find all the words from file_containing_pattern:
grep -F -f <file_containing_pattern> <file_containing_data_blob>

Resources