How to optimize bash script parsing multiple gzipped files with multiple patterns? - bash

I have a bash script which iterates over many files: f1.gz, f2.gz, .. fn.gz
Each files contains millions of lines and each line could match one pattern out of set: p1, p2, .. pn
Depending on that, the matching line should go to a specific file. The patterns are obtained with date manipulations.
I wrote a couple of versions of the same but I'm not satisfied at all and I would like to ask if any better way/solution can be achieved without recurring to writing anything in compiled language.
Here's what I have:
for FILE in `ls f*.gz`
do
echo "uncompressing only once per file -- $FILE: "
gzcat $FILE > .myfile.txt
while IFS='' read -r LINE || [[ -n "$LINE" ]]; do
for DATE in "$#" # I pass to my script several dates like 20201015, 20201014, etc
do
for i in {0..23};
do
p="DATE_PATTERNS_$DATE[$i]" # I prepared these before to avoid running "date" millions of times
echo $LINE | awk -v pat=${!p} -F '"' '$1 ~ pat {print $2" "$4" "$6}' >> $DATE.txt
done
done
done < .myfile.txt
done
Thanks

When you don't want to replace the code with one awk looping through the dates, you can start with removing the while (and opening the outputfile less often):
for FILE in f*.gz; do
echo "uncompressing only once per file -- $FILE: "
gzcat $FILE > .myfile.txt
# I pass to my script several dates like 20201015, 20201014, etc
for DATE in "$#"; do
for i in {0..23};
do
p="DATE_PATTERNS_$DATE[$i]"
awk -v pat=${!p} -F '"' '$1 ~ pat {print $2" "$4" "$6}' .myfile.txt
done
done >> $DATE.txt
done
When you still have tried this and still want improvements, consider moving the for DATE and for i into awk and/or start gzcat f*gz > .mycombinedfiles.txt (when diskspace is no issue).

Related

Searching user names in multiple files and print if it doesn't exist using bash script

I have a file which consists of number of users which I need to compare it with multiple files and print if any particular user is not present in the files with filename.
#!/bin/bash
awk '{print $1}' $1 | while read -r line; do
if ! grep -q "$line" *.txt;
then
echo "$line User doesn't exist"
fi
done
In the above script, passing user_list file as $1 and can able to find the users for single target file, but it fails for multiple files.
File contents:
user_list:
Johnny
Stella
Larry
Jack
One of the multiple files contents:
root:x:0:0:root:/root:/bin/bash
Stella:x:1:1:Admin:/bin:/bin/bash
Jack:x:2:2:admin:/sbin:/bin/bash
Usage:
./myscript user_list.txt
Desired output:
File1:
Stella doesn't exist
Jack doesn't exist
File2:
Larry doesn't exist
Johnny doesn't exist
Any suggestion here to achieve it for multiple files with printing filename headers?
Use a for loop to iterate over each file and execute your code for each file separately.
#!/bin/bash
for f in *.txt; do
echo $f:
awk '{print $1}' $1 | while read -r line; do
if ! grep -q "$line" $f
then
echo "$line doesn't exist"
fi
done
echo
done
You could do simply
for file in *.txt; do
echo "$file"
grep -Fvf "$file" "$1"
echo
done
This might do what you want.
#!/usr/bin/env bash
for f in *.txt; do ##: Loop through the *.txt files
for j; do ##: Loop through the argument files, file1 file2 file3
printf '\n%s:\n' "$j" ##: Print one of the multiple files.
while read -r lines; do ##: Read line-by-line & one-by-one all the *.txt files
if ! grep -q "$lines" "$j"; then ##: If grep did not found a match.
printf "%s not doesn't exist.\n" "$lines" ##: Print the desired output.
fi
done < "$f"
done
done
The *.txt files should be in the current directory, otherwise add the absolute path, e.g. /path/to/files/*.txt
Howto use.
./myscript file1 file2 file3 ...
The downside is you're running grep line-by-line on each files as opposed to what #Quasimodo did.

Evaluating a log file using a sh script

I have a log file with a lot of lines with the following format:
IP - - [Timestamp Zone] 'Command Weblink Format' - size
I want to write a script.sh that gives me the number of times each website has been clicked.
The command:
awk '{print $7}' server.log | sort -u
should give me a list which puts each unique weblink in a separate line. The command
grep 'Weblink1' server.log | wc -l
should give me the number of times the Weblink1 has been clicked. I want a command that converts each line created by the Awk command above to a variable and then create a loop that runs the grep command on the extracted weblink. I could use
while IFS='' read -r line || [[ -n "$line" ]]; do
echo "Text read from file: $line"
done
(source: Read a file line by line assigning the value to a variable) but I don't want to save the output of the Awk script in a .txt file.
My guess would be:
while IFS='' read -r line || [[ -n "$line" ]]; do
grep '$line' server.log | wc -l | ='$variabel' |
echo " $line was clicked $variable times "
done
But I'm not really familiar with connecting commands in a loop, as this is my first time. Would this loop work and how do I connect my loop and the Awk script?
Shell commands in a loop connect the same way they do without a loop, and you aren't very close. But yes, this can be done in a loop if you want the horribly inefficient way for some reason such as a learning experience:
awk '{print $7}' server.log |
sort -u |
while IFS= read -r line; do
n=$(grep -c "$line" server.log)
echo "$line" clicked $n times
done
# you only need the read || [ -n ] idiom if the input can end with an
# unterminated partial line (is illformed); awk print output can't.
# you don't really need the IFS= and -r because the data here is URLs
# which cannot contain whitespace and shouldn't contain backslash,
# but I left them in as good-habit-forming.
# in general variable expansions should be doublequoted
# to prevent wordsplitting and/or globbing, although in this case
# $line is a URL which cannot contain whitespace and practically
# cannot be a glob. $n is a number and definitely safe.
# grep -c does the count so you don't need wc -l
or more simply
awk '{print $7}' server.log |
sort -u |
while IFS= read -r line; do
echo "$line" clicked $(grep -c "$line" server.log) times
done
However if you just want the correct results, it is much more efficient and somewhat simpler to do it in one pass in awk:
awk '{n[$7]++}
END{for(i in n){
print i,"clicked",n[i],"times"}}' |
sort
# or GNU awk 4+ can do the sort itself, see the doc:
awk '{n[$7]++}
END{PROCINFO["sorted_in"]="#ind_str_asc";
for(i in n){
print i,"clicked",n[i],"times"}}'
The associative array n collects the values from the seventh field as keys, and on each line, the value for the extracted key is incremented. Thus, at the end, the keys in n are all the URLs in the file, and the value for each is the number of times it occurred.

How to capture first column values of a command?

I am new to shell scripting. I am trying to write a script that is suppose to run a command and use for loop to capture first column of the output and do further processing.
command: tst get files
output of this command is something like
NAME COUNT ADMIN
FileA.txt 30 adminA
FileB.txt 21 local
FileC.txt 9 local
FileD.txt 90 adminA
Here is what I have tried so far : UPDATED also want to run additional commands
#!/bin/bash
for f in $(tst get files)
do
echo "FILE :[${f}]"
tst setprimary ${f} && tst get dataload
done
the output I am seeing is something like
FILE :[NAME]
FILE :[COUNT]
FILE :[ADMIN]
FILE :[FileA.txt]
FILE :[30]
FILE :[adminA]
FILE :[FileB.txt]
FILE :[21]
FILE :[local]
FILE :[FileC.txt]
FILE :[9]
FILE :[local]
FILE :[FileD.txt]
FILE :[90]
FILE :[adminA]
I am looking for an output something like
FILE :[FileA.txt]
FILE :[FileB.txt]
FILE :[FileC.txt]
FILE :[FileD.txt]
What should I modify in the shell script to only capture NAME column values? Am I executing the tst get files command correctly in the for loop or is there a better way to execute a command and loop thru the results?
EDIT (Samuel Kirschner): you can do without the for loop entirely and just use awk to print the lines you're interested in
tst get files | awk 'NR > 1 {print "FILE :[" $1 "]"}'
If you want to keep the for loop for some reason and just extract the file name from the lines while skipping the header, you have a few choices. Awk is probably the easiest because of the NR builtin variable (which counts lines) and automatic field-splitting ($1 refers to the first field in the line, for instance), but you can use sed and cut as well.
You can use awk 'NR > 1 {print $1}' to get the first column (using any whitespace character as a delimiter while skipping the first line) or sed 1d | cut -d$'\t' -f1. Note that $'\t' is bash-specific syntax for a literal tab character, if your file is padded with spaces rather than using tabs to delimit fields, you can't use the sed ... | cut ... example.
i.e.
#!/bin/bash
for f in $(tst get files | awk 'NR > 1 {print $1}')
do
echo "FILE :[${f}]"
done
or
#!/bin/bash
for f in $(tst get files | sed 1d | cut -d$'\t' -f1)
do
echo "FILE :[${f}]"
done
to avoid unnecessary splitting in the for loop. It's best to set IFS to something specific outside the loop body to prevent 'a file with whitespace.txt' from being broken up.
OLD_IFS=IFS
IFS=$'\n\t'
for f in $(tst get files | sed 1d | cut -d$'\t' -f1)
do
echo "FILE :[${f}]"
done
You can just do:
tst get files | awk 'NR > 1 { printf "FILE :[%s]\n", $1 }'
Update: To answer extended problem as per comments below by OP:
while read -r file _; do
tst setprimary "$file" && tst get dataload
done < <(tst get files)
Or perl:
tst ... | perl -lanE 'say "File: [$F[0]]" if $.>1'
the variable $. contains the current line number

Naming awk output in loop

I'm relatively new to the world of shell scripts so hopefully this won't be too difficult. I have a file (dirlist) with a list of directories. I want to
cat 'dirlist' with the path to each file
use a program called samtools to modify the file from dirlist
use awk to subset the samtools output on a variable chr17
write the output to a file that uses the 8th field of the directory, from 'dirlist' for naming
do this for all the files listed in dirlist
I think I have all the pieces here. Items 1-3 are working fine but the loop is simply naming the file "echo".
for i in `cat dirlist`; do samtools depth $i | awk '$1 == "chr17" {print $0}' echo $i | awk -F'[/]' '{print $8}'; done
Any help would be greatly appreciated
A native bash implementation (just one process, rather than starting an awk for every file) follows:
while IFS= read -r filename; do
while IFS= read -r line; do
if [[ $line = "chr17"[[:space:]]* ]]; then
IFS=/ read -r -a pieces <<<"$filename"
printf '%s\n' "${pieces[7]}"
fi
done < <(samtools depth "$filename")
done <dirlist
I think that's what you want to do
... | awk -v f="$i" 'BEGIN{split(f,fs,"/")} $1=="chr17" {print > fs[8]}'
the final file name will be generated from the original file name split by "/" and use only the 8th segment. Kind of unusual, perhaps needs some error handling.
not tested, caveat emptor...

Read a file in a Bash script

I have a file in my file system. I want to read that file in bash script. File format is different i want to read only selected values from the file. I don't want to read the whole file as the file is very huge. Below is my file format:
Name=TEST
Add=TEST
LOC=TEST
In the file it will have data like above. From that I want to get only Add date in a variable. Could you please suggest me how I can do this.
As of now i am doing this to read the file:
file="data.txt"
while IFS= read line
do
# display $line or do somthing with $line
echo "$line"
done < "$file"
Use the right tool meant for the job, Awk in this case to speed things up!
dateValue="$(awk -F"=" '$1=="Add"{print $2; exit}' file)"
printf "%s\n" "dateValue"
TEST
The idea is to split input lines by = as the de-limiter. The awk logic works by checking the $1 field which equals to Add and prints the corresponding value associated with it.
The exit part after print is optional. It will quit the processing as soon as the Add string is met. It will help in quick processing if the file is huge as you have indicated.
You could rewrite your loop this way, notice the break after you got your line:
while IFS='=' read -r key value; do
if [[ $value == "Add" ]]; then
# your logic
break
fi
done < "$file"
If your intention is to just get the very first occurrence of "Add=", then you could use grep this way:
value=$(grep -m 1 '^Add=' "$file" | cut -f2 -d=)

Resources