Looping input file and find out if line is used - bash

I am using bash to loop through a large input file (contents.txt) that looks like:
searchterm1
searchterm2
searchterm3
...in an effort to remove search terms from the file if they are not used in a code base. I am trying to use grep and awk, but no success. I also want to exclude the images and constants directories
#/bin/bash
while read a; do
output=`grep -R $a ../website | grep -v ../website/images | grep -v ../website/constants | grep -v ../website/.git`
if [ -z "$output" ]
then echo "$a" >> notneeded.txt
else echo "$a used $($output | wc -l) times" >> needed.txt
fi
done < constants.txt
The desired effect of this would be two files. One for showing all of the search terms that are found in the code base(needed.txt), and another for search terms that are not found in the code base(notneeded.txt).
needed.txt
searchterm1 used 4 times
searchterm3 used 10 times
notneeded.txt
searchterm2
I've tried awk as well in a similar fashion but I cannot get it to loop and output as desired

Not sure but it sounds like you're looking for something like this (assuming no spaces in your file names):
awk '
NR==FNR{ terms[$0]; next }
{
for (term in terms) {
if ($0 ~ term) {
hits[term]++
}
}
}
END {
for (term in terms) {
if (term in hits) {
print term " used " hits[term] " times" > "needed.txt"
}
else {
print term > "notneeded.txt"
}
}
}
' constants.txt $( find ../website -type f -print | egrep -v '\.\.\/website\/(images|constants|\.git)' )
There's probably some find option to make the egrep unnecessary.

Related

Unscramble words Challenge - improve my bash solution

There is a Capture the Flag challenge
I have two files; one with scrambled text like this with about 550 entries
dnaoyt
cinuertdso
bda
haey
tolpap
...
The second file is a dictionary with about 9,000 entries
radar
ccd
gcc
fcc
historical
...
The goal is to find the right, unscrambled version of the word, which is contained in the dictionary file.
My approach is to sort the characters from the first word from the first file and then look up if the first word from the second file has the same length. If so then sort that too and compare them.
This is my fully functional bash script, but it is very slow.
#!/bin/bash
while IFS="" read -r p || [ -n "$p" ]
do
var=0
ro=$(echo $p | perl -F -lane 'print sort #F')
len_ro=${#ro}
while IFS="" read -r o || [ -n "$o" ]
do
ro2=$(echo $o | perl -F -lane 'print sort # F')
len_ro2=${#ro2}
let "var+=1"
if [ $len_ro == $len_ro2 ]; then
if [ $ro == $ro2 ]; then
echo $o >> new.txt
echo $var >> whichline.txt
fi
fi
done < dictionary.txt
done < scrambled-words.txt
I have also tried converting all characters to ASCII integers and sum each word, but while comparing I realized that the sum of a different char pattern may have the same sum.
[edit]
For the records:
- no anagrams contained in dictionary
- to get the flag, you need to export the unscrambled words as one blob and ans make a SHA-Hash out of it (thats the flag)
- link to ctf for guy who wanted the files https://challenges.reply.com/tamtamy/user/login.action
You're better off creating a lookup dictionary (keyed by the sorted word) from the dictionary file.
Your loop body is executed 550 * 9,000 = 4,950,000 times (O(N*M)).
The solution I propose executes two loops of at most 9,000 passes each (O(N+M)).
Bonus: It finds all possible solutions at no cost.
#!/usr/bin/perl
use strict;
use warnings qw( all );
use feature qw( say );
my $dict_qfn = "dictionary.txt";
my $scrambled_qfn = "scrambled-words.txt";
sub key { join "", sort split //, $_[0] }
my %dict;
{
open(my $fh, "<", $dict_qfn)
or die("Can't open \"$dict_qfn\": $!\n");
while (<$fh>) {
chomp;
push #{ $dict{key($_)} }, $_;
}
}
{
open(my $fh, "<", $scrambled_qfn)
or die("Can't open \"$scrambled_qfn\": $!\n");
while (<$fh>) {
chomp;
my $matches = $dict{key($_)};
say "$_ matches #$matches" if $matches;
}
}
I wouldn't be surprised if this only takes one millionths of the time of your solution for the sizes you provided (and it scales so much better than yours if you were to increase the sizes).
I would do something like this with gawk
gawk '
NR == FNR {
dict[csort()] = $0
next
}
{
print dict[csort()]
}
function csort( chars, sorted) {
split($0, chars, "")
asort(chars)
for (i in chars)
sorted = sorted chars[i]
return sorted
}' dictionary.txt scrambled-words.txt
Here's perl-free solution I came up with using sort and join:
sort_letters() {
# Splits each letter onto a line, sorts the letters, then joins them
# e.g. "hello" becomes "ehllo"
echo "${1}" | fold-b1 | sort | tr -d '\n'
}
# For each input file...
for input in "dict.txt" "words.txt"; do
# Convert each line to [sorted] [original]
# then sort and save the results with a .sorted extension
while read -r original; do
sorted=$(sort_letters "${original}")
echo "${sorted} ${original}"
done < "${input}" | sort > "${input}.sorted"
done
# Join the two files on the [sorted] word
# outputting the scrambled and unscrambed words
join -j 1 -o 1.2,2.2 "words.txt.sorted" "dict.txt.sorted"
I tried something very alike, but a bit different.
#!/bin/bash
exec 3<scrambled-words.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>scrambled-words_sorted.txt
exec 3>&-
exec 3<dictionary.txt
while read -r line <&3; do
printf "%s" ${line} | perl -F -lane 'print sort #F'
done>dictionary_sorted.txt
exec 3>&-
printf "" > whichline.txt
exec 3<scrambled-words_sorted.txt
while read -r line <&3; do
counter="$((++counter))"
grep -n -e "^${line}$" dictionary_sorted.txt | cut -d ':' -f 1 | tr -d '\n' >>whichline.txt printf "\n" >>whichline.txt
done
exec 3>&-
As you can see I don't create a new.txt file; instead I only create whichline.txt with a blank line where the word doesn't match. You can easily paste them up to create new.txt.
The logic behind the script is nearly the logic behind yours, with the exception that I called perl less times and I save two support files.
I think (but I am not sure) that creating them and cycle only one file will be better than ~5kk calls of perl. This way "only" ~10k times is called.
Finally, I decided to use grep because it's (maybe) the fastest regex matcher, and searching for the entire line the lenght is intrinsic in the regex.
Please, note that what #benjamin-w said is still valid and, in that case, grep will reply badly and I did not managed it!
I hope this could help [:

Fast grep on huge csv files

I have a file (queryids.txt) with a list of 847 keywords to search. I have to grep the keywords from about 12 huge csv files (the biggest has 2,184,820,000 lines). Eventually we will load it into a database of some sort but for now, we just want certain keywords to be grep'ed.
My command is:
LC_ALL=C fgrep -f queryids.txt subject.csv
I am thinking of writing a bash script like this:
#!/bin/bash
for f in *.csv
do
( echo "Processing $f"
filename=$(basename "$f")
filename="${filename%.*}"
LC_ALL=C fgrep -f queryids.txt $f > $filename"_goi.csv" ) &
done
and I will run it using: nohup bash myscript.sh &
The queryids.txt looks like this:
ENST00000401850
ENST00000249005
ENST00000381278
ENST00000483026
ENST00000465765
ENST00000269080
ENST00000586539
ENST00000588458
ENST00000586292
ENST00000591459
The subject file looks like this:
target_id,length,eff_length,est_counts,tpm,id
ENST00000619216.1,68,2.65769E1,0.5,0.300188,00065a62-5e18-4223-a884-12fca053a109
ENST00000473358.1,712,5.39477E2,8.26564,0.244474,00065a62-5e18-4223-a884-12fca053a109
ENST00000469289.1,535,3.62675E2,4.82917,0.212463,00065a62-5e18-4223-a884-12fca053a109
ENST00000607096.1,138,1.92013E1,0,0,00065a62-5e18-4223-a884-12fca053a109
ENST00000417324.1,1187,1.01447E3,0,0,00065a62-5e18-4223-a884-12fca053a109
I am concerned this will take a long time. Is there a faster way to do this?
Thanks!
Few things I can suggest to improve the performance:
No need to spawn a sub-shell using ( .. ) &, you can use braces { ... } & if needed.
Use grep -F (non-regex or fixed string search) to make grep run faster
Avoid basename command and use bash string manipulation
Try this script:
#!/bin/bash
for f in *.csv; do
echo "Processing $f"
filename="${f##*/}"
LC_ALL=C grep -Ff queryids.txt "$f" > "${filename%.*}_goi.csv"
done
I suggest you run this on a smaller dataset to compare the performance gain.
You could try this instead:
awk '
BEGIN {
while ( (getline line < "queryids.txt") > 0 ) {
re = ( re=="" ? "" : re "|") line
}
}
FNR==1 { close(out); out=FILENAME; sub(/\.[^.]+$/,"_goi&",out) }
$0 ~ re { print > out }
' *.csv
It's using a regexp rather than string comparison - whether or not that matters and, if so, what we can do about it depends on the values in queryids.txt. In fact there may be a vastly faster and more robust way to do this depending on what your files contain so if you edit your question to include some examples of your file contents we could be of more help.
I see you have now posted some sample input and indeed we can do this much faster and more robustly by using a hash lookup:
awk '
BEGIN {
FS="."
while ( (getline line < "queryids.txt") > 0 ) {
ids[line]
}
}
FNR==1 { close(out); out=FILENAME; sub(/\.[^.]+$/,"_goi&",out) }
$1 in ids { print > out }
' *.csv

Why is this command within my code giving different result than the same command in terminal?

**Edit: Okay, so I've tried implementing everyone's advice so far.
-I've added quotes around each variable "$1" and "$codon" to avoid whitespace.
-I've added the -ioc flag to grep to avoid caps.
-I tried using tr -d' ', however that leads to a runtime error because it says -d' ' is an invalid option.
Unfortunately I am still seeing the same problem. Or a different problem, which is that it tells me that every codon appears exactly once. Which is a different kind of wrong.
Thanks for everything so far - I'm still open to new ideas. I've updated my code below.**
I have this bash script that is supposed to count all permutations of (A C G T) in a given file.
One line of the script is not giving me the desired result and I don't know why - especially because I can enter the exact same line of code in the command prompt and get the desired result.
The line, executed in the command prompt, is:
cat dnafile | grep -o GCT | wc -l
This line tells me how many times the regular expression "GCT" appears in the file dnafile. When I run this command the result I get is 10 (which is accurate).
In the code itself, I run a modified version of the same command:
cat $1 | grep -o $codon | wc -l
Where $1 is the file name, and $codon is the 3-letter combination. When I run this from within the program, the answer I get is ALWAYS 0 (which is decidedly not accurate).
I was hoping one of you fine gents could enlighten this lost soul as to why this is not working as expected.
Thank you very, very much!
My code:
#!/bin/bash
#countcodons <dnafile> counts occurances of each codon in sequence contained within <dnafile>
if [[ $# != 1 ]]
then echo "Format is: countcodons <dnafile>"
exit
fi
nucleos=(a c g t)
allCods=()
#mix and match nucleotides to create all codons
for x in {0..3}
do
for y in {0..3}
do
for z in {0..3}
do
perm=${nucleos[$x]}${nucleos[$y]}${nucleos[$z]}
allCods=("${allCods[#]}" "$perm")
done
done
done
#for each codon, use grep to count # of occurances in file
len=${#allCods[*]}
for (( n=0; n<len; n++ ))
do
codon=${allCods[$n]}
occs=`cat "$1" | grep -ioc "$codon" | wc -l`
echo "$codon appears: $occs"
# if (( $occs > 0 ))
# then
# echo "$codon : $occs"
# fi
done
exit
You're generating your sequences in lowercase. Your code greps for gct, not GCT. You want to add the -i switch to grep. Try:
occs=`grep -ioc $codon $1`
You've got your logic backwards - you shouldn't have to read your input file once for every codon, you should only have to read it once and check each line for every codon.
You didn't supply any sample input or expected output so it's untested but something like this is the right approach:
awk '
BEGIN {
nucleosStr="a c g t"
split(nucleosStr,nucleos)
#mix and match nucleotides to create all codons
for (x in nucleos) {
for (y in nucleos) {
for (z in nucleos) {
perm = nucleos[x] nucleos[y] nucleos[z]
allCodsStr = allCodsStr (allCodsStr?" ":"") perm
}
}
}
split(allCodsStr,allCods)
}
{
#for each codon, count # of occurances in file
for (n in allCods) {
codon = allCods[n]
if ( tolower($0) ~ codon ) {
occs[n]++
}
}
}
END {
for (n in allCods) {
printf "%s appears: %d\n", allCods[n], occs[n]
}
}
' "$1"
I expect you'll see a huge performance improvement with that approach if your file is moderately large.
Try:
occs=`cat $1 | grep -o $codon | wc -l | tr -d ' '`
The problem is that wc indents the output, so $occs has a bunch of spaces at the beginning.

Show different context on different grep keyword?

I know -A -B -C could be used to show context around the grep keyword.
My question is, how to show different context on different keyword?
For example, how do I show -A 5 for cat, -B 4 for dog, and -C 1 for monkey:
egrep -A3 "cat|dog|monkey" <file>
// this just show 3 after lines for each keyword.
i don't think there's any way to do it with a single grep call, but you could run it through grep once for each variable and concatenate the output:
var=$(grep -n -A 5 cat file)$'\n'$(grep -n -B 4 dog file)$'\n'$(grep -n -C 1 monkey file)
var=$(sort -un <(echo "$var"))
now echo "$var" will produce the same output as you would have gotten from your single command, plus line numbers and context indicators (the : prefix indicates a line that matched the pattern exactly, and the - prefix indicates a line being included because of the -A -B and/or -C options).
the reason i included the line numbers thus far is to preserve the order of the results you would have seen had you managed to do this in one statement. if you like them, great, but if not, you can use the following line to cut them out:
var=$(cut -d: -f2- <(echo "$var") | cut -d- -f2-)
this passes it through once to cut the exact matching lines' prefixes, then again to cut the context matches' prefixes.
pretty? no. but it works.
I'm afraid grep won't do that. You'll have to use a different tool. Perhaps write your own program.
Something like this would do it:
awk '
BEGIN{ ARGV[ARGC++] = ARGV[1] }
function prtB(nr) { for (i=FNR-nr; i<FNR; i++) print a[i] }
function prtA(nr) { for (i=FNR+1; i<=FNR+nr; i++) print a[i] }
NR==FNR{ a[NR]; next }
/cat/ { print; prtA(5) }
/dog/ { prtB(4); print }
/monkey/ { prtB(1); print; prtA(1) }
' file
check the math on the loops in the functions. You didn't say how you'd want to handle lines that contain monkey AND dog, for example.
EDIT: here's an untested solution that would print the maximum context around any match and let you specify the contexts on the command line and won't use as much memory as the above cheap and cheerful solution:
awk -v cxts="cat:0:5\ndog:4:0\nmonkey:1:1" '
BEGIN{
ARGV[ARGC++] = ARGV[1]
numCxts = split(cxts,cxtsA,RS)
for (i=1;i<=numCxts;i++) {
regex = cxtsA[i]
n = split(regex,rangeA,/:/)
sub(/:[^:]+:[^:]+$/,"",regex)
endA[regex] = rangeA[n]
startA[regex] = rangeA[n-1]
regexA[regex]
}
}
NR==FNR{
for (regex in regexA) {
if ($0 ~ regex) {
start = NR - startA[regex]
end = NR + endA[regex]
for (i=start; i<=end; i++) {
prt[i]
}
}
}
next
}
FNR in prt
' file
Separate the searched for patterns in the cxts variable with whatever your RS value is, newline by default.

Better way of extracting data from file for comparison

Problem: Comparison of files from Pre-check status and Post-check status of a node for specific parameters.
With some help from community, I have written the following solution which extracts the information from files from directories pre and post and based on the "Node-ID" (which happens to be unique and is to be extracted from the files as well). After extracting the data from Pre/post folder, I have created folders based on the node-id and dumped files into the folders.
My Code to extract data (The data is extracted from Pre and Post folders)
FILES=$(find postcheck_logs -type f -name *.log)
for f in $FILES
do
NODE=`cat $f | grep -m 1 ">" | awk '{print $1}' | sed 's/[>]//g'` ##Generate the node-id
echo "Extracting Post check information for " $NODE
mkdir temp/$NODE-post ## create a temp directory
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param1/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param1.txt ## extract data
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param2/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param2.txt
cat $f | awk 'BEGIN { RS=$NODE"> "; } /^param3/ { foo=RS $0; } END { print foo ; }' > temp/$NODE-post/param3.txt
done
After this I have a structure as:
/Node1-pre/param1.txt
/Node1-post/param1.txt
and so on.
Now I am stuck to compare $NODE-pre and $NODE-post files,
I have tried to do it using recursive grep, but I am not finding a suitable way to do so. What is the best possible way to compare these files using diff?
Moreover, I find the above data extraction program very slow. I believe it's not the best possible way (using least resources) to do so. Any suggestions?
Look askance at any instance of cat one-file — you could use I/O redirection on the next command in the pipeline instead.
You can do the whole thing more simply with:
for f in $(find postcheck_logs -type f -name *.log)
do
NODE=$(sed '/>/{ s/ .*//; s/>//g; p; q; }' $f) ##Generate the node-id
echo "Extracting Post check information for $NODE"
mkdir temp/$NODE-post
awk -v NODE="$NODE" -v DIR="temp/$NODE-post" \
'BEGIN { RS=NODE"> " }
/^param1/ { param1 = $0 }
/^param2/ { param2 = $0 }
/^param3/ { param3 = $0 }
END {
print RS param1 > DIR "/param1.txt"
print RS param2 > DIR "/param2.txt"
print RS param3 > DIR "/param3.txt"
}' $f
done
The NODE finding process is much better done by a single sed command than cat | grep | awk | sed, and you should plan to use $(...) rather than back-quotes everywhere.
The main processing of the log file should be done once; a single awk command is sufficient. The script is passed to variables — NODE and the directory name. The BEGIN is cleaned up; the $ before NODE was probably not what you intended. The main actions are very similar; each looks for the relevant parameter name and saves it in an appropriate variable. At the end, it write the saved values to the relevant files, decorated with the value of RS. Semicolons are only needed when there's more than one statement on a line; there's just one statement per line in this expanded script. It looks bigger than the original, but that's only because I'm using vertical space.
As to comparing the before and after files, you can do it in many ways, depending on what you want to know. If you've got a POSIX-compliant diff (you probably do), you can use:
diff -r temp/$NODE-pre temp/$NODE-post
to report on the differences, if any, between the contents of the two directories. Alternatively, you can do it manually:
for file in param1.txt param2.txt param3.txt
do
if cmp -s temp/$NODE-pre/$file temp/$NODE-post/$file
then : No difference
else diff temp/$NODE-pre/$file temp/$NODE-post/$file
fi
done
Clearly, you can wrap that in a 'for each node' loop. And, if you are going to need to do that, then you probably do want to capture the output of the find command in a variable (as in the original code) so that you do not have to repeat that operation.

Resources