Find two strings in a list of files and get filename [duplicate] - bash

This question already has answers here:
Find files containing multiple strings
(6 answers)
Closed 4 years ago.
I have the following files:
100005.txt 107984.txt 116095.txt 124152.txt 133339.txt 139345.txt 18147.txt 25750.txt 32647.txt 40390.txt 48979.txt 56502.txt 64234.txt 72964.txt 80311.txt 888.txt 95969.txt
100176.txt 108084.txt 116194.txt 124321.txt 133435.txt 139438.txt 18331.txt 25940.txt 32726.txt 40489.txt 49080.txt 56506.txt 64323.txt 73063.txt 80481.txt 88958.txt 9601.txt
100347.txt 108255.txt 116378.txt 124494.txt 133531.txt 139976.txt 18420.txt 26034.txt 32814.txt 40589.txt 49082.txt 56596.txt 64414.txt 73163.txt 80580.txt 89128.txt 96058.txt
100447.txt 108343.txt 116467.txt 124594.txt 133627.txt 140519.txt 18509.txt 26128.txt 32903.txt 40854.txt 49254.txt 56768.txt 64418.txt 73498.txt 80616.txt 89228.txt 96148.txt
100617.txt 108432.txt 11647.txt 124766.txt 133728.txt 14053.txt 1866.txt 26227.txt 32993.txt 41026.txt 49308.txt 56857.txt 6449.txt 73670.txt 80704.txt 89400.txt 96239.txt
10071.txt 108521.txt 116556.txt 124854.txt 133830.txt 141062.txt 18770.txt 26327.txt 33093.txt 41197.txt 49387.txt 57029.txt 64508.txt 7377.txt 80791.txt 89500.txt 96335.txt
100788.txt 10897.txt 116746.txt 124943.txt 133866.txt 141630.txt 18960.txt 2646.txt 33194.txt 41296.txt 4971.txt 57128.txt 64680.txt 73841.txt 80880.txt 89504.txt 96436.txt
Some of the files look like:
spec:
annotations:
name: "ubuntu4"
labels:
key: "cont_name"
value: "ubuntuContainer4"
labels:
key: "cont_service"
value: "UbuntuService4"
task:
container:
image: "ubuntu:latest"
args: "tail"
args: "-f"
args: "/dev/null"
mounts:
source: "/home/testVolume"
target: "/opt"
replicated:
replicas: 1
I want to get every filename that contains ubuntu AND replicas.
I have tried awk '/ubuntu/ && /replicas/{print FILENAME}' *.txt but it doesn't seem to work for me.
Any ideas on how to fix this?

Grep can return a list of the files that match a string. You can nest that grep call so that you first get a list of files that match ubuntu, then use that list of files to get a list of files that match replicas.
grep -l replicas $( grep -l ubuntu *.txt )
This does assume that at least one file will match ubuntu. To get around that limitation, you can add a test for the existence of one file first, and then do the combined search:
grep -q ubuntu *.txt && grep -l replicas $( grep -l ubuntu *.txt )

Check if both strings appear in a given file by using a counter for each and then checking if they were incremented. You can do this with BEGINFILE, available on GNU awk:
awk 'BEGINFILE {ub=0; re=0}
/ubuntu/ {ub++}
/replicas/ {re++}
(ub>0 && re>0) {print FILENAME; nextfile}' *.txt
This sets two counters to 0 when it starts to read a file: one for one string and another one for the other. When one of the patterns is found, it increments its corresponding counter. Then it keeps checking if the two counters have been incremented. If so, it prints its filename using the FILENAME variable that contains that string. Also, it skips the rest of the file using nextfile, since there is no need to continue checking for the patterns.

awk '/ubuntu/ && /replicas/{print FILENAME}' *.txt
looks for both regexps on the same line. To find them both in the same file but possibly on separate lines with GNU awk for ENDFILE is:
awk '/ubuntu/{u=1} /replicas/{r=1} ENDFILE{if (u && r) print FILENAME; u=r=0}' *.txt
or more efficiently adding gawks nextfile construct and preferentially switching to BEGINFILE (as #fedorqui already showed) instead of ENDFILE since all that remains between file reads is to set the 2 variables:
awk 'BEGINFILE{u=r=0} /ubuntu/{u=1} /replicas/{r=1} u && r{print FILENAME; nextfile}' *.txt
With other awks it'd be:
awk '
FNR==1{prt()} /ubuntu/{u=1} /replicas/{r=1} END{prt()}
function prt() {if (u && r) print fname; fname=FILENAME; u=r=0}
' *.txt

If no subdirs have to been visited:
for f in *.txt
do
grep -q -m1 'ubuntu' $f && grep -q -m1 'replicas' $f && echo "found: $f"
done
or as oneliner:
for f in *.txt ; do grep -q -m1 'ubuntu' $f && grep -q -m1 replicas $f && echo found:$f ; done
The -q makes grep quiet, so the matches aren't display, the -m1 only searches for 1 match, so grep can report a match fast.
The && is short circuiting, so if the first grep doesn't find anything, the second isn't tried.
For working on the files further down the pipeline, you will of course eliminate the chatty "found: ".

Related

Input folder / output folder for each file in AWK [duplicate]

This question already has answers here:
Redirecting stdout with find -exec and without creating new shell
(3 answers)
Closed last month.
I am trying to run (several) Awk scripts through a list of files and would like to get each file as an output in a different folder. I tried already several ways but can not find the solution. The output in the output folder is always a single file called {} which includes all content of all files from the input folder.
Here is my code:
input_folder="/path/to/input"
output_folder="/path/to/output"
find $input_folder -type f -exec awk '! /rrsig/ && ! /dskey/ {print $1,";",$5}' {} >> $output_folder/{} \;
Can you please give me a hint what I am doing wrong?
The code is called in a .sh script.
I'd probably opt for a (slightly) less complicated find | xargs, eg:
find "${input_folder}" -type f | xargs -r \
awk -v outd="${output_folder}" '
FNR==1 { close(outd "/" outf); outf=FILENAME; sub(/.*\//,"",outf) }
! /rrsig/ && ! /dskey/ { print $1,";",$5 > (outd "/" outf) }'
NOTE: the commas in $1,";",$5 will insert spaces between $1, ; and $2; if the spaces are not desired then use $1 ";" $5 (ie, remove the commas)

looping with grep over several files

I have multiple files /text-1.txt, /text-2.txt ... /text-20.txt
and what I want to do is to grep for two patterns and stitch them into one file.
For example:
I have
grep "Int_dogs" /text-1.txt > /text-1-dogs.txt
grep "Int_cats" /text-1.txt> /text-1-cats.txt
cat /text-1-dogs.txt /text-1-cats.txt > /text-1-output.txt
I want to repeat this for all 20 files above. Is there an efficient way in bash/awk, etc. to do this ?
#!/bin/sh
count=1
next () {
[[ "${count}" -lt 21 ]] && main
[[ "${count}" -eq 21 ]] && exit 0
}
main () {
file="text-${count}"
grep "Int_dogs" "${file}.txt" > "${file}-dogs.txt"
grep "Int_cats" "${file}.txt" > "${file}-cats.txt"
cat "${file}-dogs.txt" "${file}-cats.txt" > "${file}-output.txt"
count=$((count+1))
next
}
next
grep has some features you seem not to be aware of:
grep can be launched on lists of files, but the output will be different:
For a single file, the output will only contain the filtered line, like in this example:
cat text-1.txt
I have a cat.
I have a dog.
I have a canary.
grep "cat" text-1.txt
I have a cat.
For multiple files, also the filename will be shown in the output: let's add another textfile:
cat text-2.txt
I don't have a dog.
I don't have a cat.
I don't have a canary.
grep "cat" text-*.txt
text-1.txt: I have a cat.
text-2.txt: I don't have a cat.
grep can be extended to search for multiple patterns in files, using the -E switch. The patterns need to be separated using a pipe symbol:
grep -E "cat|dog" text-1.txt
I have a dog.
I have a cat.
(summary of the previous two points + the remark that grep -E equals egrep):
egrep "cat|dog" text-*.txt
text-1.txt:I have a dog.
text-1.txt:I have a cat.
text-2.txt:I don't have a dog.
text-2.txt:I don't have a cat.
So, in order to redirect this to an output file, you can simply say:
egrep "cat|dog" text-*.txt >text-1-output.txt
Assuming you're using bash.
Try this:
for i in $(seq 1 20) ;do rm -f text-${i}-output.txt ; grep -E "Int_dogs|Int_cats" text-${i}.txt >> text-${i}-output.txt ;done
Details
This one-line script does the following:
Original files are intended to have the following name order/syntax:
text-<INTEGER_NUMBER>.txt - Example: text-1.txt, text-2.txt, ... text-100.txt.
Creates a loop starting from 1 to <N> and <N> is the number of files you want to process.
Warn: rm -f text-${i}-output.txt command first will be run and remove the possible outputfile (if there is any), to ensure that a fresh new output file will be only available at the end of the process.
grep -E "Int_dogs|Int_cats" text-${i}.txt will try to match both strings in the original file and by >> text-${i}-output.txt all the matched lines will be redirected to a newly created output file with the relevant number of the original file. Example: if integer number in original file is 5 text-5.txt, then text-5-output.txt file will be created & contain the matched string lines (if any).

Cleanup a path list in bash, removing all children paths (subdirectories)

I have a file containing a list of paths like the following:
/some/path
/some/path/file
/some/path/subpath/file
/some/otherpath
/some/otherpath/file
As of today, I iterate of that list, check if the path exists, and if so, I delete the path / file.
This works, but isn't very optimized since once a directory is deleted, I can safely assume that all of it's children are deleted too.
I also happen to use that list in rsync as exclude list, which is quite cpu extensive because the list can be quite big.
I'd like to clean that list before using it, meaning that if /some/path exists, all children paths ie /some/path/* can be safely removed from the list.
The result list of the example above should look like
/some/path
/some/otherpath
The list is already sorted, meaning there won't be a case like
/some/path/file
/some/path
What's the fastest way to do so in GNU bash ?
Thanks.
[EDIT]
Source list is generated as follows:
rsync creates a list of files in paths, using grep and sed to 'clean' rsync output
# rsync operation explanation
# (command || :) = Return code 0 regardless of command return code
# (grep -E \"^-|^d|^l\" || :) = Be sure line begins with '-' or 'd' or 'l' (rsync semantics for file, directory or symlink)
# (sed -r 's/^.{10} +[0-9,]+ [0-9/]{10} [0-9:]{8} //' || :) = Remove everything before timestamps
# (awk 'BEGIN { FS=\" -> \" } ; { print \$1 }' || :) = Only show output before ' -> ' in order to remove symlink destinations
# (grep -v \"^\.$\" || :) = Removes line containing current directory sign '.'
rsync --list-only -rlptgoDE8 /path1 | (grep -E \"^-|^d|^l\" || :) |
(sed -r 's/^.{10} +[0-9,]+ [0-9/]{10} [0-9:]{8} //' || :) |
(awk 'BEGIN { FS=\" -> \" } ; { print \$1 }' || :) |
(grep -v \"^\.$\" || :) | sort > /tmp/path1_list
rsync --list-only -rlptgoDE8 /path2 | (grep -E \"^-|^d|^l\" || :) |
(sed -r 's/^.{10} +[0-9,]+ [0-9/]{10} [0-9:]{8} //' || :) |
(awk 'BEGIN { FS=\" -> \" } ; { print \$1 }' || :) |
(grep -v \"^\.$\" || :) | sort > /tmp/path2_list
comm -23 /tmp/path1_list /tmp/path2_list > final_list
Purpose of final_list is to have a list of files which are present in /path1 but not in /path2
[/EDIT]
[EDIT2]
I use rsync to create the file lists because I need to honor rsync exclusion patterns, which I can't with other utilities, hence the whole rsync decoding used for list generation.
The whole project is about statefull file synchronization hosted at https://github.com/deajan/osync
[/EDIT2]
[EDIT3]
michael's answer based on awk works great, except for specific corner cases like:
/some/path
/some/path-whatever
/some/path/file
/some/path/subpath/file
/some/otherpath
/some/otherpath/file
Overall, I could "dedupe" some of my lists from 48k lines to 50. Not a perfect solution, but does the job so far.
[/EDIT3]
Considering that a sorted file of paths would have redundant paths listed after matching substrings, e.g.,
/one
/one/two # dupe, matches /one
/one/two/three # dupe, matches /one
/two/three
/two/three/four # dupe, matches /two/three
Then if you go thru the file, and if the current line contains the substring above it (or, specifically, the shortest substring above it), then just skip those lines:
LC_COLLATE=C sort -u file.txt | awk '
BEGIN { prev="^dummy/" }
$0 ~ prev { print "# skip: " $0; next }
$0 !~ prev { print $0; prev="^"$0"/" }'
This prefixes lines to skip with # so you can see what's omitted, feel free to remove that once you verify this might work.
Notes:
I'm ignoring the difference between files and directories, because checking would be messy & a lot slower.
Also, /path/to/file should not cause /path/to/file2 to be skipped, even though it's a substring as coded, hence I use a regex like ^string/ to assume every entry is a directory to avoid that problem.
Technically, having both pattern matches is probably redundant, unless you need to tweak them.
Maybe make sure there are no blank lines or other oddities in the input, e.g., sort input.txt | grep '^/' | awk ...
I added LC_COLLATE=C so that /path/one/ would sort ahead of /path/one2, otherwise the original assumption above (sorted substrings indicates subdirs) doesn't hold. (It might not hold in other cases, as well, but then at least there still will be fewer duplicates, but still some duplicates.) I just noticed this problem right as I was posting, and perhaps you'll discover other corner cases, so please do test :-)
You need two while loops. The outer while loop reads the base paths. And the inner while loops reads the lines to drop.
#! /bin/bash
exec 0< full-list
exec 1> reduced-list
read -r line1
while :; do
echo "$line1"
while :; do
if read -r line2; then
case $line2 in
"$line1"/*)
continue
;;
*)
line1=$line2
break
;;
esac
else
exit 0
fi
done
done
The above code assumes that there is at least one line in the full list. If this can not be guaranteed, you have to add an additional if statement after the first read.

Finding the file name in a directory with a pattern

I need to find the latest file - filename_YYYYMMDD in the directory DIR.
The below is not working as the position is shifting each time because of the spaces between(occurring mostly at file size field as it differs every time.)
please suggest if there is other way.
report =‘ls -ltr $DIR/filename_* 2>/dev/null | tail -1 | cut -d “ “ -f9’
You can use AWK to cut the last field . like below
report=`ls -ltr $DIR/filename_* 2>/dev/null | tail -1 | awk '{print $NF}'`
Cut may not be an option here
If I understand you want to loop though each file in the directory and file the largest 'YYYYMMDD' value and the filename associated with that value, you can use simple POSIX parameter expansion with substring removal to isolate the 'YYYYMMDD' and compare against a value initialized to zero updating the latest variable to hold the largest 'YYYYMMDD' as you loop over all files in the directory. You can store the name of the file each time you find a larger 'YYYYMMDD'.
For example, you could do something like:
#!/bin/sh
name=
latest=0
for i in *; do
test "${i##*_}" -gt "$latest" && { latest="${i##*_}"; name="$i"; }
done
printf "%s\n" "$name"
Example Directory
$ ls -1rt
filename_20120615
filename_20120612
filename_20120115
filename_20120112
filename_20110615
filename_20110612
filename_20110115
filename_20110112
filename_20100615
filename_20100612
filename_20100115
filename_20100112
Example Use/Output
$ name=; latest=0; \
> for i in *; do \
> test "${i##*_}" -gt "$latest" && { latest="${i##*_}"; name="$i"; }; \
> done; \
> printf "%s\n" "$name"
filename_20120615
Where the script selects filename_20120615 as the file with the greatest 'YYYYMMDD' of all files in the directory.
Since you are using only tools provided by the shell itself, it doesn't need to spawn subshells for each pipe or utility it calls.
Give it a test and let me know if that is what you intended, let me know if your intent was different, or if you have any further questions.

find only the first file from many directories

I have a lot of directories:
13R
613
AB1
ACT
AMB
ANI
Each directories contains a lots of file:
20140828.13R.file.csv.gz
20140829.13R.file.csv.gz
20140830.13R.file.csv.gz
20140831.13R.file.csv.gz
20140901.13R.file.csv.gz
20131114.613.file.csv.gz
20131115.613.file.csv.gz
20131116.613.file.csv.gz
20131117.613.file.csv.gz
20141114.ab1.file.csv.gz
20141115.ab1.file.csv.gz
20141116.ab1.file.csv.gz
20141117.ab1.file.csv.gz
etc..
The purpose if to have the first file from each directories
The result what I expect is:
13R|20140828
613|20131114
AB1|20141114
Which is the name of the directories pipe the date from the filename.
I guess I need a find and head command + awk but I can't make it, I need your help.
Here what I have test it
for f in $(ls -1);do ls -1 $f/ | head -1;done
But the folder name is missing.
When I mean the first file, is the first file returned in an alphabetical order within the folder.
Thanks.
You can do this with a Bash loop.
Given:
/tmp/test
/tmp/test/dir_1
/tmp/test/dir_1/file_1
/tmp/test/dir_1/file_2
/tmp/test/dir_1/file_3
/tmp/test/dir_2
/tmp/test/dir_2/file_1
/tmp/test/dir_2/file_2
/tmp/test/dir_2/file_3
/tmp/test/dir_3
/tmp/test/dir_3/file_1
/tmp/test/dir_3/file_2
/tmp/test/dir_3/file_3
/tmp/test/file_1
/tmp/test/file_2
/tmp/test/file_3
Just loop through the directories and form an array from a glob and grab the first one:
prefix="/tmp/test"
cd "$prefix"
for fn in dir_*; do
cd "$prefix"/"$fn"
arr=(*)
echo "$fn|${arr[0]}"
done
Prints:
dir_1|file_1
dir_2|file_1
dir_3|file_1
If your definition of 'first' is different that Bash's, just sort the array arr according to your definition before taking the first element.
You can also do this with find and awk:
$ find /tmp/test -mindepth 2 -print0 | awk -v RS="\0" '{s=$0; sub(/[^/]+$/,"",s); if (s in paths) next; paths[s]; print $0}'
/tmp/test/dir_1/file_1
/tmp/test/dir_2/file_1
/tmp/test/dir_3/file_1
And insert a sort (or use gawk) to sort as desired
sort has an unique option. Only the directory should be unique, so use the first field in sorting -k1,1. The solution works when the list of files is sorted already.
printf "%s\n" */* | sort -k1,1 -t/ -u | sed 's#\(.*\)/\([0-9]*\).*#\1|\2#'
You will need to change the sed command when the date field may be followed by another number.
This works for me:
for dir in $(find "$FOLDER" -type d); do
FILE=$(ls -1 -p $dir | grep -v / | head -n1)
if [ ! -z "$FILE" ]; then
echo "$dir/$FILE"
fi
done

Resources