How to know file name from a pipeline of commands

How to know file name from a pipeline of commands - bash

I search for some text in some file list. I have the following command to print these lines:
ls -1 *.log | xargs tail --lines=10000 | grep text_for_search
The command output contains all of occurrences of text_for_search, but it hasn't information from which file the occurrences are. How to modify the command to provide this information too?
Actually log files are gigabytes in size, so it's essential to use tail --lines=10000 for each of them

You could just use a loop instead, which will keep track of the file name for you:
for file in *.log; do
if tail --lines=-10000 "$file" | grep -q text_for_search; then
echo "$file"
fi
done
The -q switch to grep suppresses the output, returning a 0 (success) exit code if the pattern is matched.

You can use find command:
find . -name "*.log" -exec grep text_for_search '{}' \;
grep will output filename and matched line. If you just need filenames - add -l switch to grep command.
'{}' - macro used for matched file name substitution in find's -exec command,
\; indicates end of arguments for command, called by exec

Replace your tail command with:
awk '{v[NR]=$0}END{for(i=NR-10000;i<=NR;i++)print FILENAME,v[i]}'
This above is just the replacement of the tail command except it adds a file name in the begining of each line.

You must avoid parsing ls output and use shell's for loop to iterate through all *.log files:
for f in *.log; do
awk -v c=$(wc -l < "$f") 'NR>c-10000 && /text_for_search/{print FILENAME ":" $0}' "$f"
done
EDIT:
You can use awk to search through all *.log files:
awk 'NR>=10000 && /text_for_search/ {print FILENAME ":" $0}' *.log

Related

how to find every file in my repo that has a specific word in the last line?

In other words, how to combine tail and find/grep command in bash.
I want to find all the files(including the files in subdirectories) in my repo have a specific word in the last line, say FIX in the last line. I tried grep -Rl "FIX" to display all the files containing "FIX", but I don't know how to combine the tail command in it. Anyone can help??

Run tail on all the files at once and then grep the output for FIX. Since tail prepends each line with the corresponding file name when given multiple file names, that's all you have to do.
find -type f -exec tail -n1 {} + | grep FIX
Or use ** to find all files and subdirectories, then run tail on each of them one at a time:
shopt -s globstar
for file in **; do
[[ -f $file ]] && tail -n1 "$file" | grep -q FIX && echo "$file"
done
Or use find to find all matches and pipe it to a while read loop:
find -type f -print0 | while IFS= read -rd '' file; do
tail -n1 "$file" | grep -q FIX && echo "$file"
done
Or do the same thing but with -exec + and an explicit sub-shell:
find -type f -exec sh -c 'for file; do tail -n1 "$file" | grep -q FIX && echo "$file"; done' sh {} +

If you want to know if the last line matches a pattern, use sed and restrict the match to the last line with $. sed doesn't easily give a return value or do pretty printing of the filename like grep, but it gets the job done.
find . -exec sh -c "sed -n '$ { /FIX/p; }' {} | grep -q . " \; -print
Here, we use -n to suppress printing, and then print (with /p) only when the last line matches the pattern /FIX/. The output is piped to grep to get a return value that find uses to decide whether or not to -print the name.
Or, you can avoid using grep for the return by doing something like:
find . -exec awk 'END{ exit ! match($0, "FIX")}' {} \; -print

Using cat and grep commands in Bash

I'm having trouble with trying to achieve this bash command:
Concatenate all the text files in the current directory that have at least one occurrence of the word BOB (in any case) within the text of the file.
Is it correct for me to do this use the cat command then use grep to find the occurences of the word BOB?
cat grep -i [BOB] *.txt > catFile.txt

To handle filenames with whitespace characters correctly:
grep --null -l -i "BOB" *.txt | xargs -0 cat > catFile.txt

Your issue was the need to pass grep's file names to cat as an inline function:
cat $(grep --null -l -i "BOB" *.txt ) > catFile.txt
$(.....) handles the inline execution
-l returns only filenames of the things that matched

You could use find with -exec:
find -maxdepth 1 -name '*.txt' -exec grep -qi 'bob' {} \; \
-exec cat {} + > catFile.txt
-maxdepth 1 makes sure you don't search any deeper than the current directory
-name '*.txt' says to look at all files ending with .txt – for the case that there is also a directory ending in .txt, you could add -type f to only look at files
-exec grep -qi 'bob' {} \; runs grep for each .txt file found. If bob is in the file, the exit status is zero and the next directive is executed. -q makes sure the grep is silent.
-exec cat {} + runs cat on all the files that contain bob

You need to remove the square brackets...
grep -il "BOB" *

You can also use the following command that you must run from the directory containing your BOB files.
grep -il BOB *.in | xargs cat > BOB_concat.out
-i is an option used to set grep in case insensitive mode
-l will be used to output only the filename containing the pattern provided as argument to grep
*.in is used to find all the input files in the dir (should be adapted to your folder content)
then you pipe the result of the command to xargs in order to build the arguments that cat will use to produce your file concatenation.
HYPOTHESIS:
Your folder does only contain files without strange characters in their name (e.g. space)

List files whose last line doesn't contain a pattern

The very last line of my file should be "#"
if I tail -n 1 * | grep -L "#" the result is (standard input) obviously because it's being piped.
was hoping for a grep solution vs reading the entire file and just searching the last line.

for i in *; do tail -n 1 "$i" | grep -q -v '#' && echo "$i"; done

You can use sed for that:
sed -n 'N;${/pattern/!p}' file
The above command prints all lines of file if it's last line doesn't contain a pattern.
However, it looks like I misunderstood you, you want only to print the file names of the those files where the last line doesn't match the pattern. In this case I would use find together with the following (GNU) sed command:
find -maxdepth 1 -type f -exec sed -n '${/pattern/!F}' {} \;
The find command iterates over all files in the current folder and executes the sed command. $ marks the last line of input. If /pattern/ isn't found ! then F prints the file name.
The solution above looks nice and executes fast it has a drawback it would not print the names of empty files, since the last line will never reached and $ will not match.
For a stable solution I would suggest to put the commands into a script:
script.sh
#!/bin/bash
# Check whether the file is empty ...
if [ ! -s "$1" ] ; then
echo "$1"
else
# ... or if the last line contains a pattern
sed -n '${/pattern/!F}' "$1"
# If you don't have GNU sed you can use this
# (($(tail -n1 a.txt | grep -c pattern))) || echo "$1"
fi
make it executable
chmod +x script.sh
And use the following find command:
find -maxdepth 1 -type f -exec ./script.sh {} \;

Consider this one-liner:
while read name ; do tail -n1 "$name" | grep -q \# || echo "$name" does not contain the pattern ; done < <( find -type f )
It uses tail to get the last line of each file and grep to test that line against the pattern. Performance will not be the best on many files because two new processes are started in each iteration.

greping for a value != *.file extension

Not 100% on how to go about this.
Basically my app creates two file outputs.
file
file.ext
when searching through the directory it always returns both files. I wish only to work with
file
how can I do a grep for just the file? Or is grep even the correct command?
!grep *.ext
pseudo code ^

Why not just:
grep foobar file
or if you want to search your directory
find . -name 'file' | xargs -r grep boofar

You can try using grep 'file$', i.e. select lines ending with file and nothing after that. Alternatively you can use grep -v to invert results.
# (echo "file"; echo "file.txt") | grep file$
file
# (echo "file"; echo "file.txt") | grep -v .ext
file

I have no idea what you are trying to do with grep, but if you want to check whether a file exists, you do simply this:
if [ -r /some/path/file ]; then
echo "File is readable."
fi

Grep has a -v flag that inverses the meaning (looks for lines that do not contain the pattern):
ls /your/directory | grep -v '\.ext$'
This will exclude every filename ending with .ext.

Try
grep -v option
-v do the inverse match
For details do $ man grep

To generate a list of all files that do not match a pattern, use -not in find. So you want:
find . -not -name '*.ext'

Delete all but the most recent X files in bash

Is there a simple way, in a pretty standard UNIX environment with bash, to run a command to delete all but the most recent X files from a directory?
To give a bit more of a concrete example, imagine some cron job writing out a file (say, a log file or a tar-ed up backup) to a directory every hour. I'd like a way to have another cron job running which would remove the oldest files in that directory until there are less than, say, 5.
And just to be clear, there's only one file present, it should never be deleted.

The problems with the existing answers:
inability to handle filenames with embedded spaces or newlines.
in the case of solutions that invoke rm directly on an unquoted command substitution (rm `...`), there's an added risk of unintended globbing.
inability to distinguish between files and directories (i.e., if directories happened to be among the 5 most recently modified filesystem items, you'd effectively retain fewer than 5 files, and applying rm to directories will fail).
wnoise's answer addresses these issues, but the solution is GNU-specific (and quite complex).
Here's a pragmatic, POSIX-compliant solution that comes with only one caveat: it cannot handle filenames with embedded newlines - but I don't consider that a real-world concern for most people.
For the record, here's the explanation for why it's generally not a good idea to parse ls output: http://mywiki.wooledge.org/ParsingLs
ls -tp | grep -v '/$' | tail -n +6 | xargs -I {} rm -- {}
Note: This command operates in the current directory; to target a directory explicitly, use a subshell ((...)) with cd:
(cd /path/to && ls -tp | grep -v '/$' | tail -n +6 | xargs -I {} rm -- {})
The same applies analogously to the commands below.
The above is inefficient, because xargs has to invoke rm separately for each filename.
However, your platform's specific xargs implementation may allow you to solve this problem:
A solution that works with GNU xargs is to use -d '\n', which makes xargs consider each input line a separate argument, yet passes as many arguments as will fit on a command line at once:
ls -tp | grep -v '/$' | tail -n +6 | xargs -d '\n' -r rm --
Note: Option -r (--no-run-if-empty) ensures that rm is not invoked if there's no input.
A solution that works with both GNU xargs and BSD xargs (including on macOS) - though technically still not POSIX-compliant - is to use -0 to handle NUL-separated input, after first translating newlines to NUL (0x0) chars., which also passes (typically) all filenames at once:
ls -tp | grep -v '/$' | tail -n +6 | tr '\n' '\0' | xargs -0 rm --
Explanation:
ls -tp prints the names of filesystem items sorted by how recently they were modified , in descending order (most recently modified items first) (-t), with directories printed with a trailing / to mark them as such (-p).
Note: It is the fact that ls -tp always outputs file / directory names only, not full paths, that necessitates the subshell approach mentioned above for targeting a directory other than the current one ((cd /path/to && ls -tp ...)).
grep -v '/$' then weeds out directories from the resulting listing, by omitting (-v) lines that have a trailing / (/$).
Caveat: Since a symlink that points to a directory is technically not itself a directory, such symlinks will not be excluded.
tail -n +6 skips the first 5 entries in the listing, in effect returning all but the 5 most recently modified files, if any.
Note that in order to exclude N files, N+1 must be passed to tail -n +.
xargs -I {} rm -- {} (and its variations) then invokes on rm on all these files; if there are no matches at all, xargs won't do anything.
xargs -I {} rm -- {} defines placeholder {} that represents each input line as a whole, so rm is then invoked once for each input line, but with filenames with embedded spaces handled correctly.
-- in all cases ensures that any filenames that happen to start with - aren't mistaken for options by rm.
A variation on the original problem, in case the matching files need to be processed individually or collected in a shell array:
# One by one, in a shell loop (POSIX-compliant):
ls -tp | grep -v '/$' | tail -n +6 | while IFS= read -r f; do echo "$f"; done
# One by one, but using a Bash process substitution (<(...),
# so that the variables inside the `while` loop remain in scope:
while IFS= read -r f; do echo "$f"; done < <(ls -tp | grep -v '/$' | tail -n +6)
# Collecting the matches in a Bash *array*:
IFS=$'\n' read -d '' -ra files < <(ls -tp | grep -v '/$' | tail -n +6)
printf '%s\n' "${files[#]}" # print array elements

Remove all but 5 (or whatever number) of the most recent files in a directory.
rm `ls -t | awk 'NR>5'`

(ls -t|head -n 5;ls)|sort|uniq -u|xargs rm
This version supports names with spaces:
(ls -t|head -n 5;ls)|sort|uniq -u|sed -e 's,.*,"&",g'|xargs rm

Simpler variant of thelsdj's answer:
ls -tr | head -n -5 | xargs --no-run-if-empty rm
ls -tr displays all the files, oldest first (-t newest first, -r reverse).
head -n -5 displays all but the 5 last lines (ie the 5 newest files).
xargs rm calls rm for each selected file.

find . -maxdepth 1 -type f -printf '%T# %p\0' | sort -r -z -n | awk 'BEGIN { RS="\0"; ORS="\0"; FS="" } NR > 5 { sub("^[0-9]*(.[0-9]*)? ", ""); print }' | xargs -0 rm -f
Requires GNU find for -printf, and GNU sort for -z, and GNU awk for "\0", and GNU xargs for -0, but handles files with embedded newlines or spaces.

All these answers fail when there are directories in the current directory. Here's something that works:
find . -maxdepth 1 -type f | xargs -x ls -t | awk 'NR>5' | xargs -L1 rm
This:
works when there are directories in the current directory
tries to remove each file even if the previous one couldn't be removed (due to permissions, etc.)
fails safe when the number of files in the current directory is excessive and xargs would normally screw you over (the -x)
doesn't cater for spaces in filenames (perhaps you're using the wrong OS?)

ls -tQ | tail -n+4 | xargs rm
List filenames by modification time, quoting each filename. Exclude first 3 (3 most recent). Remove remaining.
EDIT after helpful comment from mklement0 (thanks!): corrected -n+3 argument, and note this will not work as expected if filenames contain newlines and/or the directory contains subdirectories.

Ignoring newlines is ignoring security and good coding. wnoise had the only good answer. Here is a variation on his that puts the filenames in an array $x
while IFS= read -rd ''; do
x+=("${REPLY#* }");
done < <(find . -maxdepth 1 -printf '%T# %p\0' | sort -r -z -n )

For Linux (GNU tools), an efficient & robust way to keep the n newest files in the current directory while removing the rest:
n=5
find . -maxdepth 1 -type f -printf '%T# %p\0' |
sort -z -nrt ' ' -k1,1 |
sed -z -e "1,${n}d" -e 's/[^ ]* //' |
xargs -0r rm -f
For BSD, find doesn't have the -printf predicate, stat can't output NULL bytes, and sed + awk can't handle NULL-delimited records.
Here's a solution that doesn't support newlines in paths but that safeguards against them by filtering them out:
#!/bin/bash
n=5
find . -maxdepth 1 -type f ! -path $'*\n*' -exec stat -f '%.9Fm %N' {} + |
sort -nrt ' ' -k1,1 |
awk -v n="$n" -F'^[^ ]* ' 'NR > n {printf "%s%c", $2, 0}' |
xargs -0 rm -f
note: I'm using bash because of the $'\n' notation. For sh you can define a variable containing a literal newline and use it instead.
Solution for UNIX & Linux (inspired from AIX/HP-UX/SunOS/BSD/Linux ls -b):
Some platforms don't provide find -printf, nor stat, nor support NUL-delimited records with stat/sort/awk/sed/xargs. That's why using perl is probably the most portable way to tackle the problem, because it is available by default in almost every OS.
I could have written the whole thing in perl but I didn't. I only use it for substituting stat and for encoding-decoding-escaping the filenames. The core logic is the same as the previous solutions and is implemented with POSIX tools.
note: perl's default stat has a resolution of a second, but starting from perl-5.8.9 you can get sub-second resolution with the stat function of the module Time::HiRes (when both the OS and the filesystem support it). That's what I'm using here; if your perl doesn't provide it then you can remove the ‑MTime::HiRes=stat from the command line.
n=5
find . '(' -name '.' -o -prune ')' -type f -exec \
perl -MTime::HiRes=stat -le '
foreach (#ARGV) {
#st = stat($_);
if ( #st > 0 ) {
s/([\\\n])/sprintf( "\\%03o", ord($1) )/ge;
print sprintf( "%.9f %s", $st[9], $_ );
}
else { print STDERR "stat: $_: $!"; }
}
' {} + |
sort -nrt ' ' -k1,1 |
sed -e "1,${n}d" -e 's/[^ ]* //' |
perl -l -ne '
s/\\([0-7]{3})/chr(oct($1))/ge;
s/(["\n])/"\\$1"/g;
print "\"$_\"";
' |
xargs -E '' sh -c '[ "$#" -gt 0 ] && rm -f "$#"' sh
Explanations:
For each file found, the first perl gets the modification time and outputs it along the encoded filename (each newline and backslash characters are replaced with the literals \012 and \134 respectively).
Now each time filename is guaranteed to be single-line, so POSIX sort and sed can safely work with this stream.
The second perl decodes the filenames and escapes them for POSIX xargs.
Lastly, xargs calls rm for deleting the files. The sh command is a trick that prevents xargs from running rm when there's no files to delete.

I realize this is an old thread, but maybe someone will benefit from this. This command will find files in the current directory :
for F in $(find . -maxdepth 1 -type f -name "*_srv_logs_*.tar.gz" -printf '%T# %p\n' | sort -r -z -n | tail -n+5 | awk '{ print $2; }'); do rm $F; done
This is a little more robust than some of the previous answers as it allows to limit your search domain to files matching expressions. First, find files matching whatever conditions you want. Print those files with the timestamps next to them.
find . -maxdepth 1 -type f -name "*_srv_logs_*.tar.gz" -printf '%T# %p\n'
Next, sort them by the timestamps:
sort -r -z -n
Then, knock off the 4 most recent files from the list:
tail -n+5
Grab the 2nd column (the filename, not the timestamp):
awk '{ print $2; }'
And then wrap that whole thing up into a for statement:
for F in $(); do rm $F; done
This may be a more verbose command, but I had much better luck being able to target conditional files and execute more complex commands against them.

If the filenames don't have spaces, this will work:
ls -C1 -t| awk 'NR>5'|xargs rm
If the filenames do have spaces, something like
ls -C1 -t | awk 'NR>5' | sed -e "s/^/rm '/" -e "s/$/'/" | sh
Basic logic:
get a listing of the files in time order, one column
get all but the first 5 (n=5 for this example)
first version: send those to rm
second version: gen a script that will remove them properly

With zsh
Assuming you don't care about present directories and you will not have more than 999 files (choose a bigger number if you want, or create a while loop).
[ 6 -le `ls *(.)|wc -l` ] && rm *(.om[6,999])
In *(.om[6,999]), the . means files, the o means sort order up, the m means by date of modification (put a for access time or c for inode change), the [6,999] chooses a range of file, so doesn't rm the 5 first.

Adaptation of #mklement0's excellent answer with some parameters and without needing to navigate to the folder containing the files to be deleted...
TARGET_FOLDER="/my/folder/path"
FILES_KEEP=5
ls -tp "$TARGET_FOLDER"**/* | grep -v '/$' | tail -n +$((FILES_KEEP+1)) | xargs -d '\n' -r rm --
[Ref(s).: https://stackoverflow.com/a/3572628/3223785 ]
Thanks! 😉

found interesting cmd in Sed-Onliners - Delete last 3 lines - fnd it perfect for another way to skin the cat (okay not) but idea:
#!/bin/bash
# sed cmd chng #2 to value file wish to retain
cd /opt/depot
ls -1 MyMintFiles*.zip > BigList
sed -n -e :a -e '1,2!{P;N;D;};N;ba' BigList > DeList
for i in `cat DeList`
do
echo "Deleted $i"
rm -f $i
#echo "File(s) gonzo "
#read junk
done
exit 0

Removes all but the 10 latest (most recents) files
ls -t1 | head -n $(echo $(ls -1 | wc -l) - 10 | bc) | xargs rm
If less than 10 files no file is removed and you will have :
error head: illegal line count -- 0
To count files with bash

I needed an elegant solution for the busybox (router), all xargs or array solutions were useless to me - no such command available there. find and mtime is not the proper answer as we are talking about 10 items and not necessarily 10 days. Espo's answer was the shortest and cleanest and likely the most unversal one.
Error with spaces and when no files are to be deleted are both simply solved the standard way:
rm "$(ls -td *.tar | awk 'NR>7')" 2>&-
Bit more educational version: We can do it all if we use awk differently. Normally, I use this method to pass (return) variables from the awk to the sh. As we read all the time that can not be done, I beg to differ: here is the method.
Example for .tar files with no problem regarding the spaces in the filename. To test, replace "rm" with the "ls".
eval $(ls -td *.tar | awk 'NR>7 { print "rm \"" $0 "\""}')
Explanation:
ls -td *.tar lists all .tar files sorted by the time. To apply to all the files in the current folder, remove the "d *.tar" part
awk 'NR>7... skips the first 7 lines
print "rm \"" $0 "\"" constructs a line: rm "file name"
eval executes it
Since we are using rm, I would not use the above command in a script! Wiser usage is:
(cd /FolderToDeleteWithin && eval $(ls -td *.tar | awk 'NR>7 { print "rm \"" $0 "\""}'))
In the case of using ls -t command will not do any harm on such silly examples as: touch 'foo " bar' and touch 'hello * world'. Not that we ever create files with such names in real life!
Sidenote. If we wanted to pass a variable to the sh this way, we would simply modify the print (simple form, no spaces tolerated):
print "VarName="$1
to set the variable VarName to the value of $1. Multiple variables can be created in one go. This VarName becomes a normal sh variable and can be normally used in a script or shell afterwards. So, to create variables with awk and give them back to the shell:
eval $(ls -td *.tar | awk 'NR>7 { print "VarName=\""$1"\"" }'); echo "$VarName"

leaveCount=5
fileCount=$(ls -1 *.log | wc -l)
tailCount=$((fileCount - leaveCount))
# avoid negative tail argument
[[ $tailCount < 0 ]] && tailCount=0
ls -t *.log | tail -$tailCount | xargs rm -f

I made this into a bash shell script. Usage: keep NUM DIR where NUM is the number of files to keep and DIR is the directory to scrub.
#!/bin/bash
# Keep last N files by date.
# Usage: keep NUMBER DIRECTORY
echo ""
if [ $# -lt 2 ]; then
echo "Usage: $0 NUMFILES DIR"
echo "Keep last N newest files."
exit 1
fi
if [ ! -e $2 ]; then
echo "ERROR: directory '$1' does not exist"
exit 1
fi
if [ ! -d $2 ]; then
echo "ERROR: '$1' is not a directory"
exit 1
fi
pushd $2 > /dev/null
ls -tp | grep -v '/' | tail -n +"$1" | xargs -I {} rm -- {}
popd > /dev/null
echo "Done. Kept $1 most recent files in $2."
ls $2|wc -l

Modified version of the answer of #Fabien if you want to specify a path. Useful if you're running the script elsewhere.
ls -tr /path/foo/ | head -n -5 | xargs -I% --no-run-if-empty rm /path/foo/%

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to know file name from a pipeline of commands - bash

You could just use a loop instead, which will keep track of the file name for you: for file in *.log; do if tail --lines=-10000 "$file" | grep -q text_for_search; then echo "$file" fi done The -q switch to grep suppresses the output, returning a 0 (success) exit code if the pattern is matched.

Replace your tail command with: awk '{v[NR]=$0}END{for(i=NR-10000;i<=NR;i++)print FILENAME,v[i]}' This above is just the replacement of the tail command except it adds a file name in the begining of each line.

Related

how to find every file in my repo that has a specific word in the last line?

Using cat and grep commands in Bash

List files whose last line doesn't contain a pattern

greping for a value != *.file extension

Delete all but the most recent X files in bash

Categories

Resources