Find lines longer than x characters and truncate for display - bash

I want to run grep on HTML files to find out lines longer than x characters and truncate the display using grep.
What I know
To figure out lines longer than 100 characters in html files.
find . -name '*.html' -print | xargs grep -on '.\{100\}'
To find lines matching title and limit display by 40 characters with
find . -name '*.html' -print | xargs grep -onE '.{0,40}title.{0,40}'
What I don't know
How can I find out lines that exceed 100 characters and then display those lines by limited to 40 characters?
MVCE
I have a bunch of html files, which look like
$ cat 1.html
abcdefghijklmnopqrstuv12345675689
12345675689abcdefghijklmnopqrstuv
abcd1234
Now, I'd like to find out lines longer than 20 characters, and then cut the display to 15 characters only.
Expected output with favoretti solution
$ find . -name '*.html' -print | xargs grep -on '.\{20\}' | cut -c -15
./1.html:1:abcd
./1.html:2:1234
./2.html:1:abcd
./2.html:2:1234

First of all it's worth mentioning that unless you're very confident that you can treat your "HTML" files as a series of line-separated records, then you should probably be using an HTML-aware tool (either standalone, or included in a scripting language).
Since you mentioned Awk in an earlier comment:
find . -name '*.html' -exec awk '
length($0) > 20 { print FILENAME, substr($0, 1, 15) }' {} +
This matches lines with length greater than 20 and prints the first 15 characters. I put the file name at the start, you can remove that if you like.
It's not clear whether you need find for a recursive search or not - if not, then you might be fine with letting the shell generate the list of files:
awk 'length($0) > 20 { print FILENAME, substr($0, 1, 15) }' *.html
And with globstar enabled (shopt -s globstar), you can use **/*.html for recursive matching in Bash.

If for some reason you want to just use grep
find . -name '*.html' -exec grep -oP '.{40}(?=.{60})' {} /dev/null \;

The first grep works ok I suppose, so if you want to print out just 40 chars, pipe it through cut?
find . -name '*.html' -print | grep -on '.\{100\}' | cut -c 1-40

Related

How to grep files in date order

I can list the Python files in a directory from most recently updated to least recently updated with
ls -lt *.py
But how can I grep those files in that order?
I understand one should never try to parse the output of ls as that is a very dangerous thing to do.
You may use this pipeline to achieve this with gnu utilities:
find . -maxdepth 1 -name '*.py' -printf '%T#:%p\0' |
sort -z -t : -rnk1 |
cut -z -d : -f2- |
xargs -0 grep 'pattern'
This will handle filenames with special characters such as space, newline, glob etc.
find finds all *.py files in current directory and prints modification time (epoch value) + : + filename + NUL byte
sort command performs reverse numeric sort on first column that is timestamp
cut command removes 1st column (timestamp) from output
xargs -0 grep command searches pattern in each file
There is a very simple way if you want to get the filelist in chronologic order that hold the pattern:
grep -sil <searchpattern> <files-to-grep> | xargs ls -ltr
i.e. you grep e.g. "hello world" in *.txt, with -sil you make the grep case insensitive (-i), suppress messages (-s) and just list files (-l); this you then pass on to ls (| xargs), sorting it by date (-t) showing date (-l) and all files (-a).

Grep part of a file name and output to a .txt

I'm trying to grep pattern (the first 8 characters) for all files names in a directory and output to a .txt using this but it's not working.
Why is that?
find . -type f -print | grep "^........" > test.txt
it still outputs the whole file name to the .txt
No need to use to grep at all, you can use the cut command to get the first 1-N characters without pattern matching:
find . -type f -print | cut -c1-8 > test.txt
You're passing the output of the find command to grep, rather than passing the output as a list of files for grep to search. You can fix it with xargs like this:
find . -type f -print | xargs grep "^........" > test.txt

Count how many files contain a string in the last line

I want to count how many files in the current directory have the string "A" in the last line.
First solution: tail -n 1 * | grep \"A\"| wc -l
This works fine, but when there are more files it does bash: /usr/bin/tail: Argument list too long.
Is there a way to get around it?
Bonus points if I can also optionally get which files contains it.
EDIT: my folder contains 343729 files
EDIT2: #tso usefully pointed to the article I'm getting "Argument list too long". How can I process a large list in chunks? in his comment.
RESULTS:
#tso solution for f in $(find . -type f); do tail -1 $f|grep \"A\"; done|wc -l takes about 20 minutes
#lars solution grep -P "\"A\"*\Z" -r . | wc -l takes about 20 minutes
#mklement0 solution printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l takes about 10 minutes
#james solution (in the comments) for i in * ; do awk 'END{if(/a/)print FILENAME}' "$i" ; done takes about 25 minutes
#codeforester find . -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>' takes >20 minutes.
#mklement0 and #codeforester solutiona also has the advantage that if I want to change the grep pattern, the second time I run it it takes zero time, I guess it's due to some sort of caching.
I've accepted #mklement0 answer at is seems to be the fastest, but I still like to mention #tso and #lars for their contributions and, based on my personal knowledge, an easier and adaptable solution.
xargs is able to overcome the max. command-line length limitation by efficiently batching the invocations into as few calls as possible.
The shell's builtins, such as printf, are not subject to the max. command-line length.
Knowing this, you can use the following approach (which assumes that your xargs implementation supports the -0 option for NUL-terminated input, and that your tail implementation supports multiple file operands and the -q option for suppressing filename headers.
Both assumptions hold for the GNU (Linux) and BSD/macOS implementations of these utilities):
printf '%s\0' * | xargs -0 sh -c 'tail -q -n 1 "$#" | grep \"A\"' - | wc -l
How about using find, tail, and grep this way? This will be more efficient than having to loop through each file. Also, tail -1 will just read the last line of the files and hence very I/O efficient.
find . -maxdepth 1 -type f -exec tail -n 1 -- {} + | grep -EB 1 '^[^=]+A' | grep -c '^==>'
find will invoke tail -1 in batches, passing ARG_MAX file names at a time
tail will print the last line of each of the file, prefixing it with the pattern "==> file_name <=="
grep -EB 1 '^[^=]+A' will look for pattern A and fetch the previous line as well (it will exclude the file_name lines while looking for the match)
grep -c '^==>' will count the number of files with matching pattern
If you don't need to know the name of the files having a match, but just get the count of files, you could do this:
find . -maxdepth 1 -type f -exec tail -q -n 1 -- {} + | grep -c 'A'
Using GNU awk:
$ cat foo
b
a
$ cat bar
b
b
$ awk 'ENDFILE{if(/a/){c++; print FILENAME}}END{print c}' *
foo
1
try with find:
for f in $(find . -type f); do tail -1 $f|grep PATERN; done|wc -l
If grep supports the -P option, this might work:
grep -P "A\Z" -r . | wc -l
See man pcrepattern. In short:
\Z matches at the end of the subject also matches before a newline at the end of the subject
\z matches only at the end of the subject
Try \Z and \z.
To see which files match, you would use only the grep part without the pipe to wc.
This will return the number of files:
grep -rlP "A\z" | wc -l
If you want to get the names then simply:
grep -rlP "A\Z"

Recursively find and open files

I want to search through all subdirectories and files to find files with a specific extension. When I find a file with the extension, I need to open it, find a specific string from within the file and store it within a txt file.
This is what I have so far for finding all of the correct files:
find . -name ".ext" ! -path './ExcludeThis*'
This is what I have for opening the file and getting the part of the file I want and storing it:
LINE=$(head .ext | grep search_string)
SUBSTR=$(echo $LINE | cut -f2 -d '"')
echo $SUBSTR >> results.txt
I am struggling for how to combine the 2 together, I have looked at 'for f in **/*' and then run an if statement in there to see if it matches the .ext and remove the need for find all together but **/* seems to work on directories only and not files.
A break down of any solutions would be very much appreciated too, I am new to shell scripting. Thanks.
find -name "*.ext" \! -path './ExcludeThis*' -exec head -q '{}' \+ |
grep search_string | cut -f2 -d'"' >> results.txt
find explanation
find -name "*.ext" \! -path './ExcludeThis*' -exec head -q '{}' \+
For each file name matched, executes head (with \+, the command line is built by appending each selected file name at the end, so the total number of invocations of the command will be much less than the number of matched files).
Notice I replaced .ext with *.ext (the first way just math a file named exactly .ext), and ! with \! (protection from interpretation by the shell).
The head option -q is necessary because that command prints headers when used with multiple files (due to \+ in this case).
In addition, if no path is given, the default is taken (.). i.e.: find . -name = find -name.
pipeline explanation
<find ... -exec head> | grep search_string | cut -f2 -d'"' >> results.txt
While head write the lines (10 by default) for every file in the pipe, grep read them.
If grep matches search_string in some of them, write those lines in the next pipe.
At the same time, cut take the second fields (delimited by ") of every line and appends them in results.txt

identifying .mp3 not by name with shell script

I would like to write a script that finds duplicate mp3's by content and not bf file name. I am wondering how one goes about seeing a file types inner data for the sake of comparison.
Thank you.
cmp can be used to compare binary files.
cmp file1.mp3 file2.mp3
if [[ $? -eq 0 ]]; then echo "Matched"; fi
cmp command returns 0 if the files are same or else -1.
This first command line lists all files having same size and same md5sum from the current directory
find . -type f -printf '%11s ' -exec md5sum '{}' ';' |
sort | uniq -w44 --all-repeated=separate
The second command line is
faster because it calculates md5sum solely for the files having the same size
more robust because it handles filenames having special characters as 'space' or 'newline'
Therefore it is also more complex
find . -type f -printf '%11s %P\0' |
LC_ALL=C sort -z |
uniq -Dzw11 |
while IFS= read -r -d '' line
do
md5sum "${line:12}"
done |
uniq -w32 --all-repeated=separate |
tee duplicated.log
Some explanations
# Print file size/md5sum/name in one line (size aligned in 11 characters)
find . -printf '%11s ' -exec md5sum '{}' ';'
# Print duplicated lines considering the the first 44 characters only
# 44 characters = size (11 characters) + one space + md5sum (32 characters)
uniq -w44 --all-repeated=separate
# Print size and path/filename terminated by a null character
find . -printf '%11s %P\0'
# Sort lines separated by a null character (-z) instead of a newline character
# based on native byte value (LC_ALL=C) instead of locals
LC_ALL=C sort -z
# Read lines separated by null character
IFS= read -r -d '' line
# Skip the first 12 characters (size and space)
# in order to obtain the rest: path/filename
"${line:12}"
If the files are really byte-to-byte equivalent, you can start searching for files of the same size. If their size is the same, you can investigate further (e.g. compare their md5sum). If the files just contain the same song, but use a different codec/compression/whatever, bash is probably not the right tool for the task.
I use this script for my photo but it can be used for other files.
First I transfer pictures from my phone/camera to directory newfiles
Then I run this script from my pictures root directory
On detection of duplicated files, the script keeps one file and moves the other ones to directory ../garbage
The script moves in priority file from newfiles
Caution: This script does not compare file content, but it detects files having same size & name (this is OK for camera files). My other answer is based on content comparison (md5sum).
#!/bin/bash
# If a file from directory 'newfile' has same size & name
# that another file from another directory
# then moves the file from 'newfile' to 'garbage'
find newfiles/ -type f -printf '%s %f\n' |
while read SIZE f
do
find . -name "$f" -size ${SIZE}c |
grep -v 'newfiles' &&
find . -name "$f" -size ${SIZE}c -path '*newfiles*' -exec mv -v '{}' ../garbage ';' &&
echo
done
# Detect all other duplicated files
# Keep the first occurrence and moves all other to 'garbage'
find . -type f -printf '%s %f\n' |
LC_ALL=C sort | #LC_ALL=C disables locale => sort is faster
uniq -dc | #keep duplicates and count number of occurrences
while read n SIZE f
do
echo -e "\n_____ $n files\t$SIZE bytes\tname: $f"
find . -name "$f" -size ${SIZE}c |
head -n 1 |
xargs mv -v -t ../garbage
done

Resources