How to get md5 output but tab separated? - bash

I can use md5 -r foo.txt > md5.txt to create a text file with the md5 of the file followed by a space and then the local path to that file .. but how would I go about getting those two items separated by a TAB character?
For reference and context, the full command I'm using is
find . -type f -exec \
bash -c '
md=$(md5 -r "$0")
siz=$(wc -c <"$0")
echo -e "${md}\t${siz}"
' {} \; \
> listing.txt
Note that the filepath item of md5 output might also contain spaces, like ./path to file/filename, and these should not be converted to tabs.

sed is another option:
find directory/ -type f -exec md5 -r '{}' '+' | sed 's/ /\t/' > listing.txt
This will replace the first space on each line with a tab.
(Note that the file you're redirecting output to should not be in the directory tree being searched by find)

Try the builtin printf and P.E. parameter expansion, to split the md variable.
find . -type f -exec sh -c '
md=$(md5 -r "$0") siz=$(wc -c <"$0")
printf "%s\t%s\t%s\n" "${md%% *}" "${md#*"${md%% *}"}" "${siz}"
' {} \; > listing.txt
Output
d41d8cd98f00b204e9800998ecf8427e ./bar.txt 0
d41d8cd98f00b204e9800998ecf8427e ./foo.txt 0
d41d8cd98f00b204e9800998ecf8427e ./more.txt 0
d41d8cd98f00b204e9800998ecf8427e ./baz.txt 0
314a1673b94e05ed5d9757b6ee33e3b1 ./qux.txt 0
See the online manual for bash ParameExpansion
The local man pages if available. PAGER='less +/^[[:space:]]*parameter\ expansion' man bash

Looks like you are simply left with spaces between the hash and file name that you don't want. A quick pass through awk can clean that up for you. By default input awk delimiter is any amount of white space. Simply running though awk and printing the fields with a new OFS (output field separator) is all you need. In fact, it makes the pass through echo pointless.
time find . -type f -exec bash -c 'md=$(md5 -r "$0"); siz=$(wc -c <"$0"); awk -vOFS="\t" "{print \$1,\$2,\$3}" <<< "${md} ${siz}" ' > listing.txt {} \;
Personally, I would have ran the output of that find command into a while loop. This is basically the same as above, but a little easier to follow.
time find . -type f | \
while read -r file; do
md=$(md5 -r "$file")
siz=$(wc -c < "$file")
awk -vOFS="\t" '{print $1,$2,$3}' <<< "${md} ${siz}"
done > listing.txt

Related

For Loop: Identify Filename Pairs, Input to For Loop

I am attempting to adapt a previously answered question for use in a for loop.
I have a folder containing multiple paired file names that need to be provided sequentially as input to a for loop.
Example Input
WT1_0min-SRR9929263_1.fastq
WT1_0min-SRR9929263_2.fastq
WT1_20min-SRR9929265_1.fastq
WT1_20min-SRR9929265_2.fastq
WT3_20min-SRR12062597_1.fastq
WT3_20min-SRR12062597_2.fastq
Paired file names can be identified with the answer from the previous question:
find . -name '*_1.fastq' -exec basename {} '_1.fastq' \; | xargs -n1 -I{} echo {}_1.fastq {}_2.fastq
I now want adopt this for use in a for loop so that each output file can be independently piped to subsequent commands and also so that output file names can be appended.
Input files can be provided as a comma-separated list of files after the -1 and -2 flags respectively. So for this example, the bulk and undesired input would be:
-1 WT1_0min-SRR9929263_1.fastq,WT1_20min-SRR9929265_1.fastq,WT3_20min-SRR12062597_1.fastq
-2 WT1_0min-SRR9929263_2.fastq,WT1_20min-SRR9929265_2.fastq,WT3_20min-SRR12062597_2.fastq
However, I would like to run this as a for loop so that input files are provided sequentially:
Iteration #1
-1 WT1_0min-SRR9929263_1.fastq
-2 WT1_0min-SRR9929263_2.fastq
Iteration #2
-1 WT1_20min-SRR9929265_1.fastq
-2 WT1_20min-SRR9929265_2.fastq
Iteration #3
-1 WT3_20min-SRR12062597_1.fastq
-2 WT3_20min-SRR12062597_2.fastq
Below is an example of the for loop I would like to run using the xarg code to pull filenames. It currently does not work. I assume I need to somehow save the paired filenames from the xarg code as a variable that can be referenced in the for loop?
find . -name '*_1.fastq' -exec basename {} '_1.fastq' \; | xargs -n1 -I{} echo {}_1.fastq {}_2.fastq
for file in *.fastq
do
bowtie2 -p 8 -x /path/genome \
1- {}_1.fastq \
2- {}_2.fastq \
"../path/${file%%.fastq}_UnMappedReads.fastq.gz" \
2> "../path/${file%%.fastq}_Bowtie2_log.txt" | samtools view -# 7 -b | samtools sort -# 7 -m 5G -o "../path/${file%%.fastq}_Mapped.bam"
done
The expected outputs for the example would be:
WT1_0min-SRR9929263_UnMappedReads.fastq.gz
WT1_20min-SRR9929265_UnMappedReads.fastq.gz
WT3_20min-SRR12062597_UnMappedReads.fastq.gz
WT1_0min-SRR9929263_Bowtie2_log.txt
WT1_20min-SRR9929265_Bowtie2_log.txt
WT3_20min-SRR12062597_Bowtie2_log.txt
WT1_0min-SRR9929263_Mapped.bam
WT1_20min-SRR9929265_Mapped.bam
WT3_20min-SRR12062597_Mapped.bam
I don't know what "bowtie2" or "samtools" are but best I can tell all you need is:
#!/usr/bin/env bash
for file1 in *_1.fastq; do
file2="${file1%_1.fastq}_2.fastq"
echo "$file1" "$file2"
done
Replace echo with whatever you want to do with ta pair of files.
If you HAD to use find for some reason then it'd be:
#!/usr/bin/env bash
while IFS= read -r file1; do
file2="${file1%_1.fastq}_2.fastq"
echo "$file1" "$file2"
done < <(find . -type f -name '*_1.fastq' -print)
or if your file names can contain newlines then:
#!/usr/bin/env bash
while IFS= read -r -d $'\0' file1; do
file2="${file1%_1.fastq}_2.fastq"
echo "$file1" "$file2"
done < <(find . -type f -name '*_1.fastq' -print0)

sed to replace string in file only displayed but not executed

I want to find all files with certain name (Myfile.txt) that do not contain certain string (my-wished-string) and then do a sed in order to do a replace in the found files. I tried with:
find . -type f -name "Myfile.txt" -exec grep -H -E -L "my-wished-string" {} + | sed 's/similar-to-my-wished-string/my-wished-string/'
But this only displays me all files with wished name that miss the "my-wished-string", but does not execute the replacement. Do I miss here something?
With a for loop and invoking a shell.
find . -type f -name "Myfile.txt" -exec sh -c '
for f; do
grep -H -E -L "my-wished-string" "$f" &&
sed -i "s/similar-to-my-wished-string/my-wished-string/" "$f"
done' sh {} +
You might want to add a -q to grep and -n to sed to silence the printing/output to stdout
You can do this by constructing two stacks; the first containing the files to search, and the second containing negative hits, which will then be iterated over to perform the replacement.
find . -type f -name "Myfile.txt" > stack1
while read -r line;
do
[ -z $(sed -n '/my-wished-string/p' "${line}") ] && echo "${line}" >> stack2
done < stack1
while read -r line;
do
sed -i "s/similar-to-my-wished-string/my-wished-string/" "${line}"
done < stack2
With some versions of sed, you can use -i to edit the file. But don't pipe the list of names to sed, just execute sed in the find:
find . -type f -name Myfile.txt -not -exec grep -q "my-wished-string" {} \; -exec sed -i 's/similar-to-my-wished-string/my-wished-string/g' {} \;
Note that any file which contains similar-to-my-wished-string also contains the string my-wished-string as a substring, so with these exact strings the command is a no-op, but I suppose your actual strings are different than these.

Find and replace string and print file directory on change

I am using find and sed to replace a string in multiple files. Here is my script:
find ./ -type f -name "*.html" -maxdepth 1 -exec sed -i '' "s/${REPLACE_STRING}/${STRING}/g" {} \; -print
The -print always prints the file no matter if something was changed or not. What I would like to see what files are changed. Ideally I would like the output to be something like this(as the files are changing):
/path/to/file was changed
- REPLACE STRING line 9 was changed
- REPLACE STRING line 12 was changed
- REPLACE STRING line 26 was changed
/path/to/file2 was changed
- REPLACE STRING line 1 was changed
- REPLACE STRING line 6 was changed
- REPLACE STRING line 36 was changed
Is there anyway of doing something like this?
Cool idea. I think -print is a deadend for the reason you mention, so it needs to be done in the exec. I think sed is also a deadend due to the challenge of printing to STDOUT as well as modifying the file. So a natural extension is to wrap some Perl around it.
What if this was your exec statement:
perl -p -i -e '$i=1 if not defined($i); print STDOUT "$ARGV, line $i: $_" if s/REPLACE_STRING/STRING/; $i++' {} \;
-p wraps the Perl statements in a standard while(<>) loop so the file is processed line by line just like sed.
-i does in-place replacement, just like sed.
-e means execute the following Perl statements.
if not defined is a sneaky way of initialising a line count variable, even though it's executed for every line.
STDOUT tells print to output to the console instead of the file.
$ARGV is the current filename, when reading from <>.
$_ is the line being processed.
if means the print only gets executed if a match is found.
For an input file text.txt containing:
line 1
token 2
line 3
token 4
line 5
The statement perl -p -i -e '$i=1 if not defined($i); print STDOUT "$ARGV, line $i: $_" if s/token/sub/; $i++' text.txt gives me:
text.txt, line 2: sub 2
text.txt, line 4: sub 4
Leaving text.txt containing:
line 1
sub 2
line 3
sub 4
line 5
So you don't get your introductory "file was changed" line, but for a one-liner I think it's a pretty good compromise.
Operating on a couple of files it looks like this:
find ./ -type f -name "*.txt" -maxdepth 1 -exec perl -p -i -e '$i=1 if not defined($i); print STDOUT "$ARGV, line $i: $_" if s/token/sub/; $i++' {} \;
.//text1.txt, line 2: sub 2
.//text1.txt, line 4: sub 4
.//text2.txt, line 1: sub 1
.//text2.txt, line 3: sub 3
.//text2.txt, line 5: sub 5
You could chain -exec actions and take advantage of the exit status. For example:
find . \
-maxdepth 1 \
-type f \
-name '*.html' \
-exec grep -Hn "$REPLACE_STRING" {} \; \
-exec sed -i '' "s/${REPLACE_STRING}/${STRING}/g" {} \;
This prints, for each matching file, the path, the line number and the line:
./file1.html:9:contents of line 9
./file1.html:12:contents of line 12
./file1.html:26:contents of line 26
./file2.html:1:contents of line 1
./file2.html:6:contents of line 6
./file2.html:36:contents of line 36
For files without a match, nothing else happens; for files with a match, the sed command will be called.
If you wanted output closer to what you have in your question, you could add a few actions:
find . \
-maxdepth 1 \
-type f \
-name '*.html' \
-exec grep -q "$REPLACE_STRING" {} \; \
-printf '%p was changed\n' \
-exec grep -n "$REPLACE_STRING" {} \; \
-exec sed -i '' "s/${REPLACE_STRING}/${STRING}/g" {} \; \
| sed -E "s/^([[:digit:]]+):.*/ - $REPLACE_STRING line \1 was changed/"
This now first checks if the file contains the string, silently, with grep -q, then prints the filename (-printf), then all the matching lines with line numbers (grep -n), then does the substitution with sed and finally modifies the output slightly with sed.
Since you're using sed -i '', I assume you're on macOS; I'm not sure if the stock find on there supports the printf option.
By now, we're pretty close to running a complex-ish script on each file that matches, so we might as well do that directly:
shopt -s nullglob
for f in ./*.html; do
if grep -q "$REPLACE_STRING" "$f"; then
printf '%s\n' "$f was changed"
grep -n "$REPLACE_STRING" "$f" \
| sed -E "s/^([[:digit:]]+):.*/ - $REPLACE_STRING line \1 was changed/"
sed -i '' "s/${REPLACE_STRING}/${STRING}/g" "$f"
fi
done
Replace your find+sed command:
find ./ -type f -name "*.html" -maxdepth 1 -exec sed -i '' "s/${REPLACE_STRING}/${STRING}/g" {} \; -print
with this GNU awk command (needs gawk for inplace editing):
gawk -i inplace -v old="$REPLACE_STRING" -v new="$STRING" '
FNR==1 { hdr=FILENAME " was changed\n" }
gsub(old,new) { printf "%s - %s line %d was changed\n", hdr, old, FNR | "cat>&2"; hdr="" }
1' *.html
You could also make it much more robust with awk than with sed if necessary since awk can support literal strings while sed can't
Alright, always defer to Ed's awk script for efficiency, but continuing with the sed + helper script using a preliminary call to grep to determine whether your file contains the word to replace, you could use a short helper script taking your ${REPLACE_STRING}, ${STRING} and filename as the first three positional parameters as follows:
Helper Script named helper.sh
#!/bin/sh
test -z "$1" && exit
test -z "$2" && exit
test -z "$3" && exit
findw="$1"
replw="$2"
fname="$3"
grep -q "$findw" "$fname" || exit
echo "$(readlink -f $fname) was changed"
grep -n "$findw" "$fname" | {
while read line; do
printf -- " - REPLACE STRING line %d was changed\n" "${line%:*}"
done }
sed -i "s/$findw/$replw/g" "$fname"
Then your call to find could be, e.g.:
find . -type f -name "f*" -exec ./helper.sh "dog" "cat" '{}' \;
Example Use/Output
Starting with a couple of files named f containing:
$ cat f
my
dog
dog
has
fleas
In a file structure containing the script in the present directory with a subdirectory d1 and multiple copies of f, e.g.
$ tree .
.
├── d1
│   └── f
├── f
└── helper.sh
Running the script results in the following:
$ find . -type f -name "f*" -exec ./helper.sh "dog" "cat" '{}' \;
/tmp/tmp-david/f was changed
- REPLACE STRING line 2 was changed
- REPLACE STRING line 3 was changed
/tmp/tmp-david/d1/f was changed
- REPLACE STRING line 2 was changed
- REPLACE STRING line 3 was changed
and the contents of f are changed accordingly
$ cat f
my
cat
cat
has
fleas
If there is no search term found in any of the files located by find, the modification times on those files are left unchanged.
Now with all that in mind, if you have gawk available, follow Ed's advise, but -- you can do it with sed and a helper :)
install Perl easily for free, define your own strings on bash shell and test here:
STRING=
REPLACE=
perl -ne 'foreach(`find . -maxdepth 1 -type f -iname "*.html"`){ open IH,$_ or die "Error $!"; print "Processing: $_";while (<IH>) {$s=$_;$t=s/$REPLACE/$STRING/; print "$s --> $_" if $t };print "Nothing replaced" if !$t}'
to truly edit it add -i option so it'd be perl -i -ne....

Looping over filtered find and performing an operation

I have a garbage dump of a bunch of Wordpress files and I'm trying to convert them all to Markdown.
The script I wrote is:
htmlDocs=($(find . -print | grep -i '.*[.]html'))
for html in "${htmlDocs[#]}"
do
P_MD=${html}.markdown
echo "${html} \> ${P_MD}"
pandoc --ignore-args -r html -w markdown < "${html}" | awk 'NR > 130' | sed '/<div class="site-info">/,$d' > "${P_MD}"
done
As far as I understand, the first line should be making an array of all html files in all subdirectories, then the for loop has a line to create a variable with the Markdown name (followed by a debugging echo), then the actual pandoc command to do the conversion.
One at a time, this command works.
However, when I try to execute it, OSX gives me:
$ ./pandoc_convert.command
./pandoc_convert.command: line 1: : No such file or directory
./pandoc_convert.command: line 1: : No such file or directory
o_0
Help?
There may be many reasons why the script fails, because the way you create the array is incorrect:
htmlDocs=($(find . -print | grep -i '.*[.]html'))
Arrays are assigned in the form: NAME=(VALUE1 VALUE2 ... ), where NAME is the name of the variable, VALUE1, VALUE2, and the rest are fields separated with characters that are present in the $IFS (input field separator) variable. Suppose you find a file name with spaces. Then the expression will create separate items in the array.
Another issue is that the expression doesn't handle globbing, i.e. file name generation based on the shell expansion of special characters such as *:
mkdir dir.html
touch \ *.html
touch a\ b\ c.html
a=($(find . -print | grep -i '.*[.]html'))
for html in "${a[#]}"; do echo ">>>${html}<<<"; done
Output
>>>./a<<<
>>>b<<<
>>>c.html<<<
>>>./<<<
>>>a b c.html<<<
>>>dir.html<<<
>>> *.html<<<
>>>./dir.html<<<
I know two ways to fix this behavior: 1) temporarily disable globbing, and 2) use the mapfile command.
Disabling Globbing
# Disable globbing, remember current -f flag value
[[ "$-" == *f* ]] || globbing_disabled=1
set -f
IFS=$'\n' a=($(find . -print | grep -i '.*[.]html'))
for html in "${a[#]}"; do echo ">>>${html}<<<"; done
# Restore globbing
test -n "$globbing_disabled" && set +f
Output
>>>./ .html<<<
>>>./a b c.html<<<
>>>./ *.html<<<
>>>./dir.html<<<
Using mapfile
The mapfile is introduced in Bash 4. The command reads lines from the standard input into an indexed array:
mapfile -t a < <(find . -print | grep -i '.*[.]html')
for html in "${a[#]}"; do echo ">>>${html}<<<"; done
The find Options
The find command selects all types of nodes, including directories. You should use the -type option, e.g. -type f for files.
If you want to filter the result set with a regular expression use -regex option, or -iregex for case-insensitive matching:
mapfile -t a < <(find . -type f -iregex .*\.html$)
for html in "${a[#]}"; do echo ">>>${html}<<<"; done
Output
>>>./ .html<<<
>>>./a b c.html<<<
>>>./ *.html<<<
echo vs. printf
Finally, don't use echo in new software. Use printf instead:
mapfile -t a < <(find . -type f -iregex .*\.html$)
for html in "${a[#]}"; do printf '>>>%s<<<\n' "$html"; done
Alternative Approach
However, I would rather pipe a loop with a read:
find . -type f -iregex .*\.html$ | while read line
do
printf '>>>%s<<<\n' "$line"
done
In this example, the read command reads a line from the standard input and stores the value into line variable.
Although I like the mapfile feature, I find the code with the pipe more clear.
Try adding the bash shebang and set IFS to handle spaces in folders and filenames:
#!/bin/bash
SAVEIFS=$IFS
IFS=$(echo -en "\n\b")
htmlDocs=($(find . -print | grep -i '.*[.]html'))
for html in "${htmlDocs[#]}"
do
P_MD=${html}.markdown
echo "${html} \> ${P_MD}"
pandoc --ignore-args -r html -w markdown < "${html}" | awk 'NR > 130' | sed '/<div class="site-info">/,$d' > "${P_MD}"
done
IFS=$SAVEIFS

Bash script to limit a directory size by deleting files accessed last

I had previously used a simple find command to delete tar files not accessed in the last x days (in this example, 3 days):
find /PATH/TO/FILES -type f -name "*.tar" -atime +3 -exec rm {} \;
I now need to improve this script by deleting in order of access date and my bash writing skills are a bit rusty. Here's what I need it to do:
check the size of a directory /PATH/TO/FILES
if size in 1) is greater than X size, get a list of the files by access date
delete files in order until size is less than X
The benefit here is for cache and backup directories, I will only delete what I need to to keep it within a limit, whereas the simplified method might go over size limit if one day is particularly large. I'm guessing I need to use stat and a bash for loop?
I improved brunner314's example and fixed the problems in it.
Here is a working script I'm using:
#!/bin/bash
DELETEDIR="$1"
MAXSIZE="$2" # in MB
if [[ -z "$DELETEDIR" || -z "$MAXSIZE" || "$MAXSIZE" -lt 1 ]]; then
echo "usage: $0 [directory] [maxsize in megabytes]" >&2
exit 1
fi
find "$DELETEDIR" -type f -printf "%T#::%p::%s\n" \
| sort -rn \
| awk -v maxbytes="$((1024 * 1024 * $MAXSIZE))" -F "::" '
BEGIN { curSize=0; }
{
curSize += $3;
if (curSize > maxbytes) { print $2; }
}
' \
| tac | awk '{printf "%s\0",$0}' | xargs -0 -r rm
# delete empty directories
find "$DELETEDIR" -mindepth 1 -depth -type d -empty -exec rmdir "{}" \;
Here's a simple, easy to read and understand method I came up with to do this:
DIRSIZE=$(du -s /PATH/TO/FILES | awk '{print $1}')
if [ "$DIRSIZE" -gt "$SOMELIMIT" ]
then
for f in `ls -rt --time=atime /PATH/TO/FILES/*.tar`; do
FILESIZE=`stat -c "%s" $f`
FILESIZE=$(($FILESIZE/1024))
DIRSIZE=$(($DIRSIZE - $FILESIZE))
if [ "$DIRSIZE" -lt "$LIMITSIZE" ]; then
break
fi
done
fi
I didn't need to use loops, just some careful application of stat and awk. Details and explanation below, first the code:
find /PATH/TO/FILES -name '*.tar' -type f \
| sed 's/ /\\ /g' \
| xargs stat -f "%a::%z::%N" \
| sort -r \
| awk '
BEGIN{curSize=0; FS="::"}
{curSize += $2}
curSize > $X_SIZE{print $3}
'
| sed 's/ /\\ /g' \
| xargs rm
Note that this is one logical command line, but for the sake of sanity I split it up.
It starts with a find command based on the one above, without the parts that limit it to files older than 3 days. It pipes that to sed, to escape any spaces in the file names find returns, then uses xargs to run stat on all the results. The -f "%a::%z::%N" tells stat the format to use, with the time of last access in the first field, the size of the file in the second, and the name of the file in the third. I used '::' to separate the fields because it is easier to deal with spaces in the file names that way. Sort then sorts them on the first field, with -r to reverse the ordering.
Now we have a list of all the files we are interested in, in order from latest accessed to earliest accessed. Then the awk script adds up all the sizes as it goes through the list, and begins outputting them when it gets over $X_SIZE. The files that are not output this way will be the ones kept, the other file names go to sed again to escape any spaces and then to xargs, which runs rm them.

Resources