Below is a bash script to move files around and rename them. The problem is it doesn't work when there is more than one file in the directory. I'm assuming because the last parameter in the mv command is a file. Any suggestions?
'#!/bin/bash'
'INPUTDIR="/home/southern-uniontn/S001007420"'
'OUTPUTDIR="/mnt/edi-06/southern-uniontn/flats-in"'
'BACKUPDIR="/backup/southern-uniontn/S001007420"'
YEAR=`date +%Y`
MONTH=`date +%m`
DAY=`date +%d`
HOUR=`date +%H`
MINUTE=`date +%M`
######## Do some error checking #########
# Does backup dir exist?
if [ ! -d $BACKUPDIR/$YEAR ]
then
mkdir $BACKUPDIR/$YEAR
fi
if [ ! -d $BACKUPDIR/$YEAR/$MONTH ]
then
mkdir $BACKUPDIR/$YEAR/$MONTH
fi
if [ ! -d $BACKUPDIR/$YEAR/$MONTH/$DAY ]
then
mkdir $BACKUPDIR/$YEAR/$MONTH/$DAY
fi
if [[ $(find $INPUTDIR -type f | wc -l) -gt 0 ]];
then
###### Rename the file, move it to Backup, then copy to the Output Directory #####
for f in $INPUTDIR/*
do
echo "`date` - Move recurring txt flat file to BackupDir for Union TN from Southern"
mv $INPUTDIR/* $BACKUPDIR/$YEAR/$MONTH/$DAY/UnionTN-S001007420-$YEAR$MONTH$DAY-$HOUR$MINUTE.txt
sleep 2
echo "`date` - Copy backup file to the Union TN Output Directory"
cp $BACKUPDIR/$YEAR/$MONTH/$DAY/UnionTN-S001007420-$YEAR$MONTH$DAY-$HOUR$MINUTE.txt $OUTPUTDIR/
done;
fi
Some notes:
Get out of the habit of using ALLCAPS variable names, leave those as reserved
by the shell. One day you'll write PATH=something and then wonder
why your script is
broken.
mkdir -p can create parent directories, and will not error if the dir already exists
store the filenames in an array. Then the shell does not have to duplicate
the work, and you don't need to count how many there are: if there are no
files, the loop has zero iterations
if you want to keep the same directory hierarchy in the outputdir,
you need to do that by hand.
use read to get the date parts
with bash v4.2+, printf can be used instead of calling out to date
use magic value "-1" to mean "now".
printf '%(%Y-%m-%d)T\n' -1 prints "2021-10-25" (as of the day I write this)
This is, I think, what you want:
#!/bin/bash
inputdir='/home/southern-uniontn/S001007420'
outputdir='/mnt/edi-06/southern-uniontn/flats-in'
backupdir='/backup/southern-uniontn/S001007420'
read year month day hour minute < <(printf '%(%Y %m %d %H %M)T\n' -1)
# create backup dirs if not exists
date_dir="$year/$month/$day"
mkdir -p "$backupdir/$date_dir"
mkdir -p "$outputdir/$date_dir"
mapfile -t files < <(find $inputdir -type f)
for f in "${files[#]}"
do
###### Rename the file, move it to Backup, then copy to the Output Directory #####
backup_file="UnionTN-S001007420-$year$month$day-$hour$minute.txt"
printf '%(%c)T - Move recurring txt flat file to backupdir for Union TN from Southern\n' -1
mv "$f" "$backupdir/$date_dir/$backup_file"
printf '%(%c)T - Copy backup file to the Union TN Output Directory\n' -1
cp "$backupdir/$date_dir/$backup_file" "$outputdir/$date_dir/$backup_file"
done
When using a glob with mv, the target must be an existing directory, and all matching files will be moved inside that directory.
In your case,
mv $INPUTDIR/* $BACKUPDIR/$YEAR/$MONTH/$DAY/UnionTN-S001007420-$YEAR$MONTH$DAY-$HOUR$MINUTE.txt
tells mv to move all file inside the $INPUTDIR/* directory to a directory named $BACKUPDIR/$YEAR/$MONTH/$DAY/UnionTN-S001007420-$YEAR$MONTH$DAY-$HOUR$MINUTE.txt.
I'm not sure what you're trying to do, but I hope this help.
Some more advice you could use:
Don't put the shebang (the first line beginning with "#") and the first three variable declarations inside single-quotes.
Some argue it is more portable and better to write /usr/bin/env bash instead of /bin/bash in the shebang
if [ CONDITION ] /then ACTION /fi statements can be simplified by writing [ CONDITION ] && ACTION
You reduce your likely hood of encountering unexpected behaviour when double-quoting your strings and variable (i.e. write "${year}/${month}/" instead of $year/$month.
No need to call mkdir a, followed by mkidr a/b, then mkdir a/b/c and so on, you can just call mkdir -p a/b/c. The p flag tells mkdir to create parent directories if they don't already exist.
It is unnecessary to validate the existence of a directory before calling mkdir since mkdir already validates that for you.
As pointed out by commenters, all-caps variables are conventions for special POSIX related variables. You should use another type of casing.
You could use date to do the formatting for you: date +%Y/%m/%d will print 2021/10/25
Strings without interpolation can have single-quotes.
(Optional, prevent undesired behaviors) Put set -e at the beginning of your scripts, after the shebang, to tell bash to halt if an error is encountered
And finally, use man <command_name> for built-in documentation!
I have ca 270 .bz2 log files (25 day logs) and one text file with ca 1500 usernames. What I need to do is find who of theese users are logged in in last 25 days. So I need to grep usernames from list of files and stop grepping when username is found in first file (when first match found).
My code works, but if in first file match found I do not need to process other files, break and search for another username, if it is found i.e. in third file, break and search for another username:
for i in $(cat /tmp/usernames.txt); do for j in $(ls *.bz2); do
bzgrep -o -m1 $i $j; done; done
Here, if in fist file match found it breaks (-m1 flag) and starts searching for the same username in second file, but I do not need that anymore.
Problem: I need to inspect users who are not logged in in last 25 days. So I can reduce their permissions in the application. If user is logged at least once in last 25 days, I do not reduce his permissions.
Question is: I need to find whom of theese usernames exist in my log files. If username is found in one of the files at least one time stop searching for this user and start searching for another user.
Example: if user1 is found in file1, print it and stop searching for this user any more in this or other files. If user2 is found in file8, print it one time and stop searching in file9, file10, file11 ... file250. Hope it makes sense.
Can't you just do this to get the list of user names that appear in any of the bzipped files:
bzgrep -o -w -F -f /tmp/usernames.txt *.bz2 | sort -u
and then a diff of that output against usernames.txt to see who has/hasn't logged in? Wrap it in a loop if it turns out to be more efficient to check one .bz2 file at a time:
for file in *.bz2; do
bzgrep -o -w -F -f /tmp/usernames.txt "$file"
done | sort -u
and you could remove found user names from each iteration if that improves performance too:
sort -u /tmp/usernames.txt > /tmp/names.txt
for file in *.bz2; do
bzgrep -o -w -F -f /tmp/names.txt "$file" | sort -u > /tmp/found.txt &&
comm -23 /tmp/names.txt /tmp/found.txt > /tmp/left.txt &&
mv /tmp/left.txt /tmp/names.txt &&
cat /tmp/found.txt
[[ -s /tmp/names.txt ]] || break
done
You could use a conditional:
if [ -n "$var" ]; then
echo "Match!"
break
fi
This structure means that the conditional is True only when $var is not empty. The loop will stop when the condition becomes True.
Good luck!
If disk space isn't a concern I would ask bzip2 to decompress all the archives to a single file and invoke grep -m1 on that file for each username :
bzcat *.bz2 > merged
while IFS='' read -r username; do
grep -om1 "$username" merged
done < /tmp/usernames.txt
rm merged
I am new to Unix scripting, I am trying to create Unix script since one week but I couldn't. Please help me in this.
I have a number of different files more than 100 (all the filenames are different) which the filename contains the date string(ex: 20171101)in the directory. I want compare these filename dates with my input date (today - 10days =20171114),with the files in the directories only using filename string date if it is less than with my input date then I have to delete the file. could anyone please help on this. Thanks
My script:
ten_days_ago=$(date -d "10 days ago" +%Y%m%d)
cd "$destination_dir" ;
ls *.* | awk -F '-' '{print $2}'
ls *.* | awk -F '-' '{print $2}' > removal.txt
while read filedate
do
if [ "$filedate" -lt "$ten_days_ago" ] ; then
cd "$destination_dir" ;
rm *-"$filedate"*
echo "deletion done"
fi
done <removal.txt
this script is working fine. but I need to send a email as well - if the deletion has been done then -one pass email else fail email.
but here within while loop if I am writing the emails then that will iterate
You're probably trying to pipe to mail from the middle of your loop. (Your question should really show this code, otherwise we can't say what's wrong.) A common technique is to redirect the loop's output to a file, and then send that. (Using a temporary file is slightly ugly, but avoids sending an empty message when there is no output from the loop.)
Just loop over the files and decide which to remove.
#!/bin/bash
t=$(mktemp -t tendays.XXXXXXXX) || exit
# Remove temp file if interrupted, or when done
trap 'rm -f "$t"' EXIT HUP INT TERM
ten_days_ago=$(date -d "10 days ago" +%Y%m%d)
for file in *-[1-9]*[1-9]-*; do
date=${file#*-} # strip prefix up through first dash
date=${date%-*} # strip from last dash from the previous result
if [ "$date" -lt "$ten_days_ago" ]; then
rm -v "$file"
fi
done >"$t" 2>&1
test -s "$t" || exit # Quit if empty
mail -s "Removed files" recipient#example.net <"$t"
I removed the (repeated!) cd so this can be run in any directory -- just switch to the directory you want before running the script. This also makes it easier to test in a directory with a set of temporary files.
Collecting the script's standard error also means the mail message will contain any error messages if rm fails for some reason or you have other exceptions.
By the by you should basically never use ls in scripts.
I have a directory that has thousands of files in it with various extensions. I also have a drop location where users drop files to be migrated to this directory. I'm looking for a script that will scan the target directory for a duplicate file name, if found, rename the file in the drop folder, then move it to the target directory.
Example:
/target/file.doc
/drop/file.doc
Script will rename file.doc to file1.doc then move it to /target/.
It needs to maintain the file extension too.
for fil in /drop/*
do
test -f "/target/$fil"
if [ "$?" = 0 ]
then
suff=$(awk -F\. '{ print "."$NF }' <<<$fil)
bdot=$(basename -s $suff $fil)
mv "/drop/$fil" "/drop/${bdot}1$suff"
cp "/drop/${bdot}1.$suff" "/target/${bdot}1$suff"
fi
done
Take each file in the drop directory and check it is existing the /target using test -e. If it does then move (rename) and then copy.
You have to take a bit more care than simply checking if a file exists before moving in order to provide a flexible solution that can handle files with or without extensions. You also may want to provide a way of forming duplicate filenames that preserves sort order. e.g. if file.txt already exists, you may want to use file_001.txt as the duplicate in target rather than file1.txt as when you reach 10 you will no longer have a canonical sort by filename.
Also, you never want to iterate with for i in $(ls dir) that is wrought with pitfalls. See Bash Pitfalls No. 1
Putting those pieces together, and including detail in the comments below, you could do something similar to the following and have a reasonable flexible solution allowing you to specify only the filename.ext to move or /path/to/drop/filename.ext. You must specify the drop and target directories in the script to meet your circumstance., e.g.
#!/bin/bash
tgt=target ## set target and drop directories as required
drp=drop
declare -i cnt=1 ## counter for filename_$cnt
test -z "$1" && { ## validate one argument given
printf "error: insufficient input\nusage: %s filename\n" "${0##*/}"
exit 1
}
test -w "$1" || test -w "$drp/$1" || { ## validate valid filename is writeable
printf "error: file not found or lack permission to move '%s'.\n" "$1"
exit 1
}
fn="${1##*/}" ## strip any path info from filename
if test "$1" != "${1%.*}" ; then
ext="${fn##*.}" ## get file extension
fnwoe="${fn%."$ext"}" ## get filename without extension
test "$fnwoe" = '' && ext= ## was a dotfile, reset ext
fi
vfn="$fn" ## set valid filename = filename
## form valid filename e.g. "$fn_001.$ext" if duplicate found
while test -e "$tgt/$vfn"; do
if test -n "$ext" ## did we have have an extension?
then
printf -v vfn "%s_%03d.%s" "$fnwoe" "$((cnt++))" "$ext"
else
printf -v vfn "%s_%03d" "$fn" "$((cnt++))"
fi
done
mv "$drp/$fn" "$tgt/$vfn" ## move file under non-conflicting name
Example drop and target
$ ls -1 drop
file
file.txt
$ ls -1 target
file.txt
file_001.txt
file_002.txt
Example Use
$ bash mvdrop.sh file
$ bash mvdrop.sh drop/file.txt
Resulting drop and target
$ ls -1 drop
$ ls -1 target
file
file.txt
file_001.txt
file_002.txt
file_003.txt
This will test to see if it exists, preserve the extension (along with any structure before the extension such as in the case of FILE.tar.gz), and move it to the target directory.
#!/bin/bash
TARGET="\target\"
DROP="\drop\"
for F in `ls $DROP`; do
if [[ -f $TARGET$F ]]; then
EXT=`echo $F | awk -F "." '{print $NF}'`
PRE=`echo $F | awk -F "." '{$NF="";print $0}' | sed -e 's/ $//g;s/ /./g'`
mv $DROP$F $DROP$PRE"1".$EXT
F=$PRE"1".$EXT
fi
mv $DROP$F $TARGET
done
Additionally you may want to do come restricting in the ls command, so that you aren't copying entire directories.
Display only regular files (no directories or symbolic links)
ls -p $DROP | grep -v /
Am trying to grep pattern from dozen files .tar.gz but its very slow
am using
tar -ztf file.tar.gz | while read FILENAME
do
if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
then
echo "$FILENAME contains string"
fi
done
If you have zgrep you can use
zgrep -a string file.tar.gz
You can use the --to-command option to pipe files to an arbitrary script. Using this you can process the archive in a single pass (and without a temporary file). See also this question, and the manual.
Armed with the above information, you could try something like:
$ tar xf file.tar.gz --to-command "awk '/bar/ { print ENVIRON[\"TAR_FILENAME\"]; exit }'"
bfe2/.bferc
bfe2/CHANGELOG
bfe2/README.bferc
I know this question is 4 years old, but I have a couple different options:
Option 1: Using tar --to-command grep
The following line will look in example.tgz for PATTERN. This is similar to #Jester's example, but I couldn't get his pattern matching to work.
tar xzf example.tgz --to-command 'grep --label="$TAR_FILENAME" -H PATTERN ; true'
Option 2: Using tar -tzf
The second option is using tar -tzf to list the files, then go through them with grep. You can create a function to use it over and over:
targrep () {
for i in $(tar -tzf "$1"); do
results=$(tar -Oxzf "$1" "$i" | grep --label="$i" -H "$2")
echo "$results"
done
}
Usage:
targrep example.tar.gz "pattern"
Both the below options work well.
$ zgrep -ai 'CDF_FEED' FeedService.log.1.05-31-2019-150003.tar.gz | more
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
$ zcat FeedService.log.1.05-31-2019-150003.tar.gz | grep -ai 'CDF_FEED'
2019-05-30 19:20:14.568 ERROR 281 --- [http-nio-8007-exec-360] DrupalFeedService : CDF_FEED_SERVICE::CLASSIFICATION_ERROR:408: Classification failed even after maximum retries for url : abcd.html
If this is really slow, I suspect you're dealing with a large archive file. It's going to uncompress it once to extract the file list, and then uncompress it N times--where N is the number of files in the archive--for the grep. In addition to all the uncompressing, it's going to have to scan a fair bit into the archive each time to extract each file. One of tar's biggest drawbacks is that there is no table of contents at the beginning. There's no efficient way to get information about all the files in the archive and only read that portion of the file. It essentially has to read all of the file up to the thing you're extracting every time; it can't just jump to a filename's location right away.
The easiest thing you can do to speed this up would be to uncompress the file first (gunzip file.tar.gz) and then work on the .tar file. That might help enough by itself. It's still going to loop through the entire archive N times, though.
If you really want this to be efficient, your only option is to completely extract everything in the archive before processing it. Since your problem is speed, I suspect this is a giant file that you don't want to extract first, but if you can, this will speed things up a lot:
tar zxf file.tar.gz
for f in hopefullySomeSubdir/*; do
grep -l "string" $f
done
Note that grep -l prints the name of any matching file, quits after the first match, and is silent if there's no match. That alone will speed up the grepping portion of your command, so even if you don't have the space to extract the entire archive, grep -l will help. If the files are huge, it will help a lot.
For starters, you could start more than one process:
tar -ztf file.tar.gz | while read FILENAME
do
(if tar -zxf file.tar.gz "$FILENAME" -O | grep -l "string"
then
echo "$FILENAME contains string"
fi) &
done
The ( ... ) & creates a new detached (read: the parent shell does not wait for the child)
process.
After that, you should optimize the extracting of your archive. The read is no problem,
as the OS should have cached the file access already. However, tar needs to unpack
the archive every time the loop runs, which can be slow. Unpacking the archive once
and iterating over the result may help here:
local tempPath=`tempfile`
mkdir $tempPath && tar -zxf file.tar.gz -C $tempPath &&
find $tempPath -type f | while read FILENAME
do
(if grep -l "string" "$FILENAME"
then
echo "$FILENAME contains string"
fi) &
done && rm -r $tempPath
find is used here, to get a list of files in the target directory of tar, which we're iterating over, for each file searching for a string.
Edit: Use grep -l to speed up things, as Jim pointed out. From man grep:
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would
normally have been printed. The scanning will stop on the first match. (-l is specified
by POSIX.)
Am trying to grep pattern from dozen files .tar.gz but its very slow
tar -ztf file.tar.gz | while read FILENAME
do
if tar -zxf file.tar.gz "$FILENAME" -O | grep "string" > /dev/null
then
echo "$FILENAME contains string"
fi
done
That's actually very easy with ugrep option -z:
-z, --decompress
Decompress files to search, when compressed. Archives (.cpio,
.pax, .tar, and .zip) and compressed archives (e.g. .taz, .tgz,
.tpz, .tbz, .tbz2, .tb2, .tz2, .tlz, and .txz) are searched and
matching pathnames of files in archives are output in braces. If
-g, -O, -M, or -t is specified, searches files within archives
whose name matches globs, matches file name extensions, matches
file signature magic bytes, or matches file types, respectively.
Supported compression formats: gzip (.gz), compress (.Z), zip,
bzip2 (requires suffix .bz, .bz2, .bzip2, .tbz, .tbz2, .tb2, .tz2),
lzma and xz (requires suffix .lzma, .tlz, .xz, .txz).
Which requires just one command to search file.tar.gz as follows:
ugrep -z "string" file.tar.gz
This greps each of the archived files to display matches. Archived filenames are shown in braces to distinguish them from ordinary filenames. For example:
$ ugrep -z "Hello" archive.tgz
{Hello.bat}:echo "Hello World!"
Binary file archive.tgz{Hello.class} matches
{Hello.java}:public class Hello // prints a Hello World! greeting
{Hello.java}: { System.out.println("Hello World!");
{Hello.pdf}:(Hello)
{Hello.sh}:echo "Hello World!"
{Hello.txt}:Hello
If you just want the file names, use option -l (--files-with-matches) and customize the filename output with option --format="%z%~" to get rid of the braces:
$ ugrep -z Hello -l --format="%z%~" archive.tgz
Hello.bat
Hello.class
Hello.java
Hello.pdf
Hello.sh
Hello.txt
All of the code above was really helpful, but none of it quite answered my own need: grep all *.tar.gz files in the current directory to find a pattern that is specified as an argument in a reusable script to output:
The name of both the archive file and the extracted file
The line number where the pattern was found
The contents of the matching line
It's what I was really hoping that zgrep could do for me and it just can't.
Here's my solution:
pattern=$1
for f in *.tar.gz; do
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true";
done
You can also replace the tar line with the following if you'd like to test that all variables are expanding properly with a basic echo statement:
tar -xzf "$f" --to-command 'echo "f:`basename $TAR_FILENAME` s:'"$pattern\""
Let me explain what's going on. Hopefully, the for loop and the echo of the archive filename in question is obvious.
tar -xzf: x extract, z filter through gzip, f based on the following archive file...
"$f": The archive file provided by the for loop (such as what you'd get by doing an ls) in double-quotes to allow the variable to expand and ensure that the script is not broken by any file names with spaces, etc.
--to-command: Pass the output of the tar command to another command rather than actually extracting files to the filesystem. Everything after this specifies what the command is (grep) and what arguments we're passing to that command.
Let's break that part down by itself, since it's the "secret sauce" here.
'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
First, we use a single-quote to start this chunk so that the executed sub-command (basename $TAR_FILENAME) is not immediately expanded/resolved. More on that in a moment.
grep: The command to be run on the (not actually) extracted files
--label=: The label to prepend the results, the value of which is enclosed in double-quotes because we do want to have the grep command resolve the $TAR_FILENAME environment variable passed in by the tar command.
basename $TAR_FILENAME: Runs as a command (surrounded by backticks) and removes directory path and outputs only the name of the file
-Hin: H Display filename (provided by the label), i Case insensitive search, n Display line number of match
Then we "end" the first part of the command string with a single quote and start up the next part with a double quote so that the $pattern, passed in as the first argument, can be resolved.
Realizing which quotes I needed to use where was the part that tripped me up the longest. Hopefully, this all makes sense to you and helps someone else out. Also, I hope I can find this in a year when I need it again (and I've forgotten about the script I made for it already!)
And it's been a bit a couple of weeks since I wrote the above and it's still super useful... but it wasn't quite good enough as files have piled up and searching for things has gotten more messy. I needed a way to limit what I looked at by the date of the file (only looking at more recent files). So here's that code. Hopefully it's fairly self-explanatory.
if [ -z "$1" ]; then
echo "Look within all tar.gz files for a string pattern, optionally only in recent files"
echo "Usage: targrep <string to search for> [start date]"
fi
pattern=$1
startdatein=$2
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
done
And I can't stop tweaking this thing. I added an argument to filter by the name of the output files in the tar file. Wildcards work, too.
Usage:
targrep.sh [-d <start date>] [-f <filename to include>] <string to search for>
Example:
targrep.sh -d "1/1/2019" -f "*vehicle_models.csv" ford
while getopts "d:f:" opt; do
case $opt in
d) startdatein=$OPTARG;;
f) targetfile=$OPTARG;;
esac
done
shift "$((OPTIND-1))" # Discard options and bring forward remaining arguments
pattern=$1
echo "Searching for: $pattern"
if [[ -n $targetfile ]]; then
echo "in filenames: $targetfile"
fi
startdate=$(date -d "$startdatein" +%s)
for f in *.tar.gz; do
filedate=$(date -r "$f" +%s)
if [[ -z "$startdatein" ]] || [[ $filedate -ge $startdate ]]; then
echo "$f:"
if [[ -z "$targetfile" ]]; then
tar -xzf "$f" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
else
tar -xzf "$f" --no-anchored "$targetfile" --to-command 'grep --label="`basename $TAR_FILENAME`" -Hin '"$pattern ; true"
fi
fi
done
zgrep works fine for me, only if all files inside is plain text.
it looks nothing works if the tgz file contains gzip files.
You can mount the TAR archive with ratarmount and then simply search for the pattern in the mounted view:
pip install --user ratarmount
ratarmount large-archive.tar mountpoint
grep -r '<pattern>' mountpoint/
This is much faster than iterating over each file and piping it to grep separately, especially for compressed TARs. Here are benchmark results in seconds for a 55 MiB uncompressed and 42 MiB compressed TAR archive containing 40 files:
Compression
Ratarmount
Bash Loop over tar -O
none
0.31 +- 0.01
0.55 +- 0.02
gzip
1.1 +- 0.1
13.5 +- 0.1
bzip2
1.2 +- 0.1
97.8 +- 0.2
Of course, these results are highly dependent on the archive size and how many files the archive contains. These test examples are pretty small because I didn't want to wait too long. But, they already exemplify the problem well enough. The more files there are, the longer it takes for tar -O to jump to the correct file. And for compressed archives, it will be quadratically slower the larger the archive size is because everything before the requested file has to be decompressed and each file is requested separately. Both of these problems are solved by ratarmount.
This is the code for benchmarking:
function checkFilesWithRatarmount()
{
local pattern=$1
local archive=$2
ratarmount "$archive" "$archive.mountpoint"
'grep' -r -l "$pattern" "$archive.mountpoint/"
}
function checkEachFileViaStdOut()
{
local pattern=$1
local archive=$2
tar --list --file "$archive" | while read -r file; do
if tar -x --file "$archive" -O -- "$file" | grep -q "$pattern"; then
echo "Found pattern in: $file"
fi
done
}
function createSampleTar()
{
for i in $( seq 40 ); do
head -c $(( 1024 * 1024 )) /dev/urandom | base64 > $i.dat
done
tar -czf "$1" [0-9]*.dat
}
createSampleTar myarchive.tar.gz
time checkEachFileViaStdOut ABCD myarchive.tar.gz
time checkFilesWithRatarmount ABCD myarchive.tar.gz
sleep 0.5s
fusermount -u myarchive.tar.gz.mountpoint
In my case the tarballs have a lot of tiny files and I want to know what archived file inside the tarball matches. zgrep is fast (less than one second) but doesn't provide the info I want, and tar --to-command grep is much, much slower (many minutes)1.
So I went the other direction and had zgrep tell me the byte offsets of the matches in the tarball and put that together with the list of offsets in the tarball of all archived files to find the matching archived files.
#!/bin/bash
set -e
set -o pipefail
function tar_offsets() {
# Get the byte offsets of all the files in a given tarball
# based on https://stackoverflow.com/a/49865044/60422
[ $# -eq 1 ]
tar -tvf "$1" -R | awk '
BEGIN{
getline;
f=$8;
s=$5;
}
{
offset = int($2) * 512 - and((s+511), compl(512)+1)
print offset,s,f;
f=$8;
s=$5;
}'
}
function tar_byte_offsets_to_files() {
[ $# -eq 1 ]
# Convert the search results of a tarball with byte offsets
# to search results with archived file name and offset, using
# the provided tar_offsets output (single pass, suitable for
# process substitution)
offsets_file="$1"
prev_offset=0
prev_offset_filename=""
IFS=' ' read -r last_offset last_len last_offset_filename < "$offsets_file"
while IFS=':' read -r search_result_offset match_text
do
while [ $last_offset -lt $search_result_offset ]; do
prev_offset=$last_offset
prev_offset_filename="$last_offset_filename"
IFS=' ' read -r last_offset last_len last_offset_filename < "$offsets_file"
# offsets increasing safeguard
[ $prev_offset -le $last_offset ]
done
# now last offset is the first file strictly after search result offset so prev offset is
# the one at or before it, and must be the one it is in
result_file_offset=$(( $search_result_offset - $prev_offset ))
echo "$prev_offset_filename:$result_file_offset:$match_text"
done
}
# Putting it together e.g.
zgrep -a --byte-offset "your search here" some.tgz | tar_byte_offsets_to_files <(tar_offsets some.tgz)
1 I'm running this in Git for Windows' minimal MSYS2 fork unixy environment, so it's possible that the launch overhead of grep is much much higher than on any kind of real Unix machine and would make `tar --to-command grep` good enough there; benchmark solutions for your own needs and platform situation before selecting.