Extract a file from tar.gz, without touching disk - bash

Current Process:
I have a tar.gz file. (Actually, I have about 2000 of them, but that's another story).
I make a temporary directory, extract the tar.gz file, revealing 100,000 tiny files (around 600 bytes each).
For each file, I cat it into a processing program, pipe that loop into another analysis program, and save the result.
The temporary space on the machines I'm using can barely handle one of these processes at once, never mind the 16 (hyperthreaded dual quad core) that they get sent by default.
I'm looking for a way to do this process without saving to disk. I believe the performance penalty for individually pulling files using tar -xf $file -O <targetname> would be prohibitive, but it might be what I'm stuck with.
Is there any way of doing this?
EDIT: Since two people have already made this mistake, I'm going to clarify:
Each file represents one point in time.
Each file is processed separately.
Once processed (in this case a variant on Fourier analysis), each gives one line of output.
This output can be combined to do things like autocorrelation across time.
EDIT2: Actual code:
for f in posns/*; do
~/data_analysis/intermediate_scattering_function < "$f"
done | ~/data_analysis/complex_autocorrelation.awk limit=1000 > inter_autocorr.txt

If you do not care about the boundaries between files, then tar --to-stdout -xf $file will do what you want; it will send the contents of each file in the archive to stdout one after the other.
This assumes you are using GNU tar, which is reasonably likely if you are using bash.
[Update]
Given the constraint that you do want to process each file separately, I agree with Charles Duffy that a shell script is the wrong tool.
You could try his Python suggestion, or you could try the Archive::Tar Perl module. Either of these would allow you to iterate through the tar file's contents in memory.

This sounds like a case where the right tool for the job is probably not a shell script. Python has a tarfile module which can operate in streaming mode, letting you make only a single pass through the large archive and process its files, while still being able to distinguish the individual files (which the tar --to-stdout approach will not).

You can use the tar option --to-command=cmd to execute the command for each file. Tar redirects the file content to the standard input of the command, and sets some environment variables with details about the file, such as TAR_FILENAME. More details in Tar Documentation.
e.g.
tar zxf file.tar.gz --to-command='./process.sh'
Note that OSX uses bsdtar by default, which does not have this option. You can explicitly call gnutar instead.

You could use a ramdisk ( http://www.vanemery.com/Linux/Ramdisk/ramdisk.html ) to process and load it from. (me boldly assuming you use Linux but other UNIX systems should have the same type of provisions)

tar zxvf <file.tar.gz> <path_to_extract> --to-command=cat
The above command will show the content of extracted file on shell only. There will be no changes to disk. tar command should be GNU tar.
Sample logs:
$ cat file_a
aaaa
$ cat file_b
bbbb
$ cat file_c
cccc
$ tar zcvf file.tar.gz file_a file_b file_c
file_a
file_b
file_c
$ cd temp
$ ls <== no files in directory
$ tar zxvf ../file.tar.gz file_b --to-command=cat
file_b
bbbb
$ tar zxvf ../file.tar.gz file_a --to-command=cat
file_a
aaaa
$ ls <== Even after tar extract - no files in directory. So, no changes to disk
$ tar --version
tar (GNU tar) 1.25
...
$

Related

Extracting certain files from a tar archive on a remote ssh server

I am running numerous simulations on a remote server (via ssh). The outcomes of these simulations are stored as .tar archives in an archive directory on this remote server.
What I would like to do, is write a bash script which connects to the remote server via ssh and extracts the required output files from each .tar archive into separate folders on my local hard drive.
These folders should have the same name as the .tar file from which the files come (To give an example, say the output of simulation 1 is stored in the archive S1.tar on the remote server, I want all '.dat' and '.def' files within this .tar archive to be extracted to a directory S1 on my local drive).
For the extraction itself, I was trying:
for f in *.tar; do
(
mkdir ../${f%.tar}
tar -x -f "$f" -C ../${f%.tar} "*.dat" "*.def"
)
done
wait
Every .tar file is around 1GB and there is a lot of them. So downloading everything takes too much time, which is why I only want to extract the necessary files (see the extensions in the code above).
Now the code works perfectly when I have the .tar files on my local drive. However, what I can't figure out is how I can do it without first having to download all the .tar archives from the server.
When I first connect to the remote server via ssh username#host, then the terminal stops with the script and just connects to the server.
Btw I am doing this in VS Code and running the script through terminal on my MacBook.
I hope I have described it clear enough. Thanks for the help!
Stream the results of tar back with filenames via SSH
To get the data you wish to retrieve from .tar files, you'll need to pass the results of tar to a string of commands with the --to-command option. In the example below, we'll run three commands.
# Send the files name back to your shell
echo $TAR_FILENAME
# Send the contents of the file back
cat /dev/stdin
# Send EOF (Ctrl+d) back (note: since we're already in a $'' we don't use the $ again)
echo '\004'
Once the information is captured in your shell, we can start to process the data. This is a three-step process.
Get the file's name
note that, in this code, we aren't handling directories at all (simply stripping them away; i.e. dir/1.dat -> 1.dat)
you can write code to create directories for the file by replacing the forward slashes / with spaces and iterating over each directory name but that seems out-of-scope for this.
Check for the EOF (end-of-file)
Add content to file
# Get the files via ssh and tar
files=$(ssh -n <user#server> $'tar -xf <tar-file> --wildcards \'*\' --to-command=$\'echo $TAR_FILENAME; cat /dev/stdin; echo \'\004\'\'')
# Keeps track of what state we're in (filename or content)
state="filename"
filename=""
# Each line is one of these:
# - file's name
# - file's data
# - EOF
while read line; do
if [[ $state == "filename" ]]; then
filename=${line/*\//}
touch $filename
echo "Copying: $filename"
state="content"
elif [[ $state == "content" ]]; then
# look for EOF (ctrl+d)
if [[ $line == $'\004' ]]; then
filename=""
state="filename"
else
# append data to file
echo $line >> <output-folder>/$filename
fi
fi
# Double quotes here are very important
done < <(echo -e "$files")
Alternative: tar + scp
If the above example seems overly complex for what it's doing, it is. An alternative that touches the disk more and requires to separate ssh connections would be to extract the files you need from your .tar file to a folder and scp that folder back to your workstation.
ssh -n <username>#<server> 'mkdir output/; tar -C output/ -xf <tar-file> --wildcards *.dat *.def'
scp -r <username>#<server>:output/ ./
The breakdown
First, we'll make a place to keep our outputted files. You can skip this if you already know the folder they'll be in.
mkdir output/
Then, we'll extract the matching files to this folder we created (if you don't want them to be in a different folder remove the -C output/ option).
tar -C output/ -xf <tar-file> --wildcards *.dat *.def
Lastly, now that we're running commands on our machine again, we can run scp to reconnect to the remote machine and pull the files back.
scp -r <username>#<server>:output/ ./

How to create a file if it does not exist otherwise clear an existing file?

I have a project that opens a file and writes some information to it. The file needs to exist, and needs to be empty. The file needs to be created the first time the program is run, and the file needs to be cleared for each successive run.
Is there an easy way in bash to create a file if it does not exist, else clear the contents of the existing file?
You don't need to use echo or any external utility, but just use the shell built-ins. Just use the shell re-direct operator (> file).
$ mkdir testDir
$ cd testDir/
$ ls -l
total 0
$ > fileThatDoesNotExist
$ ls -1
fileThatDoesNotExist
It will work even if there are some contents in the file, it will truncate its content
$ echo 'whatever' > fileThatDoesNotExist
$ cat fileThatDoesNotExist
whatever
$ > fileThatDoesNotExist
$ cat fileThatDoesNotExist

Parse filenames into touch command on OS X

I have a large number of photos on my machine where I'd like to parse the standard naming convention I have created for each file, and then pipe it to the touch command.
For example, I have these files:
2016-08-06-00h28m34.jpg
2016-08-06-00h28m35.jpg
2016-08-06-00h28m36.jpg
I would like to generate (and then run) the following commands:
touch -t 201608060028.34 2016-08-06-00h28m34.jpg
touch -t 201608060028.35 2016-08-06-00h28m35.jpg
touch -t 201608060028.36 2016-08-06-00h28m36.jpg
I can do this manually in a text editor, but it's extremely time-consuming, due to the number of files in each directory. I could also do it in C# and run it over my LAN, but that seems like overkill. Heck, I can even do this in SQL Server, but ... it's OS X and I'm sure there's a simple command-line thing I'm missing.
I've looked at Windows version of Unix touch command to modify a file's modification date from filename part, and Split Filename Up to Define Variables, but I can't seem to figure out how to add in the period for the seconds portion of the script, plus I don't want to add the batch script to each of the hundreds of folders I have.
Any assistance will be greatly appreciated.
Simple Option
I presume you are trying to set the filesystem time to match the EXIF capture time of thousands of photos. There is a tool for that, which runs on OSX, Linux (and Windows if you must). It is called jhead and I installed it on OSX using homebrew with:
brew install jhead
There may be other ways to install it - jhead website.
Please make a back up before trying this, or try it out on a small subset of your files, as I may have misunderstood your needs!
Basically the command to set the filesystem timestamp to match the EXIF timestamp on a single file is:
jhead -ft SomeFile.jpg
So, if you wanted to set the timestamps for all files in $HOME/photos/tmp and all subdirectories, you would do:
find $HOME/photos/tmp -iname \*.jpg -exec jhead -ft {} \;
Option not requiring any extra software
Failing that, you could do it with Perl which is installed on OSX by default anyway:
find . -name \*.jpg | perl -lne 'my $a=$_; s/.*(\d{4})-(\d+)-(\d+)-(\d+)h(\d+)m(\d+).*/$1$2$3$4$5.$6/ && print "touch -t $_\ \"$a\"" '
which gives this sort of output on my machine:
touch -t 201608060028.34 "./2016-08-06-00h28m34.jpg"
touch -t 201608060028.35 "./2016-08-06-00h28m35.jpg"
touch -t 201501060028.35 "./tmp/2015-01-06-00h28m35.jpg"
and if that looks good on your machine, you could send those commands into bash to be executed like this:
find . -name \*.jpg | perl -lne 'my $a=$_;s/.*(\d{4})-(\d+)-(\d+)-(\d+)h(\d+)m(\d+).*/$1$2$3$4$5.$6/ && print "touch -t $_\ \"$a\"" ' | bash -x
And, for the Perl purists out there, yes, I know Perl could do the touch itself and save invoking a whole touch process per file, but that would require modules and explanation and a heap of other extraneous stuff that is not really necessary for a one-off, or occasional operation.

Redirect file to command, whose stdout is redirected to another command

In a bash script, I want to redirect a file called test to gzip -f, then redirect STDOUT to tar -xfzv, and possibly STDERR to echo This is what I tried:
$ gzip -f <> ./test | tar -xfzv
Here, tar just complains that no file was given. How would I do what I'm attempting?
EDIT: the test file IS NOT a .tar.gz file, sorry about that
EDIT: I should be unzipping, then zipping, not like I had it written here
tar's -f switch tells it that it will be given a filename to read from. Use - for a filename to make tar read from stdout, or omit -f switch. Please read man tar for further information.
I'm not really sure about what you're trying to achieve in general, to be honest. The purpose of read-write redirection and -f gzip switch here is unclear. If the task is to unpack a .tar.gz, better use tar xvzf ./test.tar.gz.
As a side note, you cannot 'redirect stderr to echo', echo is just a built-in, and if we're talking about interactive terminal session, stderr will end up visible on your terminal anyway. You can redirect it to file with 2>$filename construct.
EDIT: So for the clarified version of the question, if you want to decompress a gzipped file, run it through bcrypt, then compress it back, you may use something like
gzip -dc $orig_file.tar.gz | bcrypt [your-switches-here] | gzip -c > $modified_file.tar.gz
where gzip's -d stands for decompression, and -c stands for 'output to stdout'.
If you want to encrypt each file individually instead of encrypting the whole tar archive, things get funnier because tar won't read input from stdin. So you'll need to extract your files somewhere, encrypt them and then tgz them back. This is not the shortest way to do that, but in general it works like this:
mkdir tmp ; cd tmp
tar xzf ../$orig_file.tar.gz
bcrypt [your-switches-here] *
tar czf ../$modified_file.tar.gz *
Please note that I'm not familiar with bcrypt switches and workflow at all.

self extracting tar archive (shell scripting)

I have attempted to write a shell script that creates another self extracting tar archive that is zipped and encoded in base64. I don't know where to go form here and have little to no experience in shell scripting.
As is this script creates tar archive that is zipped and encoded, but the self extracting does not work when i try to run the ./tarName from the terminal. Any advice is appreciated
#!/bin/sh
tarName=$1;
if [ -e $tarName.tar.gz ]
then /bin/echo "$tarName already exists"
exit 0
fi
shift;
for files;
do
tar -czvf tmpTarBall.tar.gz $files;
done
echo "#!/bin/sh" >> $tarName.tar.gz;
echo "base64 -d $tarName.tar.gz" >> $tarName.tar.gz;
echo "tar -xzvf $tarName.tar.gz" >> $tarName.tar.gz;
chmod +x ./$tarName.tar.gz;
base64 tmpTarBall.tar.gz >> $tarName.tar.gz;
rm tmpTarBall.tar.gz;
----------UPDATE
Did some looking around and this is what I have now, still doesn't work. Can anyone explain to me why?
#!/bin/sh
tarName=$1;
if [ -e $tarName.tar.gz ]
then /bin/echo "$tarName already exists"
exit 0
fi
shift;
for files;
do
tar -czvf tmpTarBall.tar.gz $files;
done
cat > extract.sh;
echo "#!/bin/sh" >> extract.sh;
echo "sed '0,/^#TARBALL#$/d' $0 | $tarName.tar.gz | base64 -d | tar -xzv; exit 0" >> extract.sh;
echo "#TARBALL#" >> extract.sh;
cat extract.sh tmpTarBall.tar.gz > $tarName.tar.gz;
chmod +x ./$tarName.tar.gz;
rm extract.sh tmpTarBall.tar.gz;
When I try to run the tarName.tar.gz i get errors:
./tarName.tar.gz: 2: ./tarName.tar.gz: tarName.tar.gz: not found
gzip: stdin: unexpected end of file
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Desired output
In outline, the script you want to generate should look like:
base64 -d <<'EOF' | tar -xzf -
…base-64 encoded data…
EOF
The base64 command decodes its standard input, which is provided as a here document terminated by a line containing just EOF. The output is written to
tar with options to extract gzipped data read from standard input.
Minimal script
So, a minimal generator script looks like:
echo "base64 -d <<'EOF' | tar -czf -"
tar -czf - "$#" | base64 -w 72
echo "EOF"
This echoes the base64 … | tar … line, then uses tar to generate on standard output a zipped tar file containing the files or directories named on the command line, and the output is piped to the GNU coreutils version of base64 with the option to specify that output lines should be 72 characters wide (plus the newline). This is all followed by EOF to mark the end of the here document.
You can add shebang lines (#!/bin/sh) to either or both scripts. There's no need to choose a more specific shell; this uses only core shell scripting constructs that would work back to the days of yore — before POSIX was a gleam in anyone's eye.
Possible complications
Complications that are possible include support for Mac OS X base64 which has a usage message like this:
Usage: base64 [-dhvD] [-b num] [-i in_file] [-o out_file]
-h, --help display this message
-D, --decode decodes input
-b, --break break encoded string into num character lines
-i, --input input file (default: "-" for stdin)
-o, --output output file (default: "-" for stdout)
The -v option and the -d option both generate base64: invalid option -- v (for the appropriate letter), plus the usage. There doesn't seem to be a way to get version information from it. However, GNU's base64 does generate a useful message when you request base64 --version. The first line of standard output will contain something like:
base64 (GNU coreutils) 8.22
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Simon Josefsson.
This is written to standard output. So, you could auto-detect whether you have the GNU base64 and adapt accordingly. You'd need one test in the generator script, and a copy of the test in the generated script. That's definitely a more refined program.
Is it necessary to do this yourself? There is an existing tool called makeself that can do this for you. If you do need to write this yourself, here are some thoughts:
Your output file is an archive with a shell script stuck to the front of it. The extract process runs the entire output file through base64 and tar, not just the archive. The base64 call turns the script portion into garbage, which then confuses tar. What you need to do is to add some code that will separate the script from the archive, then run the remaining commands on just the archive portion. One possible way to do this is to tweak your extract script to something like this:
#!/bin/sh
linenum=$(grep -n "__END_OF_SCRIPT_MARKER__" $tarName.tar.gz | tail -1 | sed -e 's/:.*//')
tail -n +$(($linenum + 1)) $tarName.tar.gz | base64 -d | tar -xzv
exit 0
__END_OF_SCRIPT_MARKER__
Make sure there is nothing in the script portion following the marker text except a newline character (which the markup on this website doesn't make visible). With this, you're using grep to find the line number that contains the marker, then stripping off that many lines with tail. What remains will be the archive portion, which is processed normally by the rest of your code. The exit line ensures that the shell doesn't try to execute the marker text or the archive contents as code. You can keep the extract code in a less compressed format if you'd rather, but you'll end up having to create a temporary file for the archive portion and ensure that it gets deleted.

Resources