Can you help me understand these TAR, SPLIT commands? - bash

What do these command lines mean?
tar cvzf - ./android_4.0.4_origen_final_full/ | split -b 2048m - android_4.0.4_origen_final_full.tar.gz
cat android_4.0.4_origen_final_full.tar.gz* | tar -zxvpf - -C /work

I would suggest Googling "man tar", "man split" and "man cat" for details on options and such.
tar is a program (tar originally was short for "tape archive") which creates a serial archive format. It's used to glob a whole directory structure full of files into a single archive file or onto a backup device (tape, disk, or whatever).
split will take a single file and break it into chunks of a given size.
tar cvzf - ./android_4.0.4_origen_final_full/ | split -b 2048m - android_4.0.4_origen_final_full.tar.gz
This command will create an archive of all the files under ./android_4.0.4_origen_final_full/ and, instead of creating a single archive file, breaks the results up (via split) into several 2,048MB (2GB) files. Specifically, the c option on tar means "create", v means "verbose" (you'll get an output line for each file archived), z means it will be compressed (with gzip), and f indicates the output file. Since the output file is given as -, then the output goes to the standard output (thus, it can be piped into split). The split option -b 2048m means the output will be split into 2GB sized files. So if the archive is 3GB, you'll get one file that's 2GB, and one that 1GB.
cat android_4.0.4_origen_final_full.tar.gz* | tar -zxvpf - -C /work
This does the opposite of the first command. It concatenates all files in the current folder whose names start with android_4.0.4_origen_final_full.tar.gz and unarchives them with tar. The tar options are the same as above, but x means "extract", p means to "preserve" file permissions, and the f - means take the input from the standard input (from the cat command in this case), and the C /work tells tar to change to the /work directory for the extraction.

The first command creates multiple tar.gz files split on 2KB blocks while the second command extracts the contents of the tart.gz file into the /work directory. It uses pipes "|" to connect one shell command's output to the input (stdin) of another.
tar cvzf - ./android_4.0.4_origen_final_full/
creates tar files (a file with all data lined up contiguously) of everything under the android_4.0.4_origen_final_full folder and compresses the tar file with gzip compression giving it an extra .gz extension. The output is piped or sent to the following command.
split -b 2048m -
splits input supplied on stdin (from the prior tar command) on 2KB blocks creating individual tar gripped files with the base name:
android_4.0.4_origen_final_full.tar.gz
cat android_4.0.4_origen_final_full.tar.gz*
dump the raw contents of the filename parameter to the screen or (stdout)
tar -zxvpf - -C /work
Extract everything from its input (stdin) to the /work folder

Related

Writing data to a zip file

I have a script which I am running in the ubuntu terminal (bash) . Currently I am directly appending the output of the script to a file using the below command :
./run.sh > a.txt
But for some input files run.sh may produce output which is large in size without compression . Is it possible to write these output directly to a zip file without going through the dump file intermediate ?
I know it is possible in Java and python . But I wanted a general method of doing it in the bash so that I could keep the run.sh same even if my running program is changing .
I have tried searching the web but haven't come across something useful .
In this case, a gzip file would be more appropriate. Unlike zip, which is an archive format, gzip is just a compressed data format and can easily be used in a pipe:
./run.sh | gzip > a.txt.gz
The resulting file can be uncompressed in place using the gunzip command (resulting in a file a.txt), viewed with zmore or listed with zcat which allows you to process the output with a filter without writing the whole decompressed file anywhere.
The 'zip' format is for archiving. The 'zip' program can take an existing file and put compressed version into an archive. For example:
./run.sh > a.txt
zip a.zip a.txt
However, you question ask specifically for a 'streaming' solution (given file size). There are few utilities that use formats that are 'streaming-happy': gz, bz2, and xz. Each excel in different type of data, but for many cases, all will work.
./run.sh | gzip > a.txt.gz
./run.sh | bzip2 > a.txt.bz2
./run.sh | xz > a.txt.xz
If you are looking for widest compatibility, gzip is usually your friend.
In bash you can use process substitution.
zip -FI -r file.zip <(./run.sh)

Copying files from a series of directories based off a list in a text file

I am attempting to use either rsync or cp in a for loop to copy files matching a list of 200 of names stored on new lines in a .txt file that match filenames with the .pdbqt extension that are in a series of subdirectories with one parent folder. The .txt file looks as follows:
file01
file02
file08
file75
file45
...
I have attempted to use rsync with the following command:
rsync -a /home/ubuntu/Project/files/pdbqt/*/*.pdbqt \
--files-from=/home/ubuntu/Project/working/output.txt \
/home/ubuntu/Project/files/top/
When I run the rsync command I receive:
rsync error: syntax or usage error (code 1) at options.c(2346) [client=3.1.2]
I have written a bash script as follows in an attempt to get that to work:
#!/bin/bash
for i in "$(cat /home/ubuntu/Project/working/output.txt | tr '\n' '')"; do
cp /home/ubuntu/Project/files/pdbqt/*/"$i".pdbqt /home/ubuntu/Project/files/top/;
done
I understand cat isn't a great command to use but I could not figure out an alternate solution to it, as I am still new to using bash. Running that I get the following error:
tr: when not truncating set1, string2 must be non-empty
cp: cannot stat '/home/ubuntu/Project/files/pdbqt/*/.pdbqt': No such file or directory
I assume that the cp error is thrown as a result of the tr error but I am not sure how else to get rid of the \n that is read from the new line separated list.
The expected results are that from the subdirectories in /pdbqt/ with the 12000 .pdbqt files the 200 files from the output.txt list would be copied from those subdirectories into the /top/ directory.
for loops are good when your data is already in shell variables. When reading in data from a file, while ... read loops work better. In your case, try:
while IFS= read -r file; do cp -i -- /home/ubuntu/Project/files/pdbqt/*/"$file".pdbqt /home/ubuntu/Project/files/top/; done </home/ubuntu/Project/working/output.txt
or, if you find the multiline version more readable:
while IFS= read -r file
do
cp -i -- /home/ubuntu/Project/files/pdbqt/*/"$file".pdbqt /home/ubuntu/Project/files/top/
done </home/ubuntu/Project/working/output.txt
How it works
while IFS= read -r file; do
This starts a while loop reading one line at a time. IFS= tells bash not to truncate white space from the line and -r tells read not to mangle backslashes. The line is stored in the shell variable called file.
cp -i -- /home/ubuntu/Project/files/pdbqt/*/"$file".pdbqt /home/ubuntu/Project/files/top/
This copies the file. -i tells cp to ask before overwriting an existing file.
done </home/ubuntu/Project/working/output.txt
This marks the end of the while loop and tells the shell to get the input for the loop from /home/ubuntu/Project/working/output.txt
Do dirs in Project/files/pdbqt/* or files *.pdbqt have dashes (-) in the name?
The error is showing the line in rsync source code options.c
"Your options have been rejected by the server.\n"
which makes me think that it's interpreting inodes (files/dirs) in your glob as rsync options.
for i in $( < /home/ubuntu/Project/working/output.txt LC_CTYPE=C tr '\n' ' ' )
do
cp /home/ubuntu/Project/files/pdbqt/*/"${i}.pdbqt" /home/ubuntu/Project/files/top/
done
I think your cat tr is missing a space
cat /home/ubuntu/Project/working/output.txt | tr '\n' ' '
John1024's use of while and read are better than mine.
Your are thinking correctly to think rsync. rsync provides the option --files-from="yourfile" that will rsync all the files in your textfile (relative to the base directory you specify next) to the destination (either host:/dest/path or locally with /dest/path alone)
You will want to specify the --no-R to tell rsync not to use relative filenames since --files-from= takes the base path as the next argument. For example, to transfer all files in your text file to some remote host where the location of the files specified are in the current directory, you could use:
rsync -uai --no-R --files-from="textfile" ./ host:/dest/path
Where the command essentially specifies you read the names to transfer from textfile where the files will be found under ./ (the current directory) and you will transfer the files to host:/dest/path on the host you specify. You can see man 1 rsync for full details.

How to view the content of a gzipped file “abc.gz” without actually extracting it?

How to view the content of a gzipped file abc.gz without extracting it ?
I tried to find a way to see content without unzip but i did not find way.
You can use the command below to see the file without replacing it with the decompressed content:
gunzip -c filename.gz
Just use zcat to see content without extraction.
zcat abc.gz
From the manual:
zcat is identical to gunzip -c. (On some systems, zcat may be
installed as gzcat to preserve the original link to compress.)
zcat uncompresses either a list of files on the command line or its
standard input and writes the uncompressed data on standard output.
zcat will uncompress files that have the correct magic number whether
they have a .gz suffix or not.
Plain text:
cat abc
Gzipped text:
zcat abc.gz

using split command in shell how to know the number of files generated

I am using command split for a large file to generate little files which are put in a folder, my problem is the folder contains over files different from my split.
I would like to know if there is a way to know how much files are generated only from my split not the number of all files in my folder.
My command split a 2 d. Is there any option I can join to this command to know it?
I know this ls -Al | wc -l will give me the number of files in the folder that doesn't interest me.
The simplest solution here is to split into a fresh directory.
Assuming that's not possible and you aren't worried about other processes operating on the directory in question you can just count the files before and after. Something like this
$ before=(*)
$ split a 2 d
$ after=(*)
$ echo "Split files: $((after - before))"
If the other files in the directory can't have the same format as the split files (and presumably they can't or split would fail or overwrite them) then you could use an appropriate glob to get just the files that match the pattern. Soemthing like splitfiles=(d??).
That failing you could see whether the --verbose option to split allows you to use split_count=$(split --verbose a 2 d | wc -l) or similar.
To be different, I will be counting the lines with grep utilizing the --verbose option:
split --verbose other_options file|grep -c ""
Example:
$ split --verbose -b 2 file|grep -c ""
60
# yeah, my file is pretty small, splitting on 2 bytes to produce numerous files
You can use split command with options -l and -a to specify prefix and suffix for the generated files.

Is it possible to split a huge text file (based on number of lines) unpacking a .tar.gz archive if I cannot extract that file as whole?

I have a .tar.gz file. It contains one 20GB-sized text file with 20.5 million lines. I cannot extract this file as a whole and save to disk. I must do either one of the following options:
Specify a number of lines in each file - say, 1 million, - and get 21 files. This would be a preferred option.
Extract a part of that file based on line numbers, that is, say, from 1000001 to 2000001, to get a file with 1M lines. I will have to repeat this step 21 times with different parameters, which is very bad.
Is it possible at all?
This answer - bash: extract only part of tar.gz archive - describes a different problem.
To extract a file from f.tar.gz and split it into files, each with no more than 1 million lines, use:
tar Oxzf f.tar.gz | split -l1000000
The above will name the output files by the default method. If you prefer the output files to be named prefix.nn where nn is a sequence number, then use:
tar Oxzf f.tar.gz |split -dl1000000 - prefix.
Under this approach:
The original file is never written to disk. tar reads from the .tar.gz file and pipes its contents to split which divides it up into pieces before writing the pieces to disk.
The .tar.gz file is read only once.
split, through its many options, has a great deal of flexibility.
Explanation
For the tar command:
O tells tar to send the output to stdout. This way we can pipe it to split without ever having to save the original file on disk.
x tells tar to extract the file (as opposed to, say, creating an archive).
z tells tar that the archive is in gzip format. On modern tars, this is optional
f tells tar to use, as input, the file name specified.
For the split command:
-l tells split to split files limited by number of lines (as opposed to, say, bytes).
-d tells split to use numeric suffixes for the output files.
- tells split to get its input from stdin
You can use the --to-stdout (or -O) option in tar to send the output to stdout.
Then use sed to specify which set of lines you want.
#!/bin/bash
l=1
inc=1000000
p=1
while test $l -lt 21000000; do
e=$(($l+$inc))
tar -xfz --to-stdout myfile.tar.gz file-to-extract.txt |
sed -n -e "$l,$e p" > part$p.txt
l=$(($l+$inc))
p=$(($p+1))
done
Here's a pure Bash solution for option #1, automatically splitting lines into multiple output files.
#!/usr/bin/env bash
set -eu
filenum=1
chunksize=1000000
ii=0
while read line
do
if [ $ii -ge $chunksize ]
then
ii=0
filenum=$(($filenum + 1))
> out/file.$filenum
fi
echo $line >> out/file.$filenum
ii=$(($ii + 1))
done
This will take any lines from stdin and create files like out/file.1 with the first million lines, out/file.2 with the second million lines, etc. Then all you need is to feed the input to the above script, like this:
tar xfzO big.tar.gz | ./split.sh
This will never save any intermediate file on disk, or even in memory. It is entirely a streaming solution. It's somewhat wasteful of time, but very efficient in terms of space. It's also very portable, and should work in shells other than Bash, and on ancient systems with little change.
you can use
sed -n 1,20p /Your/file/Path
Here you mention your first line number and the last line number
I mean to say this could look like
sed -n 1,20p /Your/file/Path >> file1
and use start line number and end line number in a variable and use it accordingly.

Resources