Writing data to a zip file - bash

I have a script which I am running in the ubuntu terminal (bash) . Currently I am directly appending the output of the script to a file using the below command :
./run.sh > a.txt
But for some input files run.sh may produce output which is large in size without compression . Is it possible to write these output directly to a zip file without going through the dump file intermediate ?
I know it is possible in Java and python . But I wanted a general method of doing it in the bash so that I could keep the run.sh same even if my running program is changing .
I have tried searching the web but haven't come across something useful .

In this case, a gzip file would be more appropriate. Unlike zip, which is an archive format, gzip is just a compressed data format and can easily be used in a pipe:
./run.sh | gzip > a.txt.gz
The resulting file can be uncompressed in place using the gunzip command (resulting in a file a.txt), viewed with zmore or listed with zcat which allows you to process the output with a filter without writing the whole decompressed file anywhere.

The 'zip' format is for archiving. The 'zip' program can take an existing file and put compressed version into an archive. For example:
./run.sh > a.txt
zip a.zip a.txt
However, you question ask specifically for a 'streaming' solution (given file size). There are few utilities that use formats that are 'streaming-happy': gz, bz2, and xz. Each excel in different type of data, but for many cases, all will work.
./run.sh | gzip > a.txt.gz
./run.sh | bzip2 > a.txt.bz2
./run.sh | xz > a.txt.xz
If you are looking for widest compatibility, gzip is usually your friend.

In bash you can use process substitution.
zip -FI -r file.zip <(./run.sh)

Related

How to download URLs in a csv and naming outputs based on a column value

1. OS: Linux / Ubuntu x86/x64
2. Task:
Write a Bash shell script to download URLs in a (large) csv (as fast/simultaneous as possible) and naming each output on a column value.
2.1 Example Input:
A CSV file containing lines like:
001,http://farm6.staticflickr.com/5342/a.jpg
002,http://farm8.staticflickr.com/7413/b.jpg
003,http://farm4.staticflickr.com/3742/c.jpg
2.2 Example outputs:
Files in a folder, outputs, containg files like:
001.jpg
002.jpg
003.jpg
3. My Try:
I tried mainly in two styles.
1. Using the download tool's inner support
Take ariasc as an example, it support use -i option to import a file of URLs to download, and (I think) it will process it in parallel to max speed. It do have --force-sequential option to force download in the order of the lines, but I failed to find a way to make the naming part happen.
2. Splitting first
split the file into files and run a script like the following to process it:
#!/bin/bash
INPUT=$1
while IFS=, read serino url
do
aria2c -c "$url" --dir=outputs --out="$serino.jpg"
done < "$INPUT"
However, it means for each line it will restart aria2c again which seems cost time and low the speed.
Though, one can run the script in bash command multiple times to get 'shell-level' parallelism, it seems not to be the best way.
Any suggestion ?
Thank you,
aria2c supports so called option lines in input files. From man aria2c
-i, --input-file=
Downloads the URIs listed in FILE. You can specify multiple sources for a single entity by putting multiple URIs on a single line separated by the TAB character. Additionally, options can be specified after each URI line. Option lines must start with one or more white space characters (SPACE or TAB) and must only contain one option per line.
and later on
These options have exactly same meaning of the ones in the command-line options, but it just applies to the URIs it belongs to. Please note that for options in input file -- prefix must be stripped.
You can convert your csv file into an aria2c input file:
sed -E 's/([^,]*),(.*)/\2\n out=\1/' file.csv | aria2c -i -
This will convert your file into the following format and run aria2c on it.
http://farm6.staticflickr.com/5342/a.jpg
out=001
http://farm8.staticflickr.com/7413/b.jpg
out=002
http://farm4.staticflickr.com/3742/c.jpg
out=003
However this won't create files 001.jpg, 002.jpg, … but 001, 002, … since that's what you specified. Either specify file names with extensions or guess the extensions from the URLs.
If the extension is always jpg you can use
sed -E 's/([^,]*),(.*)/\2\n out=\1.jpg/' file.csv | aria2c -i -
To extract extensions from the URLs use
sed -E 's/([^,]*),(.*)(\..*)/\2\3\n out=\1\3/' file.csv | aria2c -i -
Warning: This works if and only if every URL ends with an extension. For instance, due to the missing extension the line 001,domain.tld/abc would not be converted at all, causing aria2c to fail on the "URL" 001,domain.tld/abc.
Using all standard utilities you can do this to download in parallel:
tr '\n' ',' < file.csv |
xargs -P 0 -d , -n 2 bash -c 'curl -s "$2" -o "$1.jpg"' -
-P 0 option in xargs lets it run commands in parallel (one per core processor)

How can you use multiple input streams for input parameters?

I want to run a command line script that requires several parameters. Specifically:
perl prinseq-lite.pl -fastq file1.fq.gz -fastq2 file2.fq.gz
\ -out_good goodfile.out -out_bad badfile.out -log prin.log
\ -ns_max_n 5 ... more_params ...
The problem is that the files are zipped, and must be processed without first unzipping and storing them, because the unzipped file sizes are very large and this command will be run on a large number of files.
So what I need to do is to unzip the input on the fly. Previously, user l0b0, suggested that multiple input streams might be a solution. I have tried the following, but seem to be passing an empty input stream here as the program claims the input files are empty.
perl prinseq-lite.pl -fastq <(zcat f1.gz) -fastq2 <(zcat f2.gz) ...
perl prinseq-lite.pl -fastq 1< <(zcat f1.gz) -fastq2 2< <(zcat f2.gz) ...
So what I need to do, in short, is provide unzipped input for multiple parameters to this program.
Can someone tell me the proper way to do this, and/or what I'm doing wrong with my current attempts? Thanks in advance for your input.
Well, I think the easiest might be to make named pipes for the output of gzunip, then use those names pipes in the command:
mkfifo file1.fq file2.fq file3.fq ...
gunzip -c file1.fq.gz > file1.fq &
gunzip -c file2.fq.gz > file2.fq &
gunzip -c file3.fq.gz > file3.fq &
Then call your program with those pipes as file names:
perl prinseq-lite.pl -fastq file1.fq -fastq2 file2.fq -fastq3 file3.fq ...

How to view the content of a gzipped file “abc.gz” without actually extracting it?

How to view the content of a gzipped file abc.gz without extracting it ?
I tried to find a way to see content without unzip but i did not find way.
You can use the command below to see the file without replacing it with the decompressed content:
gunzip -c filename.gz
Just use zcat to see content without extraction.
zcat abc.gz
From the manual:
zcat is identical to gunzip -c. (On some systems, zcat may be
installed as gzcat to preserve the original link to compress.)
zcat uncompresses either a list of files on the command line or its
standard input and writes the uncompressed data on standard output.
zcat will uncompress files that have the correct magic number whether
they have a .gz suffix or not.
Plain text:
cat abc
Gzipped text:
zcat abc.gz

Can you help me understand these TAR, SPLIT commands?

What do these command lines mean?
tar cvzf - ./android_4.0.4_origen_final_full/ | split -b 2048m - android_4.0.4_origen_final_full.tar.gz
cat android_4.0.4_origen_final_full.tar.gz* | tar -zxvpf - -C /work
I would suggest Googling "man tar", "man split" and "man cat" for details on options and such.
tar is a program (tar originally was short for "tape archive") which creates a serial archive format. It's used to glob a whole directory structure full of files into a single archive file or onto a backup device (tape, disk, or whatever).
split will take a single file and break it into chunks of a given size.
tar cvzf - ./android_4.0.4_origen_final_full/ | split -b 2048m - android_4.0.4_origen_final_full.tar.gz
This command will create an archive of all the files under ./android_4.0.4_origen_final_full/ and, instead of creating a single archive file, breaks the results up (via split) into several 2,048MB (2GB) files. Specifically, the c option on tar means "create", v means "verbose" (you'll get an output line for each file archived), z means it will be compressed (with gzip), and f indicates the output file. Since the output file is given as -, then the output goes to the standard output (thus, it can be piped into split). The split option -b 2048m means the output will be split into 2GB sized files. So if the archive is 3GB, you'll get one file that's 2GB, and one that 1GB.
cat android_4.0.4_origen_final_full.tar.gz* | tar -zxvpf - -C /work
This does the opposite of the first command. It concatenates all files in the current folder whose names start with android_4.0.4_origen_final_full.tar.gz and unarchives them with tar. The tar options are the same as above, but x means "extract", p means to "preserve" file permissions, and the f - means take the input from the standard input (from the cat command in this case), and the C /work tells tar to change to the /work directory for the extraction.
The first command creates multiple tar.gz files split on 2KB blocks while the second command extracts the contents of the tart.gz file into the /work directory. It uses pipes "|" to connect one shell command's output to the input (stdin) of another.
tar cvzf - ./android_4.0.4_origen_final_full/
creates tar files (a file with all data lined up contiguously) of everything under the android_4.0.4_origen_final_full folder and compresses the tar file with gzip compression giving it an extra .gz extension. The output is piped or sent to the following command.
split -b 2048m -
splits input supplied on stdin (from the prior tar command) on 2KB blocks creating individual tar gripped files with the base name:
android_4.0.4_origen_final_full.tar.gz
cat android_4.0.4_origen_final_full.tar.gz*
dump the raw contents of the filename parameter to the screen or (stdout)
tar -zxvpf - -C /work
Extract everything from its input (stdin) to the /work folder

How do we rename the files after we have downloaded them using wget by reading their links from an external file?

I am trying to download some files using wget. I have stored all the links on a .txt file. When I read that file by the command wget -i <filename>.txt, the download starts but a notice is generated saying that the file name is too long. After this the download process is terminated.
How can I rename the files so that file name remains within acceptable range and the download continues.
Is there something like:- wget -O <target filename> <URL> for renaming files when downloaded from a .txt file ?
I do not believe that this functionality exists in wget. You should probably loop through the file in a Perl or shell script, or something similar.
This example below is modified from an example at ubuntuforums.org. With minor modifications you could make it accommodate output file names to your needs. Now it limits file length to first 50 characters.
#!/bin/bash
while read -r link
do
output=`echo $link | cut -c 1-50`
wget "$link" -O "$output"
done < ./links.txt
Using bash as a helper
for line in `cat input.txt`; do wget $line; done
You'll have to determine what you want the output names yourself, otherwise it will download them to whatever filename is in the url (e.g. blah.html) or index.html (if the URL ends in a slash).
Dump all the files to one monolithic file
There is another option with wget, which is to use --output-document=file. It concatenates all the downloaded files into one file.

Resources