Bash splitting a large gzipped file - bash

I've been recently dealing with some file processing, and I am trying to write a bash one-liner, which would look something like:
zcat largefile.gz | split_by_delimiter_into_separate_files
Things I tried:
zcat largefile.gz | awk '{print $0 " //"> "separate_file" NR}' RS='//'
The delimiter I am trying to split upon is "//". I know something like python could probably solve this into a couple of lines, yet my project is not python-based and this is thus not an option..

Try it like this:
zcat largefile.gz | awk -vRS='//' '{print $0 " //"> "separate_file" NR}'

You can use split which I believe does exactly what you need:
zcat largefile.gz | split -p '//' - separate_file_
will create files prefixed with separate_file_ with the content of large file, split on //

Related

How do I write an awk print command in a loop?

I would like to write a loop creating various output files with the first column of each input file, respectively.
So I wrote
for i in $(\ls -d /home/*paired.isoforms.results)
do
awk -F"\t" {print $1}' $i > $i.transcript_ids.txt
done
As an example if there were 5 files in the home directory named
A_paired.isoforms.results
B_paired.isoforms.results
C_paired.isoforms.results
D_paired.isoforms.results
E_paired.isoforms.results
I would like to print the first column of each of these files into a seperate output file, i.e. I would like to have 5 output files called
A.transcript_ids.txt
B.transcript_ids.txt
C.transcript_ids.txt
D.transcript_ids.txt
E.transcript_ids.txt
or any other name as long as it is 5 different names and I can still link them back to the original files.
I understand, that there is a problem with the double usage of $ in both the awk and the loop command, but I don't know how to change that.
Is it possible to write a command like this in a loop?
This should do the job:
for file in /home/*paired.isoforms.results
do
base=${file##*/}
base=${base%%_*}
awk -F"\t" '{print $1}' $file > $base.transcript_ids.txt
done
I assume that there can be spaces in the first field since you set the delimiter explicitly to tab. This runs awk once per file. There are ways to do it running awk once for all files, but I'm not convinced the benefit is significant. You could consider using cut instead of awk '{print $1}', too. Note that using ls as you did is less satisfactory than using globbing directly; it runs foul of file names with oddball characters (spaces, tabs, etc) in the name.
You can do that entirely in awk:
awk -F"\t" '{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; print $1 > out}' *_paired.isoforms.results
If your input files don't have names as indicated in the question, you'd have to split on something else ( as well as use a different pattern match for the input files ).
My original answer is actually doing extra name resolution every time something is printed. Here's a version that only updates the output filename when FILENAME changes:
awk -F"\t" 'FILENAME!=lf{split(FILENAME,a,"_"); out=a[1]".transcript_ids.txt"; lf=FILENAME} {print $1 > out}' *_paired.isoforms.results

awk execute same command on different files one by one

Hi I have 30 txt files in a directory which are containing 4 columns.
How can I execute a same command on each file one by one and direct output to different file.
The command I am using is as below but its being applied on all the files and giving single output. All i want is to call each file one by one and direct outputs to a new file.
start=$1
patterns=''
for i in $(seq -43 -14); do
patterns="$patterns /cygdrive/c/test/kpi/SIGTRAN_Load_$(exec date '+%Y%m%d' --date="-${i} days ${start}")*"; done
cat /cygdrive/c/test/kpi/*$patterns | sed -e "s/\t/,/g" -e "s/ /,/g"| awk -F, 'a[$3]<$4{a[$3]=$4} END {for (i in a){print i FS a[i]}}'| sed -e "s/ /0/g"| sort -t, -k1,2> /cygdrive/c/test/kpi/SIGTRAN_Load.csv
Sth like this
for fileName in /path/to/files/foo*.txt
do
mangleFile "$fileName"
done
will mangle a list of files you give via globbing. If you want to generate the file name patterns as in your example, you can do it like this:
for i in $(seq -43 -14)
do
for fileName in /cygdrive/c/test/kpi/SIGTRAN_Load_"$(exec date '+%Y%m%d' --date="-${i} days ${start}")"*
do
mangleFile "$fileName"
done
done
This way the code stays much more readable, even if shorter solutions may exist.
The mangleFile of course then will be the awk call or whatever you would like to do with each file.
Use the following idiom:
for file in *
do
./your_shell_script_containing_the_above.sh $file > some_unique_id
done
You need to run a loop on all the matching files:
for i in /cygdrive/c/test/kpi/*$patterns; do
tr '[:space:]\n' ',\n' < "$i" | awk -F, 'a[$3]<$4{a[$3]=$4} END {for (i in a){print i FS a[i]}}'| sed -e "s/ /0/g"| sort -t, -k1,2 > "/cygdrive/c/test/kpi/SIGTRAN_Load-$i.csv"
done
PS: I haven't tried much to refactor your piped commands that can probably be shortened too.

How do i count 1 or more items in comma separated input in Shell

Here's my issue, i know how to count the files using the following two strategies but i have a problem with each one.
I am using '.sh' extension.
First:
count=`echo $2 | awk -F, {'print NF'}`
causes my program to throw an error at me: awk: cannot execute - No such file or directory
Secondly:
count=`echo $2 | tr -cd , | wc -c`
Works if you have multiple values separated by commas, however, it will not work if the input is a single item with no commas.
Like i said, this was previously working with the awk but for some reason when i ran it on the physical device instead of the virtual machine it gave me that error.
any ideas?
Thing I know are NOT the issue:
Version of shell is the same.
Try count=$(echo ${2} | awk -F, '{print NF}') instead - you have your braces and quotes inside-out.
Although, it seems your bigger problem is that awk appears to not be executable... You might try which awk and ls -l $(which awk) to see what's up with that...

remove files less than a cetain size and extract filenames

I am working on a cluster remotely and give a few thousands of jobs. Some jobs crash early. I need to move the output files of those jobs (smaller than 1KB) to another folder and start them again. I guess find can move them with something like:
find . -size -1000c -exec mv {} ../crashed \;
but I also need to restart these crashed jobs. Output files in a bunch of folders in output folder and I need folder name and file name(without extantion) seperatly.
I guess sed or/and awk can do this easily but i am not sure how. By the way i am working on BASH shell.
I am trying to use cut, which seems to be working:
for i in $( find . -size -1000c )
do
FOLDER=$(echo "${i%.*}" | cut -d'/' -f2)
FILENAME=$(echo "${i%.*}" | cut -d'/' -f3)
done
But wouldn't it be better using sed or awk? And how?
Sed is a stream editor and since you're not changing anything I wouldn't use it in this case. You could use awk instead of cut like this:
FOLDER=$(echo "${i%.*}" | awk -v FS="/" '{ print $2 }')
where the -v FS="/" specifies that the variable FS (field separator, is a slash, kind of the same as what you do with the -d option in cut) and print $2 tells awk to print only the second field.
Same goes for the other instruction you have there. In your case what you have to do is simple enough, so cut actually cuts it :D
I usually use awk for more complicated tasks, involving multiple files and/or mathematical computations.
Edit:
note that I'm using gawk here (the awk implementation by GNU). I'm not sure you can pass a variable value with the -v option in other implementations, they'll have their way to do it.

Awk replace a column with its hash value

How can I replace a column with its hash value (like MD5) in awk or sed?
The original file is super huge, so I need this to be really efficient.
So, you don't really want to be doing this with awk. Any of the popular high-level scripting languages -- Perl, Python, Ruby, etc. -- would do this in a way that was simpler and more robust. Having said that, something like this will work.
Given input like this:
this is a test
(E.g., a row with four columns), we can replace a given column with its md5 checksum like this:
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
$2=cksum
print
}' < sample
This relies on GNU awk (you'll probably have this by default on a Linux system), and it uses openssl to generate the md5 checksum. We first build a shell command line in tmp to pass the selected column to the md5 command. Then we pipe the output into the cksum variable, and replace column 2 with the checksum. Given the sample input above, the output of this awk script would be:
this 7e1b6dbfa824d5d114e96981cededd00 a test
I copy pasted larsks's response, but I have added the close line, to avoid the problem indicated in this post: gawk / awk: piping date to getline *sometimes* won't work
awk '{
tmp="echo " $2 " | openssl md5 | cut -f2 -d\" \""
tmp | getline cksum
close(tmp)
$2=cksum
print
}' < sample
This might work using Bash/GNU sed:
<<<"this is a test" sed -r 's/(\S+\s)(\S+)(.*)/echo "\1 $(md5sum <<<"\2") \3"/e;s/ - //'
this 7e1b6dbfa824d5d114e96981cededd00 a test
or a mostly sed solution:
<<<"this is a test" sed -r 'h;s/^\S+\s(\S+).*/md5sum <<<"\1"/e;G;s/^(\S+).*\n(\S+)\s\S+\s(.*)/\2 \1 \3/'
this 7e1b6dbfa824d5d114e96981cededd00 a test
Replaces is from this is a test with md5sum
Explanation:
In the first:- identify the columns and use back references as parameters in the Bash command which is substituted and evaluated then make cosmetic changes to lose the file description (in this case standard input) generated by the md5sum command.
In the second:- similar to the first but hive the input string into the hold space, then after evaluating the md5sum command, append the string G to the pattern space (md5sum result) and using substitution arrange to suit.
You can also do that with perl :
echo "aze qsd wxc" | perl -MDigest::MD5 -ne 'print "$1 ".Digest::MD5::md5_hex($2)." $3" if /([^ ]+) ([^ ]+) ([^ ]+)/'
aze 511e33b4b0fe4bf75aa3bbac63311e5a wxc
If you want to obfuscate large amount of data it might be faster than sed and awk which need to fork a md5sum process for each lines.
You might have a better time with read than awk, though I haven't done any benchmarking.
the input (scratch001.txt):
foo|bar|foobar|baz|bang|bazbang
baz|bang|bazbang|foo|bar|foobar
transformed using read:
while IFS="|" read -r one fish twofish red fishy bluefishy; do
twofish=`echo -n $twofish | md5sum | tr -d " -"`
echo "$one|$fish|$twofish|$red|$fishy|$bluefishy"
done < scratch001.txt
produces the output:
foo|bar|3858f62230ac3c915f300c664312c63f|baz|bang|bazbang
baz|bang|19e737ea1f14d36fc0a85fbe0c3e76f9|foo|bar|foobar

Resources