Convert Mainframe SORT to Shell Script - shell

Is there any easy way to convert JCL SORT to Shell Script?
Here is the JCL SORT:
OPTION ZDPRINT
SORT FIELDS=(15,1,CH,A)
SUM FIELDS=(16,8,25,8,34,8,43,8,52,8,61,8),FORMAT=ZD
OUTREC BUILD=(14X,15,54,13X)
Only bytes 15 for a length of 54 are relevant from the input data, which is the key and the source values for the summation. Others bytes from the input are not important.
Assuming the data is printable.
The data is sorted on the one-byte key, and each value for records with the same key is summed, separately, for each of the six numbers. A single record is written, per key, with the summed values and with other data (those one bytes in between and at the end) from the first record. The sort is "unstable" (meaning that the order of records presented to the summation is not reproduceable from one execution to the next) so the byte values should theoretically be the same on all records, or be irrelevant.
The output, for each key, is presented as a record containing 14 blanks (14X) then the 54 bytes starting at position 15 (which is the one-byte key) and then followed by 13 blanks (13X). The numbers should be right-aligned and left-zero-filled [OP to confirm, and amend sample data and expected output].
Assuming the sum will only contain positive number and will not be signed, and that for any number which is less than 999999990 there will be leading zeros for any unused positions (numbers are character, right-aligned and left-zero-filled).
Assuming the one-byte key will only be alphabetic.
The data has already been converted to ASCII from EBCDIC.
Sample Input:
00000000000000A11111111A11111111A11111111A11111111A11111111A111111110000000000000
00000000000000B22222222A22222222A22222222A22222222A22222222A222222220000000000000
00000000000000C33333333A33333333A33333333A33333333A33333333A333333330000000000000
00000000000000A44444444B44444444B44444444B44444444B44444444B444444440000000000000
Expected Output:
A55555555A55555555A55555555A55555555A55555555A55555555
B22222222A22222222A22222222A22222222A22222222A22222222
C33333333A33333333A33333333A33333333A33333333A33333333
(14 preceding blanks and 13 trailing blanks)
Expected Volume: tenth thousands

I have figured an answer:
awk -v FIELDWIDTHS="14 1 8 1 8 1 8 1 8 1 8 1 8 13" \
'{if(!($2 in a)) {a[$2]=$2; c[$2]=$4; e[$2]=$6; g[$2]=$8; i[$2]=$10; k[$2]=$12} \
b[$2]+=$3; d[$2]+=$5; f[$2]+=$7; h[$2]+=$9; j[$2]+=$11; l[$2]+=$13;} END \
{for(id in a) printf("%14s%s%s%s%s%s%s%s%s%s%s%s%s%13s\n","",a[id],b[id],c[id],d[id],e[id],f[id],g[id],h[id],i[id],j[id],k[id],l[id],"");}' input
Explaination:
1) Split the string
awk -v FIELDWIDTHS="14 1 8 1 8 1 8 1 8 1 8 1 8 13"
2) Let $2 be the key and $4, $6, $8, $10, $12 will only set value for the first time
{if(!($2 in a)) {a[$2]=$2; c[$2]=$4; e[$2]=$6; g[$2]=$8; i[$2]=$10; k[$2]=$12}
3) Others will be summed up
b[$2]+=$3; d[$2]+=$5; f[$2]+=$7; h[$2]+=$9; j[$2]+=$11; l[$2]+=$13;} END
4) Print for each key
{for(id in a) printf("%14s%s%s%s%s%s%s%s%s%s%s%s%s%13s\n","",a[id],b[id],c[id],d[id],e[id],f[id],g[id],h[id],i[id],j[id],k[id],l[id],"");}

okay I have tried something
1) extracting duplicate keys from file and storing it in duplicates file.
awk '{k=substr($0,1,15);a[k]++}END{for(i in a)if(a[i]>1)print i}' sample > duplicates
OR
awk '{k=substr($0,1,15);print k}' sample | sort | uniq -c | awk '$1>1{print $2}' > duplicates
2) For duplicates, doing the calculation and creating newfile with specificied format
while read line
do
grep ^$line sample | awk -F[A-Z] -v key=$line '{for(i=2;i<=7;i++)f[i]=f[i]+$i}END{printf("%14s"," ");for(i=2;i<=7;i++){printf("%s%.8s",substr(key,15,1),f[i]);if(i==7)printf("%13s\n"," ")}}' > newfile
done < duplicates
3) for unique ones,format and append to newfile
grep -v -f duplicates sample | sed 's/0/ /g' >> newfile ## gives error if 0 is within data instead of start and end in a row.
OR
grep -v -f duplicates sample | awk '{printf("%14s%s%13s\n"," ",substr($0,15,54)," ")}' >> newfile
if you have any doubt, let me know.

Related

Removing beginnings sequences in fasta from a list with size

I want to remove specific sequence in the list with IDs and extract sequence from large fasta file.
input test.fasta file:
>GHAT8X
MKFNDIRNDGHEDCFNNIIFASKLSSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANAIAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLAPEYI
>GHAMNO
MRLIGCCLETENPVLVFEYVEYGTLADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVAKLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQRAVD
>GHAXM6
MYSCLGAIKNSGKEDKEKCIMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVLLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVELAVKCVSES
seqid_len.txt file:
GHAT8X 25
GHAMNO 26
GHAXM6 20
Expected output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
MRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLVL
LTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVEL
AVKCVSES
I tried:
sed 's/_/|/g' seqid_len.txt | while read line;do grep -i -A1 ${line%%[1-9]*} test.fasta | seqkit subseq -r ${line##[a-z]* }:-1 ; done
Only getting GHAT8X 25 and GHAMNO 26 sequence out. However, renaming the header does not work.
Any correction on this or any python solution would be really helpful.
Have a great weekend.
Thanks
Would you please try the following:
#!/bin/bash
awk 'NR==FNR {a[">" $1] = $2 + 0; next} # create an array which maps the header to the starting position of the sequence
$0 in a { # the header matches an array index
start = a[$0] # get the starting position
print # print the header
getline # read the sequence line
print substr($0, start) # print the sequence by removing the beginnings
}
' seqid_len.txt test.fasta | fold -w 60 # wrap the output within 60 columns
Output:
>GHAT8X
SSHKNVLKLTGCCLETRIPVIVFESVKNRTLADHIYQNQPHFEPLLLSQRLRIAVHIANA
IAYLHIGFSRPILHRKIRPSRIFLDEGYIAKLFDFSLSVSIPEGETCVKDKVTGTMGFLA
PEYI
>GHAMNO
ADRIYHPRQPNFEPVTCSLRLKIAMEIAYGIAYLHVAFSRPIVFRNVKPSNILFQEQSVA
KLFDFSYSESIPEGETRIRGRVMGTFGYLPPEYIATGDCNEKCDVYSFGMLLLELLTGQR
AVD
>GHAXM6
IMRNGKNLLENLISSFNDGETHIKDAIPIGIMGFVATEYVTTGDYNEKCDVFSFGVLLLV
LLTGQKLYSIDEAGDRHWLLNRVKKHIECNTFDEIVDPVIREELCIQSSEKDKQVQAFVE
LAVKCVSES
You'll see the 3rd sequence starts with IMR.., one column shifted compared with your expected MRN... If the 3rd one is correct and the 1st and the 2nd sequences should be fixed, tweak the calculation $2 + 0 as $2 + 1.

Multiplication of two variables containing tuples in BASH script

I have two variables containing tuples of same length generated from a PostgreSQL database and several successful follow on calculations, which I would like to multiply to generate a third variable containing the answer tuple. Each tuple contains 100 numeric records. Variable 1 is called rev_p_client_pa and variable 2 is called lawnp_p_client. I tried the following which gives me a third tuple but the answer rows are not calculated correctly:
rev_p_client_pa data is:
0.018183
0.0202814
0.013676
0.0134083
0.0108168
0.014197
0.0202814
lawn_p_client data is:
52.17
45
30.43
50
40
35
50
The command I used in the script:
awk -v var3="$rev_p_client_pa" 'BEGIN{print var3}' | awk -v var4="$lawnp_p_client" -F ',' '{print $(1)*var4}'
The command gives the following output:
0.948607
1.05808
0.713477
0.699511
0.564312
0.740657
1.05808
However when manually calculated in libreoffice calc i get:
0.94860711
0.912663
0.41616068
0.670415
0.432672
0.496895
1.01407
I used this awk structure to multiply a tuple variable with numeric value variable in a previous calculation and it calculated correctly. Does someone know how the correct awk statement should be written or maybe you have some other ideas that might be useful? Thanks for your help.
Use paste to join the two data sets together, forming a list of pairs, each separated by tab.
Then pipe the result to awk to multiply each pair of numbers, resulting in a list of products.
#!/bin/bash
rev_p_client_pa='0.018183
0.0202814
0.013676
0.0134083
0.0108168
0.014197
0.0202814'
lawn_p_client='52.17
45
30.43
50
40
35
50'
paste <(echo "$rev_p_client_pa") <(echo "$lawn_p_client") | awk '{print $1*$2}'
Output:
0.948607
0.912663
0.416161
0.670415
0.432672
0.496895
1.01407
All awk:
$ awk -v rev_p_client_pa="$rev_p_client_pa" \
-v lawn_p_client="$lawn_p_client" ' # "tuples" in as vars
BEGIN {
split(lawn_p_client,l,/\n/) # split the "tuples" by \n
n=split(rev_p_client_pa,r,/\n/) # get count of the other
for(i=1;i<=n;i++) # loop the elements
print r[i]*l[i] # multiply and output
}'
Output:
0.948607
0.912663
0.416161
0.670415
0.432672
0.496895
1.01407

how to add one to all fields in a file

suppose I have file containing numbers like:
1 4 7
2 5 8
and I want to add 1 to all these numbers, making the output like:
2 5 8
3 6 9
is there a simple one-line command (e.g. awk) to realize this?
try following once.
awk '{for(i=1;i<=NF;i++){$i=$i+1}} 1' Input_file
EDIT: As per OP's request without loop, here is a solution(written as per shown sample only).
With hardcoding of number of fields.
awk -v RS='[ \n]' '{ORS=NR%3==0?"\n":" ";print $0+1}' Input_file
OR
Without hardcoding number of fields.
awk -v RS='[ \n]' -v col=$(awk 'FNR==1{print NF}' Input_file) '{ORS=NR%col==0?"\n":" ";print $0+1}' Input_file
Explanation: So in EDIT section 1st solution I have hardcoded the number of fields by mentioning 3 there, in OR solution of EDIT, I am creating a variable named col which will read the very first line of Input_file to get the number of fields. Then it will not read all the Input_file, Now coming onto the code I have set Record separator as space or new line to it will add them without using a loop and it will add space each time after incrementing 1 in their values. It will print new line only when number of lines are completely divided by value of col(which is why we have taken number of fields in -v col section).
In native bash (no awk or other external tool needed):
#!/usr/bin/env bash
while read -r -a nums; do # read a line into an array, splitting on spaces
out=( ) # initialize an empty output array for that line
for num in "${nums[#]}"; do # iterate over the input array...
out+=( "$(( num + 1 ))" ) # ...and add n+1 to the output array.
done
printf '%s\n' "${out[*]}" # then print that output array with a newline following
done <in.txt >out.txt # with input from in.txt and output to out.txt
You can do this using gnu awk:
awk -v RS="[[:space:]]+" '{$0++; ORS=RT} 1' file
2 5 8
3 6 9
If you don't mind Perl:
perl -pe 's/(\d+)/$1+1/eg' file
Substitute any number composed of multiple digits (\d+) with that number ($1) plus 1. /e means to execute the replacement calculation, and /g means globally throughout the file.
As mentioned in the comments, the above only works for positive integers - per the OP's original sample file. If you wanted it to work with negative numbers, decimals and still retain text and spacing, you could go for something like this:
perl -pe 's/([-]?[.0-9]+)/$1+1/eg' file
Input file
Some column headers # words
1 4 7 # a comment
2 5 cat dog # spacing and stray words
+5 0 # plus sign
-7 4 # minus sign
+1000.6 # positive decimal
-21.789 # negative decimal
Output
Some column headers # words
2 5 8 # a comment
3 6 cat dog # spacing and stray words
+6 1 # plus sign
-6 5 # minus sign
+1001.6 # positive decimal
-20.789 # negative decimal

Bash - fill empty cell with following value in the column

I have a long tab-delimited CSV file and I am trying to paste in a cell a value that comes later on the column.
For instance, input.txt:
0
1
1.345 B
2
2.86 A
3
4
I would like an output such as:
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
I've been tinkering with code from other threads like this awk solution, but the problem is that the value I want is not before the empty cell, but after, kind of a .FillUp in Excel.
Additional information:
input file may have different number of lines
"A" and "B" in input file may be at different rows and not evenly separated
second column may have only two values
last cell in second column may not have value
[EDIT] for the last two rows in input.txt, B is known to be in the second column, as all rows after 2.86 are not A.
Thanks in advance.
$ tac input.txt | awk -v V=B '{if ($2) V=$2; else $2=V; print}' | tac
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
tac (cat backwards) prints a file in reverse. Reverse the file, fill in the missing values, and then reverse it again.
This allows you to process the file in a single pass as long as you know the first value to fill. It should be quite a bit faster than reversing the file twice.
awk 'BEGIN {fillvalue="B"} $2 {fillvalue=$2=="A"?"B":"A"} !$2 {$2=fillvalue} 1' input.txt
Note that this assumes knowledge about the nature of that second column being only 'A' or 'B' or blank.

Removing repeated pairs from a very big text file

I have a very big text file (few GB) that has the following format:
1 2
3 4
3 5
3 6
3 7
3 8
3 9
File is already sorted and double lines were removed. There are repeated pairs like '2 1', '4 3' reverse order that I want to remove. Does anybody have any solution to do it in a very resource limited environments, in BASH, AWK, perl or any similar languages? I can not load the whole file and loop between the values.
You want to remove lines where the second number is less than the first?
perl -i~ -lane'print if $F[0] < $F[1]' file
Possible solution:
Scan the file
For any pair where the second value is less than the first, swap the two numbers
Sort the pairs again by first then second number
Remove duplicates
I'm still thinking about more efficient solution in terms of disk sweeps, but this is a basic naive approach
For each value, perform a binary search on the file on the hard drive, without loading it into memory. Delete the duplicate if you see it. Then do a final pass that removes all instances of two or more \n.
Not exactly sure if this works / if it's any good...
awk '{ if ($2 > $1) print; else print $2, $1 }' hugetext | sort -nu -O hugetext
You want remove duplicates considering 1 2 and 2 1 to be the same?
< file.in \
| perl -lane'print "#F[ $F[0] < $F[1] ? (0,1,0,1) : (1,0,0,1) ]"' \
| sort -n \
| perl -lane'$t="#F[0,1]"; print "#F[2,3]" if $t ne $p; $p=$t;' \
> file.out
This can handle arbitrarily large files.
Here's a general O(n) algorithm to do this in 1 pass (no loops or sorting required):
Start with an empty hashset as your blacklist (a set is a map with just keys)
Read file one line at a time.
For each line:
Check to see this pair is in your blacklist already.
If so, ignore it.
If not, append it to your result file; and also add the swapped value to the blacklist (e.g., if you just read "3 4", and "4 3" to the blacklist)
This takes O(n) time to run, and O(n) storage for the blacklist. (No additional storage for the result if you manipulate the file as r/w to remove lines as you check them against the blacklist)
perl -lane '
END{
print for sort {$a<=>$b} keys %h;
}
$key = $F[0] < $F[1] ? "$F[0] $F[1]" : "$F[1] $F[0]";
$h{$key} = "";
' file.txt
Explanations :
I sort the current line in numeric order
I make the hash key variable $key by concatenating first and second value with a space
I defined the $hash{$key} to nothing
At the end, I print all the keys sorted in numeric order.
A hash key is uniq by nature, so no duplicate.
You just need to use Unix redirections to create a new file.

Resources