I have a long tab-delimited CSV file and I am trying to paste in a cell a value that comes later on the column.
For instance, input.txt:
0
1
1.345 B
2
2.86 A
3
4
I would like an output such as:
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
I've been tinkering with code from other threads like this awk solution, but the problem is that the value I want is not before the empty cell, but after, kind of a .FillUp in Excel.
Additional information:
input file may have different number of lines
"A" and "B" in input file may be at different rows and not evenly separated
second column may have only two values
last cell in second column may not have value
[EDIT] for the last two rows in input.txt, B is known to be in the second column, as all rows after 2.86 are not A.
Thanks in advance.
$ tac input.txt | awk -v V=B '{if ($2) V=$2; else $2=V; print}' | tac
0 B
1 B
1.345 B
2 A
2.86 A
3 B
4 B
tac (cat backwards) prints a file in reverse. Reverse the file, fill in the missing values, and then reverse it again.
This allows you to process the file in a single pass as long as you know the first value to fill. It should be quite a bit faster than reversing the file twice.
awk 'BEGIN {fillvalue="B"} $2 {fillvalue=$2=="A"?"B":"A"} !$2 {$2=fillvalue} 1' input.txt
Note that this assumes knowledge about the nature of that second column being only 'A' or 'B' or blank.
Related
I have a tab delimited text file with two columns without header. Now, I want to take the mean of each column within blocks of 10 rows. This means, I take the first 10 rows, take the mean between the 10 numbers in each column and output the mean of each column into another text file. Now go further, take the next 10 rows and make the same again. Until the end of the file. If there are less than 10 rows left at the end, just take the mean of the left rows.
Input file:
0.32832977 3.50941E-10
0.31647876 3.38274E-10
0.31482627 3.36508E-10
0.31447645 3.36134E-10
0.31447645 3.36134E-10
0.31396809 3.35591E-10
0.31281157 3.34354E-10
0.312004 3.33491E-10
0.31102326 3.32443E-10
0.30771822 3.2891E-10
0.30560062 3.26647E-10
0.30413213 3.25077E-10
0.30373717 3.24655E-10
0.29636685 3.16777E-10
0.29622422 3.16625E-10
0.29590765 3.16286E-10
0.2949896 3.15305E-10
0.29414582 3.14403E-10
0.28841901 3.08282E-10
0.28820667 3.08055E-10
0.28291832 3.02403E-10
0.28243792 3.01889E-10
0.28156429 3.00955E-10
0.28043638 2.9975E-10
0.27872239 2.97918E-10
0.27833349 2.97502E-10
0.27825573 2.97419E-10
0.27669023 2.95746E-10
0.27645657 2.95496E-10
Expected output text file:
0.314611284 3.36278E-10
0.296772974 3.172112E-10
0.279535036 2.987864E-10
I tried this code, but i don't know how to include the loop for each 10th row:
awk '{x+=$1;next}END{print x/NR}' file
Here is an awk to do this:
awk -v m=10 -v OFS="\t" '
FNR%m==1{sum1=0;sum2=0}
{sum1+=$1;sum2+=$2}
FNR%m==0{print sum1/m,sum2/m; lfnr=FNR; next}
END{print sum1/(FNR-lfnr),sum2/(FNR-lfnr)}' file
Prints:
0.314611 3.36278e-10
0.296773 3.17211e-10
0.279535 2.98786e-10
Or if you want the same number of decimals you have, you can use printf:
awk -v m=10 -v OFS="\t" '
FNR%m==1{sum1=0;sum2=0}
{sum1+=$1;sum2+=$2}
FNR%m==0{printf("%0.9G%s%0.9G\n",sum1/m,OFS,sum2/m); lfnr=FNR; next}
END{printf("%0.9G%s%0.9G\n",sum1/(FNR-lfnr),OFS,sum2/(FNR-lfnr))}' file
Prints:
0.314611284 3.36278E-10
0.296772974 3.172112E-10
0.279535036 2.98786444E-10
Your superpower here is the % modulo operator which allows you to detect ever m step -- in this case every 10th. Your x-ray vision is the FNR awk special variable which is the line of the file you are reading.
FNR%10 is always less than 10 and when 0 you are on the 10th iteration and time to print. When 1 you are on the first iteration and it is time to reset the sums.
Say my stream is x*N lines long, where x is the number of records and N is the number of columns per record, and is output column-wise. For example, x=2, N=3:
1
2
Alice
Bob
London
New York
How can I join every line, modulo the number of records, back into columns:
1 Alice London
2 Bob New York
If I use paste, with N -s, I get the transposed output. I could use split, with the -l option equal to N, then recombine the pieces afterwards with paste, but I'd like to do it within the stream without spitting out temporary files all over the place.
Is there an "easy" solution (i.e., rather than invoking something like awk)? I'm thinking there may be some magic join solution, but I can't see it...
EDIT Another example, when x=5 and N=3:
1
2
3
4
5
a
b
c
d
e
alpha
beta
gamma
delta
epsilon
Expected output:
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
You are looking for pr to "columnate" the stream:
pr -T -s$'\t' -3 <<'END_STREAM'
1
2
Alice
Bob
London
New York
END_STREAM
1 Alice London
2 Bob New York
pr is in coreutils.
Most systems should include a tool called pr, intended to print files. It's part of POSIX.1 so it's almost certainly on any system you'll use.
$ pr -3 -t < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilon
Or if you prefer,
$ pr -3 -t -s, < inp1
1,a,alpha
2,b,beta
3,c,gamma
4,d,delta
5,e,epsilon
or
$ pr -3 -t -w 20 < inp1
1 a alpha
2 b beta
3 c gamma
4 d delta
5 e epsilo
Check the link above for standard usage information, or man pr for specific options in your operating system.
In order to reliably process the input you need to either know the number of columns in the output file or the number of lines in the output file. If you just know the number of columns, you'd need to read the input file twice.
Hackish coreutils solution
# If you don't know the number of output lines but the
# number of output columns in advance you can calculate it
# using wc -l
# Split the file by the number of output lines
split -l"${olines}" file FOO # FOO is a prefix. Choose a better one
paste FOO*
AWK solutions
If you know the number of output columns in advance you can use this awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
FNR==NR {
# We are reading the file twice (see invocation below)
# When reading it the first time we store the number
# of fields (lines) in the variable n because we need it
# when processing the file.
n=NF
}
{
# n / c is the number of output lines
# For every output line ...
for(i=0;i<n/c;i++) {
# ... print the columns belonging to it
for(ii=1+i;ii<=NF;ii+=n/c) {
printf "%s ", $ii
}
print "" # Adds a newline
}
}
and call it like this:
awk -vc=3 -f convert.awk file file # Twice the same file
If you know the number of ouput lines in advance you can use the following awk script:
convert.awk:
BEGIN {
# Split the file into one big record where fields are separated
# by newlines
RS=''
FS='\n'
}
{
# x is the number of output lines and has been passed to the
# script. For each line in output
for(i=0;i<x;i++){
# ... print the columns belonging to it
for(ii=i+1;ii<=NF;ii+=x){
printf "%s ",$ii
}
print "" # Adds a newline
}
}
And call it like this:
awk -vx=2 -f convert.awk file
I have a dataset with 20 000 probes, they are in two columns, 21nts each. From this file I need to extract the lines in which last nucleotide in Probe1 column matches last nucleotide in in Probe 2 column. So far I tried AWK (substr) function, but didn't get the expected outcome. Here is one-liner I tried:
awk '{if (substr($2,21,1)==substr($4,21,1)){print $0}}'
Another option would be to anchor last character in columns 2 and 4 (awk '$2~/[A-Z]$/), but I can't find a way to match the probes in two columns using regex. All suggestions and comments will be very much appreciated.
Example of dataset:
Probe 1 Probe 2
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4738 GGAGGATTTGGCCGGAGAGGC C GGAGGAGGAGGAGGACGAGGT
4739 GGAGGAAGAGGAGGGGGAGGT D GGAGGACGAGGAGGAGGAGGC
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
Desired output:
4736 GGAGGAAGAGGAGGCGGAGGA A GGAGGACGAGGAGGAGGAGGA
4737 GGAGGAAGAGGAGGGAGAGGG B GGAGGACGAGGAGGAGGAGGG
4740 GGAGGAAGAGGAGGGGGAGGC E GGAGGAGGAGGACGAGGAGGC
This will filter the input, matching lines where the last character of the 2nd column is equal to the last character of the 4th column:
awk 'substr($2, length($2), 1) == substr($4, length($4), 1)'
What I changed compared to your sample script:
Move the if statement out of the { ... } block into a filter
Use length($2) and length($4) instead of hardcoding the value 21
The { print $0 } is not needed, as that is the default action for the matched lines
Is there any easy way to convert JCL SORT to Shell Script?
Here is the JCL SORT:
OPTION ZDPRINT
SORT FIELDS=(15,1,CH,A)
SUM FIELDS=(16,8,25,8,34,8,43,8,52,8,61,8),FORMAT=ZD
OUTREC BUILD=(14X,15,54,13X)
Only bytes 15 for a length of 54 are relevant from the input data, which is the key and the source values for the summation. Others bytes from the input are not important.
Assuming the data is printable.
The data is sorted on the one-byte key, and each value for records with the same key is summed, separately, for each of the six numbers. A single record is written, per key, with the summed values and with other data (those one bytes in between and at the end) from the first record. The sort is "unstable" (meaning that the order of records presented to the summation is not reproduceable from one execution to the next) so the byte values should theoretically be the same on all records, or be irrelevant.
The output, for each key, is presented as a record containing 14 blanks (14X) then the 54 bytes starting at position 15 (which is the one-byte key) and then followed by 13 blanks (13X). The numbers should be right-aligned and left-zero-filled [OP to confirm, and amend sample data and expected output].
Assuming the sum will only contain positive number and will not be signed, and that for any number which is less than 999999990 there will be leading zeros for any unused positions (numbers are character, right-aligned and left-zero-filled).
Assuming the one-byte key will only be alphabetic.
The data has already been converted to ASCII from EBCDIC.
Sample Input:
00000000000000A11111111A11111111A11111111A11111111A11111111A111111110000000000000
00000000000000B22222222A22222222A22222222A22222222A22222222A222222220000000000000
00000000000000C33333333A33333333A33333333A33333333A33333333A333333330000000000000
00000000000000A44444444B44444444B44444444B44444444B44444444B444444440000000000000
Expected Output:
A55555555A55555555A55555555A55555555A55555555A55555555
B22222222A22222222A22222222A22222222A22222222A22222222
C33333333A33333333A33333333A33333333A33333333A33333333
(14 preceding blanks and 13 trailing blanks)
Expected Volume: tenth thousands
I have figured an answer:
awk -v FIELDWIDTHS="14 1 8 1 8 1 8 1 8 1 8 1 8 13" \
'{if(!($2 in a)) {a[$2]=$2; c[$2]=$4; e[$2]=$6; g[$2]=$8; i[$2]=$10; k[$2]=$12} \
b[$2]+=$3; d[$2]+=$5; f[$2]+=$7; h[$2]+=$9; j[$2]+=$11; l[$2]+=$13;} END \
{for(id in a) printf("%14s%s%s%s%s%s%s%s%s%s%s%s%s%13s\n","",a[id],b[id],c[id],d[id],e[id],f[id],g[id],h[id],i[id],j[id],k[id],l[id],"");}' input
Explaination:
1) Split the string
awk -v FIELDWIDTHS="14 1 8 1 8 1 8 1 8 1 8 1 8 13"
2) Let $2 be the key and $4, $6, $8, $10, $12 will only set value for the first time
{if(!($2 in a)) {a[$2]=$2; c[$2]=$4; e[$2]=$6; g[$2]=$8; i[$2]=$10; k[$2]=$12}
3) Others will be summed up
b[$2]+=$3; d[$2]+=$5; f[$2]+=$7; h[$2]+=$9; j[$2]+=$11; l[$2]+=$13;} END
4) Print for each key
{for(id in a) printf("%14s%s%s%s%s%s%s%s%s%s%s%s%s%13s\n","",a[id],b[id],c[id],d[id],e[id],f[id],g[id],h[id],i[id],j[id],k[id],l[id],"");}
okay I have tried something
1) extracting duplicate keys from file and storing it in duplicates file.
awk '{k=substr($0,1,15);a[k]++}END{for(i in a)if(a[i]>1)print i}' sample > duplicates
OR
awk '{k=substr($0,1,15);print k}' sample | sort | uniq -c | awk '$1>1{print $2}' > duplicates
2) For duplicates, doing the calculation and creating newfile with specificied format
while read line
do
grep ^$line sample | awk -F[A-Z] -v key=$line '{for(i=2;i<=7;i++)f[i]=f[i]+$i}END{printf("%14s"," ");for(i=2;i<=7;i++){printf("%s%.8s",substr(key,15,1),f[i]);if(i==7)printf("%13s\n"," ")}}' > newfile
done < duplicates
3) for unique ones,format and append to newfile
grep -v -f duplicates sample | sed 's/0/ /g' >> newfile ## gives error if 0 is within data instead of start and end in a row.
OR
grep -v -f duplicates sample | awk '{printf("%14s%s%13s\n"," ",substr($0,15,54)," ")}' >> newfile
if you have any doubt, let me know.
I have a set of 10000 text files (file1.txt, file2.txt,...file10000.txt). Each one has a different number of rows. I'd like to know which is the average number of rows, among these 10000 files, excluding the last row. For example:
File1:
a
b
c
d
last
File2:
a
b
c
last
File2:
a
b
c
d
e
last
here I should obtain 4 as result. I tried with python but it requires too much time to read all the files. How could I do with a shell script?
Here's one way:
touch file{1..3}.txt
file 1 has 1 line, file 2 two lines and so on...
$ for i in {1..3}; do wc -l file${i}.txt; done | awk '{sum+=$1}END{print sum/NR}'
2