Unix / Shell Add a range of columns to file - shell

So I've been trying the same problem for the last few days, and I'm at a formatting road block.
I have a program that will only run if its working on an equal number of columns. I know the total column count, and the number needed to add with a filler value of 0, but am not sure how to do this. Is there some time of range option with awk or sed for this?
Input:
A B C D E
A B C D E 1 1 1 1
Output:
A B C D E 0 0 0 0
A B C D E 1 1 1 1
The the alphabet columns are always present (with different values), but this "fill in the blank" function is eluding me. I can't use R for this due to data file size.

One way using awk:
$ awk 'NF!=n{for(i=NF+1;i<=n;i++)$i=0}1' n=9 file
A B C D E 0 0 0 0
A B C D E 1 1 1 1
Just set n to the number of columns you want to pad upto.

Related

How to create repeted seq in informatica?

How to generate repeated seq using Informatica mapping.
Src file
A
B
C
D
E
F
G
H
I
J
Trg file
A 1
B 1
C 2
D 2
E 3
F 3
G 4
H 4
I 5
J 5
Thank you in advance.
You can use a Sequence Generator, and then an Expression that divides the value of NEXTVAL by 2:
OUT: ROUND(NEXTVAL / 2)
In the Sequence Generator you could set "Start Value" to 1 and check "Reset" so that the mapping always starts with 1 1 2 2 3 3 if that's what you need.
You should be able to achieve this using variable ports in an Expression transformation, as long as your input rows are sorted in the correct order. e.g. (pseudocode)
v_RowCount = v_RowCount + 1
v_Seq = if v_RowCount Mod 2 = 0 then (v_Seq + 1) else v_Seq
(Output port) out_Seq = v_Seq

Comparision of 2 csv files having same column names with different data

I am having two CSV files each having 2 columns with same column name. 1.csv has generated first and 2.csv has generated after 1 hour. S
o I want to see the Profit % increament and decrement for each Business unit comparing to last report. for example: Business unit B has increment of 50%(((15-10)/10)*100).
However for C it has decrease of 50%. Some new business unit(AG & JK) is also added in new hour report which can be considered only for new one. However few businees unit(D) also removed from next hour which can be considered not required.
So basically i need how can i compare and extract this data.
Busines Profit %
A 0
B 10
C 10
D 0
E 0
F 1615
G 0
Busines profit %
A 0
B 15
C 5
AG 5
E 0
F 1615
G 0
JK 10
updated requirement:
Business Profits% Old profit % new Variation
A 0 0 0
B 10 15 50%
C 10 5 -50%
D 0 cleared
AG 5 New
E 0 0 0
F 1615 1615 0%
G 0 0 0%
JK 10 New
I'd use awk for the job, something like this:
$ awk 'NR==FNR{ # process file2
a[$1]=$2 # hash second column, key is the first column
next # process the next record of file2
}
{ # process file1
if($1 in a==0) # if company not found in hash a
p="new" # it must be new
else
p=($2-a[$1])/(a[$1]==0?1:a[$1])*100 # otherwise calculate p%
print $1,p # output company and p%
}' file1 file2
A 0
B 50
C -50
AG new
E 0
F 0
G 0
JK new
One-liner version with appropriate semicolons:
$ awk 'NR==FNR{a[$1]=$2;next}{if($1 in a==0)p="new";else p=($2-a[$1])/(a[$1]==0?1:a[$1])*100;print $1,p}' file1 file2

diagonal value in co-occurrence matrix

I am so newbie and thank you so much in advance for advice
I want to make co-occurrence matrix, and followed link below
How to use R to create a word co-occurrence matrix
but I cannot understand why value of A-A is 10 in the matirx below
It should be 4 isn't it? because there are four A
dat <- read.table(text='film tag1 tag2 tag3
1 A A A
2 A C F
3 B D C ', header=T)
crossprod(as.matrix(mtabulate(as.data.frame(t(dat[, -1])))))
( ) A C F B D
A 10 1 1 0 0
C 1 2 1 1 1
F 1 1 1 0 0
B 0 1 0 1 1
D 0 1 0 1 1
The solution you use presumes each tag appears only once per film, which jives with the definition of a co-occurrence matrix as far as I can tell. Therefore, each A on the first line gets counted as co-occurring with itself and with the other two As, resulting in a total of ten co-occurences when factoring in the A on the second line.

gsub many columns simultaneously based on different gsub conditions?

I have a file with the following data-
Input-
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
If any of the other rows starting from row 2 have the same letter as row 1, they should be changed to 1. Basically, I'm trying to find out how similar any of the rows are to the first row.
Desired Output-
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
The first row has become all 1 since it is identical to itself (obviously). In the second row, the first and second columns are identical to the first row (A B) and hence they become 1 1. And so on for the other rows.
I have written the following code which does this transformation-
for seq in {1..1} ; #Iterate over the rows (in this case just row 1)
do
for position in {1..6} ; #Iterate over the columns
do
#Define the letter in the first row with which I'm comparing the rest of the rows
aa=$(awk -v pos=$position -v line=$seq 'NR == line {print $pos}' f)
#If it matches, gsub it to 1
awk -v var=$aa -v pos=$position '{gsub (var, "1", $pos)} 1' f > temp
#Save this intermediate file and now act on this
mv temp f
done
done
As you can imagine, this is really slow because that nested loop is expensive. My real data is a 60x10000 matrix and it takes about 2 hours for this program to run on that.
I was hoping you could help me get rid of the inner loop so that I can do all 6 gsubs in a single step. Maybe putting them in an array of their own? My awk skills aren't that great yet.
You can use this simpler awk command to do the job which will be faster to complete as we are avoiding nested loops in shell and also invoking awk repeatedly in nested loop:
awk '{for (i=1; i<=NF; i++) {if (NR==1) a[i]=$i; if (a[i]==$i) $i=1} } 1' file
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
EDIT:
As per the comments below here is what you can do to get the sum of each column in each row:
awk '{sum=0; for (i=1; i<=NF; i++) { if (NR==1) a[i]=$i; if (a[i]==$i) $i=1; sum+=$i}
print $0, sum}' file
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3
Input
$ cat f
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
Desired o/p
$ awk 'FNR==1{split($0,a)}{for(i=1;i<=NF;i++)if (a[i]==$i) $i=1}1' f
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
Explanation
FNR==1{ .. }
When awk reads first record of current file, do things inside braces
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces
in array and the separator strings in the seps array.
split($0,a)
split current record or row ($0) into pieces by fieldsep (defualt space, as
we have not supplied 3rd argument) and store the pieces in array a
So array a contains data from first row
a[1] = A
a[2] = B
a[3] = C
a[4] = D
a[5] = E
a[6] = F
for(i=1;i<=NF;i++)
Loop through all the fields of for each record of file till end of file.
if (a[i]==$i) $i=1
if first row's column value of current index (i) is equal to
current column value of current row set current column value = 1 ( meaning modify current column value )
Now we modified column value next just print modified row
}1
1 always evaluates to true, it performs default operation {print $0}
For update request on comment
Same question here, I have a second part of the program that adds up
the numbers in the rows. I.e. You would get 6, 2, 4, 2, 2, 3 for this
output. Can your program be tweaked to get these values out at this
step itself?
$ awk 'FNR==1{split($0,a)}{s=0;for(i=1;i<=NF;i++)if(a[i]==$i)s+=$i=1;print $0,s}' f
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3

How to separate lines depending on the value in column 1

I have a text file that contains the following (a b c d etc... contains some random values):
1 a
1 b
2 c
2 d
2 e
2 f
6 g
6 h
6 i
12 j
12 k
Is there a way to separate lines with some characters depending on the content of the first string, knowing that those numbers will always be increasing, but may vary as well. The separation would be when first string is incrementing, going from 1 to 2, then 2 to 6 etc...
The output would be like this (here I would like to use ---------- as a separation):
1 a
1 b
----------
2 c
2 d
2 e
2 f
----------
6 g
6 h
6 i
----------
12 j
12 k
awk 'NR>1 && old != $1 { print "----------" } { print; old = $1 }'
If it isn't the first line and the value in old isn't the same as in $1, print the separator. Then unconditionally print the current line, and record the value of $1 in old so that we remember for next time. Repeat until done.

Resources