How to separate lines depending on the value in column 1 - shell

I have a text file that contains the following (a b c d etc... contains some random values):
1 a
1 b
2 c
2 d
2 e
2 f
6 g
6 h
6 i
12 j
12 k
Is there a way to separate lines with some characters depending on the content of the first string, knowing that those numbers will always be increasing, but may vary as well. The separation would be when first string is incrementing, going from 1 to 2, then 2 to 6 etc...
The output would be like this (here I would like to use ---------- as a separation):
1 a
1 b
----------
2 c
2 d
2 e
2 f
----------
6 g
6 h
6 i
----------
12 j
12 k

awk 'NR>1 && old != $1 { print "----------" } { print; old = $1 }'
If it isn't the first line and the value in old isn't the same as in $1, print the separator. Then unconditionally print the current line, and record the value of $1 in old so that we remember for next time. Repeat until done.

Related

Stata counting observations across multiple columns into one table

I need a table of 3 columns:
Column 1 counts how often the letter a,b,c,d are prevalent in the variable 1 to 3 across all rows
Column 2 counts how often the number a,b,c,d are prevalent in the variable 4 to 6 across all rows
Column 3 subtracts 2 from 1
The data looks like this:
observation
var 1
var 2
var 3
var 4
var 5
var 6
1
a
b
d
c
a
b
2
b
c
d
b
a
d
3
b
d
a
c
d
a
The table should look something like this:
Column 1 (var1-3)
Column 2 (var4-6)
Column 3)
a
2
3
-1
b
3
2
1
c
1
2
-1
d
3
2
1
I am using Stata and I have no idea where to start. I have tried with tabulate, tab1, table but none of it seems to suit my needs.
There are likely to be many other ways to do this.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte observation str1(var1 var2 var3 var4 var5 var6)
1 "a" "b" "d" "c" "a" "b"
2 "b" "c" "d" "b" "a" "d"
3 "b" "d" "a" "c" "d" "a"
end
rename (var4-var6) (war#), addnumber
reshape long var war, i(obs) j(which)
rename (var war) (value=)
reshape long value, i(obs which) j(group) string
contract group value
reshape wide _freq, i(value) j(group) string
char _freqvar[varname] "var1-var3"
char _freqwar[varname] "var4-var6"
gen difference = _freqvar - _freqwar
list, subvarname abbrev(10) noobs
+--------------------------------------------+
| value var1-var3 var4-var6 difference |
|--------------------------------------------|
| a 2 3 -1 |
| b 3 2 1 |
| c 1 2 -1 |
| d 3 2 1 |
+--------------------------------------------+

Comparision of 2 csv files having same column names with different data

I am having two CSV files each having 2 columns with same column name. 1.csv has generated first and 2.csv has generated after 1 hour. S
o I want to see the Profit % increament and decrement for each Business unit comparing to last report. for example: Business unit B has increment of 50%(((15-10)/10)*100).
However for C it has decrease of 50%. Some new business unit(AG & JK) is also added in new hour report which can be considered only for new one. However few businees unit(D) also removed from next hour which can be considered not required.
So basically i need how can i compare and extract this data.
Busines Profit %
A 0
B 10
C 10
D 0
E 0
F 1615
G 0
Busines profit %
A 0
B 15
C 5
AG 5
E 0
F 1615
G 0
JK 10
updated requirement:
Business Profits% Old profit % new Variation
A 0 0 0
B 10 15 50%
C 10 5 -50%
D 0 cleared
AG 5 New
E 0 0 0
F 1615 1615 0%
G 0 0 0%
JK 10 New
I'd use awk for the job, something like this:
$ awk 'NR==FNR{ # process file2
a[$1]=$2 # hash second column, key is the first column
next # process the next record of file2
}
{ # process file1
if($1 in a==0) # if company not found in hash a
p="new" # it must be new
else
p=($2-a[$1])/(a[$1]==0?1:a[$1])*100 # otherwise calculate p%
print $1,p # output company and p%
}' file1 file2
A 0
B 50
C -50
AG new
E 0
F 0
G 0
JK new
One-liner version with appropriate semicolons:
$ awk 'NR==FNR{a[$1]=$2;next}{if($1 in a==0)p="new";else p=($2-a[$1])/(a[$1]==0?1:a[$1])*100;print $1,p}' file1 file2

gsub many columns simultaneously based on different gsub conditions?

I have a file with the following data-
Input-
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
If any of the other rows starting from row 2 have the same letter as row 1, they should be changed to 1. Basically, I'm trying to find out how similar any of the rows are to the first row.
Desired Output-
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
The first row has become all 1 since it is identical to itself (obviously). In the second row, the first and second columns are identical to the first row (A B) and hence they become 1 1. And so on for the other rows.
I have written the following code which does this transformation-
for seq in {1..1} ; #Iterate over the rows (in this case just row 1)
do
for position in {1..6} ; #Iterate over the columns
do
#Define the letter in the first row with which I'm comparing the rest of the rows
aa=$(awk -v pos=$position -v line=$seq 'NR == line {print $pos}' f)
#If it matches, gsub it to 1
awk -v var=$aa -v pos=$position '{gsub (var, "1", $pos)} 1' f > temp
#Save this intermediate file and now act on this
mv temp f
done
done
As you can imagine, this is really slow because that nested loop is expensive. My real data is a 60x10000 matrix and it takes about 2 hours for this program to run on that.
I was hoping you could help me get rid of the inner loop so that I can do all 6 gsubs in a single step. Maybe putting them in an array of their own? My awk skills aren't that great yet.
You can use this simpler awk command to do the job which will be faster to complete as we are avoiding nested loops in shell and also invoking awk repeatedly in nested loop:
awk '{for (i=1; i<=NF; i++) {if (NR==1) a[i]=$i; if (a[i]==$i) $i=1} } 1' file
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
EDIT:
As per the comments below here is what you can do to get the sum of each column in each row:
awk '{sum=0; for (i=1; i<=NF; i++) { if (NR==1) a[i]=$i; if (a[i]==$i) $i=1; sum+=$i}
print $0, sum}' file
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3
Input
$ cat f
A B C D E F
A B B B B B
C A C D E F
A B D E F A
A A A A A F
A B C B B B
Desired o/p
$ awk 'FNR==1{split($0,a)}{for(i=1;i<=NF;i++)if (a[i]==$i) $i=1}1' f
1 1 1 1 1 1
1 1 B B B B
C A 1 1 1 1
1 1 D E F A
1 A A A A 1
1 1 1 B B B
Explanation
FNR==1{ .. }
When awk reads first record of current file, do things inside braces
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces
in array and the separator strings in the seps array.
split($0,a)
split current record or row ($0) into pieces by fieldsep (defualt space, as
we have not supplied 3rd argument) and store the pieces in array a
So array a contains data from first row
a[1] = A
a[2] = B
a[3] = C
a[4] = D
a[5] = E
a[6] = F
for(i=1;i<=NF;i++)
Loop through all the fields of for each record of file till end of file.
if (a[i]==$i) $i=1
if first row's column value of current index (i) is equal to
current column value of current row set current column value = 1 ( meaning modify current column value )
Now we modified column value next just print modified row
}1
1 always evaluates to true, it performs default operation {print $0}
For update request on comment
Same question here, I have a second part of the program that adds up
the numbers in the rows. I.e. You would get 6, 2, 4, 2, 2, 3 for this
output. Can your program be tweaked to get these values out at this
step itself?
$ awk 'FNR==1{split($0,a)}{s=0;for(i=1;i<=NF;i++)if(a[i]==$i)s+=$i=1;print $0,s}' f
1 1 1 1 1 1 6
1 1 B B B B 2
C A 1 1 1 1 4
1 1 D E F A 2
1 A A A A 1 2
1 1 1 B B B 3

Find the closest values: Multiple columns conditions

Following my first question here I want to extend the condition of find the closest value from two different files of the first and second column, and print specific columns.
File1
1 2 3 4 a1
1 4 5 6 b1
8 5 9 11 c1
File 2
1 1 3 a
1 2 5 b
1 2.1 4 c
1 4 6 d
2 4 5 e
9 4 1 f
9 5 2 g
9 6 2 h
11 10 14 i
11 15 5 j
So for example I need to find the closest value from $1 in file 2 for each $1 in file 1 but then search the closest also for $2.
Output:
1 2 a1*
1 2 b*
1 4 b1
1 4 d
8 5 c1
9 5 g
* First column file 1 and 2nd column file 2 because for the 1st column (of file 1) the closest value (from the 1st column of file 2) is 1, and the 2nd condition is that also must be the closest value for the second column which is this case is 2. And I print $1,$2,$5 from file 1 and $1,$2,$4 from file 2
For the other output is the same procedure.
The solution to find the closest it is in my other post and was given by #Tensibai.
But any solution will work.
Thanks!
Sounds a little convoluted but works:
function closest(array,searched) {
distance=999999; # this should be higher than the max index to avoid returning null
split(searched,skeys,OFS)
# Get the first part of key
for (x in array) { # loop over the array to get its keys
split(x,mkeys,OFS) # split the array key
(mkeys[1]+0 > skeys[1]+0) ? tmp = mkeys[1] - skeys[1] : tmp = skeys[1] - mkeys[1] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found1 = mkeys[1] # and save the key actually found closest
}
}
# At this point we have the first part of key found, let's redo the work for the second part
distance=999999;
for (x in array) {
split(x,mkeys,OFS)
if (mkeys[1] == found1) { # Filter on the first part of key
(mkeys[2]+0 > skeys[2]+0) ? tmp = mkeys[2] - skeys[2] : tmp = skeys[2] - mkeys[2] # +0 to compare integers, ternary operator to reduce code, compute the diff between the key and the target
if (tmp < distance) { # if the distance if less than preceding, update
distance = tmp
found2 = mkeys[2] # and save the key actually found closest
}
}
}
# Now we got the second field, woot
return (found1 OFS found2) # return the combined key from out two search
}
{
if (NR>FNR) { # If we changed file (File Number Record is less than Number Record) change array
b[($1 OFS $2)] = $4 # make a array with "$1 $2" as key and $4 as value
} else {
key = ($1 OFS $2) # Make the key to avoid too much computation accessing it later
akeys[max++] = key # store the array keys to ensure order at end as for (x in array) does not guarantee the order
a[key] = $5 # make an array with the key stored previously and $5 as value
}
}
END { # Now we ended parsing the two files, print the result
for (i in akeys) { # loop over the array of keys which has a numeric index, keeping order
print akeys[i],a[akeys[i]] # print the value for the first array (key then value)
if (akeys[i] in b) { # if the same key exist in second file
print akeys[i],b[akeys[i]] # then print it
} else {
bindex = closest(b,akeys[i]) # call the function to find the closest key from second file
print bindex,b[bindex] # print what we found
}
}
}
Note I'm using OFS to combine the fields so if you change it for output it will behave properly.
WARNING: This should do with relative short files, but as now the array from second file is traversed twice it will be twice long for each searchEND OF WARNING
There's place for a better search algorithm if your files are sorted (but it was not the case on previous question and you wished to keep the order from the file). First improvement in this case, break the for loop when distance start to be greater than preceding one.
Output from your sample files:
$ mawk -f closest2.awk f1 f2
1 2 a1
1 2 b
1 4 b1
1 4 d
8 5 c1
9 5 g

Unix / Shell Add a range of columns to file

So I've been trying the same problem for the last few days, and I'm at a formatting road block.
I have a program that will only run if its working on an equal number of columns. I know the total column count, and the number needed to add with a filler value of 0, but am not sure how to do this. Is there some time of range option with awk or sed for this?
Input:
A B C D E
A B C D E 1 1 1 1
Output:
A B C D E 0 0 0 0
A B C D E 1 1 1 1
The the alphabet columns are always present (with different values), but this "fill in the blank" function is eluding me. I can't use R for this due to data file size.
One way using awk:
$ awk 'NF!=n{for(i=NF+1;i<=n;i++)$i=0}1' n=9 file
A B C D E 0 0 0 0
A B C D E 1 1 1 1
Just set n to the number of columns you want to pad upto.

Resources