Bash - adding values in row based on column - bash

The 2nd column in my csv file has duplicates. I want to add the associated values from column 1 based on those duplicates.
Example csv :
56, cc=DK
49, cc=US
34, cc=GB
32, cc=DE
32, cc=NZ
31, cc=DK
31, cc=GB
31, cc=GB
Example result :
96, cc=GB # where 96 = 34+31+31
87, cc=DK # where 87 = 56+31
32, cc=DE
32, cc=NZ

You can use associative arrays in awk:
awk '{s[$2]+=$1}END{for(k in s)print s[k]", ",k}' inFile
Expanding on that for readability, and using sum/key rather than s/k:
{ # Do for each line.
sum[$2] += $1 # Add first field to accumulator,
# indexed by second field.
# initial value is zero.
}
END { # Do this bit when whole file processed.
for (key in sum) # For each key like cc=US:
print sum[key] ", " key # Output the sum and key.
}
Here's a sample run on my box:
pax$ echo;echo '56, cc=DK
49, cc=US
34, cc=GB
32, cc=DE
32, cc=NZ
31, cc=DK
31, cc=GB
31, cc=GB' | awk '{s[$2]+=$1}END{for(k in s)print s[k]", "k}'
32, cc=DE
96, cc=GB
32, cc=NZ
49, cc=US
87, cc=DK
This works despite the fact that the first column is of the form 999, (note the comma at the end), simply because awk, when evaluating strings in a numeric context, uses only the prefix that is valid in that context. Hence 45xyzzy would become 45 and, more importantly, 49, becomes 49.

Perl solution:
perl -ane '$h{ $F[1] } += $F[0] }{ print "$h{$_}\t$_\n" for keys %h' input.csv
Explanation:
-n processes the input line by line
-a splits the input line on whitespace into fields in the #F array
the hash table %h records the sum for each key (2nd column). It just adds the value of the first column to it.
}{ (called "Eskimo greeting") separates what's executed for each line (-n) from the code to be run after the whole input was processed

It's ok to use awk for such simple task, but if you have bunch of similar tasks and you may need to change it in the future it's easy to mess something up.
Since it's typical database problem, consider using sqlite.
You can:
add row names and remove extra white spaces:
$ cat <(echo "num, name") originalInput.txt | tr -d ' ' > input.csv
import data to temporary sqlite db:
$ sqlite3 --batch temp.db <<EOF!
.mode csv
.import input.csv input
EOF!
select from db:
$sqlite3 temp.db 'SELECT sum(num), name FROM input GROUP BY name'
32|cc=DE
87|cc=DK
96|cc=GB
32|cc=NZ
49|cc=US
It is slightly bit more code and uses external sqlite3 command, but it's significantly less error prone and more flexible. You can do easily join several csv files, use fancy sorting, and more.
Also, imagine yourself looking at the code six month later trying to understand quickly what it does.

Related

How can I add one of two random numbers to each row of a csv proportionately?

I have a csv that contains 100 rows by three columns of random numbers:
100, 20, 30
746, 82, 928
387, 12, 287.3
12, 47, 2938
125, 198, 263
...
12, 2736, 14
In bash, I need to add another column that will be either a 0 or a 1. However, (and here is the hard part), I need to have 20% of the rows with 0s, and 80% with 1s.
Result:
100, 20, 30, 0
746, 82, 928, 1
387, 12, 287.3, 1
12, 47, 2938, 1
125, 198, 263, 0
...
12, 2736, 14, 1
What I have tried:
sed '1~3s/$/0/' mycsv.csv
but i thought I could replace the 1~3 with 'random number' but that doesn't work.
Maybe a loop would? Maybe sed or awk?
Using awk and rand() to get randomly 0s and 1s with 20 % probability of getting a 0:
$ awk 'BEGIN{OFS=", ";srand()}{print $0,(rand()>0.2)}' file
Output:
100, 20, 30, 1
746, 82, 928, 1
387, 12, 287.3, 1
12, 47, 2938, 0
125, 198, 263, 1
..., 0
12, 2736, 14, 1
Explained:
$ awk '
BEGIN {
OFS=", " # set output field separator
srand() # time based seed for rand()
}
{
print $0,(rand()>0.2) # output 0/1 ~ 20/80
}' file
As srand() per se is time (seconds) based, depending on the need, you might want to introduce external seed for it, for example, from Bash:
$ awk -v seed=$RANDOM 'BEGIN{srand(seed)}...'
Update: A version that first counts the lines in the file, calculates how many are 20 % 0s and randomly picks a 0 or a 1 and keeps count:
$ awk -v seed=$RANDOM '
BEGIN {
srand(seed) # feed the seed to random
}
NR==1 { # processing the first record
while((getline line < FILENAME)>0) # count the lines in the file
nr++ # nr stores the count
for(i=1;i<=nr;i++) # produce
a[(i>0.2*nr)]++ # 20 % 0s, 80 % 1s
}
{
p=a[0]/(a[0]+a[1]) # probability to pick 0 or 1
print $0 ". " (a[v=(rand()>p)]?v:v=(!v)) # print record and 0 or 1
a[v]-- # remove 0 or 1
}' file
Another way to do it is the following:
Create a sequence of 0 and 1's with the correct ratio:
$ awk 'END{for(i=1;i<=FNR;++i) print (i <= 0.8*FNR) }' file
Shuffle the output to randomize it:
$ awk 'END{for(i=1;i<=FNR;++i) print (i <= 0.8*FNR) }' file | shuf
Paste it next to the file with a <comma>-character as delimiter:
$ paste -d, file <(awk 'END{for(i=1;i<=FNR;++i) print (i <= 0.8*FNR) }' file | shuf)
The reason I do not want to use any form of random number generator, is that this could lead to 100% ones or 100% zeros. Or anything of that nature. The above produces the closest possible 80% of ones and 20% of zeros.
Another method would be a double parse with awk in the following way:
$ awk '(NR==FNR) { next }
(FNR==1) { for(i=1;i<NR;i++) a[i] = (i<0.8*(NR-1)) }
{ for(i in a) { print $0","a[i]; delete a[i]; break } }' file file
The above makes use of of the fact that for(i in a) cycles through the array in an undetermined way. You can see this by quickly doing
$ awk 'BEGIN{ORS=","; for(i=1;i<=20;++i) a[i]; for(i in a) print i; printf "\n"}'
17,4,18,5,19,6,7,8,9,10,20,11,12,13,14,1,15,2,16,3,
But this is implementation dependent.
Finally, you could actually use shuf in awk to get to the desired result
$ awk '(NR==FNR) { next }
(FNR==1) { cmd = "shuf -i 1-"(NR-1)" }
{ cmd | getline i; print $0","(i <= 0.8*(NR-FNR)) }' file file
This seems to be more a problem of algorithm than of programming. You state in your question: I need to have 20% of the rows with 0s, and 80% with 1s.. So the first question is, what to do, if the number of rows is not a multiple of 5. If you have 112 rows in total, 20% would be 22.4 rows, and this does not make sense.
Assuming that you can redefine your task to deal with that case, the simplest solution would be assign a 0 to the first 20% of the rows and a 1 to the remaining ones.
But say that you want to have some randomness in the distribution of the 0 and 1. One quick-and-dirty solution would be to create an array consisting of the numbers of zeroes and ones you are going to redeem in total, and in each iteration take a random element from this array (and remove it from the array).
Adding to the previous reply, Here is a Python 3 way to do this :
#!/usr/local/bin/python3
import csv
import math
import random
totalOflines = len(open('columns.csv').readlines())
newColumn = ( [0] * math.ceil(totalOflines * 0.20) ) + ( [1] * math.ceil(totalOflines * 0.80) )
random.shuffle(newColumn)
csvr = csv.reader(open('columns.csv'), delimiter = ",")
i=0
for row in csvr:
print("{},{},{},{}".format(row[0],row[1],row[2],newColumn[i]))
i+=1
Regards!

Adding numbers in text file using awk and Bash

I need to take all numbers that appear within a book index and add 22 to them. The index data looks like this (for example):
Ubuntu, 120, 143, 154
Yggdrasil, 144, 170-171
Yood, Charles, 6
Young, Bob, 178-179
Zawinski, Jamie, 204
I am trying to do this with awk using this script:
#!/bin/bash
filename="index"
while read -r line
do
echo $line | awk -v n=22 '{printf($1)}{printf(" " )}{for(i=2;i<=NF;i++)printf(i%2?$i+n:$i+n)", "};{print FS}'
done < "$filename"
It comes close to working but has the following problems:
It doesn't work for page numbers that are part of a range (e.g., "170-171") rather than individual numbers.
For entries where the index term is more than one word (e.g., "X Windows" and "Young, Bob") the output displays only the first word in the term. The second word ends up being output as the number 22. (I know why this is happening -- my awk commands treats $2 as a number, and if it's a string it assumes it has a value of 0) but I can't figure out how to solve it.
Disclosure: I'm by no means an awk expert. I'm just looking for a quick way to modify the page numbers in my index (which is due in a few days) because my publisher decided to change the pagination in the manuscript after I had already prepared the index. awk seems like the best tool for the job to me, but I'm open to other suggestions if someone has a better idea. Basically, I just need a way to say "take all numbers in this file and add 22 to them; don't change anything else."
With GNU awk for multi-char RS and RT:
$ awk -v RS='[0-9]+' '{ORS=(RT=="" ? "" : RT+22)}1' file
Ubuntu, 142, 165, 176
Yggdrasil, 166, 192-193
Yood, Charles, 28
Young, Bob, 200-201
Zawinski, Jamie, 226
For example:
perl -plE 's/\b(\d+)\b/$1+22/ge' index
output
Ubuntu, 142, 165, 176
Yggdrasil, 166, 192-193
Yood, Charles, 28
Young, Bob, 200-201
Zawinski, Jamie, 226
but it isn't awk
You can use this gnu awk command:
awk 'BEGIN {FS="\f";RS="(, |-|\n)";} /^[0-9]+$/ {$1 = $1 +22} { printf("%s%s", $1, RT);}' yourfile
there is a bit of abuse with FS and RS to get awk to handle each token in each line as a record of its own, so you dont have to loop over the
fields and test each field whether or not it is numerical
RS="(, |-|\n)" configures dash, newline and ", " as record separators
on "records" consisting only of digits: 22 is added
the printf prints the token together with its RT to reconstruct the line from the file
Consider using the following awk script(add_number.awk):
BEGIN{ FS=OFS=", "; if (!n) n=22; } # if `n` variable hasn't been passed the default is 22
{
for (i=1;i<=NF;i++) { # traversing fields
if ($i~/^[0-9]+$/) { # if a field contains a single number
$i+=n;
}
else if (match($i, /^([0-9]+)-([0-9]+)$/, arr)) { # if a field contains `range of numbers`
$i=arr[1]+n"-"arr[2]+n;
}
}
print;
}
Usage:
awk -v n=22 -f add_number.awk testfile
The output:
Ubuntu, 142, 165, 176
Yggdrasil, 166, 192-193
Yood, Charles, 28
Young, Bob, 200-201
Zawinski, Jamie, 226

deleting concret data from csv

I have a long csv file with 5 columns. But 3 lines have 6 columns. One begin with "tomasluck", another "peterblack" and the last one "susanpeeters". I need to delete, in this 3 lines, the fourth element (column) and get only 5 columns.
I put a short example, my file is long and is created automatically.
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, tl, 10, 10
anaperez, 14, 11-03-2015, 10, 11
and I need
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, 10, 10
anaperez, 14, 11-03-2015, 10, 11
Exactly, I was thinking in a code, that select the lines that begin with tomasluck, peterblack and susanpeeters, and then delete the 4rht field or colum.
The tricky thing about this is to keep the formatting intact. The simplest way, I think, is to treat the input as plain text and use sed:
sed '/^tomasluck,/ s/,[^,]*//3' file.csv
This removes, in a line that begins with tomasluck,, the third occurrence of a comma followed by a field (non-comma characters). The filter regex can be amended to include other first fields, such as
sed '/^\(tomasluck\|petergreat\|anaperez\),/ s/,[^,]*//3' file.csv
...but in your input data, those lines don't appear to have a sixth field.
Further ideas that may or may not pertain to your use case:
Removing the fourth field on the basis of the number of fields is a little trickier in sed, largely because sed does not have arithmetic functionality and identifying the lines is a bit tedious:
sed 'h; s/[^,]//g; /.\{5\}/ { x; s/,[^,]*//3; x; }; x' file.csv
That is:
h # copy the line to the hold buffer
s/[^,]//g # remove all non-comma characters
/.\{5\}/ { # if five characters remain (if the line has six or more
# fields)
x # exchange pattern space and hold buffer
s/,[^,]*//3 # remove field
x # swap back again
}
x # finally, swap in the actual data before printing.
The x dance is typical of sed scripts that use the hold buffer; the goal is to make sure that regardless of whether the substitution takes place, in the end the line (and not the isolated commas) are printed.
Mind you, if you want the selection condition to be that a line has six or more fields, it is worth considering to use awk, where the condition is easier to formulate but the replacement of the field is more tedious:
awk -F , 'BEGIN { OFS = FS } NF > 5 { for(i = 5; i <= NF; ++i) { $(i - 1) = $i }; --NF; $1 = $1 } 1' file.csv
That is: Split line at commas (-F ,), then
BEGIN { OFS = FS } # output field separator is input FS
NF > 5 { # if there are more than five fields
for(i = 5; i <= NF; ++i) { # shift them back one, starting at the fifth
$(i - 1) = $i
}
--NF # let awk know that there is one less field
$1 = $1 # for BSD awk: force rebuilding of the line
}
1 # whether or not a transformation happened, print.
This should work for most awks; I have tested it with gawk and mawk. However, because nothing is ever easy to do portably, I am told that there is at least one awk out there (on old Solaris, I believe) that doesn't understand the --NF trick. It would be possible to hack something together with sprintf for that, but it's enough of a corner case that I don't expect it to bite you.
a more generic solution is to check, whether we have 5 or 6 fields:
awk -F', ' '{if(NF==6) print $1", "$2", "$3", "$5", "$6; else print $0}' file.csv
You could do this through sed which uses capturing group a capturing group based regex.
$ sed 's/^\(\(tomasluck\|peterblack\|susanpeeters\),[^,]*,[^,]*\),[^,]*/\1/' file
petergreat, 15, 11-03-2015, 10, 10
tomasluck, 15, 10-03-2015, 10, 10
anaperez, 14, 11-03-2015, 10, 11
This captures all the characters upto the third column and matches the fourth column. Replacing the matched characters with the chars inside group 1 will give you the desired output.

Using cat and while read line inside awk

I am getting a syntax error for using cat and while read line inside awk.
Sample code:
awk '{
if( condition )
{
array[FNR]=$1;
cat file1.json | while read LINE; do
print LINE
done;
}
fi
}' /home/user/spfile.txt
My json file:
{
"Section_A": {
"ws/abc-Location01": 24,
"ws/abc-Location02": 67,
"ws/abc-Location03: 101,
},
"Section_B": {
"ws/abc-Location01": 33,
"ws/abc-Location02": 59,
"ws/abc-Location03: 92,
"ws/abc-Location42: 92,
}
}
My array: contains locations of various partitions like below:
array[15742] is nsg -> /ws/abc-Location42/uname/builds_nsg
array[15744] is bfr -> /ws/abc-Location63/uname/builds_bfr
array[15746] is pre -> /ws/abc-Location67/uname/builds_pre
array[15748] is sfjk -> /ws/abc-Location67/uname/builds_sfjk
File2.txt
abc5-blah30a:/vol/local13/abc-Location67
1000 598
abc5-blah30a:/vol/local14/abc-Location68
1000 186
abc5-blah30a:/vol/local14/abc-Location01
1000 256
abc5-blah30a:/vol/local14/abc-Location02
1000 15
abc5-blah30a:/vol/local14/abc-Location03
1000 765
What I'm trying to do:
I need to change only Section B in my json file, and skip all other sections.
I need to check the locations of the partitions in Section B and for all matches with the array, the numeric value on the right hand side shouldnt be changed.
For all non-matches, the numeric value on the right hand side needs to be changed to the corresponding value from another file file2.txt.
Example
There is a match for Location42 in my json file against the array, so I do NOT change it.
But there is no match against the array for Location01,02,03 in the json file.
So I need to look up the corresp values for these 3 locations against file2.txt.
And I need to change them to 256, 15, 765.
RTFM.
awk is a powerful tool and can do many things (even if as chepner said python, perl or ruby could be more adapted to your problem) but it is not a magic tool that you can use without learning.
You simply cannot use any shell construct within a awk script.

Finding gaps in sequential numbers

I don’t do this stuff for a living so forgive me if it’s a simple question (or more complicated than I think). I‘ve been digging through the archives and found a lot of tips that are close but being a novice I’m not sure how to tweak for my needs or they are way beyond my understanding.
I have some large data files that I can parse out to generate a list of coordinate that are mostly sequential
5
6
7
8
15
16
17
25
26
27
What I want is a list of the gaps
1-4
9-14
18-24
I don’t know perl, SQL or anything fancy but thought I might be able to do something that would subtract one number from the next. I could then at least grep the output where the difference was not 1 or -1 and work with that to get the gaps.
With awk :
awk '$1!=p+1{print p+1"-"$1-1}{p=$1}' file.txt
explanations
$1 is the first column from current input line
p is the previous value of the last line
so ($1!=p+1) is a condition : if $1 is different than previous value +1, then :
this part is executed : {print p+1 "-" $1-1} : print previous value +1, the - character and fist columns + 1
{p=$1} is executed for each lines : p is assigned to the current 1st column
interesting question.
sputnick's awk one-liner is nice. I cannot write a simpler one than his. I just add another way using diff:
seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'
the output with your example would be:
1,4
9,14
18,24
I knew that there is comma in it, instead of -. you could replace the grep with sed to get -, grep cannot change the input text... but the idea is same.
hope it helps.
A Ruby Answer
Perhaps someone else can give you the Bash or Awk solution you asked for. However, I think any shell-based answer is likely to be extremely localized for your data set, and not very extendable. Solving the problem in Ruby is fairly simple, and provides you with flexible formatting and more options for manipulating the data set in other ways down the road. YMMV.
#!/usr/bin/env ruby
# You could read from a file if you prefer,
# but this is your provided corpus.
nums = [5, 6, 7, 8, 15, 16, 17, 25, 26, 27]
# Find gaps between zero and first digit.
nums.unshift 0
# Create array of arrays containing missing digits.
missing_nums = nums.each_cons(2).map do |array|
(array.first.succ...array.last).to_a unless
array.first.succ == array.last
end.compact
# => [[1, 2, 3, 4], [9, 10, 11, 12, 13, 14], [18, 19, 20, 21, 22, 23, 24]]
# Format the results any way you want.
puts missing_nums.map { |ary| "#{ary.first}-#{ary.last}" }
Given your current corpus, this yields the following on standard output:
1-4
9-14
18-24
Just remember the previous number and verify that the current one is the previous plus one:
#! /bin/bash
previous=0
while read n ; do
if (( n != previous + 1 )) ; then
echo $(( previous + 1 ))-$(( n - 1 ))
fi
previous=$n
done
You might need to add some checking to prevent lines like 28-28 for single number gaps.
Perl solution similar to awk solution from StardustOne:
perl -ane 'if ($F[0] != $p+1) {printf "%d-%d\n",$p+1,$F[0]-1}; $p=$F[0]' file.txt
These command-line options are used:
-n loop around every line of the input file, do not automatically print every line
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace. Fields are indexed starting with 0.
-e execute the perl code
Given input file, use the numinterval util and paste its output beside file, then munge it with tr, xargs, sed and printf:
gaps() { paste <(echo; numinterval "$1" | tr 1 '-' | tr -d '[02-9]') "$1" |
tr -d '[:blank:]' | xargs echo |
sed 's/ -/-/g;s/-[^ ]*-/-/g' | xargs printf "%s\n" ; }
Output of gaps file:
5-8
15-17
25-27
How it works. The output of paste <(echo; numinterval file) file looks like:
5
1 6
1 7
1 8
7 15
1 16
1 17
8 25
1 26
1 27
From there we mainly replace things in column #1, and tweak the spacing. The 1s are replaced with -s, and the higher numbers are blanked. Remove some blanks with tr. Replace runs of hyphens like "5-6-7-8" with a single hyphen "5-8", and that's the output.
This one list the ones who breaks the sequence from a list.
Idea taken from #choroba but done with a for.
#! /bin/bash
previous=0
n=$( cat listaNums.txt )
for number in $n
do
numListed=$(($number - 1))
if [ $numListed != $previous ] && [ $number != 2147483647 ]; then
echo $numListed
fi
previous=$number
done

Resources