Related
I have a csv that contains 100 rows by three columns of random numbers:
100, 20, 30
746, 82, 928
387, 12, 287.3
12, 47, 2938
125, 198, 263
...
12, 2736, 14
In bash, I need to add another column that will be either a 0 or a 1. However, (and here is the hard part), I need to have 20% of the rows with 0s, and 80% with 1s.
Result:
100, 20, 30, 0
746, 82, 928, 1
387, 12, 287.3, 1
12, 47, 2938, 1
125, 198, 263, 0
...
12, 2736, 14, 1
What I have tried:
sed '1~3s/$/0/' mycsv.csv
but i thought I could replace the 1~3 with 'random number' but that doesn't work.
Maybe a loop would? Maybe sed or awk?
Using awk and rand() to get randomly 0s and 1s with 20 % probability of getting a 0:
$ awk 'BEGIN{OFS=", ";srand()}{print $0,(rand()>0.2)}' file
Output:
100, 20, 30, 1
746, 82, 928, 1
387, 12, 287.3, 1
12, 47, 2938, 0
125, 198, 263, 1
..., 0
12, 2736, 14, 1
Explained:
$ awk '
BEGIN {
OFS=", " # set output field separator
srand() # time based seed for rand()
}
{
print $0,(rand()>0.2) # output 0/1 ~ 20/80
}' file
As srand() per se is time (seconds) based, depending on the need, you might want to introduce external seed for it, for example, from Bash:
$ awk -v seed=$RANDOM 'BEGIN{srand(seed)}...'
Update: A version that first counts the lines in the file, calculates how many are 20 % 0s and randomly picks a 0 or a 1 and keeps count:
$ awk -v seed=$RANDOM '
BEGIN {
srand(seed) # feed the seed to random
}
NR==1 { # processing the first record
while((getline line < FILENAME)>0) # count the lines in the file
nr++ # nr stores the count
for(i=1;i<=nr;i++) # produce
a[(i>0.2*nr)]++ # 20 % 0s, 80 % 1s
}
{
p=a[0]/(a[0]+a[1]) # probability to pick 0 or 1
print $0 ". " (a[v=(rand()>p)]?v:v=(!v)) # print record and 0 or 1
a[v]-- # remove 0 or 1
}' file
Another way to do it is the following:
Create a sequence of 0 and 1's with the correct ratio:
$ awk 'END{for(i=1;i<=FNR;++i) print (i <= 0.8*FNR) }' file
Shuffle the output to randomize it:
$ awk 'END{for(i=1;i<=FNR;++i) print (i <= 0.8*FNR) }' file | shuf
Paste it next to the file with a <comma>-character as delimiter:
$ paste -d, file <(awk 'END{for(i=1;i<=FNR;++i) print (i <= 0.8*FNR) }' file | shuf)
The reason I do not want to use any form of random number generator, is that this could lead to 100% ones or 100% zeros. Or anything of that nature. The above produces the closest possible 80% of ones and 20% of zeros.
Another method would be a double parse with awk in the following way:
$ awk '(NR==FNR) { next }
(FNR==1) { for(i=1;i<NR;i++) a[i] = (i<0.8*(NR-1)) }
{ for(i in a) { print $0","a[i]; delete a[i]; break } }' file file
The above makes use of of the fact that for(i in a) cycles through the array in an undetermined way. You can see this by quickly doing
$ awk 'BEGIN{ORS=","; for(i=1;i<=20;++i) a[i]; for(i in a) print i; printf "\n"}'
17,4,18,5,19,6,7,8,9,10,20,11,12,13,14,1,15,2,16,3,
But this is implementation dependent.
Finally, you could actually use shuf in awk to get to the desired result
$ awk '(NR==FNR) { next }
(FNR==1) { cmd = "shuf -i 1-"(NR-1)" }
{ cmd | getline i; print $0","(i <= 0.8*(NR-FNR)) }' file file
This seems to be more a problem of algorithm than of programming. You state in your question: I need to have 20% of the rows with 0s, and 80% with 1s.. So the first question is, what to do, if the number of rows is not a multiple of 5. If you have 112 rows in total, 20% would be 22.4 rows, and this does not make sense.
Assuming that you can redefine your task to deal with that case, the simplest solution would be assign a 0 to the first 20% of the rows and a 1 to the remaining ones.
But say that you want to have some randomness in the distribution of the 0 and 1. One quick-and-dirty solution would be to create an array consisting of the numbers of zeroes and ones you are going to redeem in total, and in each iteration take a random element from this array (and remove it from the array).
Adding to the previous reply, Here is a Python 3 way to do this :
#!/usr/local/bin/python3
import csv
import math
import random
totalOflines = len(open('columns.csv').readlines())
newColumn = ( [0] * math.ceil(totalOflines * 0.20) ) + ( [1] * math.ceil(totalOflines * 0.80) )
random.shuffle(newColumn)
csvr = csv.reader(open('columns.csv'), delimiter = ",")
i=0
for row in csvr:
print("{},{},{},{}".format(row[0],row[1],row[2],newColumn[i]))
i+=1
Regards!
This question already has answers here:
how to iterate though delimited values in bash
(3 answers)
Closed 4 years ago.
I'm very new to bash, and I have a log like this:
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com", "d#gmail.com"
4, "e#hotmail.com", "f#hotmail.com", "g#gmail.com"
55, "h#gmail.com"
I would like it to be:
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com"
3333, "d#gmail.com"
4, "e#hotmail.com"
4, "f#hotmail.com"
4, "g#gmail.com"
55, "h#gmail.com"
How do I do it in bash?
The standard UNIX tool for manipulating text is awk:
$ awk 'BEGIN{FS=OFS=", "} {for (i=2;i<=NF;i++) print $1, $i}' file
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com"
3333, "d#gmail.com"
4, "e#hotmail.com"
4, "f#hotmail.com"
4, "g#gmail.com"
55, "h#gmail.com"
As the first argument, I'm passing the path to the file.
If no argument will be passed I will say about the error and will exit with (-1).
I'm iterating thru the file, with commas changed to space.
Each iteration I will be tacking separated by space word. If this word is the number I will store it and go to the next word. If it is non-numeric I will print the previous number and the current word, separated by the comma. Before the for loop, I'm initializing number with 0, just in case ;)
#!/bin/bash
if [ -z "${1}" ]; then
echo "No file specified"
exit -1
else
file=$1
echo "Parsing file \"$file\":"
fi
number="0"
for word in $(sed "s#,# #g" $file); do
if [[ $line =~ ^[0-9]+ ]] ; then
number=${word};
continue;
else
echo "$number, ${word}"
fi
done
exit 0
Run:
test#LAPTOP-EQKIVD8A:~$ cat new.txt
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com", "d#gmail.com"
4, "e#hotmail.com", "f#hotmail.com", "g#gmail.com"
55, "h#gmail.com"
test#LAPTOP-EQKIVD8A:~$ ./script.sh new.txt
Parsing file "new.txt":
10, "a#gmail.com"
2, "b#gmail.com"
3333, "c#hotmail.com"
3333, "d#gmail.com"
4, "e#hotmail.com"
4, "f#hotmail.com"
4, "g#gmail.com"
55, "h#gmail.com"
I have this data concerning trajectory information, below:
EP, 13, 2017071012, 03, AP01, 126, 27.1, -130, 17, 1018, XX, 34, NEQ, 0000, 0000, 0000, 0000
AL, 07, 2017071012, 03, AP01, 132, 27, -131.1, 18, 1018, XX, 34, NEQ, 0000, 0000, 0000, 0000
WP, 19, 2017071012, 03, AP01, 000, 18.5, -116.8, 56, 982, XX, 50, NEQ, 0057, 0047, 0034, 0036
AL, 08, 2017071012, 03, AP01, 132, 27, -132.1, 17, 1018, XX, 34, NEQ, 0000, 0000, 0000, 0000
The information needs to be sorted by the 1st (name) and 2nd (numerical identifier) columns.
Running
sort -k1,2 file.txt
organizes the file into:
AL, 07, 2017071012, 03, AP01, 132, 27, -131.1, 18, 1018, XX, 34, NEQ, 0000, 0000, 0000, 0000
AL, 08, 2017071012, 03, AP01, 132, 27, -132.1, 17, 1018, XX, 34, NEQ, 0000, 0000, 0000, 0000
EP, 13, 2017071012, 03, AP01, 126, 27.1, -130, 17, 1018, XX, 34, NEQ, 0000, 0000, 0000, 0000
WP, 19, 2017071012, 03, AP01, 000, 18.5, -116.8, 56, 982, XX, 50, NEQ, 0057, 0047, 0034, 0036
This is a step to what is desired.
I need to separate the data into separate files based on the second column - how would that be done? I imagine some type of regular expression is needed. Additionally, the second column is always numerical, and will not contain negative integers.
(The first column will always start with AL, EP, or WP)
Thank you for your information and help in advance!
sort -k1,2 file.txt | awk -F', *' '{print > ("out" $2)}'
If you are not using GNU awk and you file has a lot of unique "$2" values then you'll need to close the files as you go, e.g. at its simplest:
sort -k1,2 file.txt | awk -F', *' '{f="out" $2; print >> f; close(f)}'
Perl to the rescue:
perl -aF'/,\s/' -ne 'open my $OUT, ">>", $F[1] or die $!;
print {$OUT} $_;' -- sorted-file
-n reads the input line by line
-aF splits each line on the given pattern /,\s/, i.e. comma + space, and populates the #F array with the results
>> means the file is opened for appending
I have a text file following a certain formatting which has lines like :
{
"297723": [
[
1,
2
],
[
5,
10
],
[
1,
157
]
],
"369258": [
[
3,
4
],
[
6,
11
],
[
30,
200
]
]
}
How can I make it look like this ?
{"297723": [[1, 2], [5, 10], [1,157]],
{"369258": [[3, 4], [6, 11], [30,200]]}
Of course, there are several blocks, I just append the first (which starts with "{" and the last which closes with "}" - in all the rest, there is a number (like "2927723" in my example) which notifies the new block.
Your input is a valid JSON, so you may apply jq tool for this case:
jq -c '.' yourfile | sed 's/,"/,\n"/'
The output:
{"297723":[[1,2],[5,10],[1,157]],
"369258":[[3,4],[6,11],[30,200]]}
-c - print the input in compact-output form
The 2nd column in my csv file has duplicates. I want to add the associated values from column 1 based on those duplicates.
Example csv :
56, cc=DK
49, cc=US
34, cc=GB
32, cc=DE
32, cc=NZ
31, cc=DK
31, cc=GB
31, cc=GB
Example result :
96, cc=GB # where 96 = 34+31+31
87, cc=DK # where 87 = 56+31
32, cc=DE
32, cc=NZ
You can use associative arrays in awk:
awk '{s[$2]+=$1}END{for(k in s)print s[k]", ",k}' inFile
Expanding on that for readability, and using sum/key rather than s/k:
{ # Do for each line.
sum[$2] += $1 # Add first field to accumulator,
# indexed by second field.
# initial value is zero.
}
END { # Do this bit when whole file processed.
for (key in sum) # For each key like cc=US:
print sum[key] ", " key # Output the sum and key.
}
Here's a sample run on my box:
pax$ echo;echo '56, cc=DK
49, cc=US
34, cc=GB
32, cc=DE
32, cc=NZ
31, cc=DK
31, cc=GB
31, cc=GB' | awk '{s[$2]+=$1}END{for(k in s)print s[k]", "k}'
32, cc=DE
96, cc=GB
32, cc=NZ
49, cc=US
87, cc=DK
This works despite the fact that the first column is of the form 999, (note the comma at the end), simply because awk, when evaluating strings in a numeric context, uses only the prefix that is valid in that context. Hence 45xyzzy would become 45 and, more importantly, 49, becomes 49.
Perl solution:
perl -ane '$h{ $F[1] } += $F[0] }{ print "$h{$_}\t$_\n" for keys %h' input.csv
Explanation:
-n processes the input line by line
-a splits the input line on whitespace into fields in the #F array
the hash table %h records the sum for each key (2nd column). It just adds the value of the first column to it.
}{ (called "Eskimo greeting") separates what's executed for each line (-n) from the code to be run after the whole input was processed
It's ok to use awk for such simple task, but if you have bunch of similar tasks and you may need to change it in the future it's easy to mess something up.
Since it's typical database problem, consider using sqlite.
You can:
add row names and remove extra white spaces:
$ cat <(echo "num, name") originalInput.txt | tr -d ' ' > input.csv
import data to temporary sqlite db:
$ sqlite3 --batch temp.db <<EOF!
.mode csv
.import input.csv input
EOF!
select from db:
$sqlite3 temp.db 'SELECT sum(num), name FROM input GROUP BY name'
32|cc=DE
87|cc=DK
96|cc=GB
32|cc=NZ
49|cc=US
It is slightly bit more code and uses external sqlite3 command, but it's significantly less error prone and more flexible. You can do easily join several csv files, use fancy sorting, and more.
Also, imagine yourself looking at the code six month later trying to understand quickly what it does.