I have a csv that contains 100 rows by three columns of random numbers:
100, 20, 30
746, 82, 928
387, 12, 287.3
12, 47, 2938
125, 198, 263
...
12, 2736, 14
In bash, I need to add another column that will be either a 0 or a 1. However, (and here is the hard part), I need to have 20% of the rows with 0s, and 80% with 1s.
Result:
100, 20, 30, 0
746, 82, 928, 1
387, 12, 287.3, 1
12, 47, 2938, 1
125, 198, 263, 0
...
12, 2736, 14, 1
What I have tried:
sed '1~3s/$/0/' mycsv.csv
but i thought I could replace the 1~3 with 'random number' but that doesn't work.
Maybe a loop would? Maybe sed or awk?
Using awk and rand() to get randomly 0s and 1s with 20 % probability of getting a 0:
$ awk 'BEGIN{OFS=", ";srand()}{print $0,(rand()>0.2)}' file
Output:
100, 20, 30, 1
746, 82, 928, 1
387, 12, 287.3, 1
12, 47, 2938, 0
125, 198, 263, 1
..., 0
12, 2736, 14, 1
Explained:
$ awk '
BEGIN {
OFS=", " # set output field separator
srand() # time based seed for rand()
}
{
print $0,(rand()>0.2) # output 0/1 ~ 20/80
}' file
As srand() per se is time (seconds) based, depending on the need, you might want to introduce external seed for it, for example, from Bash:
$ awk -v seed=$RANDOM 'BEGIN{srand(seed)}...'
Update: A version that first counts the lines in the file, calculates how many are 20 % 0s and randomly picks a 0 or a 1 and keeps count:
$ awk -v seed=$RANDOM '
BEGIN {
srand(seed) # feed the seed to random
}
NR==1 { # processing the first record
while((getline line < FILENAME)>0) # count the lines in the file
nr++ # nr stores the count
for(i=1;i<=nr;i++) # produce
a[(i>0.2*nr)]++ # 20 % 0s, 80 % 1s
}
{
p=a[0]/(a[0]+a[1]) # probability to pick 0 or 1
print $0 ". " (a[v=(rand()>p)]?v:v=(!v)) # print record and 0 or 1
a[v]-- # remove 0 or 1
}' file
Another way to do it is the following:
Create a sequence of 0 and 1's with the correct ratio:
$ awk 'END{for(i=1;i<=FNR;++i) print (i <= 0.8*FNR) }' file
Shuffle the output to randomize it:
$ awk 'END{for(i=1;i<=FNR;++i) print (i <= 0.8*FNR) }' file | shuf
Paste it next to the file with a <comma>-character as delimiter:
$ paste -d, file <(awk 'END{for(i=1;i<=FNR;++i) print (i <= 0.8*FNR) }' file | shuf)
The reason I do not want to use any form of random number generator, is that this could lead to 100% ones or 100% zeros. Or anything of that nature. The above produces the closest possible 80% of ones and 20% of zeros.
Another method would be a double parse with awk in the following way:
$ awk '(NR==FNR) { next }
(FNR==1) { for(i=1;i<NR;i++) a[i] = (i<0.8*(NR-1)) }
{ for(i in a) { print $0","a[i]; delete a[i]; break } }' file file
The above makes use of of the fact that for(i in a) cycles through the array in an undetermined way. You can see this by quickly doing
$ awk 'BEGIN{ORS=","; for(i=1;i<=20;++i) a[i]; for(i in a) print i; printf "\n"}'
17,4,18,5,19,6,7,8,9,10,20,11,12,13,14,1,15,2,16,3,
But this is implementation dependent.
Finally, you could actually use shuf in awk to get to the desired result
$ awk '(NR==FNR) { next }
(FNR==1) { cmd = "shuf -i 1-"(NR-1)" }
{ cmd | getline i; print $0","(i <= 0.8*(NR-FNR)) }' file file
This seems to be more a problem of algorithm than of programming. You state in your question: I need to have 20% of the rows with 0s, and 80% with 1s.. So the first question is, what to do, if the number of rows is not a multiple of 5. If you have 112 rows in total, 20% would be 22.4 rows, and this does not make sense.
Assuming that you can redefine your task to deal with that case, the simplest solution would be assign a 0 to the first 20% of the rows and a 1 to the remaining ones.
But say that you want to have some randomness in the distribution of the 0 and 1. One quick-and-dirty solution would be to create an array consisting of the numbers of zeroes and ones you are going to redeem in total, and in each iteration take a random element from this array (and remove it from the array).
Adding to the previous reply, Here is a Python 3 way to do this :
#!/usr/local/bin/python3
import csv
import math
import random
totalOflines = len(open('columns.csv').readlines())
newColumn = ( [0] * math.ceil(totalOflines * 0.20) ) + ( [1] * math.ceil(totalOflines * 0.80) )
random.shuffle(newColumn)
csvr = csv.reader(open('columns.csv'), delimiter = ",")
i=0
for row in csvr:
print("{},{},{},{}".format(row[0],row[1],row[2],newColumn[i]))
i+=1
Regards!
I'm new shell scripting. I need to get data between run and Automatic match counts using shell scripting. So that it can be processed as semi structured data. please advice
Using sed -n '/run/,/Automatic/p' filename.txt|sed '1d;$d'|sed '$d;s/ //g' - should clean up data (1st line, 2 last lines, and spaces in beginning)
shell script - split.sh:
#!/bin/bash
sed -n '/run/,/Automatic/p' $1|sed '1d;$d'|sed '$d;s/ //g'
run for any file as below to get output on console and in file:
shell> ./split.sh test.txt |tee splitted.dat
United Kingdom: 21/09/2012
Started: 08/02/2013 16:04:44
Finished: 08/02/2013 16:21:23
Time to process: 0 days 0 hours 16 mins 39 secs
Records processed: 37497
Throughput: 135124 records/hour
Time per record: 0.0266 secs
output will be stored in splitted.dat file:
shell> cat splitted.dat
United Kingdom: 21/09/2012
Started: 08/02/2013 16:04:44
Finished: 08/02/2013 16:21:23
Time to process: 0 days 0 hours 16 mins 39 secs
Records processed: 37497
Throughput: 135124 records/hour
Time per record: 0.0266 secs
shell>
Update:
#!/bin/bash
# p - print lines with specified conditions
# !p - print lines except specified in conditions (opposite of p)
# |(pipe) - passes output of first command to the next
# $d - delete last line
# 1d - delete first line ( nd - delete nth line)
# '/run/,/Automatic/!p' - print lines except lines between 'run' to 'Automatic'
# sed '1d;s/ //g'- use output from first sed command and delete the 1st line and replace spaces with nothing
sed -n '/run/,/Automatic/!p' $1 |sed '1d;s/ //g'
Output:
Verified Correct: 32426 (86.5%)
Good Match: 2102 ( 5.6%)
Good Premise Partial: 862 ( 2.3%)
Tentative Match: 1039 ( 2.8%)
Poor Match: 4 ( 0.0%)
Multiple Matches: 7 ( 0.0%)
Partial Match: 872 ( 2.3%)
Foreign Address: 2 ( 0.0%)
Unmatched: 183 ( 0.5%)
sed -n '/run/,/Automatic/ {//!p }' test.txt
This will print all lines (,) between run and Automatic.The //! removes the line run and Automatic match counts from the output.
I created a script that will auto-login to router and checks for current CPU load, if load exceeds a certain threshold I need it print the current CPU value to the standard output.
i would like to search in script o/p for a certain pattern (the value 80 in this case which is the threshold for high CPU load) and then for each instance of the pattern it will check if current value is greater than 80 or not, if true then it will print 5 lines before the pattern followed by then the current line with the pattern.
Question1: how to loop over each instance of the pattern and apply some code on each of them separately?
Question2: How to print n lines before the pattern followed by x lines after the pattern?
ex. i used awk to search for the pattern "health" and print 6 lines after it as below:
awk '/health/{x=NR+6}(NR<=x){print}' ./logs/CpuCheck.log
I would like to do the same for the pattern "80" and this time print 5 lines before it and one line after....only if $3 (representing current CPU load) is exceeding the value 80
below is the output of auto-login script (file name: CpuCheck.log)
ABCD-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 39 36 36 47
WXYZ-> show health xxxxxxxxxx
* - current value exceeds threshold
1 Min 1 Hr 1 Hr
Cpu Limit Curr Avg Avg Max
-----------------+-------+------+------+-----+----
01 80 29 31 31 43
Thanks in advance for the help
Rather than use awk, you could use the -B and -A and switches to grep, which print a number of lines before and after a pattern is matched:
grep -E -B 5 -A 1 '^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])' CpuCheck.log
The pattern matches lines which start with some numbers, followed by spaces, followed by 80, followed by a number greater between 81 and 100. The -E switch enables extended regular expressions (EREs), which are needed if you want to use the + character to mean "one or more". If your version of grep doesn't support EREs, you can instead use the slightly more verbose \{1,\} syntax:
grep -B 5 -A 1 '^[0-9]\{1,\}[[:space:]]\{1,\}80[[:space:]]\{1,\}\(100\|9[0-9]\|8[1-9]\)' CpuCheck.log
If grep isn't an option, one alternative would be to use awk. The easiest way would be to store all of the lines in a buffer:
awk 'f-->0;{a[NR]=$0}/^[0-9]+[[:space:]]+80[[:space:]]+(100|9[0-9]|8[1-9])/{for(i=NR-5;i<=NR;++i)print i, a[i];f=1}'
This stores every line in an array a. When the third column is greater than 80, it prints the previous 5 lines from the array. It also sets the flag f to 1, so that f-->0 is true for the next line, causing it to be printed.
Originally I had opted for a comparison $3>80 instead of the regular expression but this isn't a good idea due to the varying format of the lines.
If the log file is really big, meaning that reading the whole thing into memory is unfeasible, you could implement a circular buffer so that only the previous 5 lines were stored, or alternatively, read the file twice.
Unfortunately, awk is stream-oriented and doesn't have a simple way to get the lines before the current line. But that doesn't mean it isn't possible:
awk '
BEGIN {
bufferSize = 6;
}
{
buffer[NR % bufferSize] = $0;
}
$2 == 80 && $3 > 80 {
# print the five lines before the match and the line with the match
for (i = 1; i <= bufferSize; i++) {
print buffer[(NR + i) % bufferSize];
}
}
' ./logs/CpuCheck.log
I think the easiest way with awk, by reading the file.
This should use essentially 0 memory except whatever is used to store the line numbers.
If there is only one occurence
awk 'NR==FNR&&$2=="80"{to=NR+1;from=NR-5}NR!=FNR&&FNR<=to&&FNR>=from' file{,}
If there are more than one occurences
awk 'NR==FNR&&$2=="80"{to[++x]=NR+1;from[x]=NR-5}
NR!=FNR{for(i in to)if(FNR<=to[i]&&FNR>=from[i]){print;next}}' file{,}
Input/output
Input
1
2
3
4
5
6
7
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
19
20
Output
8
9
10
11
12
01 80 39 36 36 47
13
14
15
16
17
01 80 39 36 36 47
18
How it works
NR==FNR&&$2=="80"{to[++x]=NR+5;from[x]=NR-5}
In the first file if the second field is 80 set to and from to the record number + or - whatever you want.
Increment the occurrence variable x.
NR!=FNR
In the second file
for(i in to)
For each occurrence
if(FNR<=to[i]&&FNR>=from[i]){print;next}
If the current record number(in this file) is between this occurrences to and from then print the line.Next prevents the line from being printed multiple times if occurrences of the pattern are close together.
file{,}
Use the file twice as two args. the {,} expands to file file
I have a manually created log file of the format
date start duration description
2/5 10:00p 1:45 Did this and that.
2/6 2:00a 0:20 Woke up from my slumber.
==============================================
2:05 TOTAL time spent
There are many entries in the log. To avoid manually recomputing total time every time an entry is added, I wrote the following script:
#!/bin/bash
file=`ls | grep log`
head -n -1 $file | egrep -o [0-9]:[0-9]{2}[^ap] \
| awk '{ FS = ":" ; SUM += 60*$1 ; SUM += $2 } END { print SUM }'
First, the script assumes there is exactly one file with log in its name, and that's the file I'm after. Second, it takes all lines other than the line with the current total, greps the time information from the line, and feeds it to awk, which converts it to minutes.
This is where I run into problems. The final sum would always be slightly off. Through trial and error, I discovered that awk will never count the second field of the very first record, e.g. the 45 minutes in this case. It will count the hour; it won't count the minutes. It has no such problem with the other records, but it's always off by the minutes in the first record.
What could be causing this behavior? How do I debug it?
You set FS in the loop and it's already too late for the first line.
The right way to do is :
echo -e "1:45\n0:20" | awk 'BEGIN { FS=":" } { SUM += 60*$1 + $2 } END { print SUM }'
You did not show us, that how you expect output
Whether like this ?
$ cat log
date start duration description
2/5 10:00p 1:45 Did this and that.
2/6 2:00a 0:20 Woke up from my slumber.
==============================================
2:05 TOTAL time spent
Awk Code
awk '$3~/([[:digit:]]):([[:digit:]])/ && !/TOTAL/{
split($3,A,":")
sum+=A[1]*60+A[2]
}
END{
print "Total",sum,"Minutes"
}' log
Resulting
Total 125 Minutes
I would like to make a loop that will take 10 lines of my input file and output it to an output file. And continue to add lines to the output file not over writing it.
This is a sample data:
FilePath Filename Probability ClassifierID HectorFileType LibmagicFileType
/mnt/Hector/Data/benign/binary/benign-pete/ 01d0cd964020a1f498c601f9801742c1 19 S040PDFv02 data.pdf PDF document
/mnt/Hector/Data/benign/binary/benign-pete/ 0299a1771587043b232f760cbedbb5b7 0 S040PDFv02 data.pdf PDF document
I then use this to count each unique file and show how many of each file there is with:
cut -f 4 input.txt|sort| uniq -c | awk '{print $2, $1}' | sed 1d
So ultimately I just need help making a loop that can run that line of bash and output 10 lines of data at a time to an output file
If I understand correctly, for every block of 10 lines, you are trying to:
Skip the headers, the first line of the block
count how many times field #4 (ClassifierID) occurs and output the field, plus the count.
Here is an AWK script which will do it:
FNR % 10 != 1 {
++count[$4]
}
FNR % 10 == 0 {
for (i in count) {
print i, count[i]
delete count[i]
}
}
Discussion
The FNR % 10 != 1 block processes every line, but lines 1, 11, 21, ... AKA the lines you want to skip. This block keeps a count of field $4
The FNR % 10 == 0 block prints out a summary for that block and resets (via delete) the count
My script does not sort the fields, so the order might be different.
If you want to tally for the whole file, not just block of 10s, then replace FNR % 10 == 0 with END.