I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file
I've been reading through other similar questions. I have this working, but it is very slow due to the size of the CSV I'm working with. Are there ways to make this more efficient?
My goal:
I have an incredibly large CSV (>100 GB). I would like to take all of the unique values in a column, extract 10% of these, and then use that 10% to subsample the original CSV.
What I'm doing:
1 - I'm pulling all unique values from column 11 and writing those to a text file:
cat File1.csv | cut -f11 -d , | sort | uniq > uniqueValues.txt
2 - Next, I'm sampling a random 10% of the values in uniqueValues.txt:
cat uniqueValues.txt | awk 'BEGIN {srand()} !/^$/ { if (rand() <= .10) print $0'} > uniqueValues.10pct.txt
3 - Next, I'm pulling the rows in File1.csv which have column 11 matching values from uniqueValues.10pct.txt:
awk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
As far as I can tell, this seems to be working. Does this seem reasonable? Any suggestions on how to improve the efficiency?
Any suggestions on how to improve the efficiency?
Avoid sort in 1st step as 2nd and 3rd do not care about order, you might do your whole 1st step using single awk command as follows:
awk 'BEGIN{FS=","}!arr[$11]++{print $11}' File1.csv > uniqueValues.txt
Explanation: I inform GNU AWK that field separator (FS) is comma, then for each line I do arr[$11]++ to get number of occurence of value in 11th column and use ! to negate it, so 0 becomes true, whilst 1 and greater becomes false. If this hold true I print 11th column.
Please test above against your 1st step for you data and then select one which is faster.
As for 3th step you might attemp using not-GNU AWK if you are allowed to install tools at your machine. For example author of article¹ Don’t MAWK AWK – the fastest and most elegant big data munging language! found nawk faster than GNU AWK and mawk faster than nawk. After installing prepare test data and measure times for
gawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
nawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
mawk -F, 'NR==FNR{a[$1]=$0;next}($11 in a){print}' uniqueValues.10pct.txt File1.csv > File1_subsample.csv
then use one which proved by fastest.
¹be warned that values shown pertains to versions available at September 2009, you might get different times with version available at June 2022.
You might find this to be faster (untested since no sample input/output provided):
cut -f11 -d',' File1.csv |
sort -u > uniqueValues.txt
numUnq=$(wc -l < uniqueValues.txt)
shuf -n "$(( numUnq / 10 ))" uniqueValues.txt |
awk -F',' 'NR==FNR{a[$1]; next} $11 in vals' - File1.csv
You could try replacing that first cut | sort; numUnq=$(wc...) with
numUnq=$(awk -F',' '!seen[$11]++{print $11 > "uniqueValues.txt"; cnt++} END{print cnt+0}' File1.csv)
to see if that's any faster but I doubt it since cut, sort, and wc are all very fast while awk has to do regexp-based field splitting and store all $11 values in memory (which can get slow as the array size increases due to how dynamic array allocation works).
Create a sample *.csv file:
for ((i=1;i<=100;i++))
do
for ((j=1;j<=100;j++))
do
echo "a,b,c,d,e,f,g,h,i,j,${j},k,l,m"
done
done > large.csv
NOTES:
1,000 total lines
100 unique values in the 11th field
each unique value shows up 10 times in the file
We'll look at a couple awk ideas that:
keep track of unique values as we find them
apply the random percentage check as we encounter a new (unique) value
require just a single pass through the source file
NOTE: both of these awk scripts (below) replace all of OP's current code (cat/cut/sort/uniq/cat/awk/awk)
First idea applies our random percentage check each time we find a new unique value:
awk -F',' '
BEGIN { srand() }
!seen[$11]++ { if (rand() <= 0.10) # if this is the 1st time we have seen this value and rand() is <= 10% then ...
keep[$11] # add the value to our keep[] array
}
$11 in keep # print current line if $11 is an index in the keep[] array
' large.csv > small.csv
NOTES:
one drawback to this approach is that the total number of unique values is not guaranteed to always be exactly 10% since we're at the mercy of the rand() function, for example ...
a half dozen sample runs generated 70, 110, 100, 140, 110 lines (ie, 7, 11, 10, 14 and 11 unique values) in small.csv
A different approach where we pre-generate a random set of modulo 100 values (ie, 0 to 99); as we find a new uniq value we check the count (of uniq values) modulo 100 and if we find a match to our pre-generated set then we print the row:
awk -F',' -v pct=10 '
BEGIN { srand()
delete mods # force awk to treat all "mods" references as an array and not a scalar
while (length(mods) < pct) # repeat loop until we have "pct" unique indices in the mods[] array
mods[int(rand() * 100)] # generate random integers betwen 0 and 99
}
!seen[$11]++ { if ((++uniqcnt % 100) in mods) # if this is the 1st time we have seen this value then increment our unique value counter and if "modulo 100" is an index in the mods[] array then ...
keep[$11] # add the value to our keep[] array
}
$11 in keep # print current line if $11 is an index in the keep[] array
' large.csv > small.csv
NOTES:
for a large pct this assumes the rand() results are evenly distributed between 0 and 1 so that the mods[] array is populated in a timely manner
this has the benefit of printing lines that represent exactly 10% of the possible unique values (depending on number of unique values the percentage will actually be 10% +/- 1%)
a half dozen sample runs all generated exactly 100 lines (ie, 10 unique values) in small.csv
If OP still needs to generate the two intermediate (sorted) files (uniqueValues.txt and uniqueValues.10pct.txt) then this could be done in the same awk script via an END {...} block, eg:
END { PROCINFO["sorted_in"]="#ind_num_asc" # this line of code requires GNU awk otherwise OP can sort the files at the OS/bash level
for (i in seen)
print i > "uniqueValues.txt"
for (i in keep)
print i > "uniqueValues.10pct.txt" # use with 1st awk script
# print i > "uniqueValues." pct "pct.txt" # use with 2nd awk script
}
Given a CSV file with contents similar to this:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
What is the best way using bash or awk scripting to tidy it up and remove all useless zeros. By useless I mean: this data will be used for line charts in web pages. However reading the entire CSV file in the web browser via JavaScript/jQuery etc is very slow. It would be more efficient to eliminate the useless zeros prior to uploading the file. If I remove all the zeros, the lines all more or less show peak to peak to peak instead of real lines from zero to some larger value back to zero, followed by a space until the next value greater than zero.
As you see there are 3 groups in the list of data. Any time there are 3 in a row for example for GRP1, I'd like to remove the middle or 2nd 0 in that list. In reality, this could work for values greater than zero also...if the same values were found every 10 seconds for say 10 in a row... it would be good to leave both ends in place and remove items 2 through 9.
The line chart would look the same, but the data would be much smaller to deal with. Ideally I could do this with a shell script on disk prior to reading the input file.
So (just looking at GRP1) instead of:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:31,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:41,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
The script would eliminate all useless 3 values...and leave only:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
Or... Another Expected Result using 0 this time...instead of 3 as the common consecutive value for GRP2...
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:21,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:31,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:41,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
The script would eliminate all useless 0 values...and leave only:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
#karakfa answer gets me close but still end up with portions similar to this after applying awk to one unique group and then eliminating some duplicates that also showed up for some reason:
I like it but it still ends up with this:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:06:51,DTE,DTE,TOTAL,1
2017-05-02,00:07:01,DTE,DTE,TOTAL,1
2017-05-02,00:07:51,DTE,DTE,TOTAL,1
2017-05-02,00:08:01,DTE,DTE,TOTAL,1
2017-05-02,00:08:51,DTE,DTE,TOTAL,1
2017-05-02,00:09:01,DTE,DTE,TOTAL,1
2017-05-02,00:09:51,DTE,DTE,TOTAL,1
2017-05-02,00:10:01,DTE,DTE,TOTAL,1
2017-05-02,00:10:51,DTE,DTE,TOTAL,1
2017-05-02,00:11:01,DTE,DTE,TOTAL,1
2017-05-02,00:11:51,DTE,DTE,TOTAL,1
2017-05-02,00:12:01,DTE,DTE,TOTAL,1
2017-05-02,00:12:51,DTE,DTE,TOTAL,1
2017-05-02,00:13:01,DTE,DTE,TOTAL,1
2017-05-02,00:13:51,DTE,DTE,TOTAL,1
2017-05-02,00:14:01,DTE,DTE,TOTAL,1
2017-05-02,00:14:51,DTE,DTE,TOTAL,1
2017-05-02,00:15:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
Would be wonderful to get to this instead:
2017-05-02,00:05:51,DTE,DTE,TOTAL,2
2017-05-02,00:06:01,DTE,DTE,TOTAL,1
2017-05-02,00:15:11,DTE,DTE,TOTAL,1
2017-05-02,00:15:21,DTE,DTE,TOTAL,9
That's one ill-placed question but I'll take a crack at the title, if you don't mind:
$ awk -F, ' {
if($3 OFS $4 OFS $6 in first)
last[$3 OFS $4 OFS $6]=$0
else
first[$3 OFS $4 OFS $6]=$0 }
END {
for(i in first) {
print first[i]
if(i in last)
print last[i] }
}' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
Basically it keeps the first and last (if exists) occurrence of each unique combination of 3rd, 4th and 6th field.
Edit: In the new light of the word consecutive, how about this awful hack:
$ awk -F, '
(p!=$3 OFS $4 OFS $6) {
if(NR>1 && lp<(NR-1))
print q
print $0
lp=NR }
{
p=$3 OFS $4 OFS $6
q=$0 }
' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:01,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:11,GRP3,GRP3,TOTAL,0
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
and output for the second data:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2
and the third:
2017-05-01,00:00:01,GRP2,GRP2,TOTAL,0
2017-05-01,00:00:51,GRP2,GRP2,TOTAL,0
2017-05-01,00:01:01,GRP2,GRP2,TOTAL,2
Simple awk approach:
awk -F, '$NF!=0' inputfile
The output:
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:11,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:21,GRP1,GRP1,TOTAL,3
$NF!=0 - takes into account only those lines which don't have 0 as their last field value
awk to the rescue!
$ awk -F'[,:]' '$4==pt+10 && $NF==p {pt=$4; pl=$0; next}
pl {print pl}
{pt=$4;p=$NF}1' file
2017-05-01,00:00:01,GRP1,GRP1,TOTAL,3
2017-05-01,00:00:51,GRP1,GRP1,TOTAL,3
2017-05-01,00:01:01,GRP1,GRP1,TOTAL,2