Sort & Uniq the values on specific column - bash

I am having a data separated by : delimeted
AA:w_c;w_c;r_c:1;3
BB:sync;sync:4
CC:t_wak;t_wak:6;7;8
I need to print only one value in column 2 that to unique value. If there are more than one unique value then it need to print in another file.
I tried this:
#!/bin/bash
sort -u -t : -k2,2 file >> txt
awk -F: '{gsub(";"," ",$3)}1' txt
Output:
BB:sync;sync:4
CC t_wak;t_wak 6 7 8
AA w_c;w_c;r_c 1 3
Actually I am trying to to do sort and uniq the values in column 2 and copying that output to another file called "txt". Then I am using AWk to replace the ; with space in column 3 seems above code is not working.
Desired Output 1:
BB:sync:4
CC:t_wak:6 7 8
The above two values are the actual output we need to get to print because in column 2 it contains only one value.
The below one needs to print in another file because in column 2 it contains more than one value.
Desired output 2:
AA:w_c;r_c:1;3
w_c
r_c
In column 2 it should have only one value, if there are more than one then need to print in another file by stating them as shown above.

This quick solution should work for the example:
awk 'BEGIN{FS=OFS=":"}
{
split($2, a, ";")
v=""; delete u
for(i=1;i<=length(a);i++){
if( ++u[a[i]]<2)
v=v (i==1?"":";") a[i]
}
$2=v
if(length(u)>1){
print > "output2.txt"
next
}
}7' input
Let's do a test:
kent$ awk 'BEGIN{FS=OFS=":"}
{
split($2, a, ";")
v=""; delete u
for(i=1;i<=length(a);i++){
if( ++u[a[i]]<2)
v=v (i==1?"":";") a[i]
}
$2=v
if(length(u)>1){
print > "output2.txt"
next
}
}7' f
BB:sync:4
CC:t_wak:6;7;8
kent$ cat output2.txt
AA:w_c;r_c:1;3
If you want to have each value in col2 in the output2.txt:
awk 'BEGIN{FS=OFS=":";out2="output2.txt"}
{
split($2, a, ";")
v=""; delete u
for(i=1;i<=length(a);i++){
if( ++u[a[i]]<2)
v=v (i==1?"":";") a[i]
}
$2=v
if(length(u)>1){
print > out2
for(x in u)
print x > out2
next
}
}7' input
Then you'll get:
kent$ cat output2.txt
AA:w_c;r_c:1;3
w_c
r_c

Related

awk to get first column if the a specific number in the line is greater than a digit

I have a data file (file.txt) contains the below lines:
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=22:00,dom=sss.co.uk,user2=lis
I'm expecting to get the first column ($1) only if the ETA= number is greater than 15, like here I will have 2nd and 3rd line first column only is expected.
345
456
I tried like cat file.txt | awk -F [,TPF=]' '{print $1}' but its print whole line which has ETA at the end.
Using awk
$ awk -F"[=, ]" '{for (i=1;i<NF;i++) if ($i=="ETA") if ($(i+1) > 15) print $1}' input_file
345
456
With your shown samples please try following GNU awk code. Using match function of GNU awk where I am using regex (^[0-9]+).*ETA=([0-9]+):[0-9]+ which creates 2 capturing groups and saves its values into array arr. Then checking condition if 2nd element of arr is greater than 15 then print 1st value of arr array as per requirement.
awk '
match($0,/(^[0-9]+).*\<ETA=([0-9]+):[0-9]+/,arr) && arr[2]+0>15{
print arr[1]
}
' Input_file
I would harness GNU AWK for this task following way, let file.txt content be
123 pro=tegs, ETA=12:00, team=xyz,user1=tom,dom=dby.com
345 pro=rbs, team=abc,user1=chan,dom=sbc.int,ETA=23:00
456 team=efg, pro=bvy,ETA=02:00,dom=sss.co.uk,user2=lis
then
awk 'substr($0,index($0,"ETA=")+4,2)+0>15{print $1}' file.txt
gives output
345
Explanation: I use String functions, index to find where is ETA= then substr to get 2 characters after ETA=, 4 is used as ETA= is 4 characters long and index gives start position, I use +0 to convert to integer then compare it with 15. Disclaimer: this solution assumes every row has ETA= followed by exactly 2 digits.
(tested in GNU Awk 5.0.1)
Whenever input contains tag=value pairs as yours does, it's best to first create an array of those mappings (v[]) below and then you can just access the values by their tags (names):
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
v["ETA"]+0 > 15 {
print $1
}
$ awk -f tst.awk file
345
456
With that approach you can trivially enhance the script in future to access whatever values you like by their names, test them in whatever combinations you like, output them in whatever order you like, etc. For example:
$ cat tst.awk
BEGIN {
FS = "[, =]+"
OFS = ","
}
{
delete v
for ( i=2; i<NF; i+=2 ) {
v[$i] = $(i+1)
}
}
(v["pro"] ~ /b/) && (v["ETA"]+0 > 15) {
print $1, v["team"], v["dom"]
}
$ awk -f tst.awk file
345,abc,sbc.int
456,efg,sss.co.uk
Think about how you'd enhance any other solution to do the above or anything remotely similar.
It's unclear why you think your attempt would do anything of the sort. Your attempt uses a completely different field separator and does not compare anything against the number 15.
You'll also want to get rid of the useless use of cat.
When you specify a column separator with -F that changes what the first column $1 actually means; it is then everything before the first occurrence of the separator. Probably separately split the line to obtain the first column, space-separated.
awk -F 'ETA=' '$2 > 15 { split($0, n, /[ \t]+/); print n[1] }' file.txt
The value in $2 will be the data after the first separator (and up until the next one) but using it in a numeric comparison simply ignores any non-numeric text after the number at the beginning of the field. So for example, on the first line, we are actually literally checking if 12:00, team=xyz,user1=tom,dom=dby.com is larger than 15 but it effectively checks if 12 is larger than 15 (which is obviously false).
When the condition is true, we split the original line $0 into the array n on sequences of whitespace, and then print the first element of this array.
Using awk you could match ETA= followed by 1 or more digits. Then get the match without the ETA= part and check if the number is greater than 15 and print the first field.
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4)+0 > 15) print $1
}' file
Output
345
456
If the first field should start with a number:
awk '/^[0-9]/ && match($0, /ETA=[0-9]+/) {
if(substr($0, RSTART+4, RLENGTH-4) > 15)+0 print $1
}' file

How to remove columns from a file by given the columns in anther file in Linux?

Suppose I have a file A contains the column numbers need to be removed (I really have over 500 columns in my input file fileB),
fileA:
2
5
And I want to remove those columns(2 and 5) from fileB:
a b c d e f
g h i j k l
in Linux to get:
a c d f
g i j l
what should I do? I found out that I could eliminate printing those columns with the code:
awk '{$2=$5="";print $0}' fileB
however, there are two problems in this way, first it does not really remove those columns, it just using empty string to replace them; second, instead of manually typing in those column numbers, how can I get these column numbers by reading from another file.
Original Question:
Suppose I have a file A contains the column numbers need to be removed,
file A:
223
345
346
567
And I want to remove those columns(223, 345,567) from file B in Linux, what should I do?
If your cut have the --complement option then you can do:
cut --complement -d ' ' -f "$(echo $(<FileA))" fileB
$ cat tst.awk
NR==FNR {
badFldNrs[$1]
next
}
FNR == 1 {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ( !(inFldNr in badFldNrs) ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2in[outFldNr]
printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
}
}
$ awk -f tst.awk fileA fileB
a c d f
g i j l
One awk idea:
awk '
FNR==NR { skip[$1] ; next } # store field #s to be skipped
{ line="" # initialize output variable
pfx="" # first prefix will be ""
for (i=1;i<=NF;i++) # loop through the fields in this input line ...
if ( !(i in skip) ) { # if field # not mentioned in the skip[] array then ...
line=line pfx $i # add to our output variable
pfx=OFS # prefix = OFS for 2nd-nth fields to be added to output variable
}
if ( pfx == OFS ) # if we have something to print ...
print line # print output variable to stdout
}
' fileA fileB
NOTE: OP hasn't provided the input/output field delimiters; OP can add the appropriate FS/OFS assignments as needed
This generates:
a c d f
g i j l
Using awk
$ awk 'NR==FNR {col[$1]=$1;next} {for(i=1;i<=NF;++i) if (i != col[i]) printf("%s ", $i); printf("\n")}' fileA fileB
a c d f
g i j l

Average of first ten numbers of text file using bash

I have a file of two columns. The first column is dates and the second contains a corresponding number. The two commas are separated by a column. I want to take the average of the first three numbers and print it to a new file. Then do the same for the 2nd-4th number. Then 3rd-5th and so on. For example:
File1
date1,1
date2,1
date3,4
date4,1
date5,7
Output file
2
2
4
Is there any way to do this using awk or some other tool?
Input
akshay#db-3325:/tmp$ cat file.txt
date1,1
date2,1
date3,4
date4,1
date5,7
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, '{
x = $2;
i = NR % n;
ma += (x - q[i]) / n;
q[i] = x;
if(NR>=n)print ma;
}' file.txt
2
2
4
OR below one useful for plotting and keeping reference axis (in your case date) at center of average point
Script
akshay#db-3325:/tmp$ cat avg.awk
BEGIN {
m=int((n+1)/2)
}
{L[NR]=$2; sum+=$2}
NR>=m {d[++i]=$1}
NR>n {sum-=L[NR-n]}
NR>=n{
a[++k]=sum/n
}
END {
for (j=1; j<=k; j++)
print d[j],a[j] # remove d[j], if you just want values only
}
Output
akshay#db-3325:/tmp$ awk -v n=3 -v FS=, -v OFS=, -f avg.awk file.txt
date2,2
date3,2
date4,4
$ awk -F, '{a[NR%3]=$2} (NR>=3){print (a[0]+a[1]+a[2])/3}' file
2
2
4
Add a little bit math tricks here, set $2 to a[NR%3] for each record. So the value in each element would be updated cyclically. And the sum of a[0], a[1], a[2] would be the sum of past 3 numbers.
updated based on the changes made due to the helpful feedback from Ed Morton
here's a quick and dirty script to do what you've asked for. It doesn't have much flexibility in it but you can easily figure out how to extend it.
To run save it into a file and execute it as an awk script either with a shebang line or by calling awk -f
// {
Numbers[NR]=$2;
if ( NR >= 3 ) {
printf("%i\n", (Numbers[NR] + Numbers[NR-1] + Numbers[NR-2])/3)
}
}
BEGIN {
FS=","
}
Explanation:
Line 1: Match all lines, "/" is the match operator and in this case we have an empty match which means "do this thing on every line". Line 3: Use the Record Number (NR) as the key and store the value from column 2 Line 4: If we have 3 or more values read from the file Line 5: Do the maths and print as an integer BEGIN block: Change the Field Separator to a comma ",".

Extract lines having same second column but different third column

I have a file having strings in 3 columns as below.
a b x
a b y
a b z
a c x
a d y
I want to extract all the lines having same second column but different third column. The output I am expecting for the above example is
a b x
a b y
a b z
I tried uniq -f2 and sort -u -k2, But it isn't working as I expect. Any suggestions please.
awk '
seen[$2]++ {
if (!seen[$2,$3]++) {
printf "%s%s\n", first[$2], $0
}
delete first[$2]
next
}
{ first[$2] = $0 ORS }
' file
a b x
a b y
a b z
Note that the above will work in any awk, for any values in your input file, does not retain the whole of the input file in memory, doesn't rely on any external tools for pre/post processing, and will produce the output lines in exactly the same order they appeared in the input.
awk to the rescue!
Need to make sure all records are unique first
$ sort file | uniq |
awk '{c[$2]++; a[$2]=a[$2]?a[$2]RS$0:$0}
END{for(k in a) if(c[k]>1) print a[k]}'
a b x
a b y
a b z
Explanation: keep the counter of second field occurrences and aggregate the records. At the end print the records for which the counter is greater than one.

Grab nth occurrence in between two patterns using awk or sed

I have an issue where I want to parse through the output from a file and I want to grab the nth occurrence of text in between two patterns preferably using awk or sed
category
1
s
t
done
category
2
n
d
done
category
3
r
d
done
category
4
t
h
done
Let's just say for this example I want to grab the third occurrence of text in between category and done, essentially the output would be
category
3
r
d
done
This might work for you (GNU sed):
'sed -n '/category/{:a;N;/done/!ba;x;s/^/x/;/^x\{3\}$/{x;p;q};x}' file
Turn off automatic printing by using the -n option. Gather up lines between category and done. Store a counter in the hold space and when it reaches 3 print the collection in the pattern space and quit.
Or if you prefer awk:
awk '/^category/,/^done/{if(++m==1)n++;if(n==3)print;if(/^done/)m=0}' file
Try doing this :
awk -v n=3 '/^category/{l++} (l==n){print}' file.txt
Or more cryptic :
awk -v n=3 '/^category/{l++} l==n' file.txt
If your file is big :
awk -v n=3 '/^category/{l++} l>n{exit} l==n' file.txt
If your file doesn't contain any null characters, here's on way using GNU sed. This will find the third occurrence of a pattern range. However, you can easily modify this to get any occurrence you'd like.
sed -n '/^category/ { x; s/^/\x0/; /^\x0\{3\}$/ { x; :a; p; /done/q; n; ba }; x }' file.txt
Results:
category
3
r
d
done
Explanation:
Turn off default printing with the -n switch. Match the word 'category' at the start of a line. Swap the pattern space with the hold space and append a null character to the start of the pattern. In the example, if the pattern then contains two leading null characters, pull the pattern out of holdspace. Now create a loop and print the contents of the pattern space until the last pattern is matched. When this last pattern is found, sed will quit. If it's not found sed will continue to read the next line of input in and continue in its loop.
awk -v tgt=3 '
/^category$/ { fnd=1; rec="" }
fnd {
rec = rec $0 ORS
if (/^done$/) {
if (++cnt == tgt) {
printf "%s",rec
exit
}
fnd = 0
}
}
' file
With GNU awk you can set the the record separator to a regular expression:
<file awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
RT is the matched record separator. Note that the record relative to n will be off by one as the first record refers to what precedes the first RS.
Edit
As per Ed's comment, this will not work when the records have other data in between them, e.g.:
category
1
s
t
done
category
2
n
d
done
foo
category
3
r
d
done
bar
category
4
t
h
done
One way to get around this is to clean up the input with a second (or first) awk:
<file awk '/^category$/,/^done$/' |
awk 'NR==n+1 { print rt, $0 } { rt = RT }' RS='\\<category' ORS='' n=3
Output:
category
3
r
d
done
Edit 2
As Ed has noted in the comments, the above methods do not search for the ending pattern. One way to do this, which hasn't been covered by the other answers, is with getline (note that there are some caveats with awk getline):
<file awk '
/^category$/ {
v = $0
while(!/^done$/) {
if(!getline)
exit
v = v ORS $0
}
if(++nr == n)
print v
}' n=3
On one line:
<file awk '/^category$/ { v = $0; while(!/^done$/) { if(!getline) exit; v = v ORS $0 } if(++nr == n) print v }' n=3

Resources