Combining multiple lines into a single based on column values [duplicate] - shell

This question already has answers here:
Merge values for same key
(4 answers)
Closed 4 years ago.
I have a file with below records.
$File.txt
APPLE,A,10
APPLE,A,20
APPLE,A,30
GRAPE,B,12
GRAPE,B,13
I want the output to be as given below:
APPLE,A,10|20|30,
GRAPE,B,12|13,
I have tried the below method and got the required output. But looking for something simpler.
awk -F"," '{if(NR<2){if(!seen[$1]++){printf "%-8s|",$3}}else{if(seen[$1]++){printf "%-12s|",$3}else{ printf ",\n%-12s|",$3}}}' File1.txt | awk -F"|" '{for(i=1;i<NF-1;i++){ printf "%-12s|",$i}printf "%-12s,\n", $(NF-1)}'|sed 's/ //g' > O1.txt
awk -F"," '{print $1","$2","}' File1.txt | uniq > O2.txt
paste -d'\0' O2.txt O1.txt

something like this?
$ awk -F, '{k=$1 FS $2; a[k]=((k in a)?a[k]"|":k FS)$3}
END {for(k in a) print a[k] FS}' file
APPLE,A,10|20|30,
GRAPE,B,12|13,
to remove the last comma, remove the FS in print statement. If your file is already sorted this can be simplified further.

You need something like below with just standalone awk
awk -F, 'BEGIN { OFS = FS }{ key = $1","$2 }{ unique[key] = unique[key]?(unique[key]"|"$3):($3) }
END { for (i in unique) print i, unique[i] }' file
If you think you need the extra , at the end just add "," at the END clause after printing the elements from the array.

Related

awk: comparing two files containing numbers

I'm using this command to compare two files and print out lines in which $1 is different:
awk -F, 'NR==FNR {exclude[$1];next} !($1 in exclude)' old.list new.list > changes.list
the files I'm working with have been sorted numerically with -n
old.list:
30606,10,57561
30607,100,26540
30611,300,35,5.068
30612,100,211,0.035
30613,200,5479,0.005
30616,100,2,15.118
30618,0,1257,0.009
30620,14,8729,0.021
new.list
30606,10,57561
30607,100,26540
30611,300,35,5.068
30612,100,211,0.035
30613,200,5479,0.005
30615,50,874,00.2
30616,100,2,15.118
30618,0,1257,0.009
30620,14,8729,0.021
30690,10,87,0.021
30800,20,97,1.021
Result
30615,50,874,00.2
30690,10,87,0.021
30800,20,97,1.021
I'm looking for a way to tweak my command and make awk print lines only if $1 from new.list is not only unique but also > $1 from the last line of the old.list
Expected result:
30690,10,87,0.021
30800,20,97,1.021
because 30690 and 30800 ($1) > 30620 ($1 from the last line of old.list)
in this case, 30615,50,874,00.2 would not be printed because 30615 is admitedly unique to new.list but it's also < 30620 ($1 from the last line of the old.list)
awk -F, '{if ($1 #from new.list > $1 #from_the_last_line_of_old.list) print }'
something like that, but I'm not sure it can be done this way?
Thank you
You can use the awk you have but then pipe through sort to sort numeric high to low then pipe to head to get the first:
awk -F, 'FNR==NR{seen[$1]; next} !($1 in seen)' old new | sort -nr | head -n1
30690,10,87,0.021
Or, use an the second pass to find the max in awk and an END block to print:
awk -F, 'FNR==NR{seen[$1]; next}
(!($1 in seen)) {uniq[$1]=$0; max= $1>max ? $1 : max}
END {print uniq[max]}' old new
30690,10,87,0.021
Cup of coffee and reading you edit, just do this:
awk -F, 'FNR==NR{ref=$1; next} $1>ref' old new
30690,10,87,0.021
30800,20,97,1.021
Since you are only interested in the values greater than the last line of old there is no need to even look at the other lines of that file;
Just read the full first file and grab the last $1 since it is already sorted and then compare to $1 in the new file. If old is not sorted or you just want to save that step, you can do:
FNR==NR{ref=$1>ref ? $1 : ref; next}
if you need to uniquely the values in new you can do that as part of the sort step you are already doing:
sort -t, -k 1,1 -n -u new
single-pass awk solution :
mawk 'BEGIN { ___ = log(!(_^= FS = ",")) # set def. value to -inf
} NR==FNR ? __[___=$_] : ($_ in __)<(+___<+$_)' old.txt new.txt
30690,10,87,0.021
30800,20,97,1.021
Since both files are sorted, this command should be more efficient than the other solutions here:
awk -F, 'NR==FNR{x=$1}; $1>x{x=$1; print}' <(tail -n1 old) new
It reads only one line from old
It prints only lines where new.$1 > old[last].$1
It prints only lines with unique $1

AWK script to sort the list in alphabetical order of usernames in getent passwd [duplicate]

This question already has answers here:
How to use awk sort by column 3
(10 answers)
Closed 3 years ago.
I am very new to awk and I am trying to sort the fifth group of my usernames which has " Zack e" in alphabetical order. I typed getent passwd in my terminal, which are:
zack:x:115:120:Zack E:/home/zack:/var/run/bin/bash/false
hp:x:118:7:HPLIP system user:/var/run/hplip:/bin/false
armvad:x:3:3:Ezikon Armvad:/dev:/usr/sbin/nologin
bruh:x:1542:1546:Burh RG:/home/banner:/bin/bash
I also tried this approached
for i in `sed -e 's/.* \(\d\)*/\1/' passwd.in | sort`; do grep $i passwd.in; done > file_sort.txt
But all it did was sort the first group instead of the desired group. How can I combine my approach with an awk script all I know to do in awk is BEGIN {FS = ":"} $5 == 1000{print $1}
Is there any way, I can rearrange the tokens so sort can work, and then put them back in order?
Why 'awk' when 'sort' may do the job elegantly?
sort -t : -k5 myFile.txt
or, reversely
sort -r -t : -k5 myFile.txt
Using GNU awk and for scanning order for sorting at output when the $5 is unique for each record:
$ awk -F: '{a[$5]=$0}END{PROCINFO["sorted_in"]="#ind_str_asc";for(i in a)print a[i]}' file
Output:
bruh:x:1542:1546:Burh RG:/home/banner:/bin/bash
armvad:x:3:3:Ezikon Armvad:/dev:/usr/sbin/nologin
hp:x:118:7:HPLIP system user:/var/run/hplip:/bin/false
zack:x:115:120:Zack E:/home/zack:/var/run/bin/bash/false
Explained:
$ awk -F: '{ # set field delimiter
a[$5]=$0 # hash records, use $5 as key
}
END {
PROCINFO["sorted_in"]="#ind_str_asc" # set the for traverse order
for(i in a) # use it
print a[i] # output
}' file
Since the hash key is $5 it needs to be unique to avoid collisions. If they are not unique, I guess in this case, where the records are unique, you could:
$ awk -F: '{
a[$0]=$5 # changed
}
END {
PROCINFO["sorted_in"]="#val_str_asc" # changed
for(i in a)
print i # changed
}' file
Then there are of course functions asort and asorti that sort arrays where as the above examples sorted at output. For example, we expect the $5 not to be unique:
$ awk -F: '{
a[$0]=$5 # record is unique, $5 is not expected to be
}
END {
n=asorti(a,b,"#val_str_asc") # asorti to preserve $0 in b but order from a[] value
for(i=1;i<=n;i++) # sorted to b, index rewritten from 1..n
print i, b[i]
}' file
Output now:
1 bruh:x:1542:1546:Burh RG:/home/banner:/bin/bash
2 armvad:x:3:3:Ezikon Armvad:/dev:/usr/sbin/nologin
3 hp:x:118:7:HPLIP system user:/var/run/hplip:/bin/false
4 zack:x:115:120:Zack E:/home/zack:/var/run/bin/bash/false

Multiple Big file sort

I have two files that each line order by timestamp but has different structure. I want merge there file info one single file and order by timestamp. look like:
file A(less than 2G)
1,1,1487779199850
2,2,1487779199852
3,3,1487779199854
4,4,1487779199856
5,5,1487779199858
file B(less than 15G)
1,1,10,100,1487779199850
2,2,20,200,1487779199852
3,3,30,300,1487779199854
4,4,40,400,1487779199856
5,5,50,500,1487779199858
how can I accomplish this? is there any way can make it as fast as possible?
$ awk -F, -v OFS='\t' '{print $NF, $0}' fileA fileB | sort -s -n -k1,1 | cut -f2-
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
I originally posted the above as just a comment under #VM17's answer but (s)he suggested I make it a new answer.
The above would be more robust and efficient since it's using the default separator for sort+cut (tab), will truly only sort on the first key (his would use the whole line despite the -k1 since sorts field separator tab isn't present in the line), uses a stable sort algorithm (sort -s) to preserve input order and uses cut to strip off the added key field which is more efficient than invoking awk again since awk does field splitting etc. on each record which isn't needed to just remove the leading field(s).
Alternatvely you might find something like this more efficient:
$ cat tst.awk
{ currRec = $0; currKey = $NF }
NR>1 {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
if ($NF < currKey) {
print
}
else {
saved = $0 ORS
break
}
}
}
{ prevRec = currRec; prevKey = currKey }
END {
print prevRec
printf "%s", saved
while ( (getline < "fileB") > 0 ) {
print
}
}
$ awk -f tst.awk fileA
1,1,1487779199850
1,1,10,100,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
As you can see it reads from fileB between reads of lines fileA comparing timestamps so it's interleaving the 2 files and so doesn't require a subsequent pipe to sort and cut.
Just check the logic as I didn't think about it very much and be aware that this is a rare situation where getline might be appropriate for efficiency but make sure to read http://awk.freeshell.org/AllAboutGetline to understand all it's caveats if you're ever considering using it again.
Try this-
awk -F, '{print $NF, $0}' fileA fileB | sort -nk 1 | awk '{print $2}'
Output-
1,1,10,100,1487779199850
1,1,1487779199850
2,2,1487779199852
2,2,20,200,1487779199852
3,3,1487779199854
3,3,30,300,1487779199854
4,4,1487779199856
4,4,40,400,1487779199856
5,5,1487779199858
5,5,50,500,1487779199858
This concatenates the two files and then puts the timestamp at the starting of the line. It then sorts according to the timestamp and then removes that dummy column.
This will be slow for big files though.

How to print a range of columns in a CSV in AWK? [duplicate]

This question already has answers here:
Extract specific columns from delimited file using Awk
(8 answers)
Closed 4 years ago.
With awk, I can print any column within a CSV, e.g., this will print the 10th column in file.csv.
awk -F, '{ print $10 }' file.csv
If I need to print columns 5-10, including the comma, I only know this way:
awk -F, '{ print $5","$6","$7","$8","$9","$10 }' file.csv
This method is not so good if I want to print many columns. Is there a simpler syntax for printing a range of columns in a CSV in awk?
The standard way to do this in awk is using a for loop:
awk -v s=5 -v e=10 'BEGIN{FS=OFS=","}{for (i=s; i<=e; ++i) printf "%s%s", $i, (i<e?OFS:ORS)}' file
However, if your delimiter is simple (as in your example), you may prefer to use cut:
cut -d, -f5-10 file
Perl deserves a mention (using -a to enable autosplit mode):
perl -F, -lane '$"=","; print "#F[4..9]"' file
You can use a loop in awk to print columns from 5 to 10:
awk -F, '{ for (i=5; i<=10; i++) print $i }' file.csv
Keep in mind that using print it will print each columns on a new line. If you want to print them on same line using OFS then use:
awk -F, -v OFS=, '{ for (i=5; i<=10; i++) printf("%s%s", $i, OFS) }' file.csv
With GNU awk for gensub():
$ cat file
a,b,c,d,e,f,g,h,i,j,k,l,m
$
$ awk -v s=5 -v n=6 '{ print gensub("(([^,]+,){"s-1"})(([^,]+,){"n-1"}[^,]+).*","\\3","") }' file
e,f,g,h,i,j
s is the start position and n is the number of fields to print from that point on. Or if you prefer to specify start and end:
$ awk -v s=5 -v e=10 '{ print gensub("(([^,]+,){"s-1"})(([^,]+,){"e-s"}[^,]+).*","\\3","") }' file
e,f,g,h,i,j
Note that this will only work with single-character field separators since it relies on being able to negate the FS in a character class.

creating a ":" delimited list in bash script using awk

I have following lines
380:<CHECKSUM_VALIDATION>
393:</CHECKSUM_VALIDATION>
437:<CHECKSUM_VALIDATION>
441:</CHECKSUM_VALIDATION>
I need to format it as below
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441
Is it possible to achieve above output using "awk"? [I'm using bash]
Thanks you!
Here you go:
awk -F '[:<>/]+' '{ n = $1; getline; print $2 ":" n ":" $1 }'
Explanation:
Set the field separator with -F to be a sequence of a mix of :<>/ characters, this way the first field will be the number, and the second will be CHECKSUM_VALIDATION
Save the first field in variable n and read the next line (which would overwrite $1)
Print the line: a combination of the number from the previous line, and the fields on the current line
Another approach without using getline:
awk -F '[:<>/]+' 'NR % 2 { n = $1 } NR % 2 == 0 { print $2 ":" n ":" $1 }'
This one uses the record counter NR to determine whether it's time to print: if NR is odd, save the first field in n, if NR is even, then print.
You can try this sed,
sed 'N; s/\([0-9]\+\):<\(.*\)>\n\([0-9]\+\):<\(.*\)>/\2:\1:\3/' file.txt
Test:
sat:~$ sed 'N; s/\([0-9]\+\):<\(.*\)>\n\([0-9]\+\):<\(.*\)>/\2:\1:\3/' file.txt
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441
Another way:
awk -F: '/<C/ {printf "CHECKSUM_VALIDATION:%d:",$1; next} {print $1}'
Here is one gnu awk
awk -F"[:\n<>]" 'NR==1{print $3,$1,$5;f=$3;next} $3{print f,$3,$7}' OFS=":" RS="</CH" file
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441
Based on Jonas post and avoiding getline, this awk should do:
awk -F '[:<>/]+' '/<C/ {f=$1;next} { print $2,f,$1}' OFS=\: file
CHECKSUM_VALIDATION:380:393
CHECKSUM_VALIDATION:437:441

Resources