In bash how to transform multimap<K,V> to a map of <K, {V1,V2}> - bash

I am processing output from a file in bash and need to group values by their keys.
For example, I have the
13,47099
13,54024
13,1
13,39956
13,0
17,126223
17,52782
17,4
17,62617
17,0
23,1022724
23,79958
23,80590
23,230
23,1
23,118224
23,0
23,1049
42,72470
42,80185
42,2
42,89199
42,0
54,70344
54,72824
54,1
54,62969
54,1
in a file and group all values from a particular key into a single line as in
13,47099,54024,1,39956,0
17,126223,52782,4,62617,0
23,1022724,79958,80590,230,1,118224,0,1049
42,72470,80185,2,89199,0
54,70344,72824,1,62969,1
There are about 10000 entries in my input file. How do I transform this data in shell ?

awk to the rescue!
assuming keys are contiguous...
$ awk -F, 'p!=$1 {if(a) print a; a=p=$1}
{a=a FS $2}
END {print a}' file
13,47099,54024,1,39956,0
17,126223,52782,4,62617,0
23,1022724,79958,80590,230,1,118224,0,1049
42,72470,80185,2,89199,0
54,70344,72824,1,62969,1

Here is a breakdown of what #karakfa's code is doing, for us awk beginners. I've written this based on a toy dataset file:
1,X
1,Y
3,Z
p!=$1: check if the pattern p!=$1 is true
checks if variable p is equal to the first field of the current (first) line of file (1 in this case)
since p is undefined at this point it cannot be equal to 1, so p!=$1 is true and we continue with this line of code
if(a) print a: check if variable a exists and print a if it does exists
since a is undefined at this point the print a command is not executed
a=p=$1: set variables a and p equal to the value of the first field of the current (first) line (1 in this case)
a=a FS $2: set variable a equal to a combined with the value of the second field of the current (first) line separated by the field separator (1,X in this case)
END: since we haven't reached the end of file yet, we skip the the rest of this line of code
move to the next (second) line of file and restart the awk code on that line
p!=$1: check if the pattern p!=$1 is true
since p is 1 and the first field of the current (second) line is 1, p!=$1 is false and we skip the the rest of this line of code
a=a FS $2: set a equal to the value of a and the value of the second field of the current (second) line separated by the filed separator (1,X,Y in this case)
END: since we haven't reached the end of file yet, we skip the the rest of this line of code
move to the next (third) line of file and restart the awk code
p!=$1: check if the pattern p!=$1 is true
since p is 1 and $1 of the third line is 3, p!=$1 is true and we continue with this line of code
if(a) print a: check if variable a exists and print a if it does exists
since a is 1,X,Y at this point, 1,X,Y is printed to the output
a=p=$1: set variables a and p equal to the value of the first field of the current (third) line (3 in this case)
a=a FS $2: set variable a equal to a combined with the value of the second field of the current (third) line separated by the field separator (3,Z in this case)
END {print a}: since we have reached the end of file, execute this code
print a: print the last group a (3,Z in this case)
The resulting output is
1,X,Y
3,Z
Please let me know if there are any errors in this description.

Slight tweak to #karakfa's answer. If you want the separator between the key and the values to be different than the separator between the values, you can use this code:
awk -F, 'p==$1 {a=a "; " $2} p!=$1 {if(a) print a; a=$0; p=$1} END {print a}'

Related

Print all lines between line containing a string and first blank line, starting with the line containing that string

I've tried awk:
awk -v RS="zuzu_mumu" '{print RS $0}' input_file > output_file
The obtained file is the exact input_file but now the first line in file is zuzu_mumu.
How could be corrected my command?
After solved this, I've found the same string/patern in another arrangement; so I need to save all those records that match too, in an output file, following this rule:
if pattern match on a line, then look at previous lines and print the first line that follows an empty line, and print also the pattern match line and an empty line.
record 1
record 2
This is record 3 first line
info 1
info 2
This is one matched zuzu_mumu line
info 3
info 4
info 5
record 4
record 5
...
This is record n-1 first line
info a
This is one matched zuzu_mumu line
info b
info c
record n
...
I should obtain:
This is record 3 first line
This is one matched zuzu_mumu line
This is record n-1 first line
This is one matched zuzu_mumu line
Print all lines between line containing a string and first blank line,
starting with the line containing that string
I would use GNU AWK for this task. Let file.txt content be
Able
Baker
Charlie
Dog
Easy
Fox
then
awk 'index($0,"aker"){p=1}p{if(/^$/){exit};print}' file.txt
output
Baker
Charlie
Explanation: use index String function which gives either position of aker in whole line ($0) or 0 and treat this as condition, so this is used like is aker inside line? Note that using index rather than regular expression means we do not have to care about characters with special meaning, like for example .. If it does set p value to 1. If p then if it is empty line (it matches start of line followed by end of line) terminate processing (exit); print whole line as is.
(tested in gawk 4.2.1)
If you don't want to match the same line again, you can record all lines in an array and print the valid lines in the END block.
awk '
f && /zuzu_mumu/ { # If already found and found again
delete ary; entries=1; next; # Delete the array, reset entries and go to the next record
}
f || /zuzu_mumu/ { # If already found or match the word or interest
if(/^[[:blank:]]*$/){exit} # If only spaces, exit
f=1 # Mark as found
ary[entries++]=$0 # Add the current line to the array and increment the entry number
}
END {
for (j=1; j<entries; j++) # Loop and print the array values
print ary[j]
}
' file

Prepending letter to field value

I have a file 0.txt containing the following value fields contents in parentheses:
(bread,milk,),
(rice,brand B,),
(pan,eggs,Brandc,),
I'm looking in OS and elsewhere for how to prepend the letter x to the beginning of each value between commas so that my output file becomes (using bash unix):
(xbread,xmilk,),
(xrice,xbrand B,),
(xpan,xeggs,xBrand C,),
the only thing I've really tried but not enough is:
awk '{gsub(/,/,",x");print}' 0.txt
for all purposes the prefix should not be applied to the last commas at the end of each line.
With awk
awk 'BEGIN{FS=OFS=","}{$1="(x"substr($1,2);for(i=2;i<=NF-2;i++){$i="x"$i}}1'
Explanation:
# Before you start, set the input and output delimiter
BEGIN{
FS=OFS=","
}
# The first field is special, the x has to be inserted
# after the opening (
$1="(x"substr($1,2)
# Prepend 'x' from field 2 until the previous to last field
for(i=2;i<=NF-2;i++){
$i="x"$i
}
# 1 is always true. awk will print in that case
1
The trick is to anchor the regexp so that it matches the whole comma-terminated substring you want to work with, not just the comma (and avoids other “special” characters in the syntax).
awk '{ gsub(/[^,()]+,/, "x&") } 1' 0.txt
sed -r 's/([^,()]+,)/x\1/g' 0.txt

Remove duplicate variables except last occurrence in bash script file

I have the config file on local, whom I am appending some variables from different remote machines. The file content is as:
#!/bin/bash
name=bob
department=(Production)
name=alice
department=(R&D)
name=maggie
department=(Production R&D)
The latest values updated in the file are the last one. So the expected output in the config file should be:
#!/bin/bash
name=maggie
department=(Production R&D)
I want to remove the first two data of name and address except for the latest one which is last. But this should happen only if there are multiple same variables.
I referred and try this for my solution but not getting expected output:
https://backreference.org/2011/11/17/remove-duplicates-but-keeping-only-the-last-occurrence/
Would you please try the following:
tac file | awk '{ # print "file" reversing the line order: last line first
line = $0 # backup the line
sub(/#.*/, "") # remove comments (not sure if comment line exists)
if (match($0, /([[:alnum:]_]+)=/)) { # look like an assignment to a variable
varname = substr($0, RSTART, RLENGTH - 1)
# extract the variable name (-1 to remove "=")
if (! seen[varname]++) print line # print the line if the variable is seen irst time
} else { # non-assignment line
print line
}
}' | tac # reverse the lines again
Output:
#!/bin/bash
name=maggie
department=(Production R&D)
Please note the parser to extract variable names is a lousy one. You may need to tweak the code depending on the actual file.

Search a CSV file for a value in the first column, if found shift the value of second column one row down

I have CSV files that look like this:
786,1702
787,1722
-,1724
788,1769
789,1766
I would like to have a bash command that searches the first column for the - and if found then shifts the values in the second column down. The - reccurr several times in the first column and would need to start from the top to preserve the order of the second column.
The second column would be blank
Desired output:
786,1702
787,1722
-,
788,1724
789,1769
790,1766
So far I have: awk -F ',' '$1 ~ /^-$/' filename.csv to find the hyphens, but shifting the 2nd column down is tricky...
Assuming that the left column continues with incremental IDs to shift the right column until it is empty.
awk 'BEGIN{start=0;FS=","}$1=="-"{stack[stacklen++]=$2;print $1",";next}stacklen-start{stack[stacklen++]=$2;print $1","stack[start];delete stack[start++];next}1;END{for (i=start;i<stacklen;i++){print $1-start+i+1,stack[i]}}' filename.csv
# or
<filename.csv awk -F, -v start=0 '$1=="-"{stack[stacklen++]=$2;print $1",";next}stacklen-start{stack[stacklen++]=$2;print $1","stack[start];delete stack[start++];next}1;END{for (i=start;i<stacklen;i++){print $1-start+i+1,stack[i]}}'
Or, explained:
I am here using a shifted stack to avoid rewriting indexes. With start as the pointer to the first useful element of the stack, and stacklen as the last element. This avoids the costly operation of shifting all array elements whenever we want to remove the first element.
# chmod +x shift_when_dash
./shift_when_dash filename.csv
with shift_when_dash being an executable file containing:
#!/usr/bin/awk -f
BEGIN { # Everything in this block is executed once before opening the file
start = 0 # Needed because we are using it in a scalar context before initialization
FS = "," # Input field separator is a comma
}
$1 == "-" { # We match the special case where the first column is a simple dash
stack[stacklen++] = $2 # We store the second column on top of our stack
print $1 "," # We print the dash without a second column as asked by OP
next # We stop processing the current record and go on to the record
}
stacklen - start { # In case we still have something in our stack
stack[stacklen++] = $2 # We store the current 2nd column on the stack
print $1 "," stack[start] # We print the current ID with the first stacked element
delete stack[start++] # Free up some memory and increment our pointer
next
}
1 # We print the line as-is, without any modification.
# This applies to lines which were not skipped by the
# 'next' statements above, so in our case all lines before
# the first dash is encountered.
END {
for (i=start;i<stacklen;i++) { # For every element remaining in the stack after the last line
print $1-start+i+1 "," stack[i] # We print a new incremental id with the stack element
}
}
next is an awk statement similar to continue in other languages, with the difference that it skips to the next input line instead of the next loop element. It is useful to emulate a switch-case.

extract each line followed by a line with a different value in column two

Given the following file structure,
9.975 1.49000000 0.295 0 0.4880 0.4929 0.5113 0.5245 2.016726 1.0472 -30.7449 1
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1
9.975 1.50000000 0.295 0 0.5145 0.4984 0.4873 0.5019 2.002143 1.0854 -30.3044 2
is there a way to extract each line in which the value in column two is not equal to the value in column two in the following line?
I.e. from these three lines I would like to extract the second one, since 1.49 is not equal to 1.50.
Maybe with sed or awk?
This is how I do this in MATLAB:
myline = 1;
mynewline = 1;
while myline < length(myfile)
if myfile(myline,2) ~= myfile(myline+1,2)
mynewfile(mynewline,:) = myfile(myline,:);
mynewline = mynewline+1;
myline = myline+1;
else
myline = myline+1;
end
end
However, my files are so large now that I would prefer to carry out this extraction in terminal before transferring them to my laptop.
Awk should do.
<data awk '($2 != prev) {print line} {line = $0; prev = $2}'
A brief intro to awk: awk program consists of a set of condition {code} blocks. It operates line by line. When no condition is given, the block is executed for each line. BEGIN condition is executed before the first line. Each line is split to fields, which are accessible with $_number_. The full line is in $0.
Here I compare the second field to the previous value, if it does not match I print the whole previous line. In all cases I store the current line into line and the second field into prev.
And if you really want it right, careful with the float comparisons - something like abs($2 - prev) < eps (there is no abs in awk, you need to define it yourself, and eps is some small enough number). I'm actually not sure if awk converts to number for equality testing, if not you're safe with the string comparisons.
This might work for you (GNU sed):
sed -r 'N;/^((\S+)\s+){2}.*\n\S+\s+\2/!P;D' file
Read two lines at a time. Pattern match on the first two columns and only print the first line when the second column does not match.
Try following command:
awk '$2 != field && field { print line } { field = $2; line = $0 }' infile
It saves previous line and second field, comparing in next loop with current line values. The && field check is useful to avoid a blank line at the beginning of file, when $2 != field would match because variable is empty.
It yields:
9.975 1.49000000 0.295 1 0.4870 0.5056 0.5188 0.5045 2.015859 1.0442 -30.7653 1

Resources