Best way to remove multiple entries from kml file - bash

I have a very large KML file (over 20000 placemarkers). They are named by numbers which go up in increments of 5 starting at about 7000 up to 27000.
<Placemark>
<name>7750</name>
<description><![CDATA[converted by:</br>GridReferenceFinder.com</br>]]></description>
<Point>
<coordinates>-0.99153654,52.225002,0</coordinates>
</Point>
</Placemark>
I would like to remove any placemarker that doesnt end in 00 or 50. Having a placemarker every 5 metres is slowing down some of the lower end devices on site.
Is there some script, command or whatever that will check the name and if it doesn't end in 00 or 50 delete from <Placemark> to </Placemark> for that entry?
You would literally be saving me 10 hours work deleting them individually.

A Perl-one liner solution!
I would like to remove any placemarker that doesnt end in 00 or 50
First of all a solution for this; match anything except for ones end with 00 or 50
^(?:[7-9]\d|[1-2]\d\d)(?!00|50)\d\d$
Demo:
^(?:[7-9]\d|[1-2]\d\d)(?!00|50)\d\d$
A fest test can be:
perl -le 'print for grep{ /^(?:[7-9]\d|[1-2]\d\d)(?=00|50)\d\d$/ } 7000..27000'
then read the entire file once:
$/=undef;
then read all matches with a while loop:
while/<Placemark>\s*?<name>(?:[7-9]\d|[1-2]\d\d)(?!00|50)\d\d.*?<\/Placemark>/sg
s flag is for reading as a single line or . can match newline, and g for global search
then print the match (S&):
perl -lne '$/=undef;print $& while/<Placemark>\s*?<name>(?:[7-9]\d|[1-2]\d\d)(?!00|50)\d\d.*?<\/Placemark>/sg' file
pattern for match:
<Placemark>\s*?<name>(?:[7-9]\d|[1-2]\d\d)(?!00|50)\d\d.*?<\/Placemark>
demo:
<Placemark>\s*?<name>(?:[7-9]\d|[1-2]\d\d)(?!00|50)\d\d.*?<\/Placemark>
NOTE:
If you notice this part (?!00|50) it is an exclude matcher that by using a lookahead, you can make it opposite, that means:
^(?:[7-9]\d|[1-2]\d\d)(?=00|50)\d\d$
only matches things that end with 00 or 50.
So you can use this to switch between what you want and what you do not want.
print all patterns that does not end with 00 or 50
perl -lne '$/=undef;print $& while/<Placemark>\s*?<name>(?:[7-9]\d|[1-2]\d\d)(?!00|50)\d\d.*?<\/Placemark>/sg' file
print all patterns that end with 00 or 50
perl -lne '$/=undef;print $& while/<Placemark>\s*?<name>(?:[7-9]\d|[1-2]\d\d)(?!00|50)\d\d.*?<\/Placemark>/sg' file
How to Substitute
if you like, you can use operator: s/regex-match/substitute-string/
perl -pe '$/=undef;s/<Placemark>\s*?<name>(?:[7-9]\d|[1-2]\d\d)(?!00|50)\d\d.*?<\/Placemark>/==>DELETE<==/sg' file
test:
input:
before...
<Placemark>
<name>7700</name>
<description><![CDATA[converted by:</br>GridReferenceFinder.com</br>]]></description>
<Point>
<coordinates>-0.99153654,52.225002,0</coordinates>
</Point>
</Placemark>
after...
---------
before...
<Placemark>
<name>7701</name>
<description><![CDATA[converted by:</br>GridReferenceFinder.com</br>]]></description>
<Point>
<coordinates>-0.99153654,52.225002,0</coordinates>
</Point>
</Placemark>
after...
--------
before...
<Placemark>
<name>27650</name>
<description><![CDATA[converted by:</br>GridReferenceFinder.com</br>]]></description>
<Point>
<coordinates>-0.99153654,52.225002,0</coordinates>
</Point>
</Placemark>
after...
--------
before...
<Placemark>
<name>27651</name>
<description><![CDATA[converted by:</br>GridReferenceFinder.com</br>]]></description>
<Point>
<coordinates>-0.99153654,52.225002,0</coordinates>
</Point>
</Placemark>
after...
end.
the output:
before...
<Placemark>
<name>7700</name>
<description><![CDATA[converted by:</br>GridReferenceFinder.com</br>]]></description>
<Point>
<coordinates>-0.99153654,52.225002,0</coordinates>
</Point>
</Placemark>
after...
---------
before...
==>DELETE<==
after...
--------
before...
<Placemark>
<name>27650</name>
<description><![CDATA[converted by:</br>GridReferenceFinder.com</br>]]></description>
<Point>
<coordinates>-0.99153654,52.225002,0</coordinates>
</Point>
</Placemark>
after...
--------
before...
==>DELETE<==
after...
end.
NOTE.2:
you can use -i for edit-in-place
perl -i.bak -pe ' ... the rest of the script ...' file
It is better to use perl 5.22 or upper version

Something like this in awk:
$ awk '
/<Placemark>/ { d=""; b="" } # d is delete, b is buffer, reset both
{ b=b $0 (/<\/Placemark>/?"":ORS) } # gather data to d
!/[50]0</ && /<\/name>/ { d=1 } # if not 50 or 00 set del flag
/<\/Placemark>/ && d!=1 { print b } # print b if not marked delete
' file
It only works with well formed input, especially:
...
</Placemark>
<Placemark>
...
<name>1234</name>
...
not:
...
</Placemark><Placemark>
... <!-- or: -->
<name>1234
</name>
...
It's an ugly hack. Someone will probably set you up with something nicer but try it if it suits your needs.

awk '$0 == "<Placemark>" {cnt=cnt+1} {arry[cnt]=arry[cnt]$0"\n";if ($0 ~ /<name>/) {match($0,/[[:digit:]]+/);num=substr($0,RSTART,RLENGTH);numbs[num]=cnt}} END { for ( i in numbs ) {if ( substr(i,length(i)-1,length(i)) == "00" || substr(i,length(i)-1,length(i)) == "50") { print arry[numbs[i]] } } }' filename
An alternate awk solution is as above. We first set a counter for an array when $0 = . Then we set the array arry for each Placement element in the file. As we do this we also check for the the name index in the file. When we find this, we pattern match the number within (match function) and then use this to sets another array numbs that tracks the numbers against the counter for each placement element. We finally loop through each elements in numbs, checking the number to ensure it ends in 50 or 00. If it does, the arry index is printed.

Related

How to replace all occurances of string pattern with number matching the string that depends on the order the strings were found

I need a bash script that searches for any string inside <>, if it finds one that it hasn't found before it should replace it with the current value of the index counter (0 at the beginning) and increment the counter. If it finds a string inside <>that it already knows, it should look up the index of the string and replace it with the index. This should be done across multiple files, meaning the counter does not reset when multiple files are searched for the patterns, only at program startup
file_a.txt:
<abc>
<b>
<c>
<c>
<abc>
file_b.txt:
<c>
<b>
Should become
file_a.txt:
0
1
2
2
0
file_b.txt:
2
1
What I got so far:
names=()
for file in folder/*.txt
do
name=$(sed 's/\<[a-zA-Z]*\> /\1 /' file)
for i in "${names[#]}"
do
if [ "$i" -eq "$name" ]
then
#replace string with index of string in array
else
names+=("$name")
fi
done
done
Edit:
What I did not mention in order to simplify the problem is that the patterns that should be replaced is not the only text inside the files, meaning the files look like this:
file_a.txt:
123abc<abc>xyz
efg
<b>ah
a<c>
<c>b
c<abc>
file_b.txt:
xyz<c>xyz
xyz<b>xyz
Should become
file_a.txt:
123abc0xyz
efg
1ah
a2
2b
c0
file_b.txt:
xyz2xyz
xyz1xyz
Because the files can be quite big, they should not be copied, only edited. This should be done for all files inside an folder and files in subfolders
You may try this awk script:
mkdir -p tmp
awk 'match($0, /<[^>]+>/) {
k = substr($0, RSTART, RLENGTH)
if (!(k in freq))
freq[k] = n++
$0 = substr($0, 1, RSTART-1) freq[k] substr($0, RSTART+RLENGTH)
}
{
print $0 > ("tmp/" FILENAME)
}' file_{a,b}.txt
Modified files will be saves in tmp/ directory and you can move them back after examining their content.
cat tmp/file_a.txt
123abc0xyz
efg
1ah
a2
2b
c0
cat tmp/file_b.txt
xyz2xyz
xyz1xyz

bash : to keep all values > 3500 with sed

I 've a question concerning sed cmd: how to keep all values > 3500 in a field?
this is my problem:
I've as output (from a .csv file):
String1;Val1;String2;Val2
i would like to keep all lines where Val1 is only > 3500 and Val2 >= 60,00 (<= 99,99)
so, i tried this:
`sed -nr 's/^(.*);
([^([0-9]|[1-9][0-9]|[1-9][0-9]{2}|[1-2][0-9]{3}|3[0-4][0-9]{2}|3500)]);
(.*);
([6-9][0-9],[0-9]*)$
/Dans la ville de \1, \2 votants avec un pourcentage de \4 pour \3/p'
`
but i 've this error:
`sed -e expression #1, char 174: Unmatched ) or \)`
i think the problem come from the search of the second field.
i look all numbers <= 3500 and i put NOT(these tests).
Do u have an idea to how should i proceed?
Thanks.
(and sry for this terrible english)
Awk is the right way to go in such case:
awk 'BEGIN{ FS=OFS=";" }$2 > 3500 && ($4 >= 60.00 && $4 <= 99.99)' file
The parsing error is in [^([0-9]|[1-9][0-9]|[1-9][0-9]{2}|[1-2][0-9]{3}|3[0-4]. I'm not entirely sure where exactly, but that doesn't matter since there is an error in your approach:
(Inverted) character classes [^...] do not work on full strings. [^ab|xy] matches all single characters that are not a, b, |, x, or y.
If you want to say »all strings except 0, 1, 2, ..., 3500« you have to use something different, probably a positive formulation like »all strings from 3500, 3501, ...«.
The following regex should work for numbers >= 3500.
0*([1-9][0-9]{4,}|[4-9][0-9]{3}|3[5-9][0-9]{2})

Using sed to go to a specific line, change pattern then print all between line and another pattern

So I need to change a specific line in a big textfile by something found one line before. What the text looks like:
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: [0-9][0-9][0-9][0-9][0-9] SOME TEXT
Tél. :
numbers
Fax :
numbers
"----------------------"
What I've found so far is (i believe i'm almost done):
K=0
while [ $K -lt 11519 ]; do
let K=K+1
L=`head -n $K file_that_contains_line_numbers_I_want.txt | tail -1`
M=`expr $L - 2`
dept=`head -n $L filename.txt | tail -1 | sed -e 's/Adresse:.*Code Postal: //' -e 's/[0-9]\{3\} .*//'`
sed -n ""$M"{s/Tél. :/$dept/; /----------------------/p; q}" filename.txt >>newfile.csv
done
Where $dept is the first two digits after Code Postal: .
What doesn't yet work is the last sed bit: I want the end file to look like the old file, just with the "Tél." part changed to $dept.
New file:
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: 90000 SOME TEXT
90
numbers
Fax :
numbers
"----------------------"
Obviously this pattern with the names repeat, but sometimes the lines Tél. and below are not there.
tl dr; I want to change a pattern in a file with something found one line up, with the thing found one line up changing.
If you found a different way to get $dept in a different line, I would be very happy to hear about it.
I know my code is not the least bit the most efficient, but I learned about sed one week ago only.
Thanks in advance for helping me/correcting me.
EDIT: As I've been asked to provide some input, here it is :
Nom: JOHN DOE
Société: APERTURE SCIENCE
Adresse: 37 RUE OF PARIS CS 30112 Code Postal: 51726 REIMS CEDEX
Tél. :
12 34 56 78 90
Fax :
12 34 56 78 90
"----------------------"
Nom: OLIVER TWIST
Société: NASA
Adresse: 40 RUE DU GINGEMBRE CS 70999 Code Postal: 67009 STRASBOURG CEDEX
Tél. :
12 34 56 78 90
Fax :
12 34 56 78 90
"----------------------"
Nom: BARACK OBAMA
Société: WHITE HOUSE
Adresse: 124 BOULEVARD DE GAULLE Code Postal: 75017 PARIS
Tél. :
12 34 56 78 90
"----------------------"
Output I want to achieve :
Nom: JOHN DOE
Société: APERTURE SCIENCE
Adresse: 37 RUE OF PARIS CS 30112 Code Postal: 51726 REIMS CEDEX
51
12 34 56 78 90
Fax :
12 34 56 78 90
"----------------------"
Nom: OLIVER TWIST
Société: NASA
Adresse: 40 RUE DU GINGEMBRE CS 70999 Code Postal: 67009 STRASBOURG CEDEX
67
12 34 56 78 90
Fax :
12 34 56 78 90
"----------------------"
Nom: BARACK OBAMA
Société: WHITE HOUSE
Adresse: 124 BOULEVARD DE GAULLE Code Postal: 75017 PARIS
75
12 34 56 78 90
"----------------------"
With sed :
$ sed '/.*Code Postal: \([0-9][0-9]\).*/{p;s//\1/;n;d}' file
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: 90000 SOME TEXT
90
numbers
Fax :
numbers
"----------------------"
/.*Code Postal: \([0-9][0-9]\).*/ : search for line containing Code Postal: followed by two digits
p : print matching line (ie clone the line containing "Code Postal")
s//\1/ : substitute matching line (s//\1) with captured digits (\([0-9][0-9]\))
n read the next line ("Tél") and deletes it (d)
I've just seen your edit, you can achieve that with :
sed '/.*Code Postal: \([0-9][0-9]\).*/{p;s//\1/;N;/[0-9]/s/\n/ /;s/Tél\. : *//}' file
Note that the dept number will be output on a single line in the "OLIVER TWIST" block (because Tél.: is on a single line as in first block)
You do not provide sample input to check against, but this should work:
/Code Postal:/ {
match($0, /Code Postal: *([0-9][0-9])/, result);
dept = result[1];
}
/^Tél/ { $2 = dept }
{ print }
Save the code to a file, then call awk -f file input_file. It works like this: If the line matches "Code Postal", then save the first two digits of the postal code in the variable dept. If the line starts with "Tél", replace the second field with the value of dept. Then, print any line.
Here is my guess as to what you are trying to accomplish.
awk 'NR==FNR { # Store line numbers in a[]
a[$1] = $1; next }
FNR in a { m=1 } # We are in match range
/^------$/ { m=0 } # Separator: we are out of range
m && /^Adresse.*Code postal:/ { c=substr($NF, 1, 2); $NF = 90000 }
m && /^Tél\. :$/ { $0 = c }
{ print }' file_that_contains_line_numbers_I_want.txt filename > filename.new
This contains some common Awk idioms. The following is a really brief sketch of the script in human terms.
NR is the current line number overall, and FNR is the file number within the current file. When these are equal, it means you are reading the first input file. In this case, we read the line number into the array a and skip to the next line.
If we are falling through, we are reading the second file. When we see a line number which is present in a, we set the flag m to a true (non-zero) value to indicate that we are in a region where a substition should take place. When we see the dashed lines, we clear it, because this marks the end of the current record.
Finally, if we are in one of the targeted records (m is true) we look for the patterns and perform the requested extraction and substitution. NF is the number of fields in the current line, and $ selects a field, so $NF = 90000 replaces the last field on the line; and $0 is the entire input line, so when we see Tél. : we replace the whole line with the extracted code.
At the end of the script, we print whatever we are reading; the next in the first block skips the rest of the script, so we are printing only when we are in the second file. The resulting output should (hopefully!) be the result you require.
This should be orders of magnitude faster than reading the same file over and over again, and should work as long as the first file contains less than millions of line numbers (assuming modern hardware; if you have a really small machine with limited memory and no swap, maybe tens of thousands).
It sounds like this might be what you want, using GNU awk for the 3rd arg to match()):
$ awk 'match($0,/.*Code Postal: *([0-9][0-9])/,a){$0=$0 ORS a[1]} !/^Tél/' file
or gawk or mawk for gensub():
$ awk '{$0=gensub(/.*Code Postal: *([0-9][0-9]).*/,"&\n\\1",1)} !/^Tél/' file
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: 90000 SOME TEXT
90
numbers
Fax :
numbers
"----------------------"
The above was run on this input file:
$ cat file
Nom: some text
Société: some text
Adresse: some text and numb3rs Code Postal: 90000 SOME TEXT
Tél. :
numbers
Fax :
numbers
"----------------------"
The above matches the stated regexp, saves the captured 2 digits in array a[1] and adds that preceded by a newline (ORS) to the end of the current line before printing that and any other line that doesn't start with Tél.
Read Effective Awk programming, 4th Edition, by Arnold Robbins if you'll be doing any text manipulation in UNIX.

I have an issue with a while read line bash script with sed command to delete lines, and following lines

I have a bash script with which I'd like to read a list of lines from SedTest.names and delete that line and the following line in the file SedTest.fasta. The script looks like
while read line; do
sed "/${line}/{N;d}" SedTest.fasta
done < "SedTest.names"
SedTest.fasta looks like
>seq1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>seq3
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>seq4
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>seq5
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
>seq6
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
>seq7
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>seq8
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
>seq9
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
>seq10
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
And SedTest.names looks like
>seq3
>seq5
>seq6
When I run the script and direct output to test.out, test.out looks like
>seq1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>seq4
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>seq5
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
>seq6
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
>seq7
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>seq8
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
>seq9
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
>seq10
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
>seq1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>seq3
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>seq4
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>seq6
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
>seq7
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>seq8
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
>seq9
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
>seq10
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
>seq1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>seq3
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
>seq4
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>seq5
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
>seq7
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>seq8
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
>seq9
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
>seq10
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ
It has removed the line and following line as asked, but has edited the data in SedTest.fasta for each of the 3 lines from SedTest.names separately. I need one set of out data modified by having all of the 3 lines, and their following lines, removed, making an outfile 14 lines in size.
I feel very close with this approach but I'm too thick to know why it's not working. Any help much appreciated!
You're not using inline flag in sed and that is not saving the file in place. Use your loop like this:
while read -r line; do
sed -i.bak "/$line/{N;d;}" SedTest.fasta
done < SedTest.names
The problem is that each sed command processes the whole file for each line in the names file. I prefer awk as alternative because it can use associative arrays to save values from different files and process them only once, like:
awk '
FNR == NR { names[$1] = 1; next }
{ if ($1 in names) { getline; next } }
{ print }
' SedTest.names SedTest.fasta
In the FNR == NR block save the names into an array, and for the second one compares if it exists, then getline reads next line and skips it. The rest are printed as normal
It yields:
>seq1
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq2
BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
>seq4
DDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDDD
>seq7
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>seq8
HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
>seq9
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
>seq10
JJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ

Grouping the data from a column using awk or any other shell command

I could format the data using a perl script (hash). I am wondering if it can be done through some shell one liner, so that every time I dont need to write a perl script if there is some change in the
input format.
Example Input:
rinku a
rinku b
rinku c
rrs d
rrs e
abc f
abc g
abc h
abc i
xyz j
example Output:
rinku a,b,c
rrs d,e
abc f,g,h,i
xyz j
Please help me with a command using shell/awk/sed to format the input.
Thanks,
Rinku
How about
$ awk '{arr[$1]=arr[$1]?arr[$1]","$2:$2} END{for (i in arr) print i, arr[i]}' input
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
The awk program also has associative arrays, similar to Perl:
awk '{v[$1]=v[$1]","$2}END{for(k in v)print k" "substr(v[k],2)}' inputFile
For each line X Y (key of X, value of Y), it basically just appends ,Y to every array element indexed by X, taking advantage of the fact they all start as empty strings.
Then, since your values are then of the form ,x,y,z, you just strip off the first character when outputting.
This generates, for your input data (in inputFile):
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
As an aside, if you want it as nicely formatted as the original, you can create a program.awk file:
{
val[$1] = val[$1]","$2
if (length ($1) > maxlen) {
maxlen = length ($1)
}
}
END {
for (key in val) {
printf "%-*s %s\n", maxlen, key, substr(val[key],2)
}
}
and run that with:
awk -f program.awk inputFile
and you'll get:
rinku a,b,c
abc f,g,h,i
rrs d,e
xyz j
sed -n ':cycle
$!N
s/^\([^[:blank:]]*\)\([[:blank:]]\{1,\}.*\)\n\1[[:blank:]]\{1,\}/\1\2,/;t cycle
P
s/.*\n//;t cycle' YourFile
trying not to use the hold buffer (and not loading the full file in memory)
- load the line
- if first word is the same as the one after CR, repolace the CR and first word by a ,
- if the case, restart at line loading
- if not, print first line
- replace the current buffer until first \n by nothing
- if the case restart at line loading
posix version so --posix on GNU sed

Resources