To split and arrange number in single inverted - bash

I have around 65000 products codes in a text file.I wanted to split those number in group of 999 each .Then-after want each 999 number with single quotes separated by comma.
Could you please suggest how I can achieve above scenario through Unix script.
87453454
65778445
.
.
.
.
Till 65000 productscodes
Need to arrange in below pattern:
'87453454','65778445',

With awk:
awk '
++c == 1 { out = "\047" $0 "\047"; next }
{ out = out ",\047" $0 "\047" }
c == 999 { print out; c = 0 }
END { if (c) print out }
' file
Or, with GNU sed:
sed "
:a
\$bb
N
0~999{
:b
s/\n/','/g
s/^/'/
s/$/'/
b
}
ba" file

With Perl:
perl -ne '
sub pq { chomp; print "\x27$_\x27" } pq;
for (1 .. 998) {
if (defined($_ = <>)) {
print ",";
pq
}
}
print "\n"
' < file
Credit for Mauke perl#libera.chat

65000 isn't that many lines for awk - just do it all in one shot :
mawk 'BEGIN { FS = RS; RS = "^$"; OFS = (_="\47")(",")_
} gsub(/^|[^0-9]*$/,_, $!(NF = NF))'
'66771756','69562431','22026341','58085790','22563930',
'63801696','24044132','94255986','56451624','46154427'
That's for grouping them all in one line. To make 999 ones, try
jot -r 50 10000000 99999999 |
# change "5" to "999" here
rs -C= 0 5 |
mawk 'sub(".*", "\47&\47", $!(NF -= _==$NF ))' FS== OFS='\47,\47'
'36452530','29776340','31198057','36015730','30143632'
'49664844','83535994','86871984','44613227','12309645'
'58002568','31342035','72695499','54546650','21800933'
'38059391','36935562','98323086','91089765','65672096'
'17634208','14009291','39114390','35338398','43676356'
'14973124','19782405','96782582','27689803','27438921'
'79540212','49141859','25714405','42248622','25589123'
'11466085','87022819','65726165','86718075','56989625'
'12900115','82979216','65469187','63769703','86494457'
'26544666','89342693','64603075','26102683','70528492'
_==$NF checks whether right most column is empty or not,
—- i.e. whether there's a trailing edge sep that needds to be trimmed

If your input file only contains short codes as shown in your example, you could use the following hack:
xargs -L 999 bash -c "printf \'%s\', \"\$#\"; echo" . <inputFile >outputFile
Alternatively, you can use this sed command:
sed -Ene"s/(.*)/'\1',/;H" -e{'0~999','$'}'{z;x;s/\n//g;p}' <inputFile >outputFile
s/(.*)/'\1',/ wraps each line in '...',
but does not print it (-n)
instead, H appends the modified line to the so called hold space; basically a helper variable storing a single string.
(This also adds a line break as a separator, but we remove that later).
Every 999 lines (0~999) and at the end of the input file ($) ...
... the hold space is then printed and cleared (z;x;...;p)
while deleting all delimiter-linebreaks (s/\n//g) mentioned earlier.

Related

Replacing numbers with their respective strings in awk

I am a newbie in bash/awk programming and I have a file looks like this:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . 1
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . 1,4
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . 8,64,512
With awk, I want to change the numbers in the last column ($10) with their descriptions. I assigned the numbers and their definitions in two different arrays. The way I was thinking was to change these numbers by iterating the two array together. Here, 0 is "unknown", 1 is "germline", 4 is "somatic" and goes on.
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
number=$(IFS=,; echo "${z[*]}")
def=$(IFS=,; echo "${t[*]}")
awk -v a="$number" -v b="${def}" 'BEGIN { OFS="\t" } /#/ {next}
{
x=split(a, e, /,/)
y=split(b, f, /,/)
delete c
m=split($10, c, /,/)
for (i=1; i<=m; i++) {
for (j=1; j<=x; j++) {
if (c[i]==e[j]) {
c[i]=f[j]
}
}
$10+=sprintf("%s, ",c[i])
}
print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
}' input.vcf > output.vcf
The output should look like this:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,paternal
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive
I would be so glad if you could help me!
All the best
Assuming you don't really need to define the lists of numbers and names as 2 shell arrays for some other reason:
$ cat tst.awk
BEGIN {
split("0 1 2 4 8 16 32 64 128 256 512 1024 1073741824",nrsArr)
split("unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other",namesArr)
for (i in nrsArr) {
nr2name[nrsArr[i]] = namesArr[i]
}
}
!/#/ {
n = split($NF,nrs,/,/)
sub(/[^[:space:]]+$/,"")
printf "%s", $0
for (i=1; i<=n; i++) {
printf "%s%s", nr2name[nrs[i]], (i<n ? "," : ORS)
}
}
$ awk -f tst.awk input.vcf
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,inherited
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive
The above preserves whatever white space you had in your input file in case that matters.
You may use this awk:
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
awk -v z="${z[*]}" -v t="${t[*]}" '
BEGIN {
split(z, zarr)
split(t, tarr)
for (i=1; i in zarr; ++i)
map[zarr[i]] = tarr[i]
}
{
split($NF, arr, /,/)
s = ""
for (i=1; i in arr; ++i)
s = s (i == 1 ? "" : ",") map[arr[i]]
$NF = s;
}
1
' file
btw number 4 is mapped to inherited not paternal as you have in your expected output.
Use this short Perl in-line script:
perl -F'\t' -lane '
BEGIN {
#keys = qw( 0 1 2 4 8 16 32 64 128 256 512 1024 1073741824 );
#vals = qw( unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other );
%val = map { $keys[$_] => $vals[$_] } 0..$#keys;
}
print join "\t", #F[0..8], ( join ",", map { $val{$_} } split /,/, $F[9] );
' in_file > out_file
The Perl script uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/\t/' : Split into #F on TAB, rather than on whitespace.
%val = map { $keys[$_] => $vals[$_] } 0..$#keys; : Create %val - a hash lookup table with keys = numeric codes and values = mutation/variant types.
Note that in Perl, arrays are 0-indexed.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
Assumptions:
OP has confirmed beforehand that the z and t arrays are valid (eg, same number of elements in both arrays)
OP may want to (dynamically) change the contents of the z and t arrays so we'll leave the array assignments at the bash level (ie, won't hardcode inside of awk)
the substitution strings could contain white space so we'll keep OP's current method of building comma-delimited strings (from the z and t) arrays; also assumes replacement strings do not contain commas; this should simplify parsing of the replacement strings within awk
while OP has explicitly coded for (awk) field #10, we'll assume this number could change; we'll focus on processing the last field in a row
Small change to initialization code:
# original arrays
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
# renamed variables (format: x,y,z,...)
nums=$(IFS=,; echo "${z[*]}")
alphas=$(IFS=,; echo "${t[*]}")
One awk idea:
awk -v nums="${nums}" -v alphas="${alphas}" ' # pass comma-delimited variables to awk
BEGIN { OFS="\t" # copied from original code:w
n=split(nums,num,/,/) # split comma-delimted variables
a=split(alphas,alpha,/,/) # into arrays
}
/#/ { next } # copied from original code
{ l=split($NF,lastf,/,/) # split the last (comma-delimited) field
$NF="" # clear the last field
pfx="" # initialize our prefix string
for (i=1; i<=l; i++) # loop through entries in the last field
for (j=1; j<=n; j++) # loop through array of numbers
if ( lastf[i] == num[j] ) # if array entries match ...
{ $NF= $NF pfx alpha[j] # append the associated alpha to the last field
pfx="," # set the prefix to "," for the next item
break # break out one level to process next entry in the last field
}
}
{ print } # print the current line (with modified last field)
' input.vcf
The above generates:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic.germline,inherited
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive

Saving lines between "start"s and "end"s to different arrays

I have a file that consists of a bunch of things but what I need are numbers between start and end strings: For example :
ghghgh
start
23
34
22
12
end
ghbd
wodkkh
234
start
14
56
74
end
So, I need two arrays here one containing 23,34,22,12 and one containing 14,56,74. What's the best command to use?
If I only had one start and one end I would be able to use mapfile and awk to obtain that array, but there's many start and ends in the file and I need to save all the arrays.
You can do it with sed.
sed -n '/start/{:a;N;/end/!ba;s/\n/, /g;s/, [^,][a-z][^,]*//Ig;s/start, //p}'
The code will iterate through all chunks between 'start' and 'end' lines.
It will remove all items with non-digit symbols and output each "array" on separate line.
Here is output from your data sample:
23, 34, 22, 12
14, 56, 74
You need to implement a small state machine - switching between in block and out of block:
awk '/end/{block = 0; print a; a = ""} (block) {a = a " " $0} /start/{block = 1}'
If at end, leave block, print and empty the accumulator. If in block, accumulate current line. If at start, mark that we're inside a block.
You can tell awk to change the output file every time a new sequence starts
awk '/start/{i++;f=1;next} /end/{f=0} f{print > "arr"i}' file
For the example file, this will create files: arr1, arr2. Then you can create separated arrays with the lines of these files:
for i in $( ls arr* ); do readarray -t $i < $i; done
note: I have assumed that all lines between matching patterns are numeric and acceptable as in the example.
If you trust your input files enough for an eval:
$ cat tst.sh
eval $(
awk '
f {
if ( /end/ ) {
print "declare arr" ++cnt "=(" vals " )"
vals = ""
f = 0
}
else {
vals = vals OFS $0
}
}
/start/ { f = 1 }
' "$1"
)
printf "arr1:%s\n" "${arr1[#]}"
printf "arr2:%s\n" "${arr2[#]}"
$ ./tst.sh file
arr1:23
arr1:34
arr1:22
arr1:12
arr2:14
arr2:56
arr2:74
Check the quoting and all other shell gotchas...

Find nth row using AWK and assign them to a variable

Okay, I have two files: one is baseline and the other is a generated report. I have to validate a specific string in both the files match, it is not just a single word see example below:
.
.
name os ksd
56633223223
some text..................
some text..................
My search criteria here is to find unique number such as "56633223223" and retrieve above 1 line and below 3 lines, i can do that on both the basefile and the report, and then compare if they match. In whole i need shell script for this.
Since the strings above and below are unique but the line count varies, I had put it in a file called "actlist":
56633223223 1 5
56633223224 1 6
56633223225 1 3
.
.
Now from below "Rcount" I get how many iterations to be performed, and in each iteration i have to get ith row and see if the word count is 3, if it is then take those values into variable form and use something like this
I'm stuck at the below, which command to be used. I'm thinking of using AWK but if there is anything better please advise. Here's some pseudo-code showing what I'm trying to do:
xxxxx=/root/xxx/xxxxxxx
Rcount=`wc -l $xxxxx | awk -F " " '{print $1}'`
i=1
while ((i <= Rcount))
do
record=_________________'(Awk command to retrieve ith(1st) record (of $xxxx),
wcount=_________________'(Awk command to count the number of words in $record)
(( i=i+1 ))
done
Note: record, wcount values are later printed to a log file.
Sounds like you're looking for something like this:
#!/bin/bash
while read -r word1 word2 word3 junk; do
if [[ -n "$word1" && -n "$word2" && -n "$word3" && -z "$junk" ]]; then
echo "all good"
else
echo "error"
fi
done < /root/shravan/actlist
This will go through each line of your input file, assigning the three columns to word1, word2 and word3. The -n tests that read hasn't assigned an empty value to each variable. The -z checks that there are only three columns, so $junk is empty.
I PROMISE you you are going about this all wrong. To find words in file1 and search for those words in file2 and file3 is just:
awk '
NR==FNR{ for (i=1;i<=NF;i++) words[$i]; next }
{ for (word in words) if ($0 ~ word) print FILENAME, word }
' file1 file2 file3
or similar (assuming a simple grep -f file1 file2 file3 isn't adequate). It DOES NOT involve shell loops to call awk to pull out strings to save in shell variables to pass to other shell commands, etc, etc.
So far all you're doing is asking us to help you implement part of what you think is the solution to your problem, but we're struggling to do that because what you're asking for doesn't make sense as part of any kind of reasonable solution to what it sounds like your problem is so it's hard to suggest anything sensible.
If you tells us what you are trying to do AS A WHOLE with sample input and expected output for your whole process then we can help you.
We don't seem to be getting anywhere so let's try a stab at the kind of solution I think you might want and then take it from there.
Look at these 2 files "old" and "new" side by side (line numbers added by the cat -n):
$ paste old new | cat -n
1 a b
2 b 56633223223
3 56633223223 c
4 c d
5 d h
6 e 56633223225
7 f i
8 g Z
9 h k
10 56633223225 l
11 i
12 j
13 k
14 l
Now lets take this "actlist":
$ cat actlist
56633223223 1 2
56633223225 1 3
and run this awk command on all 3 of the above files (yes, I know it could be briefer, more efficient, etc. but favoring simplicity and clarity for now):
$ cat tst.awk
ARGIND==1 {
numPre[$1] = $2
numSuc[$1] = $3
}
ARGIND==2 {
oldLine[FNR] = $0
if ($0 in numPre) {
oldHitFnr[$0] = FNR
}
}
ARGIND==3 {
newLine[FNR] = $0
if ($0 in numPre) {
newHitFnr[$0] = FNR
}
}
END {
for (str in numPre) {
if ( str in oldHitFnr ) {
if ( str in newHitFnr ) {
for (i=-numPre[str]; i<=numSuc[str]; i++) {
oldFnr = oldHitFnr[str] + i
newFnr = newHitFnr[str] + i
if (oldLine[oldFnr] != newLine[newFnr]) {
print str, "mismatch at old line", oldFnr, "new line", newFnr
print "\t" oldLine[oldFnr], "vs", newLine[newFnr]
}
}
}
else {
print str, "is present in old file but not new file"
}
}
else if (str in newHitFnr) {
print str, "is present in new file but not old file"
}
}
}
.
$ awk -f tst.awk actlist old new
56633223225 mismatch at old line 12 new line 8
j vs Z
It's outputing that result because the 2nd line after 56633223225 is j in file "old" but Z in file "new" and the file "actlist" said the 2 files had to be common from one line before until 3 lines after that pattern.
Is that what you're trying to do? The above uses GNU awk for ARGIND but the workaround is trivial for other awks.
Use the below code:
awk '{if (NF == 3) { word1=$1; word2=$2; word3=$3; print "Words are:" word1, word2, word3} else {print "Line", NR, "is having", NF, "Words" }}' filename.txt
I have given the solution as per the requirement.
awk '{ # awk starts from here and read a file line by line
if (NF == 3) # It will check if current line is having 3 fields. NF represents number of fields in current line
{ word1=$1; # If current line is having exact 3 fields then 1st field will be assigned to word1 variable
word2=$2; # 2nd field will be assigned to word2 variable
word3=$3; # 3rd field will be assigned to word3 variable
print word1, word2, word3} # It will print all 3 fields
}' filename.txt >> output.txt # THese 3 fields will be redirected to a file which can be used for further processing.
This is as per the requirement, but there are many other ways of doing this but it was asked using awk.

Extracting multiple parts of a string using bash

I have a caret delimited (key=value) input and would like to extract multiple tokens of interest from it.
For example: Given the following input
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"
1=A00^35=D^22=101^150=1^33=1
1=B000^35=D^22=101^150=2^33=2
I would like the following output
35=D^150=1^
35=D^150=2^
I have tried the following
$ echo -e "1=A00^35=D^150=1^33=1\n1=B000^35=D^150=2^33=2"|egrep -o "35=[^/^]*\^|150=[^/^]*\^"
35=D^
150=1^
35=D^
150=2^
My problem is that egrep returns each match on a separate line. Is it possible to get one line of output for one line of input? Please note that due to the constraints of the larger script, I cannot simply do a blind replace of all the \n characters in the output.
Thank you for any suggestions.This script is for bash 3.2.25. Any egrep alternatives are welcome. Please note that the tokens of interest (35 and 150) may change and I am already generating the egrep pattern in the script. Hence a one liner (if possible) would be great
You have two options. Option 1 is to change the "white space character" and use set --:
OFS=$IFS
IFS="^ "
set -- 1=A00^35=D^150=1^33=1 # No quotes here!!
IFS="$OFS"
Now you have your values in $1, $2, etc.
Or you can use an array:
tmp=$(echo "1=A00^35=D^150=1^33=1" | sed -e 's:\([0-9]\+\)=: [\1]=:g' -e 's:\^ : :g')
eval value=($tmp)
echo "35=${value[35]}^150=${value[150]}"
To get rid of the newline, you can just echo it again:
$ echo $(echo "1=A00^35=D^150=1^33=1"|egrep -o "35=[^/^]*\^|150=[^/^]*\^")
35=D^ 150=1^
If that's not satisfactory (I think it may give you one line for the whole input file), you can use awk:
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=35,150 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
35=D^150=1^
35=d^
pax> echo '
1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLIST=1,33 -F^ ' {
sep = "";
split (LIST, srch, ",");
for (i = 1; i <= NF; i++) {
for (idx in srch) {
split ($i, arr, "=");
if (arr[1] == srch[idx]) {
printf sep "" arr[1] "=" arr[2];
sep = "^";
}
}
}
if (sep != "") {
print sep;
}
}'
1=A00^33=1^
1=a00^33=11^
This one allows you to use a single awk script and all you need to do is to provide a comma-separated list of keys to print out.
And here's the one-liner version :-)
echo '1=A00^35=D^150=1^33=1
1=a00^35=d^157=11^33=11
' | awk -vLST=1,33 -F^ '{s="";split(LST,k,",");for(i=1;i<=NF;i++){for(j in k){split($i,arr,"=");if(arr[1]==k[j]){printf s""arr[1]"="arr[2];s="^";}}}if(s!=""){print s;}}'
given a file 'in' containing your strings :
$ for i in $(cut -d^ -f2,3 < in);do echo $i^;done
35=D^150=1^
35=D^150=2^

Uniq in awk; removing duplicate values in a column using awk

I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!( valueArray[i] in duplicateArray))
{
duplicateArray[j] = valueArray[i];
j++;
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
}
}
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
{
split($2, valueArray,",");
j=0;
for (i in valueArray)
{
if (!(valueArray[i] in duplicateArray))
{
duplicateArray[valueArray[i]] = 1
}
};
printf $1 "\t";
for (j in duplicateArray)
{
if (j) # prevents printing an extra comma
{
printf j ",";
}
}
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
}'
Perl:
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk:
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
fi
done
part[1]=${new2#,} # remove leading comma
fi
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"

Resources