I have a large datafile in the following format below:
ENST00000371026 WDR78,WDR78,WDR78, WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458, atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
The columns are tab separated. Multiple values within columns are comma separated. I would like to remove the duplicate values in the second column to result in something like this:
ENST00000371026 WDR78 WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 1,WD repeat domain 78 isoform 2,
ENST00000371023 WDR32 WD repeat domain 32 isoform 2
ENST00000400908 RERE,KIAA0458 atrophin-1 like protein isoform a,Homo sapiens mRNA for KIAA0458 protein, partial cds.,
I tried the following code below but it doesn't seem to remove the duplicate values.
awk '
BEGIN { FS="\t" } ;
split($2, valueArray,",");
for (i in valueArray)
if (!( valueArray[i] in duplicateArray))
duplicateArray[j] = valueArray[i];
printf $1 "\t";
for (j in duplicateArray)
if (duplicateArray[j]) {
printf duplicateArray[j] ",";
printf "\t";
print $3
}' knownGeneFromUCSC.txt
How can I remove the duplicates in column 2 correctly?
Your script acts only on the second record (line) in the file because of NR==2. I took it out, but it may be what you intend. If so, you should put it back.
The in operator checks for the presence of the index, not the value, so I made duplicateArray an associative array* that uses the values from valueArray as its indices. This saves from having to iterate over both arrays in a loop within a loop.
The split statement sees "WDR78,WDR78,WDR78," as four fields rather than three so I added an if to keep it from printing a null value which would result in ",WDR78," being printed if the if weren't there.
* In reality all arrays in AWK are associative.
awk '
BEGIN { FS="\t" } ;
split($2, valueArray,",");
for (i in valueArray)
if (!(valueArray[i] in duplicateArray))
duplicateArray[valueArray[i]] = 1
printf $1 "\t";
for (j in duplicateArray)
if (j) # prevents printing an extra comma
printf j ",";
printf "\t";
print $3
delete duplicateArray # for non-gawk, use split("", duplicateArray)
perl -F'\t' -lane'
$F[1] = join ",", grep !$_{$_}++, split ",", $F[1];
print join "\t", #F; %_ = ();
' infile
awk -F'\t' '{
n = split($2, t, ","); _2 = x
split(x, _) # use delete _ if supported
for (i = 0; ++i <= n;)
_[t[i]]++ || _2 = _2 ? _2 "," t[i] : t[i]
$2 = _2
}-3' OFS='\t' infile
The line 4 in the awk script is used to preserve the original order of the values in the second field after filtering the unique values.
Sorry, I know you asked about awk... but Perl makes this much more simple:
$ perl -n -e ' #t = split(/\t/);
%t2 = map { $_ => 1 } split(/,/,$t[1]);
$t[1] = join(",",keys %t2);
print join("\t",#t); ' knownGeneFromUCSC.txt
Pure Bash 4.0 (one associative array):
declare -a part # parts of a line
declare -a part2 # parts 2. column
declare -A check # used to remember items in part2
while read line ; do
part=( $line ) # split line using whitespaces
IFS=',' # separator is comma
part2=( ${part[1]} ) # split 2. column using comma
if [ ${#part2[#]} -gt 1 ] ; then # more than 1 field in 2. column?
check=() # empty check array
new2='' # empty new 2. column
for item in ${part2[#]} ; do
(( check[$item]++ )) # remember items in 2. column
if [ ${check[$item]} -eq 1 ] ; then # not yet seen?
new2=$new2,$item # add to new 2. column
part[1]=${new2#,} # remove leading comma
IFS=$'\t' # separator for the output
echo "${part[*]}" # rebuild line
done < "$infile"
I have around 65000 products codes in a text file.I wanted to split those number in group of 999 each .Then-after want each 999 number with single quotes separated by comma.
Could you please suggest how I can achieve above scenario through Unix script.
Till 65000 productscodes
Need to arrange in below pattern:
With awk:
awk '
++c == 1 { out = "\047" $0 "\047"; next }
{ out = out ",\047" $0 "\047" }
c == 999 { print out; c = 0 }
END { if (c) print out }
' file
Or, with GNU sed:
sed "
ba" file
With Perl:
perl -ne '
sub pq { chomp; print "\x27$_\x27" } pq;
for (1 .. 998) {
if (defined($_ = <>)) {
print ",";
print "\n"
' < file
Credit for Mauke perl#libera.chat
65000 isn't that many lines for awk - just do it all in one shot :
mawk 'BEGIN { FS = RS; RS = "^$"; OFS = (_="\47")(",")_
} gsub(/^|[^0-9]*$/,_, $!(NF = NF))'
That's for grouping them all in one line. To make 999 ones, try
jot -r 50 10000000 99999999 |
# change "5" to "999" here
rs -C= 0 5 |
mawk 'sub(".*", "\47&\47", $!(NF -= _==$NF ))' FS== OFS='\47,\47'
_==$NF checks whether right most column is empty or not,
—- i.e. whether there's a trailing edge sep that needds to be trimmed
If your input file only contains short codes as shown in your example, you could use the following hack:
xargs -L 999 bash -c "printf \'%s\', \"\$#\"; echo" . <inputFile >outputFile
Alternatively, you can use this sed command:
sed -Ene"s/(.*)/'\1',/;H" -e{'0~999','$'}'{z;x;s/\n//g;p}' <inputFile >outputFile
s/(.*)/'\1',/ wraps each line in '...',
but does not print it (-n)
instead, H appends the modified line to the so called hold space; basically a helper variable storing a single string.
(This also adds a line break as a separator, but we remove that later).
Every 999 lines (0~999) and at the end of the input file ($) ...
... the hold space is then printed and cleared (z;x;...;p)
while deleting all delimiter-linebreaks (s/\n//g) mentioned earlier.
I am a newbie in bash/awk programming and I have a file looks like this:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . 1
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . 1,4
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . 8,64,512
With awk, I want to change the numbers in the last column ($10) with their descriptions. I assigned the numbers and their definitions in two different arrays. The way I was thinking was to change these numbers by iterating the two array together. Here, 0 is "unknown", 1 is "germline", 4 is "somatic" and goes on.
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
number=$(IFS=,; echo "${z[*]}")
def=$(IFS=,; echo "${t[*]}")
awk -v a="$number" -v b="${def}" 'BEGIN { OFS="\t" } /#/ {next}
x=split(a, e, /,/)
y=split(b, f, /,/)
delete c
m=split($10, c, /,/)
for (i=1; i<=m; i++) {
for (j=1; j<=x; j++) {
if (c[i]==e[j]) {
$10+=sprintf("%s, ",c[i])
print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10
}' input.vcf > output.vcf
The output should look like this:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,paternal
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive
I would be so glad if you could help me!
All the best
Assuming you don't really need to define the lists of numbers and names as 2 shell arrays for some other reason:
$ cat tst.awk
split("0 1 2 4 8 16 32 64 128 256 512 1024 1073741824",nrsArr)
split("unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other",namesArr)
for (i in nrsArr) {
nr2name[nrsArr[i]] = namesArr[i]
!/#/ {
n = split($NF,nrs,/,/)
printf "%s", $0
for (i=1; i<=n; i++) {
printf "%s%s", nr2name[nrs[i]], (i<n ? "," : ORS)
$ awk -f tst.awk input.vcf
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic . germline,inherited
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive
The above preserves whatever white space you had in your input file in case that matters.
You may use this awk:
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
awk -v z="${z[*]}" -v t="${t[*]}" '
split(z, zarr)
split(t, tarr)
for (i=1; i in zarr; ++i)
map[zarr[i]] = tarr[i]
split($NF, arr, /,/)
s = ""
for (i=1; i in arr; ++i)
s = s (i == 1 ? "" : ",") map[arr[i]]
$NF = s;
' file
btw number 4 is mapped to inherited not paternal as you have in your expected output.
Use this short Perl in-line script:
perl -F'\t' -lane '
#keys = qw( 0 1 2 4 8 16 32 64 128 256 512 1024 1073741824 );
#vals = qw( unknown germline somatic inherited paternal maternal de-novo biparental uniparental not-tested tested-inconclusive not-reported other );
%val = map { $keys[$_] => $vals[$_] } 0..$#keys;
print join "\t", #F[0..8], ( join ",", map { $val{$_} } split /,/, $F[9] );
' in_file > out_file
The Perl script uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/\t/' : Split into #F on TAB, rather than on whitespace.
%val = map { $keys[$_] => $vals[$_] } 0..$#keys; : Create %val - a hash lookup table with keys = numeric codes and values = mutation/variant types.
Note that in Perl, arrays are 0-indexed.
perldoc perlrun: how to execute the Perl interpreter: command line switches
OP has confirmed beforehand that the z and t arrays are valid (eg, same number of elements in both arrays)
OP may want to (dynamically) change the contents of the z and t arrays so we'll leave the array assignments at the bash level (ie, won't hardcode inside of awk)
the substitution strings could contain white space so we'll keep OP's current method of building comma-delimited strings (from the z and t) arrays; also assumes replacement strings do not contain commas; this should simplify parsing of the replacement strings within awk
while OP has explicitly coded for (awk) field #10, we'll assume this number could change; we'll focus on processing the last field in a row
Small change to initialization code:
# original arrays
z=(0 1 2 4 8 16 32 64 128 256 512 1024 1073741824)
t=("unknown" "germline" "somatic" "inherited" "paternal" "maternal" "de-novo" "biparental" "uniparental" "not-tested" "tested-inconclusive" "not-reported" "other")
# renamed variables (format: x,y,z,...)
nums=$(IFS=,; echo "${z[*]}")
alphas=$(IFS=,; echo "${t[*]}")
One awk idea:
awk -v nums="${nums}" -v alphas="${alphas}" ' # pass comma-delimited variables to awk
BEGIN { OFS="\t" # copied from original code:w
n=split(nums,num,/,/) # split comma-delimted variables
a=split(alphas,alpha,/,/) # into arrays
/#/ { next } # copied from original code
{ l=split($NF,lastf,/,/) # split the last (comma-delimited) field
$NF="" # clear the last field
pfx="" # initialize our prefix string
for (i=1; i<=l; i++) # loop through entries in the last field
for (j=1; j<=n; j++) # loop through array of numbers
if ( lastf[i] == num[j] ) # if array entries match ...
{ $NF= $NF pfx alpha[j] # append the associated alpha to the last field
pfx="," # set the prefix to "," for the next item
break # break out one level to process next entry in the last field
{ print } # print the current line (with modified last field)
' input.vcf
The above generates:
1 10032154 10032154 A C Leber_congenital_amaurosis_9 criteria_provided,_single_submitter Benign . germline
1 10032184 10032184 A G Retinal_dystrophy|Leber_congenital_amaurosis_9|not_provided criteria_provided,_multiple_submitters,_no_conflicts Pathogenic/Likely_pathogenic.germline,inherited
1 10032209 10032209 G A not_provided criteria_provided,_single_submitter Likely_benign . paternal,biparental,tested-inconclusive
I have csv file with multiple lines. Each line has the same number of columns. What I need to do is to group those lines by a few specified columns and aggregate data from other columns. Example of input file:
For above example I need to group lines by first two columns. From 3rd column I need to choose the min value, for 4th column max value, and 5th column should have the sum. So, for such input file I need output:
I need to process it in bash (I can use awk or sed as well).
With bash and sort:
# create associative arrays
declare -A month2num=([Jan]=1 [Feb]=2 [Mar]=3 [Apr]=4 [May]=5 [Jun]=6 [Jul]=7 [Aug]=8 [Sep]=9 [Oct]=10 [Nov]=11 [Dec]=12])
declare -A p ds de # date start and date end
declare -A -i sum # set integer attribute
# function to convert 5-Jun-2011 to 20110605
date2num() { local d m y; IFS="-" read -r d m y <<< "$1"; printf "%d%.2d%.2d\n" $y ${month2num[$m]} $d; }
# read all columns to variables p1 p2 d1 d2 s
while IFS="," read -r p1 p2 d1 d2 s; do
# if associative array is still empty for this entry
# fill with current strings/value
if [[ -z ${p[$p1,$p2]} ]]; then
# compare strings, set new strings and sum value
if [[ ${p[$p1,$p2]} == "$p1,$p2" ]]; then
[[ $(date2num "$d1") < $(date2num ${ds[$p1,$p2]}) ]] && ds[$p1,$p2]="$d1"
[[ $(date2num "$d2") > $(date2num ${de[$p1,$p2]}) ]] && de[$p1,$p2]="$d2"
done < file
# print content of all associative arrays with key vom associative array p
for i in "${!p[#]}"; do echo "${p[$i]},${ds[$i]},${de[$i]},${sum[$i]}"; done
Usage: ./script.sh | sort
Output to stdout:
See: help declare, help read and of course man bash
With awk + sort
awk -F',|-' '
z=sprintf( "%.2d",$3)
y=sprintf("%s",$5 A[$4] z)
if (y < start[$1$2])
x=sprintf( "%.2d",$6)
w=sprintf("%s",$8 A[$7] x)
if(w > end[$1$2] )
for (i in B)print i "," C[i] "," D[i] "," B[i]
' infile | sort
Extended GNU awk solution:
awk -F, 'function parse_date(d_str){
split(d_str, d, "-");
t = mktime(sprintf("%d %d %d 00 00 00", d[3], m[d[2]], d[1]));
return t
BEGIN{ m["Jan"]=1; m["Feb"]=2; m["Mar"]=3; m["Apr"]=4; m["May"]=5; m["Jun"]=6;
m["Jul"]=7; m["Aug"]=8; m["Sep"]=9; m["Oct"]=10; m["Nov"]=11; m["Dec"]=12;
k=$1 SUBSEP $2;
if (k in a){
if (parse_date(a[k]["min"]) > parse_date($3)) { a[k]["min"]=$3 }
if (parse_date(a[k]["max"]) < parse_date($4)) { a[k]["max"]=$4 }
} else {
a[k]["min"]=$3; a[k]["max"]=$4
a[k]["sum"]+= $5
for (i in a) {
split(i, j, SUBSEP);
print j[1], j[2], a[i]["min"], a[i]["max"], a[i]["sum"]
}' OFS=',' file
The output:
I have a file that consists of a bunch of things but what I need are numbers between start and end strings: For example :
So, I need two arrays here one containing 23,34,22,12 and one containing 14,56,74. What's the best command to use?
If I only had one start and one end I would be able to use mapfile and awk to obtain that array, but there's many start and ends in the file and I need to save all the arrays.
You can do it with sed.
sed -n '/start/{:a;N;/end/!ba;s/\n/, /g;s/, [^,][a-z][^,]*//Ig;s/start, //p}'
The code will iterate through all chunks between 'start' and 'end' lines.
It will remove all items with non-digit symbols and output each "array" on separate line.
Here is output from your data sample:
23, 34, 22, 12
14, 56, 74
You need to implement a small state machine - switching between in block and out of block:
awk '/end/{block = 0; print a; a = ""} (block) {a = a " " $0} /start/{block = 1}'
If at end, leave block, print and empty the accumulator. If in block, accumulate current line. If at start, mark that we're inside a block.
You can tell awk to change the output file every time a new sequence starts
awk '/start/{i++;f=1;next} /end/{f=0} f{print > "arr"i}' file
For the example file, this will create files: arr1, arr2. Then you can create separated arrays with the lines of these files:
for i in $( ls arr* ); do readarray -t $i < $i; done
note: I have assumed that all lines between matching patterns are numeric and acceptable as in the example.
If you trust your input files enough for an eval:
$ cat tst.sh
eval $(
awk '
f {
if ( /end/ ) {
print "declare arr" ++cnt "=(" vals " )"
vals = ""
f = 0
else {
vals = vals OFS $0
/start/ { f = 1 }
' "$1"
printf "arr1:%s\n" "${arr1[#]}"
printf "arr2:%s\n" "${arr2[#]}"
$ ./tst.sh file
Check the quoting and all other shell gotchas...
I have file like:
I never know if in the row missing A,B,C or D value. But I need to transform this file like:
So if any value missing print just - mark. My plan is have the same number of columns to easy parsing. I am prefer awk solution. Thank you for any advice or help.
My first try was:
awk '{gsub(/[,]/, "\t")}; BEGIN{ FS = OFS = "\t" } { for(i=1; i<=NF; i++) if($i ~ /^ *$/) $i = "-" }; {print $0}'
But then I notice, that some values are missing.
From my header I know that there is value A,B,C,D,E,F...
$ cat file.txt
$ perl -F, -le '#k=(A..F);
$op[0]=$F[0]; #op[1..6]=("-")x6;
$j=0; for($i=1;$i<=$#F;){ if($F[$i] =~ m/$k[$j++]=/){$op[$j]=$F[$i]; $i++} }
print join(",",#op)
' file.txt
-F, split input line on , and save to #F array
-l removes newline from input line, adds newline to output
#k=(A..F); initialize #k array with A, B, etc upto F
$op[0]=$F[0]; #op[1..6]=("-")x6; initalize #op array with first element of #F and remaining six elements as -
for-loop iterates over #F array, if element matches with #k array element in corresponding index followed by =, change #op element
print join(",",#op) print the #op array with , as separator
Perl to the rescue!
You haven't specified how to obtain the header information, so in the following script, the #header array is populated directly.
%to_idx hash maps the column names to their indices (A => 0, B => 1 etc.).
Each lines is split into fields, each field is compared to the expected one ($next) and dashes are printed if needed. The same happens for missing trailing fields.
use warnings;
use strict;
my #header = qw( A B C D E F );
my %to_idx = map +($header[$_] => $_), 0 .. $#header;
open my $IN, '<', shift or die $!;
while (<$IN>) {
my #fields = split /,/;
print shift #fields;
my $next = 0;
for my $field (#fields) {
my ($name, $value) = split /=/, $field;
print ',-' x ($to_idx{$name} - $next);
print ",$name=$value";
$next = $to_idx{$name} + 1;
print ',-' x (1 + $#header - $next); # Missing trailing fields.
print "\n"
Solution in TXR
(defstruct fill-missing nil
(hash (hash :equal-based))
(:postinit (self)
(each ((s self.strings))
(set [self.hash s] "-")))
(:method add (self str val)
(set [self.hash str] `#str=#val`))
(:method print (self stream)
(put-string `#{(mapcar self.hash self.strings) ","}` stream))))
# (bind fm #(new fill-missing strings '#"A B C D E F"))
#{label},#(coll)#{sym /[^,=]+/}=#{val /[^,]+/}#(do fm.(add sym val))#(end)
# (do (put-line `#label,#fm`))
$ txr missing.txr data
PROCINFO["sorted_in"]="#ind_str_asc" # order for for(i in a)
for(i=65;i<=90;i++) # create the whole alphabet to array a[]
a[sprintf("%c", i)] # you could read the header and use that as well
split($0,b,",") # split record by ","
printf "%s", b[1] # printf first element (AA, BB...)
delete b[1] # get rid of it
for(i in b)
b[substr(b[i],1,1)]=b[i] # take the first letter to use as index (A=12)
for(i in a) # go thru alphabet and printf from b[]
printf "%s%s", OFS, (i in b?b[i]:"-"); print ""
awk -v OFS=\, -f parsing.awk tbparsed.txt
It prints "-" for each letter not found in the record. If the data had a header, you could split to 2-D array b[NR] and change the for(i in a) to for(i in b[1]) ... printf ... b[NR][b[1][i]] ... and if you don't need the static first column, remove the first printf and delete.