change range of letters in a specific column - bash

I want to change in column 11 these characters
!"#$%&'()*+,-.\/0123456789:;<=>?#ABCDEFGHIJ
for these characetrs:
#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghi
so, if I have in column 11 000#!, it should be PPP_#. I tried awk:
awk '{a = gensub(/[#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghi]/, /[!\"\#$%&'\''()*+,-.\/0123456789:;<=>?#ABCDEFGHIJ]/, "g", $11); print a }' file.txt
but it does not work...

Try Perl.
perl -lane '$F[10] =~ y/!"#$%&'"'"'()*+,-.\/0-9:;<=>?#A-J/#A-Z[\\]^_`a-i/;
print join(" ", #F)'
I am assuming by "column 11" you mean a string of several characters after the tenth run of successive whitespace, which is what the -a option splits on by default (basically to simulate Awk). Unfortunately, changes to the array #F do not show up in the output directly, so you have to reconstruct the output line from (the modified) #F, which will normalize the field delimiter to just a single space.

Just change f = 2 to f = 11:
$ cat tst.awk
BEGIN {
f = 2
old = "!\"#$%&'()*+,-.\\/0123456789:;<=>?#ABCDEFGHIJ"
new = "#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\\\]^_`abcdefghi"
n = length(old)
for (i=1; i<=n; i++) {
map[substr(old,i,1)] = substr(new,i,1)
}
}
{
n = length($f)
newStr = ""
for (i=1; i<=n; i++) {
oldChar = substr($f,i,1)
newStr = newStr (oldChar in map ? map[oldChar] : oldChar)
}
$f = newStr
print
}
$ cat file
a 000#! b
$ awk -f tst.awk file
a PPP_# b
Note that you have to escape "s and \s in strings.

Related

How to remove columns from a file by given the columns in anther file in Linux?

Suppose I have a file A contains the column numbers need to be removed (I really have over 500 columns in my input file fileB),
fileA:
2
5
And I want to remove those columns(2 and 5) from fileB:
a b c d e f
g h i j k l
in Linux to get:
a c d f
g i j l
what should I do? I found out that I could eliminate printing those columns with the code:
awk '{$2=$5="";print $0}' fileB
however, there are two problems in this way, first it does not really remove those columns, it just using empty string to replace them; second, instead of manually typing in those column numbers, how can I get these column numbers by reading from another file.
Original Question:
Suppose I have a file A contains the column numbers need to be removed,
file A:
223
345
346
567
And I want to remove those columns(223, 345,567) from file B in Linux, what should I do?
If your cut have the --complement option then you can do:
cut --complement -d ' ' -f "$(echo $(<FileA))" fileB
$ cat tst.awk
NR==FNR {
badFldNrs[$1]
next
}
FNR == 1 {
for (inFldNr=1; inFldNr<=NF; inFldNr++) {
if ( !(inFldNr in badFldNrs) ) {
out2in[++numOutFlds] = inFldNr
}
}
}
{
for (outFldNr=1; outFldNr<=numOutFlds; outFldNr++) {
inFldNr = out2in[outFldNr]
printf "%s%s", $inFldNr, (outFldNr<numOutFlds ? OFS : ORS)
}
}
$ awk -f tst.awk fileA fileB
a c d f
g i j l
One awk idea:
awk '
FNR==NR { skip[$1] ; next } # store field #s to be skipped
{ line="" # initialize output variable
pfx="" # first prefix will be ""
for (i=1;i<=NF;i++) # loop through the fields in this input line ...
if ( !(i in skip) ) { # if field # not mentioned in the skip[] array then ...
line=line pfx $i # add to our output variable
pfx=OFS # prefix = OFS for 2nd-nth fields to be added to output variable
}
if ( pfx == OFS ) # if we have something to print ...
print line # print output variable to stdout
}
' fileA fileB
NOTE: OP hasn't provided the input/output field delimiters; OP can add the appropriate FS/OFS assignments as needed
This generates:
a c d f
g i j l
Using awk
$ awk 'NR==FNR {col[$1]=$1;next} {for(i=1;i<=NF;++i) if (i != col[i]) printf("%s ", $i); printf("\n")}' fileA fileB
a c d f
g i j l

awk or bash for split lines

I would like to split a csv file which looks like this:
a|b|1,2,3
c|d|4,5
e|f|6,7,8
the goal is this format:
a|b|1
a|b|2
a|b|3
c|d|4
c|d|5
e|f|6
e|f|7
e|f|8
How can I do this in bash or awk?
With bash:
while IFS="|" read -r a b c; do for n in ${c//,/ }; do echo "$a|$b|$n"; done; done <file
Output:
a|b|1
a|b|2
a|b|3
c|d|4
c|d|5
e|f|6
e|f|7
e|f|8
$ cat hm.awk
{
s = $0; p = ""
while (i = index(s, "|")) { # `p': up to the last '|'
# `s': the rest
p = p substr(s, 1 , i)
s = substr(s, i + 1)
}
n = split(s, a, ",")
for (i = 1; i <= n; i++)
print p a[i]
}
Usage:
awk -f hm.awk file.csv
In Gnu awk (split):
$ awk '{n=split($0,a,"[|,]");for(i=3;i<=n;i++) print a[1] "|" a[2] "|" a[i]}' file
with perl
$ cat ip.csv
a|b|1,2,3
c|d|4,5
e|f|6,7,8
$ perl -F'\|' -lane 'print join "|", #F[0..1],$_ foreach split /,/,$F[2]' ip.csv
a|b|1
a|b|2
a|b|3
c|d|4
c|d|5
e|f|6
e|f|7
e|f|8
splits input line on | into #F array
then for every comma separated value in 3rd column, print in desired format
For a generic last column,
perl -F'\|' -lane 'print join "|", #F[0..$#F-1],$_ foreach split /,/,$F[-1]' ip.csv

find if two consecutive lines are different and where

How to find the difference and point of difference between two consecutive lines of a fixed width file ?
sample file:
cat test.txt
1111111111111111122211111111111111
1111111111111111132211111111111111
output :
it should inform user that there is difference between two lines and the position of difference is at :18th character.(as in above example)
It would be really helpful if it could list all the positions in case of multiple variations.For example:
11111111111111111211113111
11111111111111111211114111
Here is should say : difference spotted in 18th and 26th characters.
I was trying things in following lines, but seems lost.
while read line
do
echo $line |sed 's/./ &/g' |xargs -n1 #NOt able to apply diff (stupid try)
done <test.txt
Perl to the rescue:
$ echo '11131111111111111211113111
11111111111111111211114111' \
| perl -le '$d = <> ^ <>;
print pos $d while $d =~ /[^\0]/g'
4
23
It XORs the two input strings and reports all positions where the result isn't the null byte, i.e. where the strings were different.
You can use an empty field separator to make each character a field in awk and compare entries of every even record with odd numbered record:
awk 'BEGIN{ FS="" } NR%2 {
split($0, a)
next
}
{
print "line # ", NR
for (i=1; i<=NF; i++)
if ($i != a[i])
print "difference spotted in position:", i
}' test.txt
line # 2
difference spotted in position: 18
line # 4
difference spotted in position: 18
difference spotted in position: 23
Where input data is:
cat test.txt
1111111111111111122211111111111111
1111111111111111132211111111111111
11111111111111111211113111
11111111111111111311114111
PS: It will only work on awk versions that split records into chars when FS is null, eg GNU awk, OSX awk etc.
$ cat tst.awk
{ curr = $0 }
(NR%2)==0 {
currLgth = length(curr)
prevLgth = length(prev)
maxLgth = (currLgth > prevLgth ? currLgth : prevLgth)
print "Comparing:"
print prev
print curr
for (i=1; i<=maxLgth; i++) {
prevChar = substr(prev,i,1)
currChar = substr(curr,i,1)
if ( prevChar != currChar ) {
printf "Difference: char %d line %d = \"%s\", line %d = \"%s\"\n", i, NR-1, prevChar, NR, currChar
}
}
print ""
}
{ prev = curr }
.
$ cat file
1111111111111111122211111111111111
1111111111111111132211111111111111
11111111111111111111111111
11111111111111111111111
$ awk -f tst.awk file
Comparing:
1111111111111111122211111111111111
1111111111111111132211111111111111
Difference: char 18 line 1 = "2", line 2 = "3"
Comparing:
11111111111111111111111111
11111111111111111111111
Difference: char 24 line 3 = "1", line 4 = ""
Difference: char 25 line 3 = "1", line 4 = ""
Difference: char 26 line 3 = "1", line 4 = ""

Unix/Bash: Uniq on a cell

I have a tab-separated fileA where the 12th column (starting from 1) contain several comma separated identifiers. Some of them in the same row, however, can occur more than once:
GO:0042302, GO:0042302, GO:0042302
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
....
....
(some with a white-space after the comma, some where it is not).
I would like to only get the unique identifiers and remove the multiples for each row in the 12th column:
GO:0042302
GO:0004386,GO:0005524,GO:0006281
....
....
Here is what I have so far:
for row in `fileA`
do
cut -f12 $row | sed "s/,/\n/" | sort | uniq | paste fileA - | \
awk 'BEGIN {OFS=FS="\t"}{print $1, $2, $3, $4, $5, $6, $7, $8, $9, $10, $11, $13}'
done > out
The idea was to go over each row at a time, cut out the 12th column, replace all commas with newlines and then sort and take uniq to get rid of duplicates, paste it back and print the columns in the right order, skipping the original identifier column.
However, this does not seem to work. Any ideas?
Just for completeness, and because I personally prefer Perl over Awk for this sort of thing, here's a Perl one-liner solution:
perl -F'\t' -le '%u=();#k=split/,/,$F[11];#u{#k}=#k;$F[11]=join",",sort
keys%u;print join"\t",#F'
Explanation:
-F'\t' Loop over input lines, splitting each one into fields at tabs
-l automatically remove newlines from input and append on output
-e get code to execute from the next argument instead of standard input
%u = (); # clear out the hash variable %u
#k = split /,/, $F[11]; # Split 12th field (1st is 0) on comma into array #k
#u{#k} = #k; # Copy the contents of #k into #u as key/value pairs
Because hash keys are unique, that last step means that the keys of %u are now a deduplicated copy of #k.
$F[11] = join ",", sort keys %u; # replace the 12th field with the sorted unique list
print join "\t", #F; # and print out the modified line
If I understand you correctly, then with awk:
awk -F '\t' 'BEGIN { OFS = FS } { delete b; n = split($12, a, /, */); $12 = ""; for(i = 1; i <= n; ++i) { if(!(a[i] in b)) { b[a[i]]; $12 = $12 a[i] "," } } sub(/,$/, "", $12); print }' filename
This works as follows:
BEGIN { OFS = FS } # output FS same as input FS
{
delete b # clear dirty table from last pass
n = split($12, a, /, */) # split 12th field into tokens,
$12 = "" # then clear it out for reassembly
for(i = 1; i <= n; ++i) { # wade through those tokens
if(!(a[i] in b)) { # those that haven't been seen yet:
b[a[i]] # remember that they were seen
$12 = $12 a[i] "," # append to result
}
}
sub(/,$/, "", $12) # remove trailing comma from resulting field
print # print the transformed line
}
The delete b; has been POSIX-conforming for only a short while, so if you're working with an old, old awk and it fails for you, see #MarkReed's comment for another way that ancient awks should accept.
Using field 2 instead of field 12:
$ cat tst.awk
BEGIN{ FS=OFS="\t" }
{
split($2,f,/ *, */)
$2 = ""
delete seen
for (i=1;i in f;i++) {
if ( !seen[f[i]]++ ) {
$2 = $2 (i>1?",":"") f[i]
}
}
print
}
.
$ cat file
a,a,a GO:0042302, GO:0042302, GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281 d,d,d
$ awk -f tst.awk file
a,a,a GO:0042302 b,b,b
c,c,c GO:0004386,GO:0005524,GO:0006281 d,d,d
If your awk doesn't support delete seen you can use split("",seen).
Using this awk:
awk -F '\t' -v OFS='\t' '{
delete seen;
split($12, a, /[,; ]+/);
for (i=1; i<=length(a); i++) {
if (!(a[i] in seen)) {
seen[a[i]];
s=sprintf("%s%s,", s, a[i])
}
}
$12=s} 1' file
GO:0042302,
GO:0042302,GO:0004386,GO:0005524,GO:0006281,
In your example data, the comma followed by a space is the delimiter of the 12th field. Every subfield after that is merely a repeat of the first field. The subfields appear to already be in sorted order.
GO:0042302, GO:0042302, GO:0042302
^^^dup1^^^ ^^^dup2^^^
GO:0004386,GO:0005524,GO:0006281, GO:0004386,GO:0005524,GO:0006281
^^^^^^^^^^^^^^^dup1^^^^^^^^^^^^^
Based on that, you could simply keep the first of the subfields and toss the rest:
awk -F"\t" '{sub(/, .*/, "", $12)} 1' fileA
If instead, you can have different sets of repeated subfields, where keys are not sorted like this:
GO:0042302, GO:0042302, GO:0042302, GO:0062122,GO:0055000, GO:0055001, GO:0062122,GO:0055000
GO:0004386,GO:0005524,GO:0006281, GO:0005525, GO:0004386,GO:0005524,GO:0006281
If you were stuck with a default MacOS awk you could introduce a sort/uniq functions in an awk executable script:
#!/usr/bin/awk -f
BEGIN {FS="\t"}
{
c = uniq(a, split($12, a, /, |,/))
sort(a, c)
s = a[1]
for(i=2; i<=c; i++) { s = s "," a[i] }
$2 = s
}
47 # print out the modified line
# take an indexed arr as from split and de-dup it
function uniq(arr, len, i, uarr) {
for(i=len; i>=1; i--) { uarr[arr[i]] }
delete arr
for(k in uarr) { arr[++i] = k }
return( i )
}
# slightly modified from
# http://rosettacode.org/wiki/Sorting_algorithms/Bubble_sort#AWK
function sort(arr, len, haschanged, tmp, i)
{
haschanged = 1
while( haschanged==1 ) {
haschanged = 0
for(i=1; i<=(len-1); i++) {
if( arr[i] > arr[i+1] ) {
tmp = arr[i]
arr[i] = arr[i + 1]
arr[i + 1] = tmp
haschanged = 1
}
}
}
}
If you had GNU-awk, I think you could swap out the sort(a, c) call with asort(a), and drop the bubble-sort local function completely.
I get the following for the 12th field:
GO:0042302,GO:0055000,GO:0055001,GO:0062122
GO:0004386,GO:0005524,GO:0005525,GO:0006281

Storing Mulitdimensional array using awk

My input file is
a|b|c|d
w|r|g|h
i want to store the value in array like
a[1,1] = a
a[1,2] = b
a[2,1] = w
Kindly suggest in any way to achieve this in awk bash.
I have two i/p files and need to do field level validation.
Like this
awk -F'|' '{for(i=1;i<=NF;i++)a[NR,i]=$i}
END {print a[1,1],a[2,2]}' file
Output
a r
This parses the file into an awk array:
awk -F \| '{ for(i = 1; i <= NF; ++i) a[NR,i] = $i }' filename
You'll have to add code that uses the array for this to be of any use, of course. Since you didn't say what you wanted to do with the array once it is complete (after the pass over the file), this is all the answer i can give you.
You're REALLY going to want to get/use gawk 4.* if you're using multi-dimensional arrays as that's the only awk that supports them. When you write:
a[1,2]
in any awk you are actually creating a psedudo-multi-dimensional array which is a 1-dimensional array indexed by the string formed by the concatenation of
1 SUBSEP 2
where SUBSEP is a control char that's unlikely to appear in your input.
In GNU awk 4.* you can do:
a[1][2]
(note the different syntax) and that populates an actual multi-dimentional array.
Try this to see the difference:
$ cat tst.awk
BEGIN {
SUBSEP=":" # just to make it visible when printing
oneD[1,2] = "a"
oneD[1,3] = "b"
twoD[1][2] = "c"
twoD[1][3] = "d"
for (idx in oneD) {
print "oneD", idx, oneD[idx]
}
print ""
for (idx1 in twoD) {
print "twoD", idx1
for (idx2 in twoD[idx1]) { # you CANNOT do this with oneD
print "twoD", idx1, idx2, twoD[idx1][idx2]
}
}
}
$ awk -f tst.awk
oneD 1:2 a
oneD 1:3 b
twoD 1
twoD 1 2 c
twoD 1 3 d

Resources