I need to take all numbers that appear within a book index and add 22 to them. The index data looks like this (for example):
Ubuntu, 120, 143, 154
Yggdrasil, 144, 170-171
Yood, Charles, 6
Young, Bob, 178-179
Zawinski, Jamie, 204
I am trying to do this with awk using this script:
#!/bin/bash
filename="index"
while read -r line
do
echo $line | awk -v n=22 '{printf($1)}{printf(" " )}{for(i=2;i<=NF;i++)printf(i%2?$i+n:$i+n)", "};{print FS}'
done < "$filename"
It comes close to working but has the following problems:
It doesn't work for page numbers that are part of a range (e.g., "170-171") rather than individual numbers.
For entries where the index term is more than one word (e.g., "X Windows" and "Young, Bob") the output displays only the first word in the term. The second word ends up being output as the number 22. (I know why this is happening -- my awk commands treats $2 as a number, and if it's a string it assumes it has a value of 0) but I can't figure out how to solve it.
Disclosure: I'm by no means an awk expert. I'm just looking for a quick way to modify the page numbers in my index (which is due in a few days) because my publisher decided to change the pagination in the manuscript after I had already prepared the index. awk seems like the best tool for the job to me, but I'm open to other suggestions if someone has a better idea. Basically, I just need a way to say "take all numbers in this file and add 22 to them; don't change anything else."
With GNU awk for multi-char RS and RT:
$ awk -v RS='[0-9]+' '{ORS=(RT=="" ? "" : RT+22)}1' file
Ubuntu, 142, 165, 176
Yggdrasil, 166, 192-193
Yood, Charles, 28
Young, Bob, 200-201
Zawinski, Jamie, 226
For example:
perl -plE 's/\b(\d+)\b/$1+22/ge' index
output
Ubuntu, 142, 165, 176
Yggdrasil, 166, 192-193
Yood, Charles, 28
Young, Bob, 200-201
Zawinski, Jamie, 226
but it isn't awk
You can use this gnu awk command:
awk 'BEGIN {FS="\f";RS="(, |-|\n)";} /^[0-9]+$/ {$1 = $1 +22} { printf("%s%s", $1, RT);}' yourfile
there is a bit of abuse with FS and RS to get awk to handle each token in each line as a record of its own, so you dont have to loop over the
fields and test each field whether or not it is numerical
RS="(, |-|\n)" configures dash, newline and ", " as record separators
on "records" consisting only of digits: 22 is added
the printf prints the token together with its RT to reconstruct the line from the file
Consider using the following awk script(add_number.awk):
BEGIN{ FS=OFS=", "; if (!n) n=22; } # if `n` variable hasn't been passed the default is 22
{
for (i=1;i<=NF;i++) { # traversing fields
if ($i~/^[0-9]+$/) { # if a field contains a single number
$i+=n;
}
else if (match($i, /^([0-9]+)-([0-9]+)$/, arr)) { # if a field contains `range of numbers`
$i=arr[1]+n"-"arr[2]+n;
}
}
print;
}
Usage:
awk -v n=22 -f add_number.awk testfile
The output:
Ubuntu, 142, 165, 176
Yggdrasil, 166, 192-193
Yood, Charles, 28
Young, Bob, 200-201
Zawinski, Jamie, 226
Related
So I need to subset 10 characters from all strings in a particular column of a file, randomly and without repetition (i.e. I want to avoid drawing a character from any given index more than once).
For the sake of simplicity, let's say I have the following string:
ABCDEFGHIJKLMN
For which I should obtain, for example, this result:
DAKLFCHGBI
Notice that no letter occurs twice, which means that no position is extracted more than once.
For this other string:
CCCCCCCCCCCCGG
Analogously, I should never find more than two "G" characters in the output (otherwise it would mean that a "G" character has been sampled more than once), e.g.:
CCGCCCCCCC
Or, in other words, I want to shuffle all characters from each string, and keep the first 10. This can be easily achieved in bash using:
echo "ABCDEFGHIJKLMN" | fold -w1 | shuf -n10 | tr -d '\n'
However, since I need to perform this many times on dozens of files with over a hundred thousand lines each, this is way too slow. So looking around, I've arrived at the following awk code, which seems to work fine whenever the strings are passed to it one by one, e.g.:
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' <(echo "ABCDEFGHIJKLMN")
But when I input the following file with a string on each row, awk hangs and the output gets truncated on the second line:
echo "ABCDEFGHIJKLMN" > file.txt
echo "CCCCCCCCCCCCGG" >> file.txt
awk '{srand(); len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} print ""}' file.txt
This other version of the code which samples characters from the string with repetition works fine, so it looks like the issue lies in the part which populates the N array, but I'm not proficient in awk so I'm a bit stuck:
awk '{srand(); len=length($1); for(i=1;i<=10;i++) {k=int(rand()*len)+1; printf "%s", substr($1,k,1)} print ""}'
Anyone can help?
In case this matters: my actual file is more complex than the examples provided here, with several other columns, and unlike the ones in this example, its strings may have different lengths.
Thanks in advance for your time :)
EDIT:
As mentioned in the comments, I managed to make it work by removing the N array (so that it resets before processing each row):
awk 'BEGIN{srand()} {len=length($1); for(i=1;i<=10;) {k=int(rand()*len)+1; if(!(k in N)) {N[k]; printf "%s", substr($1,k,1); i++}} split("", N); print ""}' file.txt
Do note however that if the string in $1 is shorter than 10, this will get stuck in an infinite loop, so make sure that all strings are always longer than the subset target size. The alternative solution provided by Andre Wildberg in the comments doesn't carry this issue.
I would harness GNU AWK for this task following way, let file.txt content be
ABCDEFGHIJKLMN
CCCCCCCCCCCCGG
then
awk 'function comp_func(i1, v1, i2, v2){return rand()-0.5}BEGIN{FPAT=".";PROCINFO["sorted_in"]="comp_func"}{s="";patsplit($0,arr);for(i in arr){s = s arr[i]};print substr(s,1,10)}' file.txt
might give output
NGLHCKEIMJ
CCCCCCCCGG
Explanation: I use custom Array Traversal Control function which does randomly decides which element should be considered greater. -0.5 is used as rand() gives values from 0 to 1. For each line array arr is populated by characters of line, then traversed in random order to create s string which are characters shuffled, then substr used to get first 10 characters. You might elect to add counter which will terminate for loop if you have very long lines in comparison to number of characters to select.
(tested in GNU Awk 5.0.1)
Iteratively construct a substring of the remaining letters.
Tested with
awk version 20121220
GNU Awk 4.2.1, API: 2.0
GNU Awk 5.2.1, API 3.2
mawk 1.3.4 20200120
% awk -v size=10 'BEGIN{srand()} {n=length($0); a=$0; x=0;
for(i=1; i<=n; i++){x++; na=length(a); rnd = int(rand() * na + 1)
printf("%s", substr(a, rnd, 1))
a=substr(a, 1, rnd - 1)""substr(a, rnd + 1, na)
if(x >= size){break}}
print ""}' file.txt
CJFMKHNDLA
CGCCCCCCCC
In consecutive iterative runs remember to check if srand works the way you expect in your version of awk. If in doubt use $RANDOM or, better, /dev/urandom.
if u don't need to be strictly within awk, then jot makes it super easy :
say you want 20 random characters between
"A" (ascii 65) and "N" (ascii 78), inc. repeats of same chars
jot -s '' -c -r 20 65 78
ANNKECLDMLMNCLGDIGNL
I have a tab separated text file, call it input.txt
cat input.txt
Begin Annotation Diff End Begin,End
6436687 >ENST00000422706.5|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-205|APOL1|2901|protein_coding| 50 6436736 6436687,6436736
6436737 >ENST00000426053.5|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-206|APOL1|2808|protein_coding| 48 6436784 6436737,6436784
6436785 >ENST00000319136.8|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000075315.5|APOL1-201|APOL1|3000|protein_coding| 51 6436835 6436785,6436835
6436836 >ENST00000422471.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319151.1|APOL1-204|APOL1|561|nonsense_mediated_decay| 11 6436846 6436836,6436846
6436847 >ENST00000475519.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319153.1|APOL1-212|APOL1|600|retained_intron| 11 6436857 6436847,6436857
6436858 >ENST00000438034.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319152.2|APOL1-210|APOL1|566|protein_coding| 11 6436868 6436858,6436868
6436869 >ENST00000439680.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319252.1|APOL1-211|APOL1|531|nonsense_mediated_decay| 10 6436878 6436869,6436878
6436879 >ENST00000427990.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319154.2|APOL1-207|APOL1|624|protein_coding| 12 6436890 6436879,6436890
6436891 >ENST00000397278.8|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319100.4|APOL1-202|APOL1|2795|protein_coding| 48 6436938 6436891,6436938
6436939 >ENST00000397279.8|ENSG00000100342.21|OTTHUMG00000030427.9|-|APOL1-203|APOL1|1564|protein_coding| 28 6436966 6436939,6436966
6436967 >ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding| 11 6436977 6436967,6436977
6436978 >ENST00000431184.1|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319254.1|APOL1-208|APOL1|550|nonsense_mediated_decay| 11 6436988 6436978,6436988
Using the information in input.txt I want to obtain information from a file called Other_File.fa. This file is an annotation file filled with ENST#'s (transcript IDs) and sequences of A's,T's,C's,and G's. I want to store the sequence in a file called Output.log (see example below) and I want to store the command used to retrieve the text in a file called Input.log (see example below).
I have tried to do this using awk and cut so far using a for loop. This is the code I have tried.
for line in `awk -F "\\t" 'NR != 1 {print substr($2,2,17)"#"$5}' input.txt`
do
transcript=`cut -d "#" -f 1 $line`
range=`cut -d "#" -f 2 $line` #Range is the string location in Other_File.fa
echo "Our transcript is ${transcript} and our range is ${range}" >> Input.log
sed -n '${range}' Other_File.fa >> Output.log
done
Here is an example of the 11 lines between ENST00000433768.5 and ENST00000431184.1 in Other_File.fa.
grep -A 11 ENST00000433768.5 Other_File.fa
>ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding|
ATCCACACAGCTCAGAACAGCTGGATCTTGCTCAGTCTCTGCCAGGGGAAGATTCCTTGG
AGGAGCACACTGTCTCAACCCCTCTTTTCCTGCTCAAGGAGGAGGCCCTGCAGCGACATG
GAGGGAGCTGCTTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTT
GGTGTGGGAGTGAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGG
ACAGATACTGGAGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGAC
CCAGGCCCAGCTGGGTCCAGAGGTGACAGTGGAGAGCCGTGTACCCTGAGACCAGCCTGC
AGAGGACAGAGGCAACATGGAGGTGCCTCAAGGATCAGTGCTGAGGGTCCCGCCCCCATG
CCCCGTCGAAGAACCCCCTCCACTGCCCATCTGAGAGTGCCCAAGACCAGCAGGAGGAAT
CTCCTTTGCATGAGAGCAGTATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAG
T
>ENST00000431184.1|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319254.1|APOL1-208|APOL1|550|nonsense_mediated_decay|
The range value in input.txt for this transcript is 6436967,6436977. In my file Input.log for this transcript I hope to get
Our transcript is ENST00000433768.5 and our range is 6436967,6436977
And in Output.log for this transcript I hope to get
>ENST00000433768.5|ENSG00000100342.21|OTTHUMG00000030427.9|OTTHUMT00000319253.2|APOL1-209|APOL1|541|protein_coding|
ATCCACACAGCTCAGAACAGCTGGATCTTGCTCAGTCTCTGCCAGGGGAAGATTCCTTGG
AGGAGCACACTGTCTCAACCCCTCTTTTCCTGCTCAAGGAGGAGGCCCTGCAGCGACATG
GAGGGAGCTGCTTTGCTGAGAGTCTCTGTCCTCTGCATCTGGATGAGTGCACTTTTCCTT
GGTGTGGGAGTGAGGGCAGAGGAAGCTGGAGCGAGGGTGCAACAAAACGTTCCAAGTGGG
ACAGATACTGGAGATCCTCAAAGTAAGCCCCTCGGTGACTGGGCTGCTGGCACCATGGAC
CCAGGCCCAGCTGGGTCCAGAGGTGACAGTGGAGAGCCGTGTACCCTGAGACCAGCCTGC
AGAGGACAGAGGCAACATGGAGGTGCCTCAAGGATCAGTGCTGAGGGTCCCGCCCCCATG
CCCCGTCGAAGAACCCCCTCCACTGCCCATCTGAGAGTGCCCAAGACCAGCAGGAGGAAT
CTCCTTTGCATGAGAGCAGTATCTTTATTGAGGATGCCATTAAGTATTTCAAGGAAAAAG
T
But I am getting the following error, and I am unsure as to why or how to fix it.
cut: ENST00000433768.5#6436967,6436977: No such file or directory
cut: ENST00000433768.5#6436967,6436977: No such file or directory
Our transcript is and our range is
My thought was each line from the awk would be read as a string then cut could split the string along the "#" symbol I have added, but it is reading each line as a file and throwing an error when it can't locate the file in my directory.
Thanks.
EDIT2: This is a generic solution which will compare 2 files(input and other_file.fa) and on whichever line whichever range is found it will print them. Eg--> Range numbers are found on 300 line number but range shows you should print from 1 to 20 it will work in that case also. Also note this calls system command which further calls sed command(like you were using range within sed), there are other ways too, like to load whole Input_file into an array or so and then print, but I am going with this one here, fair warning this is not tested with huge size files.
awk -F'[>| ]' '
FNR==NR{
arr[$2]=$NF
next
}
($2 in arr){
split(arr[$2],lineNum,",")
print arr[$2]
start=lineNum[1]
end=lineNum[2]
print "sed -n \047" start","end"p \047 " FILENAME
system("sed -n \047" start","end"p\047 " FILENAME)
start=end=0
}
' file1 FS="[>|]" other_file.fa
EDIT: With OP's edited samples, please try following to print lines based on other file. assumes that the line you find range values, those values will be always after the line on which they found(eg--> 3rd line range values found and range is 4 to 10).
awk -F'[>| ]' '
FNR==NR{
arr[$2]=$NF
next
}
($2 in arr){
split(arr[$2],lineNum," ")
start=lineNum[1]
end=lineNum[2]
}
FNR>=start && FNR<=end{
print
if(FNR==end){
start=end=0
}
}
' file1 FS="[>|]" other_file.fa
You need not to do this with a for loop and then call awk program each time for each line. This could be done in single awk, considering that you have to only print them. Written and tested with your shown samples.
awk -F'[>| ]' 'FNR>1{print "Our transcript is:"$3" and our range is:"$NF}' Input_file
NOTE: This will print for each line of your Input_file values of transcript and range, in case you want to further perform some operation with their values then please do mention.
I was trying to solve one of my old assignment I am literally stuck in this one Can anyone help me?
There is a file called "datafile". This file has names of some friends and their
ages. But unfortunately, the names are not in the correct format. They should be
lastname, firstname
But, by mistake they are firstname,lastname
The task of the problem is writing a shell script called fix_datafile
to correct the problem, and sort the names alphabetically. The corrected filename
is called datafile.fix .
Please make sure the original structure of the file should be kept untouched.
The following is the sample of datafile.fix file:
#personal information
#******** Name ********* ***** age *****
Alexanderovich,Franklin 47
Amber,Christine 54
Applesum,Franky 33
Attaboal,Arman 18
Balad,George 38
Balad,Sam 19
Balsamic,Shery 22
Bojack,Steven 33
Chantell,Alex 60
Doyle,Jefry 45
Farland,Pamela 40
Handerman,jimmy 23
Kashman,Jenifer 25
Kasting,Ellen 33
Lorux,Allen 29
Mathis,Johny 26
Maxter,Jefry 31
Newton,Gerisha 40
Osama,Franklin 33
Osana,Gabriel 61
Oxnard,George 20
Palomar,Frank 24
Plomer,Susan 29
Poolank,John 31
Rochester,Benjami 40
Stanock,Verona 38
Tenesik,Gabriel 29
Whelsh,Elsa 21
If you can use awk (I suppose you can), than this there's a script which does what you need:
#!/bin/bash
RESULT_FILE_NAME="datafile.new"
cat datafile.fix | head -4 > datafile.new
cat datafile.fix | tail -n +5 | awk -F"[, ]" '{if(!$2){print()}else{print($2","$1, $3)}}' >> datafile.new
Passing -F"[, ]" allows awk to split columns both by , and space and all that remains is just print columns in a needed format. The downsides are that we should use if statement to preserve empty lines and file header also should be treated separately.
Another option is using sed:
cat datafile.fix | sed -E 's/([a-zA-Z]+),([a-zA-Z]+) ([0-9]+)/\2,\1 \3/g' > datafile.new
The downside is that it requires regex that is not as obvious as awk syntax.
awk -F[,\ ] '
!/^$/ && !/^#/ {
first=$1;
last=$2;
map[first][last]=$0
}
END {
PROCINFO["sorted_in"]="#ind_str_asc";
for (i in map) {
for (j in map[i])
{
print map[i][j]
}
}
}' namesfile > datafile.fix
One liner:
awk -F[,\ ] '!/^$/ && !/^#/ { first=$1;last=$2;map[first][last]=$0 } END { PROCINFO["sorted_in"]="#ind_str_asc";for (i in map) { for (j in map[i]) { print map[i][j] } } }' namesfile > datafile.fix
A solution completely in gawk.
Set the field separator to both , and space. Then ignore any lines that are empty or start with #. Mark the first and last variables based on the delimited fields and then create a two dimensional array called map indexed by first and last name and the value equal to the line. At the end, set the sort to indices string ascending and loop through the array printing the names in order as requested.
Completely in bash:
re="^[[:space:]]*([^#]([[:space:]]|[[:alpha:]])+),(([[:space:]]|[[:alpha:]])*[[:alpha:]]) *([[:digit:]]+)"
while read line
do
if [[ ${line} =~ $re ]]
then
echo ${BASH_REMATCH[3]},${BASH_REMATCH[1]} ${BASH_REMATCH[5]}
else
echo "${line}"
fi
done < names.txt
The core of this is to capture, using bash regex matching (=~ operator of the [[ command), parenthesis groupings, and the BASH_REMATCH array, the name before the comma (([^#]([[:space:]]|[[:alpha:]])+)), the name after the comma ((([[:space:]]|[[:alpha:]])*[[:alpha:]])), and the age ( *([[:digit:]]+)). The first-name regex is constructed so as to exclude comments, and the last-name regex is constructed as to handle multiple spaces before the age without including them in the name. Preconditions: Commented lines with or without leading spaces (^[[:space:]]*([^#]), or lines without a comma, are passed through unchanged. Either first names or last names may have internal spaces. Once the last name and first name are isolated, it is easy to print them in reverse order followed by the age (echo ${BASH_REMATCH[3]},${BASH_REMATCH[1]} ${BASH_REMATCH[5]}). Note that the letter/space groupings are counted as matches which is why we skip 2 and 4.
I have tried using awk and sed.
Try if this works
less dataflie.fix | sed 's/ /,/g' | awk -F "," '{print $2,$1,$3}' | sed 's/ /,/' | sed 's/^,//' | sort -u > dataflie_new.fix
So what i'm trying to do is this: I've been using keybr.com to sharpen my typing skills and on this site you can "provide your own custom text." Now i've been taking chapters out of books to type so its a little more interesting than just typing groups of letters. Now I want to also insert numbers into the text. Specifically, between each word have something like "393" and random sets smaller and larger than that example.
so i have saved a chapter of a book into a file in my home folder. Now i just need a command to search for spaces and input a group of numbers and add a space so a sentence would look like this: The 293 dog 328 is 102 black. 334 The... etc.
I have looked up linux commands through search engines and i've found out how to replace strings in text files with:
sed -i 's/original/new/g' file.txt
and how to generate random numbers with:
$ shuf -i MIN-MAX -n COUNT
i just can not figure out how to output a one line command that will have random numbers between each word. I'm still-a-searching so thanks to anyone that takes the time to read my problem.
Perl to the rescue!
perl -pe 's/ /" " . (100 + int rand 900) . " "/ge' < input.txt > output.txt
-p reads the input line by line, after reading a line, it runs the code and prints the line to the output
s/// is similar to the substitution you know from sed
/g means global, i.e. it substitutes as many times as possible
/e means the replacement part is a code to run. In this case, the code generates a random number (100-999).
Given:
$ echo "$txt"
Here is some random words. Please
insert a number a space between each one.
Here is a simple awk to do that:
$ echo "$txt" | awk '{for (i=1;i<=NF;i++) printf "%s %d ", $i, rand()*100; print ""}'
Here 92 is 59 some 30 random 57 words. 74 Please 78
insert 43 a 33 number 77 a 10 space 78 between 83 each 76 one. 49
And here is roughly the same thing in pure Bash:
while read -r line; do
for word in $line; do
printf "%s %s" "$word $((1+$RANDOM % 100))"
done
echo
done < <(echo "$txt")
I am getting a syntax error for using cat and while read line inside awk.
Sample code:
awk '{
if( condition )
{
array[FNR]=$1;
cat file1.json | while read LINE; do
print LINE
done;
}
fi
}' /home/user/spfile.txt
My json file:
{
"Section_A": {
"ws/abc-Location01": 24,
"ws/abc-Location02": 67,
"ws/abc-Location03: 101,
},
"Section_B": {
"ws/abc-Location01": 33,
"ws/abc-Location02": 59,
"ws/abc-Location03: 92,
"ws/abc-Location42: 92,
}
}
My array: contains locations of various partitions like below:
array[15742] is nsg -> /ws/abc-Location42/uname/builds_nsg
array[15744] is bfr -> /ws/abc-Location63/uname/builds_bfr
array[15746] is pre -> /ws/abc-Location67/uname/builds_pre
array[15748] is sfjk -> /ws/abc-Location67/uname/builds_sfjk
File2.txt
abc5-blah30a:/vol/local13/abc-Location67
1000 598
abc5-blah30a:/vol/local14/abc-Location68
1000 186
abc5-blah30a:/vol/local14/abc-Location01
1000 256
abc5-blah30a:/vol/local14/abc-Location02
1000 15
abc5-blah30a:/vol/local14/abc-Location03
1000 765
What I'm trying to do:
I need to change only Section B in my json file, and skip all other sections.
I need to check the locations of the partitions in Section B and for all matches with the array, the numeric value on the right hand side shouldnt be changed.
For all non-matches, the numeric value on the right hand side needs to be changed to the corresponding value from another file file2.txt.
Example
There is a match for Location42 in my json file against the array, so I do NOT change it.
But there is no match against the array for Location01,02,03 in the json file.
So I need to look up the corresp values for these 3 locations against file2.txt.
And I need to change them to 256, 15, 765.
RTFM.
awk is a powerful tool and can do many things (even if as chepner said python, perl or ruby could be more adapted to your problem) but it is not a magic tool that you can use without learning.
You simply cannot use any shell construct within a awk script.