Increment all regex matching numbers throughout an HTML file - shell

I have a bunch of HTML files that have anchors structured like:
LinkName
I'm running the files through sed to convert the links into this structure:
LinkName
The last piece of the puzzle that I'm trying to solve is that I need to increment the numbers within the anchor by 10:
#L217 -> #L227 // first link
#cl-217 -> #cl-227 // transformed link
So the final version of the above example link would be:
LinkName
I've gotten close =/
awk 'gsub(/#cl-[0-9]+/, "#cl-ABC")') # just can't get the incremented match in ABC
This one works, but only once, or once per line:
awk '{n = substr($0, match($0, /[0-9]+/), RLENGTH) + 10; sub(/[0-9]+/, n); print }
(* I don't have gawk, or gnu sed)

Try this:
1- Create a file named replace.sh
for file in /path/to/files/*.html; do
while read line; do
name=$line
[[ $line =~ 'LinkName' ]];
match=${BASH_REMATCH[1]};
replace=$((${BASH_REMATCH[1]} + 10));
perl -i -pe 's!LinkName!LinkName!g' $file
done < $file
done
2- chmod +x replace.sh
3- ./replace.sh

In POSIX shells you can use let to compute. First get only the number into a variable, then let my_var++ to increment it.
On the other hand, I'm morally obliged to warn you that manipulating HTML with shell scripts is a maintainability disaster waiting to happen. Python, JavaScript, XSLT or Java would all do a much better job.

Related

How to use locale variable and 'delete last line' in sed

I am trying to delete the last three lines of a file in a shell bash script.
Since I am using local variables in combination with the Regex syntax in sed the answer proposed in How to use sed to remove the last n lines of a file does not cover this case. On the contrary, the cases covered deal with sed in a terminal and does not cover syntax in shell scripts, neither does it cover the use of variables in sed expressions.
The commands I have available is limited, since I am not on a Linux but use a MINGW64 for it.
sed does a create job so far, but deleting the last three lines gives me some headaches in relation of how to format the expression.
I use wc to be aware of how many lines the file has and subtract then with expr three lines.
n=$(wc -l < "$distribution_area")
rel=$(expr $n - 3)
The start point for deleting lines is defined by rel but accessing the local variable happens through the $ and unfortunately the syntax of sed is using the $ to define the end of file. Hence,
sed -i "$rel,$d" "$distribution_area"
won't work, and what ever variant of combinations e.g. '"'"$rel"'",$d' gives me sed: -e expression #1, char 1: unknown command: `"' or something similar.
Can somebody show me how to combine the variable with the $d regex syntax of sed?
sed -i "$rel,$d" "$distribution_area"
Here you're missing the variable name (n) for the second arg.
Consider the following example on a file called test that contains 1-10:
n=$(wc -l < test)
rel=$(($n - 3))
sed "$rel,$n d" test
Result:
1
2
3
4
5
6
To make sure the d will not interfere with the $n, you can add a space instead of escaping.
If you have a recent head available, I'd recommend something like:
head -n -3 test
Can somebody show me how to combine the variable with the $d regex syntax of sed?
$d expands to a varibale d, you have to escape it.
"$rel,\$d"
or:
"$rel"',$d'
But I would use:
head -n -3 "$distribution_area" > "$distribution_area".tmp
mv "$distribution_area".tmp "$distribution_area"
You can remove the last N lines using only pure Bash, without forking additional processes (such as sed). Such scripts look ugly, but they would work in any environment where only Bash runs and nothing else is available, no other binaries like sed, awk etc.
If the entire file fits in RAM, a straightforward solution is to split it by lines and print all but the N trailing ones:
delete_last_n_lines() {
local -ir n="$1"
local -a lines
readarray lines
((${#lines[#]} > n)) || return 0
printf '%s' "${lines[#]::${#lines[#]} - n}"
}
If the file does not fit in RAM, you can keep a FIFO buffer that stores N lines (N + 1 in the “implementation” below, but that’s just a technical detail), let the file (arbitrarily large) flow through the buffer and, after reaching the end of the file, not print out what remains in the buffer (the last N lines to remove).
delete_last_n_lines() {
local -ir n="$1 + 1"
local -a lines
local -i pos i
for ((i = 0; i < n; ++i)); do
IFS= read -r lines[i] || return 0
done
printf '%s\n' "${lines[pos]}"
while IFS= read -r lines[pos++]; do
((pos %= n))
printf '%s\n' "${lines[pos]}"
done
}
The following example gets 10 lines of input, 0 to 9, but prints out only 0 to 6, removing 7, 8 and 9 as desired:
printf '%s' {0..9}$'\n' | delete_last_n_lines 3
Last but not least, this simple hack lacks sed’s -i option to edit files in-place. That could be implemented (e.g.) using a temporary file to store the output and then renaming the temporary file to the original. (A more sophisticated approach would be needed to avoid storing the temporary copy altogether. I don’t think Bash exposes an interface like lseek() to read files “backwards”, so this cannot be done in Bash alone.)

degenerate positions in motifs -bash

I am new to coding and writing a shell script which searches for motifs in protein sequence files and prints their location if present.
But these motifs have degenerate positions.
For example,
A motif can be (psi, psi,x, psi) where psi=(I, L or V) and x can be any of the 20 amino acids.
I would search a set of sequences for the occurrence of this motif. However, my protein sequences are exact sequences, i.e. they have no ambiguity, like:
>
MSGIALSRLAQERKAWRKDHPFGFVAVPTKNPDGTMNLMNWECAIPGKKGTPWEGGL
Would like the search for the all possible exact instances of the motif in the protein sequence which is present in fasta file.
I have a rough code which I know is wrong.
#!/usr/bin/bash
x=(A C G H I L M P S T V D E F K N Q R W Y)
psi=(I L V)
alpha=(D E)
motif1=($psi,$psi,$x,$psi)
for f in *.fasta ; do
if grep -q "$motif1" <$f ; then
echo $f
grep "^>" $f | tr -d ">"
grep -v ">" $f | grep -aob "$motif1"
fi
done
Appreciate any help in finding my way.
Thanks in advance!
The shell is an excellent tool for orchestrating other tools, but it's not particularly well suited to analyzing the contents of files.
A common arrangement is to use the shell to run Awk over a set of files, and do the detection logic in Awk instead. (Other popular tools are Python and Perl; I would perhaps tackle this in Python if I were to start from scratch.)
Regardless of the language of your script, you should avoid code duplication; refactor to put parameters in variables and then run the code with those parameters, or perhaps move the functionality to a function and then call it with different parameters. For example,
scan () {
local f
for f in *.fasta; do
# Inefficient: refactor to do the grep only once, then decide whether you want to show the output or not
if grep -q "$1" "$f"; then
# Always, always use double quotes around file names
echo "$f"
grep "^>" "$f" | tr -d ">"
grep -v ">" "$f" | grep -aob "$1"
fi
done
}
case $motif in
1) scan "$SIM_Type_1";; # Notice typo in variable name
2) scan "$SIM_Type_2";; # Ditto
3) scan "$SIM_Type_3";; # Ditto
4) scan "$SIM_Type_4";; # Ditto
5) scan "$SIM_TYPE_5";; # Notice inconsistent variable name
alpha) scan "$SIM_Type_alpha";;
beta) scan "$SIM_Type_beta";;
esac
You are declaring the _*Type_* variables (or occasionally *_TYPE_* -- the shell is case sensitive, and you should probably use the same capitalization for all the variables just to make it easier for yourself) as arrays, but then you are apparently attempting to use them as regular scalars. I can only guess as to what you intend for the variables to actually contain; but I'm guessing you want something like
# These are strings which contain regular expressions
x='[ACGHILMPSTVDEFKNQRWY]'
psi='[ILV]'
psi_1='[IV]'
alpha='[DE]'
# These are strings which contain sequences of the above regexes
# The ${variable} braces are not strictly necessary, but IMHO help legibility
SIM_Type_1="${psi}${psi}${x}${psi}"
SIM_Type_2="${psi}${x}${psi}${psi}"
SIM_Type_3="${psi}${psi}${psi}${psi}"
SIM_Type_4="${x}${psi}${x}${psi}"
SIM_TYPE_5="${psi}${alpha}${psi}${alpha}${psi}"
SIM_Type_alpha="${psi_1}${x}${psi_1}${psi}"
SIM_Type_beta="${psi_1}${psi_1}.${x}${psi}"
# You had an empty spot here ^ I guessed you want to permit any character?
If you really wanted these to be arrays, the way to access the contents of the array is "${array[#]}" but then that will not produce something we can directly pass to grep or Awk so I went with declaring these as strings containing regular expressions for the motifs.
But to reiterate, Awk is probably a better language for this, so let's refactor scan to be an Awk script.
# This replaces the function definition above
scan () {
awk -v re="$1" '{ if (/^>/) label=$0
else if (idx = match($0, re, result) {
if (! printed) { print FILENAME; printed = 1 }
print len + idx ":" result[0]
}
len += 1+length($0) # one for the newline
}
# Reset printed if we skip to a new file
FNR == 1 { printed = 0 }' *.fasta
}
The main complication here is reimplementing the grep -b byte offset calculation. If that is not strictly necessary (perhaps a line number would suffice?) then the Awk script can be reduced to a somewhat more trivial one.
Your use of grep -a suggests that perhaps your input files contain DOS newlines. I think this will work fine regardless of this.
The immediate benefit of this refactoring is that we avoid scanning the potentially large input file twice. We only scan the file once, and print the file name on the first match.
If you want to make the script more versatile, this is probably also a better starting point than the grep | tr solution you had before. But if the script does what you need, and the matches are often near the beginning of the input file, or the input files are not large, perhaps you don't actually want to switch to Awk after all.
Notice also that like your grep logic, this will not work if a sequence is split over several lines in the FASTA file and the match happens to straddle one of the line breaks.
Finally, making the script prompt for interactive input is a design wart. I would suggest you accept the user choice as a command-line argument instead.
motif=$1
So you'd use this as ./scriptname alpha to run the alpha regex against your FASTA files.
Another possible refactoring would be to read all the motif regexs into a slightly more complex Awk script and print all matches for all of them in a form which then lets you easily pick the ones you actually want to examine in more detail. If you have a lot of data to process, looping over it only once is a huge win.

Batch create files with name and content based on input file

I am a mac OS user trying to batch create a bunch of files. I have a text file with column of several hundred terms/subjects, eg:
hydrogen
oxygen
nitrogen
carbon
etcetera
I want to programmatically fill a directory with text files generated from this subject list. For example, "hydrogen.txt" and "oxygen.txt" and so on, with each file created by iterating through the lines of my list_of_names.txt file. Some lines are one word, but other lines are two or three words (eg: "carbon monoxide"). This I have figured out how to do:
awk 'NF>0' list_of_names.txt | while read line; do touch "${line}.txt"; done
Additionally I need to create two lines of content within each of these files, and the content is both static and dynamic...
# filename
#elements/filename
...where in the example above the pound sign ("#") and "elements/" would be the same in all of the files created, but "filename" would be variable (eg: "hydrogen" for "hydrogen.txt" and "oxygen" for "oxygen.txt" etc). One further wrinkle is that if any spaces appear at all on the second line of content, there needs to be a trailing pound sign. For example:
# filename
#elements/carbon monoxide#
...although this last part is not a dealbreaker and I can use grep to modify list_of_names.txt such that phrases like "carbon monoxide" become "carbon_monoxide" and just deal with the repercussions of this later. (But if it is easy to preserve the spaces, I would prefer that.)
After a couple hours of searching and attempts to use sed, awk, and so on I am stuck at a directory full of files with the correct filename.txt format, but I can't get further that this. Mostly I think my efforts are failing because the solutions I can find for doing something like this are using commands I am not familiar with and they are structured for GNU and don't execute correctly in Terminal on Mac OS.
I am amenable to processing this in multiple steps (ie make all of the files.txt first, then run a second step to populate the content of the files), or as a single command that makes the files and all of their content simultaneously ('simultaneously' from a human timescale).
My horrible pseudocode (IN CAPS) for how this would look as 2 steps:
awk 'NF>0' list_of_names.txt | while read line; do touch "${line}.txt"; done
awk 'NF>0' list_of_names.txt | while read line; OPEN "${line}.txt" AND PRINT "# ${line}\n#elements/${line}"; IF ${line} CONTAINS CHARACTER " " PRINT "#"; done
You could use a simple Bash loop and create the files in one shot:
#!/bin/bash
while read -r name; do # loop through input file content
[[ $name ]] || continue # skip empty lines
output=("# $name") # initialize the array with first element
trailing=
[[ $name = *" "* ]] && trailing="#" # name has spaces in it
output+=("#elements/$name$trailing") # name doesn't have a space
printf '%s\n' "${output[#]}" > "$name.txt" # write array content to the output file
done < list_of_names.txt
Doing it in awk:
awk '
NF {
trailing = (/ / ? "#" : "")
out=$0".txt"
printf("# %s\n#elements/%s%s\n", $0, $0, trailing) > out
close(out)
}
' list_of_names.txt
Doing the whole job in awk will yield better performance than in bash, which isn't really suited to processing text like this.
It seems to me that this should cover the requirements you've specified:
awk '
{
out=$0 ".txt"
printf "# %s\n#elements/%s%s\n", $0, $0, (/ / ? "#" : "") >> out
close(out)
}
' list_of_subjects.txt
Though you could shrink it to a one-liner:
awk '{printf "# %s\n# elements/%s%s\n",$0,$0,(/ /?"#":"")>($0".txt");close($0".txt")}' list_of_subjects.txt

performance issues in shell script

I have a 200 MB tab separated text file with millions of rows. In this file, I have a column with multiple locations like US , UK , AU etc.
Now I want to break this file on the basis of this column. Though this code is working fine for me, but facing performance issue as it is taking more than 1 hour to split the file into multiple files based on locations. Here is the code:
#!/bin/bash
read -p "Please enter the file to split " file
read -p "Enter the Col No. to split " col_no
#set -x
header=`head -1 $file`
cnt=1
while IFS= read -r line
do
if [ $((cnt++)) -eq 1 ]
then
echo "$line" >> /dev/null
else
loc=`echo "$line" | cut -f "$col_no"`
f_name=`echo "file_"$loc".txt"`
if [ -f "$f_name" ]
then
echo "$line" >> "$f_name";
else
touch "$f_name";
echo "file $f_name created.."
echo "$line" >> "$f_name";
sed -i '1i '"$header"'' "$f_name"
fi
fi
done < $file
The logic applied here is that we are reading the entire file only once, and depending on the locations, we are creating and appending the data to it.
Please suggest necessary improvements in the code to enhance its performance.
Following is a sample data and is separated by colon instead of tab. The country code is in the 4th column:
ID1:ID2:ID3:ID4:ID5
100:abcd:TEST1:ZA:CCD
200:abcd:TEST2:US:CCD
300:abcd:TEST3:AR:CCD
400:abcd:TEST4:BE:CCD
500:abcd:TEST5:CA:CCD
600:abcd:TEST6:DK:CCD
312:abcd:TEST65:ZA:CCD
1300:abcd:TEST4153:CA:CCD
There are a couple of things to bear in mind:
Reading files using while read is slow
Creating subshells and executing external processes is slow
This is a job for a text processing tool, such as awk.
I would suggest that you used something like this:
# save first line
NR == 1 {
header = $0
next
}
{
filename = "file_" $col ".txt"
# if country code has changed
if (filename != prev) {
# close the previous file
close(prev)
# if we haven't seen this file yet
if (!(filename in seen)) {
print header > filename
}
seen[filename]
}
# print whole line to file
print >> filename
prev = filename
}
Run the script using something along the following lines:
awk -v col="$col_no" -f script.awk file
where $col_no is a shell variable containing the column number with the country codes.
If you don't have too many different country codes, you can get away with leaving all the files open, in which case you can remove the call to close(filename).
You can test the script on the sample provided in the question like this:
awk -F: -v col=4 -f script.awk file
Note that I've added -F: to change the input field separator to :.
I think Tom is on the right track, but I'd simplify this a little.
Awk is magical in some ways. One of those ways is that it will keep all its input and output file handles open unless you explicitly close them. So if you create a variable containing an output file name, you can simply redirect to your variable and trust that awk will send the data to the place you've specified and eventually close the output file when it runs out of input to process.
(N.B. an extension of this magic is that in addition to redirects, you can maintain multiple PIPES. Imagine if you were to cmd="gzip -9 > file_"$4".txt.gz"; print | cmd)
The following splits your file without adding a header to each output file.
awk -F: 'NR>1 {out="file_"$4".txt"; print > out}' inp.txt
If adding the header is important, a little more code is required. But not much.
awk -F: 'NR==1{h=$0;next} {out="file_"$4".txt"} !(out in files){print h > out; files[out]} {print > out}' inp.txt
Or, because this one-liner is now a bit long, we can split it out for explanation:
awk -F: '
NR==1 {h=$0;next} # Capture the header
{out="file_"$4".txt"} # Capture the output file
!(out in files){ # If we haven't seen this output file before,
print h > out; # print the header to it,
files[out] # and record the fact that we've seen it.
}
{print > out} # Finally, print our line of input.
' inp.txt
I tested these two scripts successfully on the input data you provided in your question. With this type of solution, there is no need to sort your input data -- your output in each file will be in the order in which that subset's records appeared in your input data.
Note: different versions of awk will permit you to open different numbers of open files. GNU awk (gawk) has a limit in the thousands -- significantly more than the number of countries you might have to deal with. BSD awk version 20121220 (in FreeBSD) appears to run out after 21117 files. BSD awk version 20070501 (in OS X El Capitan) is limited to 17 files.
If you're not confident in your potential number of open files, you can experiment with your version of awk usig something like this:
mkdir -p /tmp/i
awk '{o="/tmp/i/file_"NR".txt"; print "hello" > o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
You can also test the number of open pipes:
awk '{o="cat >/dev/null; #"NR; print "hello" | o; printf "\r%d ",NR > "/dev/stderr"}' /dev/random
(If you have a /dev/yes or something that just spits out lines of text ad nauseam, that would be better than using /dev/random for input.)
I haven't previously come across this limit in my own awk programming because when I've needed to create many many output files, I've always used gawk. :-P

Replacing specific lines in multiple files using bash

I'm relatively new to bash scripting, having started out of the need to manage my simulations on supercomputers. I'm currently stuck on writing a script to change specific lines in my pbs files.
There's 2 stages to my problem. First, I need to replace a number of lines in a text file (another script), and overwrite that file for my later use. The rough idea is:
Replace lines 27, 28 and 29 of 'filename005' with 'text1=000', 'text2=005' and 'text3=010'
Next, I'd like to do that recursively for a set of text files with numbered suffixes, and the numbering influences the replaced text.
My code so far is:
#!/bin/bash
for ((i = 1; i < 10; i++))
do
let NUM=i*5
let OLD=NUM-5
let NOW=NUM
let NEW=NUM+5
let FILE=$(printf "filename%03g" $NUM)
sed "27 c\text1=$OLD" $FILE
sed "28 c\text2=$NOW" $FILE
sed "29 c\text3=$NEW" $FILE
done
I know there are some errors in the last 4 lines of my code, and I'm still studying up on the proper way to implement sed. Appreciate any tips!
Thanks!
CS
Taking the first line of your specification:
Replace lines 27:29 of filename005, with text1=000; text2=005; text3=010
That becomes:
sed -e '27,29c\
text1=000\
text2=005\
text3=010' filename005
Rinse and repeat. The backslashes indicate to sed that the change continues. It's easier on yourself if your actual data lines do not need to end with backslashes.
You can play with:
seq 1 35 |
sed -e '27,29c\
text1=000\
text2=005\
text3=010'
to see what happens without risking damage to precious files. Given the specification lines, you could write a sed script to generate sed scripts from the specification (though I'd be tempted to use Perl or awk instead; indeed, I'd probably do the whole job in Perl).
Okay, I managed to get my code to work after finding out that for in-line replacement, I need to write it to a temporary file. So, with the recursive loop and multi-line replacement (and other small tweaks):
for ((i = 1; i < 10; i++ ))
do
let NUM=i*5
let OLD=NUM-5
let NOW=NUM
let NEW=NUM+5
FILE=`printf "filename%03g" $NUM`
sed -e "27,29 c\
OLD=$OLD\n\
NOW=$NOW\n\
NEW=$NEW" $FILE >temp.tmp && mv temp.tmp $FILE
done
Do let me know if there is a more elegant way to use sed in this context. Thanks again #Johnathan!

Resources