How to loop a variable range in cut command - bash

I have a file with 2 columns, and i want to use the values from the second column to set the range in the cut command to select a range of characters from another file. The range i desire is the character in the position of the value in the second column plus the next 10 characters. I will give an example in a while.
My files are something like that:
File with 2 columns and no blank lines between lines (file1.txt):
NAME1 10
NAME2 25
NAME3 48
NAME4 66
File that i want to extract the variable range of characters(just one very long line with no spaces and no bold font) (file2.txt):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
...or, more literally (for copy/paste to test):
GATCGAGCGGGATTCTTTTTTTTTAGGCGAGTCAGCTAGCATCAGCTACGAGAGGCGAGGGCGGGCTATCACGACTACGACTACGACTACAGCATCAGCATCAGCGCACTAGAGCGAGGCTAGCTAGCTACGACTACGATCAGCATCGCACATCGACTACGATCAGCATCAGCTACGCATCGAAGAGAGAGC
Desired resulting file, one sequence per line (result.txt):
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT
The resulting file would have the characters from 10-20, 25-35, 48-58 and 66-76, each range in a new line. So, it would always keep the range of 10, but in different start points and those start points are set by the values in the second column from the first file.
I tried the command:
for i in $(awk '{print $2}' file1.txt);
do
p1=$i;
p2=`expr "$1" + 10`
cut -c$p1-$2 file2.txt > result.txt;
done
I don't get any output or error message.
I also tried:
while read line; do
set $line
p2=`expr "$2" + 10`
cut -c$2-$p2 file2.txt > result.txt;
done <file1.txt
This last command gives me an error message:
cut: invalid range with no endpoint: -
Try 'cut --help' for more information.
expr: non-integer argument

There's no need for cut here; dd can do the job of indexing into a file, and reading only the number of bytes you want. (Note that status=none is a GNUism; you may need to leave it out on other platforms and redirect stderr otherwise if you want to suppress informational logging).
while read -r name index _; do
dd if=file2.txt bs=1 skip="$index" count=10 status=none
printf '\n'
done <file1.txt >result.txt
This approach avoids excessive memory requirements (as present when reading the whole of file2 -- assuming it's large), and has bounded performance requirements (overhead is equal to starting one copy of dd per sequence to extract).

Using awk
$ awk 'FNR==NR{a=$0; next} {print substr(a,$2+1,10)}' file2 file1
GATTCTTTTT
GGCGAGTCAG
CGAGAGGCGA
TATCACGACT

If file2.txt is not too large, then you can read it in memory,
and use Bash sub-strings to extract the desired ranges:
data=$(<file2.txt)
while read -r name index _; do
echo "${data:$index:10}"
done <file1.txt >result.txt
This will be much more efficient than running cut or another process for every single range definition.
(Thanks to #CharlesDuffy for the tip to read data without a useless cat, and the while loop.)

One way to solve it:
#!/bin/bash
while read line; do
pos=$(echo "$line" | cut -f2 -d' ')
x=$(head -c $(( $pos + 10 )) file2.txt | tail -c 10)
echo "$x"
done < file1.txt > result.txt
It's not the solution an experienced bash hacker would use, but it is very good for someone who is new to bash. It uses tools that are very versatile, although somewhat bad if you need high performance. Shell scripting is commonly used by people who rarely shell scripts, but knows a few commands and just wants to get the job done. That's why I'm including this solution, even if the other answers are superior for more experienced people.
The first line is pretty easy. It just extracts the numbers from file1.txt. The second line uses the very nice tools head and tail. Usually, they are used with lines instead of characters. Nevertheless, I print the first pos + 10 characters with head. The result is piped into tail which prints the last 10 characters.
Thanks to #CharlesDuffy for improvements.

Related

Utilising variables in tail command

I am trying to export characters from a reference file in which their byte position is known. To do this, I have a long list of numbers stored as a variable which have been used as the input to a tail command.
For example, the reference file looks like:
ggaaatgcattcaaacatgc
And the list looks like:
5
10
7
15
I have tried using this code:
list=$(<pos.txt)
echo "$list"
cat ref.txt | tail -c +"list" | head -c1 > out.txt
However, it keeps returning "invalid number of bytes: '+5\n10\n7\n15...'"
My expected output would be
a
t
g
a
...
Can anybody tell me what I'm doing wrong? Thanks!
It looks like you are trying to access your list variable in your tail command. You can access it like this: $list rather than just using quotes around it.
Your logic is flawed even after fixing the variable access. The list variable includes all lines of your list.txt file. Including the newline character \n which is invisible in many UIs and programs, but it is of course visible when you are manually reading single bytes. You need to feed the lines one by one to make it work properly.
Also unless those numbers are indexes from the end, you need to feed them to head instead of tail.
If I understood what you are attempting to do correctly, this should work:
while read line
do
head -c $line ref.txt | tail -c 1 >> out.txt
done < pos.txt
The reason for your command failure is simple. The variable list contains a multi-line string stored from the pos.txt files including newlines. You cannot pass not more than one integer value for the -c flag.
Your attempts can be fixed quite easily with removing calls to cat and using a temporary variable to hold the file content
while IFS= read -r lineNo; do
tail -c "$lineNo" ref.txt | head -c1
done < pos.txt
But then if your intentions is print the desired output in a new-line every time, head does not output that way. It just forms a string atga for your given input in a single line and not across multiple lines with one character at each line.
As Gordon mentions in one of the comments, for much more efficient FASTA files processing, you could just use one invocation of awk though (skipping multiple forks to head/tail). Your provided input does not involve any headers to skip which would be straightforward as
awk ' FNR==NR{ n = split($0,arr,""); for(i=1;i<=n;i++) hash[i] = arr[i] }
( $0 in hash ){ print hash[$0] } ' ref.txt pos.txt
You could use cut instead of tail:
pos=$(<pos.txt)
cut -c ${pos//$'\n'/,} --output-delimiter=$'\n' ref.txt
Or just awk:
awk -F '' 'NR==FNR{c[$0];next} {for(i in c) print $i}' pos.txt ref.txt
both yield:
a
g
t
a

How to find integer values and compare them then transfer the main files?

I have some output files (5000 files) of .log which are the results of QM computations. Inside each file there are two special lines indicate the number of electrons and orbitals, like this below as an example (with exact spaces as in output files):
Number of electrons = 9
Number of orbitals = 13
I thought about a script (bash or Fortran), as a solution to this problem, which grep these two lines (at same time) and get the corresponding integer values (9 and 13, for instance), compare them and finds the difference between two values, and finally, list them in a new text file with the corresponding filenames.
I would really appreciate any help given.
Am posting an attempt in GNU Awk, and have tested it in that only.
#!/bin/bash
for file in *.log
do
awk -F'=[[:blank:]]*' '/Number of/{printf "%s%s",$2,(NR%2?" ":RS)}' "$file" | awk 'function abs(v) {return v < 0 ? -v : v} {print abs($1-$2)}' >> output_"$file"
done
The reason I split the AWK logic to two was to reduce the complexity in doing it in single huge command. The first part is for extracting the numbers from your log file in a columnar format and second for getting their absolute value.
I will break-down the AWK logic:-
-F'=[[:blank:]]*' is a mult0 character delimiter logic including = and one or more instances of [[:blank:]] whitespace characters.
'/Number of/{printf "%s%s",$2,(NR%2?" ":RS)}' searches for lines starting with Number of and prints it in a columnar fashion, i.e. as 9 13 from your sample file.
The second part is self-explanatory. I have written a function to get the absolute value from the two returned values and print it.
Each output is saved in a file named output_, for you to process it further.
Run the script from your command line as bash script.sh, where script.sh is the name containing the above lines.
Update:-
In case if you are interested in negative values too i.e. without the absolute function, change the awk statement to
awk -F'=[[:blank:]]*' '/Number of/{printf "%s%s",$2,(NR%2?" ":RS)}' "$file" | awk '{print ($1-$2)}' >> output_"$file"
Bad way to do it (but it will work)-
while read file
do
first=$(awk -F= '/^Number/ {print $2}' "$file" | head -1)
second=$(awk -F= '/^Number/ {print $2}' "$file" | tail -1)
if [ "$first" -gt "$second" ]
then
echo $(("$first" - "$second"))
else
echo $(("$second" - "$first"))
fi > "$file"_answer ;
done < list_of_files
This method picks up the values (in the awk one liner and compares them.
It then subtracts them to give you one value which it saves in the file called "$file"_answer. i.e. the initial file name with '_answer' as a suffix to the name.
You may need to tweak this code to fit your purposes exactly.

How to get line WITH tab character using tail and head

I have made a script to practice my Bash, only to realize that this script does not take tabulation into account, which is a problem since it is designed to find and replace a pattern in a Python script (which obviously needs tabulation to work).
Here is my code. Is there a simple way to get around this problem ?
pressure=1
nline=$(cat /myfile.py | wc -l) # find the line length of the file
echo $nline
for ((c=0;c<=${nline};c++))
do
res=$( tail -n $(($(($nline+1))-$c)) myfile.py | head -n 1 | awk 'gsub("="," ",$1){print $1}' | awk '{print$1}')
#echo $res
if [ $res == 'pressure_run' ]
then
echo "pressure_run='${pressure}'" >> myfile_mod.py
else
echo $( tail -n $(($nline-$c)) myfile.py | head -n 1) >> myfile_mod.py
fi
done
Basically, it finds the line that has pressure_run=something and replaces it by pressure_run=$pressure. The rest of the file should be untouched. But in this case, all tabulation is deleted.
If you want to just do the replacement as quickly as possible, sed is the way to go as pointed out in shellter's comment:
sed "s/\(pressure_run=\).*/\1$pressure/" myfile.py
For Bash training, as you say, you may want to loop manually over your file. A few remarks for your current version:
Is /myfile.py really in the root directory? Later, you don't refer to it at that location.
cat ... | wc -l is a useless use of cat and better written as wc -l < myfile.py.
Your for loop is executed one more time than you have lines.
To get the next line, you do "show me all lines, but counting from the back, don't show me c lines, and then show me the first line of these". There must be a simpler way, right?
To get what's the left-hand side of an assignment, you say "in the first space-separated field, replace = with a space , then show my the first space separated field of the result". There must be a simpler way, right? This is, by the way, where you strip out the leading tabs (your first awk command does it).
To print the unchanged line, you do the same complicated thing as before.
A band-aid solution
A minimal change that would get you the result you want would be to modify the awk command: instead of
awk 'gsub("="," ",$1){print $1}' | awk '{print$1}'
you could use
awk -F '=' '{ print $1 }'
"Fields are separated by =; give me the first one". This preserves leading tabs.
The replacements have to be adjusted a little bit as well; you now want to match something that ends in pressure_run:
if [[ $res == *pressure_run ]]
I've used the more flexible [[ ]] instead of [ ] and added a * to pressure_run (which must not be quoted): "if $res ends in pressure_run, then..."
The replacement has to use $res, which has the proper amount of tabs:
echo "$res='${pressure}'" >> myfile_mod.py
Instead of appending each line each loop (and opening the file each time), you could just redirect output of your whole loop with done > myfile_mod.py.
This prints literally ${pressure} as in your version, because it's single quoted. If you want to replace that by the value of $pressure, you have to remove the single quotes (and the braces aren't needed here, but don't hurt):
echo "$res=$pressure" >> myfile_mod.py
This fixes your example, but it should be pointed out that enumerating lines and then getting one at a time with tail | head is a really bad idea. You traverse the file for every single line twice, it's very error prone and hard to read. (Thanks to tripleee for suggesting to mention this more clearly.)
A proper solution
This all being said, there are preferred ways of doing what you did. You essentially loop over a file, and if a line matches pressure_run=, you want to replace what's on the right-hand side with $pressure (or the value of that variable). Here is how I would do it:
#!/bin/bash
pressure=1
# Regular expression to match lines we want to change
re='^[[:space:]]*pressure_run='
# Read lines from myfile.py
while IFS= read -r line; do
# If the line matches the regular expression
if [[ $line =~ $re ]]; then
# Print what we matched (with whitespace!), then the value of $pressure
line="${BASH_REMATCH[0]}"$pressure
fi
# Print the (potentially modified) line
echo "$line"
# Read from myfile.py, write to myfile_mod.py
done < myfile.py > myfile_mod.py
For a test file that looks like
blah
test
pressure_run=no_tab
blah
something
pressure_run=one_tab
pressure_run=two_tabs
the result is
blah
test
pressure_run=1
blah
something
pressure_run=1
pressure_run=1
Recommended reading
How to read a file line-by-line (explains the IFS= and -r business, which is quite essential to preserve whitespace)
BashGuide

'grep +A': print everything after a match [duplicate]

This question already has answers here:
How to get the part of a file after the first line that matches a regular expression
(12 answers)
Closed 7 years ago.
I have a file that contains a list of URLs. It looks like below:
file1:
http://www.google.com
http://www.bing.com
http://www.yahoo.com
http://www.baidu.com
http://www.yandex.com
....
I want to get all the records after: http://www.yahoo.com, results looks like below:
file2:
http://www.baidu.com
http://www.yandex.com
....
I know that I could use grep to find the line number of where yahoo.com lies using
grep -n 'http://www.yahoo.com' file1
3 http://www.yahoo.com
But I don't know how to get the file after line number 3. Also, I know there is a flag in grep -A print the lines after your match. However, you need to specify how many lines you want after the match. I am wondering is there something to get around that issue. Like:
Pseudocode:
grep -n 'http://www.yahoo.com' -A all file1 > file2
I know we could use the line number I got and wc -l to get the number of lines after yahoo.com, however... it feels pretty lame.
AWK
If you don't mind using AWK:
awk '/yahoo/{y=1;next}y' data.txt
This script has two parts:
/yahoo/ { y = 1; next }
y
The first part states that if we encounter a line with yahoo, we set the variable y=1, and then skip that line (the next command will jump to the next line, thus skip any further processing on the current line). Without the next command, the line yahoo will be printed.
The second part is a short hand for:
y != 0 { print }
Which means, for each line, if variable y is non-zero, we print that line. In AWK, if you refer to a variable, that variable will be created and is either zero or empty string, depending on context. Before encounter yahoo, variable y is 0, so the script does not print anything. After encounter yahoo, y is 1, so every line after that will be printed.
Sed
Or, using sed, the following will delete everything up to and including the line with yahoo:
sed '1,/yahoo/d' data.txt
This is much easier done with sed than grep. sed can apply any of its one-letter commands to an inclusive range of lines; the general syntax for this is
START , STOP COMMAND
except without any spaces. START and STOP can each be a number (meaning "line number N", starting from 1); a dollar sign (meaning "the end of the file"), or a regexp enclosed in slashes, meaning "the first line that matches this regexp". (The exact rules are slightly more complicated; the GNU sed manual has more detail.)
So, you can do what you want like so:
sed -n -e '/http:\/\/www\.yahoo\.com/,$p' file1 > file2
The -n means "don't print anything unless specifically told to", and the -e directive means "from the first appearance of a line that matches the regexp /http:\/\/www\.yahoo\.com/ to the end of the file, print."
This will include the line with http://www.yahoo.com/ on it in the output. If you want everything after that point but not that line itself, the easiest way to do that is to invert the operation:
sed -e '1,/http:\/\/www\.yahoo\.com/d' file1 > file2
which means "for line 1 through the first line matching the regexp /http:\/\/www\.yahoo\.com/, delete the line" (and then, implicitly, print everything else; note that -n is not used this time).
awk '/yahoo/ ? c++ : c' file1
Or golfed
awk '/yahoo/?c++:c' file1
Result
http://www.baidu.com
http://www.yandex.com
This is most easily done in Perl:
perl -ne 'print unless 1 .. m(http://www\.yahoo\.com)' file
In other words, print all lines that aren’t between line 1 and the first occurrence of that pattern.
Using this script:
# Get index of the "yahoo" word
index=`grep -n "yahoo" filepath | cut -d':' -f1`
# Get the total number of lines in the file
totallines=`wc -l filepath | cut -d' ' -f1`
# Subtract totallines with index
result=`expr $total - $index`
# Gives the desired output
grep -A $result "yahoo" filepath

Using bash and sed

Okay, so I'm not too great at this, but I have a bash script to pick a random number, then use sed to read lines off of files.
It's not working and I must have done something wrong. Could anyone correct my code?
I want the code to pull the line (random number) from each of those files, then output it as a single string (with spaces between).
NUMBER=$[ ( $RANDOM % 100 ) + 1 ]
sed -n NUMBER'p' /Users/user/Desktop/Street.txt
sed -n NUMBER'p' /Users/user/Desktop/City.txt
sed -n NUMBER'p' /Users/user/Desktop/State.txt
sed -n NUMBER'p' /Users/user/Desktop/Zip.txt
You probably need to use $NUMBER in your sed commands, rather than just NUMBER (or ${NUMBER} if other text is directly next to it). Example:
sed -n "${NUMBER}p" /Users/user/Desktop/Street.txt
The following script will use the same randomly chosen number to grab that line from each of the 4 input files you specified and concatenate those lines into a single variable called $outstring.
#!/bin/bash
NUMBER=$(((RANDOM % 100)+1))
for file in Street City State Zip; do
outstring+="$(sed -n "${NUMBER}p" "./${file}.txt") "
done
echo $outstring
Note: If you want (potentially) different line numbers from each of the 4 input files, then simply put the NUMBER= statement inside the for-loop.
This has the advantage of choosing from the whole of each file rather than only the first 100 lines. It will choose a different line from each file.
for f in Street City State Zip
do
printf '%s ' "$(shuf -n 1 "/Users/user/Desktop/$f.txt")"
done
printf '\n'

Resources