The file I am working on looks like this
header
//
[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
absolute:
gthcont: 5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
01010010101010101010101010101011111100011
1111010010010101010101010111101000100000
00000000000000011001100101010010101011111
I need it to be split into four files. The first file is
[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
The second file has to be
5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
The next file has to be
01010010101010101010101010101011111100011
11110100100101010101010101111010001000001
00000000000000011001100101010010101011111
so the header and the // have to be excluded before the first file, the absolute: line should be removed and the gthcont: shoudl not pop up as well.
Ideally the script would just take the input name of the file and name the output as first_input, second_input and third_input...
the fourth file should have the numbers from within the brackets in the first file..in this case it woudl only be
25
29
so my current try ist
awk.awk
BEGIN{body=0}
!body && /^\/\/$/ {body=1}
body && /^\[/ {print > "first_"FILENAME}
body && /^pos/{$1="";print > "second_"FILENAME}
body && /^[01]+/ {print > "third_"FILENAME}
body && /^\[[0-9]+\]/ {
print > "first_"FILENAME
print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}
but is somehow duplicates the lines in the first file so it would be [25], [25], [29],[29]
Some very minor changes to your script produce the desired output:
!body && /^\/\/$/ {body=1}
body && sub(/^gthcont: */,"") {print > "second_"FILENAME}
body && /^[01]+/ {print > "third_"FILENAME}
body && /^\[[0-9]+\]/ {
print > "first_"FILENAME
print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}
The duplication problem was caused by the fact that you printed to the first file in two places.
I have used sub to remove the first part of the gthcont: line (and changed the pattern too). sub returns true if it makes any replacements, so you can use it as a test as well. The advantage of using a substitution rather than unsetting the first field is that you can also get rid of the leading white space from the line.
As pointed out in the comments, there is no need to initialise body, so I removed the BEGIN block too.
I would just use a shell function for this:
function split3 {
if [[ $# -ne 1 ]]; then echo 'split3: error: require 1 argument.' >&2; return 1; fi;
while read -r; do
line=$REPLY;
if [[ "$line" =~ ^\[([0-9]+)\]: ]]; then
echo "$line" >&3;
echo "${BASH_REMATCH[1]}" >&6;
elif [[ "$line" =~ ^gthcont: ]]; then
echo "${line#gthcont: }" >&4;
elif [[ "$line" =~ ^\s*[01]+\s*$ ]]; then
echo "$line" >&5;
fi;
done <"$1" 3>"first_$1" 4>"second_$1" 5>"third_$1" 6>"fourth_$1";
};
split3 input; echo $?;
## 0
cat first_input;
## [25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
## [29]:((962:0.000580339,930:0.000580339):0.00543993);
cat second_input;
## 5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
cat third_input;
## 01010010101010101010101010101011111100011
## 1111010010010101010101010111101000100000
## 00000000000000011001100101010010101011111
cat fourth_input;
## 25
## 29
Related
I want to compare a group of words (individuals) in pairs and extract the one with the lowest numeric variable. My files and scripts are made this way.
Relatedness_3rdDegree.txt (example):
Individual1 Individual2
Individual5 Individual23
Individual50 Individual65
filename.imiss
INDV N_DATA N_GENOTYPES_FILTERED N_MISS F_MISS
Individual1 375029 0 782 0.00208517
Individual2 375029 0 341 0.000909263
Individual3 375029 0 341 0.000909263
Main script:
numlines=$(wc -l Relatedness_3rdDegree.txt|awk '{print $1}')
for line in `seq 1 $numlines`
do
ind1=$(sed -n "${line}p" Relatedness_3rdDegree.txt|awk '{print $1}')
ind2=$(sed -n "${line}p" Relatedness_3rdDegree.txt|awk '{print $2}')
miss1=$(grep $ind1 filename.imiss|awk '{print $5}')
miss2=$(grep $ind2 filename.imiss|awk '{print $5}')
if echo "$miss1 > $miss2" | bc -l | grep -q 1
then
echo $ind1 >> miss.txt
else
echo $ind2 >> miss.txt
fi
echo "$line / $numlines"
done
This last script will echo a series of line like this :
1 / 208
2 / 208
3 / 208
and so on, until getting to this error:
91 / 208
(standard_in) 1: syntax error
92 / 208
(standard_in) 1: syntax error
93 / 208
If I go to my output (miss.txt), the printed individuals are not correct.
It should print the individuals, within the pairs contained in the file "Relatedness_3rdDegree.txt", that have the lowest value of F_MISS (column $5 of the "filename.imiss").
For instance, in the pair "Individual1 Individual2", it should compare their values of F_MISS and print only the individual with the lowest value, which in this example would be Individual 2.
I have manually checked the values and the printed individual, and it looks like it printed random individuals per each pair.
What is wrong in this script?
Bash version:
#!/bin/bash
declare -A imiss
while read -r ind nd ngf nm fm # we'll ignore most of these
do
imiss[$ind]=$fm
done < filename.imiss
while read -r i1 i2
do
if (( $(echo "${imiss[$i1]} > ${imiss[$i2]}" | bc -l) ))
then
echo "$i1"
else
echo "$i2"
fi
done < Relatedness_3rdDegree.txt
Run* it like:
bash-imiss
AWK version:
#!/usr/bin/awk -f
NR == FNR {imiss[$1] = $5; next}
{
if (imiss[$1] > imiss[$2]) {
print $1
} else {
print $2
}
}
Run* it like:
awk-imiss filename.imiss Relatedness_3rdDegree.txt
These two scripts do exactly the same thing in exactly the same way using associative arrays.
* This assumes that you have set the script file executable using chmod and that it's in your PATH and that the data files are in your current directory.
Using the tab-delimited file below I am trying to validate the header line 1 and then store that number in a variable $header to use in a couple of if statements. If $header equals 10 then file has expected number of fields, but if $header less than 10 file is missing header for: and the missing header fields are printed underneath. The bash seems close and if i use the awk by itself it seems to work perfectly, but I can not seem to use it in the if. Thank you :).
file.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
file2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
bash
for f in /home/cmccabe/Desktop/validate/*.txt; do
bname=`basename $f`
pref=${bname%%.txt}
header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output # detect header row in file and store in header and write to output
if [[ $header == "10" ]]; then # display results
echo "file has expected number of fields" # file is validated for headers
else
echo "file is missing header for:" # missing header field ...in file not-validated
echo "$header"
fi # close if.... else
done >> ${pref}_output
desired output for file.txt
file has expected number of fields
desired output for file1.txt
file is missing header for:
Input
You can use awk if you like, but bash is more than capable of handling the first line fields comparison on its own. If you maintain an array of expected field names, you can then easily split the first line into fields, compare against the expected number of fields, and output the identity of the missing field if you read less than the expected number of fields from any given file.
The following is a short example that takes filenames as arguments (you need to take filenames from stdin for a large number of files, or use xargs, as required). The script simply reads the first line in each file, separates the line into fields, checks the field count, and outputs any missing fields in a short error message:
#!/bin/bash
declare -i header=10 ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
"Chr"
"Start"
"End"
"Ref"
"Alt"
"Freq"
"Qual"
"Score"
"Input" )
for i in "$#"; do ## for each file given as argument
read -r line < "$i" ## read first line from file into 'line'
oldIFS="$IFS" ## save current Internal Field Separator (IFS)
IFS=$'\t' ## set IFS to word-split on '\t'
fldarray=( $line ); ## fill 'fldarray' with fields in line
IFS="$oldIFS" ## restore original IFS
nfields=${#fldarray[#]} ## get number of fields in 'line'
if (( nfields < header )) ## test against header
then
printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
for j in "${fields[#]}" ## for each expected field
do ## check against those in line, if not present print
[[ $line =~ $j ]] || printf " %s" "$j"
done
printf "\n\n" ## tidy up with newlines
fi
done
Example Input
$ cat dat/hdr.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
$ cat dat/hdr2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
$ cat dat/hdr3.txt
Index Chr Start End Alt Freq Qual Score Input
1 1 1 100 - 1 GOOD 10 .
2 2 20 200 C .002 STRAND BIAS 2 .
3 2 270 400 GG .036 GOOD 6 .
Example Use/Output
$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input
error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref
Look things over, while awk can do many things bash cannot on its own, bash is more than capable with parsing text.
Here is one in GNU awk (nextfile):
$ awk '
FNR==NR {
for(n=1;n<=NF;n++)
a[$n]
nextfile
}
NF==(n-1) {
print FILENAME " file has expected number of fields"
nextfile
}
{
for(i=1;i<=NF;i++)
b[$i]
print FILENAME " is missing header for: "
for(i in a)
if(i in b==0)
print i
nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for:
Input
The first file processed by the script defines the headers (in a) that the following files should have and compares them (in b) against it.
This piece of code will do exactly what you are asking. Let me know if it works for you.
for f in ./*.txt; do
[[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]] && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index Chr Start End Ref Alt Freq Qual Score Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c | awk '/1 / {print $2}' );
done
Output :
File ./file2.txt is missing Input
File ./file.txt has all the fields on its header
I have 2 scripts, #1 and #2. Each work OK by themselves. I want to read a 15 row file, row by row, and process it. Script #2 selects rows. Row 0 is is indicated as firstline=0, lastline=1. Row 14 would be firstline=14, lastline=15. I see good results from echo. I want to do the same with script #1. Can't get my head around nesting correctly. Code below.
#!/bin/bash
# script 1
filename=slash
firstline=0
lastline=1
i=0
exec <${filename}
while read ; do
i=$(( $i + 1 ))
if [ "$i" -ge "${firstline}" ] ; then
if [ "$i" -gt "${lastline}" ] ; then
break
else
echo "${REPLY}" > slash1
fold -w 21 -s slash1 > news1
sleep 5
fi
fi
done
# script2
firstline=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)
lastline=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)
for ((i=0;i<${#firstline[#]};i++))
do
echo ${firstline[$i]} ${lastline[$i]};
done
Your question is very unclear, but perhaps you are simply looking for some simple function calls:
#!/bin/bash
script_1() {
filename=slash
firstline=$1
lastline=$2
i=0
exec <${filename}
while read ; do
i=$(( $i + 1 ))
if [ "$i" -ge "${firstline}" ] ; then
if [ "$i" -gt "${lastline}" ] ; then
break
else
echo "${REPLY}" > slash1
fold -w 21 -s slash1 > news1
sleep 5
fi
fi
done
}
# script2
firstline=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)
lastline=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)
for ((i=0;i<${#firstline[#]};i++))
do
script_1 ${firstline[$i]} ${lastline[$i]};
done
Note that reading the file this way is extremely inefficient, and there are undoubtedly better ways to handle this, but I am trying to minimize the changes from your code.
Update: Based on your later comments, the following idiomatic Bash code that uses sed to extract the line of interest in each iteration solves your problem much more simply:
Note:
- If the input file does not change between loop iterations, and the input file is small enough (as it is in the case at hand), it's more efficient to buffer the file contents in a variable up front, as is demonstrated in the original answer below.
- As tripleee points out in a comment: If simply reading the input lines sequentially is sufficient (as opposed to extracting lines by specific line numbers, then a single, simple while read -r line; do ... # fold and output, then sleep ... done < "$filename" is enough.
# Determine the input filename.
filename='slash'
# Count its number of lines.
lineCount=$(wc -l < "$filename")
# Loop over the line numbers of the file.
for (( lineNum = 1; lineNum <= lineCount; ++lineNum )); do
# Use `sed` to extract the line with the line number at hand,
# reformat it, and output to the target file.
fold -w 21 -s <(sed -n "$lineNum {p;q;}" "$filename") > 'news1'
sleep 5
done
A simplified version of what I think you're trying to achieve:
#!/bin/bash
# Split fields by newlines on input,
# and separate array items by newlines on output.
IFS=$'\n'
# Read all input lines up front, into array ${lines[#]}
# In terms of your code, you'd use
# read -d '' -ra lines < "$filename"
read -d '' -ra lines <<<$'line 1\nline 2\nline 3\nline 4\nline 5\nline 6\nline 7\nline 8\nline 9\nline 10\nline 11\nline 12\nline 13\nline 14\nline 15'
# Define the arrays specifying the line ranges to select.
firstline=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14)
lastline=(1 2 3 4 5 6 7 8 9 10 11 12 13 14 15)
# Loop over the ranges and select a range of lines in each iteration.
for ((i=0; i<${#firstline[#]}; i++)); do
extractedLines="${lines[*]: ${firstline[i]}: 1 + ${lastline[i]} - ${firstline[i]}}"
# Process the extracted lines.
# In terms of your code, the `> slash1` and `fold ...` commands would go here.
echo "$extractedLines"
echo '------'
done
Note:
The name of the array variable filled with read -ra is lines; ${lines[#]} is Bash syntax for returning all array elements as separate words (${lines[*]} also refers to all elements, but with slightly different semantics), and this syntax is used in the comments to illustrate that lines is indeed an array variable (note that if you were to use simply $lines to reference the variable, you'd implicitly get only the item with index 0, which is the same as: ${lines[0]}.
<<<$'line 1\n...' uses a here-string (<<<) to read an ad-hoc sample document (expressed as an ANSI C-quoted string ($'...')) in the interest of making my example code self-contained.
As stated in the comment, you'd read from $filename instead:
read -d '' -ra lines <"$filename"
extractedLines="${lines[*]: ${firstline[i]}: 1 + ${lastline[i]} - ${firstline[i]}}" extracts the lines of interest; ${firstline[i]} references the current element (index i) from array ${firstline[#]}; since the last token in Bash's array-slicing syntax
(${lines[*]: <startIndex>: <elementCount>}) is the count of elements to return, we must perform a calculation to determine the count, which is what 1 + ${lastline[i]} - ${firstline[i]} does.
By virtue of using "${lines[*]...}" rather than "${lines[#]...}", the extracted array elements are joined by the first character in $IFS, which in our case is a newline ($'\n') (when extracting a single line, that doesn't really matter).
Consider a plain text file containing page-breaking ASCII control character "Form Feed" ($'\f'):
alpha\n
beta\n
gamma\n\f
one\n
two\n
three\n
four\n
five\n\f
earth\n
wind\n
fire\n
water\n\f
Note that each page has a random number of lines.
Need a bash routine that return the page number of a given line number from a text file containing page-breaking ASCII control character.
After a long time researching the solution I finally came across this piece of code:
function get_page_from_line
{
local nline="$1"
local input_file="$2"
local npag=0
local ln=0
local total=0
while IFS= read -d $'\f' -r page; do
npag=$(( ++npag ))
ln=$(echo -n "$page" | wc -l)
total=$(( total + ln ))
if [ $total -ge $nline ]; then
echo "${npag}"
return
fi
done < "$input_file"
echo "0"
return
}
But, unfortunately, this solution proved to be very slow in some cases.
Any better solution ?
Thanks!
The idea to use read -d $'\f' and then to count the lines is good.
This version migth appear not ellegant: if nline is greater than or equal to the number of lines in the file, then the file is read twice.
Give it a try, because it is super fast:
function get_page_from_line ()
{
local nline="${1}"
local input_file="${2}"
if [[ $(wc -l "${input_file}" | awk '{print $1}') -lt nline ]] ; then
printf "0\n"
else
printf "%d\n" $(( $(head -n ${nline} "${input_file}" | grep -c "^"$'\f') + 1 ))
fi
}
Performance of awk is better than the above bash version. awk was created for such text processing.
Give this tested version a try:
function get_page_from_line ()
{
awk -v nline="${1}" '
BEGIN {
npag=1;
}
{
if (index($0,"\f")>0) {
npag++;
}
if (NR==nline) {
print npag;
linefound=1;
exit;
}
}
END {
if (!linefound) {
print 0;
}
}' "${2}"
}
When \f is encountered, the page number is increased.
NR is the current line number.
----
For history, there is another bash version.
This version is using only built-it commands to count the lines in current page.
The speedtest.sh that you had provided in the comments showed it is a little bit ahead (20 sec approx.) which makes it equivalent to your version:
function get_page_from_line ()
{
local nline="$1"
local input_file="$2"
local npag=0
local total=0
while IFS= read -d $'\f' -r page; do
npag=$(( npag + 1 ))
IFS=$'\n'
for line in ${page}
do
total=$(( total + 1 ))
if [[ total -eq nline ]] ; then
printf "%d\n" ${npag}
unset IFS
return
fi
done
unset IFS
done < "$input_file"
printf "0\n"
return
}
awk to the rescue!
awk -v RS='\f' -v n=09 '$0~"^"n"." || $0~"\n"n"." {print NR}' file
3
updated anchoring as commented below.
$ for i in $(seq -w 12); do awk -v RS='\f' -v n="$i"
'$0~"^"n"." || $0~"\n"n"." {print n,"->",NR}' file; done
01 -> 1
02 -> 1
03 -> 1
04 -> 2
05 -> 2
06 -> 2
07 -> 2
08 -> 2
09 -> 3
10 -> 3
11 -> 3
12 -> 3
A script of similar length can be written in bash itself to locate and respond to the embedded <form-feed>'s contained in a file. (it will work for POSIX shell as well, with substitute for string index and expr for math) For example,
#!/bin/bash
declare -i ln=1 ## line count
declare -i pg=1 ## page count
fname="${1:-/dev/stdin}" ## read from file or stdin
printf "\nln:pg text\n" ## print header
while read -r l; do ## read each line
if [ ${l:0:1} = $'\f' ]; then ## if form-feed found
((pg++))
printf "<ff>\n%2s:%2s '%s'\n" "$ln" "$pg" "${l:1}"
else
printf "%2s:%2s '%s'\n" "$ln" "$pg" "$l"
fi
((ln++))
done < "$fname"
Example Input File
The simple input file with embedded <form-feed>'s was create with:
$ echo -e "a\nb\nc\n\fd\ne\nf\ng\nh\n\fi\nj\nk\nl" > dat/affex.txt
Which when output gives:
$ cat dat/affex.txt
a
b
c
d
e
f
g
h
i
j
k
l
Example Use/Output
$ bash affex.sh <dat/affex.txt
ln:pg text
1: 1 'a'
2: 1 'b'
3: 1 'c'
<ff>
4: 2 'd'
5: 2 'e'
6: 2 'f'
7: 2 'g'
8: 2 'h'
<ff>
9: 3 'i'
10: 3 'j'
11: 3 'k'
12: 3 'l'
With Awk, you can define RS (the record separator, default newline) to form feed (\f) and IFS (the input field separator, default any sequence of horizontal whitespace) to newline (\n) and obtain the number of lines as the number of "fields" in a "record" which is a "page".
The placement of form feeds in your data will produce some empty lines within a page so the counts are off where that happens.
awk -F '\n' -v RS='\f' '{ print NF }' file
You could reduce the number by one if $NF == "", and perhaps pass in the number of the desired page as a variable:
awk -F '\n' -v RS='\f' -v p="2" 'NR==p { print NF - ($NF == "") }' file
To obtain the page number for a particular line, just feed head -n number to the script, or loop over the numbers until you have accrued the sum of lines.
line=1
page=1
for count in $(awk -F '\n' -v RS='\f' '{ print NF - ($NF == "") }' file); do
old=$line
((line += count))
echo "Lines $old through line are on page $page"
((page++)
done
This gnu awk script prints the "page" for the linenumber given as command line argument:
BEGIN { ffcount=1;
search = ARGV[2]
delete ARGV[2]
if (!search ) {
print "Please provide linenumber as argument"
exit(1);
}
}
$1 ~ search { printf( "line %s is on page %d\n", search, ffcount) }
/[\f]/ { ffcount++ }
Use it like awk -f formfeeds.awk formfeeds.txt 05 where formfeeds.awk is the script, formfeeds.txt is the file and '05' is a linenumber.
The BEGIN rule deals mostly with the command line argument. The other rules are simple rules:
$1 ~ search applies when the first field matches the commandline argument stored in search
/[\f]/ applies when there is a formfeed
I got a file with several lines. Some of these lines contain LFs (0x0A) and CRs (0x0D), which I want to get removed. The point is, that I want to replace them with SPACE them only in a range of characters of every line, eg in a File:
30 30 30 30 30 30 30 30 30 30 **0D 0A** 30 30 0A; 0000000000..00
30 30 30 30 30 30 30 30 **0D 0A** 30 30 30 30 0A; 00000000..0000
I want to remove 0d, 0a from position 0 to 12 in every line of the file.
I got
awk '{l=substr($0,1,12);r=substr($0,13);gsub(/\x00-\1F/," ",l);print l r}' ${f} > ${f}.noLF
but this seems not to work. I guess substr stops at the first 0x0d.
Is there another solution?
awk '/\r$/ && length < 13 {sub(/\r$/,""); printf "%s ", $0; next} {print}' file
Here is something ugly that may work!
Save it as go
#!/bin/bash
while :
do
# Read 13 bytes from stdin, and replace carriage returns and linefeeds with spaces
dd bs=13 count=1 2>/dev/null | tr '\r\n' ' '
# Break out of loop if dd was not successful
[ ${PIPESTATUS[0]} -ne 0 ] && break
# Get rest of conventional line, breaking out of loop if EOF
read rest || break
echo $rest
done
It reads 13 bytes from your file and removes all carriage returns and linefeeds. Then it reads the rest of the conventional line and outputs that.
Use it like this:
chmod +x go
./go < yourfile
Example:
more file
q
wertyuiopqwertyuiop
qwerty
uiopqwertyuiop
./go < file
q wertyuiopqwertyuiop
qwerty uiopqwertyuiop
EDITED TO MATCH FURTHER QUESTIONS
#!/bin/bash
while :
do
# Read 13 bytes from stdin, and replace carriage returns and linefeeds with spaces
first13=$(dd bs=13 count=1 2>/dev/null)
ddexitstatus=$?
if [ echo $first13 | grep -q "^KT" ]; then
echo $first13
else
echo $first13 | tr '\r\n' ' '
fi
# Break out of loop if dd was not successful
[ $ddexitstatus -ne 0 ] && break
# Get rest of conventional line, breaking out of loop if EOF
read rest || break
echo $rest
done