How to check if a line has the right syntax like this? - bash

if i have the following line syntax
FirstName, FamilyName, Address, PhoneNo
and i am reading a data file that contain information, how can i check that i reads a line with the right syntax ??
UPDATE::
i mean a function i send to it each line (from a while loop), and its return 0 if the line is correct and 1 if the line is not ?
UPDATE2::
the correct form is
first name(string), last name(string), address(string), phone no.(string)
so if the line is missing one or if there more than 4,, it should return a 1,,
Using Bash,
Good Input is ::
Rami, Jarrar, Jenin - Wadi berqen, 111 111
# Some Cases To Deal With
, Jarrar, Jenin - Wadi berqen, 111 111
- Extra Spaces::
Rami, Jarrar, Jenin - Wadi berqen, 111 111
Rami, Jarrar, Jenin - Wadi berqen, 111 111, 213 3123
ALSO ANOTHER UPDATE :)
check(){
x=$(echo "$#" | grep -q '^[^,]\+,[^,]\+,[^,]\+,[^,]\+$')
return $x
}
len=#number of lines in the file
i=1
while [ $i -le $len ]; do
line=$(cat $file)
#------this is where i call the func-----
check $line
if [ $? -eq 1 ];then
echo "ERROR"
else
echo "Good Line"
fi
BASH 2.3.39
*GREP 2.5.3*
UPDATE
now if i make the correct format like this ::
string, value, value, value
value : is a positive integer
what this line should be replaced ::
x=$(echo "$#" | grep -q '^[^,]\+,[^,]\+,[^,]\+,[^,]\+$')
??

Allows empty fields:
check () { echo "$#" | grep -q '^[^,]*,[^,]*,[^,]*,[^,]*$'; }
Does not allow any field to be empty:
check () { echo "$#" | grep -q '^[^,]\+,[^,]\+,[^,]\+,[^,]\+$'; }
Bourne shell without using external utilities (allows empty fields):
check () { local IFS=,; set -- $#; return $(test -n "$4" -a -z "$5"); }
Bash 3.2 or greater (allows empty fields):
check () { [[ $# =~ ^[^,]*,[^,]*,[^,]*,[^,]*$ ]]; }
Bash 3.2 or greater (does not allow empty fields):
check () { [[ $# =~ ^[^,]+,[^,]+,[^,]+,[^,]+$ ]]; }

is_correct () {
grep -q '^[^ ][^,]\+, [^ ][^,]\+, [^ ][^,]\+, [^ ][^,]\+$' <<< "$#"
}
l=0
while read line ; do
is_correct "$line" && echo line $l ok || echo Invalid syntax on line $l
((l+=1))
done <<<"Rami, Jarrar, Jenin - Wadi berqen, 111 111
, Jarrar, Jenin - Wadi berqen, 111 111
- Extra Spaces::
Rami, Jarrar, Jenin - Wadi berqen, 111 111
Rami, Jarrar, Jenin - Wadi berqen, 111 111, 213 3123
A line, containg fields with, many spaces, but otherwise valid
a, b, c, d
aa, bb, cc, dd"
Yields:
line 0 ok
Invalid syntax on line 1
Invalid syntax on line 2
Invalid syntax on line 3
Invalid syntax on line 4
line 5 ok
Invalid syntax on line 6
line 7 ok
Correctly throws out the all but the sample good line, including the "too many spaces" case. The only place where it fails is if a field has only one character in it.

Related

Shell script with grep and sed to extract individuals from a pair after comparing the numerical values of a variable

I want to compare a group of words (individuals) in pairs and extract the one with the lowest numeric variable. My files and scripts are made this way.
Relatedness_3rdDegree.txt (example):
Individual1 Individual2
Individual5 Individual23
Individual50 Individual65
filename.imiss
INDV N_DATA N_GENOTYPES_FILTERED N_MISS F_MISS
Individual1 375029 0 782 0.00208517
Individual2 375029 0 341 0.000909263
Individual3 375029 0 341 0.000909263
Main script:
numlines=$(wc -l Relatedness_3rdDegree.txt|awk '{print $1}')
for line in `seq 1 $numlines`
do
ind1=$(sed -n "${line}p" Relatedness_3rdDegree.txt|awk '{print $1}')
ind2=$(sed -n "${line}p" Relatedness_3rdDegree.txt|awk '{print $2}')
miss1=$(grep $ind1 filename.imiss|awk '{print $5}')
miss2=$(grep $ind2 filename.imiss|awk '{print $5}')
if echo "$miss1 > $miss2" | bc -l | grep -q 1
then
echo $ind1 >> miss.txt
else
echo $ind2 >> miss.txt
fi
echo "$line / $numlines"
done
This last script will echo a series of line like this :
1 / 208
2 / 208
3 / 208
and so on, until getting to this error:
91 / 208
(standard_in) 1: syntax error
92 / 208
(standard_in) 1: syntax error
93 / 208
If I go to my output (miss.txt), the printed individuals are not correct.
It should print the individuals, within the pairs contained in the file "Relatedness_3rdDegree.txt", that have the lowest value of F_MISS (column $5 of the "filename.imiss").
For instance, in the pair "Individual1 Individual2", it should compare their values of F_MISS and print only the individual with the lowest value, which in this example would be Individual 2.
I have manually checked the values and the printed individual, and it looks like it printed random individuals per each pair.
What is wrong in this script?
Bash version:
#!/bin/bash
declare -A imiss
while read -r ind nd ngf nm fm # we'll ignore most of these
do
imiss[$ind]=$fm
done < filename.imiss
while read -r i1 i2
do
if (( $(echo "${imiss[$i1]} > ${imiss[$i2]}" | bc -l) ))
then
echo "$i1"
else
echo "$i2"
fi
done < Relatedness_3rdDegree.txt
Run* it like:
bash-imiss
AWK version:
#!/usr/bin/awk -f
NR == FNR {imiss[$1] = $5; next}
{
if (imiss[$1] > imiss[$2]) {
print $1
} else {
print $2
}
}
Run* it like:
awk-imiss filename.imiss Relatedness_3rdDegree.txt
These two scripts do exactly the same thing in exactly the same way using associative arrays.
* This assumes that you have set the script file executable using chmod and that it's in your PATH and that the data files are in your current directory.

How can a "grep | sed | awk" script merging line pairs be more cleanly implemented?

I have a little script to extract specific data and cleanup the output a little. It seems overly messy and i'm wondering if the script can be trimmed down a bit.
The input file contains of pairs of lines -- names, followed by numbers.
Line pairs where the numeric value is not between 80 and 199 should be discarded.
Pairs may sometimes, but will not always, be preceded or followed by blank lines, which should be ignored.
Example input file:
al12t5682-heapmemusage-latest.log
38
al12t5683-heapmemusage-latest.log
88
al12t5684-heapmemusage-latest.log
100
al12t5685-heapmemusage-latest.log
0
al12t5686-heapmemusage-latest.log
91
Example/wanted output:
al12t5683 88
al12t5684 100
al12t5686 91
Current script:
grep --no-group-separator -PxB1 '([8,9][0-9]|[1][0-9][0-9])' inputfile.txt \
| sed 's/-heapmemusage-latest.log//' \
| awk '{$1=$1;printf("%s ",$0)};NR%2==0{print ""}'
Extra input example
al14672-heapmemusage-latest.log
38
al14671-heapmemusage-latest.log
5
g4t5534-heapmemusage-latest.log
100
al1t0000-heapmemusage-latest.log
0
al1t5535-heapmemusage-latest.log
al1t4676-heapmemusage-latest.log
127
al1t4674-heapmemusage-latest.log
53
A1t5540-heapmemusage-latest.log
54
G4t9981-heapmemusage-latest.log
45
al1c4678-heapmemusage-latest.log
81
B4t8830-heapmemusage-latest.log
76
a1t0091-heapmemusage-latest.log
88
al1t4684-heapmemusage-latest.log
91
Extra Example expected output:
g4t5534 100
al1t4676 127
al1c4678 81
a1t0091 88
al1t4684 91
another awk
$ awk -F- 'NR%2{p=$1; next} 80<=$1 && $1<=199 {print p,$1}' file
al12t5683 88
al12t5684 100
al12t5686 91
UPDATE
for the empty line record delimiter
$ awk -v RS= '80<=$2 && $2<=199{sub(/-.*/,"",$1); print}' file
al12t5683 88
al12t5684 100
al12t5686 91
Consider implementing this in native bash, as in the following (which can be seen running with your sample input -- including sporadically-present blank lines -- at http://ideone.com/Qtfmrr):
#!/bin/bash
name=; number=
while IFS= read -r line; do
[[ $line ]] || continue # skip blank lines
[[ -z $name ]] && { name=$line; continue; } # first non-blank line becomes name
number=$line # second one becomes number
if (( number >= 80 && number < 200 )); then
name=${name%%-*} # prune everything after first "-"
printf '%s %s\n' "$name" "$number" # emit our output
fi
name=; number= # clear the variables
done <inputfile.txt
The above uses no external commands whatsoever -- so whereas it might be slower to run over large input than a well-implemented awk or perl script, it also has far shorter startup time since no interpreter other than the already-running shell is required.
See:
BashFAQ #1 - How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?, describing the while read idiom.
BashFAQ #100 - How do I do string manipulations in bash?; or The Bash-Hackers' Wiki on parameter expansion, describing how name=${name%%-*} works.
The Bash-Hackers' Wiki on arithmetic expressions, describing the (( ... )) syntax used for numeric comparisons.
perl -nle's/-.*//; $n=<>; print "$_ $n" if 80<=$n && $n<=199' inputfile.txt
With gnu sed
sed -E '
N
/\n[8-9][0-9]$/bA
/\n1[0-9]{2}$/!d
:A
s/([^-]*).*\n([0-9]+$)/\1 \2/
' infile

bash to identify and verify file headers

Using the tab-delimited file below I am trying to validate the header line 1 and then store that number in a variable $header to use in a couple of if statements. If $header equals 10 then file has expected number of fields, but if $header less than 10 file is missing header for: and the missing header fields are printed underneath. The bash seems close and if i use the awk by itself it seems to work perfectly, but I can not seem to use it in the if. Thank you :).
file.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
file2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
bash
for f in /home/cmccabe/Desktop/validate/*.txt; do
bname=`basename $f`
pref=${bname%%.txt}
header=$(awk -F'\t' '{print NF, "fields detected in file and they are:" ORS $0; exit}') $f >> ${pref}_output # detect header row in file and store in header and write to output
if [[ $header == "10" ]]; then # display results
echo "file has expected number of fields" # file is validated for headers
else
echo "file is missing header for:" # missing header field ...in file not-validated
echo "$header"
fi # close if.... else
done >> ${pref}_output
desired output for file.txt
file has expected number of fields
desired output for file1.txt
file is missing header for:
Input
You can use awk if you like, but bash is more than capable of handling the first line fields comparison on its own. If you maintain an array of expected field names, you can then easily split the first line into fields, compare against the expected number of fields, and output the identity of the missing field if you read less than the expected number of fields from any given file.
The following is a short example that takes filenames as arguments (you need to take filenames from stdin for a large number of files, or use xargs, as required). The script simply reads the first line in each file, separates the line into fields, checks the field count, and outputs any missing fields in a short error message:
#!/bin/bash
declare -i header=10 ## header has 10 fields
## aray of field names (can be read from 1st file)
fields=( "Index"
"Chr"
"Start"
"End"
"Ref"
"Alt"
"Freq"
"Qual"
"Score"
"Input" )
for i in "$#"; do ## for each file given as argument
read -r line < "$i" ## read first line from file into 'line'
oldIFS="$IFS" ## save current Internal Field Separator (IFS)
IFS=$'\t' ## set IFS to word-split on '\t'
fldarray=( $line ); ## fill 'fldarray' with fields in line
IFS="$oldIFS" ## restore original IFS
nfields=${#fldarray[#]} ## get number of fields in 'line'
if (( nfields < header )) ## test against header
then
printf "error: only '%d' fields in file '%s'\nmissing:" "$nfields" "$i"
for j in "${fields[#]}" ## for each expected field
do ## check against those in line, if not present print
[[ $line =~ $j ]] || printf " %s" "$j"
done
printf "\n\n" ## tidy up with newlines
fi
done
Example Input
$ cat dat/hdr.txt
Index Chr Start End Ref Alt Freq Qual Score Input
1 1 1 100 C - 1 GOOD 10 .
2 2 20 200 A C .002 STRAND BIAS 2 .
3 2 270 400 - GG .036 GOOD 6 .
$ cat dat/hdr2.txt
Index Chr Start End Ref Alt Freq Qual Score
1 1 1 100 C - 1 GOOD 10
2 2 20 200 A C .002 STRAND BIAS 2
3 2 270 400 - GG .036 GOOD 6
$ cat dat/hdr3.txt
Index Chr Start End Alt Freq Qual Score Input
1 1 1 100 - 1 GOOD 10 .
2 2 20 200 C .002 STRAND BIAS 2 .
3 2 270 400 GG .036 GOOD 6 .
Example Use/Output
$ bash hdrfields.sh dat/hdr.txt dat/hdr2.txt dat/hdr3.txt
error: only '9' fields in file 'dat/hdr2.txt'
missing: Input
error: only '9' fields in file 'dat/hdr3.txt'
missing: Ref
Look things over, while awk can do many things bash cannot on its own, bash is more than capable with parsing text.
Here is one in GNU awk (nextfile):
$ awk '
FNR==NR {
for(n=1;n<=NF;n++)
a[$n]
nextfile
}
NF==(n-1) {
print FILENAME " file has expected number of fields"
nextfile
}
{
for(i=1;i<=NF;i++)
b[$i]
print FILENAME " is missing header for: "
for(i in a)
if(i in b==0)
print i
nextfile
}' file1 file1 file2
file1 file has expected number of fields
file2 is missing header for:
Input
The first file processed by the script defines the headers (in a) that the following files should have and compares them (in b) against it.
This piece of code will do exactly what you are asking. Let me know if it works for you.
for f in ./*.txt; do
[[ $( head -1 $f | awk '{ print NF}' ) -eq 10 ]] && echo "File $f has all the fields on its header" || echo "File $f is missing " $( echo "Index Chr Start End Ref Alt Freq Qual Score Input $( head -1 $f )" | tr ' ' '\n' | sort | uniq -c | awk '/1 / {print $2}' );
done
Output :
File ./file2.txt is missing Input
File ./file.txt has all the fields on its header

split file into several sub files

The file I am working on looks like this
header
//
[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
absolute:
gthcont: 5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
01010010101010101010101010101011111100011
1111010010010101010101010111101000100000
00000000000000011001100101010010101011111
I need it to be split into four files. The first file is
[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
[29]:((962:0.000580339,930:0.000580339):0.00543993);
The second file has to be
5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
The next file has to be
01010010101010101010101010101011111100011
11110100100101010101010101111010001000001
00000000000000011001100101010010101011111
so the header and the // have to be excluded before the first file, the absolute: line should be removed and the gthcont: shoudl not pop up as well.
Ideally the script would just take the input name of the file and name the output as first_input, second_input and third_input...
the fourth file should have the numbers from within the brackets in the first file..in this case it woudl only be
25
29
so my current try ist
awk.awk
BEGIN{body=0}
!body && /^\/\/$/ {body=1}
body && /^\[/ {print > "first_"FILENAME}
body && /^pos/{$1="";print > "second_"FILENAME}
body && /^[01]+/ {print > "third_"FILENAME}
body && /^\[[0-9]+\]/ {
print > "first_"FILENAME
print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}
but is somehow duplicates the lines in the first file so it would be [25], [25], [29],[29]
Some very minor changes to your script produce the desired output:
!body && /^\/\/$/ {body=1}
body && sub(/^gthcont: */,"") {print > "second_"FILENAME}
body && /^[01]+/ {print > "third_"FILENAME}
body && /^\[[0-9]+\]/ {
print > "first_"FILENAME
print substr($0, 2, index($0,"]")-2) > "fourth_"FILENAME
}
The duplication problem was caused by the fact that you printed to the first file in two places.
I have used sub to remove the first part of the gthcont: line (and changed the pattern too). sub returns true if it makes any replacements, so you can use it as a test as well. The advantage of using a substitution rather than unsetting the first field is that you can also get rid of the leading white space from the line.
As pointed out in the comments, there is no need to initialise body, so I removed the BEGIN block too.
I would just use a shell function for this:
function split3 {
if [[ $# -ne 1 ]]; then echo 'split3: error: require 1 argument.' >&2; return 1; fi;
while read -r; do
line=$REPLY;
if [[ "$line" =~ ^\[([0-9]+)\]: ]]; then
echo "$line" >&3;
echo "${BASH_REMATCH[1]}" >&6;
elif [[ "$line" =~ ^gthcont: ]]; then
echo "${line#gthcont: }" >&4;
elif [[ "$line" =~ ^\s*[01]+\s*$ ]]; then
echo "$line" >&5;
fi;
done <"$1" 3>"first_$1" 4>"second_$1" 5>"third_$1" 6>"fourth_$1";
};
split3 input; echo $?;
## 0
cat first_input;
## [25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001);
## [29]:((962:0.000580339,930:0.000580339):0.00543993);
cat second_input;
## 5 4 2 1 3 4 543 5 67 657 78 67 8 5645 6
cat third_input;
## 01010010101010101010101010101011111100011
## 1111010010010101010101010111101000100000
## 00000000000000011001100101010010101011111
cat fourth_input;
## 25
## 29

Bash: Merging 4 lines in 4 files into one single file

I'm looking for a way to merge 4 lines of dna probing results into one line.
The problem here is:
I don't want to append the lines. But associating them
The 4 lines of dna probing:
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
I need these to be 1 line, not just appended but intermixed without the dashes.
First characters of the result:
ACCTTAGCCCCGC...
This seem to be a kind of general problem, so the language choosed to solve this don't matter.
For fun: one bash way:
lines=(
A----A----------A----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G--G--G------G------
---TT--------T-T---------T-----
)
result=""
for ((i=0;i<${#lines};i++)) ;do
chr=- c=()
for ((l=0;l<${#lines[#]};l++)) ;do
[ "${lines[l]:i:1}" != "-" ] &&
chr="${lines[l]:i:1}" &&
c+=($l)
done
[ ${#c[#]} -eq 0 ] && printf 'Char #%d not replaced.\n' $i
[ ${#c[#]} -gt 1 ] && c="${c[*]}" && chr="*" &&
printf "Conflict at char #%d (lines: %s).\n" $i "${c// /, }"
result+=$chr
done
echo $result
With provided input, there is no conflict and all characters is replaced. So the output is:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Note: Question stand for 4 different files, so lines= syntax could be:
lines=($(cat file1 file2 file3 file4))
But with a wrong input:
lines=(
A----A---A-----A-----A-A--AAAA-
-CC----CCCC-C-----CCC-C-------C
------G----G---G-G------G------
---TT--------T-T---------T-----
)
output could be:
Conflict at char #9 (lines: 0, 1).
Char #14 not replaced.
Conflict at char #15 (lines: 0, 2, 3).
Char #16 not replaced.
and
echo $result
ACCTTAGCC*CGCT-*-GCCCACAGTAAAAC
Small perl filter
But if input are not to be verified, this little perl filter could do the job:
(Thanks #jm666 for }{ syntax)
perl -nlE 'y+-+\0+;$,|=$_}{say$,' <(cat file1 file2 file3 file4)
where
-n process all lines without output
-l whipe leading cariage return at end of lines
y+lhs+rhs+ replace (translate) chars from 'lhs' to 'rhs'
\0 is the *null* character, binary 0.
$, is a variable
|= binary or, between himself and current line ($_)
}{ at END, once all lines processed
Alternative way - not very efficient - but short:
file="./gene"
line1=$(head -1 "$file")
seq ${#line1} | xargs -n1 -I% cut -c% "$file" | paste -s - | tr -cd '[A-Z\n]'
prints:
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC
Assumption: each line has the same length.
Decomposition:
the line1=$(head -1 "$file") read the 1st line into the variable line1
the seq ${#line1} generate a sequence of numbers 1..char_count_in_the_line1, like
1
2
..
31
the xargs -n1 -I% cut -c% "$file" will run for each above number the command cut like cut -c22 filename - what extract the given column from the file, e.g. you will get output like:
A
-
-
-
-
C
-
-
# and so on
the paste -s - will join the above lines into one long line with the \t (tab) separator, like:
A - - - - C - - - C - - - - - T ... etc...
finally the tr -cd '[A-Z\n]' remove everything what isn't uppercase character or newline, so will get the final
ACCTTAGCCCCGCTGTAGCCCACAGTAAAAC

Resources