Processing a tab delimited file with shell script processing - bash

normally I would use Python/Perl for this procedure but I find myself (for political reasons) having to pull this off using a bash shell.
I have a large tab delimited file that contains six columns and the second column is integers. I need to shell script a solution that would verify that the file indeed is six columns and that the second column is indeed integers. I am assuming that I would need to use sed/awk here somewhere. Problem is that I'm not that familiar with sed/awk. Any advice would be appreciated.
Many thanks!
Lilly

gawk:
BEGIN {
FS="\t"
}
(NF != 6) || ($2 != int($2)) {
exit 1
}
Invoke as follows:
if awk -f colcheck.awk somefile
then
# is valid
else
# is not valid
fi

Well you can directly tell awk what the field delimiter is (the -F option). Inside your awk script you can tell how many fields are present in each record with the NF variable.
Oh, and you can check the second field with a regex. The whole thing might look something like this:
awk < thefile -F\\t '
{ if (NF != 6 || $2 ~ /[^0123456789]/) print "Format error, line " NR; }
'
That's probably close but I need to check the regex because Linux regex syntax variation is so insane. (edited because grrrr)

here's how to do it with awk
awk 'NF!=6||$2+0!=$2{print "error"}' file

Pure Bash:
infile='column6.dat'
lno=0
while read -a line ; do
((lno++))
if [ ${#line[#]} -ne 6 ] ; then
echo -e "line $lno has ${#line[#]} elements"
fi
if ! [[ ${line[1]} =~ ^[0-9]+$ ]] ; then
echo -e "line $lno column 2 : not an integer"
fi
done < "$infile"
Possible output:
line 19 has 5 elements
line 36 column 2 : not an integer
line 38 column 2 : not an integer
line 51 has 3 elements

Related

regex to print lines if value between patterns is greater than number - solution which is independent of column position

2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#CALMED#OK#58#NARDE#4356#68654768961#BHR#TST#DEV
2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#CALMED#OK#58#NARDE#89034#1234567#BHR#TST#DEV
2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#OK#58#BHREDD#234586#4254567#BHR#TST#DEV
2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#OK#58#NARDE#89034#1034567#BHR#TST#DEV
I have log file mentioned above. I would like to print lines only if value between patterns # and #BHR is greater than 1100000.
I can see in my log file lines with values 68654768961, 1234567, 4254567, 1034567. As per the requirement the output should conatin only first 3 lines.
I am looking for regex to get desired output.
One questions, this #58#BHR should be ignore in third line ? If yes, I will get value between patterns # and #BHR#.
Normally, it should be solved this question by writing scripting according the business logical. But you could try this one line command by awk.
awk '{if (0 == system("[ $(echo \"" $0 "\"" " | grep -oP \"" "(?<=#)\\d+(?=#BHR#)\" || echo 0) -gt 1100000 ]")) {print $0}}' log_file
Mainly, it use system() to scratch the value by grep:
# if can't get the pattern value by grep, the value will assign 0
echo $one_line | grep -oP "(?<=#)\d+(?=#BHR#)" || echo 0`
and compare the value to 1100000 by [ "$value" -gt 1100000 ] in awk.
FYI, so if the value greater than 1100000 it will return 0.
system(cmd): executes cmd and returns its exit status

Print only once if something specific name is in the file

I have a problem. This is my script:
#!/bin/bash
file_name="eq3_luteina_horyzontalna"
file_name2="wiazanie_PO4"
tmp=$(mktemp) || exit 1
for index in {1..405000}
do
if ! [ -s "${file_name}_$index.ndx" ];then
echo "0" >> ${file_name2}_POP42.txt
else
awk '{if($2==/POP42/) print "5"; else print "0"}' ${file_name}_$index.ndx >> ${file_name2}_POP42.txt
fi
done
The problem is here
awk '{if($2==/POP42/) print "5"; else print "0"}' ${file_name}_$index.ndx
I want to only check if POP42 is in the file in the second column and print 5
but I have data like that
162 POP87
1851 POP42
so it will print into my output file ${file_name2}_POP42.txt, something like that:
0
5
but I want to have
5
Another situation
3075 POP42
2911 POP42
It will print to output
5
5
but I want only
5
How can I manage my problem?
awk '$2=="POP42"{s=5; exit} END{print s+0}' file
By the way - $2==/POP42/ doesn't do what you think it does, i.e. look for lines with $2 equal to (or even containing) POP42. It's actually shorthand for $2==($0 ~ /POP42/ ? 1 : 0) courtesy of the regexp delimiters /.../ you used and what THAT does is see if a string matching the regexp POP42 occurs anywhere on the current line and, if it does, then test to see if $2 has the value 1, otherwise test to see if $2 has the value 0. It's important to know the difference between string (") and regexp (/) delimiters and string (e.g. ==) and regexp (e.g. ~) comparison operators when using awk.

appending text to specific line in file bash

So I have a file that contains some lines of text separated by ','. I want to create a script that counts how much parts a line has and if the line contains 16 parts i want to add a new one. So far its working great. The only thing that is not working is appending the ',' at the end. See my example below:
Original file:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
Expected result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
This is my code:
while read p; do
if [[ $p == "HEA"* ]]
then
IFS=',' read -ra ADDR <<< "$p"
echo ${#ADDR[#]}
arrayCount=${#ADDR[#]}
if [ "${arrayCount}" -eq 16 ];
then
sed -i "/$p/ s/\$/,xx/g" $f
fi
fi
done <$f
Result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
What im doing wrong? I'm sure its something small but i cant find it..
It can be done using awk:
awk -F, 'NF==16{$0 = $0 FS "xx"} 1' file
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
-F, sets input field separator as comma
NF==16 is the condition that says execute block inside { and } if # of fields is 16
$0 = $0 FS "xx" appends xx at end of line
1 is the default awk action that means print the output
For using sed answer should be in the following:
Use ${line_number} s/..../..../ format - to target a specific line, you need to find out the line number first.
Use the special char & to denote the matched string
The sed statement should look like the following:
sed -i "${line_number}s/.*/&xx/"
I would prefer to leave it to you to play around with it but if you would prefer i can give you a full working sample.

how to validate if data has a trailing "/"

I have a file containing various information. The fields are delimited by |. One of the fields contains a directory. For example :
blah|blah|blah|/usr/local/etc/|blah|blah
I need to validate that the path field does not end with a "/". I'm using ksh. Any suggestions?
thanks.
Assuming the directory is always in the 4th field
line=0
while IFS='|' read -rA fields; do
let line++
[[ ${fields[3]} == */ ]] && echo line $line: ends with a slash
done < filename
Not ksh, but this is a natural job for awk:
awk -F\| '$4 ~ /\/$/ {
print "Trailing slash in line "NR":", $4
}' ${file:?}
Try this:
if [ line ~= '(/\w+)+(\||$)' ]
My shell syntax is rusty, so this might need a little massaging into shape
Don't forget special path like / (root)
I keep the / (root) in code below
echo "blah|blah|blah|/usr/local/etc/|blah|blah|
blah|blah|blah|/|blah|blah
blah|blah|blah|.|blah|blah
blah|blah|blah|/usr/local/etc|blah|blah" \
sed "
/\/\|/ {
/\|\/\|/ !s/\/|/|/
}"
explaination:
//\|/ treat line where a "/|" appear
//\|/ ! treat line where "|/|" doesn't appear (here in the case of previous test occur)
s//|/|/ replace "/|" by "|" (here when both test occur successfully)

Setting a BASH environment variable directly in AWK (in an AWK one-liner)

I have a file that has two columns of floating point values. I also have a C program that takes a floating point value as input and returns another floating point value as output.
What I'd like to do is the following: for each row in the original, execute the C program with the value in the first column as input, and then print out the first column (unchanged) followed by the second column minus the result of the C program.
As an example, suppose c_program returns the square of the input and behaves like this:
$ c_program 4
16
$
and suppose data_file looks like this:
1 10
2 11
3 12
4 13
What I'd like to return as output, in this case, is
1 9
2 7
3 3
4 -3
To write this in really sketchy pseudocode, I want to do something like this:
awk '{print $1, $2 - `c_program $1`}' data_file
But of course, I can't just pass $1, the awk variable, into a call to c_program. What's the right way to do this, and preferably, how could I do it while still maintaining the "awk one-liner"? (I don't want to pull out a sledgehammer and write a full-fledged C program to do this.)
you just do everything in awk
awk '{cmd="c_program "$1; cmd|getline l;print $1,$2-l}' file
This shows how to execute a command in awk:
ls | awk '/^a/ {system("ls -ld " $1)}'
You could use a bash script instead:
while read line
do
FIRST=`echo $line | cut -d' ' -f1`
SECOND=`echo $line | cut -d' ' -f2`
OUT=`expr $SECOND \* 4`
echo $FIRST $OUT `expr $OUT - $SECOND`
done
The shell is a better tool for this using a little used feature. There is a shell variable IFS which is the Input Field Separator that sh uses to split command lines when parsing; it defaults to <Space><Tab><Newline> which is why ls foo is interpreted as two words.
When set is given arguments not beginning with - it sets the positional parameters of the shell to the contents of the arguments as split via IFS, thus:
#!/bin/sh
while read line ; do
set $line
subtrahend=`c_program $1`
echo $1 `expr $2 - $subtrahend`
done < data_file
Pure Bash, without using any external executables other than your program:
#!/bin/bash
while read num1 num2
do
(( result = $(c_program num2) - num1 ))
echo "$num1 $result"
done
As others have pointed out: awk is not not well equipped for this job. Here is a suggestion in bash:
#!/bin/sh
data_file=$1
while read column_1 column_2 the_rest
do
((result=$(c_program $column_1)-$column_2))
echo $column_1 $result "$the_rest"
done < $data_file
Save this to a file, say myscript.sh, then invoke it as:
sh myscript.sh data_file
The read command reads each line from the data file (which was redirected to the standard input) and assign the first 2 columns to $column_1 and $column_2 variables. The rest of the line, if there is any, is stored in $the_rest.
Next, I calculate the result based on your requirements and prints out the line based on your requirements. Note that I surround $the_rest with quotes to reserve spacing. Failure to do so will result in multiple spaces in the input file to be squeezed into one.

Resources