Processing a tab delimited file with shell script processing

Processing a tab delimited file with shell script processing - bash

normally I would use Python/Perl for this procedure but I find myself (for political reasons) having to pull this off using a bash shell.
I have a large tab delimited file that contains six columns and the second column is integers. I need to shell script a solution that would verify that the file indeed is six columns and that the second column is indeed integers. I am assuming that I would need to use sed/awk here somewhere. Problem is that I'm not that familiar with sed/awk. Any advice would be appreciated.
Many thanks!
Lilly

gawk:
BEGIN {
FS="\t"
}
(NF != 6) || ($2 != int($2)) {
exit 1
}
Invoke as follows:
if awk -f colcheck.awk somefile
then
# is valid
else
# is not valid
fi

Well you can directly tell awk what the field delimiter is (the -F option). Inside your awk script you can tell how many fields are present in each record with the NF variable.
Oh, and you can check the second field with a regex. The whole thing might look something like this:
awk < thefile -F\\t '
{ if (NF != 6 || $2 ~ /[^0123456789]/) print "Format error, line " NR; }
'
That's probably close but I need to check the regex because Linux regex syntax variation is so insane. (edited because grrrr)

here's how to do it with awk
awk 'NF!=6||$2+0!=$2{print "error"}' file

Pure Bash:
infile='column6.dat'
lno=0
while read -a line ; do
((lno++))
if [ ${#line[#]} -ne 6 ] ; then
echo -e "line $lno has ${#line[#]} elements"
fi
if ! [[ ${line[1]} =~ ^[0-9]+$ ]] ; then
echo -e "line $lno column 2 : not an integer"
fi
done < "$infile"
Possible output:
line 19 has 5 elements
line 36 column 2 : not an integer
line 38 column 2 : not an integer
line 51 has 3 elements

Related

regex to print lines if value between patterns is greater than number - solution which is independent of column position

2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#CALMED#OK#58#NARDE#4356#68654768961#BHR#TST#DEV
2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#CALMED#OK#58#NARDE#89034#1234567#BHR#TST#DEV
2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#OK#58#BHREDD#234586#4254567#BHR#TST#DEV
2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#OK#58#NARDE#89034#1034567#BHR#TST#DEV
I have log file mentioned above. I would like to print lines only if value between patterns # and #BHR is greater than 1100000.
I can see in my log file lines with values 68654768961, 1234567, 4254567, 1034567. As per the requirement the output should conatin only first 3 lines.
I am looking for regex to get desired output.

One questions, this #58#BHR should be ignore in third line ? If yes, I will get value between patterns # and #BHR#.
Normally, it should be solved this question by writing scripting according the business logical. But you could try this one line command by awk.
awk '{if (0 == system("[ $(echo \"" $0 "\"" " | grep -oP \"" "(?<=#)\\d+(?=#BHR#)\" || echo 0) -gt 1100000 ]")) {print $0}}' log_file
Mainly, it use system() to scratch the value by grep:
# if can't get the pattern value by grep, the value will assign 0
echo $one_line | grep -oP "(?<=#)\d+(?=#BHR#)" || echo 0`
and compare the value to 1100000 by [ "$value" -gt 1100000 ] in awk.
FYI, so if the value greater than 1100000 it will return 0.
system(cmd): executes cmd and returns its exit status

Print only once if something specific name is in the file

I have a problem. This is my script:
#!/bin/bash
file_name="eq3_luteina_horyzontalna"
file_name2="wiazanie_PO4"
tmp=$(mktemp) || exit 1
for index in {1..405000}
do
if ! [ -s "${file_name}_$index.ndx" ];then
echo "0" >> ${file_name2}_POP42.txt
else
awk '{if($2==/POP42/) print "5"; else print "0"}' ${file_name}_$index.ndx >> ${file_name2}_POP42.txt
fi
done
The problem is here
awk '{if($2==/POP42/) print "5"; else print "0"}' ${file_name}_$index.ndx
I want to only check if POP42 is in the file in the second column and print 5
but I have data like that
162 POP87
1851 POP42
so it will print into my output file ${file_name2}_POP42.txt, something like that:
0
5
but I want to have
5
Another situation
3075 POP42
2911 POP42
It will print to output
5
5
but I want only
5
How can I manage my problem?

awk '$2=="POP42"{s=5; exit} END{print s+0}' file
By the way - $2==/POP42/ doesn't do what you think it does, i.e. look for lines with $2 equal to (or even containing) POP42. It's actually shorthand for $2==($0 ~ /POP42/ ? 1 : 0) courtesy of the regexp delimiters /.../ you used and what THAT does is see if a string matching the regexp POP42 occurs anywhere on the current line and, if it does, then test to see if $2 has the value 1, otherwise test to see if $2 has the value 0. It's important to know the difference between string (") and regexp (/) delimiters and string (e.g. ==) and regexp (e.g. ~) comparison operators when using awk.

appending text to specific line in file bash

So I have a file that contains some lines of text separated by ','. I want to create a script that counts how much parts a line has and if the line contains 16 parts i want to add a new one. So far its working great. The only thing that is not working is appending the ',' at the end. See my example below:
Original file:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
Expected result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
This is my code:
while read p; do
if [[ $p == "HEA"* ]]
then
IFS=',' read -ra ADDR <<< "$p"
echo ${#ADDR[#]}
arrayCount=${#ADDR[#]}
if [ "${arrayCount}" -eq 16 ];
then
sed -i "/$p/ s/\$/,xx/g" $f
fi
fi
done <$f
Result:
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
,xx
What im doing wrong? I'm sure its something small but i cant find it..

It can be done using awk:
awk -F, 'NF==16{$0 = $0 FS "xx"} 1' file
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a
b,b,b,b,b,b
a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx
-F, sets input field separator as comma
NF==16 is the condition that says execute block inside { and } if # of fields is 16
$0 = $0 FS "xx" appends xx at end of line
1 is the default awk action that means print the output

For using sed answer should be in the following:
Use ${line_number} s/..../..../ format - to target a specific line, you need to find out the line number first.
Use the special char & to denote the matched string
The sed statement should look like the following:
sed -i "${line_number}s/.*/&xx/"
I would prefer to leave it to you to play around with it but if you would prefer i can give you a full working sample.

how to validate if data has a trailing "/"

I have a file containing various information. The fields are delimited by |. One of the fields contains a directory. For example :
blah|blah|blah|/usr/local/etc/|blah|blah
I need to validate that the path field does not end with a "/". I'm using ksh. Any suggestions?
thanks.

Assuming the directory is always in the 4th field
line=0
while IFS='|' read -rA fields; do
let line++
[[ ${fields[3]} == */ ]] && echo line $line: ends with a slash
done < filename

Not ksh, but this is a natural job for awk:
awk -F\| '$4 ~ /\/$/ {
print "Trailing slash in line "NR":", $4
}' ${file:?}

Try this:
if [ line ~= '(/\w+)+(\||$)' ]
My shell syntax is rusty, so this might need a little massaging into shape

Don't forget special path like / (root)
I keep the / (root) in code below
echo "blah|blah|blah|/usr/local/etc/|blah|blah|
blah|blah|blah|/|blah|blah
blah|blah|blah|.|blah|blah
blah|blah|blah|/usr/local/etc|blah|blah" \
sed "
/\/\|/ {
/\|\/\|/ !s/\/|/|/
}"
explaination:
//\|/ treat line where a "/|" appear
//\|/ ! treat line where "|/|" doesn't appear (here in the case of previous test occur)
s//|/|/ replace "/|" by "|" (here when both test occur successfully)

Setting a BASH environment variable directly in AWK (in an AWK one-liner)

I have a file that has two columns of floating point values. I also have a C program that takes a floating point value as input and returns another floating point value as output.
What I'd like to do is the following: for each row in the original, execute the C program with the value in the first column as input, and then print out the first column (unchanged) followed by the second column minus the result of the C program.
As an example, suppose c_program returns the square of the input and behaves like this:
$ c_program 4
16
$
and suppose data_file looks like this:
1 10
2 11
3 12
4 13
What I'd like to return as output, in this case, is
1 9
2 7
3 3
4 -3
To write this in really sketchy pseudocode, I want to do something like this:
awk '{print $1, $2 - `c_program $1`}' data_file
But of course, I can't just pass $1, the awk variable, into a call to c_program. What's the right way to do this, and preferably, how could I do it while still maintaining the "awk one-liner"? (I don't want to pull out a sledgehammer and write a full-fledged C program to do this.)

you just do everything in awk
awk '{cmd="c_program "$1; cmd|getline l;print $1,$2-l}' file

This shows how to execute a command in awk:
ls | awk '/^a/ {system("ls -ld " $1)}'
You could use a bash script instead:
while read line
do
FIRST=`echo $line | cut -d' ' -f1`
SECOND=`echo $line | cut -d' ' -f2`
OUT=`expr $SECOND \* 4`
echo $FIRST $OUT `expr $OUT - $SECOND`
done

The shell is a better tool for this using a little used feature. There is a shell variable IFS which is the Input Field Separator that sh uses to split command lines when parsing; it defaults to <Space><Tab><Newline> which is why ls foo is interpreted as two words.
When set is given arguments not beginning with - it sets the positional parameters of the shell to the contents of the arguments as split via IFS, thus:
#!/bin/sh
while read line ; do
set $line
subtrahend=`c_program $1`
echo $1 `expr $2 - $subtrahend`
done < data_file

Pure Bash, without using any external executables other than your program:
#!/bin/bash
while read num1 num2
do
(( result = $(c_program num2) - num1 ))
echo "$num1 $result"
done

As others have pointed out: awk is not not well equipped for this job. Here is a suggestion in bash:
#!/bin/sh
data_file=$1
while read column_1 column_2 the_rest
do
((result=$(c_program $column_1)-$column_2))
echo $column_1 $result "$the_rest"
done < $data_file
Save this to a file, say myscript.sh, then invoke it as:
sh myscript.sh data_file
The read command reads each line from the data file (which was redirected to the standard input) and assign the first 2 columns to $column_1 and $column_2 variables. The rest of the line, if there is any, is stored in $the_rest.
Next, I calculate the result based on your requirements and prints out the line based on your requirements. Note that I surround $the_rest with quotes to reserve spacing. Failure to do so will result in multiple spaces in the input file to be squeezed into one.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Processing a tab delimited file with shell script processing - bash

gawk: BEGIN { FS="\t" } (NF != 6) || ($2 != int($2)) { exit 1 } Invoke as follows: if awk -f colcheck.awk somefile then # is valid else # is not valid fi

here's how to do it with awk awk 'NF!=6||$2+0!=$2{print "error"}' file

Related

regex to print lines if value between patterns is greater than number - solution which is independent of column position

Print only once if something specific name is in the file

appending text to specific line in file bash

how to validate if data has a trailing "/"

Setting a BASH environment variable directly in AWK (in an AWK one-liner)

Categories

Resources