Processing a tab delimited file with shell script processing - bash
normally I would use Python/Perl for this procedure but I find myself (for political reasons) having to pull this off using a bash shell.
I have a large tab delimited file that contains six columns and the second column is integers. I need to shell script a solution that would verify that the file indeed is six columns and that the second column is indeed integers. I am assuming that I would need to use sed/awk here somewhere. Problem is that I'm not that familiar with sed/awk. Any advice would be appreciated.
Many thanks!
Lilly
gawk:
BEGIN {
FS="\t"
}
(NF != 6) || ($2 != int($2)) {
exit 1
}
Invoke as follows:
if awk -f colcheck.awk somefile
then
# is valid
else
# is not valid
fi
Well you can directly tell awk what the field delimiter is (the -F option). Inside your awk script you can tell how many fields are present in each record with the NF variable.
Oh, and you can check the second field with a regex. The whole thing might look something like this:
awk < thefile -F\\t '
{ if (NF != 6 || $2 ~ /[^0123456789]/) print "Format error, line " NR; }
'
That's probably close but I need to check the regex because Linux regex syntax variation is so insane. (edited because grrrr)
here's how to do it with awk
awk 'NF!=6||$2+0!=$2{print "error"}' file
Pure Bash:
infile='column6.dat'
lno=0
while read -a line ; do
((lno++))
if [ ${#line[#]} -ne 6 ] ; then
echo -e "line $lno has ${#line[#]} elements"
fi
if ! [[ ${line[1]} =~ ^[0-9]+$ ]] ; then
echo -e "line $lno column 2 : not an integer"
fi
done < "$infile"
Possible output:
line 19 has 5 elements
line 36 column 2 : not an integer
line 38 column 2 : not an integer
line 51 has 3 elements
Related
regex to print lines if value between patterns is greater than number - solution which is independent of column position
2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#CALMED#OK#58#NARDE#4356#68654768961#BHR#TST#DEV 2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#CALMED#OK#58#NARDE#89034#1234567#BHR#TST#DEV 2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#OK#58#BHREDD#234586#4254567#BHR#TST#DEV 2001-06-30T11:33:33,543 DEBUG (Bss-Thread-948:[]) SUNCA#44#77#OK#58#NARDE#89034#1034567#BHR#TST#DEV I have log file mentioned above. I would like to print lines only if value between patterns # and #BHR is greater than 1100000. I can see in my log file lines with values 68654768961, 1234567, 4254567, 1034567. As per the requirement the output should conatin only first 3 lines. I am looking for regex to get desired output.
One questions, this #58#BHR should be ignore in third line ? If yes, I will get value between patterns # and #BHR#. Normally, it should be solved this question by writing scripting according the business logical. But you could try this one line command by awk. awk '{if (0 == system("[ $(echo \"" $0 "\"" " | grep -oP \"" "(?<=#)\\d+(?=#BHR#)\" || echo 0) -gt 1100000 ]")) {print $0}}' log_file Mainly, it use system() to scratch the value by grep: # if can't get the pattern value by grep, the value will assign 0 echo $one_line | grep -oP "(?<=#)\d+(?=#BHR#)" || echo 0` and compare the value to 1100000 by [ "$value" -gt 1100000 ] in awk. FYI, so if the value greater than 1100000 it will return 0. system(cmd): executes cmd and returns its exit status
Print only once if something specific name is in the file
I have a problem. This is my script: #!/bin/bash file_name="eq3_luteina_horyzontalna" file_name2="wiazanie_PO4" tmp=$(mktemp) || exit 1 for index in {1..405000} do if ! [ -s "${file_name}_$index.ndx" ];then echo "0" >> ${file_name2}_POP42.txt else awk '{if($2==/POP42/) print "5"; else print "0"}' ${file_name}_$index.ndx >> ${file_name2}_POP42.txt fi done The problem is here awk '{if($2==/POP42/) print "5"; else print "0"}' ${file_name}_$index.ndx I want to only check if POP42 is in the file in the second column and print 5 but I have data like that 162 POP87 1851 POP42 so it will print into my output file ${file_name2}_POP42.txt, something like that: 0 5 but I want to have 5 Another situation 3075 POP42 2911 POP42 It will print to output 5 5 but I want only 5 How can I manage my problem?
awk '$2=="POP42"{s=5; exit} END{print s+0}' file By the way - $2==/POP42/ doesn't do what you think it does, i.e. look for lines with $2 equal to (or even containing) POP42. It's actually shorthand for $2==($0 ~ /POP42/ ? 1 : 0) courtesy of the regexp delimiters /.../ you used and what THAT does is see if a string matching the regexp POP42 occurs anywhere on the current line and, if it does, then test to see if $2 has the value 1, otherwise test to see if $2 has the value 0. It's important to know the difference between string (") and regexp (/) delimiters and string (e.g. ==) and regexp (e.g. ~) comparison operators when using awk.
appending text to specific line in file bash
So I have a file that contains some lines of text separated by ','. I want to create a script that counts how much parts a line has and if the line contains 16 parts i want to add a new one. So far its working great. The only thing that is not working is appending the ',' at the end. See my example below: Original file: a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a b,b,b,b,b,b a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a b,b,b,b,b,b a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a Expected result: a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx b,b,b,b,b,b a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a b,b,b,b,b,b a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx This is my code: while read p; do if [[ $p == "HEA"* ]] then IFS=',' read -ra ADDR <<< "$p" echo ${#ADDR[#]} arrayCount=${#ADDR[#]} if [ "${arrayCount}" -eq 16 ]; then sed -i "/$p/ s/\$/,xx/g" $f fi fi done <$f Result: a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a ,xx b,b,b,b,b,b a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a b,b,b,b,b,b a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a ,xx What im doing wrong? I'm sure its something small but i cant find it..
It can be done using awk: awk -F, 'NF==16{$0 = $0 FS "xx"} 1' file a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx b,b,b,b,b,b a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a b,b,b,b,b,b a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,a,xx -F, sets input field separator as comma NF==16 is the condition that says execute block inside { and } if # of fields is 16 $0 = $0 FS "xx" appends xx at end of line 1 is the default awk action that means print the output
For using sed answer should be in the following: Use ${line_number} s/..../..../ format - to target a specific line, you need to find out the line number first. Use the special char & to denote the matched string The sed statement should look like the following: sed -i "${line_number}s/.*/&xx/" I would prefer to leave it to you to play around with it but if you would prefer i can give you a full working sample.
how to validate if data has a trailing "/"
I have a file containing various information. The fields are delimited by |. One of the fields contains a directory. For example : blah|blah|blah|/usr/local/etc/|blah|blah I need to validate that the path field does not end with a "/". I'm using ksh. Any suggestions? thanks.
Assuming the directory is always in the 4th field line=0 while IFS='|' read -rA fields; do let line++ [[ ${fields[3]} == */ ]] && echo line $line: ends with a slash done < filename
Not ksh, but this is a natural job for awk: awk -F\| '$4 ~ /\/$/ { print "Trailing slash in line "NR":", $4 }' ${file:?}
Try this: if [ line ~= '(/\w+)+(\||$)' ] My shell syntax is rusty, so this might need a little massaging into shape
Don't forget special path like / (root) I keep the / (root) in code below echo "blah|blah|blah|/usr/local/etc/|blah|blah| blah|blah|blah|/|blah|blah blah|blah|blah|.|blah|blah blah|blah|blah|/usr/local/etc|blah|blah" \ sed " /\/\|/ { /\|\/\|/ !s/\/|/|/ }" explaination: //\|/ treat line where a "/|" appear //\|/ ! treat line where "|/|" doesn't appear (here in the case of previous test occur) s//|/|/ replace "/|" by "|" (here when both test occur successfully)
Setting a BASH environment variable directly in AWK (in an AWK one-liner)
I have a file that has two columns of floating point values. I also have a C program that takes a floating point value as input and returns another floating point value as output. What I'd like to do is the following: for each row in the original, execute the C program with the value in the first column as input, and then print out the first column (unchanged) followed by the second column minus the result of the C program. As an example, suppose c_program returns the square of the input and behaves like this: $ c_program 4 16 $ and suppose data_file looks like this: 1 10 2 11 3 12 4 13 What I'd like to return as output, in this case, is 1 9 2 7 3 3 4 -3 To write this in really sketchy pseudocode, I want to do something like this: awk '{print $1, $2 - `c_program $1`}' data_file But of course, I can't just pass $1, the awk variable, into a call to c_program. What's the right way to do this, and preferably, how could I do it while still maintaining the "awk one-liner"? (I don't want to pull out a sledgehammer and write a full-fledged C program to do this.)
you just do everything in awk awk '{cmd="c_program "$1; cmd|getline l;print $1,$2-l}' file
This shows how to execute a command in awk: ls | awk '/^a/ {system("ls -ld " $1)}' You could use a bash script instead: while read line do FIRST=`echo $line | cut -d' ' -f1` SECOND=`echo $line | cut -d' ' -f2` OUT=`expr $SECOND \* 4` echo $FIRST $OUT `expr $OUT - $SECOND` done
The shell is a better tool for this using a little used feature. There is a shell variable IFS which is the Input Field Separator that sh uses to split command lines when parsing; it defaults to <Space><Tab><Newline> which is why ls foo is interpreted as two words. When set is given arguments not beginning with - it sets the positional parameters of the shell to the contents of the arguments as split via IFS, thus: #!/bin/sh while read line ; do set $line subtrahend=`c_program $1` echo $1 `expr $2 - $subtrahend` done < data_file
Pure Bash, without using any external executables other than your program: #!/bin/bash while read num1 num2 do (( result = $(c_program num2) - num1 )) echo "$num1 $result" done
As others have pointed out: awk is not not well equipped for this job. Here is a suggestion in bash: #!/bin/sh data_file=$1 while read column_1 column_2 the_rest do ((result=$(c_program $column_1)-$column_2)) echo $column_1 $result "$the_rest" done < $data_file Save this to a file, say myscript.sh, then invoke it as: sh myscript.sh data_file The read command reads each line from the data file (which was redirected to the standard input) and assign the first 2 columns to $column_1 and $column_2 variables. The rest of the line, if there is any, is stored in $the_rest. Next, I calculate the result based on your requirements and prints out the line based on your requirements. Note that I surround $the_rest with quotes to reserve spacing. Failure to do so will result in multiple spaces in the input file to be squeezed into one.