Discrete to continuous number ranges via awk - bash

Assume a text file file which contains multiple discrete number ranges, one per line. Each range is preceded by a string (i.e., the range name). The lower and upper bound of each range is separated by a dash. Each number range is succeeded by a semi-colon. The individual ranges are sorted (i.e., range 101-297 comes before 1299-1301) and do not overlap.
$cat file
foo 101-297;
bar 1299-1301;
baz 1314-5266;
Please note that in the example above the three ranges do not form a continuous range that starts at integer 1.
I believe that awk is the appropriate tool to fill the missing number ranges such that all ranges taken together form a continuous range from {1} to {upper bound of the last range}. If so, what awk command/function would you use to perform the task?
$cat file | sought_awk_command
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
new3 1302-1313;
baz 1314-5266;
--
Edit 1: Upon closer evaluation, the code suggested below fails at another simple example.
$cat example2
foo 101-297;
bar 1299-1301;
baz 1302-1314; # Notice that ranges "bar" and "baz" are continuous to one another
qux 1399-5266;
$ awk -F'[ -]' '$3-Q>1{print "new"++o,Q+1"-"$3-1";";Q=$4} 1' example2
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
baz 1302-1314;
new3 1302-1398; # ERROR HERE: Notice that range "new3" has a lower bound that is equal to upper bound of "bar", not of "baz".
qux 1399-5266;
--
Edit 2: Many thanks to RavinderSingh13 for assistance with solving this question. However, the suggested code still generates output inconsistent with the given objective.
$ cat example3
foo 35025-35144;
bar 35259-35375;
baz 35376-35624;
qux 37911-39434;
$ awk -F'[ -]' '$3-Q+0>=1{print "new"++o,Q+1"-"$3-1";";Q=$4} {Q=$4;print}' example3
new1 1-35024;
foo 35025-35144;
new2 35145-35258;
bar 35259-35375;
new3 35376-35375; # ERROR HERE: Notice that range "new3" has been added, even though ranges "bar" and "baz" are contiguous.
baz 35376-35624;
new4 35625-37910;
qux 37911-39434;

try:
awk -F'[ -]' '$3-Q>1{print "new"++o,Q+1"-"$3-1";";Q=$4} 1' Input_file
EDIT: Adding a non-one liner solution for same too now with proper explanation.
awk -F'[ -]' ' ###Setting field separator as space, dash here.
$3-Q>1{ ###Checking here if 3rd field and variable Qs subtraction is greater than 1, if yes then perform following.
print "new"++o,Q+1"-"$3-1";"; ###printing the string new with a incrementing value of variable o each time, then variable Qs value with adding 1 to it, then current line $4-1 and semi colon.
Q=$4 ###Assigning the variable Q value to 4th field of the current line here too.
}
1 ###printing the current line here.
' Input_file ###Mentioning the Input_file here too.
EDIT2: Adding one more answer as per OP's a condition.
awk -F'[ -]' '$3-Q+0>=1{print "new"++o,Q+1"-"$3-1";";Q=$4} {Q=$4;print}' Input_file

This has no problem with ranges that can overlap as you showed in your original example2 where bar 1299-1301; and baz 1301-1314; overlapped at 1301.
$ cat tst.awk
{ split($2,curr,/[-;]/); currStart=curr[1]; currEnd=curr[2] }
currStart > (prevEnd+1) { print "new"++cnt, prevEnd+1 "-" currStart-1 ";" }
{ print; prevEnd=currEnd }
$ awk -f tst.awk file
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
new3 1302-1313;
baz 1314-5266;
$ awk -f tst.awk example2
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
baz 1301-1314;
new3 1315-1398;
qux 1399-5266;
$ awk -f tst.awk example3
new1 1-35024;
foo 35025-35144;
new2 35145-35258;
bar 35259-35375;
baz 35376-35624;
new3 35625-37910;
qux 37911-39434;

$ cat file1
foo 2-100
bar 102-200
$ awk F' +|[-;}' 'p+1<$2{print "new" ++q, p+1 "-" $2-1 ";"}p=$3' file1
new1 1-1;
foo 2-100
new2 101-101;
bar 102-200
$ cat file2
foo 101-297;
bar 1299-1301;
baz 1314-5266;
$ awk -F' +|[-;]' 'p+1<$2{print "new" ++q, p+1 "-" $2-1 ";"}p=$3' file2
new1 1-100;
foo 101-297;
new2 298-1298;
bar 1299-1301;
new3 1302-1313;
baz 1314-5266;
Explained:
$ awk -F' +|[-;]' ' # FS is ; - or a bunch of spaces
p+1 < $2 { # if p revious $3+1 is still less than new $2
print "new"++q,p+1 "-" $2-1 ";" # print a "new" line
}
p=$3 # set future p and implicit print of record *
' file2 # * as all values are above 0

Related

Conditions in AWK

I'm filtering some data with awk (version 20070501, on MacOS) but experienced a syntax challenge when applying a multiple negative match conditions to values in a specific column.
Here's a generic example that I think captures my issue.
Input:
foo,bar
bar,foo
foo,bar
bar,foo
With this code I remove matches for foo in column 2:
awk 'BEGIN { FS=OFS="," } ; { if ($2 !~ /foo/ ) print $0}'
I get this output, which I expected:
foo,bar
foo,bar
Next, I add an additional condition to the if statement, to also remove all values matching bar in column 2:
awk 'BEGIN { FS=OFS="," } ; { if ($2 !~ /foo/ || $2 !~ /bar/) print $0}'
I get this output, which I did not expect:
foo,bar
bar,foo
foo,bar
bar,foo
I expected no rows to be returned, which was my aim. So what's going on?
Are the two conditions are cancelling each other out? I read the GNU awk documentation for boolean expressions, which states:
The ‘&&’ and ‘||’ operators are called short-circuit operators because of the way they work. Evaluation of the full expression is “short-circuited” if the result can be determined partway through its evaluation.
From this snippet, I wasn't sure how to make progress. Or is the issue that the syntax isn't correct? Or both?
Update:
After comments and help from #wiktor-stribiżew here's a better representation of the problem:
1 2 3 4 5
foo bar foo bar FY 2008 Program Totals
foo bar foo bar FY 2009 Program Totals
foo bar foo bar Fiscal Year 2010 Program Totals
foo bar foo bar Fiscal Year 2011 Program Totals
foo bar foo bar Fiscal Year 2012 Program Totals
foo bar foo bar Fiscal Year 2013 Program Totals
foo bar foo bar Fiscal Year 2014 Program Totals
foo bar foo bar Fiscal Year 2015 Program Totals
foo bar foo bar Fiscal Year 2016 Program Totals
foo bar foo bar Fiscal Year 2017 Program Totals
My failing code would be:
awk 'BEGIN { FS=OFS="\t" } ; { if ($5 !~ /Fiscal.*Program Totals/ || $5 !~ /FY.*Program Totals/) print $0}'
The accepted answer below resolves this.
You want to filter out lines where Field 2 matches either foo or bar, so you want that field to be not equal to foo and bar. Thus, you need && operator:
awk -F',' '$2 !~ /foo/ && $2 !~ /bar/' file > newfile
# ^^
Note you may also use || if you group the conditions and negate the result:
awk -F\, '!($2 ~ /foo/ || $2 ~ /bar/)' file > newfile
Note you need not set OFS because you are only printing $0 (whole lines) and since it is the default action, you do not need to specify that if you write the condition as shown above.
All you need is:
awk '$2 !~ /foo|bar/' file
Given your real failing code:
awk 'BEGIN { FS=OFS="\t" } ; { if ($5 !~ /Fiscal.*Program Totals/ || $5 !~ /FY.*Program Totals/) print $0}'
and assuming your fields really are tab-separated as your code implies, you'd write that as just:
awk -F'\t' '$5 !~ /F(iscal|Y).*Program Totals/'

Find a line with certain string then remove it's newline character at the end in bash

I'm trying to prepare a file for a report. What I have is like this:
foo
bar bar oof
bar oof
foo
bar bar
I'm trying to get an output like this:
foo bar bar oof
bar off
foo bar bar
I wanted to search for a string, in this case 'foo', and within the line where the string is found I have to remove the newline.
I did search but I can only find solutions where 'foo' is also removed. How can I do this?
Using awk:
awk -v search='foo' '$0 ~ search{printf $0; next}1' infile
You may use printf $0 OFS like below, if your field doesn't have leading space before newline char
awk -v search='foo' '$0 ~ search{printf $0 OFS; next}1' infile
Test Results:
$ cat infile
foo
bar bar oof
bar oof
foo
bar bar
$ awk -v search='foo' '$0 ~ search{printf $0; next}1' infile
foo bar bar oof
bar oof
foo bar bar
Explanation:
-v search='foo' - set variable search
$0 ~ search - if lines/record/row contains regexp/pattern/string mentioned in variable
{printf $0; next} - print current record without record separator and go to next line
}1 1 at the end does default operation that is print current record/row.
You can do this quite easily with sed, for example:
$ sed '/^foo$/N;s/\n/ /' file
foo bar bar oof
bar oof
foo bar bar
Explanation
/^foo$/ find lines containing only foo
N read/append next line of input into pattern space.
s/\n/ / substitute the '\n' with a space.

sed : delete all lines NOT containing string A or B, starting at 2nd line

Please help - I'm stuck.
I need to
delete all lines of a textfile
which does NOT contain foo or Foo
starting from the 2nd line of the file
infile:
first line
foobar
tree
fish
Foo Bar
Football
Foobar
Street
foo bar
outfile:
first line
foobar
Foo Bar
Football
Foobar
foo bar
I tried the following:
sed '2,$/*.foo.*\|.*Foo.*/!d' -i test.txt
The resulting error is:
sed: -e expression #1, char 4: unknown command: `/'
What's my mistake?
(awk would be a possible alternative, too.)
sed approach:
sed -e '2,${/[fF]oo/!d}' file
-e script (--expression=script)
Add the commands in script to the set of commands to be run while
processing the input.
2,$ - an address range, considers the lines from the second line to the end
[fF] - character class, matches either f or F
/!d - deletes lines which don't contain foo or Foo
awk 'NR==1 || !/foo|Foo/' oldfile > newfile
Note: If you use csh or tcsh, you need to protect the ! with backslash:
awk 'NR==1 || \!/foo|Foo/' oldfile > newfile
A good option is to invert your requirements - delete all lines, but print first line and all lines with foo or Foo.
sed -n '1p; /foo\|Foo/p'
This one will print the first line twice if it contains foo or Foo, but that can be easily fixed if needed.
A chance but without sed neither awk (don't know if they are mandatory in your case), you can copy the first line of your textfile in a new file:
head -1 textfile > newfile
Then you can append all the content with foo or Foo to the new file:
grep "foo\|Foo" textfile >> newfile
So you will have the desired content in newfile.
If you want to have it in your original file then you can move it:
mv newfile textfile
If first line contains foo or Foo it will be printed twice, but as you stated you wanted to keep the first line, I assume that it won't have neither foo nor Foo.
In awk. First some test data:
$ cat file
1 begin
2 asd
3 foo
4 sdf
5 Foo
6 end
The code. print records after the first record that contain foo or Foo:
$ awk 'NR==1 || /[fF]oo/' file
1 begin
3 foo
5 Foo
another awk in the forest
awk '!a++||!/[fF]oo/' infile
Here is your code:
sed '2,$/*.foo.*\|.*Foo.*/!d'
The issue here is that you are using two forms of line-addressing, namely the numeric (addr1,addr2) and matching (/REGEX/). In addition, there are errors in your regular expression.
Here is how I would solve it:
sed '1b; /[fF]oo/!d' infile
Output:
first line
foobar
Foo Bar
Football
Foobar
foo bar
awk '/^[fF]/&&!/fish/' file
first line
foobar
Foo Bar
Football
Foobar
foo bar

Output only parts from a logfile (a function name and a param value of it)

I dont have find a question like for this specific case, I have a logfile like this:
"foo function1 para1=abc para2=def para3=ghi bar
foo function2 para1=jkl para2=mno para3=pqr bar"
Now i want execute a one-liner on a gnu bash with this output:
function1 def
function2 mno
foo indicates the start for the function name and bar is the sign for the end of this block. So i want to search for the word "foo", extract the next word (the function name) and then search for the param2 and extract only the value.
How can I do this with a one-liner (not a script)?
If this isn't all you need:
$ awk -F'[ =]' '{print $2, $6}' file
function1 def
function2 mno
then edit your question to clarify your requirements and provide more meaningful and truly representative sample input/output.
#Simi: Try:
awk -F'[ ="]' '{for(i=1;i<=NF;i++){if($i=="foo"){printf("%s",$(i+1))};if($i=="para2"){printf(" %s\n",$(i+1))}}}' Input_file
Here I am making field separator as space or = or ("), then I am traversing into all the fields of a line then searching for strings(foo,para2) if any field has these values then simply printing the next field's values as per your requirement. Let me know if this helps you.
Perl One Liner
perl -lane 'print "$F[1] ",(split(/=/,$F[3]))[1]' logfile
Input
"foo function1 para1=abc para2=def para3=ghi bar
foo function2 para1=jkl para2=mno para3=pqr bar"
Output
function1 def
function2 mno
foo indicates the start for the function name and bar is the sign for
the end of this block. So i want to search for the word "foo", extract
the next word (the function name) and then search for the param2 and
extract only the value.
With some assumption, if your log file would look like below then
Either
$ cat mylog
foo function1 para1=abc para2=def para3=ghi bar foo function2 para1=jkl para2=mno para3=pqr bar
OR with line-break
$ cat mylog
foo function1 para1=abc para2=def para3=ghi bar
foo function2 para1=jkl para2=mno para3=pqr bar
Output
$ awk -F'[ =]' '{for(i=1;i<=NF;i++)if($i=="foo")print $(i+1),$(i+5) }' mylog
function1 def
function2 mno
If in case your para2 is not in order you may use this
$ awk -F'[ =]' '{f=""; for(i=1;i<=NF;i++){if($i=="foo")f=$(i+1); if(f && $i=="para2")print f,$(i+1)}}' mylog
This is how awk see fields in record with -F'[ =]'
With -F'[ =]' awk would see fields in record like below
foo function1 para1=abc para2=def para3=ghi bar
^ ^ ^ ^ ^ ^
1 2 3 4 5 6
i (i+1) (i+5)
Explanation
awk -F'[ =]' ' # Set field separator space and =
{
# Loop through no of fields in record,
# NF gives no of fields in current record
for(i=1;i<=NF;i++)
# If field equal to foo then
if($i=="foo")
# print next(i+1) and 5th(i+5) field from current field index
print $(i+1),$(i+5)
}
' mylog # input file
if your real input is something else, please post input and expected output

Bash/Shell: analyse tab-separated CSV for lines with data in n-th column

I have a tab-separated CSV, to big to download and open locally.
I want to show any lines with data in the n-th column, that is those lines with anything else than a tab right before the n-th tab of that line.
I´d post what I´ve tried so far, but my sed-knowledge is merely enough to assume that it can be done with sed.
edit1:
sample
id num name title
1 1 foo foo
2 2 bar
3 3 baz baz
If n=3 (name), then I want to output the rows 1+3.
If n=4 (title), then I want to output all the lines.
edit 2:
I found this possible solution:
awk -F '","' 'BEGIN {OFS=","} { if (toupper($5) == "STRING 1") print }' file1.csv > file2.csv
source: https://unix.stackexchange.com/questions/97070/filter-a-csv-file-based-on-the-5th-column-values-of-a-file-and-print-those-reco
But trying
awk -F '"\t"' 'BEGIN {OFS="\t"} { if (toupper($72) != "") print }' data.csv > data-tmp.csv
did not work (result file empty), so I propably got the \t wrong? (copy&paste without understanding awk)
I'm not exactly sure I understand your desired behaviour. Is this it?
$ cat file
id num name title
1 1 foo foo
2 2 bar
3 3 baz baz
$ awk -v n=3 -F$'\t' 'NR>1&&$n!=""' file
1 1 foo foo
3 3 baz baz
$ awk -v n=4 -F$'\t' 'NR>1&&$n!=""' file
1 1 foo foo
2 2 bar
3 3 baz baz
I'll assume you have enough space on the remote machine:
1) use cut to get the desired column N (delimiter is tab by standard)
cut -f N > tempfile
2) get line numbers only of non-empty lines
grep -c '^$' -n tempfile | sed 's/:.*//' > linesfile
3) use sed to extract lines
while read $linenumber ; do
sed "$linenumber p" >> newdatafile
done < linesfile
Unfortunately the line number cannot be extracted by piping the cut output to grep, but I am pretty sure there are more elegant solutions.

Resources