Filtering file in Unix

Filtering file in Unix - bash

I have a big problem with filtering output error file.
The log file:
Important some words flags
Line 1
Line 2
...
Line N
Important some words
Line 1
Line 2
...
Line N
Important some words
Line 1
Line 2
...
Line N
Important some words flags
Line 1
Line 2
...
Line N
So, some section has word "flags" another not.
Desired output file is:
Important some words flags
Line 1
Line 2
...
Line N
Important some words flags
Line 1
Line 2
...
Line N
Only section with line, which one was started via "Important" and ended "flags".
All sections have a random number of lines.
So I can't use something like that:
grep -B1 -P '!^Important*flags' logfile
Because I don't know how many lines will be after/before that line...

There are more succinct ways to handle it, but this is fairly clear:
awk '/^Important.*flags$/ { p = 1; print; next }
/^Important/ { p = 0; next }
{ if (p) print }'
If the line is important and flagged, set p to 1, print the line, and skip to the next.
Else, if the line is important (but not flagged), set p to 0 and skip to the next.
Otherwise, it is an 'unimportant' line; print it if p is non-zero (which means that the last important line was flagged).
Any lines before the first Important line will find p is 0 anyway, so they won't be printed.

perl -n0E 'say /(Important\N*flags.*?)(?=Important|$)/sg'

Related

Process a line based on lines before and after in bash

I am trying to figure out how to write a bash script which uses the lines immediately before and after a line as a condition. I will give an example in a python-like pseudocode which makes sense to me.
Basically:
for line in FILE:
if line_minus_1 == line_plus_one:
line = line_minus_1
What would be the best way to do this?
So if I have an input file that reads:
3
1
1
1
2
2
1
2
1
1
1
2
2
1
2
my output would be:
3
1
1
1
2
2
2
2
1
1
1
2
2
2
2
Notice that it starts from the first line until the last line and respects changes made in earlier lines so if I have:
2
1
2
1
2
2
I would get:
2
2
2
2
2
2
and not:
2
1
1
1
2
2

$ awk 'minus2==$0{minus1=$0} NR>1{print minus1} {minus2=minus1; minus1=$0} END{print minus1}' file
3
1
1
1
2
2
2
2
1
1
1
2
2
2
2
How it works
minus2==$0{minus1=$0}
If the line from 2 lines ago is the same as the current line, then set the line from 1 line ago equal to the current line.
NR>1{print minus1}
If we are past the first line, then print the line from 1 line ago.
minus2=minus1; minus1=$0
Update the variables.
END{print minus1}
After we have finished reading the file, print the last line.
Multiple line version
For those who like their code spread over multiple lines:
awk '
minus2==$0{
minus1=$0
}
NR>1{
print minus1
}
{
minus2=minus1
minus1=$0
}
END{
print minus1
}
' file

Here is a (GNU) sed solution:
$ sed -r '1N;N;/^(.*)\n.*\n\1$/s/^(.*\n).*\n/\1\1/;P;D' infile
3
1
1
1
2
2
2
2
1
1
1
2
2
2
2
This works with a moving three line window. A bit more readable:
sed -r ' # -r for extended regular expressions: () instead of \(\)
1N # On first line, append second line to pattern space
N # On all lines, append third line to pattern space
/^(.*)\n.*\n\1$/s/^(.*\n).*\n/\1\1/ # See below
P # Print first line of pattern space
D # Delete first line of pattern space
' infile
N;P;D is the idiomatic way to get a moving two line window: append a line, print first line, delete first line of pattern space. To get a moving three line window, we read an additional line, but only once, namely when processing the first line (1N).
The complicated bit is checking if the first and third line of the pattern space are identical, and if they are, replacing the second line with the first line. To check if we have to make the substitution, we use the address
/^(.*)\n.*\n\1$/
The anchors ^ and $ are not really required as we'll always have exactly to newlines in the pattern space, but it makes it more clear that we want to match the complete pattern space. We put the first line into a capture group and see if it is repeated on the third line by using a backreference.
Then, if this is the case, we perform the substitution
s/^(.*\n).*\n/\1\1/
This captures the first line including the newline, matches the second line including the newline, and substitutes with twice the first line. P and D then print and remove the first line.
When reaching the end, the whole pattern space is printed so we're not swallowing any lines.
This also works with the second input example:
$ sed -r '1N;N;/^(.*)\n.*\n\1$/s/^(.*\n).*\n/\1\1/;P;D' infile2
2
2
2
2
2
2
To use with BSD sed (as found in OS X), you'd either have to use the -E instead of the -r option, or use no option, i.e., basic regular expressions and escape all parentheses (\(\)) in the capture groups. The newline matching should work, but I didn't test it. If in doubt, check this great answer lining out all the differences.

Print lines indexed by a second file

I have two files:
File with strings (new line terminated)
File with integers (one per line)
I would like to print the lines from the first file indexed by the lines in the second file. My current solution is to do this
while read index
do
sed -n ${index}p $file1
done < $file2
It essentially reads the index file line by line and runs sed to print that specific line. The problem is that it is slow for large index files (thousands and ten thousands of lines).
Is it possible to do this faster? I suspect awk can be useful here.
I search SO to my best but could only find people trying to print line ranges instead of indexing by a second file.
UPDATE
The index is generally not shuffled. It is expected for the lines to appear in the order defined by indices in the index file.
EXAMPLE
File 1:
this is line 1
this is line 2
this is line 3
this is line 4
File 2:
3
2
The expected output is:
this is line 3
this is line 2

If I understand you correctly, then
awk 'NR == FNR { selected[$1] = 1; next } selected[FNR]' indexfile datafile
should work, under the assumption that the index is sorted in ascending order or you want lines to be printed in their order in the data file regardless of the way the index is ordered. This works as follows:
NR == FNR { # while processing the first file
selected[$1] = 1 # remember if an index was seen
next # and do nothing else
}
selected[FNR] # after that, select (print) the selected lines.
If the index is not sorted and the lines should be printed in the order in which they appear in the index:
NR == FNR { # processing the index:
++counter
idx[$0] = counter # remember that and at which position you saw
next # the index
}
FNR in idx { # when processing the data file:
lines[idx[FNR]] = $0 # remember selected lines by the position of
} # the index
END { # and at the end: print them in that order.
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This can be inlined as well (with semicolons after ++counter and index[FNR] = counter, but I'd probably put it in a file, say foo.awk, and run awk -f foo.awk indexfile datafile. With an index file
1
4
3
and a data file
line1
line2
line3
line4
this will print
line1
line4
line3
The remaining caveat is that this assumes that the entries in the index are unique. If that, too, is a problem, you'll have to remember a list of index positions, split it while scanning the data file and remember the lines for each position. That is:
NR == FNR {
++counter
idx[$0] = idx[$0] " " counter # remember a list here
next
}
FNR in idx {
split(idx[FNR], pos) # split that list
for(p in pos) {
lines[pos[p]] = $0 # and remember the line for
# all positions in them.
}
}
END {
for(i = 1; i <= counter; ++i) {
print lines[i]
}
}
This, finally, is the functional equivalent of the code in the question. How complicated you have to go for your use case is something you'll have to decide.

This awk script does what you want:
$ cat lines
1
3
5
$ cat strings
string 1
string 2
string 3
string 4
string 5
$ awk 'NR==FNR{a[$0];next}FNR in a' lines strings
string 1
string 3
string 5
The first block only runs for the first file, where the line number for the current file FNR is equal to the total line number NR. It sets a key in the array a for each line number that should be printed. next skips the rest of the instructions. For the file containing the strings, if the line number is in the array, the default action is performed (so the line is printed).

Use nl to number the lines in your strings file, then use join to merge the two:
~ $ cat index
1
3
5
~ $ cat strings
a
b
c
d
e
~ $ join index <(nl strings)
1 a
3 c
5 e
If you want the inverse (show lines that NOT in your index):
$ join -v 2 index <(nl strings)
2 b
4 d
Mind also the comment by #glennjackman: if your files are not lexically sorted, then you need to sort them before passing in:
$ join <(sort index) <(nl strings | sort -b)

In order to complete the answers that use awk, here's a solution in Python that you can use from your bash script:
cat << EOF | python
lines = []
with open("$file2") as f:
for line in f:
lines.append(int(line))
i = 0
with open("$file1") as f:
for line in f:
i += 1
if i in lines:
print line,
EOF
The only advantage here is that Python is way more easy to understand than awk :).

Bash: reading same lines in two files in nested loop

I'm trying to calculate confidence interval from several files: ones contains lines with means, and others contains lines with values (one per line). I'm trying to read one line from the file that contains the means, and all the lines from another file (because I have to do some computations). Here is what I've done (of course it's not working):
parameters="some value to move from a file to another one"
while read avg; do
for row in mypath/*_${parameters}*.dat; do
for value in $( awk '{ print $2; }' ${row}); do
read all the lines in first_file.dat (I need only the second column)
read the first line in avg.dat
combine data and calculate the confidence interval
done
done
done < avg.dat
** file avg.dat (not necessarily 100 lines) **
.99
2.34
5.41
...
...
2.88
** firstfile.dat in mypath (100 lines) **
0 13.77
1 2
2 63.123
3 21.109
...
...
99 1.05
** secondfile.dat in mypath (100 lines) **
0 8.56
1 91.663
2 19
3 0
...
...
99 4.34
The first line of avg.dat refers to the firstfile.dat in mypath, the second line of avg.dat refers to the secondfile.dat in mypath, etc... So, in the example above, I have to do some computation using .99 (from avg.dat) with all the numbers in the second column of firstfile.dat. Same with 2.34 and secondfile.dat.
I can't reach my objective because I can't find a way to switch to the next line in the avg.dat when I've finished to read a file in mypath. Instead I read the first line in avg.dat and all the files in mypath, then the second line in avg.dat and, again, all the files in mypath, etc... Can you help me to find a solution? Thank you all!

In bash I would do this:
exec 3<avg.dat
shopt -s extglob
for file in !(avg).dat; do
read -u 3 avg
while read value; do
# do stuff with $value and $avg
done < <(cut -f 2 -d " " "$file")
done
exec 3<&- # close the file descriptor

Shell: How to append characters at the end of a string?

I need to write a shell script to append characters to each line in a text to make all lines be the same length. For example, if the input is:
Line 1 has 25 characters.
Line two has 27 characters.
Line 3: all lines must have the same number of characters.
Here "Line 3" has 58 characters (not including the newline character) so I have to append 33 characters to "Line 1" and 31 characters to "Line 2". The output should look like:
Line 1 has 25 characters.000000000000000000000000000000000
Line two has 27 characters.0000000000000000000000000000000
Line 3: all lines must have the same number of characters.
We can assume the max length (58 in the above example) is known.

Here is one way of doing it:
while read -r; do # Read from the file one line at a time
printf "%s" "$REPLY" # Print the line without the newline
for (( i=1; i<=((58 - ${#REPLY})); i++ )); do # Find the difference in length to iterate
printf "%s" "0" # Pad 0s
done
printf "\n" # Add the newline
done < file
Output:
Line 1 has 25 characters.000000000000000000000000000000000
Line two has 27 characters.0000000000000000000000000000000
Line 3: all lines must have the same number of characters.
Of course this is easy if you know the max length of the line. If you don't then you need to read the file in an array keep track of the length of each line and keeping the length of the line which is longest in a variable. Once you have completely read the file, you iterate your array and do the same for loop shown above.

awk '{print length($0)}' <file_name> | sort -nr | head -1
you would not need a loop to find the highest length

Here's a cryptic one:
perl -lpe '$_.="0"x(58-length)' file

How to print lines containing a certain number, but not containing other numbers that never previously appeared?

I have a file containing many numbers, written with 10 leading digits and with "A" temporarily placed before and "Z" placed after, to make sure scripts do not misidentify the beginning and ending of a number. E.g.:
A00000000001Z
A00000000003Z,A00000000004Z;A00000000005Z
A00000000004Z A00000000005Zsome wordsA00000000001Z
A00000000006Z;A00000000005Z
A00000000001Z
I need to search for a particular number, but output only those lines where the number is found, but no other numbers that never previously appeared are on the same line.
For example, if I searched for "0000000001", it would print lines 1, 3, and 5:
A00000000001Z
A00000000004Z A00000000005Zsome wordsA00000000001Z
A00000000001Z
It can print line 3 because the other numbers "00000000004" and "00000000005" previously appeared in line 2.
If I searched for "00000000005", it would print line 3:
A00000000004Z A00000000005Zsome wordsA00000000001Z
It would not print line 2, because the other numbers "00000000003" and "00000000004" never appeared previously.
So far, I have worked out this:
# search for the line and print the previously appearing lines to a temporary file
grep -B 10000000 0000000001 file.txt > output.temp
# send the last line to another file
cat output.temp | tail -1 > output.temp1
sed -i '$ d' output.tmp > output.temp2
# search for numbers appearing in output.temp2
for i in 1 .. 1000000 NOT original number
a=`printf $010d $i`
if [ $a FOUND in output.temp2]
then
# check if was found in the previous line
if [ $a NOT FOUND in output.temp1]
else
fi
fi
done < ./file.txt
How can I print out only those lines containing a certain number, while excluding the other numbers that never previously appeared in the file?

Not strictly bash, but here it is in Python2 that you can run from the shell:
#!/usr/bin/env python
import re
import sys
def find_valid_ids(input_file, target_id):
with open(input_file) as f:
found_ids = set()
for line in f.readlines():
ids = set(re.findall(r'A\d+Z', line))
if (target_id in ids and
(len(ids - found_ids) == 0 or
(len(ids) == 1 and target_id in ids))):
print line.strip('\n')
found_ids |= ids
if __name__ == "__main__":
try:
find_valid_ids(sys.argv[1], sys.argv[2])
except IndexError as e:
print 'Usage: ./find_valid_ids.py input_file target_id'
So if you saved the above as find_valid_ids.py you'd $ chmod +x find_valid_ids.py and run it like $ ./find_valid_ids.py your_input_file.txt A00000000001Z

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Filtering file in Unix - bash

perl -n0E 'say /(Important\Nflags.?)(?=Important|$)/sg'

Related

Process a line based on lines before and after in bash

Print lines indexed by a second file

Bash: reading same lines in two files in nested loop

Shell: How to append characters at the end of a string?

How to print lines containing a certain number, but not containing other numbers that never previously appeared?

Categories

Resources