How to search for a matching string in a file bottom-to-top without using tac? [closed] - shell

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 1 year ago.
Improve this question
I need to grep through a file, starting at the bottom of the file until I get to the first date that appears "2021-04-04", and then return that date. I don't want to start from the top and work my way down to the first line as there's thousands of lines in each file.
Example file contents:
random text on first line
random text on second line
2021-01-01
random text on fourth line
2021-02-03
random text on sixth line
2021-03-03
2021-04-04
Random text on ninth line
tac isn't available on MacOS so I can't use it.

"thousands of lines" are nothing, they'll be processed in the blink of an eye. Once you get into 10s of millions of lines THEN you could start thinking about a performance improvement if it became necessary.
All you need is:
awk '/[0-9]{4}(-[0-9]{2}){2}/{line=$0} END{if (line!="") print line}' file
Here's the 3rd-run timing comparison for finding the last line containing 2 or more consecutive 5s in a 100000 line file generated by seq 100000 > file100k, i.e. where the target string is just 45 lines from the end of the input file, with and without tac:
$ time awk '/5{2}/{line=$0} END{if (line!="") print line}' file100k
99955
real 0m0.056s
user 0m0.031s
sys 0m0.000s
$ time tac file100k | awk '/5{2}/{print; exit}'
99955
real 0m0.056s
user 0m0.015s
sys 0m0.030s
As you can see, both ran in a fraction of a second and using tac did nothing to improve the speed of execution. Switching to tac+grep doesn't make it any faster either, it still just takes 1/20th of a second:
$ time tac file100k | grep -m1 '5\{2\}'
99955
real 0m0.057s
user 0m0.015s
sys 0m0.015s
In case you ever do need it in future, though, here's how to implement an efficient tac if you don't have it:
$ mytac() { cat -n "${#:--}" | sort -k1,1rn | cut -d$'\t' -f2-; }
$ seq 5 | mytac
5
4
3
2
1
The above mytac() function just adds line numbers to the input, sorts those in reverse and then removes them again. If your cat doesn't have -n to add line numbers then you can use nl if you have it or awk -v OFS='\t' '{print NR, $0}' will always work.

Use tac:
#!/bin/bash
function process_file_backwords(){
tac $1 | while IFS= read line; do
# Grep for xxxx-xx-xx number matching
first_date=$(echo $line | grep '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}' | awk -F '"' '{ print $2}')
# Check if the variable is not empty, if yes break the loop
[ ! -z $first_date ] && echo $first_date && break
done
}
echo $(process_file_backwords $1)
Note: Make sure you add empty line at the of the file so tac will not concatenate the last two lines.
Note: Remove the awk part if the file contains strings without ".

On MacOS
You can use tail -r which will do the same thing as tac but you may have to supply the number of lines you want tail to output from your file. Something like this should work:
tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
-r tells tail to output its last line first
-n takes a numeric argument telling how many lines tail should output
wc -l outputs the line count and filename of a given file
cut -d ' ' splits the above on the space character and -f 1 takes the first "field" which will be our line count
$ cat myfile.txt
foo
this is a date 2021-04-03
bar
this is another date 2021-04-04 for example
$ tail -r -n $(wc -l myfile.txt | cut -d ' ' -f 1) myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04
grep options:
The -m 1 option will quit after the first result.
The -o option
will return only the string matching the pattern (i.e. your date)
The -P option uses the perl regex engine which is really down to
preference but I personally prefer the regex syntax (seems to use
fewer backslashes \)
On Linux
You can use tac (cat in reverse) and pipe that into your grep. e.g.:
$ tac myfile.txt
this is another date 2021-04-04 for example
bar
this is a date 2021-04-03
foo
$ tac myfile.txt | grep -m 1 -o -P '\d{4}-\d{2}-\d{2}'
2021-04-04

You can use perl to reverse the lines and grep for the 1st match too.
perl -e 'print reverse<>' inputFile | grep -m1 '[0-9]\{4\}-[0-9]\{2\}-[0-9]\{2\}'

Related

Grep - Getting the character position in the line of each occurrence

According to the manual, the option -b can give the byte offset of a given occurence, but it seems to start from the beginning of the parsed content.
I need to retrieve the position of each matching content returned by grep. I used this line, but it's quite ugly:
grep '<REGEXP>' | while read -r line ; do echo $line | grep -bo '<REGEXP>' ; done
How to get it done in a more elegant way, with a more efficient use of GNU utils?
Example:
$ echo "abcdefg abcdefg" > test.txt
$ grep 'efg' | while read -r line ; do echo $line | grep -bo 'efg' ; done < test.txt
4:efg
12:efg
(Indeed, this command line doesn't output the line number, but it's not difficult to add it.)
With any awk (GNU or otherwise) in any shell on any UNIX box:
$ awk -v re='efg' -v OFS=':' '{
end = 0
while( match(substr($0,end+1),re) ) {
print NR, end+=RSTART, substr($0,end,RLENGTH)
end+=RLENGTH-1
}
}' test.txt
1:5:efg
1:13:efg
All strings, fields, array indices in awk start at 1, not zero, hence the output not looking like yours since to awk your input string is:
123456789012345
abcdefg abcdefg
rather than:
012345678901234
abcdefg abcdefg
Feel free to change the code above to end+=RSTART-1 and end+=RLENGTH if you prefer 0-indexed strings.
Perl is not a GNU util, but can solve your problem nicely:
perl -nle 'print "$.:$-[0]" while /efg/g'

Get Average of Found Numbers in Each File to Two Decimal Places

I have a script that searches through all files in the directory and pulls the number next to the word <Overall>. I want to now get the average of the numbers from each file, and output the filename next to the average to two decimal places. I've gotten most of it to work except displaying the average. I should say I think it works, I'm not sure if it's pulling all of the instances in the file, and I'm definitely not sure if it's finding the average, it's hard to tell without the precision. I'm also sorting by the average at the end. I'm trying to use awk and bc to get the average, there's probably a better method.
What I have now:
path="/home/Downloads/scores/*"
(for i in $path
do
echo `basename $i .dat` `grep '<Overall>' < $i |
head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc`
done) | sort -g -k 2
The output i get is:
John 4
Lucy 4
Matt 5
Sara 5
But it shouldn't be an integer and it should be to two decimal places.
Additionally, the files I'm searching through look like this:
<Student>John
<Math>2
<English>3
<Overall>5
<Student>Richard
<Math>2
<English>2
<Overall>4
In general, your script does not extract all numbers from each file, but only the first digit of the first number. Consider the following file:
<Overall>123 ...
<Overall>4 <Overall>56 ...
<Overall>7.89 ...
<Overall> 0 ...
The command grep '<Overall>' | head -c 10 | tail -c 1 will only extract 1.
To extract all numbers preceded by <Overall> you can use grep -Eo '<Overall> *[0-9.]*' | grep -o '[0-9.]*' or (depending on your version) grep -Po '<Overall>\s*\K[0-9.]*'.
To compute the average of these numbers you can use your awk command or specialized tools like ... | average (from the package num-utils) or ... | datamash mean 1.
To print numbers with two decimal places (that is 1.00 instead of 1 and 2.35 instead of 2.34567) you can use printf.
#! /bin/bash
path=/home/Downloads/scores/
for i in "$path"/*; do
avg=$(grep -Eo '<Overall> *[0-9.]*' "$file" | grep -o '[0-9.]*' |
awk '{total += $1} END {print total/NR}')
printf '%s %.2f\n' "$(basename "$i" .dat)" "$avg"
done |
sort -g -k 2
Sorting works only if file names are free of whitespace (like space, tab, newline).
Note that you can swap out the two lines after avg=$( with any method mentioned above.
You can use a sed command and retrieve the values to calculate their average with bc:
# Read the stdin, store the value in an array and perform a bc call
function avg() { mapfile -t l ; IFS=+ bc <<< "scale=2; (${l[*]})/${#l[#]}" ; }
# Browse the .dat files, then display for each file the average
find . -iname "*.dat" |
while read f
do
f=${f##*/} # Remove the dirname
# Echoes the file basename and a tabulation (no newline)
echo -en "${f%.dat}\t"
# Retrieves all the "Overall" values and passes them to our avg function
sed -E -e 's/<Overall>([0-9]+)/\1/' "$f" | avg
done
Output example:
score-2 1.33
score-3 1.33
score-4 1.66
score-5 .66
The pipeline head -c 10 | tail -c 1 | awk '{total += $1} END {print total/NR}' | bc needs improvement.
head -c 10 | tail -c 1 leaves only the 10th character of the first Overall line from each file; better drop that.
Instead, use awk to "remove" the prefix <Overall> and extract the number; we can do this by using <Overall> for the input field separator.
Also use awk to format the result to two decimal places.
Since awk did the job, there's no more need for bc; drop it.
The above pipeline becomes awk -F'<Overall>' '{total += $2} END {printf "%.2f\n", total/NR}'.
Don't miss to keep the ` after it.

Count number of Special Character in Unix Shell

I have a delimited file that is separated by octal \036 or Hexadecimal value 1e.
I need to count the number of delimiters on each line using a bash shell script.
I was trying to use awk, not sure if this is the best way.
Sample Input (| is a representation of \036)
Example|Running|123|
Expected output:
3
awk -F'|' '{print NF-1}' file
Change | to whatever separator you like. If your file can have empty lines then you need to tweak it to:
awk -F'|' '{print (NF ? NF-1 : 0)}' file
You can try
awk '{print gsub(/\|/,"")}'
Simply try
awk -F"|" '{print substr($3,length($3))}' OFS="|" Input_file
Explanation: Making field separator -F as | and then printing the 3rd column by doing $3 only as per your need. Then setting OFS(output field separator) to |. Finally mentioning Input_file name here.
This will work as far as I know
echo "Example|Running|123|" | tr -cd '|' | wc -c
Output
3
This should work for you:
awk -F '\036' '{print NF-1}' file
3
-F '\036' sets input field delimiter as octal value 036
Awk may not be the best tool for this. Gnu grep has a cool -o option that prints each matching pattern on a separate line. You can then count how many matching lines are generated for each input line, and that's the count of your delimiters. E.g. (where ^^ in the file is actually hex 1e)
$ cat -v i
a^^b^^c
d^^e^^f^^g
$ grep -n -o $'\x1e' i | uniq -c
2 1:
3 2:
if you remove the uniq -c you can see how it's working. You'll get "1" printed twice because there are two matching patterns on the first line. Or try it with some regular ascii characters and it becomes clearer what the -o and -n options are doing.
If you want to print the line number followed by the field count for that line, I'd do something like:
$grep -n -o $'\x1e' i | tr -d ':' | uniq -c | awk '{print $2 " " $1}'
1 2
2 3
This assumes that every line in the file contains at least one delimiter. If that's not the case, here's another approach that's probably faster too:
$ tr -d -c $'\x1e\n' < i | awk '{print length}'
2
3
0
0
0
This uses tr to delete (-d) all characters that are not (-c) 1e or \n. It then pipes that stream of data to awk which just counts how many characters are left on each line. If you want the line number, add " | cat -n" to the end.

Replace each } with a }\n in a huge (12GB) which consists of 1 line?

I have a log file (from a customer). 18 Gigs. All contents of the file are in 1 line.
I want to read the file in logstash. But I get problems because of Memory. The file is read line by line but unfortunately it is all on 1 line.
I tried split the file into lines so that logstash can process it (the file has a simple json format, no nested objects) i wanted to have each json in one line, splitting at } by replacing with }\n:
sed -i 's/}/}\n/g' NonPROD.log.backup
But sed is killed - I assume also because of memory. How can I resolve this? Can I let sed manipulate the file using other chunks of data than lines? I know by default sed reads line by line.
The following uses only functionality built into the shell:
#!/bin/bash
# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
# ...and print that content followed by '}' and a newline.
printf '%s}\n' "$piece"
done
# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"
If you have logstash configured to read from a TCP port (using 14321 as an arbitrary example below), you can run thescript <NonPROD.log.backup >"/dev/tcp/127.0.0.1/14321" or similar, and there you are -- without needing to have double your original input file's space available on disk, as other answers thus far given require.
With GNU awk for RT:
$ printf 'abc}def}ghi\n' | awk -v RS='}' '{ORS=(RT?"}\n":"")}1'
abc}
def}
ghi
with other awks:
$ printf 'abc}def}ghi\n' | awk -v RS='}' -v ORS='}\n' 'NR>1{print p} {p=$0} END{printf "%s",p}'
abc}
def}
ghi
I decided to test all of the currently posted solutions for functionality and execution time using an input file generated by this command:
awk 'BEGIN{for(i=1;i<=1000000;i++)printf "foo}"; print "foo"}' > file1m
and here's what I got:
1) awk (both awk scripts above had similar results):
time awk -v RS='}' '{ORS=(RT?"}\n":"")}1' file1m
Got expected output, timing =
real 0m0.608s
user 0m0.561s
sys 0m0.045s
2) shell loop:
$ cat tst.sh
#!/bin/bash
# as long as there exists another } in the file, read up to it...
while IFS= read -r -d '}' piece; do
# ...and print that content followed by '}' and a newline.
printf '%s}\n' "$piece"
done
# print any trailing content after the last }
[[ $piece ]] && printf '%s\n' "$piece"
$ time ./tst.sh < file1m
Got expected output, timing =
real 1m52.152s
user 1m18.233s
sys 0m32.604s
3) tr+sed:
$ time tr '}' '\n' < file1m | sed 's/$/}/'
Did not produce the expected output (Added an undesirable } at the end of the file), timing =
real 0m0.577s
user 0m0.468s
sys 0m0.078s
With a tweak to remove that final undesirable }:
$ time tr '}' '\n' < file1m | sed 's/$/}/; $s/}//'
real 0m0.718s
user 0m0.670s
sys 0m0.108s
4) fold+sed+tr:
$ time fold -w 1000 file1m | sed 's/}/}\n\n/g' | tr -s '\n'
Got expected output, timing =
real 0m0.811s
user 0m1.137s
sys 0m0.076s
5) split+sed+cat:
$ cat tst2.sh
mkdir tmp$$
pwd="$(pwd)"
cd "tmp$$"
split -b 1m "${pwd}/${1}"
sed -i 's/}/}\n/g' x*
cat x*
rm -f x*
cd "$pwd"
rmdir tmp$$
$ time ./tst2.sh file1m
Got expected output, timing =
real 0m0.983s
user 0m0.685s
sys 0m0.167s
You can running it through tr, then put the end bracket back on at the end of each line:
$ cat NonPROD.log.backup | tr '}' '\n' | sed 's/$/}/' > tmp$$
$ wc -l NonPROD.log.backup tmp$$
0 NonPROD.log.backup
43 tmp10528
43 total
(My test file only had 43 brackets.)
You could:
Split the file to say 1M chunks using split -b 1m file.log
Process all the files sed 's/}/}\n/g' x*
... and redirect the output of sed to combine them back to a single piece
The drawback of this is the doubled storage space.
another alternative with fold
$ fold -w 1000 long_line_file | sed 's/}/}\n\n/g' | tr -s '\n'

How to strip a number in the output of an executable?

I run an executable which outputs a lot of lines to stdout. The last line is
Run in 100 seconds
The code in the C program of the executable to write the last line is
printf("Ran in %g seconds\n", time);
So there is a newline character at the end.
I want to strip the last number, e.g. 100, from the stdout, so in bash
./myexecutable > output
Then I wonder how to further parse output to get the time number in bash? Do I need some applications to do that?
Thanks!
You could use grep:
grep -oP 'Ran in \K\d+' output
or
grep -oP '(?<=Ran in )\d+(?= seconds)' output
Let's say:
s='Run in 100 seconds'
Using tr:
tr -cd '[[:digit:]]' <<< "$s"
100
Using sed:
sed 's/[^0-9]*//g' <<< "$s"
100
However if you want to grab last number in a line then use this lookahead regex:
s='Run 10 in 100 seconds'
grep -oP '\d+(?!\D*\d)' <<< "$s"
100
Or, use tail to grab the last line (tail -n 1 <file>) and extract the number by either -
Using sed with three pattern groups and printing the second group match:
tail -n 1 output | sed 's/\(^Run in \)\([0-9]\+\)\( seconds$\)/\2/g'
Using awk to print the third ($3) token:
tail -n 1 output | awk '{print $3}'

Resources