cat or grep a html file to find specific text [duplicate] - bash

This question already has answers here:
How to print lines between two patterns, inclusive or exclusive (in sed, AWK or Perl)?
(9 answers)
Extract lines between two patterns from a file [duplicate]
(3 answers)
Closed 4 years ago.
I am trying to use bash to parse and HTML file using grep.
The HTML won't change so I should be able to find the text easy enough.
The HTML will be like this, and I just want the number which will change each time the file changes:
<div class="total">
900 files inspected,
28301 offenses detected:
</div>
grep -E '^<div class="total">.</div>' my_file.html
Ideally I just want to pull the number of offenses so in the example above it would be 28301. I would like to assign it to a variable also.
Am I close?

you can do a simple
a=$(grep -oP '(\d+)(?=\soffenses\sdetected)' abc);echo $a
will give:
28301
-o only gives the matching part of the line
-P uses perl regular expression in regex
abc is the name of the file
(\d+)(?=\soffenses\sdetected) in this reges we are just using positive lookahead to capture the require digits that are followed by a particular word

If you have GNU grep and GNU sed, you can do:
$ cat file | xargs | grep -Po '<div class=total>\K(.*?)</div>' | sed -E 's/<\/div>//; s/, /\n/'
900 files inspected
28301 offenses detected:
If you have ruby available:
$ ruby -e 'puts readlines.join[/(?<=<div class="total">).+(?=<\/div>)/m].gsub(/^[ \t]+/m,"")' file
900 files inspected,
28301 offenses detected:

Related

How to truncate extraneous output using shell script? [duplicate]

This question already has answers here:
How to insert strings containing slashes with sed? [duplicate]
(11 answers)
Why it's not possible to use regex to parse HTML/XML: a formal explanation in layman's terms
(10 answers)
Closed 3 years ago.
I am trying to eliminate everything before and after the JSON contained in a specific part of a webpage so I can send that to a PHP script. I've tried a number of ways to get rid of the container content but all of them so far have failed, including one method that has worked in the exact same syntax for related purposes:
The characters that are between the two asterisks (**) at the beginning and end I need removed:
**var songs = [**{"timestamp":1555176393000,"title":"Enter Sandman","trackId":"ba_5cbb546d-5c1c-490e-9908-761b89dd5166","artist":"Metallica","artistId":"52_65f4f0c5-ef9e-490c-aee3-909e7ae6b2ab","album":"Metallica","albumId":"d0_6e729716-c0eb-3f50-a740-96ac173be50d","npe_id":"3cc5fe24d0ffcbb9152d861f27ae801660"},{"timestamp":1555176702000,"title":"Start Me Up","trackId":"76_d0b86399-11e5-4d11-b4fe-ce4b3f9a4736","artist":"The Rolling Stones","artistId":"1b_b071f9fa-14b0-4217-8e97-eb41da73f598","album":"Tattoo You","albumId":"d1_778b345b-e8a1-4054-b5ba-c611d3fda421","npe_id":"f0dc0ab12ef99a6e0087cad12886509b7b"},{"timestamp":1555176909000,"title":"Fame","trackId":"4e_cdef4b88-7314-431a-9cdd-d457296a65b7","artist":"David Bowie","artistId":"ab_5441c29d-3602-4898-b1a1-b77fa23b8e50","album":"Best of Bowie","albumId":"21_3709ee5a-d087-370f-afb4-f730092c7a94","npe_id":"2b8b3a170baa77125891d72a0474d3343a"},{"timestamp":1555177158000,"title":"Rocket","trackId":"34_aa5b9053-849e-4788-972f-7941303175b6","artist":"Def Leppard","artistId":"c1_7249b899-8db8-43e7-9e6e-22f1e736024e","album":"Hysteria","albumId":"06_de5cf055-d875-41f8-9261-89b11b7ff145","npe_id":"0d87b580f140a85feaebc7d77f75db2a3d"},{"timestamp":1555177826000,"title":"Mama, I'm Coming Home","trackId":"cb_e5b09171-9527-4d24-8ab6-1e922fdd66d3","artist":"Ozzy Osbourne","artistId":"4b_8aa5b65a-5b3c-4029-92bf-47a544356934","album":"No More Tears","albumId":"66_8f3d5a65-036c-3260-b9bb-36f1d0d80c11","npe_id":"6b766464fe945f275bf478192dcd33cfdc"},{"timestamp":1555178076000,"title":"Gold Dust Woman","trackId":"a4_ef8c1eca-f344-4bfb-82ea-763aa8aeaad9","artist":"Fleetwood Mac","artistId":"66_bd13909f-1c29-4c27-a874-d4aaf27c5b1a","album":"2010-01-08: The Rock Boat X, Lido Deck, Carnival Inspiration","albumId":"80_4f229af0-2afc-431d-87ff-f7f6af66268e","npe_id":"f6417d98fd1fefcca227d82a8ac9b84197"},{"timestamp":1555178363000,"title":"With or Without You","trackId":"79_6b9a509f-6907-4a6e-9345-2f12da09ba4b","artist":"U2","artistId":"26_a3cb23fc-acd3-4ce0-8f36-1e5aa6a18432","album":"The Joshua Tree","albumId":"0c_d287c703-5c25-3181-85d4-4d8c1a7d8ecd","npe_id":"23b19420196b28e2156ecda87c11b882e0"},{"timestamp":1555178654000,"title":"Who Are You","trackId":"7d_431b9746-c6ec-489d-9199-c83676171ae8","artist":"The Who","artistId":"22_f2fa2f0c-b6d7-4d09-be35-910c110bb342","album":"Who Are You","albumId":"40_b255da2c-6583-35f9-95e3-ef5f9c14e868","npe_id":"e01896f74f24968bb7727eaafbf6250b8f"},{"timestamp":1555179031000,"title":"Authority Song","trackId":"31_f5ff19f7-95f3-4a22-8996-3788c264e0b8","artist":"John Mellencamp","artistId":"4d_0aad6b52-fd93-4ea4-9c5d-1f66e1bc9f0a","album":"Words & Music: John Mellencamp's Greatest Hits","albumId":"9e_1240c510-7015-4484-baac-ce17f5277ea1","npe_id":"244785e3b1d75effb9fdecbb6df76b009f"},{"timestamp":1555179256000,"title":"Touch Me","trackId":"9d_1dd1f86c-2120-45f3-ac9f-3c87257fe414","artist":"The Doors","artistId":"13_9efff43b-3b29-4082-824e-bc82f646f93d","album":"The Soft Parade","albumId":"db_c29d7552-b5df-42b8-aae7-03d1e250cb3a","npe_id":"1b5d155eb2eeee6fc1fdb50a94b100669c"}]**; <ol class="songs tracks"></ol>**
Here is the shell script which produces the above at present:
#!/bin/sh
curl -v --silent http://player.listenlive.co/41851/en/songhistory >/var/tmp/wklh$1.a.txt
pta=`cat /var/tmp/wklh$1.a.txt | grep songs > /var/tmp/wklh$1.b.txt`
ptb=`cat /var/tmp/wklh$1.b.txt | sed -n -e '/var songs = /,/; <span title/ p' > /var/tmp/wklh$1.c.txt`
ptc=`cat /var/tmp/wklh$1.c.txt | grep songs > /var/tmp/wklh$1.d.txt`
#ptd=`cat /var/tmp/wklh$1.d.txt | sed -i 's/var songs = [//g' /var/tmp/wklh$1.d.txt`
#ptd=`cat /var/tmp/wklh$1.d.txt | sed -i 's/}]; <ol class="songs tracks"></ol>//g' /var/tmp/wklh$1.d.txt`
json=`cat /var/tmp/wklh$1.d.txt`
echo $json
metadata=`php /etc/asterisk/scripts/music/wklh.php $json`
echo $metadata
The commented out lines are what I was trying to use to remove the extraneous content, since it is predictable every time. However, when uncommented, I get the following errors:
sed: -e expression #1, char 18: unterminated `s' command
sed: -e expression #1, char 38: unknown option to `s'
I've examined my sed statement, but I can't find any discrepancies between how I use it here and in other working shell scripts.
Is there actually a syntax error here (or unallowed characters)? Or is there a better way I can do this?
Your shell script has serious issues.
The syntax
variable=`commands`
takes the output of commands and assigns it to variable. But in every case, you are redirecting all output to a file; so the variable will always be empty.
Unless you need the temporary files for reasons which are not revealed in your question (such as maybe being able to check how many bytes of output you got in each temporary file for a monitoring report, or something like that), a pipeline would be much superior.
#!/bin/sh
curl -v --silent http://player.listenlive.co/41851/en/songhistory |
grep songs |
sed -n -e '/var songs = /,/; <span title/ p' |
grep songs |
php /etc/asterisk/scripts/music/wklh.php
This also does away with the useless uses of cat and the useless uses of echo and so also coincidentally removes the quoting errors. The grep x | sed -n 's/y/z/p' is a useless use of grep which can easily be refactored to sed -n '/x/s/y/z/p'
Square brackets are special to sed. Simply escape them.
s/var songs = \[//g
If you use slash / as the regex delimiter, it becomes special. Either escape it or use a different delimiter.
s/}]; <ol class="songs tracks"><\/ol>//g
s|}]; <ol class="songs tracks"></ol>||g
if your data in 'd' file, try gnu sed,
sed -Ez 's/^\*\*[^\*]+\*\*(.+)]\*\*[^\*]+\*\*\s*$/\1/' d
remove last ] too, to correctly balance the Json

Using grep to filter real time output of a process? If so, how to get the line after a match? [duplicate]

This question already has answers here:
How to show only next line after the matched one?
(14 answers)
grep: show lines surrounding each match
(14 answers)
Read from a endless pipe bash [duplicate]
(1 answer)
Closed 4 years ago.
Should I use grep to filter a real time output? I'm not sure if this is what I should use for a real time output.
Example: command -option | grep --color 'string1\|string2'
If so, how to get also the lines after string1 and string2?
As #shellter mentioned, from man grep:
-A num, --after-context=num
Print num lines of trailing context after each match. See also the -B and -C options.
so you would use command -option | grep -A 1 --color 'string1\|string2' to print matched lines and the line right after them.
There are plenty of other options in the manual for grep, and most other command-line programs, so I suggest getting used to running man cmd as a quick first check.

Single-line bash command that will count and display the occurrences of a particular string in a file [duplicate]

This question already has answers here:
Count the number of times a word appears in a file
(3 answers)
Closed 4 years ago.
I am a Bash & Terminal NEWBIE. I have been given the task of counting the number of entries of a specific area code using a single-line Bash Terminal command. Can you please point me in the right direction to achieving this goal? I've been using a bash scripting cheat sheet but i'm not familiar enough with bash commands to create a script to iterate and count the number of times [213] appears in file:
If you are looking for the string 123 anywhere in the file, then:
grep -c 123 file # counts 123 4123 41235 etc
If you are looking for the "word" 123, then:
grep -wc 123 file # counts 123 /123/ #123# etc., but not 1234 4123 ...
If you want multiple occurrences of the word on the same line to be counted separately, then use the -o option:
grep -ow 123 file | wc -l
See also:
Confused about word boundary on Unix & Linux Stack Exchange
grep -o '213' filename | wc -l
In the future, you should try searching for general forms of your command. You would have found a number of similar questions
See man grep. grep has a count option.
So you want to run grep -c 213 file.
Following awk may help you here too.(It will look for string 213 anywhere in the line(s) of Input_file)
awk /213/{count++} END{print count}' Input_file
In case you want to look for only those lines which have exactly have digit 213 then use following.
awk /^213$/{count++} END{print count}' Input_file

How to scrape end of line in grep? [duplicate]

This question already has answers here:
How to find patterns across multiple lines using grep?
(28 answers)
Closed 6 years ago.
I have a file that contains a sequence already broken into lines, something like this:
CGCCCATGGGTCGTATACGTAATGGGAAAACAAAGCATGGTGTAACTATGGTAAGTGCTA
GACAATACAAGAAGGCTGATATTTGTAGAATAATTCATTTGAATTATTATGCTGTAAATA
GCTAGATTATTATGCATAATTACTTTGAGAGGTGATCAATCAATTCGACCCTTGCCAATT
I want to search a specific pattern in this file like GCTGTAAATAGCTAGATTA for example.
The problem is that the pattern may be cut by a newline at an unpredictable place.
I can use :
grep -e "pattern" file
but it cannot avoid "new line" character and doesn't give the result. How can I modify my command to ignore \n in my search?
Edit:
I don't know either my query exists in the file or not, and if it is there, I don't know where it exists.
The best solution that came into my mind is
tr -d '\n' < file | grep -e "CTACCCCAGACAAACTGGTCAGATACCAACCATCAGCGAAACTAACCAAACAAA"
but I know there should be more efficient ways to do that.
pattern="GCTGTAAATA"$'\n'"GCTAGATTA" # $'\n' is Bash's way of mentioning special chars
grep -e "$pattern" file
OR
pattern="GCTGTAAATA
GCTAGATTA" # with an actual newline at the end of the first line
grep -e "$pattern" file

grep not working with BOM [duplicate]

This question already has answers here:
Elegant way to search for UTF-8 files with BOM?
(11 answers)
Closed 8 years ago.
I am trying to grep a string from a file but grep returns nothing (even though the string is present in the file). It turned out that the file starts with a ÿþ mark. If I remove it manually then grep works. How do I make grep work without manually removing the BOM?
What about:
strings <file> | grep <pattern>
Alternatively check the man page of your grep command. What's actually happening is that grep is looking at the first few bytes of your file and deciding that it's a binary file and therefore not searchable. You can override this with:
--binary-files=text
You can also use cat with the -v (visible) option:
cat -v file | grep pattern

Resources