I'm trying to make a bash scraper, I've managed to extract the data, but struggle with fetching the lines for f.ex today's temperature using grep since the date and temperature is not on the same line. I would like the results to be outputted into a file.
I've tried grep -E -o '[2022]-[11]-[15]' | grep "celsius" | grep -E -o '[0-9]{1,2}.[0-9]{1,2}' > file.txt
API result
`product class="pointData">
<time datatype="forecast" from="2022-11-14T18:00:00Z" to="2022-11-14T18:00:00Z">
<location altitude="4" latitude="60.3913" longitude="5.3221">
<temperature id="TTT" unit="celsius" value="8.2"/>
<windDirection id="dd" deg="146.5" name="SE"/>
<windSpeed id="ff" mps="0.5" beaufort="1" name="Flau vind"/>
<windGust id="ff_gust" mps="1.2"/>
<humidity unit="percent" value="82.5"/>
<pressure id="pr" unit="hPa" value="1014.5"/>
<cloudiness id="NN" percent="45.1"/>
<fog id="FOG" percent="0.0"/>
<lowClouds id="LOW" percent="4.5"/>
<mediumClouds id="MEDIUM" percent="0.0"/>
<highClouds id="HIGH" percent="39.9"/>
<dewpointTemperature id="TD" unit="celsius" value="5.0"/>
</location>
</time>
<time datatype="forecast" from="2022-11-14T17:00:00Z" to="2022-11-14T18:00:00Z">
<location altitude="4" latitude="60.3913" longitude="5.3221">
<precipitation unit="mm" value="0.0" minvalue="0.0" maxvalue="0.0"/>
<symbol id="PartlyCloud" number="3" code="partlycloudy_night"/>
</location>
</time>
<time datatype="forecast" from="2022-11-14T19:00:00Z" to="2022-11-14T19:00:00Z">
<location altitude="4" latitude="60.3913" longitude="5.3221">
<temperature id="TTT" unit="celsius" value="8.7"/>
<windDirection id="dd" deg="112.5" name="SE"/>
<windSpeed id="ff" mps="0.4" beaufort="1" name="Flau vind"/>
<windGust id="ff_gust" mps="0.8"/>
<humidity unit="percent" value="75.6"/>
<pressure id="pr" unit="hPa" value="1013.8"/>
<cloudiness id="NN" percent="57.5"/>
<fog id="FOG" percent="0.0"/>
<lowClouds id="LOW" percent="1.1"/>
<mediumClouds id="MEDIUM" percent="0.4"/>
<highClouds id="HIGH" percent="55.4"/>
<dewpointTemperature id="TD" unit="celsius" value="4.4"/>
</location>
</time>
Output to file should be.
8.2
grep -A3 '2022-11-14' -m1 inputfile.txt | \
grep -P -o "<temperature.*celsius.*\"\K\-?[0-9]{1,2}\.[0-9]{1,2}"
8.2
-A3 print 3 lines after match
-m1 Stop after first match
-P use Perl regex
-o grep only the match
\K ignore what is before
-? get - for negative temperature
[0-9]{1,2}.[0-9]{1,2} the temperature in celsius
You can also use xq:
$ date="2022-11-14"
$ xq -r '.product.time[0] | select (."#from" | contains("'$date'")) // null | '\
'.location|.temperature|(if ."#unit" == "celsius" then ."#value" else "error" end)' \
< input.html
8.2
Or as #AndyLester said, using xpath.
$ date="2022-11-14"
$ xmllint --xpath '//time[starts-with(#from,"'$date'")][1]'\
'//temperature[#unit="celsius"]/#value' input.txt |\
grep -Po '[-]?\d+\.\d+'
Related
I'm capturing URL content using cURL which gives output in HTML format. Using awk I'm capturing sensor name and its status.
(curl <MY URL> | awk -F"Sensor<\/th><td>" '{print $2}' | awk -F"<\/td></tr>" '{print $1}'; \
curl <my URL> | awk -F"Status<\/th><td><strong>" '{print $2}' | awk -F"<\/strong>" '{printf $1}' \
) | tr -d '\n' >> output
cURL input like,
<html><head><title>Sensor status for NumberOfThreadsSensor-NumberOfThreads</title></head><body>
<h1>Sensor status for NumberOfThreadsSensor-NumberOfThreads</h1>
<table>
<tr><th>Plugin</th><td>NumberOfThreadsSensor</td></tr><tr><th>Sensor</th><td>NumberOfThreads</td></tr><tr><th>Status</th><td>Ok</td></tr><tr><th>Created</th><td>Fri Aug 14 09:03:10 UTC 2020 (13 seconds ago)</td></tr><tr><th>TTL</th><td>30 seconds</td></tr><tr><th>Short message</th><td>1;14;28</td></tr><tr><th>Long message</th><td>1 [interval: 1 min];14 [interval: 30 min];28 [interval: 60 min]</td></tr></table>
<h2>Formats</h2><p>The status shown on this page is also available in the following machine-friendly formats:</p>
<ul>
<li>A simple status string, Possible values: OK, WARNING, CRITICAL, UNKNOWN.</li>
<li>Nagios plugin output, output formatted for easy integration with Nagios.</li>
<li>Full xml all available data in xml for easy parsing by ad-hoc monitoring tools.</li>
<li>Prometheus output, all available data in prometheus format</li>
</ul>
<p>Please do not rely on the output of this page for automated monitoring, use one of the formats above.</p>
</body></html>
Current output ScoreProcessorWarning
expected output ScoreProcessor Warning
Please help me to simplify my shell script and I'm in learning phase. Thanks for help
With the input presented saved in /tmp/input.txt:
<h1>Sensor status for EventProcessorStatus-ScoreProcessor</h1>
<table>
<tr><th>Plugin</th><td>EventProcessorStatus</td></tr><tr><th>Sensor</th><td>ScoreProcessor</td></tr><tr><th>Status</th><td><strong>Warning</strong></td></tr><tr><th>Created</th><td>Fri Aug 10 00:16:23 UTC 2020 (0 seconds ago)</td></tr><tr><th>TTL</th><td>30 seconds</td></tr><tr><th>Short message</th><td>Endpoint is running, but has errors</td></tr><tr><th>Long message</th><td>Endpoint is running, but has errors<br/>
Number of errors in background process (xxxx) logs: 4<br/>
</td></tr></table>
<h2>Performance data</h2><table>
with my very limited knowledge of xmllint I ended with:
# Extract only table, get text from all tales
xmllint --html --xpath '//table//tr//text()' /tmp/input.txt |
# Because we know table has two rows, join two lines together
sed 'N;s/\n/\t/' |
# Filter Sensor and status only
sed -n '/Sensor\t/{s///;h}; /Status\t/{s///;x;G;p}' |
# Read the sensor and status to bash
{ IFS= read -r name; IFS= read -r status; echo "name=$name status=$status" ;}
which outputs:
name=ScoreProcessor status=Warning
please is there any simple way how can I get NAME output only from lines, where DATE < 5 days ago and then call other command called rm on these lines with NAME as argument?
I have the following output from mega-ls path/ -l (mega.nz) command:
FLAGS VERS SIZE DATE NAME
d--- - - 06Feb2020 05:00:01 bk_20200206050000
d--- - - 07Feb2020 05:00:01 bk_20200207050000
d--- - - 08Feb2020 05:00:01 bk_20200208050000
d--- - - 09Feb2020 05:00:01 bk_20200209050000
d--- - - 10Feb2020 05:00:01 bk_20200210050000
d--- - - 11Feb2020 05:00:01 bk_20200211050000
I tried grep, sort and other ways e.g. mega-ls path/ -l | head -n 5 but I don't know how to search these lines based on the date.
Thank you a lot.
I try find simple way for you request ;)
mega-ls path/ -l | head -n 5 | tr -s ' ' | cut -d ' ' -f6 | grep -v -e '^$' | grep '^bk_20200206.*' | xargs rm -f
Part 1 : This is you command (returned folders list by extra data)
mega-ls path/ -l | head -n 5
Part 2 : Try to remove extra space in your part 1 result
tr -s ' '
Part 3 : Try to use cut command to delimit result part 2 and return Name Folders column
cut -d ' ' -f6
Part 4 : Try to remove Empty lines from result part 3 (result of header line)
grep -v -e '^$'
Part 5 : This your request for search folders name by date yyyymmdd format example : 20200206 (replace 20200206 to your real date need)
grep '^bk_20200206.*'
Part 6 : (Very Important!!) If you need to delete result folders use this part (Very Important!!)
xargs rm -f
Best Regards
I need to search for open and close html tags and print how many have been found. But seem to be second use one file is not working. Second block shows me every time 0 tags. If i move second block above first then it show me right number of tags, but the block that is now on second place does show 0 tags.
./s.sh <my.html
TAG=$(grep -oP "<([^>\/]+)>" $1 | wc -l)
echo "<TAG> -" $TAG
CTAG=$(grep -oP "</([^>\/]+)>" $1 | wc -l)
echo "</TAG> -" $CTAG
I'm getting this output:
<TAG> - 13
</TAG> - 0
But should get something like this:
<TAG> - 13
</TAG> - 11
Input example:
<HTML>
<P>Список сотрудников
<TABLE BORDER=0>
<TR><TH>ФИО</TH><TH>Дата</TH></TR>
<TR><TD>Иванов И.И.</TD><TD>10.12.2019</TD></TR>
<TR><TD>Сидоров А.В.</TD><TD>11.11.1977</TD></TR>
</TABLE>
<P>Всего: 2 чел.
</HTML>
No need to escape the slash in the pattern, and you can omit the capturing group and the -P option:
$ TAG=$(grep -o "<[^>/]*>" "$1" | wc -l)
$ echo "<TAG>: " $TAG
<TAG>: 13
$ CTAG=$(grep -o "</[^/>]*>" "$1" | wc -l)
$ echo "</TAG>: " $CTAG
</TAG>: 11
I use this code and this is works only for January,how can I create script with sed for every month?
var='<table>\n<tr><th colspan="7">'
cal -h | sed '1{s|^|'"${var}"'|;s|$|</th></tr>|};2,${s|\(..\) |<td>\1</td>|g;s|^|<tr>|;s|$|</tr>|};$s|$|\n</table>|' >> file.html
Try with 3 months:
for m in {1..3}
do
cal -m $m -h | sed '1{s|^|'"${var}"'|;s|$|</th></tr>|};2,${s|\(..\) |<td>\1</td>|g;s|^|<tr>|;s|$|</tr>|};$s|$|\n</table>|'
done
I want grep only the text after every http: line and write it to a file.
I have the current output from the output stream
References
1. https://soundcloud.com/sc-opensearch.xml
2. https://m.soundcloud.com/search/sounds?q=L AME IMMORTELLE
3. https://soundcloud.com/
4. http://www.enable-javascript.com/
5. https://soundcloud.com/search
6. https://soundcloud.com/search/sounds
7. https://soundcloud.com/search/sets
8. https://soundcloud.com/search/people
9. https://soundcloud.com/search/groups
10. https://soundcloud.com/thomas-rainer/l-ame-immortelle-banish
11. https://soundcloud.com/outtamyndxmetal-llc/lame-immortelle-the-heart
12. https://soundcloud.com/cyberdelic-mind/l-me-immortelle-dark-mix-i
13. https://soundcloud.com/sawthinzarhtaik/dort-drauben
14. https://soundcloud.com/lagrima-negra/lagrima-tears-in-the-rain
15. https://soundcloud.com/bathony/in-strict-confidence-zauberschlos-lame-immortelle-version
16. https://soundcloud.com/jubej-thos/sirius-5-jahre-lame-immortelle
17. https://soundcloud.com/virul3nt/lamme-immortelle-sag-mir-wann-shiv-r-remix
18. https://soundcloud.com/outtamyndxmetal-llc/lame-immortelle-no-goodbye
19. https://soundcloud.com/usefulrage/das-ich-dem-ich-den-traum
20. http://help.soundcloud.com/customer/portal/articles/552882-the-site-won-t-load-for-me-all-i-see-is-the-soundcloud-logo-what-can-i-do-
21. http://google.com/chrome
22. http://firefox.com/
23. http://apple.com/safari
24. http://windows.microsoft.com/ie
25. http://help.soundcloud.com/
and my code currently which is not greping is below
lynx --dump -listonly https://soundcloud.com/search/sounds?q=L%20AME%20IMMORTELLE | \
tr "\t\r\n'" ' "' | \
grep -i -o 'http......HERE I NEED THE GREP STUFF' | \
sed -e 's/^.*"\([^"]\+\)".*$/\1/g' \ >k.txt
You can use grep -E:
grep -i -oE 'https?://soundcloud\.com[^[:blank:]]*'
It worked with
lynx --dump -listonly https://soundcloud.com/search/sounds?q=L%20AME%20IMMORTELLE | \
tr "\t\r\n'" ' "' | \
grep -i -oE 'https?://[^[:blank:]]+' | \
sed -e 's/^.*"\([^"]\+\)".*$/\1/g' \
>k.txt
i got the appropriate output
https://soundcloud.com/sc-opensearch.xml
https://m.soundcloud.com/search/sounds?q=L
https://soundcloud.com/
http://www.enable-javascript.com/
https://soundcloud.com/search
https://soundcloud.com/search/sounds
https://soundcloud.com/search/sets
https://soundcloud.com/search/people
https://soundcloud.com/search/groups
https://soundcloud.com/thomas-rainer/l-ame-immortelle-banish
https://soundcloud.com/outtamyndxmetal-llc/lame-immortelle-the-heart
https://soundcloud.com/cyberdelic-mind/l-me-immortelle-dark-mix-i
https://soundcloud.com/sawthinzarhtaik/dort-drauben
https://soundcloud.com/lagrima-negra/lagrima-tears-in-the-rain
https://soundcloud.com/bathony/in-strict-confidence-zauberschlos-lame-immortelle-version
https://soundcloud.com/jubej-thos/sirius-5-jahre-lame-immortelle
https://soundcloud.com/virul3nt/lamme-immortelle-sag-mir-wann-shiv-r-remix
https://soundcloud.com/outtamyndxmetal-llc/lame-immortelle-no-goodbye
https://soundcloud.com/usefulrage/das-ich-dem-ich-den-traum
http://help.soundcloud.com/customer/portal/articles/552882-the-site-won-t-load-for-me-all-i-see-is-the-soundcloud-logo-what-can-i-do-
http://google.com/chrome
http://firefox.com/
http://apple.com/safari
http://windows.microsoft.com/ie
http://help.soundcloud.com/