How to get remote file size from a shell script? - shell

Is there a way to get the size of a remote file like
http://api.twitter.com/1/statuses/public_timeline.json
in shell script?

You can download the file and get its size. But we can do better.
Use curl to get only the response header using the -I option.
In the response header look for Content-Length: which will be followed by the size of the file in bytes.
$ URL="http://api.twitter.com/1/statuses/public_timeline.json"
$ curl -sI $URL | grep -i Content-Length
Content-Length: 134
To get the size use a filter to extract the numeric part from the output above:
$ curl -sI $URL | grep -i Content-Length | awk '{print $2}'
134

Two caveats to the other answers:
Some servers don't return the correct Content-Length for a HEAD request, so you might need to do the full download.
You'll likely get an unrealistically large response (compared to a modern browser) unless you specify gzip/deflate headers.
Also, you can do this without grep/awk or piping:
curl 'http://api.twitter.com/1/statuses/public_timeline.json' --location --silent --write-out 'size_download=%{size_download}\n' --output /dev/null
And the same request with compression:
curl 'http://api.twitter.com/1/statuses/public_timeline.json' --location --silent -H 'Accept-Encoding: gzip,deflate' --write-out 'size_download=%{size_download}\n' --output /dev/null

Similar to codaddict's answer, but without the call to grep:
curl -sI http://api.twitter.com/1/statuses/public_timeline.json | awk '/Content-Length/ { print $2 }'

The preceding answers won't work when there are redirections. For example, if one wants the size of the debian iso DVD, he must use the --location option, otherwise, the reported size may be that of the 302 Moved Temporarily answer body, not that of the real file.
Suppose you have the following url:
$ url=http://cdimage.debian.org/debian-cd/8.1.0/amd64/iso-dvd/debian-8.1.0-amd64-DVD-1.iso
With curl, you could obtain:
$ curl --head --location ${url}
HTTP/1.0 302 Moved Temporarily
...
Content-Type: text/html; charset=iso-8859-1
...
HTTP/1.0 200 OK
...
Content-Length: 3994091520
...
Content-Type: application/x-iso9660-image
...
That's why I prefer using HEAD, which is an alias to the lwp-request command from the libwww-perl package (on debian). Another advantages it has is that it strips the extra \r characters, which eases subsequent string processing.
So to retrieve the size of the debian iso DVD, one could do for example:
$ size=$(HEAD ${url})
$ size=${size##*Content-Length: }
$ size=${size%%[[:space:]]*}
Please note that:
this method will require launching only one process
it will work only with bash, because of the special expansion syntax used
For other shells, you may have to resort to sed, awk, grep et al..

I think the easiest way to do this would be to:
use cURL to run in silent mode -s,
pull only the headers -I (so as to avoid downloading the whole file)
then do a case insensitive grep -i
and return the second arg using awk $2.
output is returned as bytes
Examples:
curl -sI http://api.twitter.com/1/statuses/public_timeline.json | grep -i content-length | awk '{print $2}'
//output: 52
or
curl -sI https://code.jquery.com/jquery-3.1.1.min.js | grep -i content-length | awk '{print $2}'
//output: 86709
or
curl -sI http://download.thinkbroadband.com/1GB.zip | grep -i content-length | awk '{print $2}'
//output: 1073741824
Show as Kilobytes/Megabytes
If you would like to show the size in Kilobytes then change the awk to:
awk '{print $2/1024}'
or Megabytes
awk '{print $2/1024/1024}'

The accepted solution was not working for me, this is:
curl -s https://code.jquery.com/jquery-3.1.1.min.js | wc -c

I have a shell function, based on codaddict's answer, which gives a remote file's size in a human-readable format thusly:
remote_file_size () {
printf "%q" "$*" |
xargs curl -sI |
grep Content-Length |
awk '{print $2}' |
tr -d '\040\011\012\015' |
gnumfmt --to=iec-i --suffix=B # the `g' prefix on `numfmt' is only for systems
# ^ # that lack the GNU coreutils by default, i.e.,
# | # non-Linux systems
# |
# | # in other words, if you're on Linux, remove this
# | # letter `g'; if you're on BSD or Mac, install the GNU coreutils
} # | |
# +----------------------------------------+

This will show you a detailed info about the ongoing download
you just need to specify an URL like below example.
$ curl -O -w 'We downloaded %{size_download} bytes\n'
https://cmake.org/files/v3.8/cmake-3.8.2.tar.gz
output
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7328k 100 7328k 0 0 244k 0 0:00:29 0:00:29 --:--:-- 365k
We downloaded 7504706 bytes
For automated purposes you'll just need to add the command to your
script file.

To combine all the above for me works:
URL="http://cdimage.debian.org/debian-cd/current/i386/iso-dvd/debian-9.5.0-i386-DVD-1.iso"
curl --head --silent --location "$URL" | grep -i "content-length:" | tr -d " \t" | cut -d ':' -f 2
This will return just the content length in bytes:
3767500800

You can kinda do it like this, including auto-following 301/302 redirections :
curl -ILs 'https://twitter.com/i/csp_report?a=ORTGK%3D%3D%3D&ro=fals' |
mawk 'NF*=!_<NF' \
OFS= FS='^[Cc][Oo][Nn][Tt][Ee][Nn][Tt]-[Ll][Ee][Nn][Gg][Tt][Hh]: '
1 41
It's very brute force but gets the job done - but that's whatever raw value being reported by the server, so you may have to make adjustments to it as you see fit.
You may also have to add the -g flag so it can auto handle switchover from vanilla http to https :
curl -gILs 'http://apple.com' |
mawk 'NF *= !_<NF' OFS= \
FS='^[Cc][Oo][Nn][Tt][Ee][Nn][Tt]-[Ll][Ee][Nn][Gg][Tt][Hh]: '
1 304
2 106049
'(I''m *guessing* this might be the main site,
and first item was the redirection page ? )'

Question is old and have been sufficiently answered , but let expand upon exisiting answer. If you want to automate this task ( for checking file sizes of multiple files) then here's a one liner.
first write the URL of the files in a file:
cat url_of_files.txt
https://stpubdata-jwst.stsci.edu/ero/jw02734/jw02734002001/jw02734002001_04101_00001-seg002_nis_x1dints.fits
https://stpubdata-jwst.stsci.edu/ero/jw02734/jw02734002001/jw02734002001_04101_00001-seg003_nis_calints.fits
https://stpubdata-jwst.stsci.edu/ero/jw02734/jw02734002001/jw02734002001_04102_00001-seg001_nis_calints.fits
https://stpubdata-jwst.stsci.edu/ero/jw02734/jw02734002001/jw02734002001_02101_00002-seg001_nis_cal.fits
...
then from the command line (from the same directory as your url_of_files.txt):
eval $(sed -rn '/^https/s/(https.*$)/curl -sI \1/p' url_of_files.txt) | awk '/[Cc]ontent-[Ll]ength/{kb=$2/1024;mb=kb/1024;gb=mb/1024;print ( $2>1024 ? ( kb>1024 ? ( mb>1024 ? gb " G" : mb " M") : kb " K" ) : $2 " B" ) }'
This is for checking file sizes ranging from bytes to Gbs. I use this line to check the fits data files being made available by the JWST team.
It checks the file size and depending on its size , roughly converts it to a an appropriate number with B,K,M,G extensions denoting the size in Bytes, Kilo bytes, Mega bytes, and Giga bytes.
result:
...
177.188 K
177.188 K
236.429 M
177.188 K
5.95184 M
1.83608 G
1.20326 G
130.059 M
1.20326 G
...

My solution is using awk END to ensure to grep only the last Content-length:
function curl2contentlength() {
curl -sI -L -H 'Accept-Encoding: gzip,deflate' $1 | grep -i Content-Length | awk 'END{print $2}'
}
curl2contentlength $#
./curl2contentlength.sh "https://chrt.fm/track/B63133/stitcher.simplecastaudio.com/ec74d48c-cbf1-4764-923e-7d584dce50fa/episodes/a85954a3-24c3-48ed-bced-ef0607b7149a/audio/128/default.mp3?aid=rss_feed&awCollectionId=ec74d48c-cbf1-4764-923e-7d584dce50fa&awEpisodeId=a85954a3-24c3-48ed-bced-ef0607b7149a&feed=qm_9xx0g"
10806508
In fact without it would have been
0
0
10806508

I use like this ([Cc]ontent-[Ll]ength:), because I got server give multiple Content-Length character at header response
curl -sI "http://someserver.com/hls/125454.ts" | grep [Cc]ontent-[Ll]ength: | awk '{ print $2 }'
Accept-Ranges: bytes
Access-Control-Expose-Headers: Date, Server, Content-Type, Content-Length
Server: WowzaStreamingEngine/4.5.0
Cache-Control: no-cache
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: OPTIONS, GET, POST, HEAD
Access-Control-Allow-Headers: Content-Type, User-Agent, If-Modified-Since, Cache-Control, Range
Date: Tue, 10 Jan 2017 01:56:08 GMT
Content-Type: video/MP2T
Content-Length: 666460

different solution:
ssh userName#IP ls -s PATH | grep FILENAME | awk '{print$1}'
gives you the size in KB

Related

How to combine multiple awk command and to print then with space

I'm capturing URL content using cURL which gives output in HTML format. Using awk I'm capturing sensor name and its status.
(curl <MY URL> | awk -F"Sensor<\/th><td>" '{print $2}' | awk -F"<\/td></tr>" '{print $1}'; \
curl <my URL> | awk -F"Status<\/th><td><strong>" '{print $2}' | awk -F"<\/strong>" '{printf $1}' \
) | tr -d '\n' >> output
cURL input like,
<html><head><title>Sensor status for NumberOfThreadsSensor-NumberOfThreads</title></head><body>
<h1>Sensor status for NumberOfThreadsSensor-NumberOfThreads</h1>
<table>
<tr><th>Plugin</th><td>NumberOfThreadsSensor</td></tr><tr><th>Sensor</th><td>NumberOfThreads</td></tr><tr><th>Status</th><td>Ok</td></tr><tr><th>Created</th><td>Fri Aug 14 09:03:10 UTC 2020 (13 seconds ago)</td></tr><tr><th>TTL</th><td>30 seconds</td></tr><tr><th>Short message</th><td>1;14;28</td></tr><tr><th>Long message</th><td>1 [interval: 1 min];14 [interval: 30 min];28 [interval: 60 min]</td></tr></table>
<h2>Formats</h2><p>The status shown on this page is also available in the following machine-friendly formats:</p>
<ul>
<li>A simple status string, Possible values: OK, WARNING, CRITICAL, UNKNOWN.</li>
<li>Nagios plugin output, output formatted for easy integration with Nagios.</li>
<li>Full xml all available data in xml for easy parsing by ad-hoc monitoring tools.</li>
<li>Prometheus output, all available data in prometheus format</li>
</ul>
<p>Please do not rely on the output of this page for automated monitoring, use one of the formats above.</p>
</body></html>
Current output ScoreProcessorWarning
expected output ScoreProcessor Warning
Please help me to simplify my shell script and I'm in learning phase. Thanks for help
With the input presented saved in /tmp/input.txt:
<h1>Sensor status for EventProcessorStatus-ScoreProcessor</h1>
<table>
<tr><th>Plugin</th><td>EventProcessorStatus</td></tr><tr><th>Sensor</th><td>ScoreProcessor</td></tr><tr><th>Status</th><td><strong>Warning</strong></td></tr><tr><th>Created</th><td>Fri Aug 10 00:16:23 UTC 2020 (0 seconds ago)</td></tr><tr><th>TTL</th><td>30 seconds</td></tr><tr><th>Short message</th><td>Endpoint is running, but has errors</td></tr><tr><th>Long message</th><td>Endpoint is running, but has errors<br/>
Number of errors in background process (xxxx) logs: 4<br/>
</td></tr></table>
<h2>Performance data</h2><table>
with my very limited knowledge of xmllint I ended with:
# Extract only table, get text from all tales
xmllint --html --xpath '//table//tr//text()' /tmp/input.txt |
# Because we know table has two rows, join two lines together
sed 'N;s/\n/\t/' |
# Filter Sensor and status only
sed -n '/Sensor\t/{s///;h}; /Status\t/{s///;x;G;p}' |
# Read the sensor and status to bash
{ IFS= read -r name; IFS= read -r status; echo "name=$name status=$status" ;}
which outputs:
name=ScoreProcessor status=Warning

zabbix_get and AWK/substr - ZBX_NOTSUPPORTED: Too many parameters

I'm trying to use Zabbix to gather information about 500 errors in Apache logs. I thought, I will use zabbix_get to execute command on Agent's OS, however when trying to test that, I get ZBX_NOTSUPPORTED: Too many parameters. error. At first, I tried to escape all " and other special characters, however without any success. When trying to debug the command (which works well locally, without using zabbix_get) I found that the issue is probably somewhere close to AWK and substr. I would appreciate if anyone could take a look...
tail -n 1000 /content/logs/httpd/*_access.log | awk -v d1="$(date --date='-5 min' '+%_d/%b/%Y:%H:%M:%S')" -v d2="$(date '+%_d/%b/%Y:%H:%M:%S')" 'substr($5,2) > d1 && substr($5,2) < d2 || substr($5,2) ~ d2' | cut -d" " -f10 | grep "500" | wc -l
Sample Input:
IP IP - - [21/Jan/2019:03:14:06 -0500] "GET /path HTTP/1.1" 200 2068 referrer "UserAgent" 7634.
IP IP - - [21/Jan/2019:03:14:06 -0500] "GET /path HTTP/1.1" 500 1 "-" "UserAgent" 1892
Output:
1
I would like to execute command listed above every 5 minutes to count number of entries with 500 error which happened during previous 5 minutes.
I know that I can always create a script but if I could avoid doing that...
Thanks!
Instead of using zabbix_get you could create a system.run[blablabla].
However, you should user the native log file monitoring feature: use the logrt.count or log.count functions to extract the number of matching lines.
Assuming your date compare is correct (in fact not the case due to the letter of month)
try this:
tail -n 1000 /content/logs/httpd/*_access.log | awk -v d1="$(date --date='-5 min' '+%_d/%b/%Y:%H:%M:%S')" -v d2="$(date '+%_d/%b/%Y:%H:%M:%S')" '{Ti=substr($5,2)};( Ti > d1 && Ti <= d2 ) && $10 ~ /5[0-9][0-9]/ {Cnt++;print $10} END{print Cnt+0}'
you can also skip the tail using only awk directly, it's fast enough with few change:
awk -v d1="$(date --date='-5 min' '+%_d/%b/%Y:%H:%M:%S')" -v d2="$(date '+%_d/%b/%Y:%H:%M:%S')" '($5 < "[" d1){next};{Ti=substr($5,2)};( Ti > d1 && Ti <= d2 ) && $10 ~ /5[0-9][0-9]/ {Cnt++;print $10} END{print Cnt+0}' /content/logs/httpd/*_access.log
you can also limit the *_access.log with a $( find /content/logs/httpd -name *_access.log -mmin -6)

how to pass variable value in POST JSON data in CURL command?

There are two fields which are unique , so i wrote a command to generate random value everytime the POST method is build and stored those values into the variables and pass those variables in the CURL command line. The script is below.
rcontactno=$(cat /dev/urandom | tr -dc '0-9' | fold -w 10 | head -n 1)
rfirstname=$(cat /dev/urandom | tr -dc 'a-zA-Z' | fold -w 10 | head -n
echo $rcontactno and $rfirstname
STATUS=$(curl -v -X POST -d '{"userName":"$rfirstname","contactNo":$rcontactno}' $1/restaurants/53/managers --header "Content-Type:application/json" --header "Accept:application/json" | grep HTTP | cut -d ' ' -f2 )
#Passing the URL using command-line argument
if [[ STATUS -eq 201 ]]; then
echo “Success”
exit 0
else
echo “Failed”
exit 127
fi
then i execute the script by
bash manager-post.sh
i get this type of error
> POST /restaurants/53/managers HTTP/1.1
> User-Agent: curl/7.37.1
> Host: my-url
> Content-Type:application/json
> Accept:application/json
> Content-Length: 54
>
} [data not shown]
* upload completely sent off: 54 out of 54 bytes
< HTTP/1.1 400 Bad Request
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Content-Type: application/json;charset=UTF-8
< Transfer-Encoding: chunked
< Date: Fri, 16 Jan 2015 08:59:01 GMT
< Connection: close
<
{ [data not shown]
* Closing connection 0
“Failed”
but when i run the curl command without the bash script and explicitly mention the values of userName and contactNo , then it will execute successfully.
Where am i making mistake?
The single quotes don't allow variable expansion in the shell, so you need to use double quotes instead. Then you need to subsequently escape the double-quotes you want to send as-is.
A useful debugging technique is to add --trace-ascii dump.txt to your command line and inspect dump.txt after invoking curl to see that it matches what you intended to send.

BASH grep with multiple parameters + n lines after one of the matches

I have a bunch of text as a output from command, I need to display only specific matching lines plus some additional lines after match "message" (message text is obviously longer than 1 line)
what I tried was:
grep -e 'Subject:' -e 'Date:' -A50 -e 'Message:'
but it included 50 lines after EACH match, and I need to pass that only to single parameter. How would I do that?
code with output command:
(<...> | telnet <mailserver> 110 | grep -e 'Subject:' -e 'Date:' -A50 -e 'Message:'
Part of the telnet output:
Date: Tue, 10 Sep 2013 16
Message-ID: <00fb01ceae25$
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_00FC_01CEAE3E.DE32CE40"
X-Mailer: Microsoft Office Outlook 12.0
Thread-Index: Ac6uJWYdA3lUzs1cT8....
Content-Language: lt
X-Mailman-Approved-At: Tue, 10 Sep 2013 16:0 ....
Subject: ...
X-BeenThere: ...
Precedence: list
Try following:
... | telnet ... > <file>
grep -e 'Subject:' -e 'Date:' <file> && grep -A50 -e 'Message:' <file>
Will need to dump the output to a file first.
This can be done with awk as well, without the need for dumping output to a file.
... | telnet ... | awk '/Date:/ {print}; /Subject:/ {print}; /Message:/ {c=50} c && c--'
With grep it would be hard to do. Better use awk for this
awk '/Subject:|Date:/;/Message:/ {while(l<=50){print $0;l++;getline}}'
Here the awk prints 50 lines below the Message: pattern and only one line is printed for all other patterns.

syntax error: operand expected (error token is ">= 75 ")

#!/bin/bash
CURRENT=$(df -h / | grep / | awk '{ print $4}')
THRESHOLD=75
if (( "$CURRENT" >= "$THRESHOLD" )); then
mail -s "CENTOS-6 localhost 10.10.1.238 Disk Space Alert" sss#abc.net << EOF
Your root partition remaining free space is critically low. Used: $CURRENT%
EOF
fi
I got the following error when i run the script, syntax error: operand expected (error token is ">= 75 ")
It's because CURRENT will contain a percent sign, so it won't be a valid operand for the comparison operation.
You can remove the last character like this :
CURRENT=${CURRENT%?};
Also make sure that df -h / | grep / | awk '{ print $4}' is correctly returning the usage ratio, on most systems you have to use print $5.
A couple of things:
you don't need grep at all, awk is quite capable of doing it's own regex stuff.
if you search for / in the df output, you'll probably get most lines as most mounts have a / somewhere in them. If you just want the root mountpoint, you can use <space>/$.
Check that 4 is the correct field number, on my box it's 5.
In any case, that field is of the form 55% which will not be considered numeric. You can use gsub to get rid of it.
With that in mind, the following snippet can be used to get the percentage:
df -h | awk '$0 ~ / \/$/ { gsub("%","",$5); print $5 }'
And, just as an aside, I'm not that big a fan of here-docs in shell scripts since it either (1) screws up my nicely indented files; or (2) makes me burn half an hour while I try to remember the various syntax options which will allow indented EOF strings :-)
I prefer something like:
(
echo Your root partition remaining free space is critically low: Used: ${CURRENT}%
) | mail -s "CENTOS-6 localhost 10.10.1.238 Disk Space Alert" sss#abc.net
Especially since that means I can put arbitrarily complex commands in the sub-shell to generate whatever info I want in the mail message (rather than just simple text substitutions).
So, bottom line, I'd be looking at something more like:
#!/usr/bin/env bash
# Config section.
LIMIT=75
# Code section.
CURR=$(df -h | awk '$0 ~ / \/$/ { gsub("%","",$5); print $5 }')
if [[ ${CURR} -ge ${LIMIT} ]] ; then
(
echo "Your root partition remaining free space is critically low: Used: ${CURR}%"
) | mail -s "CENTOS-6 localhost 10.10.1.238 Disk Space Alert" sss#abc.net
fi
Just try:
CURRENT=$(df -h |awk '{print $4}' |sort -n |tail -n1 |sed 's/%//g')
THRESHOLD=90
if [ $THRESHOLD -gt $CURRENT ]

Resources