Download the files that are bigger than a certain size - bash

I have Files.txt, which consists of a list of URLs of Excel files:
http://www.bcheadset.ca/applicant.xlsx
http://www.bcheadset.ca/fte.xlsx
http://www.iutox.org/TRTF_Matrix2012_Oct.xlsx
http://www.journalprices.com/2013_FinalSummaryForWeb.xlsx
http://www.camts.org/7__2013_Aircraft_Checklist.xlsx
http://www.nanotr11.org/poster_program.xlsx
http://www.vliruos.be/media/6352100/nss2015_annex_3_budget.xlsx
...
What I want to do is to download (by wget) the files that are bigger than 10Mo.
To verify if a file is bigger than 10Mo before dowloading it, we could use curl --head. For instance, curl --head http://www.bcheadset.ca/fte.xlsx returns
HTTP/1.1 200 OK
Content-Length: 3158847
Content-Type: application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Last-Modified: Mon, 27 Jul 2015 22:16:45 GMT
Accept-Ranges: bytes
ETag: "ffb49fecb9c8d01:c05"
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET
Date: Tue, 06 Oct 2015 15:14:27 GMT
And from Content-Length, we can verify that it is bigger than 10Mo.
My question is how to automate this by a piece of Bash or Perl. Especially I don't know how to extract the Content-Length and does a comparison...
Could anyone help?

You can use LWP and perl and send a HEAD request, fetching response headers. Something like this:
#!/usr/bin/env perl
use strict;
use warnings;
use LWP;
use Data::Dumper;
my $request = HTTP::Request -> new ( HEAD => 'http://www.google.com' );
my $agent = LWP::UserAgent -> new;
my $response = $agent -> request ( $request );
print $response -> header ( 'Content-Length');
And then use LWP to do a GET instead of a HEAD you want that file.

In order to adapt to various HTTP forms and their content some type of pattern matching may be prudent. Here is a rough example of how to use glob matching in shell to accomplish this:
#!/bin/sh
BUFFER=$(curl --head http://www.bcheadset.ca/fte.xlsx )
HOLD=""
TAKENEXT=0
for i in $BUFFER
do case "$i" in
Content-Length:)
TAKENEXT=1
;;
*) if [ "$TAKENEXT" -eq 1 ]
then HOLD="$i"
break
fi
;;
esac
done
printf "SIZE was: %s\n" "$HOLD"
Here is a simpler method using grep:
SIZE=$(curl --head http://www.bcheadset.ca/fte.xlsx | grep 'Content-Length:' | awk '{print $2}')
printf "%s\n" "$SIZE"
.

Related

how to pass variable value in POST JSON data in CURL command?

There are two fields which are unique , so i wrote a command to generate random value everytime the POST method is build and stored those values into the variables and pass those variables in the CURL command line. The script is below.
rcontactno=$(cat /dev/urandom | tr -dc '0-9' | fold -w 10 | head -n 1)
rfirstname=$(cat /dev/urandom | tr -dc 'a-zA-Z' | fold -w 10 | head -n
echo $rcontactno and $rfirstname
STATUS=$(curl -v -X POST -d '{"userName":"$rfirstname","contactNo":$rcontactno}' $1/restaurants/53/managers --header "Content-Type:application/json" --header "Accept:application/json" | grep HTTP | cut -d ' ' -f2 )
#Passing the URL using command-line argument
if [[ STATUS -eq 201 ]]; then
echo “Success”
exit 0
else
echo “Failed”
exit 127
fi
then i execute the script by
bash manager-post.sh
i get this type of error
> POST /restaurants/53/managers HTTP/1.1
> User-Agent: curl/7.37.1
> Host: my-url
> Content-Type:application/json
> Accept:application/json
> Content-Length: 54
>
} [data not shown]
* upload completely sent off: 54 out of 54 bytes
< HTTP/1.1 400 Bad Request
* Server Apache-Coyote/1.1 is not blacklisted
< Server: Apache-Coyote/1.1
< Content-Type: application/json;charset=UTF-8
< Transfer-Encoding: chunked
< Date: Fri, 16 Jan 2015 08:59:01 GMT
< Connection: close
<
{ [data not shown]
* Closing connection 0
“Failed”
but when i run the curl command without the bash script and explicitly mention the values of userName and contactNo , then it will execute successfully.
Where am i making mistake?
The single quotes don't allow variable expansion in the shell, so you need to use double quotes instead. Then you need to subsequently escape the double-quotes you want to send as-is.
A useful debugging technique is to add --trace-ascii dump.txt to your command line and inspect dump.txt after invoking curl to see that it matches what you intended to send.

BASH grep with multiple parameters + n lines after one of the matches

I have a bunch of text as a output from command, I need to display only specific matching lines plus some additional lines after match "message" (message text is obviously longer than 1 line)
what I tried was:
grep -e 'Subject:' -e 'Date:' -A50 -e 'Message:'
but it included 50 lines after EACH match, and I need to pass that only to single parameter. How would I do that?
code with output command:
(<...> | telnet <mailserver> 110 | grep -e 'Subject:' -e 'Date:' -A50 -e 'Message:'
Part of the telnet output:
Date: Tue, 10 Sep 2013 16
Message-ID: <00fb01ceae25$
MIME-Version: 1.0
Content-Type: multipart/alternative;
boundary="----=_NextPart_000_00FC_01CEAE3E.DE32CE40"
X-Mailer: Microsoft Office Outlook 12.0
Thread-Index: Ac6uJWYdA3lUzs1cT8....
Content-Language: lt
X-Mailman-Approved-At: Tue, 10 Sep 2013 16:0 ....
Subject: ...
X-BeenThere: ...
Precedence: list
Try following:
... | telnet ... > <file>
grep -e 'Subject:' -e 'Date:' <file> && grep -A50 -e 'Message:' <file>
Will need to dump the output to a file first.
This can be done with awk as well, without the need for dumping output to a file.
... | telnet ... | awk '/Date:/ {print}; /Subject:/ {print}; /Message:/ {c=50} c && c--'
With grep it would be hard to do. Better use awk for this
awk '/Subject:|Date:/;/Message:/ {while(l<=50){print $0;l++;getline}}'
Here the awk prints 50 lines below the Message: pattern and only one line is printed for all other patterns.

Merging CSV files : Appending instead of merging

So basically i want to merge a couple of CSV files. Im using the following script to do that :
paste -d , *.csv > final.txt
However this has worked for me in the past but this time it doesn't work. It appends the data next to each other as opposed to below each other. For instance two files that contain records in the following format
CreatedAt ID
Mon Jul 07 20:43:47 +0000 2014 4.86249E+17
Mon Jul 07 19:58:29 +0000 2014 4.86238E+17
Mon Jul 07 19:42:33 +0000 2014 4.86234E+17
When merged give
CreatedAt ID CreatedAt ID
Mon Jul 07 20:43:47 +0000 2014 4.86249E+17 Mon Jul 07 18:25:53 +0000 2014 4.86215E+17
Mon Jul 07 19:58:29 +0000 2014 4.86238E+17 Mon Jul 07 17:19:18 +0000 2014 4.86198E+17
Mon Jul 07 19:42:33 +0000 2014 4.86234E+17 Mon Jul 07 15:45:13 +0000 2014 4.86174E+17
Mon Jul 07 15:34:13 +0000 2014 4.86176E+17
Would anyone know what the reason behind this is? Or what i can do to force merge below records?
Assuming that all the csv files have the same format and all start with the same header,
you can write a little script as the following to append all files in only one and to take only one time the header.
#!/bin/bash
OutFileName="X.csv" # Fix the output name
i=0 # Reset a counter
for filename in ./*.csv; do
if [ "$filename" != "$OutFileName" ] ; # Avoid recursion
then
if [[ $i -eq 0 ]] ; then
head -1 "$filename" > "$OutFileName" # Copy header if it is the first file
fi
tail -n +2 "$filename" >> "$OutFileName" # Append from the 2nd line each file
i=$(( $i + 1 )) # Increase the counter
fi
done
Notes:
The head -1 or head -n 1 command print the first line of a file (the head).
The tail -n +2 prints the tail of a file starting from the lines number 2 (+2)
Test [ ... ] is used to exclude the output file from the input list.
The output file is rewritten each time.
The command cat a.csv b.csv > X.csv can be simply used to append a.csv and b csv in a single file (but you copy 2 times the header).
The paste command pastes the files one on a side of the other. If a file has white spaces as lines you can obtain the output that you reported above.
The use of -d , asks to paste command to define fields separated by a comma ,, but this is not the case for the format of the files you reported above.
The cat command instead concatenates files and prints on the standard output, that means it writes one file after the other.
Refer to man head or man tail for the syntax of the single options (some version allows head -1 other instead head -n 1)...
Alternative simple answer, this as combine_csv.sh:
#!/bin/bash
{ head -n 1 $1 && tail -q -n +2 $*; }
can be used like this:
pattern="my*filenames*.csv"
combine_csv.sh ${pattern} > result.csv
Thank you so much #wahwahwah.
I used your script to make nautilus-action, but it work correctly only with this changes:
#!/bin/bash
for last; do true; done
OutFileName=$last/RESULT_`date +"%d-%m-%Y"`.csv # Fix the output name
i=0 # Reset a counter
for filename in "$last/"*".csv"; do
if [ "$filename" != "$OutFileName" ] ; # Avoid recursion
then
if [[ $i -eq 0 ]] ; then
head -1 "$filename" > "$OutFileName" # Copy header if it is the first file
fi
tail -n +2 "$filename" >> "$OutFileName" # Append from the 2nd line each file
i=$(( $i + 1 )) # Increase the counter
fi
done

Stripping http response and getting only contents in CURL

I am using the following command to fetch a file from a server:
curl -i -L --user user:pass -o $s.po -F file=#$s -X GET http://address
However, the result is like this:
HTTP/1.1 100 Continue
HTTP/1.1 200 OK
Server: nginx/1.2.1
Date: Wed, 04 Jun 2014 18:00:08 GMT
Content-Type: application/json; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Content-Language: en
Expires: Wed, 04 Jun 2014 18:00:08 GMT
Vary: Authorization, Host, Accept-Language, Cookie
Last-Modified: Wed, 04 Jun 2014 18:00:08 GMT
Cache-Control: max-age=0
X-Frame-Options: SAMEORIGIN
Strict-Transport-Security: max-age=15768000; includeSubDomains
X-UA-Compatible: IE=edge,chrome=1
{
"content": "foo\nbar",
"mimetype": "text/x-po"
}
I would like to write only the foo\nbar part of the response to store exactly the same file into the disk. How can I do that?
After removing -i option and adding > output.json gives the following output in the file $s.po:
{
"content": "foo\nbar",
"mimetype": "text/x-po"
}
Remove -i option and pipe output to an awk command and save in a file:
curl -L --user user:pass -o $s.po -F file=#$s -X GET http://address |
awk -F'[:,]' '$1~/content/{gsub(/[" ]/, "", $2); print $2}' > output.json
As per man curl:
-i, --include
(HTTP) Include the HTTP-header in the output. The HTTP-header includes things like
server-name, date of the document, HTTP-version and more...

How to get remote file size from a shell script?

Is there a way to get the size of a remote file like
http://api.twitter.com/1/statuses/public_timeline.json
in shell script?
You can download the file and get its size. But we can do better.
Use curl to get only the response header using the -I option.
In the response header look for Content-Length: which will be followed by the size of the file in bytes.
$ URL="http://api.twitter.com/1/statuses/public_timeline.json"
$ curl -sI $URL | grep -i Content-Length
Content-Length: 134
To get the size use a filter to extract the numeric part from the output above:
$ curl -sI $URL | grep -i Content-Length | awk '{print $2}'
134
Two caveats to the other answers:
Some servers don't return the correct Content-Length for a HEAD request, so you might need to do the full download.
You'll likely get an unrealistically large response (compared to a modern browser) unless you specify gzip/deflate headers.
Also, you can do this without grep/awk or piping:
curl 'http://api.twitter.com/1/statuses/public_timeline.json' --location --silent --write-out 'size_download=%{size_download}\n' --output /dev/null
And the same request with compression:
curl 'http://api.twitter.com/1/statuses/public_timeline.json' --location --silent -H 'Accept-Encoding: gzip,deflate' --write-out 'size_download=%{size_download}\n' --output /dev/null
Similar to codaddict's answer, but without the call to grep:
curl -sI http://api.twitter.com/1/statuses/public_timeline.json | awk '/Content-Length/ { print $2 }'
The preceding answers won't work when there are redirections. For example, if one wants the size of the debian iso DVD, he must use the --location option, otherwise, the reported size may be that of the 302 Moved Temporarily answer body, not that of the real file.
Suppose you have the following url:
$ url=http://cdimage.debian.org/debian-cd/8.1.0/amd64/iso-dvd/debian-8.1.0-amd64-DVD-1.iso
With curl, you could obtain:
$ curl --head --location ${url}
HTTP/1.0 302 Moved Temporarily
...
Content-Type: text/html; charset=iso-8859-1
...
HTTP/1.0 200 OK
...
Content-Length: 3994091520
...
Content-Type: application/x-iso9660-image
...
That's why I prefer using HEAD, which is an alias to the lwp-request command from the libwww-perl package (on debian). Another advantages it has is that it strips the extra \r characters, which eases subsequent string processing.
So to retrieve the size of the debian iso DVD, one could do for example:
$ size=$(HEAD ${url})
$ size=${size##*Content-Length: }
$ size=${size%%[[:space:]]*}
Please note that:
this method will require launching only one process
it will work only with bash, because of the special expansion syntax used
For other shells, you may have to resort to sed, awk, grep et al..
I think the easiest way to do this would be to:
use cURL to run in silent mode -s,
pull only the headers -I (so as to avoid downloading the whole file)
then do a case insensitive grep -i
and return the second arg using awk $2.
output is returned as bytes
Examples:
curl -sI http://api.twitter.com/1/statuses/public_timeline.json | grep -i content-length | awk '{print $2}'
//output: 52
or
curl -sI https://code.jquery.com/jquery-3.1.1.min.js | grep -i content-length | awk '{print $2}'
//output: 86709
or
curl -sI http://download.thinkbroadband.com/1GB.zip | grep -i content-length | awk '{print $2}'
//output: 1073741824
Show as Kilobytes/Megabytes
If you would like to show the size in Kilobytes then change the awk to:
awk '{print $2/1024}'
or Megabytes
awk '{print $2/1024/1024}'
The accepted solution was not working for me, this is:
curl -s https://code.jquery.com/jquery-3.1.1.min.js | wc -c
I have a shell function, based on codaddict's answer, which gives a remote file's size in a human-readable format thusly:
remote_file_size () {
printf "%q" "$*" |
xargs curl -sI |
grep Content-Length |
awk '{print $2}' |
tr -d '\040\011\012\015' |
gnumfmt --to=iec-i --suffix=B # the `g' prefix on `numfmt' is only for systems
# ^ # that lack the GNU coreutils by default, i.e.,
# | # non-Linux systems
# |
# | # in other words, if you're on Linux, remove this
# | # letter `g'; if you're on BSD or Mac, install the GNU coreutils
} # | |
# +----------------------------------------+
This will show you a detailed info about the ongoing download
you just need to specify an URL like below example.
$ curl -O -w 'We downloaded %{size_download} bytes\n'
https://cmake.org/files/v3.8/cmake-3.8.2.tar.gz
output
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7328k 100 7328k 0 0 244k 0 0:00:29 0:00:29 --:--:-- 365k
We downloaded 7504706 bytes
For automated purposes you'll just need to add the command to your
script file.
To combine all the above for me works:
URL="http://cdimage.debian.org/debian-cd/current/i386/iso-dvd/debian-9.5.0-i386-DVD-1.iso"
curl --head --silent --location "$URL" | grep -i "content-length:" | tr -d " \t" | cut -d ':' -f 2
This will return just the content length in bytes:
3767500800
You can kinda do it like this, including auto-following 301/302 redirections :
curl -ILs 'https://twitter.com/i/csp_report?a=ORTGK%3D%3D%3D&ro=fals' |
mawk 'NF*=!_<NF' \
OFS= FS='^[Cc][Oo][Nn][Tt][Ee][Nn][Tt]-[Ll][Ee][Nn][Gg][Tt][Hh]: '
1 41
It's very brute force but gets the job done - but that's whatever raw value being reported by the server, so you may have to make adjustments to it as you see fit.
You may also have to add the -g flag so it can auto handle switchover from vanilla http to https :
curl -gILs 'http://apple.com' |
mawk 'NF *= !_<NF' OFS= \
FS='^[Cc][Oo][Nn][Tt][Ee][Nn][Tt]-[Ll][Ee][Nn][Gg][Tt][Hh]: '
1 304
2 106049
'(I''m *guessing* this might be the main site,
and first item was the redirection page ? )'
Question is old and have been sufficiently answered , but let expand upon exisiting answer. If you want to automate this task ( for checking file sizes of multiple files) then here's a one liner.
first write the URL of the files in a file:
cat url_of_files.txt
https://stpubdata-jwst.stsci.edu/ero/jw02734/jw02734002001/jw02734002001_04101_00001-seg002_nis_x1dints.fits
https://stpubdata-jwst.stsci.edu/ero/jw02734/jw02734002001/jw02734002001_04101_00001-seg003_nis_calints.fits
https://stpubdata-jwst.stsci.edu/ero/jw02734/jw02734002001/jw02734002001_04102_00001-seg001_nis_calints.fits
https://stpubdata-jwst.stsci.edu/ero/jw02734/jw02734002001/jw02734002001_02101_00002-seg001_nis_cal.fits
...
then from the command line (from the same directory as your url_of_files.txt):
eval $(sed -rn '/^https/s/(https.*$)/curl -sI \1/p' url_of_files.txt) | awk '/[Cc]ontent-[Ll]ength/{kb=$2/1024;mb=kb/1024;gb=mb/1024;print ( $2>1024 ? ( kb>1024 ? ( mb>1024 ? gb " G" : mb " M") : kb " K" ) : $2 " B" ) }'
This is for checking file sizes ranging from bytes to Gbs. I use this line to check the fits data files being made available by the JWST team.
It checks the file size and depending on its size , roughly converts it to a an appropriate number with B,K,M,G extensions denoting the size in Bytes, Kilo bytes, Mega bytes, and Giga bytes.
result:
...
177.188 K
177.188 K
236.429 M
177.188 K
5.95184 M
1.83608 G
1.20326 G
130.059 M
1.20326 G
...
My solution is using awk END to ensure to grep only the last Content-length:
function curl2contentlength() {
curl -sI -L -H 'Accept-Encoding: gzip,deflate' $1 | grep -i Content-Length | awk 'END{print $2}'
}
curl2contentlength $#
./curl2contentlength.sh "https://chrt.fm/track/B63133/stitcher.simplecastaudio.com/ec74d48c-cbf1-4764-923e-7d584dce50fa/episodes/a85954a3-24c3-48ed-bced-ef0607b7149a/audio/128/default.mp3?aid=rss_feed&awCollectionId=ec74d48c-cbf1-4764-923e-7d584dce50fa&awEpisodeId=a85954a3-24c3-48ed-bced-ef0607b7149a&feed=qm_9xx0g"
10806508
In fact without it would have been
0
0
10806508
I use like this ([Cc]ontent-[Ll]ength:), because I got server give multiple Content-Length character at header response
curl -sI "http://someserver.com/hls/125454.ts" | grep [Cc]ontent-[Ll]ength: | awk '{ print $2 }'
Accept-Ranges: bytes
Access-Control-Expose-Headers: Date, Server, Content-Type, Content-Length
Server: WowzaStreamingEngine/4.5.0
Cache-Control: no-cache
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true
Access-Control-Allow-Methods: OPTIONS, GET, POST, HEAD
Access-Control-Allow-Headers: Content-Type, User-Agent, If-Modified-Since, Cache-Control, Range
Date: Tue, 10 Jan 2017 01:56:08 GMT
Content-Type: video/MP2T
Content-Length: 666460
different solution:
ssh userName#IP ls -s PATH | grep FILENAME | awk '{print$1}'
gives you the size in KB

Resources