Ruby output as input for system command - ruby

I am trying download a ton of files via gsutil (Google Cloud). You can pass a list of URLs to download:
You can pass a list of URLs (one per line) to copy on stdin instead of as command line arguments by using the -I option. This allows you to use gsutil in a pipeline to upload or download files / objects as generated by a program, such as:
some_program | gsutil -m cp -I gs://my-bucket
How can I do this from Ruby, from within the program I mean? I tried to output them but that doesn't seem to work.
urls = ["url1", "url2", "url3"]
`echo #{puts urls} | gsutil -m cp -I gs://my-bucket`
Any idea?
A potential workaround would be to save the URLs in a file and use cat file | gsutil -m cp -I gs://my-bucket but that feels like overkill.

Can you try echo '#{urls.join("\n")}'
If you put puts it returns nil, rather than the string you want to return. The interpolation fails due to the same reason.

Related

iterate through specific files using webHDFS in a bash script

I want to download specific files in a HDFS directory, with their names starting with "total_conn_data_". Since I've got many files I want to write a bash script.
Here's what I do:
myPatternFile="total_conn_data_*.csv"
for filename in `curl -i -X GET "https://knox.blabla/webhdfs/v1/path/to/the/directory/?OP=LISTSTATUS" -u username`; do
curl -i -X GET "https://knox.blabla/webhdfs/v1/path/to/the/directory/$filename?OP=OPEN" -u username -L -o "./data/$filename" -k;
done
But it does not work since curl -i -X GET "https://knox.blabla/webhdfs/v1/path/to/the/directory/?OP=LISTSTATUS" -u username is sending back a json text and not file names.
How should I do? Thanks
curl provides output in json format only. you will have to use other tools like jquery and sed to format that output and get the list of files.

Split and copy a file from a bucket to another bucket, without downloading it locally

I'd like to split and copy a huge file from a bucket (gs://$SRC_BUCKET/$MY_HUGE_FILE) to another bucket (gs://$DST_BUCKET/), but without downloading the file locally. I expect to do this using only gsutil and shell commands.
I'm looking for something with the same final behaviour as the following commands :
gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE my_huge_file_stored_locally
split -l 1000000 my_huge_file_stored_locally a_split_of_my_file_
gsutil -m mv a_split_of_my_file_* gs://$DST_BUCKET/
But, because I'm executing these actions on a Compute Engine VM with limited disk storage capacity, getting the huge file locally is not possible (and anyway, it seems like a waste of network bandwidth).
The file in this example is split by number of lines (-l 1000000), but I will accept answers if the split is done by number of bytes.
I took a look at the docs about streaming uploads and downloads using gsutil to do something like :
gsutil cp gs://$SRC_BUCKET/$MY_HUGE_FILE - | split -1000000 | ...
But I can't figure out how to upload split files directly to gs://$DST_BUCKET/, without creating them locally (creating temporarily only 1 shard for the transfer is OK though).
This can't be done without downloading, but you could use range reads to build the pieces without downloading the full file at once, e.g.,
gsutil cat -r 0-10000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file1
gsutil cat -r 10001-20000 gs://$SRC_BUCKET/$MY_HUGE_FILE | gsutil cp - gs://$DST_BUCKET/file2
...

Bash script to wget url starting with a specific character

I have a URL http://example.com/dir that has many subdirectories with files that I want to save. Because its size is very big I want to break this operation in parts
eg. download everything from subdirectories starting with A like
http://example.com/A
http://example.com/Aa
http://example.com/Ab
etc
I have created the following script
#!/bin/bash
for g in A B C
do wget -e robots=off -r -nc -np -R "index.html*" http://example.com/$g
done
but it tries to download only http://example.com/A and not http://example.com/A*
Look at this page, it has all you need to know:
https://www.gnu.org/software/wget/manual/wget.html
1) You could use:
--spider -nd -r -o outputfile <domain>
which does not download the files, it just checks if they are there.
-nd prevents wget from creating directories locally
-r to parse entire site
-o outputfile to send the output to a file
to get a list of URLs to download.
2) then parse the outputfile to extract the files, and create smaller lists of links you want to download.
3) then use -i file (== --input-file=file) to download each list, thus limiting how many you download in one execution of wget.
Notes:
- --limit-rate=amount can be used to slow down downloads, to spare your Internet link!

Curl wildcard delete

I'm trying to use curl to delete files before i upload a new set, I'm having trouble trying to wildcard the files.
The below code works to delete one specific file
curl -v -u usr:"pass" ftp://11.11.11.11/outgoing/ -Q "DELE /outgoing/configuration-1.zip"
But when i try and wildcard the file with the below
curl -v -u usr:"pass" ftp://11.11.11.11/outgoing/ -Q "DELE /outgoing/configuration-*.zip"
i ge the error below
errorconfiguration-*: No such file or directory
QUOT command failed with 550
Can i use wildcards in curl delete?
Thanks
Curl does not support wildcards in any commands on an FTP server. In order to perform the required delete, you'll have to first list the files in the directory on the server, filter down to the files you want, and then issue delete commands for those.
Assuming your files are in the path ftp://11.11.11.11/outgoing, you could do something like:
curl -u usr:"pass" -l ftp://11.11.11.11/outgoing \
| grep '^configuration[-][[:digit:]]\+[.]zip$' \
| xargs -I{} -- curl -v -u usr:"pass" ftp://11.11.11.11/outgoing -Q "DELE {}"
That command (untested, since I don't have access to your server) does the following:
Outputs a directory listing for the outgoing/outgoing directory on the server.
Filters that directory listing for file names that start with configuration-, then have one or more digits, and then end with .zip. You may need to adjust this regex for different patterns.
Supplies the matching names to xargs, which, using the delimiter {} to interpolate each matched name, runs the curl command to DELETE each file on the server.
You could use one curl command to delete all of the files by concatting the matched names together into a single delete command, but that would be less legible for use as an example.

How to combine 3 commands into a single process for runit to monitor?

I wrote a script that grabs a set of parameters from two sources using wget commands, stores them into a variables and then executes video transcoding process based on the retrieved parameters. Runit was installed to monitor the process.
The problem is that when I try to stop the process, runit doesnt know that only the last transcoding process needs to be stopped therefore it fails to stop it.
How can I combine all the commands in bash script to act as a single process/app?
The commands are something as follows:
wget address/id.html
res=$(cat res_id | grep id.html)
wget address/length.html
time=$(cat length_id | grep length.html)
/root/bin -i video1.mp4 -s $res.....................
Try wrapping them in a shell:
sh -c '
wget address/id.html
res=$(grep id.html res_id)
wget address/length.html
time=$(grep length.html length_id)
/root/bin -i video1.mp4 -s $res.....................
'

Resources