Using GNU Parallel for cluster computing over LAN with rsync

Using GNU Parallel for cluster computing over LAN with rsync - bash

I have two machines, and I want to use GNU Parallel to have multiple processes 'cat' the contents of some text files from both machines.
I have the following setup.
On a local machine, in the same directory, I have the following files:
cmd.sh - a bash file with contents: 'cat "$#"'
test1.txt - a text file with contents: 'Test 1'
test2.txt - a text file with contents: 'Test 2'
test3.txt - a text file with contents: 'Test 3'
nodefile - a text file with the following contents:
2/:
4/ dan#192.168.0.3
This is if I am using the nodefile example from wordpress link (below), and my IP is 192.168.0.2.
None of these files are replicated on the remote machine. I want to have multiple processes 'cat' the contents of each of the test?.txt files from both machines.
Preferably, this:
Wouldn't leave any artifacts on the remote machine
Would leave the contents of the local directory intact.
I have been able to execute multiprocessing commands remotely with the nodefile as per this wordpress example, but none involving file echoing remotely.
So far, I have something like the following:
parallel --sshloginfile nodefile --workdir . --basefile cmd.sh -a cmd.sh --trc ::: test1.txt test2.txt test3.txt
But this isn't working and is removing the files from my directory and not replacing them, as well as giving rsync errors. I (unfortunately) can't provide the errors at the moment, or replicate the setup.
I am very inexperienced with parallel, can anyone guide me on the syntax to accomplish this task? I haven't been able to find the answer (so far) in the man pages or on the web.
Running Ubuntu 16.04 LTS and using latest version of GNU Parallel.

You make a few mistakes:
-a is used to give an input source. It is basically an alias for ::::
you do not give the command to run after the options to GNU Parallel and before the :::
--trc takes an argument (namely the file to transfer back). You do not have a file to transfer back, so use --transfer --cleanup instead.
So:
chmod +x cmd.sh
parallel --sshloginfile nodefile --workdir . --basefile cmd.sh --transfer --cleanup ./cmd.sh ::: test1.txt test2.txt test3.txt
It is unclear if you want to transfer anything to the remote machine, so maybe this is really the correct answer:
parallel --sshloginfile nodefile --nonall --workdir . ./cmd.sh test1.txt test2.txt test3.txt

Related

hash method to verify integrity of dir vs dir.tar.gz

I'm working on a python scrip that verify the integrity of some downloaded projects.
On my nas, I have all my compressed folder: folder1.tar.gz, folder2.tar.gz, …
On my Linux computer, the equivalent uncompressed folder : folder1, folder2, …
So, i want to compare the integrity of my files without any UnTar or download !
I think i can do it on the nas with something like (with md5sum):
sshpass -p 'pasword' ssh login#my.nas.ip tar -xvf /path/to/my/folder.tar.gz | md5sum | awk '{ print $1 }'
this give me a hash, but I don't know how to get an equivalent hash to compare with the normal folder on my computer. Maybe the way I am doing it is wrong.
I need one command for the nas, and one for the Linux computer, that output the same hash ( if the folders are the same, of course )

If you did that, tar xf would actually extract the files. md5sum would only see the file listing, and not the file content.
However, if you have GNU tar on the server and the standard utility paste, you could create checksums this way:
mksums:
#!/bin/bash
data=/path/to/data.tar.gz
sums=/path/to/data.md5
paste \
<(tar xzf "$data" --to-command=md5sum) \
<(tar tzf "$data" | grep -v '/$') \
| sed 's/-\t//' > "$sums"
Run mksums above on the machine with the tar file.
Copy the sums file it creates to the computer with the folders and run:
cd /top/level/matching/tar/contents
md5sums -c "$sums"
paste joins lines of files given as arguments
<( ...) runs a command, making its output appear in a fifo
--to-command is a GNU tar extension which allows running commands which will receive their data from stdin
grep filters out directories from the tar listing
sed removes the extraneous -\t so the checksum file can be understood by md5sum
The above assumes you don't have any very-oddly named files (for example, the names can't contain newlines)

bash: C:/Program: No such file or directory

I am new to Docker, Debezium, Bash, and Kafka. I am attempting to run the Debezium tutorial/example for MSSQL Server on Windows 10 here:
https://github.com/debezium/debezium-examples/blob/master/tutorial/README.md#using-sql-server
I am able to start the topology, per step one. However, when I go to step two and execute the following command:
cat debezium-sqlserver-init/inventory.sql | docker exec -i tutorial_sqlserver_1 bash -c '/opt/mssql-tools/bin/sqlcmd -U sa -P $SA_PASSWORD'
I get the following error:
bash: C:/Program: No such file or directory
I do not have the foggiest idea why it would even drag C:/Program in to this. I do not see it in the command nor do I see it in the *.sql file. Does anyone know why this is happening and what the fix is?
Note 1: I am already in the current directory where this command should be runnable and there are no spaces in the folder/file path
Note 2: I am running the commands in Git Bash
When using set -x to log how the command is run, there's still no C:/Program anywhere in it, as can be seen by the following log:
$ cat debezium-sqlserver-init/inventory.sql | docker exec -i tutorial_sqlserver_1 bash -c '/opt/mssql-tools/bin/sqlcmd -U sa -P $SA_PASSWORD'
+ cat debezium-sqlserver-init/inventory.sql
+ docker exec -i tutorial_sqlserver_1 bash -c '/opt/mssql-tools/bin/sqlcmd -U sa -P $SA_PASSWORD'
bash: C:/Program: No such file or directory

I had a similar problem yesterday, the solution was adding a backslash before the absolute path, like :
cat debezium-sqlserver-init/inventory.sql | docker exec -i tutorial_sqlserver_1 bash -c '\/opt/mssql-tools/bin/sqlcmd -U sa -P $SA_PASSWORD'
\/opt/mssql-tools/bin/sqlcmd prevents conversion to Windows path.

Issue with download multiple file with names in BASH

I'm trying to download multiple files in parallel using xargs. Things worked so well if I only download the file without given name. echo ${links[#]} | xargs -P 8 -n 1 wget. Is there any way that allow me to download with filename like wget -O [filename] [URL] but in parallel?
Below is my work. Thank you.
links=(
"https://apod.nasa.gov/apod/image/1901/sombrero_spitzer_3000.jpg"
"https://apod.nasa.gov/apod/image/1901/orionred_WISEantonucci_1824.jpg"
"https://apod.nasa.gov/apod/image/1901/20190102UltimaThule-pr.png"
"https://apod.nasa.gov/apod/image/1901/UT-blink_3d_a.gif"
"https://apod.nasa.gov/apod/image/1901/Jan3yutu2CNSA.jpg"
)
names=(
"file1.jpg"
"file2.jpg"
"file3.jpg"
"file4.jpg"
"file5.jpg"
)
echo ${links[#]} ${names[#]} | xargs -P 8 -n 1 wget

With GNU Parallel you can do:
parallel wget -O {2} {1} ::: "${links[#]}" :::+ "${names[#]}"
If a download fails, GNU Parallel can also retry commands with --retry 3.

How can I pipe a tar compression operation to aws s3 cp?

I'm writing a custom backup script in bash for personal use. The goal is to compress the contents of a directory via tar/gzip, split the compressed archive, then upload the parts to AWS S3.
On my first try writing this script a few months ago, I was able to get it working via something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 - /mnt/SCRATCH/backup.tgz.part
aws s3 sync /mnt/SCRATCH/ s3://backups/ --delete
rm /mnt/SCRATCH/*
This worked well for my purposes, but required /mnt/SCRATCH to have enough disk space to store the compressed directory. Now I wanted to improve this script to not have to rely on having enough space in /mnt/SCRATCH, and did some research. I ended up with something like:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter "aws s3 cp - s3://backups/backup.tgz.part" -
This almost works, but the target filename on my S3 bucket is not dynamic, and it seems to just overwrite the backup.tgz.part file several times while running. The end result is just one 100MB file, vs the intended several 100MB files with endings like .part0001.
Any guidance would be much appreciated. Thanks!

when using split you can use the env variable $FILE to get the generated file name.
See split man page:
--filter=COMMAND
write to shell COMMAND; file name is $FILE
For your use case you could use something like the following:
--filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE'
(the single quotes are needed, otherwise the environment variable substitution will happen immediately)
Which will generate the following file names on aws:
backup.tgz.partx0000
backup.tgz.partx0001
backup.tgz.partx0002
...
Full example:
tar -czf - /mnt/STORAGE_0/dir_to_backup | split -b 100M -d -a 4 --filter 'aws s3 cp - s3://backups/backup.tgz.part$FILE' -

You should be able to get it done quite easily and in parallel using GNU Parallel. It has the --pipe option to split the input data into blocks of size --block and distribute it amongst multiple parallel processes.
So, if you want to use 100MB blocks and use all cores of your CPU in parallel, and append the block number ({#}) to the end of the filename on AWS, your command would look like this:
tar czf - something | parallel --pipe --block 100M --recend '' aws s3 cp - s3://backups/backup.tgz.part{#}
You can use just 4 CPU cores instead of all cores with parallel -j4.
Note that I set the "record end" character to nothing so that it doesn't try to avoid splitting mid-line which is its default behaviour and better suited to text file processing than binary files like tarballs.

wget command to download a file and save as a different filename

I am downloading a file using the wget command. But when it downloads to my local machine, I want it to be saved as a different filename.
For example: I am downloading a file from www.examplesite.com/textfile.txt
I want to use wget to save the file textfile.txt on my local directory as newfile.txt. I am using the wget command as follows:
wget www.examplesite.com/textfile.txt

Use the -O file option.
E.g.
wget google.com
...
16:07:52 (538.47 MB/s) - `index.html' saved [10728]
vs.
wget -O foo.html google.com
...
16:08:00 (1.57 MB/s) - `foo.html' saved [10728]

Also notice the order of parameters on the command line. At least on some systems (e.g. CentOS 6):
wget -O FILE URL
works. But:
wget URL -O FILE
does not work.

You would use the command Mechanical snail listed. Notice the uppercase O. Full command line to use could be:
wget www.examplesite.com/textfile.txt --output-document=newfile.txt
or
wget www.examplesite.com/textfile.txt -O newfile.txt
Hope that helps.

wget -O yourfilename.zip remote-storage.url/theirfilename.zip
will do the trick for you.
Note:
a) its a capital O.
b) wget -O filename url will only work. Putting -O last will not.

Either curl or wget can be used in this case. All 3 of these commands do the same thing, downloading the file at http://path/to/file.txt and saving it locally into "my_file.txt":
wget http://path/to/file.txt -O my_file.txt # my favorite--it has a progress bar
curl http://path/to/file.txt -o my_file.txt
curl http://path/to/file.txt > my_file.txt
Notice the first one's -O is the capital letter "O".
The nice thing about the wget command is it shows a nice progress bar.
You can prove the files downloaded by each of the 3 techniques above are exactly identical by comparing their sha512 hashes. Running sha512sum my_file.txt after running each of the commands above, and comparing the results, reveals all 3 files to have the exact same sha hashes (sha sums), meaning the files are exactly identical, byte-for-byte.
See also: How to capture cURL output to a file?

Using CentOS Linux I found that the easiest syntax would be:
wget "link" -O file.ext
where "link" is the web address you want to save and "file.ext" is the filename and extension of your choice.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio