How do you fix wget from comingling download data when running multiple concurrent instances? - bash

I am running a script that in turn calls another script multiple times in the background with different sets of parameters.
The secondary script first does a wget on an ftp url to get a listing of files at that url. It outputs that to a unique filename.
Simplified example:
Each of these is being called by a separate instance of the secondary script running in the background.
wget --no-verbose 'ftp://foo.com/' -O '/downloads/foo/foo_listing.html' >foo.log
wget --no-verbose 'ftp://bar.com/' -O '/downloads/bar/bar_listing.html' >bar.log
When I run the secondary script once at a time, everything behaves as expected. I get an html file with a list of files, links to them, and information about the files the same way I would when viewing an ftp url through a browser.
Continued simplified one at a time (and expected) example results:
foo_listing.html:
...
foo1.xml ...
foo2.xml ...
...
bar_listing.html:
...
bar3.xml ...
bar4.xml ...
...
When I run the secondary script many times in the background, some of the resulting files, although they have the base urls correct (the one that was passed in) the files listed are from a different run of wget.
Continued simplified multiprocessing (and actual) example results:
foo_listing.html:
...
bar3.xml ...
bar4.xml ...
...
bar_listing.html
correct, as above
Oddly enough, all other files I download seem to work just fine. It's only these listing files that get jumbled up.
The current workaround is to put in a 5 second delay between backgrounded processes. With only that one change everything works perfectly.
Does anybody know how to fix this?
Please don't recommend using some other method of getting the listing files or not running concurrently. I'd like to actually know how to fix this when using wget in many backgrounded processes if possible.
EDIT:
Note:
I am not referring to the status output that wget spews to the screen. I don't care at all about that (that is actually also being stored in separate log files and is working correctly). I'm referring to the data wget is downloading from the web.
Also, I cannot show the exact code that I am using as it is proprietary for my company. There is nothing "wrong" with my code as it works perfectly when putting in a 5 second delay between backgrounded instances.

Log bug with Gnu, use something else for now whenever possible, put in time delays between concurrent runs. Possibly create a wrapper for getting ftp directory listings that only allows one at a time to be retrieved.
:-/

Related

Occasionally Expect send command gets truncated

I have an expect script that logs in to an SBC and runs a command for a particular interface.
I call this script from a shell script to perform the same command on multiple SBCs and multiple interfaces. I run the script 6 times on each SBC grabbing details for a single interface each time and the output gets saved to a different file on a per SBC/interface combination.
Trouble is, I run it for example on SBC A and in two of the files the command is truncated and nothing happens. Say interface 2 and 3.
If I run the script again, 5 interfaces work this time and now a different interface, interface 4 fails with a truncated command.
I don’t understand what would cause the command to fail randomly. Any thoughts would be appreciated.
Ok think I have cracked it. Occasionally the command I am entering is matching the expected prompt. In reality the command should always match the prompt so strange it doesn’t fail every time.
Have tweaked the expected prompt and re-running script.

get_dir_file_info() hangs when run on a large directory

I have made a little function that deletes files based on date. Prior to doing the deletions, it lets the user choose how many days/months back to delete files, telling them how many files and how much memory it would clean up.
It worked great in my test environment, but when I attempted to test it on a larger directory (approximately 100K files), it hangs.
I’ve stripped everything else from my code to ensure that it is the get_dir_info() function that is causing the issue.
$this->load->helper('file');
$folder = "iPad/images/";
set_time_limit (0);
echo "working<br />";
$dirListArray = get_dir_file_info($folder);
echo "still working";
When I run this, the page loads for approximately 60 seconds, then displays only the first message “working” and not the following message “still working”.
It doesn’t seem to be a system/php memory problem as it is coming back after 60 seconds and the server respects my set_time_limit() as I’ve had to use that for other processes.
Is there some other memory/time limit I might be hitting that I need to adjust?
from the CI user guide the get_dir_file_info() is:
Reads the specified directory and builds an array containing the filenames, filesize, dates, and permissions. Sub-folders contained within the specified path are only read if forced by sending the second parameter, $top_level_only to FALSE, as this can be an intensive operation.
so if you are saying that you have 100k files then the best way to do it, is to cut it into two steps:
First: use get_filenames('path/to/directory/') to retrieve all your files without their information.
Second: use get_file_info('path/to/file', $file_information) to retrieve a specific file info, as you might not need all the file information immediately. it can be done on file name click or something relevant.
the idea here is not to force your server to deal with large amount of process while in production. that would kill two things, responsiveness, and performance (I haven't found a better definition for performance) but the idea here is clear.

Broken pipe, closing control connection. while piping small file through funzip using wget

I'm trying to download a small zip file (1159 bytes) and pipe it through funzip. This works great with larger files fro that server. However three small files give me an error:
Broken pipe, closing control connection.
I use the following code:
wget -O - --ftp-user=username --ftp-password=secret ftp://server/small-file.zip | funzip
Also downloading the file directly works good, only the piping to funzip doesn't work. I suspect the file is too small.
Anyone knows how to fix this?
Edit: Size doesn't seem to matter (don't let the girls tell you otherwise :)), even files of 400 bytes are not giving errors
Ok, if nobody can answer it, I'll answer it myself
I found there are two solutions, one is limiting the download rate for wget
--limit-rate=1000
This works for the files of around 1kb but now sometimes larger files seem to suffer from the same error. It also slows down the whole process.
Now I just pipe the download through a script that sleeps 1 second at the end. This seems to solve it.

Bash piping output and input of a program

I'm running a minecraft server on my linux box in a detached screen session. I'm not very fond of screen very much and would like to be able to constantly pipe the output of the server to a file(like a pipe) and pipe some input from a file to the server(so that I can input and output to the server from remote programs, like a python script). I'm not very experienced in bash, so could somebody tell me how to do this?
Thanks, NikitaUtiu.
It's not clear if you need screen at all. I don't know the minecraft server, but generally for server software, you can run it from a crontab entry and redirect output to log files.
Assuming your server kills itself at midnight sunday night, (we can discuss changing this if restarting 1x per week is too little or too much OR you require ad-hoc restarts), but for a basic idea of what to do, here is a crontab entry that starts the server each monday at 1 minute after midnight.
01 00 * * 1 dtTm=`/bin/date +\%Y\%m\%d.\%H\%M\%S`; export dtTm; { /usr/bin/mineserver -o ..... your_options_to_run_mineserver_here ... ; } > /tmp/mineserver_trace_log.${dtTm} 2>&1
consult your man page for crontab to confirm that day-of-week ranges are 0-6 (0=Sunday), and change the day-of-week value if 0!=Sunday.
Normally I would break the code up so it is easier to read, but for crontab entries, each entry has to be all on one line, (with some weird exceptions) AND usually a limit of 1024b-8K to how long the line can be. Note that the ';' just before the closing '}' is super-critical. If this is left out, you'll get un-deciperable error messages, or no error messages at all.
Basically, you're redirecting any output into a file (including std-err output). Now you can do a lot of stuff with the output, use more or less to look at the file, grep ERR ${logFile}, write scripts that grep for error messages and then send you emails that errors have been found, etc, etc.
You may have some sys-admin work on your hands to get the mineserver user so it can run crontab entries. Also if you're not comfortable using the vi or emacs editors, creating a crontab file may require help from others. Post to superuser.com to get answers for problems you have with linux admin issues.
Finally, there are two points I'd like to make about dated logfiles.
Good: a. If you app dies, you never have to rerun it to then capture output and figure out why something has stopped working. For long running programs this can save you a lot of time. b. keeping dated files gives you the ability to prove to you, your boss, others, that It used to work just fine, see here are the log files. c. Keeping the log files, assuming there is useful information in them, gives you the opportunity to mine those files for facts. I.E. : program used to take 1 sec for processing, now it is taking 1 hr, etc etc.
Bad: a. You'll need to set up a mechanism to sweep old log files, otherwise at some point everything will have stopped, AND when you finally figure out what the problem was, you discover that your /tmp OR whatever dir you chose to use IS completely full.
There is a self-maintaining solution to using dates on the logfiles I can tell you about if you find this approach useful. It will take a little explaining, so I don't want to spend the time writing it up if you don't find the crontab solution useful.
I hope this helps!

Verify whether ftp is complete or not?

I got an application which is polling on a folder continuously. Once any file is ftp to the folder, the application has to move this file to some other folder for processing.
Here, we don't have any option to verify whether ftp is complete or not.
One command "lsof" is suggested in the technical forums. It got a file description column which gives the file status.
Since, this is a free bsd command and not present in old versions of linux, I want to clarify the usage of this command.
Can you guys tell us your experience in file verification and is there any other alternative solution available?
Also, is there any risk in using this utility?
Appreciate your help in advance.
Thanks,
Mathew Liju
We've done this before in a number of different ways.
Method one:
If you can control the process sending the files, have it send the file itself followed by a sentinel file. For example, send the real file "contracts.doc" followed by a one-byte "contracts.doc.sentinel".
Then have your listener process watch out for the sentinel files. When one of them is created, you should process the equivalent data file, then delete both.
Any data file that's more than a day old and doesn't have a corresponding sentinel file, get rid of it - it was a failed transmission.
Method two:
Keep an eye on the files themselves (specifically the last modification date/time). Only process files whose modification time is more than N minutes in the past. That increases the latency of processing the files but you can usually be certain that, if a file hasn't been written to in five minutes (for example), it's done.
Conclusion:
Both those methods have been used by us successfully in the past. I prefer the first but we had to use the second one once when we were not allowed to change the process sending the files.
The advantage of the first one is that you know the file is ready when the sentinel file appears. With both lsof (I'm assuming you're treating files that aren't open by any process as ready for processing) and the timestamps, it's possible that the FTP crashed in the middle and you may be processing half a file.
There are normally three approaches to this sort of problem.
providing a signal file so that when your file is transferred, an additional file is sent to mark that transfer is complete
add an entry to a log file within that directory to indicate a transfer is complete (this really only works if you have a single peer updating the directory, to avoid concurrency issues)
parsing the file to determine completeness. e.g. does the file start with a length field, or is it obviously incomplete ? e.g. parsing an incomplete XML file will result in a parse error due to the lack of an end element. Depending on your file's size and format, this can be trivial, or can be very time-consuming.
lsof would possibly be an option, although you've identified your Linux portability issue. If you use this, note the -F option, which formats the output suitable for processing by other programs, rather than being human-readable.
EDIT: Pax identified a fourth (!) method I'd forgotten - using the fact that the timestamp of the file hasn't updated in some time.
There is a fifth method. You can also check if the FTP Session is still active. This will work if every peer has it's own ftp user account. As long as the user is not logged off from FTP, assume the files are not complete.

Resources