Threaded wget - minimalizing resources - shell

I have a script that is getting the GeoIP locations of various ips, this is run daily and I'm going to expect to have around ~50,000 ips to look up.
I have a GeoIP system set up - I just would like to eliminate having to run wget 50,000 times per report.
What I was thinking is, there must be some way to have wget open a connection with the url - then pass the ips, that way it doesn't have to re-establish the connection.
Any help will be much appreciated.

If you give wget several addresses at once, with consecutive addresses belonging to the same HTTP/1.1 (Connection: keep-alive) supporting server, wget will re-use the already-established connection.
If there are too many addresses to list on the command line, you can write them to a file and use the -i/--input-file= option (and, per UNIX tradition, -i-/--input-file=- reads standard input).
There is, however, no way to preserve a connection across different wget invocations.

You could also write a threaded Ruby script to run wget on multiple input files simultaneously to speed the process up. So if you have 5 files containing 10,000 addresses each, you could use this script:
#!/usr/bin/ruby
threads = []
for file in ARGV
threads << Thread.new(file) do |filename|
system("wget -i #{filename}")
end
end
threads.each { |thrd| thrd.join }
Each of these threads would use one connection to download all addresses in a file. The following command then means only 5 connections to the server to download all 50,000 files.
./fetch.rb "list1.txt" "list2.txt" "list3.txt" "list4.txt" "list5.txt"

You could also write a small program (in Java or C or whatever) that sends the list of files as a POST request and the server returns an object with data about them. Shouldn't be too slow either.

Related

Split rsyslog files

I am trying to find a way to make rsyslog write logs into multiple files simultaneous, that is received from the same host. For example write to two files, and switch every log line.
I have one host that sends insane amounts of data (600gb a day), and i want write this logs to multiple files instead of one.
This is because of Splunk that read this file for indexing purposes, cant utilize multiple pipelines (aka. multiple CPU threads) on the same file, and this causes an bottleneck.
I was thinking about making rsyslog writing a new file every second, but this does not sound optimal.
Any suggestion is much appreciated :)
I have not been able to find any documentation that describes this ability in rsyslog.
The solution for this is to use part of the subseconds to split the file randomly (using last 2 digits, nano seconds) to 100 files at the same time.
template(name="template_nano_file_format" type="list") {
property(name="$.path")
property(name="fromhost-ip")
constant(value="/")
property(name="$year")
property(name="$month")
property(name="$day")
constant(value="-")
property(name="$hour")
constant(value="-")
property(name="timegenerated" dateFormat="subseconds" regex.expression="..$" regex.type="ERE")
constant(value=".log")
}

"tail -F" equivalent in lftp

I'm currently looking for tips to simulate a tail -F in lftp.
The goal is to monitor a log file the same way I could do with a proper ssh connection.
The closest command I found for now is repeat cat logfile.
It works but that not the best when my file is too big cause it displays each time all the file.
The lftp program specifically will not support this, but if the server supports the extension, it is possible to pull only the last $x bytes from a file with, e.g. curl --range (see this serverfault answer). This, combined with some logic to only grab as many bytes as have been added since the last poll, could allow you to do this relatively efficiently. I doubt if there are any off-the-shelf FTP clients with this functionality, but someone else may know better.

Limit the size of the Kismet PCAP output file?

I am running some kismet captures, and I need to continuously parse the outputted PCAP files, so in order to do this I need Kismet to save the file, and start a new one periodically (I use inotify-tools to detect newly created files).
But the problem is I do not find a way for kismet to do it. In man pages i found that -m option allows to limit the file size by packet size, so I ran it like this:
sudo kismet -c wlan0 -m 10
But that did not create multiple files, it carried on just putting all traffic to one file.
Any other ways to somehow make kismet break output to different files? I don't really care about what criteria is used (time, packet count, file size.. I'll take anything)
Thanks!
I think that you can modify it in the kismet.conf file.
There is an option that says
writeinterval=300
It means that every 300 seconds the pcap file will be saved. It will make a new file every 300 seconds.If you want a shorter time you can change it.
Hope it helps

How do you fix wget from comingling download data when running multiple concurrent instances?

I am running a script that in turn calls another script multiple times in the background with different sets of parameters.
The secondary script first does a wget on an ftp url to get a listing of files at that url. It outputs that to a unique filename.
Simplified example:
Each of these is being called by a separate instance of the secondary script running in the background.
wget --no-verbose 'ftp://foo.com/' -O '/downloads/foo/foo_listing.html' >foo.log
wget --no-verbose 'ftp://bar.com/' -O '/downloads/bar/bar_listing.html' >bar.log
When I run the secondary script once at a time, everything behaves as expected. I get an html file with a list of files, links to them, and information about the files the same way I would when viewing an ftp url through a browser.
Continued simplified one at a time (and expected) example results:
foo_listing.html:
...
foo1.xml ...
foo2.xml ...
...
bar_listing.html:
...
bar3.xml ...
bar4.xml ...
...
When I run the secondary script many times in the background, some of the resulting files, although they have the base urls correct (the one that was passed in) the files listed are from a different run of wget.
Continued simplified multiprocessing (and actual) example results:
foo_listing.html:
...
bar3.xml ...
bar4.xml ...
...
bar_listing.html
correct, as above
Oddly enough, all other files I download seem to work just fine. It's only these listing files that get jumbled up.
The current workaround is to put in a 5 second delay between backgrounded processes. With only that one change everything works perfectly.
Does anybody know how to fix this?
Please don't recommend using some other method of getting the listing files or not running concurrently. I'd like to actually know how to fix this when using wget in many backgrounded processes if possible.
EDIT:
Note:
I am not referring to the status output that wget spews to the screen. I don't care at all about that (that is actually also being stored in separate log files and is working correctly). I'm referring to the data wget is downloading from the web.
Also, I cannot show the exact code that I am using as it is proprietary for my company. There is nothing "wrong" with my code as it works perfectly when putting in a 5 second delay between backgrounded instances.
Log bug with Gnu, use something else for now whenever possible, put in time delays between concurrent runs. Possibly create a wrapper for getting ftp directory listings that only allows one at a time to be retrieved.
:-/

Why is curl in Ruby slower than command-line curl?

I am trying to download more than 1m pages (URLs ending by a sequence ID). I have implemented kind of multi-purpose download manager with configurable number of download threads and one processing thread. The downloader downloads files in batches:
curl = Curl::Easy.new
batch_urls.each { |url_info|
curl.url = url_info[:url]
curl.perform
file = File.new(url_info[:file], "wb")
file << curl.body_str
file.close
# ... some other stuff
}
I have tried to download 8000 pages sample. When using the code above, I get 1000 in 2 minutes. When I write all URLs into a file and do in shell:
cat list | xargs curl
I gen all 8000 pages in two minutes.
Thing is, I need it to have it in ruby code, because there is other monitoring and processing code.
I have tried:
Curl::Multi - it is somehow faster, but misses 50-90% of files (does not download them and gives no reason/code)
multiple threads with Curl::Easy - around the same speed as single threaded
Why is reused Curl::Easy slower than subsequent command line curl calls and how can I make it faster? Or what I am doing wrong?
I would prefer to fix my download manager code than to make downloading for this case in a different way.
Before this, I was calling command-line wget which I provided with a file with list of URLs. Howerver, not all errors were handled, also it was not possible to specify output file for each URL separately when using URL list.
Now it seems to me that the best way would be to use multiple threads with system call to 'curl' command. But why when I can use directly Curl in Ruby?
Code for the download manager is here, if it might help: Download Manager (I have played with timeouts, from not-setting it to various values, it did not seem help)
Any hints appreciated.
This could be a fitting task for Typhoeus
Something like this (untested):
require 'typhoeus'
def write_file(filename, data)
file = File.new(filename, "wb")
file.write(data)
file.close
# ... some other stuff
end
hydra = Typhoeus::Hydra.new(:max_concurrency => 20)
batch_urls.each do |url_info|
req = Typhoeus::Request.new(url_info[:url])
req.on_complete do |response|
write_file(url_info[:file], response.body)
end
hydra.queue req
end
hydra.run
Come to think of it, you might get a memory problem because of the enormous amout of files. One way to prevent that would be to never store the data in a variable but instead stream it to the file directly. You could use em-http-request for that.
EventMachine.run {
http = EventMachine::HttpRequest.new('http://www.website.com/').get
http.stream { |chunk| print chunk }
# ...
}
So, if you don't set a on_body handler than curb will buffer the download. If you're downloading files you should use an on_body handler. If you want to download multiple files using Ruby Curl, try the Curl::Multi.download interface.
require 'rubygems'
require 'curb'
urls_to_download = [
'http://www.google.com/',
'http://www.yahoo.com/',
'http://www.cnn.com/',
'http://www.espn.com/'
]
path_to_files = [
'google.com.html',
'yahoo.com.html',
'cnn.com.html',
'espn.com.html'
]
Curl::Multi.download(urls_to_download, {:follow_location => true}, {}, path_to_files) {|c,p|}
If you want to just download a single file.
Curl::Easy.download('http://www.yahoo.com/')
Here is a good resource: http://gist.github.com/405779
There's been benchmarks done that has compared curb with other methods such as HTTPClient. The winner, in almost all categories was HTTPClient. Plus, there have been some documented scenarios where curb does NOT work in multi-threading scenarios.
Like you, I've had your experience. I ran system commands of curl in 20+ concurrent threads and it was 10 X fasters than running curb in 20+ concurrent threads. No matter, what I tried, this was always the case.
I've since then switched to HTTPClient, and the difference is huge. Now it runs as fast as 20 concurrent curl system commands, and uses less CPU as well.
First let me say that I know almost nothing about Ruby.
What I do know is that Ruby is an interpreted language; it's not surprising that it's slower than heavily optimised code that's been compiled for a specific platform. Every file operation will probably have checks around it that curl doesn't. The "some other stuff" will slow things down even more.
Have you tried profiling your code to see where most of the time is being spent?
Stiivi,
any chance that Net::HTTP would suffice for simple downloading of HTML pages?
You didn't specify a Ruby version, but threads in 1.8.x are user-space threads, not scheduled by the OS, so the entire Ruby interpreter only ever use one CPU/core. On top of that there is a Global Interpreter Lock, and probably other locks as well, interfering with concurrency. Since you're trying to maximize network throughput, you're probably underutilizing CPUs.
Spawn as many processes as the machine has memory for, and limit the reliance on threads.

Resources