ruby scrape with multiple ip addresses

ruby scrape with multiple ip addresses - ruby

I would like to know if it is possible for a Ruby program to possess multiple IP addresses? I am trying to download a lot of data from a site, but it is very slow with only 1 connection at a time.
I intend to multi-thread my program with each thread using its own IP address, but I do not know if it is possible in the first place, any help or hints would be greatly appreciated.

It is definitely possible for a machine or a program to have multiple IP addresses. You can even have multiple network adapters, and tie each of them to different physical connections.
However, it can get really hairy to maintain. The challenge for that is partly in the code, partly in the system maintenance, and partly in the networking required to make that happen.
A better approach that you can take is to design your program so that it can run distributed. As such, you can have several copies of it synchronized and doing the work in parallel. You can then scale it horizontally (build more copies) as required, and over different machines and connections if required.
EDIT: You mentioned that you cannot scale horizontally, and that you prefer to use multiple connections from the same machine.
It's very likely that for this you'll have to go a little bit lower in the network stack, developing yourself the connection through sockets in order to use specific network interfaces.
Check out an introduction to Ruby sockets.
Also, check out these related questions:
How does a socket know which network interface controller to use?
Binding to networking interfaces in ruby
Ruby: Binding a listening socket to a specific interface
Can I make ruby send network traffic over a specific iface?

Related

How to web request filtering - block one particular country

My MVC-5 website gets a lot of false registrations, or real and followed by experiments in escalation of privileges. I can write a request filter - but how to block web requests from one particular country? Are there publicly available list of IPs that I can block ... or what? How do people approach solving this issue?

Depending on the country you might have a little difficulty in blocking everything but there are lists out there (such as the one maintained by nirsoft). Usually it's better to block specific IP addresses where the bad behavior is originating by using software that watches for the behavior and dynamically blocks it. That way you're covered regardless of where it originates. Especially since managing IP address blacklists is a real pain. I've made use of IPTables on linux before for this and it works like a charm.

Existing Event Driven Network Protocols

I am building a set of programs that consist of multiple clients and a single server.
The clients are frequently pushing small packets of data to the server, which will validate the information (returning an error if the data is invalid), and process the received information. The information may then incur the firing of events, which clients will be subscribed to, allowing for clients to be instantly (or as close as possible) notified (along with a small amount of data).
I have some ideas about how to do this, but I am trying to avoid creating a protocol of my own, mainly as I'm sure it would take forever and I would probably make a few errors. So I was wondering if there are any existing protocols that I could implement into my system that would provide such functionality.
The number of clients will initially be quite small, but will be growing over time to potentially include 1000's of clients (with their own subscriptions), and several front end servers (each one handling a subset of subscriptions) parsing the information back and forth with back end servers for improved capability.
So, if anyone knows of any existing protocols that implement these requirements and functionality, that would be fantastic.
EDIT
I am currently looking at the XMPP protocol, and the JXTA protocol suite (for reference, and implement with another language). Both seem quite good and provide the necessary connectivity, but I have not had the opportunity to test each of them out in my environment, or if they are even suitable for what I am attempting.
Additionally, some of the network clients will be outside of the local network and operating over WAN. Security is not so much of an issue, but I need to take into account the increase latency of this, and firewall rules (local to the connection that is hosting the application and ISP firewalls) that could be blocking certain ports or transport protocols (I have read some text that said that some ISPs where blocking UDP packets, but not sure of how wide this goes. I can do it at home, the office, mobile, friends houses, etc and have yet to experience it myself).

I'm sorry if the following is not exactly what you're after but I am slightly confused by your use of the word 'protocol'. I understand a protocol to be a 'communication specification' only, where the implementation is left entirely to you. If that is the case I always find the the following graphic usefull, link.
If on the other hand you are looking for a solution which allows you to easily implement the networking side of your application, helping save time, then checkout the following network libraries, which implement their own custom protocol:
NetworkComms.Net
Lidgren
ZeroMQ
Disclaimer: I'm a developer for NetworkComms.Net

Max numbers of connections / threads on my TCP / IP server?

I am curious about whether my server would work better on Linux or Windows, from what I have read Windows only supports around 2,000 connections/threads while I have not seen much information about how many threads / connections Linux can handle.
Is there any advantages to using Linux over Windows other than stability / security for my TCP /IP server?
Thanks.

Threads and sockets are different resources, the limits for each will depend not just on Linux vs Windows but also which versions of each OS you are using. Also, if you're using a class library instead of raw socket or thread APIs, those might impose a specific limit. As an example early versions of CSocket in MFC created a hidden window for each socket, so you were effectively limited to the number of GDI resources on the system.

Either platform will be fine, and most apps will never get big enough to need more than a single server to run them anyway. Get your project done in whichever way is easier for you.

I would imagine that the primary concern when building a high-scale application is the experience of the engineers on your team, including operations engineers. By all means consider performance when selecting a platform, but the experience and preference of your development and operations engineers is probably more important - after all, they will need to maintain and operate the service respectively.
In any case, if you have a real need for a service with 2000 concurrent clients, it probably has some high availability requirement which means it can't be run on a single server anyway.

How to extract information from client/server communication with no documentation?

What are methods for undocumented client/server communication to be captured and analyzed for information you want and then have your program looking for this information in real time? For example, programs that look at online game client/server communication and get information and use it to do things like show location on a 3rd party map, etc.

Wireshark will allow you to inspect communication between the client-server (assuming you're running one of them on your machine). As you want to perform this snooping in your own application, look at WinPcap. Being able to reverse engineer the protocol is a whole other kettle of fish, mind.

In general, wireshark is an excellent recommendation for traffic/protocol analysis- however, you seem to be looking for something else:
For example, programs that look at online game client/server communication and get information and use it to do things like show location on a 3rd party map, etc.
I assume you are referring to multiplayer games and game servers?
If so, these programs are usually using a dedicated service connection to query the corresponding server for positional updates and other meta information on a different port, they don't actually intercept or inspect client/server communciations at realtime, and they don't really interfere with these updates, either.
So, you'll find that most game servers provide support for a very simply passive connection (i.e. output only), that's merely there for getting certain runtime state, which in turn is often simply polled by a corresponding external script/webpage.
Similarly, there's often also a dedicated administration interface provided on a different port, as well as another one that publishes server statistics, so that these can be easily queried for embedding neat stats in webpages.
Depending on the type of game server, these may offer public/anonymous use, or they may require certain credentials to access such a data port.
More complex systems will also allow you to subscribe only to specific state and updates, so that you can dynamically configure what data you are interested in.
So, even if you had complete documentation on the underlying protocol, you wouldn't really be able to directly inspect client/server communications without being in between these communications. This can however not be easily achieved. In theory, this would basically require a SOCKS proxy server to be set up and used by all clients, so that you can actually inspect the communications going on.
Programs like wireshark will generally only provide useful information for communications going on on your own machine/network, and will not provide any information about communications going on in between machines that you do not have access to.
In other words, even if you used wireshark to a) reverse engineer the protocol, b) come up with a way to inspect the traffic, c) create a positional map - all this would only work for those communications that you have access to, i.e. those that are taking place on your own machine/network. So, a corresponding online map would only show your own position.
Of course, there's an alternative: you can emulate a client, so that you are being provided with server-side updates from other clients, this will mostly have to be in spectator mode.
This in turn would mean that you are a passive client that's just consuming server-side state, but not providing any.
So that you can in turn use all these updates to populate an online map or use it for whatever else is on your mind.
This will however require your spectator/client to be connected to the server all the time, possibly taking up precious game server slots.
Some game servers provide dedicated spectator modes, so that you can observe the whole game play using a live feed. Most game servers will however automatically kick spectators after a certain idle timeout.

bandwidth and traffic simulator for web apps?

Can you suggest how to create a test environment to simulate various types of bandwidths and traffic in a web app?
Or maybe an open source program which does this against localhost?
I think this is a very important subject when programming web apps but it is not a usual topic, the only way i can imagine to create such kind of environment is to use some kind of proxy in a local network but before start looking into the squid documentation i would like to hear your suggestions.

if you're using apache you may want to take a look at apache ab

There are two approaches to shape network traffic to simulate a network link:
Run some software on the client or server that sits somewhere in the networking stack and shapes the traffic between the app and the network interface
Run the traffic shaping software on a dedicated machine with 2 network interfaces through which your traffic is routed
(2) is a better solution if you don't want to install software on the client or server (and possibly impact performance), but requires more hardware fiddling.
Some other features you might want to think about are what shaping parameters can be simulated. Most do delay and packet loss, some do jitter and bandwidth limiting as well. Some solutions can selectively filter traffic (for instance by port number, TCP or UDP etc).
Here is a list of some of the systems I've found:
Open Source or Freeware
DummyNet is an open source BSD Unix-based for dedicated devices. It is not clear if the software is being actively maintained
NistNet is an open source Linux-based system for dedicated devices. The software has not been actively maintained for several years.
Commercial
Apposite Technoligies sell dedicated hardware solutions for simulating WAN links, with a Web based GUI for configuring the settings and collecting traffic measurements
East Coast DataCom sell hardware dedicated simulators for simulating routers and modems
Itrinegy offer both dedicated device solutions, and solutions for running on clients or servers.
Network FX offer several dedicated device products for simulating network impairments between the client & server
NetLimiter is a client side system that allows throttling of individual applications, and includes a firewall.
Shunra Software offer a range of products, from high end enterprise WAN simulation and testing, to a simple client-resident emulator.

The closest I can think of is doing something similar with VEDekstop from Shunra..
Simulating High Latency and Low Bandwidth in Testing of Database Applications
Shunra VE Desktop Standard is a Windows-based client software solution that simulates a wide area network link so that you can test applications under a variety of current and potential network conditions – directly from your desktop.

I wrote a php script awhile back which used CURL to run a sequence of page requests against my server which represented a typical use scenario. I had it output the times that it took for the server to respond to each of the requests. I then had another script which spawned a bunch of these test case scripts simultaneously for a sustained period and correlated the results into a file which I could then look at in a spreadsheet to see average times. This way I could simulate the number of users hitting the site that I wanted. The limitations are that you need to run the test script on a different server to the web server and that the client machine can become too loaded to give meaningful results past a certain point. I've since left the job otherwise I would paste the scripts here.

If you are running a Linux box as your server, Linux box as your client, or have the capability to put (perhaps a VM) a Linux router between your client and server, you can use NetEm.
NetEm is a Linux TC (Traffic Control) discipline which can delay (i.e. add latency) packets leaving a host. Although it's tricky to set up clever rules (e.g. add latency to some traffic, not to others), it's easy to add a simple "delay everything leaving the interface by 50ms" type rules and some recipes are provided.
By sticking a Linux VM between your client and server, you can simulate as much latency as you like. And you can turn it on and off dynamically. Linux has other TC disciplines which can be combined with NetEm to restrict bandwidth (but the script to set this up can be somewhat complicated). NetEm can also randomly drop packets.
I use it and it works a treat :)

Web Application Stress Tool (WAST) from Microsoft is what you need.
http://www.microsoft.com/downloads/details.aspx?familyid=e2c0585a-062a-439e-a67d-75a89aa36495&displaylang=en

I haven't used it for years (lack of need, not because I'd found anything else), but xat webspeed would be the first thing I would point toward

As other people have mentioned, Apache's ab (comes with Apache, so you probably have it already) is good.
Other good options are:
HP's LoadRunner Apache
Jakarta's JMeter
Tsung (if you want to get your erlang on)
I personally like ab and JMeter the best.

We use Loadrunner to do bandwidth and traffic simulation in our App. Loadrunner is can start agents on various machines and you can simulate one machine as running on dialup modem v/s another on DSL v/s another on Cable internet.
We also use Loadrunner to simulate various kinds of traffic conditions from 10 user run to 500 user run. We can also insert think times in the script and simulate a real user executing the http request. The best part is that it comes with a recording studio where it will plug in with Internet explorer and you can record the whole scenario/Usecase that can be as simple as hitting one page to a full blown 50-60 page script or more.

i found this little java program that works great : sloppy
yet not a proffesional solution but it works for simple tests, i guess it uses java streams and buffers to slow down the connection .

Have you looked at Tsung? It's a great utility for seeing if your website will scale in event of attack, I mean massive popularity. We use it for our web frontend, and our internal systems too.

If you're interested in performing your tests out of your browser, there is also a really great Firefox plug-in.

Do not forget about Wanulator (http://www.wanulator.de/).
The name Wanulator comes from "WAN" and "simulator. This pretty much describes what the software does: It simulates different Internet conditions such as delay or packet loss. Furthermore it simulates user access line speeds e.g. modem, ISDN or ADSL.
Wanulator is currently packaged as a Linux boot CD based on SLAX. This will give you a full out of the box experience. You can turn any PC into a test-system within a blink - just by booting the Wanulator CD. The package already includes useful client SW such as web-browser and network sniffer (Wireshark). Nevertheless if the PC has 2 network interfaces the system can run as an intermediate system between your server and your client - as a switch - without any configuration hassles.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio