Flume to fetch logs over network - hadoop

I have been working in Flume to fetch logs from a server machine to HDFS. I was able to achieve this if the server and client machines are connected in same network. But how can i achieve the same if the server and client are in different networks.
Do i need to write a custom source for this? [Just checked with twitter example from cloudera in which they're using their own custom source to fetch twitter tweets.]
Any help would be greatly appreciated.
Thanks,
Kalai

If you have a multihomed host joining two non-talking networks you'd like to ship across, you can have a flume agent running there to bridge logs incoming from one network and deliver it to the other one. So your multihomed host will act as a sort of proxy. I don't know if this is necessarily a good idea, as your proxy is probably already busy doing other things if it's the only link between the networks. But if you can set this up, you won't need custom sinks or sources.
If you have two disjoint networks that can both see the internet, you can have one agent post to a web server over HTTP (or TCP for that matter, but it's more work), and another fetch it from the same website. You'll need to write two custom agents (source & sink) for that to work in a performant, reliable and secure fashion, not to mention the web service itself.
Finally, if you have two networks that are completely disconnected (with an air gap), then you may consider writing a custom sink that will, for example, autodetect an inserted tape and copy logs to the tape. Then you take the tape, walk over to another network, plug it in, and have another agent there autodetect it as well and ingest the data :)

Flume agents need to be able to connect to transport events. This means they need to be on the same network.
I'm not sure I understand your question. Why would you expect it to work at all?

Related

How to remotely connect to a local elasticsearch server - in a secure way ofc

I have been playing around with creating a webapp that uses elasticsearch to perform queries. Currently, everything is in production, thus on the localhost, let's say elasticsearch runs at 123.123.123.123:9200. All fun and games, but once the webapplication (react) is finished, the webapp should be able to send the queries to the current local elastic search db.
I have been reading around on how to get this done in a proper and most of all secure way. Summary of this all is currently:
"First off, exposing an Elasticsearch node directly to the internet without protections in front of it is usually bad, bad news." (see here: Accessing elasticsearch from a public domain name or IP).
Another interesting blog I found: https://code972.com/blog/2017/01/dont-be-ransacked-securing-your-elasticsearch-cluster-properly-107.
The problem with the above-mentioned sources is that they are a bit older, and thus I am not sure whether they are up to date.
Therefore the following questions:
Is nginx sufficient to act as a secure middleman, passing the queries from the end-users to elastic?
What is the difference at that point with writing a backend into the react application (e.g. using node and express)?
What is the added value taking into account the built-in security from elasticsearch (usernames, password, apikey, certificates, https,...)?
I am reading a lot about using a VPN or tunneling. I have the impression that these solutions are more geared towards a corporate-collaborative approach. Let's say I am running my front-end on a live server, I can use tunneling to show my work to colleagues, my employer. VPN would be more realistic for allowing employees -wish I had them, just a cs student here- to access e.g. the database within my private network (let's say an employee needs to access kibana to adapt something, let's say an API-key - just making something up here), he/she could use a VPN connection for that.
Thank you so much for helping me clarify the above-mentioned points!
TLS, authorisation and access control are free for the Elastic Stack, and have been for a while. I'd start by looking at the docs, as it's an easy way to natively secure your cluster
for nginx, it can be useful for rate limiting, or blocking specific queries for eg. however it's another thing to configure and maintain
from a client POV it would really only matter if you are using the official Elasticsearch clients, and you use nginx and make changes to the way the API would respond to the client (eg path rewrites, rate limiting)
it's free, it's native, it's easy to manage via Kibana
I'd follow the docs to secure Elasticsearch and then see if you need this at some point in the future. this would be handled outside Elasticsearch anyway, and you'd still want to secure Elasticsearch
The point in exposing Elasticsearch nodes directly to the internet is a higher vulnerability in principle. You should follow the rule of the least "surface" of your system on the internet.
A good practice is to hide from the internet whatever doesn't need to be there, although well protected. It takes ~20mins to get cyber attacks on any exposed service (see a showcase).
So I suggest you install a private network, such as a traditional VPN or an SDP product such as Shieldoo Mesh.

ElasticSearch replication home/server

I am running a local ElasticSearch server from my own home, but would like access to the content from outside. Since I am on a dynamic IP and besides that do not feel comfortable opening up ports to the outside, I would like to rent a VPS somewhere, setup ElasticSearch and let this server be a read only copy of the one I have at home.
As I understand it, this should be possible - however I have been unsuccessful at creating any usable version that lets another server be a read-only version of my home ES-server.
Can anyone point me to a piece of information or create a guide, that would help me to set this up? I am rather known to ES-usage, however my setup-skills are still vague.
As I understand it, this should be possible
It might be possible with some workarounds, but it's definitely not built for that:
One cluster needs to be in one physical region; mainly because of latency and the stability of the network connection.
There are no read-only versions. You could only allow read access to a node (via a reverse proxy or the security plugin), but that's only a workaround.

OpenNMS passive nodes logs scanning

I would like to setup OpenNMS monitoring system where OpenNMS server will do all the job because I cannot modify nodes which must be scanned. I can although ssh and ftp to those nodes.
I am thinking about using some plugin which will ssh and tail logs.
Any suggestions for plugin I could use or good tutorial how to write my own plugin?
Opennms is at it's best when you have snmp access to your nodes. Snmp will give you the ability to monitor, collect and graph, and without it I believe you will need to create a custom collector in order to collect and graph useful information, if that is your aim. There are some standard collectors:
http://www.opennms.org/wiki/Docu-overview#Data_Collection
http://www.opennms.org/wiki/Documentation:Features_DataCollection
With regards to generic approaches to collecting data, you could use expect scripts Alternatively, you could write some scripts on the clients (if you have the relevant access) that collect data which could be retrieved by the server. You can use key-based SSH connections to ease the auth burden as long as you look after your keys.

Personal Internet use monitoring

How could a (Windows) desktop application be created to monitor the amount of time spent on a particular website?
My first idea was to play with the Host file to intercept requests, log, and proxy. This feels a bit clunky; and I suspect my program would look like malware.
I feel like there is a smarter way? Any ideas?
There is a tool similar to what you are looking for called K-9 Web Protection. It is more used for parents to monitor what their kids are up to when hooked up to the internet. I have installed this for my niece's computer with good results and praises as it blocks, content filter, restrict internet times. This may be OTT for your needs but worth a shot as you can see what sites were visited.
The other, is to use a dedicated firewall monitoring solution such as IPCOP which is a Linux based distribution with a sole purpose in providing a proxy, stateful packet inspection (SPI) firewall, Intrusion Detection System (IDS).
Hope this helps,
Best regards,
Tom.
You could do this by monitoring active connections via netstat, or if you need more advanced data you can install The Windows Packet Capture Library and get any data about network use, and inside your desktop app, find network traffic that relates to 'spending time' on a website (which might just be GET requests for you, but I don't know), and record various statistics as required.
Route the traffic through a scriptable proxy and change the browser settings to point to that proxy.

How to extract information from client/server communication with no documentation?

What are methods for undocumented client/server communication to be captured and analyzed for information you want and then have your program looking for this information in real time? For example, programs that look at online game client/server communication and get information and use it to do things like show location on a 3rd party map, etc.
Wireshark will allow you to inspect communication between the client-server (assuming you're running one of them on your machine). As you want to perform this snooping in your own application, look at WinPcap. Being able to reverse engineer the protocol is a whole other kettle of fish, mind.
In general, wireshark is an excellent recommendation for traffic/protocol analysis- however, you seem to be looking for something else:
For example, programs that look at online game client/server communication and get information and use it to do things like show location on a 3rd party map, etc.
I assume you are referring to multiplayer games and game servers?
If so, these programs are usually using a dedicated service connection to query the corresponding server for positional updates and other meta information on a different port, they don't actually intercept or inspect client/server communciations at realtime, and they don't really interfere with these updates, either.
So, you'll find that most game servers provide support for a very simply passive connection (i.e. output only), that's merely there for getting certain runtime state, which in turn is often simply polled by a corresponding external script/webpage.
Similarly, there's often also a dedicated administration interface provided on a different port, as well as another one that publishes server statistics, so that these can be easily queried for embedding neat stats in webpages.
Depending on the type of game server, these may offer public/anonymous use, or they may require certain credentials to access such a data port.
More complex systems will also allow you to subscribe only to specific state and updates, so that you can dynamically configure what data you are interested in.
So, even if you had complete documentation on the underlying protocol, you wouldn't really be able to directly inspect client/server communications without being in between these communications. This can however not be easily achieved. In theory, this would basically require a SOCKS proxy server to be set up and used by all clients, so that you can actually inspect the communications going on.
Programs like wireshark will generally only provide useful information for communications going on on your own machine/network, and will not provide any information about communications going on in between machines that you do not have access to.
In other words, even if you used wireshark to a) reverse engineer the protocol, b) come up with a way to inspect the traffic, c) create a positional map - all this would only work for those communications that you have access to, i.e. those that are taking place on your own machine/network. So, a corresponding online map would only show your own position.
Of course, there's an alternative: you can emulate a client, so that you are being provided with server-side updates from other clients, this will mostly have to be in spectator mode.
This in turn would mean that you are a passive client that's just consuming server-side state, but not providing any.
So that you can in turn use all these updates to populate an online map or use it for whatever else is on your mind.
This will however require your spectator/client to be connected to the server all the time, possibly taking up precious game server slots.
Some game servers provide dedicated spectator modes, so that you can observe the whole game play using a live feed. Most game servers will however automatically kick spectators after a certain idle timeout.

Resources