Anomaly detection using mapreduce - hadoop

I'm new to Apache Hadoop and i'm really looking forward to explore more features of it. After the basic wordcount example i wanted to up the ante a little bit. So i sat down with this problem statement which i got by going through Hadoop In Action book.
"Take a web server log file . Write a MapReduce program to
aggregate the number of visits for each IP address. Write another MapReduce
program to find the top K IP addresses in terms of visits. These frequent
visitors may be legitimate ISP proxies (shared among many users) or they
may be scrapers and fraudsters (if the server log is from an ad network)."
Can anybody help me out as to how i should start ? Its kind of tough to actually write our own code since hadoop only gives wordcount as a basic example to kick start .
Any help gratefully appreciated . Thanks.

Write a MapReduce program to aggregate the number of visits for each IP address.
The wordcount example is not much different from this one. In the wordcount example the map emits ("word",1) after extracting the "word" from the input, in the IP address case the map emits ("192.168.0.1",1) after extracting the ""192.168.0.1" IP address from the log files.
Write another MapReduce program to find the top K IP addresses in terms of visits.
After the completion of the first MapReduce job, there will be a lot of output files based on the # of reducers with content like this
<visits> <ip address>
All these files have to merged using the getmerge option. The getmerge option will merge the file and also get the file locally.
Then the local file has to be sorted using the sort command based on the 1st column, which is the # of visits.
Then using the head command you can get the first n lines to get the top n IP address by visits.
There might be a better approach for the second MR Job.

Related

Understanding file handling in hadoop

I am new to Hadoop ecosystem with some basic idea. Please assist on following queries to start with:
If the file size (file that am trying to copy into HDFS) is very big and unable to accommodate with the available commodity hardware in my Hadoop ecosystem system, what can be done? Will the file wait until it gets an empty space or the there is an error?
How to find well in advance or predict the above scenario will occur in a Hadoop production environment where we continue to receive files from outside sources?
How to add a new node to a live HDFS ecosystem? There are many methods but I wanted to know which files I need to alter?
How many blocks does a node have? If I assume that a node is a CPU with storage(HDD-500 MB), RAM(1GB) and a processor(Dual Core). In this scenario is it like 500GB/64? assuming that each block is configured to hold 64 GB RAM
If I copyFromLocal a 1TB file into HDFS, which portion of the file will be placed in which block in which node? How can I know this?
How can I find which record/row of the input file is available in which file of the multiple files split by Hadoop?
What are the purpose of each xmls configured? (core-site.xml,hdfs-site.xml & mapred-site.xml). In a distributed environment, which of these files should be placed in all the slave Data Nodes?
How to know how many map and reduce jobs will run for any read/write activity? Will the write operation always have 0 reducer?
Apologize for asking some of the basic questions. Kindly suggest methods to find answers for all of the above queries.

HDFS How to know from which host we get a file

I use the command line and I want to know from which host I get a file (or which replica I get).
Normally it should be the nearest to me. But I changed a policy for the project. Thus I want to check the final results to see if my new policy works correctly.
Following command does not give any information:
hadoop dfs -get /file
And the next one gives me only the replica's position, but not which one is preferred for the get:
hadoop fsck /file -files -blocks -locations
HDFS abstracts this information away as it is not very useful for users to know where they are reading from (the filesystem is designed to be as less in your way as possible). Typically, the DFSClient intends to pick up the data in order of the hosts returned to it (moving onto an alternative in case of a failure). The hosts returned to it is sorted by the NameNode for appropriate data or rack locality - and that is how the default scenario works.
While the proper answer for your question would be to write good test cases that can both simulate and assert this, you can also run your program with the Hadoop logger set to DEBUG, to check the IPC connections made to various hosts (including DNs) when reading the files - and go through these to assert manually that your host-picking is working as intended.
Another way would be to run your client through a debugger and observe the parts around the connections made finally to retrieve data (i.e. after NN RPCs).
Thanks,
We finally use the networks statistics with a simple test case to find where hadoop takes the replicas.
But the easiest way is to print the array nodes modified by this method:
org.apache.hadoop.net.NetworkTopology pseudoSortByDistance( Node reader, Node[] nodes )
As we expected, the get of the replicas is based on the results of the methods. The firsts items are preferred. Normally the first item is taken except if there is an error with the node. For more information about this method, see Replication

Processing logs in Amazon EMR with or without using Hive

I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries.
Now,
Not all entries in a log file are useful
the entries which are useful needs to be transformed and the output needs to be stored in an output file, so that I can efficiently query (using Hive) the output logs later.
I have a python script which can simply take a log file and do part a. and b. mentioned above but I have not written any mappers or reducers.
Hive takes care of Mappers and Reducers for its queries. Please tell me if and how it is possible to use the python script to run it over all logs and save the output in 'hdfs:///outputlogs' ?
I am new to Map Reduce and have seen some examples of Word count but all of them has a single input file. Where can I find examples which has multiple input files ?
Here I see that you have two-fold issue:
Having more than one file as input
The same word count example will work if you pass in more than one
file as input. In fact you can very easily pass a folder name as
input instead of a file name, in your case hdfs:///logs.
you may even pass on a comma separated list of paths as input, for
this instead of using following:
FileInputFormat.setInputPaths(conf, new Path(args[0]));
You may use the following:
FileInputFormat.setInputPaths(job, args[0]);
Note that only passing a list of comma separated as args[0] will be
sufficient.
How to convert your logic to mapreduce
This does have a steep learning curve as you will need to think in
terms of key and values. But I feel that you can just have all the
logic in the mapper itself and have an IdentityReducer, like this :
conf.setReducerClass(IdentityReducer.class);
If you spend sometime reading examples from the following locations,
you should be in a better position to make these decisions:
hadoop-map-reduce-examples ( http://hadoop-map-reduce-examples.googlecode.com/svn/trunk/hadoop-examples/src/ )
http://developer.yahoo.com/hadoop/tutorial/module4.html
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
The long-term correct way to do this is, as Amar stated, to write a MapReduce job to do it.
However, if this is a one-time thing, and the data isn't too enormous, it might be simplest/easiest to do this with a simple bash script since you already have the python script:
hadoop fs -text /logs/* > input.log
python myscript.py input.log output.log
hadoop fs -copyFromLocal output.log /outputlogs
rm -f input.log output.log
If this is a repeated process - something you want to be reliable and efficient - or if you just want to learn to use MapReduce better, then stick with Amar's answer.
If you have logic already written, and you want to do parallell processing using EMR and/or vanilla Hadoop - you can use Hadoop streaming : http://hadoop.apache.org/docs/r0.15.2/streaming.html. In a nutshell - your script taking data into stdin and making output to stdout can became a mapper.
Thus you will run the processing of data in HDFS using cluster, without a need to repackage you code.

Hadoop Load and Store

When I am trying to run a Pig script which has two "store" to the same file this way
store Alert_Message_Count into 'out';
store Warning_Message_Count into 'out';
It hangs, I mean it does not proceed after showing 50% done.
Is this wrong? Cant we store both the results in the same file(folder)?
HDFS does not have append mode. So in most cases where you are running map-reduce programs, the output file is opened once, data is written and then closed. Assuming this approach you can not write data simultaneously onto the same file.
Try writing to separate files and check if the map-red programs do not hang. If they still do, then there are some other issues.
You can obtain the result and map-reduce logs to analyze what went wrong.
[Edit:]
You can not write to the same file or append to an existing file. The HDFS Append feature is a work in progress.
To work on this you can do two things:
1) If you have the same schema content in both Alert_Message_Count and Warning_Message_Count, you could use union as suggested by Chris.
2) Do post processing when the schema is not the same. That is write a map reduce program to merge the two separate outputs into one.
Normally Hadoop MapReduce won't allow you to save job output to a folder that already exists, so i would guess that this isn't possible either (seeing as Pig translates the commands into a series of M/R steps) - but i would expect some form of error message rather than it just to hang.
If you open the cluster job tracker, and look at the logs for the task, does the log yield anything of note which can help diagnose this further?
Might also be worth checking with the pig mailing lists (if you haven't already)
If you want to append one dataset to another, use the union keyword:
grunt> All_Count = UNION Alert_Message_Count, Warning_Message_Count;
grunt> store All_Count into 'out';

Using Hadoop to "bucket" data out with a single run

Is it possible to use one Hadoop job run to output data to different directories based on keys?
My use case is server access logs. Say I have them all together, but I want to split them out based on some common URL patterns.
For example,
Anything that starts with /foo/ should go to /year/month/day/hour/foo/file
Anything that starts with /bar/ should go to /year/month/day/hour/bar/file
Anything that doesn't match should go to /year/month/day/hour/other/file
There are two problems here (from my understanding of Map Reduce): first, I'd prefer to just iterate over my data one time, instead of running one "grep" job per URL type I'd like to match. How would I split up the output, though? If I key the first with "foo", second with "bar", and rest with "other" then don't they all still go to the same reducers? How do I tell Hadoop to output them into different files?
The second problem is related (maybe the same?), I need to break output up by the timestamp in the access log line.
I should note that I'm not looking for code to solve this, but rather the proper terminology and high level solution to look into. If I have to do it with multiple runs, that's alright, but I can't run one "grep" for each possible hour (to make a file for that hour), there must be another way?
You need to partition the data just as you describe. Then you need to have multiple output files. See here (Generating Multiple Output files with Hadoop 0.20+).

Resources