I have 3 FS's. 1st FS always 'connects' to 2nd FS and 2nd FS always 'redirects' to 3rd FS. Could anyone please help me to write down the dialplan code?
I could find the link for call through two FSs
https://wiki.freeswitch.org/wiki/Connect_Two_FreeSWITCH_Boxes
Also, call can be initiated at 1st FS only and received at 3rd FS and not other way around. (But RTP packets has to flow both ends)
Related
Is it possible to write the output of a mapreduce directly into the data which was the input of this mapreduce?
Thanks!
I suppose your job does something new, even dealing with the same data (maybe it takes care of the timestamp of execution or it accesses an external service).
You could try to set the same path for both input and output data:
FileInputFormat.addInputPath(job, new Path(configuration.get("/path/to/data"));
FileOutputFormat.setOutputPath(job, new Path(configuration.get("/path/to/data")));
Since the mappers write their data onto a temp directory, it could work (caveat: I never tried to do that!).
As you have mention You dnt want duplicate Data, which seems like you dnt want to perform any Operation/Analysis of data in dfs, where Mapreduce is only used when analyzing data, so in said case, you can read the previous data in mention location repeatedly.
Note: If You are using language like pig/hive, you need to keep a copy/history of previous data as Pig/Hive you to clear the location before/after processing. So history location can be use to call back the same data again. :)
We have two clusters, where our requirement is to pull data from one cluster to another.
Only option available to us is, pull the data thru webhdfs!!
But unfortunately, what we can see is, thru webhdfs we can only pull only one file at a time, that too requiring two commands to be executed for every single file.
My straight question is: is there a way thru webhdfs, that we can pull entire directory data ??
**Ex:**
**directory structure in the cluster:**
dir1
file1
file2
file3
**currently observed that,**
for every file i.e 1,2 & 3, i need to execute two commands to get data.
**Problem statement:**
Is there a way thru webhdfs, to get all the files in a single call i.e., files 1,2 & 3 at a time from dir1 ...!!!!
Can someone please help me with this...
NOTE: DISTCP is not a working option for us due to security resons!!
I use the command line and I want to know from which host I get a file (or which replica I get).
Normally it should be the nearest to me. But I changed a policy for the project. Thus I want to check the final results to see if my new policy works correctly.
Following command does not give any information:
hadoop dfs -get /file
And the next one gives me only the replica's position, but not which one is preferred for the get:
hadoop fsck /file -files -blocks -locations
HDFS abstracts this information away as it is not very useful for users to know where they are reading from (the filesystem is designed to be as less in your way as possible). Typically, the DFSClient intends to pick up the data in order of the hosts returned to it (moving onto an alternative in case of a failure). The hosts returned to it is sorted by the NameNode for appropriate data or rack locality - and that is how the default scenario works.
While the proper answer for your question would be to write good test cases that can both simulate and assert this, you can also run your program with the Hadoop logger set to DEBUG, to check the IPC connections made to various hosts (including DNs) when reading the files - and go through these to assert manually that your host-picking is working as intended.
Another way would be to run your client through a debugger and observe the parts around the connections made finally to retrieve data (i.e. after NN RPCs).
Thanks,
We finally use the networks statistics with a simple test case to find where hadoop takes the replicas.
But the easiest way is to print the array nodes modified by this method:
org.apache.hadoop.net.NetworkTopology pseudoSortByDistance( Node reader, Node[] nodes )
As we expected, the get of the replicas is based on the results of the methods. The firsts items are preferred. Normally the first item is taken except if there is an error with the node. For more information about this method, see Replication
I have a lot of log files in my EMR cluster at path 'hdfs:///logs'. Each log entry is multiple lines but have a starting and ending marker to demarcate between two entries.
Now,
Not all entries in a log file are useful
the entries which are useful needs to be transformed and the output needs to be stored in an output file, so that I can efficiently query (using Hive) the output logs later.
I have a python script which can simply take a log file and do part a. and b. mentioned above but I have not written any mappers or reducers.
Hive takes care of Mappers and Reducers for its queries. Please tell me if and how it is possible to use the python script to run it over all logs and save the output in 'hdfs:///outputlogs' ?
I am new to Map Reduce and have seen some examples of Word count but all of them has a single input file. Where can I find examples which has multiple input files ?
Here I see that you have two-fold issue:
Having more than one file as input
The same word count example will work if you pass in more than one
file as input. In fact you can very easily pass a folder name as
input instead of a file name, in your case hdfs:///logs.
you may even pass on a comma separated list of paths as input, for
this instead of using following:
FileInputFormat.setInputPaths(conf, new Path(args[0]));
You may use the following:
FileInputFormat.setInputPaths(job, args[0]);
Note that only passing a list of comma separated as args[0] will be
sufficient.
How to convert your logic to mapreduce
This does have a steep learning curve as you will need to think in
terms of key and values. But I feel that you can just have all the
logic in the mapper itself and have an IdentityReducer, like this :
conf.setReducerClass(IdentityReducer.class);
If you spend sometime reading examples from the following locations,
you should be in a better position to make these decisions:
hadoop-map-reduce-examples ( http://hadoop-map-reduce-examples.googlecode.com/svn/trunk/hadoop-examples/src/ )
http://developer.yahoo.com/hadoop/tutorial/module4.html
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/
http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html
The long-term correct way to do this is, as Amar stated, to write a MapReduce job to do it.
However, if this is a one-time thing, and the data isn't too enormous, it might be simplest/easiest to do this with a simple bash script since you already have the python script:
hadoop fs -text /logs/* > input.log
python myscript.py input.log output.log
hadoop fs -copyFromLocal output.log /outputlogs
rm -f input.log output.log
If this is a repeated process - something you want to be reliable and efficient - or if you just want to learn to use MapReduce better, then stick with Amar's answer.
If you have logic already written, and you want to do parallell processing using EMR and/or vanilla Hadoop - you can use Hadoop streaming : http://hadoop.apache.org/docs/r0.15.2/streaming.html. In a nutshell - your script taking data into stdin and making output to stdout can became a mapper.
Thus you will run the processing of data in HDFS using cluster, without a need to repackage you code.
I'm new to Apache Hadoop and i'm really looking forward to explore more features of it. After the basic wordcount example i wanted to up the ante a little bit. So i sat down with this problem statement which i got by going through Hadoop In Action book.
"Take a web server log file . Write a MapReduce program to
aggregate the number of visits for each IP address. Write another MapReduce
program to find the top K IP addresses in terms of visits. These frequent
visitors may be legitimate ISP proxies (shared among many users) or they
may be scrapers and fraudsters (if the server log is from an ad network)."
Can anybody help me out as to how i should start ? Its kind of tough to actually write our own code since hadoop only gives wordcount as a basic example to kick start .
Any help gratefully appreciated . Thanks.
Write a MapReduce program to aggregate the number of visits for each IP address.
The wordcount example is not much different from this one. In the wordcount example the map emits ("word",1) after extracting the "word" from the input, in the IP address case the map emits ("192.168.0.1",1) after extracting the ""192.168.0.1" IP address from the log files.
Write another MapReduce program to find the top K IP addresses in terms of visits.
After the completion of the first MapReduce job, there will be a lot of output files based on the # of reducers with content like this
<visits> <ip address>
All these files have to merged using the getmerge option. The getmerge option will merge the file and also get the file locally.
Then the local file has to be sorted using the sort command based on the 1st column, which is the # of visits.
Then using the head command you can get the first n lines to get the top n IP address by visits.
There might be a better approach for the second MR Job.