Best way to trigger execution at File arrival at NFS using OOZIE - hadoop

Following 1 and 2:
Different types of files enter my NFS directory from time to time. I would like to use OOZIE or any other HDFS solution to trigger the file arrival event and to copy the file into specific location at the HDFS in accordance to its type. What is the best way to do it?

Best way is very subjective term. It largely depends on, what kind of data, frequency and what sorts of things should happen once the data arrive at specific location.
Apache flume can monitor specific folder for data availability and push it down to any sink like HDFS as-is. Flume is good for streaming data.But it does only one specific job- just moving data from place to place.
But on other hand, look up Oozie Coordinators. Coordinators have data availability trigger and with oozie you can perform all sort of ETL operations after data arrives using tools like spark,hive,pig etc and push it down to hdfs using shell actions. You can schedule jobs to run during specific times,frequency or have job send you an email if something goes wrong...

Related

What is the difference between HUE, YARN and OOZIE

I understand the concepts of HDFS and Map Reduce and how it is important to move the processing logic to the data to increase efficiency. I was even able to run a couple of map reduce job on my basic Hadoop cluster. Surrounding these concepts there are a lot of different technologies like YARN, HUE, OOZIE all of which seems to do the same thing (at least from a very high level) which is operation visibility and CRUD abilities for jobs (which can be map-reduce or something else).
Am I correct in making this assumption or is there a much more fundamental difference between them?
Thanks
Kay
YARN - Map Reduce is API where you have to implement data processing logic in it. Once the code is compiled you have to submit the jobs using hadoop jar command. YARN is the framework which will keep track of the resources, submit the job on the cluster, execute the job, show/log the progress.
OOZIE - Take a data integration example. You might have to get a data set from one database and other data set from other database, then you want to join, process the data and reload it into a cache or 3rd database. It involves 2 sqoop jobs to pull data from database, a hive/map reduce job to join and process the data, then push into cache/database. All these jobs are dependent on each other, eg: we are supposed to process the data only after data is pulled from source databases. Hence we need to create a workflow to execute complete data integration process. OOZIE can facilitate that. It is map reduce based workflow tool. Workflow it self will be executed as one or more map reduce jobs.
HUE: There are many tools in Hadoop - HDFS (file system), Sqoop, Hive/pig to process the data, Impala, HBase and many many more. To execute the POCs, it can get tedious to connect to the cluster. Also it need some linux skills. To overcome those challenges all the Hadoop eco system tools are consolidate under one umbrella - called Hue.

What's the proper way to log big data to organize and store it with Hadoop, and query it using Hive?

So basically I have apps on different platforms that are sending logging data to my server. It's a node server that essentially accepts a payload of log entries and it saves them to their respective log files (as write stream buffers, so it is fast), and creates a new log file whenever one fills up.
The way I'm storing my logs is essentially one file per "endpoint", and each log file consists of space separated values that correspond to metrics. For example, a player event log structure might look like this:
timestamp user mediatype event
and the log entry would then look like this
1433421453 bob iPhone play
Based off of reading documentation, I think this format is good for something like Hadoop. The way I think this works, is I will store these logs on a server, then run a cron job that periodically moves these files to S3. From S3, I could use those logs as a source for a Hadoop cluster using Amazon's EMR. From there, I could query it with Hive.
Does this approach make sense? Are there flaws in my logic? How should I be saving/moving these files around for Amazon's EMR? Do I need to concatenate all my log files into one giant one?
Also, what if I add a metric to a log in the future? Will that mess up all my previous data?
I realize I have a lot of questions, that's because I'm new to Big Data and need a solution. Thank you very much for your time, I appreciate it.
If you have a large volume of log dump that changes periodically, the approach you laid out makes sense. Using EMRFS, you can directly process the logs from S3 (which you probably know).
As you 'append' new log events to Hive, the part files will be produced. So, you dont have to concatenate them ahead of loading them to Hive.
(on day 0, the logs are in some delimited form, loaded to Hive, Part files are produced as a result of various transformations. On subsequent cycles, new events/logs will be appened to those part files.)
Adding new fields on an ongoing basis is a challenge. You can create new data structures/sets and Hive tables and join them. But the joins are going to be slow. So, you may want to define fillers/placeholders in your schema.
If you are going to receive streams of logs (lots of small log files/events) and need to run near real time analytics, then have a look at Kinesis.
(also test drive Impala. It is faster)
.. my 2c.

Map Reduce program which Caches results and computes automatically when changes affect input dataset

I have a set of input files which are going through changes. Is there any way by which we can run a Map reduce program which caches results. Also, whenever there is any change to the input files the Map Reduce program automatically runs again and the resultset is altered according to changes to input files? Can we use MR to approach this dynamically ?
Let me give you a fair idea that can be done as i can not give code over here
you can do one thing that use flume for the changes in the file and use mapreduce job as the flume sink.
So whenever the content of the file changes flume agent will be triggered and your mapreduce job as the sink of flume will be executed.
this way you can achieve your goal
cheers
Map Reduce is in the realm of batch processing and is not real time, also HDFS is append only file system, if one record out of a billion had changed, than the whole dataset or part file has to be re-written. Not good for near realtime processing and can get very compute intensive if the changes can not be cached in the Mapper and you need to use the Reduce side join.
For the problem you have described it will be better to use a combination of Kafka, Storm and HBase or just HBase depending on how the changes to the file are generated.

Understanding more about Hadoop/HDFS Data Loading

im researching Hadoop and MapReduce (I'm a beginner!) and have a simple question regarding HDFS. I'm a little confused about how HDFS and MapReduce work together.
Lets say I have logs from System A, Tweets, and a stack of documents from System B. When this is loaded into Hadoop/HDFS, is this all thrown into one big HDFS bucket, or would there be 3 areas (for want of a better word)? If so, what is the correct terminology?
The questions stems from understanding how to execute a MapReduce job. If I only wanted to concentrate on the Logs for example, can this be done, or are all jobs executed on the entire content stored on the cluster?
Thanks for your guidance!
TM
HDFS is a file system. As in your local filesystem you can organize all your logs and documents into multiple files and directories. When you run MapReduce jobs you usually specify a directory with your input files. Thus it is possible to execute a job only on the logs from system A or the documents from system B.
However the input for your mappers is specified by the InputFormat. Most implementations originate from FileInputFormat which reads files. However it is possible to implement custom InputFormats in order to read data from other sources. You can find an explanation on input and output formats in this Hadoop Tutorial.

Can Hadoop MapReduce can run over other filesystems?

I heard like for mapreduce jobs input need not in HDFS. It can be on other file system.. Can someone please provide me more inputs on this..
I am litle confused on this? In standalone mode, data can be on local file system. But in cluster mode how can we point to mapreduce jobs to some other file system?
No it does not need to be in HDFS. For instance jobs which target HBase using its TableInputFormat pull records over the network from HBase nodes as inputs to its map jobs. The DbInputFormat can be used to pull data from a SQL database into a job. You could build an input format that did something like read data off of an NFS mount.
In practice you want to avoid pulling data over the network if you can. MR performance is much better if you can have your data locally on the nodes where the job is being run since Disk Throughput > Network Throughput.
Based in the InputFormat set on the job, Hadoop can read from any source. Hadoop provides a couple of InputFormats. It's not difficult to write a custom InputFormat also, let's say to provide a proprietary format as input to a Job.
On the same lines Hadoop provides a couple of OutputFormats and it shouldn't be difficult to write a custom OutputFormat also.
Here is a nice article on the DBInputFormat.
Another way to achieve it is to put into HDFS files with information where the real data is. Mapper will get this information and pull real data for the processing.
For example we can have several files with URLs of data to be processed.
What we will loose in this case is data locality - otherwise it is fine.

Resources