Run a Hadoop job without output file

Run a Hadoop job without output file - hadoop

Is it possible to run a hadoop job without specifying output file ?
When i try to run a hadoop job , no output file specified Exception is thrown .
can any one please give any procedure to do so using Java.
I am writing the data processed by reduce to a non relational database so i no longer require it to write to HDFS.

Unfortunately, you can't really do this. Writing output is part of the framework. When you work outside of the framework, you basically have to just deal with the consequences.
You can use NullOutputFormat, which doesn't write any data to HDFS. I think it still creates the folder, though. You could always let Hadoop create the folder, then delete it.

Related

Storing small files in hdfs and archiving them in Nifi Flow

I have an issue with small files and HDFS.
Scenario: I am using NiFi to read messages from the Kafka topic, these are all really small.
Requirement: to store these raw messages of data in HDFS(for replay capability)...before doing further processing on them.
I was thinking using Hadoop Archive (HAR) on them periodically. Is that something i can do through NiFi? the har command seems like a command line thing rather than something that i could execute through Nifi? Would love to know a solution that can achieve my requirement, without bringing down HDFS due to the small files.
Ginil

You can execute command line inside Nifi with ExecuteProcess processor :
http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.6.0/org.apache.nifi.processors.standard.ExecuteProcess/
You can also take a look at Kafka-connect HDFS for putting kafka records into HDFS.

Simple deeplearning4J Java based Spark example?

I need to run a simple Java based deeplearning4j example in hadoop cluster and I found one here. My need to specify the input from command line (which should be a path on HDFS) and output should go to HDFS, for later view
However, in the example there is no mention, it is hard coding the input from local file system and output goes to local file system.
Can anyone help me here?

Maybe some combination of this recent pull request on our examples:
https://github.com/deeplearning4j/dl4j-examples/pull/384
and Spring-hadoop could help you?
http://projects.spring.io/spring-hadoop/
I mean conceptually all you'd do is change the file system type.
The FileSystem api in hadoop can point to either local or an hdfs url so there shouldn't be much change.

How do I add files to distributed cache in an oozie job

I am implementing an oozie workflow where, in the first job I am reading data from a database using sqoop and writing it to hdfs. In the second job I need to read a large amount of data and use the files I just wrote in job one to process the large data. Here's what I thought of or tried:
Assuming job one writes the files to some directory on hdfs, adding the files to distributed cache in the driver class of job two will not work as oozie workflow knows just about the mapper and reducer classes of the job. (Please correct me if I am wrong here)
I also tried to write to the lib directory of the workflow hoping that the files would then be automatically added to distributed cache but I understood that the lib directory should be read only when the job is running.
I also thought if I could add the files to distributed cache in the setup() of job 2 then I could access them in the mapper/reducer. I am not aware of how one can add files in setup(), is it possible?
How else can I read the output files of the previous job in the subsequent job from distributed cache. I am already using the input directory of job two to read the data that needs to be processed so I cannot use that.
I am using Hadoop 1.2.1, Oozie 3.3.2 on Ubuntu 12.04 virtual machine.

Add the below properties to add files or archives to your map-reduce action . Refer to this documentation for details.
<file>[FILE-PATH]</file>
...
<archive>[FILE-PATH]</archive>
You can also give input at java command line as shown below.
<main-class>org.apache.oozie.MyFirstMainClass</main-class>
<java-opts>-Dblah</java-opts>
<arg>argument1</arg>
<arg>argument2</arg>

Running a hadoop job using java command

I have a simple java program that sets up a MR job. I could successfully execute this in Hadoop infrastructure (hadoop 2x) using 'hadoop jar '. But I want to achieve the same thing using java command as below.
java className
How can I pass hadoop configuration to this className?
What extra arguments do I need to supply?
Any link/documentation would be highly appreciated.

As you run your 'hadoop jar' command with the other parameters, same way you can run using java.
check if, this commands evaluates to hadoop class path
$ hadoop classpath
then whatever your custom jar is should be added in class path
$ java -cp `hadoop classpath`:/my/tools/jar/tools.jar
I am able to get mine working with this, on my hadoop cluster

I don't think you can find a documentation on this. hadoop command is a script, a lot of classes are used there eg. Class for accessing filesystem FsShell, class used when we run a jar RunJar etc. Adding hadoop related libraries, configuration files to classpath are handled in the hadoop command itself.
You better take a look at the hadoop script.

How can you do that? Any jar file execution means, it has to execute in distributed environment where all daemons work together to complete the execution.
We are not running locally or on local file system. So, it needs be executed as per the norms of hdfs so i don't think we can execute like we do in local file system.

Hadoop is a framework which simplifies the distributed computing. Before hadoop also, programmers know about parallel processing and multi threading concepts. But when you deal with multiple machines you need to know how to
Communicate between machines
Network processing
What if one machine fails? fault tolerance
and many more! which is a huge, that's where hadoop simplifies your job. It takes care of all your operating level stuff and you can focus on just your business logic.
So in your case, based on what you are asking, there is no direct answer for that. Because by passing parameters the your program doesn't work. You will need to write lot of libraries to deal with distributed computing. If you want to explore them, then I would suggest go ahead and read hadoop source code.
http://hadoop.apache.org/version_control.html

getting data in and out of hadoop

I need a system to analyze large log files. A friend directed me to hadoop the other day and it seems perfect for my needs. My question revolves around getting data into hadoop-
Is it possible to have the nodes on my cluster stream data as they get it into HDFS? Or would each node need to write to a local temp file and submit the temp file after it reaches a certain size? and is it possible to append to a file in HDFS while also running queries/jobs on that same file at the same time?

Fluentd log collector just released its WebHDFS plugin, which allows the users to instantly stream data into HDFS. It's really easy to install with ease of management.
Fluentd + Hadoop: Instant Big Data Collection
Of course you can import data directly from your applications. Here's a Java example to post logs against Fluentd.
Fluentd: Data Import from Java Applications

A hadoop job can run over multiple input files, so there's really no need to keep all your data as one file. You won't be able to process a file until its file handle is properly closed, however.

HDFS does not support appends (yet?)
What I do is run the map-reduce job periodically and output results to an 'processed_logs_#{timestamp}" folder.
Another job can later take these processed logs and push them to a database etc. so it can be queried on-line

I'd recommend using Flume to collect the log files from your servers into HDFS.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Run a Hadoop job without output file - hadoop

Related

Storing small files in hdfs and archiving them in Nifi Flow

Simple deeplearning4J Java based Spark example?

How do I add files to distributed cache in an oozie job

Running a hadoop job using java command

getting data in and out of hadoop

Categories

Resources