Is pig.temp.dir property mandatory? - hadoop

Pig Execution Mode = Local
In that case do we need to set pig.temp.dir=/temp property and this /temp folder needs to be present inside HDFS.
Note:
Storing Intermediate Results
Pig stores the intermediate data generated between MapReduce jobs in a temporary location on HDFS. This location must already exist on HDFS prior to use. This location can be configured using the pig.temp.dir property. The property's default value is "/tmp" which is the same as the hardcoded location in Pig 0.7.0 and earlier versions.
As per: http://pig.apache.org/docs/r0.14.0/start.html#req Storing Intermediate Results heading

You'll still need to have some temp directory, but it needs to be present in your local file system. In local mode Pig (and MapReduce) does all operations on local filesystem by default.

Related

Why DATA is COPIED and not MOVED while loading data from local filesystem Hive hadoop

When we use following command:
Load data local inpath "mypath"
why the data is copied from local filesystem into HDFS and not moved?
Since you are moving data between 2 different file systems (sh + HDFS) this cannot be a metadata operation as in non-local load.
The data itself should be copied.
Theoretically this command could also initiate a deletion command of the source file, but what for?

how to save data in HDFS with spark?

I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.
Is that correct if I write this:
myDStream.foreachRDD(frm->{
frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});
where ip_addr is the ip address of my hdfs remote server.
/home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And,
myNewFolder is the folder where I want to save my data.
Thanks in advance.
Yassir
The path has to be a directory in HDFS.
For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.
The path to use would be hdfs://namenode_ip:port/myNewFolder/
On execution of the spark job this directory myNewFolder will be created.
The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

what is the difference between hadoop.tmp.dir and mapred.temp.dir and mapreduce.cluster.temp.dir

I want to know what is a difference between hadoop.tmp.dir and mapred.temp.dir also how mapred.temp.dir [deprecated] differs from mapreduce.cluster.temp.dir
hadoop.tmp.dir is the highest-level temporary directory. It defaults to /tmp/hadoop-${user.name}.
By default, mapred.temp.dir refers to a directory under hadoop.tmp.dir: ${hadoop.tmp.dir}/mapred/temp.
Logically, this makes sense because MapReduce (like Hive, Spark, etc) is a subset of Hadoop, so it should be stored under Hadoop's temp directory.

How can I specify Hadoop XML configuration variables via the Hadoop shell scripts?

I'm writing code to create a temporary Hadoop cluster. Unlike most Hadoop clusters, I need the location for logs, HDFS files, etc, to be in a specific temporary network location that is different each time the cluster is started. This network directory will be generated at runtime; I do not know the directory name at the time I'm checking in the shell scripts like hadoop-env.sh and the XML files like core-default.xml.
At checkin time: I can modify the shell scripts like hadoop-env.sh and the XML files like core-default.xml.
At run time: I generate the temporary directory that I want to use for my data storage.
I can instruct most of Hadoop to use this temporary directory by specifying environment variables like HADOOP_LOG_DIR and HADOOP_PID_DIR, and if necessary I can modify the shell scripts to read those environment variables.
However, HDFS determines its local directory to store the filesystem via two properties that are defined in XML files, not environment variables or shell scripts: hadoop.tmp.dir in core-default.xml and dfs.datanode.data.dir in hdfs-default.xml.
Is there any way to edit these XML files to determine the value of hadoop.tmp.dir at runtime? Or, alternatively, is there any way to use environment variables to override the XML-configured value of hadoop.tmp.dir?
We had a similar requirement earlier. Configuring dfs.data.dir and dfs.name.dir as part of HADOOP_OPTS worked well for us. For e.g.
export HADOOP_OPTS="-Ddfs.name.dir=$NAMENODE_DATA -Ddfs.data.dir=$DFS_DATA"
This method can be used to configure other configurations also, like namenode url.

Does a file need to be in HDFS in order to use it in distributed cache?

I get
Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: file:/path/to/my.jar, expected: hdfs://ec2-xx-xx-xx-xxx.compute-1.amazonaws.com
if I try to add a local file to distributed cache in hadoop. When the file is on HDFS, I don't get this error (obviously, since it's using the expected FS). Is there a way to use a local file in distributed cache without first copying it to hdfs? Here is a code snippet:
Configuration conf = job.getConfiguration();
FileSystem fs = FileSystem.getLocal(conf);
Path dependency = fs.makeQualified(new Path("/local/path/to/my.jar");
DistributedCache.addArchiveToClassPath(path, conf);
Thanks
It has to be in HDFS first. I'm going to go out on a limb here, but I think it is because the file is "pulled" to the local distributed cache by the slaves, not pushed. Since they are pulled, they have no way to access that local path.
No, I don't think you can put anything on the distributed cache without it being in HDFS first. All Hadoop jobs use input/output path in relation to HDFS.
File can be either in local system, hdfs, S3 or other cluster also. You need to specify as
-files hdfs:// if the file is in hdfs
by default it assumes local file system.

Resources