Access hdfs file from udf - hadoop

I`d like to access a file from my udf call. This is my script:
files = LOAD '$docs_in' USING PigStorage(';') AS (id, stopwords, id2, file);
buzz = FOREACH files GENERATE pigbuzz.Buzz(file, id) as file:bag{(year:chararray, word:chararray, count:long)};
The jar is registered. The path is realtive to my hdfs, where the files really exist. The call is made. But seems that the file is not discovered. Maybe beacause I'm trying to access the file on hdfs.
How can I access a file in hdfs, from my UDF java call?

Inside an EvalFunc you can get a file from the HDFS via:
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
in = fs.open(new Path(fileName));
BufferedReader br = new BufferedReader(new InputStreamReader(in));
....
You might also consider putting the files into the distributed cache, in that case you have to override getCacheFiles() in your EvalFunc class.
E.g:
#Override
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(2);
list.add("/cache/pig/wordlist1.txt#w1");
list.add("/cache/pig/wordlist2.txt#w2");
return list;
}
then you can just pass the symlinks of the files (w1 and w2) in order to get them from
the local file system of each of the worker nodes:
BufferedReader br = new BufferedReader(new FileReader(fileName));

Related

Append to file in HDFS (CDH 5.4.5)

Brand new to HDFS here.
I've got this small section of code to test out appending to a file:
val path: Path = new Path("/tmp", "myFile")
val config = new Configuration()
val fileSystem: FileSystem = FileSystem.get(config)
val outputStream = fileSystem.append(path)
outputStream.writeChars("what's up")
outputStream.close()
It is failing with this message:
Not supported
java.io.IOException: Not supported
at org.apache.hadoop.fs.ChecksumFileSystem.append(ChecksumFileSystem.java:352)
at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1163)
I looked at the source for ChecksumFileSystem.java, and it seems to be hardcoded to not support appending:
#Override
public FSDataOutputStream append(Path f, int bufferSize,
Progressable progress) throws IOException {
throw new IOException("Not supported");
}
How to make this work? Is there some way to change the default file system to some other implementation that does support append?
It turned out that I needed to actually run a real hadoop namenode and datanode. I am new to hadoop and did not realize this. Without this, it will use your local filesystem which is a ChecksumFileSystem, which does not support append. So I followed the blog post here to get it up and running on my system, and now I am able to append.
The append method has to be called on outputstream not on filesystem. filesystem.get() is just used to connect to your HDFS. First set dfs.support.append as true in hdfs-site.xml
<property>
<name>dfs.support.append</name>
<value>true</value>
</property>
stop all your demon services using stop-all.sh and restart it again using start-all.sh. Put this in your main method.
String fileuri = "hdfs/file/path"
Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(URI.create(fileuri),conf);
FSDataOutputStream out = fs.append(new Path(fileuri));
PrintWriter writer = new PrintWriter(out);
writer.append("I am appending this to my file");
writer.close();
fs.close();

How to open/stream .zip files through Spark?

I have zip files that I would like to open 'through' Spark. I can open .gzip file no problem because of Hadoops native Codec support, but am unable to do so with .zip files.
Is there an easy way to read a zip file in your Spark code? I've also searched for zip codec implementations to add to the CompressionCodecFactory, but am unsuccessful so far.
There was no solution with python code and I recently had to read zips in pyspark. And, while searching how to do that I came across this question. So, hopefully this'll help others.
import zipfile
import io
def zip_extract(x):
in_memory_data = io.BytesIO(x[1])
file_obj = zipfile.ZipFile(in_memory_data, "r")
files = [i for i in file_obj.namelist()]
return dict(zip(files, [file_obj.open(file).read() for file in files]))
zips = sc.binaryFiles("hdfs:/Testing/*.zip")
files_data = zips.map(zip_extract).collect()
In the above code I returned a dictionary with filename in the zip as a key and the text data in each file as the value. you can change it however you want to suit your purposes.
#user3591785 pointed me in the correct direction, so I marked his answer as correct.
For a bit more detail, I was able to search for ZipFileInputFormat Hadoop, and came across this link: http://cotdp.com/2012/07/hadoop-processing-zip-files-in-mapreduce/
Taking the ZipFileInputFormat and its helper ZipfileRecordReader class, I was able to get Spark to perfectly open and read the zip file.
rdd1 = sc.newAPIHadoopFile("/Users/myname/data/compressed/target_file.ZIP", ZipFileInputFormat.class, Text.class, Text.class, new Job().getConfiguration());
The result was a map with one element. The file name as key, and the content as the value, so I needed to transform this into a JavaPairRdd. I'm sure you could probably replace Text with BytesWritable if you want, and replace the ArrayList with something else, but my goal was to first get something running.
JavaPairRDD<String, String> rdd2 = rdd1.flatMapToPair(new PairFlatMapFunction<Tuple2<Text, Text>, String, String>() {
#Override
public Iterable<Tuple2<String, String>> call(Tuple2<Text, Text> textTextTuple2) throws Exception {
List<Tuple2<String,String>> newList = new ArrayList<Tuple2<String, String>>();
InputStream is = new ByteArrayInputStream(textTextTuple2._2.getBytes());
BufferedReader br = new BufferedReader(new InputStreamReader(is, "UTF-8"));
String line;
while ((line = br.readLine()) != null) {
Tuple2 newTuple = new Tuple2(line.split("\\t")[0],line);
newList.add(newTuple);
}
return newList;
}
});
Please try the code below:
using API sparkContext.newAPIHadoopRDD(
hadoopConf,
InputFormat.class,
ImmutableBytesWritable.class, Result.class)
I've had a similar issue and I've solved with the following code
sparkContext.binaryFiles("/pathToZipFiles/*")
.flatMap { case (zipFilePath, zipContent) =>
val zipInputStream = new ZipInputStream(zipContent.open())
Stream.continually(zipInputStream.getNextEntry)
.takeWhile(_ != null)
.flatMap { zipEntry => ??? }
}
This answer only collects the previous knowledge and I share my experience.
ZipFileInputFormat
I tried following #Tinku and #JeffLL answers, and use imported ZipFileInputFormat together with sc.newAPIHadoopFile API. But this did not work for me. And I do not know how would I put com-cotdp-hadoop lib on my production cluster. I am not responsible for the setup.
ZipInputStream
#Tiago Palma gave a good advice, but he did not finish his answer and I struggled quite some time to actually get the decompressed output.
By the time I was able to do so, I had to prepare all the theoretical aspects, which you can find in my answer: https://stackoverflow.com/a/45958182/1549135
But the missing part of the mentioned answer is reading the ZipEntry:
import java.util.zip.ZipInputStream;
import java.io.BufferedReader;
import java.io.InputStreamReader;
sc.binaryFiles(path, minPartitions)
.flatMap { case (name: String, content: PortableDataStream) =>
val zis = new ZipInputStream(content.open)
Stream.continually(zis.getNextEntry)
.takeWhile(_ != null)
.flatMap { _ =>
val br = new BufferedReader(new InputStreamReader(zis))
Stream.continually(br.readLine()).takeWhile(_ != null)
}}
using API sparkContext.newAPIHadoopRDD(hadoopConf, InputFormat.class, ImmutableBytesWritable.class, Result.class)
File name should be pass using conf
conf=( new Job().getConfiguration())
conf.set(PROPERTY_NAME from your input formatter,"Zip file address")
sparkContext.newAPIHadoopRDD(conf, ZipFileInputFormat.class, Text.class, Text.class)
Please Find PROPERTY_NAME from your input formatter for set path
Try:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.text("yourGzFile.gz")

Reading in a parameter file in Amazon Elastic MapReduce and S3

I am trying to run my hadoop program in Amazon Elastic MapReduce system. My program takes an input file from the local filesystem which contains parameters needed for the program to run. However, since the file is normally read from the local filesystem with FileInputStream the task fails when executed in AWS environment with an error saying that the parameter file was not found. Note that, I already uploaded the file into Amazon S3. How can I fix this problem? Thanks. Below is the code that I use to read the paremeter file and consequently read the parameters in the file.
FileInputStream fstream = new FileInputStream(path);
FileInputStream os = new FileInputStream(fstream);
DataInputStream datain = new DataInputStream(os);
BufferedReader br = new BufferedReader(new InputStreamReader(datain));
String[] args = new String[7];
int i = 0;
String strLine;
while ((strLine = br.readLine()) != null) {
args[i++] = strLine;
}
If you must read the file from the local file system, you can configure your EMR job to run with a boostrap action. In that action, simply copy the file from S3 to a local file using s3cmd or similar.
You could also go through the Hadoop FileSystem class to read the file, as I'm pretty sure EMR supports direct access like this. For example:
FileSystem fs = FileSystem.get(new URI("s3://my.bucket.name/"), conf);
DataInputStream in = fs.open(new Path("/my/parameter/file"));
I did not try Amazon Elastic yet, however it looks like a classical application of distributed cache. You add file do cache using -files option (if you implement Tool/ToolRunner) or job.addCacheFile(URI uri) method, and access it as if it existed locally.
You can add this file to the distributed cache as follows :
...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...
Later, in configure() of your mapper/reducer, you can do the following:
...
Path s3FilePath;
#Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
...
}

Files not put correctly into distributed cache

I am adding a file to distributed cache using the following code:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
Then I read the file into the mappers:
protected void setup(Context context)throws IOException,InterruptedException{
Configuration conf = context.getConfiguration();
URI[] cacheFile = DistributedCache.getCacheFiles(conf);
FSDataInputStream in = FileSystem.get(conf).open(new Path(cacheFile[0].getPath()));
BufferedReader joinReader = new BufferedReader(new InputStreamReader(in));
String line;
try {
while ((line = joinReader.readLine()) != null) {
s = line.toString().split("\t");
do stuff to s
} finally {
joinReader.close();
}
The problem is that I only read in one line, and it is not the file I was putting into the cache. Rather it is: cm9vdA==, or root in base64.
Has anyone else had this problem, or see how I'm using distributed cache incorrectly? I am using Hadoop 0.20.2 fully distributed.
Common mistake in your job configuration:
Configuration conf2 = new Configuration();
job = new Job(conf2);
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
After you create your Job object, you need to pull back the Configuration object as Job makes a copy of it, and configuring values in conf2 after you create the job will have no effect on the job iteself. Try this:
job = new Job(new Configuration());
Configuration conf2 = job.getConfiguration();
job.setJobName("Join with Cache");
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000"), conf2);
You should also check the number of files in the distributed cache, there is probably more than one and you're opening a random file which is giving you the value you are seeing.
I suggest you use symlinking which will make the files available in the local working directory, and with a known name:
DistributedCache.createSymlink(conf2);
DistributedCache.addCacheFile(new URI("hdfs://server:port/FilePath/part-r-00000#myfile"), conf2);
// then in your mapper setup:
BufferedReader joinReader = new BufferedReader(new FileInputStream("myfile"));

How to design my mapper?

I have to write a mapreduce job but I dont know how to go about it,
I have jar MARD.jar through which I can instantiate MARD objects.
Using which I call the mard.normalize file meathod on it i.e. mard.normaliseFile(bunch of arguments).
This inturn creates certain output file.
For the normalise meathod to run it needs a folder called myMard in the working directory.
So I thought that I would give the myMard folder as the in input path to hadoop job, but m not sure if that would help beacuse mard.normaliseFile(bunch of arguments) will search for the myMard folder in the working directory but it will not find it as (**this is what I think) the Mapper will only be able to access the content of files through the "values" obtained from the fileSplit, it cannot give direct access to the files in the myMard folder.
In short I have to execute the follwing code through the MapReduce
File setupFolder = new File(setupFolderName);
setupFolder.mkdirs();
MARD mard = new MARD(setupFolder);
Text valuz = new Text();
IntWritable intval = new IntWritable();
File original = new File("Vca1652.txt");
File mardedxml = new File("Vca1652-mardedxml.txt");
File marded = new File("Vca1652-marded.txt");
mardedxml.createNewFile();
marded.createNewFile();
NormalisationStats stats;
try {
stats = mard.normaliseFile(original,mardedxml,marded,50.0);
//This meathod requires access to the myMardfolder
System.out.println(stats);
} catch (MARDException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Please help

Resources