How to write huge data from Java to HDFS

How to write huge data from Java to HDFS - hadoop

Our Java application generates huge data (long running program), but unable to store the data efficiently.
Public class HDFSWriter {
FSDataOutputStream out = null;
FileSystem fs = null;
Configuration conf = null;
static int linescounter = 0;
void CreateHDFSFile() {
Path filePath = new Path("filename.CSV");
conf = new Configuration();
fs = FileSystem.get(conf);
out = fs.create(filePath);
}
void writeHDFSFile(String csvLine) {
out.writeBytes(csvLine);
linescounter++;
if(linescounter>=500) {
linescounter=0;
out.writeBytes(csvLine);
//out.hsync();
//out.hflush();
}
}
void close() {
fs.close();
}
}
CreateHDFSFile method is called start of the program.
writeHDFSFile method is called for each line to insert into HDFS File.
close method is called at end of the program.
Even though I invoke hsync or hflush, data is not appearing in HDFS. It's appearing only after the complete program is completed i.e, after fs.close().
How to make the data available during the HDFS file is created or at every time interval or particular number of records?

Related

How to load H20 trained model in Java UDF

I am trying to load trained xgboost model to be used in custom UDF written in Java. File is in zip format and stored in hdfs.
I have tried to read it using Path class but it's not working.
import org.apache.hadoop.fs.Path;
public EasyPredictModelWrapper loadModel(String xgBoostModelFile) {
if (model == null) {
synchronized (_lockObject) {
if (model == null) {
log.info("Model has not been loaded, loading ...");
try {
Path path = new Path(xgBoostModelFile);
model = new EasyPredictModelWrapper(MojoModel.load(path)); // Doesn't compile since MojoModel only takes string as an input.
} catch (IOException e) {
log.error("Got an exception while trying to load xgBoostModel \n", e);
}
}
}
}
return model;
}
I Want to successfully load model.zip

Got answer in H20 slack community.
FileSystem fs = FileSystem.get(new Configuration());
Path path = new Path(xgBoostModelFile);
FSDataInputStream inputStream = fs.open(path);
MojoReaderBackend mojoReaderBackend = MojoReaderBackendFactory.createReaderBackend(inputStream,CachingStrategy.MEMORY);
model = new EasyPredictModelWrapper(MojoModel.load(mojoReaderBackend));

try with resource printwriter

I am trying to learn how to use try with resources. First I tried to put java.io.File myFile = new java.io.File(filename) in the resource parenthesis, but netbeans told me that it is not autoclosable. Am I properly handling this exception? I was under the impression that the Exception would be generated in the line where I define the file class object.
//This method writes to a csv or txt file, specify full filepath (including
//extension) Each value will be on a new line
public void writeFile(String filename)
{
java.io.File myFile = new java.io.File(filename);
try(java.io.PrintWriter outfile = new java.io.PrintWriter(myFile))
{
for (int i = 0; i < size; i++)
{
//print all used elements line by line
outfile.println(Integer.toString(this.getElement(i)));
}
} catch (FileNotFoundException fileNotFoundException)
{
//print error
}
}//end writeFile(String)----------------------------------------------------

Writing to a file in S3 from jar on EMR on AWS

Is there any way in which I can write to a file from my Java jar to an S3 folder where my reduce files would be written ? I have tried something like:
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream FS = fs.create(new Path("S3 folder output path"+"//Result.txt"));
PrintWriter writer = new PrintWriter(FS);
writer.write(averageDelay.toString());
writer.close();
FS.close();
Here Result.txt is the new file which I would want to write.

Answering my own question:-
I found my mistake.I should be passing the URI of S3 folder path to the fileSystem Object like below:-
FileSystem fileSystem = FileSystem.get(URI.create(otherArgs[1]),conf);
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(otherArgs[1]+"//Result.txt"));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write("\n Average Delay:"+averageDelay);
writer.close();
fsDataOutputStream.close();

FileSystem fileSystem = FileSystem.get(URI.create(otherArgs[1]),new JobConf(<Your_Class_Name_here>.class));
FSDataOutputStream fsDataOutputStream = fileSystem.create(new
Path(otherArgs[1]+"//Result.txt"));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write("\n Average Delay:"+averageDelay);
writer.close();
fsDataOutputStream.close();
This is how I handled the conf variable in the above code block and it worked like charm.

Here's another way to do it in Java by using the AWS S3 putObject directly with a string buffer.
... AmazonS3 s3Client;
public void reduce(Text key, java.lang.Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws Exception {
UUID fileUUID = UUID.randomUUID();
SimpleDateFormat sdf = new SimpleDateFormat("yyy-MM-dd");
sdf.setTimeZone(TimeZone.getTimeZone("UTC"));
String fileName = String.format("nightly-dump/%s/%s-%s",sdf.format(new Date()), key, fileUUID);
log.info("Filename = [{}]", fileName);
String content = "";
int count = 0;
for (Text value : values) {
count++;
String s3Line = value.toString();
content += s3Line + "\n";
}
log.info("Count = {}, S3Lines = \n{}", count, content);
PutObjectResult putObjectResult = s3Client.putObject(S3_BUCKETNAME, fileName, content);
log.info("Put versionId = {}", putObjectResult.getVersionId());
reduceWriteContext("1", "1");
context.setStatus("COMPLETED");
}

Hadoop dir/file last modification times

Is there a way to get the last modified times of all dirs and files in hdfs? I want to create page that displays the information, but I have no clue how to go about getting the last mod times all in one .txt file.

See if it helps :
public class HdfsDemo {
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/Users/miqbal1/hadoop-eco/hadoop-1.1.2/conf/core-site.xml"));
conf.addResource(new Path("/Users/miqbal1/hadoop-eco/hadoop-1.1.2/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
System.out.println("Enter the directory name : ");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
Path path = new Path(br.readLine());
displayDirectoryContents(fs, path);
fs.close();
}
private static void displayDirectoryContents(FileSystem fs, Path rootDir) {
// TODO Auto-generated method stub
try {
FileStatus[] status = fs.listStatus(rootDir);
for (FileStatus file : status) {
if (file.isDir()) {
System.out.println("DIRECTORY : " + file.getPath() + " - Last modification time : " + file.getModificationTime());
displayDirectoryContents(fs, file.getPath());
} else {
System.out.println("FILE : " + file.getPath() + " - Last modification time : " + file.getModificationTime());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
One thing to notice though, getModificationTime() returns the modification time of file in milliseconds since January 1, 1970 UTC.

You probably have to iterate through the files and directories, to get the status of each path - you can use the below code (just sample) - but I'm not sure, how efficient that would be, if you have large set of files and directories.
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://<namenod_ip_address:<port>");
conf.set("mapred.job.tracker", "<jobtracker_ip_address>:<port>");
conf.setBoolean("fs.hdfs.impl.disable.cache", true);
FileSystem lfs = FileSystem.get(l_configuration);
fs.getFileStatus(new Path("/your/path")).getModificationTime();

hadoop fs -stat
#hadoop commands fs
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#stat

Bufferreader and Bufferwriter for reading and writing hdfs files

I'm trying to read from a hdfs file line by line and then create a hdfs file and write to it line by line. The code that I use looks like this:
Path FileToRead=new Path(inputPath);
FileSystem hdfs = FileToRead.getFileSystem(new Configuration());
FSDataInputStream fis = hdfs.open(FileToRead);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis));
String line;
line = reader.readLine();
while (line != null){
String[] lineElem = line.split(",");
for(int i=0;i<10;i++){
MyMatrix[i][Integer.valueOf(lineElem[0])-1] = Double.valueOf(lineElem[i+1]);
}
line=reader.readLine();
}
reader.close();
fis.close();
Path FileToWrite = new Path(outputPath+"/V");
FileSystem fs = FileSystem.get(new Configuration());
FSDataOutputStream fileOut = fs.create(FileToWrite);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fileOut));
writer.write("check");
writer.close();
fileOut.close();
When I run this code in my outputPath file V has not been created. But if I replace the part for reading with the part for writing the file will be created and check is written into it.
Can anyone please help me understand how to use them correctly to be able to read first the whole file and then write to the file line by line?
I have also tried another code for reading from one file and writing to another one but the file will be created but there is nothing written into it!
I use sth like this:
hadoop jar main.jar program2.Main input output
Then in my first job I read from arg[0] and write to a file in args[1]+"/NewV" using map reduce classes and it works.
In my other class (non map reduce)I use args[1]+"/NewV" as input path and output+"/V_0" as output path (I pass these strings to constructor). here is the code for the class :
public class Init_V {
String inputPath, outputPath;
public Init_V(String inputPath, String outputPath) throws Exception {
this.inputPath = inputPath;
this.outputPath = outputPath;
try{
FileSystem fs = FileSystem.get(new Configuration());
Path FileToWrite = new Path(outputPath+"/V.txt");
Path FileToRead=new Path(inputPath);
BufferedWriter output = new BufferedWriter
(new OutputStreamWriter(fs.create(FileToWrite,
true)));
BufferedReader reader = new
BufferedReader(new InputStreamReader(fs.open(FileToRead)));
String data;
data = reader.readLine();
while ( data != null )
{
output.write(data);
data = reader.readLine();
}
reader.close();
output.close(); }catch(Exception e){
}
}
}

I think, you need to understand how hadoop works properly. In hadoop, many thing is done by the system, you are just giving input and output path, then they are opened and created by hadoop if the paths are valid. Check the following example;
public int run (String[] args) throws Exception{
if(args.length != 3){
System.err.println("Usage: MapReduce <input path> <output path> ");
ToolRunner.printGenericCommandUsage(System.err);
}
Job job = new Job();
job.setJarByClass(MyClass.class);
job.setNumReduceTasks(5);
job.setJobName("myclass");
FileInputFormat.addInputPath(job, new Path(args[0]) );
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0:1 ;
}
/* ----------------------main---------------------*/
public static void main(String[] args) throws Exception{
int exitCode = ToolRunner.run(new MyClass(), args);
System.exit(exitCode);
}
As you see here, you only initialize necessary variables and reading&writing is done by hadoop.
Also, in your Mapper class you are saying context.write(key, value) inside map, and similarly in your Reduce class you are doing same, it writes for you.
If you use BufferedWriter/Reader it will write to your local file system not to HDFS. To see files in HDFS you should write hadoop fs -ls <path>, the files you are looking by ls command are in your local file system
EDIT: In order to use read/write you should know the followings: Let say you have N machine in your hadoop network. When you want to read, you will not know which mapper is reading, similarly writing. So, all mappers and reducer should have those paths not to give exception.
I dont know if you could use any other class but you can use two methods for your specific reason: startup and cleanup. These methods are used only once in each map and reduce worker. So if you want to read and write you can use that files. Reading and writing is same as normal java code. For example, you want to see something for each key, and want to write it to a txt. You can do the following:
//in reducer
BufferedReader bw ..;
void startup(...){
bw = new ....;
}
void reduce(...){
while(iter.hasNext()){ ....;
}
bw.write(key, ...);
}
void cleanup(...){
bw.close();
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

How to write huge data from Java to HDFS - hadoop

Related

How to load H20 trained model in Java UDF

try with resource printwriter

Writing to a file in S3 from jar on EMR on AWS

Hadoop dir/file last modification times

Bufferreader and Bufferwriter for reading and writing hdfs files

Categories

Resources