Bufferreader and Bufferwriter for reading and writing hdfs files - hadoop

I'm trying to read from a hdfs file line by line and then create a hdfs file and write to it line by line. The code that I use looks like this:
Path FileToRead=new Path(inputPath);
FileSystem hdfs = FileToRead.getFileSystem(new Configuration());
FSDataInputStream fis = hdfs.open(FileToRead);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis));
String line;
line = reader.readLine();
while (line != null){
String[] lineElem = line.split(",");
for(int i=0;i<10;i++){
MyMatrix[i][Integer.valueOf(lineElem[0])-1] = Double.valueOf(lineElem[i+1]);
}
line=reader.readLine();
}
reader.close();
fis.close();
Path FileToWrite = new Path(outputPath+"/V");
FileSystem fs = FileSystem.get(new Configuration());
FSDataOutputStream fileOut = fs.create(FileToWrite);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fileOut));
writer.write("check");
writer.close();
fileOut.close();
When I run this code in my outputPath file V has not been created. But if I replace the part for reading with the part for writing the file will be created and check is written into it.
Can anyone please help me understand how to use them correctly to be able to read first the whole file and then write to the file line by line?
I have also tried another code for reading from one file and writing to another one but the file will be created but there is nothing written into it!
I use sth like this:
hadoop jar main.jar program2.Main input output
Then in my first job I read from arg[0] and write to a file in args[1]+"/NewV" using map reduce classes and it works.
In my other class (non map reduce)I use args[1]+"/NewV" as input path and output+"/V_0" as output path (I pass these strings to constructor). here is the code for the class :
public class Init_V {
String inputPath, outputPath;
public Init_V(String inputPath, String outputPath) throws Exception {
this.inputPath = inputPath;
this.outputPath = outputPath;
try{
FileSystem fs = FileSystem.get(new Configuration());
Path FileToWrite = new Path(outputPath+"/V.txt");
Path FileToRead=new Path(inputPath);
BufferedWriter output = new BufferedWriter
(new OutputStreamWriter(fs.create(FileToWrite,
true)));
BufferedReader reader = new
BufferedReader(new InputStreamReader(fs.open(FileToRead)));
String data;
data = reader.readLine();
while ( data != null )
{
output.write(data);
data = reader.readLine();
}
reader.close();
output.close(); }catch(Exception e){
}
}
}

I think, you need to understand how hadoop works properly. In hadoop, many thing is done by the system, you are just giving input and output path, then they are opened and created by hadoop if the paths are valid. Check the following example;
public int run (String[] args) throws Exception{
if(args.length != 3){
System.err.println("Usage: MapReduce <input path> <output path> ");
ToolRunner.printGenericCommandUsage(System.err);
}
Job job = new Job();
job.setJarByClass(MyClass.class);
job.setNumReduceTasks(5);
job.setJobName("myclass");
FileInputFormat.addInputPath(job, new Path(args[0]) );
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0:1 ;
}
/* ----------------------main---------------------*/
public static void main(String[] args) throws Exception{
int exitCode = ToolRunner.run(new MyClass(), args);
System.exit(exitCode);
}
As you see here, you only initialize necessary variables and reading&writing is done by hadoop.
Also, in your Mapper class you are saying context.write(key, value) inside map, and similarly in your Reduce class you are doing same, it writes for you.
If you use BufferedWriter/Reader it will write to your local file system not to HDFS. To see files in HDFS you should write hadoop fs -ls <path>, the files you are looking by ls command are in your local file system
EDIT: In order to use read/write you should know the followings: Let say you have N machine in your hadoop network. When you want to read, you will not know which mapper is reading, similarly writing. So, all mappers and reducer should have those paths not to give exception.
I dont know if you could use any other class but you can use two methods for your specific reason: startup and cleanup. These methods are used only once in each map and reduce worker. So if you want to read and write you can use that files. Reading and writing is same as normal java code. For example, you want to see something for each key, and want to write it to a txt. You can do the following:
//in reducer
BufferedReader bw ..;
void startup(...){
bw = new ....;
}
void reduce(...){
while(iter.hasNext()){ ....;
}
bw.write(key, ...);
}
void cleanup(...){
bw.close();
}

Related

How to call .sh script from a JavaFX button`s action handler? [duplicate]

It is quite simple to run a Unix command from Java.
Runtime.getRuntime().exec(myCommand);
But is it possible to run a Unix shell script from Java code? If yes, would it be a good practice to run a shell script from within Java code?
You should really look at Process Builder. It is really built for this kind of thing.
ProcessBuilder pb = new ProcessBuilder("myshellScript.sh", "myArg1", "myArg2");
Map<String, String> env = pb.environment();
env.put("VAR1", "myValue");
env.remove("OTHERVAR");
env.put("VAR2", env.get("VAR1") + "suffix");
pb.directory(new File("myDir"));
Process p = pb.start();
You can use Apache Commons exec library also.
Example :
package testShellScript;
import java.io.IOException;
import org.apache.commons.exec.CommandLine;
import org.apache.commons.exec.DefaultExecutor;
import org.apache.commons.exec.ExecuteException;
public class TestScript {
int iExitValue;
String sCommandString;
public void runScript(String command){
sCommandString = command;
CommandLine oCmdLine = CommandLine.parse(sCommandString);
DefaultExecutor oDefaultExecutor = new DefaultExecutor();
oDefaultExecutor.setExitValue(0);
try {
iExitValue = oDefaultExecutor.execute(oCmdLine);
} catch (ExecuteException e) {
System.err.println("Execution failed.");
e.printStackTrace();
} catch (IOException e) {
System.err.println("permission denied.");
e.printStackTrace();
}
}
public static void main(String args[]){
TestScript testScript = new TestScript();
testScript.runScript("sh /root/Desktop/testScript.sh");
}
}
For further reference, An example is given on Apache Doc also.
I think you have answered your own question with
Runtime.getRuntime().exec(myShellScript);
As to whether it is good practice... what are you trying to do with a shell script that you cannot do with Java?
I would say that it is not in the spirit of Java to run a shell script from Java. Java is meant to be cross platform, and running a shell script would limit its use to just UNIX.
With that said, it's definitely possible to run a shell script from within Java. You'd use exactly the same syntax you listed (I haven't tried it myself, but try executing the shell script directly, and if that doesn't work, execute the shell itself, passing the script in as a command line parameter).
Yes it is possible to do so. This worked out for me.
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import org.omg.CORBA.portable.InputStream;
public static void readBashScript() {
try {
Process proc = Runtime.getRuntime().exec("/home/destino/workspace/JavaProject/listing.sh /"); //Whatever you want to execute
BufferedReader read = new BufferedReader(new InputStreamReader(
proc.getInputStream()));
try {
proc.waitFor();
} catch (InterruptedException e) {
System.out.println(e.getMessage());
}
while (read.ready()) {
System.out.println(read.readLine());
}
} catch (IOException e) {
System.out.println(e.getMessage());
}
}
Here is my example. Hope it make sense.
public static void excuteCommand(String filePath) throws IOException{
File file = new File(filePath);
if(!file.isFile()){
throw new IllegalArgumentException("The file " + filePath + " does not exist");
}
if(isLinux()){
Runtime.getRuntime().exec(new String[] {"/bin/sh", "-c", filePath}, null);
}else if(isWindows()){
Runtime.getRuntime().exec("cmd /c start " + filePath);
}
}
public static boolean isLinux(){
String os = System.getProperty("os.name");
return os.toLowerCase().indexOf("linux") >= 0;
}
public static boolean isWindows(){
String os = System.getProperty("os.name");
return os.toLowerCase().indexOf("windows") >= 0;
}
Yes, it is possible and you have answered it! About good practises, I think it is better to launch commands from files and not directly from your code. So you have to make Java execute the list of commands (or one command) in an existing .bat, .sh , .ksh ... files.
Here is an example of executing a list of commands in a file MyFile.sh:
String[] cmd = { "sh", "MyFile.sh", "\pathOfTheFile"};
Runtime.getRuntime().exec(cmd);
To avoid having to hardcode an absolute path, you can use the following method that will find and execute your script if it is in your root directory.
public static void runScript() throws IOException, InterruptedException {
ProcessBuilder processBuilder = new ProcessBuilder("./nameOfScript.sh");
//Sets the source and destination for subprocess standard I/O to be the same as those of the current Java process.
processBuilder.inheritIO();
Process process = processBuilder.start();
int exitValue = process.waitFor();
if (exitValue != 0) {
// check for errors
new BufferedInputStream(process.getErrorStream());
throw new RuntimeException("execution of script failed!");
}
}
As for me all things must be simple.
For running script just need to execute
new ProcessBuilder("pathToYourShellScript").start();
The ZT Process Executor library is an alternative to Apache Commons Exec. It has functionality to run commands, capturing their output, setting timeouts, etc.
I have not used it yet, but it looks reasonably well-documented.
An example from the documentation: Executing a command, pumping the stderr to a logger, returning the output as UTF8 string.
String output = new ProcessExecutor().command("java", "-version")
.redirectError(Slf4jStream.of(getClass()).asInfo())
.readOutput(true).execute()
.outputUTF8();
Its documentation lists the following advantages over Commons Exec:
Improved handling of streams
Reading/writing to streams
Redirecting stderr to stdout
Improved handling of timeouts
Improved checking of exit codes
Improved API
One liners for quite complex use cases
One liners to get process output into a String
Access to the Process object available
Support for async processes ( Future )
Improved logging with SLF4J API
Support for multiple processes
This is a late answer. However, I thought of putting the struggle I had to bear to get a shell script to be executed from a Spring-Boot application for future developers.
I was working in Spring-Boot and I was not able to find the file to be executed from my Java application and it was throwing FileNotFoundFoundException. I had to keep the file in the resources directory and had to set the file to be scanned in pom.xml while the application was being started like the following.
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
<includes>
<include>**/*.xml</include>
<include>**/*.properties</include>
<include>**/*.sh</include>
</includes>
</resource>
</resources>
After that I was having trouble executing the file and it was returning error code = 13, Permission Denied. Then I had to make the file executable by running this command - chmod u+x myShellScript.sh
Finally, I could execute the file using the following code snippet.
public void runScript() {
ProcessBuilder pb = new ProcessBuilder("src/main/resources/myFile.sh");
try {
Process p;
p = pb.start();
} catch (IOException e) {
e.printStackTrace();
}
}
Hope that solves someone's problem.
Here is an example how to run an Unix bash or Windows bat/cmd script from Java. Arguments can be passed on the script and output received from the script. The method accepts arbitrary number of arguments.
public static void runScript(String path, String... args) {
try {
String[] cmd = new String[args.length + 1];
cmd[0] = path;
int count = 0;
for (String s : args) {
cmd[++count] = args[count - 1];
}
Process process = Runtime.getRuntime().exec(cmd);
BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
try {
process.waitFor();
} catch (Exception ex) {
System.out.println(ex.getMessage());
}
while (bufferedReader.ready()) {
System.out.println("Received from script: " + bufferedReader.readLine());
}
} catch (Exception ex) {
System.out.println(ex.getMessage());
System.exit(1);
}
}
When running on Unix/Linux, the path must be Unix-like (with '/' as separator), when running on Windows - use '\'. Hier is an example of a bash script (test.sh) that receives arbitrary number of arguments and doubles every argument:
#!/bin/bash
counter=0
while [ $# -gt 0 ]
do
echo argument $((counter +=1)): $1
echo doubling argument $((counter)): $(($1+$1))
shift
done
When calling
runScript("path_to_script/test.sh", "1", "2")
on Unix/Linux, the output is:
Received from script: argument 1: 1
Received from script: doubling argument 1: 2
Received from script: argument 2: 2
Received from script: doubling argument 2: 4
Hier is a simple cmd Windows script test.cmd that counts number of input arguments:
#echo off
set a=0
for %%x in (%*) do Set /A a+=1
echo %a% arguments received
When calling the script on Windows
runScript("path_to_script\\test.cmd", "1", "2", "3")
The output is
Received from script: 3 arguments received
It is possible, just exec it as any other program. Just make sure your script has the proper #! (she-bang) line as the first line of the script, and make sure there are execute permissions on the file.
For example, if it is a bash script put #!/bin/bash at the top of the script, also chmod +x .
Also as for if it's good practice, no it's not, especially for Java, but if it saves you a lot of time porting a large script over, and you're not getting paid extra to do it ;) save your time, exec the script, and put the porting to Java on your long-term todo list.
I think with
System.getProperty("os.name");
Checking the operating system on can manage the shell/bash scrips if such are supported.
if there is need to make the code portable.
String scriptName = PATH+"/myScript.sh";
String commands[] = new String[]{scriptName,"myArg1", "myArg2"};
Runtime rt = Runtime.getRuntime();
Process process = null;
try{
process = rt.exec(commands);
process.waitFor();
}catch(Exception e){
e.printStackTrace();
}
Just the same thing that Solaris 5.10 it works like this ./batchstart.sh there is a trick I donĀ“t know if your OS accept it use \\. batchstart.sh instead. This double slash may help.
for linux use
public static void runShell(String directory, String command, String[] args, Map<String, String> environment)
{
try
{
if(directory.trim().equals(""))
directory = "/";
String[] cmd = new String[args.length + 1];
cmd[0] = command;
int count = 1;
for(String s : args)
{
cmd[count] = s;
count++;
}
ProcessBuilder pb = new ProcessBuilder(cmd);
Map<String, String> env = pb.environment();
for(String s : environment.keySet())
env.put(s, environment.get(s));
pb.directory(new File(directory));
Process process = pb.start();
BufferedReader inputReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
BufferedWriter outputReader = new BufferedWriter(new OutputStreamWriter(process.getOutputStream()));
BufferedReader errReader = new BufferedReader(new InputStreamReader(process.getErrorStream()));
int exitValue = process.waitFor();
if(exitValue != 0) // has errors
{
while(errReader.ready())
{
LogClass.log("ErrShell: " + errReader.readLine(), LogClass.LogMode.LogAll);
}
}
else
{
while(inputReader.ready())
{
LogClass.log("Shell Result : " + inputReader.readLine(), LogClass.LogMode.LogAll);
}
}
}
catch(Exception e)
{
LogClass.log("Err: RunShell, " + e.toString(), LogClass.LogMode.LogAll);
}
}
public static void runShell(String path, String command, String[] args)
{
try
{
String[] cmd = new String[args.length + 1];
if(!path.trim().isEmpty())
cmd[0] = path + "/" + command;
else
cmd[0] = command;
int count = 1;
for(String s : args)
{
cmd[count] = s;
count++;
}
Process process = Runtime.getRuntime().exec(cmd);
BufferedReader inputReader = new BufferedReader(new InputStreamReader(process.getInputStream()));
BufferedWriter outputReader = new BufferedWriter(new OutputStreamWriter(process.getOutputStream()));
BufferedReader errReader = new BufferedReader(new InputStreamReader(process.getErrorStream()));
int exitValue = process.waitFor();
if(exitValue != 0) // has errors
{
while(errReader.ready())
{
LogClass.log("ErrShell: " + errReader.readLine(), LogClass.LogMode.LogAll);
}
}
else
{
while(inputReader.ready())
{
LogClass.log("Shell Result: " + inputReader.readLine(), LogClass.LogMode.LogAll);
}
}
}
catch(Exception e)
{
LogClass.log("Err: RunShell, " + e.toString(), LogClass.LogMode.LogAll);
}
}
and for usage;
ShellAssistance.runShell("", "pg_dump", new String[]{"-U", "aliAdmin", "-f", "/home/Backup.sql", "StoresAssistanceDB"});
OR
ShellAssistance.runShell("", "pg_dump", new String[]{"-U", "aliAdmin", "-f", "/home/Backup.sql", "StoresAssistanceDB"}, new Hashmap<>());

How to write huge data from Java to HDFS

Our Java application generates huge data (long running program), but unable to store the data efficiently.
Public class HDFSWriter {
FSDataOutputStream out = null;
FileSystem fs = null;
Configuration conf = null;
static int linescounter = 0;
void CreateHDFSFile() {
Path filePath = new Path("filename.CSV");
conf = new Configuration();
fs = FileSystem.get(conf);
out = fs.create(filePath);
}
void writeHDFSFile(String csvLine) {
out.writeBytes(csvLine);
linescounter++;
if(linescounter>=500) {
linescounter=0;
out.writeBytes(csvLine);
//out.hsync();
//out.hflush();
}
}
void close() {
fs.close();
}
}
CreateHDFSFile method is called start of the program.
writeHDFSFile method is called for each line to insert into HDFS File.
close method is called at end of the program.
Even though I invoke hsync or hflush, data is not appearing in HDFS. It's appearing only after the complete program is completed i.e, after fs.close().
How to make the data available during the HDFS file is created or at every time interval or particular number of records?

Writing to a file in S3 from jar on EMR on AWS

Is there any way in which I can write to a file from my Java jar to an S3 folder where my reduce files would be written ? I have tried something like:
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream FS = fs.create(new Path("S3 folder output path"+"//Result.txt"));
PrintWriter writer = new PrintWriter(FS);
writer.write(averageDelay.toString());
writer.close();
FS.close();
Here Result.txt is the new file which I would want to write.
Answering my own question:-
I found my mistake.I should be passing the URI of S3 folder path to the fileSystem Object like below:-
FileSystem fileSystem = FileSystem.get(URI.create(otherArgs[1]),conf);
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(otherArgs[1]+"//Result.txt"));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write("\n Average Delay:"+averageDelay);
writer.close();
fsDataOutputStream.close();
FileSystem fileSystem = FileSystem.get(URI.create(otherArgs[1]),new JobConf(<Your_Class_Name_here>.class));
FSDataOutputStream fsDataOutputStream = fileSystem.create(new
Path(otherArgs[1]+"//Result.txt"));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write("\n Average Delay:"+averageDelay);
writer.close();
fsDataOutputStream.close();
This is how I handled the conf variable in the above code block and it worked like charm.
Here's another way to do it in Java by using the AWS S3 putObject directly with a string buffer.
... AmazonS3 s3Client;
public void reduce(Text key, java.lang.Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws Exception {
UUID fileUUID = UUID.randomUUID();
SimpleDateFormat sdf = new SimpleDateFormat("yyy-MM-dd");
sdf.setTimeZone(TimeZone.getTimeZone("UTC"));
String fileName = String.format("nightly-dump/%s/%s-%s",sdf.format(new Date()), key, fileUUID);
log.info("Filename = [{}]", fileName);
String content = "";
int count = 0;
for (Text value : values) {
count++;
String s3Line = value.toString();
content += s3Line + "\n";
}
log.info("Count = {}, S3Lines = \n{}", count, content);
PutObjectResult putObjectResult = s3Client.putObject(S3_BUCKETNAME, fileName, content);
log.info("Put versionId = {}", putObjectResult.getVersionId());
reduceWriteContext("1", "1");
context.setStatus("COMPLETED");
}

InputStreamReader throws NPE on second iteration of for loop

InputStreamReader throws an NPE as follows when I execute this code on the second iteration of the for loop. The code works perfectly for the first iteration and returns the following NPE on the second iteration. I am using the code snippet to read contents of specific file from an FTP location and display them. Please note all the lines till the new InputStreamReader work perfectly even on second iteration. Any ideas why?
Exception in thread "main" java.lang.NullPointerException
at java.io.Reader.<init>(Reader.java:61)
at java.io.InputStreamReader.<init>(InputStreamReader.java:55)
at com.test.txtweb.server.task.CallBackRetryTask.main(CallBackRetryTask.java:229)
Here is the source code:
public static void main(String[] args){
String strDate = "20130805";
FTPClient ftpClient = new FTPClient();
try {
ftpClient.connect(host);
String pathToFiles = "/path/to/File";
String ftpFileName = "";
List<String> ftpFileNames = null;
InputStream iStream;
if(ftpClient.login(username, password)){
ftpClient.enterLocalPassiveMode();
FTPFile[] ftpFiles = ftpClient.listFiles();
ftpFileNames = new ArrayList<String>();
for (FTPFile ftpFile : ftpFiles) {
ftpFileName = ftpFile.getName();
if(ftpFileName.contains(strDate)){
iStream = ftpClient.retrieveFileStream(pathToFiles + ftpFileName);
System.out.println(ftpClient.getReplyString());
InputStreamReader isr = new InputStreamReader(iStream); //Error on this line on second iteration
BufferedReader br = new BufferedReader(isr);
String line = "";
while ((line = br.readLine()) != null) {
System.out.println(line);
}
}
}
}
Figured out the issue after quite a lot of debugging. I was not calling the completePendingCommand() after transferring the file to check the status of the transfer.
The API for FTPClient.retrieveFileStream() states that to finalize the file transfer you must call FTPClient.completePendingCommand() and check its return value to verify success. After correcting this issue, everything started working fine.

Hadoop dir/file last modification times

Is there a way to get the last modified times of all dirs and files in hdfs? I want to create page that displays the information, but I have no clue how to go about getting the last mod times all in one .txt file.
See if it helps :
public class HdfsDemo {
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/Users/miqbal1/hadoop-eco/hadoop-1.1.2/conf/core-site.xml"));
conf.addResource(new Path("/Users/miqbal1/hadoop-eco/hadoop-1.1.2/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
System.out.println("Enter the directory name : ");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
Path path = new Path(br.readLine());
displayDirectoryContents(fs, path);
fs.close();
}
private static void displayDirectoryContents(FileSystem fs, Path rootDir) {
// TODO Auto-generated method stub
try {
FileStatus[] status = fs.listStatus(rootDir);
for (FileStatus file : status) {
if (file.isDir()) {
System.out.println("DIRECTORY : " + file.getPath() + " - Last modification time : " + file.getModificationTime());
displayDirectoryContents(fs, file.getPath());
} else {
System.out.println("FILE : " + file.getPath() + " - Last modification time : " + file.getModificationTime());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
One thing to notice though, getModificationTime() returns the modification time of file in milliseconds since January 1, 1970 UTC.
You probably have to iterate through the files and directories, to get the status of each path - you can use the below code (just sample) - but I'm not sure, how efficient that would be, if you have large set of files and directories.
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://<namenod_ip_address:<port>");
conf.set("mapred.job.tracker", "<jobtracker_ip_address>:<port>");
conf.setBoolean("fs.hdfs.impl.disable.cache", true);
FileSystem lfs = FileSystem.get(l_configuration);
fs.getFileStatus(new Path("/your/path")).getModificationTime();
hadoop fs -stat
#hadoop commands fs
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#stat

Resources