Hadoop dir/file last modification times - hadoop

Is there a way to get the last modified times of all dirs and files in hdfs? I want to create page that displays the information, but I have no clue how to go about getting the last mod times all in one .txt file.

See if it helps :
public class HdfsDemo {
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/Users/miqbal1/hadoop-eco/hadoop-1.1.2/conf/core-site.xml"));
conf.addResource(new Path("/Users/miqbal1/hadoop-eco/hadoop-1.1.2/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
System.out.println("Enter the directory name : ");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
Path path = new Path(br.readLine());
displayDirectoryContents(fs, path);
fs.close();
}
private static void displayDirectoryContents(FileSystem fs, Path rootDir) {
// TODO Auto-generated method stub
try {
FileStatus[] status = fs.listStatus(rootDir);
for (FileStatus file : status) {
if (file.isDir()) {
System.out.println("DIRECTORY : " + file.getPath() + " - Last modification time : " + file.getModificationTime());
displayDirectoryContents(fs, file.getPath());
} else {
System.out.println("FILE : " + file.getPath() + " - Last modification time : " + file.getModificationTime());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
One thing to notice though, getModificationTime() returns the modification time of file in milliseconds since January 1, 1970 UTC.

You probably have to iterate through the files and directories, to get the status of each path - you can use the below code (just sample) - but I'm not sure, how efficient that would be, if you have large set of files and directories.
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://<namenod_ip_address:<port>");
conf.set("mapred.job.tracker", "<jobtracker_ip_address>:<port>");
conf.setBoolean("fs.hdfs.impl.disable.cache", true);
FileSystem lfs = FileSystem.get(l_configuration);
fs.getFileStatus(new Path("/your/path")).getModificationTime();

hadoop fs -stat
#hadoop commands fs
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#stat

Related

read files from hdfs location using spring batch 4 java config

I want read all the files from HDFS location and process files sequentially using spring batch.Currently I was using the MultiResourceItemReader to read the files from local file system and processing it.
l have Read the files from HDFS Location to local file system and spring batch read the files from the local file system.
//Read the files from the hdfs to local file system
private Resource[] getMultipleResourceItemreader() {
ArrayList<Resource> resource = new ArrayList<Resource>();
org.apache.hadoop.conf.Configuration configuration= new org.apache.hadoop.conf.Configuration();
configuration.set("fs.defaultFS", "hdfs://localhost:9000");
configuration.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
System.setProperty("HADOOP_USER_NAME", "xxxx");
System.setProperty("hadoop.home.dir", "D:\\rajesh\\softwares\\winutils");
FileSystem fs;
try {
fs = FileSystem.get(URI.create("hdfs://localhost:9000"), configuration);
FileStatus[] files = fs.listStatus(new Path("hdfsfilelocation"));
for (int i=0;i<files.length;i++){
//resource.add( context.getResource(files[i].getPath().toString()));
fs.copyToLocalFile(files[i].getPath(), new Path(batchConfigurationProperties.getCsvFilePath()));
deleteTempFile(batchConfigurationProperties.getCsvFilePath(),".crc");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
File configJsonDirectory = new File(batchConfigurationProperties.getCsvFilePath());
File[] csvFileList = configJsonDirectory.listFiles();
for (File file : csvFileList) {
if (file.isFile()) {
resource.add(new FileSystemResource(file.getPath()));
}
}
return resource.toArray(new Resource[resource.size()]);
}

Writing to a file in S3 from jar on EMR on AWS

Is there any way in which I can write to a file from my Java jar to an S3 folder where my reduce files would be written ? I have tried something like:
FileSystem fs = FileSystem.get(conf);
FSDataOutputStream FS = fs.create(new Path("S3 folder output path"+"//Result.txt"));
PrintWriter writer = new PrintWriter(FS);
writer.write(averageDelay.toString());
writer.close();
FS.close();
Here Result.txt is the new file which I would want to write.
Answering my own question:-
I found my mistake.I should be passing the URI of S3 folder path to the fileSystem Object like below:-
FileSystem fileSystem = FileSystem.get(URI.create(otherArgs[1]),conf);
FSDataOutputStream fsDataOutputStream = fileSystem.create(new Path(otherArgs[1]+"//Result.txt"));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write("\n Average Delay:"+averageDelay);
writer.close();
fsDataOutputStream.close();
FileSystem fileSystem = FileSystem.get(URI.create(otherArgs[1]),new JobConf(<Your_Class_Name_here>.class));
FSDataOutputStream fsDataOutputStream = fileSystem.create(new
Path(otherArgs[1]+"//Result.txt"));
PrintWriter writer = new PrintWriter(fsDataOutputStream);
writer.write("\n Average Delay:"+averageDelay);
writer.close();
fsDataOutputStream.close();
This is how I handled the conf variable in the above code block and it worked like charm.
Here's another way to do it in Java by using the AWS S3 putObject directly with a string buffer.
... AmazonS3 s3Client;
public void reduce(Text key, java.lang.Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context) throws Exception {
UUID fileUUID = UUID.randomUUID();
SimpleDateFormat sdf = new SimpleDateFormat("yyy-MM-dd");
sdf.setTimeZone(TimeZone.getTimeZone("UTC"));
String fileName = String.format("nightly-dump/%s/%s-%s",sdf.format(new Date()), key, fileUUID);
log.info("Filename = [{}]", fileName);
String content = "";
int count = 0;
for (Text value : values) {
count++;
String s3Line = value.toString();
content += s3Line + "\n";
}
log.info("Count = {}, S3Lines = \n{}", count, content);
PutObjectResult putObjectResult = s3Client.putObject(S3_BUCKETNAME, fileName, content);
log.info("Put versionId = {}", putObjectResult.getVersionId());
reduceWriteContext("1", "1");
context.setStatus("COMPLETED");
}

distcp java api exit the application

i need to copy files between aws s3 and our local hdfs, i tried to use distcp java api but the problem with it is at the end of distcp it called System.exit(), which stopped my app too, so if i have multiple folders/files to copy and i used multiple threads, each thread perform a distcp command, the first thread who finish the distcp will stop the app, thus stop the rest of distcp, is there any other way to avoid this, i know i can write up my own MR job to do the copied but want to know if there other options
my code:
List<Future<Void>> calls = new ArrayList<Future<Void>>();
for (String dir : s3Dirs) {
final String[] args = new String[4];
args[0] = "-log";
args[1] = LOG_DIR;
args[2] = S3_DIR;
args[3] = LOCAL_HDFS_DIR
calls.add(_exec.submit(new Callable<Void>() {
#Override
public Void call() throws Exception {
try {
DistCp.main(args); <-- Distcp command
} catch (Exception e) {
System.out.println("Failed to copy files from " + args[2] + " to " + args[3]);
}
return null;
}
}));
}
for (Future<Void> f : calls) {
try {
f.get();
} catch (Exception e) {
LOGGER.error("Error while distcp", e);
}
}
Distcp main()
public static void main(String argv[]) {
int exitCode;
try {
DistCp distCp = new DistCp();
Cleanup CLEANUP = new Cleanup(distCp);
ShutdownHookManager.get().addShutdownHook(CLEANUP,
SHUTDOWN_HOOK_PRIORITY);
exitCode = ToolRunner.run(getDefaultConf(), distCp, argv);
}
catch (Exception e) {
LOG.error("Couldn't complete DistCp operation: ", e);
exitCode = DistCpConstants.UNKNOWN_ERROR;
}
System.exit(exitCode); <--- exit here
}
I have used distcp before and never faced the System.exit() problem, even with multiple threads. Try, instead of using the Distcp like that, using the ToolRunner to invoke a distcp call(like it is used on the Distcp Test cases from the hadoop tools package). The Distcp Test cases use the ToolRunner to run distcp and it allows you to run it with multiple threads. I am copying the code snippet from the above link here:
public void testCopyFromLocalToLocal() throws Exception {
Configuration conf = new Configuration();
FileSystem localfs = FileSystem.get(LOCAL_FS, conf);
MyFile[] files = createFiles(LOCAL_FS, TEST_ROOT_DIR+"/srcdat");
ToolRunner.run(new DistCp(new Configuration()),
new String[] {"file:///"+TEST_ROOT_DIR+"/srcdat",
"file:///"+TEST_ROOT_DIR+"/destdat"});
assertTrue("Source and destination directories do not match.",
checkFiles(localfs, TEST_ROOT_DIR+"/destdat", files));
deldir(localfs, TEST_ROOT_DIR+"/destdat");
deldir(localfs, TEST_ROOT_DIR+"/srcdat");
}

Calling a method to read from a text file with BufferedReader

I searched around for this but I could not find a soultion.
Sorry about my bad description. Im not very good at this.
I have a UI class
Its calling a "lotto" class.
That lotto classes constructor is called a method named readData()
readData is reading from a file using BufferedReader
Im not getting an error message but its just not reading.
It gets stuck at BufferedReader fr = new BufferedReader... and goes to the catch thing.
If its a file not found problem how would i make it track where my file is. Im using eclipse and the program is stored on my usb. I need to hand it in to my teacher so i cant just put a location in. Is there code that tracks where my program is then takes the file from that folder?
Here is the code being used.
import java.io.*;
//contructor
public Lotto()
{
try
{
readData();
nc = new NumberChecker();
}
catch(IOException e)
{
System.out.println("There was a problem");
}
}
private void readData() throws IOException
{
//this method reads winning tickets date and pot from a file
BufferedReader file = new BufferedReader (new FileReader("data.txt"));
for(int i=0;i<5;i++)
{
System.out.println("in "+i);
winningNums[i] = file.readLine();
winningDates[i] = file.readLine();
weeksMoney[i] = Integer.parseInt(file.readLine());
System.out.println("out "+i);
}
file.close();
}
if you get an error in this line of code
BufferedReader file = new BufferedReader (new FileReader("data.txt"));
Then it is probably a FileNotFoundException
Make sure that the data.txt file is in the same folder as your compiled .class file and not the .java source.
It would be best to use a proper root to your file ex. c:\my\path\data.txt
And don't forget the \
Try surrounding the BufferedReader in a try catch and look for a file not found exception as well as IO exception. Also try putting in the fully qualified path name with double backslashes.
BufferedReader file;
try {
file = new BufferedReader (new FileReader("C:\\filepath\\data.txt"));
for(int i=0;i<5;i++)
{
System.out.println("in "+i);
winningNums[i] = file.readLine();
winningDates[i] = file.readLine();
weeksMoney[i] = Integer.parseInt(file.readLine());
System.out.println("out "+i);
}
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}

Bufferreader and Bufferwriter for reading and writing hdfs files

I'm trying to read from a hdfs file line by line and then create a hdfs file and write to it line by line. The code that I use looks like this:
Path FileToRead=new Path(inputPath);
FileSystem hdfs = FileToRead.getFileSystem(new Configuration());
FSDataInputStream fis = hdfs.open(FileToRead);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis));
String line;
line = reader.readLine();
while (line != null){
String[] lineElem = line.split(",");
for(int i=0;i<10;i++){
MyMatrix[i][Integer.valueOf(lineElem[0])-1] = Double.valueOf(lineElem[i+1]);
}
line=reader.readLine();
}
reader.close();
fis.close();
Path FileToWrite = new Path(outputPath+"/V");
FileSystem fs = FileSystem.get(new Configuration());
FSDataOutputStream fileOut = fs.create(FileToWrite);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fileOut));
writer.write("check");
writer.close();
fileOut.close();
When I run this code in my outputPath file V has not been created. But if I replace the part for reading with the part for writing the file will be created and check is written into it.
Can anyone please help me understand how to use them correctly to be able to read first the whole file and then write to the file line by line?
I have also tried another code for reading from one file and writing to another one but the file will be created but there is nothing written into it!
I use sth like this:
hadoop jar main.jar program2.Main input output
Then in my first job I read from arg[0] and write to a file in args[1]+"/NewV" using map reduce classes and it works.
In my other class (non map reduce)I use args[1]+"/NewV" as input path and output+"/V_0" as output path (I pass these strings to constructor). here is the code for the class :
public class Init_V {
String inputPath, outputPath;
public Init_V(String inputPath, String outputPath) throws Exception {
this.inputPath = inputPath;
this.outputPath = outputPath;
try{
FileSystem fs = FileSystem.get(new Configuration());
Path FileToWrite = new Path(outputPath+"/V.txt");
Path FileToRead=new Path(inputPath);
BufferedWriter output = new BufferedWriter
(new OutputStreamWriter(fs.create(FileToWrite,
true)));
BufferedReader reader = new
BufferedReader(new InputStreamReader(fs.open(FileToRead)));
String data;
data = reader.readLine();
while ( data != null )
{
output.write(data);
data = reader.readLine();
}
reader.close();
output.close(); }catch(Exception e){
}
}
}
I think, you need to understand how hadoop works properly. In hadoop, many thing is done by the system, you are just giving input and output path, then they are opened and created by hadoop if the paths are valid. Check the following example;
public int run (String[] args) throws Exception{
if(args.length != 3){
System.err.println("Usage: MapReduce <input path> <output path> ");
ToolRunner.printGenericCommandUsage(System.err);
}
Job job = new Job();
job.setJarByClass(MyClass.class);
job.setNumReduceTasks(5);
job.setJobName("myclass");
FileInputFormat.addInputPath(job, new Path(args[0]) );
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0:1 ;
}
/* ----------------------main---------------------*/
public static void main(String[] args) throws Exception{
int exitCode = ToolRunner.run(new MyClass(), args);
System.exit(exitCode);
}
As you see here, you only initialize necessary variables and reading&writing is done by hadoop.
Also, in your Mapper class you are saying context.write(key, value) inside map, and similarly in your Reduce class you are doing same, it writes for you.
If you use BufferedWriter/Reader it will write to your local file system not to HDFS. To see files in HDFS you should write hadoop fs -ls <path>, the files you are looking by ls command are in your local file system
EDIT: In order to use read/write you should know the followings: Let say you have N machine in your hadoop network. When you want to read, you will not know which mapper is reading, similarly writing. So, all mappers and reducer should have those paths not to give exception.
I dont know if you could use any other class but you can use two methods for your specific reason: startup and cleanup. These methods are used only once in each map and reduce worker. So if you want to read and write you can use that files. Reading and writing is same as normal java code. For example, you want to see something for each key, and want to write it to a txt. You can do the following:
//in reducer
BufferedReader bw ..;
void startup(...){
bw = new ....;
}
void reduce(...){
while(iter.hasNext()){ ....;
}
bw.write(key, ...);
}
void cleanup(...){
bw.close();
}

Resources