I want read all the files from HDFS location and process files sequentially using spring batch.Currently I was using the MultiResourceItemReader to read the files from local file system and processing it.
l have Read the files from HDFS Location to local file system and spring batch read the files from the local file system.
//Read the files from the hdfs to local file system
private Resource[] getMultipleResourceItemreader() {
ArrayList<Resource> resource = new ArrayList<Resource>();
org.apache.hadoop.conf.Configuration configuration= new org.apache.hadoop.conf.Configuration();
configuration.set("fs.defaultFS", "hdfs://localhost:9000");
configuration.set("fs.hdfs.impl", org.apache.hadoop.hdfs.DistributedFileSystem.class.getName());
System.setProperty("HADOOP_USER_NAME", "xxxx");
System.setProperty("hadoop.home.dir", "D:\\rajesh\\softwares\\winutils");
FileSystem fs;
try {
fs = FileSystem.get(URI.create("hdfs://localhost:9000"), configuration);
FileStatus[] files = fs.listStatus(new Path("hdfsfilelocation"));
for (int i=0;i<files.length;i++){
//resource.add( context.getResource(files[i].getPath().toString()));
fs.copyToLocalFile(files[i].getPath(), new Path(batchConfigurationProperties.getCsvFilePath()));
deleteTempFile(batchConfigurationProperties.getCsvFilePath(),".crc");
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
File configJsonDirectory = new File(batchConfigurationProperties.getCsvFilePath());
File[] csvFileList = configJsonDirectory.listFiles();
for (File file : csvFileList) {
if (file.isFile()) {
resource.add(new FileSystemResource(file.getPath()));
}
}
return resource.toArray(new Resource[resource.size()]);
}
Related
In my spring-batch-integration app, file polling invokes the batchjob for eachfile and this application could be running on multiple servers(nodes) but they all are supposed to read a common directory.Now, I wrote a custom locker which takes the lock on file so that any other instance will not be able to process the same file . code as below
public class MyFileLocker extends AbstractFileLockerFilter{
private final ConcurrentMap<File, FileLock> lockCache = new ConcurrentHashMap<File, FileLock>();
private final ConcurrentMap<File, FileChannel> ChannelCache = new ConcurrentHashMap<File, FileChannel>();
#Override
public boolean lock(File fileToLock) {
FileChannel channel;
FileLock lock;
try {
channel = new RandomAccessFile(fileToLock, "rw").getChannel();
lock = channel.tryLock();
if (lock == null || !lock.isValid()) {
System.out.println(" Problem in acquiring lock!!" + fileToLock);
return false;
}
lockCache.put(fileToLock, lock);
ChannelCache.put(fileToLock, channel);
} catch (FileNotFoundException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
return true;
}
#Override
public boolean isLockable(File file) {
return file.canWrite();
}
#Override
public void unlock(File fileToUnlock) {
FileLock lock = lockCache.get(fileToUnlock);
try {
if(lock!=null){
lock.release();
ChannelCache.get(fileToUnlock).close();
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
}
Now, when i invoke my Spring batch and i try to read that file using flatfileitemreader it gives me
org.springframework.batch.item.file.NonTransientFlatFileException
which i believe is coming beacuse file is locked. I did some googling and found that NIOLocker locks the file in a way that even the current thread can't read it. I found a link where it shows how to read the locked file but they are using buffer.
How can I make my file accessible to my FlatfileItemReader.
Please suggest.
Yes, you really can get access to the locked file content only over ByteBuffer:
FileChannel fileChannel = channelCache.get(lockedFile);
ByteBuffer byteBuffer = ByteBuffer.allocate((int) fileChannel.size());
fileChannel.read(byteBuffer);
System.out.println("Read File " + lockedFile.getName() + " with content: " + new String(byteBuffer.array()));
Oh! Yeah. You really pointed to my repo :-).
So, with locker you don't have choice unless copy/paste the file byte[] that way before FlatfileItemReader or just inject some custom BufferedReaderFactory into the same FlatfileItemReader, which converts the locked file to the appropriate BufferedReader:
new BufferedReader(new CharArrayReader(byteBuffer.asCharBuffer().array()));
Based on the link that you shared, looks like you could try the following.
//This is from your link (except the size variable)
FileChannel fileChannel = channelCache.get(lockedFile);
int size = (int) fileChannel.size();
ByteBuffer byteBuffer = ByteBuffer.allocate(size);
fileChannel.read(byteBuffer);
//Additional code that you could try
byte[] bArray = new byte[size];
//Write data to the byte array
byteBuffer.get(bArray);
FlatFileItemReader flatFileItemReader = new FlatFileItemReader();
flatFileItemReader.setResource(new ByteArrayResource(bArray));
//Next you can try reading from your flatFileItemReader as usual
...
Let me know if it doesn't progress your issue.
Solution : I am creating a temporary file with content of the locked file and processing it. Once processing is done i archive that file and remove the locked and temporary both files. Key here is to create a new file with locked file content. code for below is as follows :
File tmpFile = new File(inputFile.getAbsolutePath() + ".lck");
FileChannel fileChannel = MyFileLocker.getChannelCache().get(new File(inputFile.getAbsolutePath()));
InputStream inputStream = Channels.newInputStream(fileChannel);
ByteStreams.copy(inputStream, Files.newOutputStreamSupplier(tmpFile));
Here inputFile is my locked file and tmp file is new file with locked file content. I have also created a method in my locker class to unlock and delete
public void unlockAndDelete(File fileToUnlockandDelete) {
FileLock lock = lockCache.get(fileToUnlockandDelete);
String fileName = fileToUnlockandDelete.getName();
try {
if(lock!=null){
lock.release();
channelCache.get(fileToUnlockandDelete).close();
//remove from cache
lockCache.remove(fileToUnlockandDelete);
channelCache.remove(fileToUnlockandDelete);
boolean isFiledeleted = fileToUnlockandDelete.delete();
if(isFiledeleted){
System.out.println("File deleted successfully" + fileName);
}else{
System.out.println("File is not deleted."+fileName);
}
}
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Our Java application generates huge data (long running program), but unable to store the data efficiently.
Public class HDFSWriter {
FSDataOutputStream out = null;
FileSystem fs = null;
Configuration conf = null;
static int linescounter = 0;
void CreateHDFSFile() {
Path filePath = new Path("filename.CSV");
conf = new Configuration();
fs = FileSystem.get(conf);
out = fs.create(filePath);
}
void writeHDFSFile(String csvLine) {
out.writeBytes(csvLine);
linescounter++;
if(linescounter>=500) {
linescounter=0;
out.writeBytes(csvLine);
//out.hsync();
//out.hflush();
}
}
void close() {
fs.close();
}
}
CreateHDFSFile method is called start of the program.
writeHDFSFile method is called for each line to insert into HDFS File.
close method is called at end of the program.
Even though I invoke hsync or hflush, data is not appearing in HDFS. It's appearing only after the complete program is completed i.e, after fs.close().
How to make the data available during the HDFS file is created or at every time interval or particular number of records?
i need to copy files between aws s3 and our local hdfs, i tried to use distcp java api but the problem with it is at the end of distcp it called System.exit(), which stopped my app too, so if i have multiple folders/files to copy and i used multiple threads, each thread perform a distcp command, the first thread who finish the distcp will stop the app, thus stop the rest of distcp, is there any other way to avoid this, i know i can write up my own MR job to do the copied but want to know if there other options
my code:
List<Future<Void>> calls = new ArrayList<Future<Void>>();
for (String dir : s3Dirs) {
final String[] args = new String[4];
args[0] = "-log";
args[1] = LOG_DIR;
args[2] = S3_DIR;
args[3] = LOCAL_HDFS_DIR
calls.add(_exec.submit(new Callable<Void>() {
#Override
public Void call() throws Exception {
try {
DistCp.main(args); <-- Distcp command
} catch (Exception e) {
System.out.println("Failed to copy files from " + args[2] + " to " + args[3]);
}
return null;
}
}));
}
for (Future<Void> f : calls) {
try {
f.get();
} catch (Exception e) {
LOGGER.error("Error while distcp", e);
}
}
Distcp main()
public static void main(String argv[]) {
int exitCode;
try {
DistCp distCp = new DistCp();
Cleanup CLEANUP = new Cleanup(distCp);
ShutdownHookManager.get().addShutdownHook(CLEANUP,
SHUTDOWN_HOOK_PRIORITY);
exitCode = ToolRunner.run(getDefaultConf(), distCp, argv);
}
catch (Exception e) {
LOG.error("Couldn't complete DistCp operation: ", e);
exitCode = DistCpConstants.UNKNOWN_ERROR;
}
System.exit(exitCode); <--- exit here
}
I have used distcp before and never faced the System.exit() problem, even with multiple threads. Try, instead of using the Distcp like that, using the ToolRunner to invoke a distcp call(like it is used on the Distcp Test cases from the hadoop tools package). The Distcp Test cases use the ToolRunner to run distcp and it allows you to run it with multiple threads. I am copying the code snippet from the above link here:
public void testCopyFromLocalToLocal() throws Exception {
Configuration conf = new Configuration();
FileSystem localfs = FileSystem.get(LOCAL_FS, conf);
MyFile[] files = createFiles(LOCAL_FS, TEST_ROOT_DIR+"/srcdat");
ToolRunner.run(new DistCp(new Configuration()),
new String[] {"file:///"+TEST_ROOT_DIR+"/srcdat",
"file:///"+TEST_ROOT_DIR+"/destdat"});
assertTrue("Source and destination directories do not match.",
checkFiles(localfs, TEST_ROOT_DIR+"/destdat", files));
deldir(localfs, TEST_ROOT_DIR+"/destdat");
deldir(localfs, TEST_ROOT_DIR+"/srcdat");
}
Is there a way to get the last modified times of all dirs and files in hdfs? I want to create page that displays the information, but I have no clue how to go about getting the last mod times all in one .txt file.
See if it helps :
public class HdfsDemo {
public static void main(String[] args) throws IOException {
Configuration conf = new Configuration();
conf.addResource(new Path("/Users/miqbal1/hadoop-eco/hadoop-1.1.2/conf/core-site.xml"));
conf.addResource(new Path("/Users/miqbal1/hadoop-eco/hadoop-1.1.2/conf/hdfs-site.xml"));
FileSystem fs = FileSystem.get(conf);
System.out.println("Enter the directory name : ");
BufferedReader br = new BufferedReader(new InputStreamReader(System.in));
Path path = new Path(br.readLine());
displayDirectoryContents(fs, path);
fs.close();
}
private static void displayDirectoryContents(FileSystem fs, Path rootDir) {
// TODO Auto-generated method stub
try {
FileStatus[] status = fs.listStatus(rootDir);
for (FileStatus file : status) {
if (file.isDir()) {
System.out.println("DIRECTORY : " + file.getPath() + " - Last modification time : " + file.getModificationTime());
displayDirectoryContents(fs, file.getPath());
} else {
System.out.println("FILE : " + file.getPath() + " - Last modification time : " + file.getModificationTime());
}
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
One thing to notice though, getModificationTime() returns the modification time of file in milliseconds since January 1, 1970 UTC.
You probably have to iterate through the files and directories, to get the status of each path - you can use the below code (just sample) - but I'm not sure, how efficient that would be, if you have large set of files and directories.
Configuration conf = new Configuration();
conf.set("fs.default.name", "hdfs://<namenod_ip_address:<port>");
conf.set("mapred.job.tracker", "<jobtracker_ip_address>:<port>");
conf.setBoolean("fs.hdfs.impl.disable.cache", true);
FileSystem lfs = FileSystem.get(l_configuration);
fs.getFileStatus(new Path("/your/path")).getModificationTime();
hadoop fs -stat
#hadoop commands fs
https://hadoop.apache.org/docs/r2.4.1/hadoop-project-dist/hadoop-common/FileSystemShell.html#stat
I'm trying to read from a hdfs file line by line and then create a hdfs file and write to it line by line. The code that I use looks like this:
Path FileToRead=new Path(inputPath);
FileSystem hdfs = FileToRead.getFileSystem(new Configuration());
FSDataInputStream fis = hdfs.open(FileToRead);
BufferedReader reader = new BufferedReader(new InputStreamReader(fis));
String line;
line = reader.readLine();
while (line != null){
String[] lineElem = line.split(",");
for(int i=0;i<10;i++){
MyMatrix[i][Integer.valueOf(lineElem[0])-1] = Double.valueOf(lineElem[i+1]);
}
line=reader.readLine();
}
reader.close();
fis.close();
Path FileToWrite = new Path(outputPath+"/V");
FileSystem fs = FileSystem.get(new Configuration());
FSDataOutputStream fileOut = fs.create(FileToWrite);
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(fileOut));
writer.write("check");
writer.close();
fileOut.close();
When I run this code in my outputPath file V has not been created. But if I replace the part for reading with the part for writing the file will be created and check is written into it.
Can anyone please help me understand how to use them correctly to be able to read first the whole file and then write to the file line by line?
I have also tried another code for reading from one file and writing to another one but the file will be created but there is nothing written into it!
I use sth like this:
hadoop jar main.jar program2.Main input output
Then in my first job I read from arg[0] and write to a file in args[1]+"/NewV" using map reduce classes and it works.
In my other class (non map reduce)I use args[1]+"/NewV" as input path and output+"/V_0" as output path (I pass these strings to constructor). here is the code for the class :
public class Init_V {
String inputPath, outputPath;
public Init_V(String inputPath, String outputPath) throws Exception {
this.inputPath = inputPath;
this.outputPath = outputPath;
try{
FileSystem fs = FileSystem.get(new Configuration());
Path FileToWrite = new Path(outputPath+"/V.txt");
Path FileToRead=new Path(inputPath);
BufferedWriter output = new BufferedWriter
(new OutputStreamWriter(fs.create(FileToWrite,
true)));
BufferedReader reader = new
BufferedReader(new InputStreamReader(fs.open(FileToRead)));
String data;
data = reader.readLine();
while ( data != null )
{
output.write(data);
data = reader.readLine();
}
reader.close();
output.close(); }catch(Exception e){
}
}
}
I think, you need to understand how hadoop works properly. In hadoop, many thing is done by the system, you are just giving input and output path, then they are opened and created by hadoop if the paths are valid. Check the following example;
public int run (String[] args) throws Exception{
if(args.length != 3){
System.err.println("Usage: MapReduce <input path> <output path> ");
ToolRunner.printGenericCommandUsage(System.err);
}
Job job = new Job();
job.setJarByClass(MyClass.class);
job.setNumReduceTasks(5);
job.setJobName("myclass");
FileInputFormat.addInputPath(job, new Path(args[0]) );
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapperClass(MyMapper.class);
job.setReducerClass(MyReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
return job.waitForCompletion(true) ? 0:1 ;
}
/* ----------------------main---------------------*/
public static void main(String[] args) throws Exception{
int exitCode = ToolRunner.run(new MyClass(), args);
System.exit(exitCode);
}
As you see here, you only initialize necessary variables and reading&writing is done by hadoop.
Also, in your Mapper class you are saying context.write(key, value) inside map, and similarly in your Reduce class you are doing same, it writes for you.
If you use BufferedWriter/Reader it will write to your local file system not to HDFS. To see files in HDFS you should write hadoop fs -ls <path>, the files you are looking by ls command are in your local file system
EDIT: In order to use read/write you should know the followings: Let say you have N machine in your hadoop network. When you want to read, you will not know which mapper is reading, similarly writing. So, all mappers and reducer should have those paths not to give exception.
I dont know if you could use any other class but you can use two methods for your specific reason: startup and cleanup. These methods are used only once in each map and reduce worker. So if you want to read and write you can use that files. Reading and writing is same as normal java code. For example, you want to see something for each key, and want to write it to a txt. You can do the following:
//in reducer
BufferedReader bw ..;
void startup(...){
bw = new ....;
}
void reduce(...){
while(iter.hasNext()){ ....;
}
bw.write(key, ...);
}
void cleanup(...){
bw.close();
}