How to design my mapper? - hadoop

I have to write a mapreduce job but I dont know how to go about it,
I have jar MARD.jar through which I can instantiate MARD objects.
Using which I call the mard.normalize file meathod on it i.e. mard.normaliseFile(bunch of arguments).
This inturn creates certain output file.
For the normalise meathod to run it needs a folder called myMard in the working directory.
So I thought that I would give the myMard folder as the in input path to hadoop job, but m not sure if that would help beacuse mard.normaliseFile(bunch of arguments) will search for the myMard folder in the working directory but it will not find it as (**this is what I think) the Mapper will only be able to access the content of files through the "values" obtained from the fileSplit, it cannot give direct access to the files in the myMard folder.
In short I have to execute the follwing code through the MapReduce
File setupFolder = new File(setupFolderName);
setupFolder.mkdirs();
MARD mard = new MARD(setupFolder);
Text valuz = new Text();
IntWritable intval = new IntWritable();
File original = new File("Vca1652.txt");
File mardedxml = new File("Vca1652-mardedxml.txt");
File marded = new File("Vca1652-marded.txt");
mardedxml.createNewFile();
marded.createNewFile();
NormalisationStats stats;
try {
stats = mard.normaliseFile(original,mardedxml,marded,50.0);
//This meathod requires access to the myMardfolder
System.out.println(stats);
} catch (MARDException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
Please help

Related

How to list all children in Google Drive's appfolder and read file contents with Xamarin / c#?

I'm trying to work with text files in the apps folder.
Here's my GoogleApiClient constructor:
googleApiClient = new GoogleApiClient.Builder(this)
.AddApi(DriveClass.API)
.AddScope(DriveClass.ScopeFile)
.AddScope(DriveClass.ScopeAppfolder)
.UseDefaultAccount()
.AddConnectionCallbacks(this)
.EnableAutoManage(this, this)
.Build();
I'm connecting with:
googleApiClient.Connect()
And after:
OnConnected()
I need to list all files inside the app folder. Here's what I got so far:
IDriveFolder appFolder = DriveClass.DriveApi.GetAppFolder(googleApiClient);
IDriveApiMetadataBufferResult result = await appFolder.ListChildrenAsync(googleApiClient);
Which is giving me the files metadata.
But after that, I don't know how to read them, edit them or save new files. They are text files created with my app's previous version (native).
I'm following the google docs for drive but the Xamarin API is a lot different and has no docs or examples. Here's the API I'm using: https://components.xamarin.com/view/googleplayservices-drive
Edit:
Here is an example to read file contents from the guide:
DriveFile file = ...
file.open(mGoogleApiClient, DriveFile.MODE_READ_ONLY, null)
.setResultCallback(contentsOpenedCallback);
First I can't find anywhere in the guide what "DriveFile file = ..." means. How do I get this instance? DriveFile seems to be a static class in this API.
I tried:
IDriveFile file = DriveClass.DriveApi.GetFile(googleApiClient, metadata.DriveId);
This has two problems, first it complains that GetFile is deprecated but doesn't say how to do it properly. Second, the file doesn't have an "open" method.
Any help is appreciated.
The Xamarin binding library wraps the Java Drive library (https://developers.google.com/drive/), so all the guides/examples for the Android-based Drive API work if you keep in mind the Binding's Java to C# transformations:
get/set methods -> properties
fields -> properties
listeners -> events
static nested class -> nested class
inner class -> nested class with an instance constructor
So you can list the AppFolder's directory and files by recursively using the Metadata when the drive item is a folder.
Get Directory/File Tree Example:
await Task.Run(() =>
{
async void GetFolderMetaData(IDriveFolder folder, int depth)
{
var folderMetaData = await folder.ListChildrenAsync(_googleApiClient);
foreach (var driveItem in folderMetaData.MetadataBuffer)
{
Log.Debug(TAG, $"{(driveItem.IsFolder ? "(D)" : "(F)")}:{"".PadLeft(depth, '.')}{driveItem.Title}");
if (driveItem.IsFolder)
GetFolderMetaData(driveItem.DriveId.AsDriveFolder(), depth + 1);
}
}
GetFolderMetaData(DriveClass.DriveApi.GetAppFolder(_googleApiClient), 0);
});
Output:
[SushiHangover.FlightAvionics] (D):AppDataFolder
[SushiHangover.FlightAvionics] (F):.FlightInstrumentationData1.json
[SushiHangover.FlightAvionics] (F):.FlightInstrumentationData2.json
[SushiHangover.FlightAvionics] (F):.FlightInstrumentationData3.json
[SushiHangover.FlightAvionics] (F):AppConfiguration.json
Write a (Text) File Example:
using (var contentResults = await DriveClass.DriveApi.NewDriveContentsAsync(_googleApiClient))
using (var writer = new OutputStreamWriter(contentResults.DriveContents.OutputStream))
using (var changeSet = new MetadataChangeSet.Builder()
.SetTitle("AppConfiguration.txt")
.SetMimeType("text/plain")
.Build())
{
writer.Write("StackOverflow Rocks\n");
writer.Write("StackOverflow Rocks\n");
writer.Close();
await DriveClass.DriveApi.GetAppFolder(_googleApiClient).CreateFileAsync(_googleApiClient, changeSet, contentResults.DriveContents);
}
Note: Substitute a IDriveFolder for DriveClass.DriveApi.GetAppFolder to save a file in a subfolder of the AppFolder.
Read a (text) File Example:
Note: driveItem in the following example is an existing text/plain-based MetaData object that is found by recursing through the Drive contents (see Get Directory/File list above) or via creating a query (Query.Builder) and executing it via DriveClass.DriveApi.QueryAsync.
var fileContexts = new StringBuilder();
using (var results = await driveItem.DriveId.AsDriveFile().OpenAsync(_googleApiClient, DriveFile.ModeReadOnly, null))
using (var inputStream = results.DriveContents.InputStream)
using (var streamReader = new StreamReader(inputStream))
{
while (streamReader.Peek() >= 0)
fileContexts.Append(await streamReader.ReadLineAsync());
}
Log.Debug(TAG, fileContexts.ToString());

Access hdfs file from udf

I`d like to access a file from my udf call. This is my script:
files = LOAD '$docs_in' USING PigStorage(';') AS (id, stopwords, id2, file);
buzz = FOREACH files GENERATE pigbuzz.Buzz(file, id) as file:bag{(year:chararray, word:chararray, count:long)};
The jar is registered. The path is realtive to my hdfs, where the files really exist. The call is made. But seems that the file is not discovered. Maybe beacause I'm trying to access the file on hdfs.
How can I access a file in hdfs, from my UDF java call?
Inside an EvalFunc you can get a file from the HDFS via:
FileSystem fs = FileSystem.get(UDFContext.getUDFContext().getJobConf());
in = fs.open(new Path(fileName));
BufferedReader br = new BufferedReader(new InputStreamReader(in));
....
You might also consider putting the files into the distributed cache, in that case you have to override getCacheFiles() in your EvalFunc class.
E.g:
#Override
public List<String> getCacheFiles() {
List<String> list = new ArrayList<String>(2);
list.add("/cache/pig/wordlist1.txt#w1");
list.add("/cache/pig/wordlist2.txt#w2");
return list;
}
then you can just pass the symlinks of the files (w1 and w2) in order to get them from
the local file system of each of the worker nodes:
BufferedReader br = new BufferedReader(new FileReader(fileName));

NotifyFilter of FileSystemWatcher not working

I have a windows service (and verified the code by creating a similar WinForms application) where the NotifyFilter doesn't work. As soon as I remove that line of code, the service works fine and I can see the event-handler fire in the WinForms application.
All I'm doing is dropping a text file into the input directory for the FileSystemWatcher to kick off the watcher_FileChanged delegate. When I have the _watcher.NotifyFilter = NotifyFilters.CreationTime; in there, it doesn't work. When I pull it out, it works fine.
Can anyone tell me if I'm doing something wrong with this filter?
Here is the FSW code for the OnStart event.
protected override void OnStart(string[] args)
{
_watcher = new FileSystemWatcher(#"C:\Projects\Data\Test1");
_watcher.Created += new FileSystemEventHandler(watcher_FileChanged);
_watcher.NotifyFilter = NotifyFilters.CreationTime;
_watcher.IncludeSubdirectories = false;
_watcher.EnableRaisingEvents = true;
_watcher.Error += new ErrorEventHandler(OnError);
}
private void watcher_FileChanged(object sender, FileSystemEventArgs e)
{
// Folder with new files - one or more files
string folder = #"C:\Projects\Data\Test1";
System.Console.WriteLine(#"C:\Projects\Data\Test1");
//Console.ReadKey(true);
// Folder to delete old files - one or more files
string output = #"C:\Temp\Test1\";
System.Console.WriteLine(#"C:\Temp\Test1\");
//Console.ReadKey(true);
// Create name to call new zip file by date
string outputFilename = Path.Combine(output, string.Format("Archive{0}.zip", DateTime.Now.ToString("MMddyyyy")));
System.Console.WriteLine(outputFilename);
//Console.ReadKey(true);
// Save new files into a zip file
using (ZipFile zip = new ZipFile())
{
// Add all files in directory
foreach (var file in Directory.GetFiles(folder))
{
zip.AddFile(file);
}
// Save to output filename
zip.Save(outputFilename);
}
DirectoryInfo source = new DirectoryInfo(output);
// Get info of each file into the output directory to see whether or not to delete
foreach (FileInfo fi in source.GetFiles())
{
if (fi.CreationTime < DateTime.Now.AddDays(-1))
fi.Delete();
}
}
I've been having trouble with this behavior too. If you step through the code (and if you look at MSDN documenation, you'll find that NotifyFilter starts off with a default value of:
NotifyFilters.FileName | NotifyFilters.DirectoryName | NotifyFilters.LastWrite
So when you say .NotifyFilter = NotifyFilters.CreationTime, you're wiping out those other values, which explains the difference in behavior. I'm not sure why NotifyFilters.CreationTime is not catching the new file... seems like it should, shouldn't it!
You can probably just use the default value for NotifyFilter if it's working for you. If you want to add NotifyFilters.CreationTime, I'd recommend doing something like this to add the new value and not replace the existing ones:
_watcher.NotifyFilter = _watcher.NotifyFilter | NotifyFilters.CreationTime;
I know this is an old post but File Creation time is not always reliable. I came across a problem where a Log file was being moved to an archive folder and a new file of the same name was created in it's place however the file creation date did not change, in fact the meta data was retained from the previous file (the one that was moved to the archive) .
Windows has this cache on certain attributes of a file, file creation date is included. You can read the article on here: https://support.microsoft.com/en-us/kb/172190.

Unable to load OpenNLP sentence model in Hadoop map-reduce job

I'm trying to get OpenNLP integrated into a map-reduce job on Hadoop, starting with some basic sentence splitting. Within the map function, the following code is run:
public AnalysisFile analyze(String content) {
InputStream modelIn = null;
String[] sentences = null;
// references an absolute path to en-sent.bin
logger.info("sentenceModelPath: " + sentenceModelPath);
try {
modelIn = getClass().getResourceAsStream(sentenceModelPath);
SentenceModel model = new SentenceModel(modelIn);
SentenceDetectorME sentenceBreaker = new SentenceDetectorME(model);
sentences = sentenceBreaker.sentDetect(content);
} catch (FileNotFoundException e) {
logger.error("Unable to locate sentence model.");
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
} finally {
if (modelIn != null) {
try {
modelIn.close();
} catch (IOException e) {
}
}
}
logger.info("number of sentences: " + sentences.length);
<snip>
}
When I run my job, I'm getting an error in the log saying "in must not be null!" (source of class throwing error), which means that somehow I can't open an InputStream to the model. Other tidbits:
I've verified that the model file exists in the location sentenceModelPath refers to.
I've added Maven dependencies for opennlp-maxent:3.0.2-incubating, opennlp-tools:1.5.2-incubating, and opennlp-uima:1.5.2-incubating.
Hadoop is just running on my local machine.
Most of this is boilerplate from the OpenNLP documentation. Is there something I'm missing, either on the Hadoop side or the OpenNLP side, that would cause me to be unable to read from the model?
Your problem is the getClass().getResourceAsStream(sentenceModelPath) line. This will try to load a file from the classpath - neither the file in HDFS nor on the client local file system is part of the classpath at mapper / reducer runtime, so this is why you're seeing the Null error (the getResourceAsStream() returns null if the resource cannot be found).
To get around this you have a number of options:
Amend your code to load the file from HDFS:
modelIn = FileSystem.get(context.getConfiguration()).open(
new Path("/sandbox/corpus-analysis/nlp/en-sent.bin"));
Amend your code to load the file from the local dir, and use the -files GenericOptionsParser option (which copies to file from the local file system to HDFS, and back down to the local directory of the running mapper / reducer):
modelIn = new FileInputStream("en-sent.bin");
Hard-bake the file into the job jar (in the root dir of the jar), and amend your code to include a leading slash:
modelIn = getClass().getResourceAsStream("/en-sent.bin");</li>

Copying files from HDFS to local file system with JAVA

I am trying to copy files from HDFS to local filesystem for preprocessing. The below code should work according to the documentation. Although it doesn't give any error messages and the mapreduce job runs smoothly I can not see any output on my local hard drive. What do you think the problem is? Thanks.
try {
Path phdfs_input = new Path("hdfs://master:54310/user/hduser/conninput/"+value.toString());
Path plocal_input = new Path("/home/hduser/Desktop/"+avlue.toString());
FileSystem fs = FileSystem.get(context.getConfiguration());
fs.copyToLocalFile(phdfs_input, plocal_input);
/* String localoutput_file = "/home/hduser/Destop/output/"+value.toString();
String cmd1[] = {"mafia", "-mfi", ".5", "-ascii", "~/Desktop/"+value.toString(), localoutput_file };
File mafia_dir = new File("/home/hduser/");
ShellCommandExecutor s = new ShellCommandExecutor(cmd1, mafia_dir);*/
} catch (Exception e) {
e.printStackTrace();
}
Try using /user/hduser/conninput/"+value.toString() in the Path constructor instead of providing the master:54310 part. It should figure out master:54310 from the Configuration.

Resources