Does 'hdfs dfs -cp' use /tmp as part of its implementation - hadoop

Trying to investigate an issue where /tmp is filling up and we don't know what's causing it. We do have a recent change that's using the HDFS command to perform a copy to another host (hdfs dfs -cp /source/file hdfs://other.host:port/target/file, and while the copy operation doesn't directly touch or reference /tmp it could potentially be using it as part of its implementation.
But I can't find anything in the documentation to confirm or refute that theory - does anyone else know the answer?

You could look at the code:
Here's the code for copying using HDFS.
It uses it's own internal CommandWithDestination class.
And writes everything using another internal class which is really just java.io. classes. (To complete the actual write.) So it's buffering byte's in memory and sending the bytes around. Likely not the issue. You could check this by altering the tmp directory used by java. (java.io.tmpdir)
export _JAVA_OPTIONS=-Djava.io.tmpdir=/new/tmp/dir
According to the java.io.File Java Docs
The default temporary-file directory is specified by the system
property java.io.tmpdir. On UNIX systems the default value of this
property is typically "/tmp" or "/var/tmp"; on Microsoft Windows
systems it is typically "c:\temp". A different value may be given to
this system property when the Java virtual machine is invoked, but
programmatic changes to this property are not guaranteed to have any
effect upon the the temporary directory used by this method.
Metheod used to by HDFS copy:
protected void copyStreamToTarget(InputStream in, PathData target)
throws IOException {
if (target.exists && (target.stat.isDirectory() || !overwrite)) {
throw new PathExistsException(target.toString());
}
TargetFileSystem targetFs = new TargetFileSystem(target.fs);
try {
System.out.flush();
System.out.println("Hello Copy Stream");
PathData tempTarget = direct ? target : target.suffix("._COPYING_");
targetFs.setWriteChecksum(writeChecksum);
targetFs.writeStreamToFile(in, tempTarget, lazyPersist, direct); //here's where it uses Java.io to write the file to hdfs.
if (!direct) {
targetFs.rename(tempTarget, target);
}
} finally {
targetFs.close(); // last ditch effort to ensure temp file is removed
}
}

Related

Qt loading deleted/renamed file with Windows 10

I'm seeing some weird behavior when running a test install of my Qt program (tried using qt 5.5.1 and 7.0). I haven't noticed the issue when running in a debug/development environment - but see the issue when installed into "Program Files (x86)".
The issue is: I'm using QDirIterator to find database files within the "QStandardPaths::DataLocation" locations and loading them in via sqlite. The phantom files are located in Program Files (x86)//Library/.ndat What I'm seeing is that files from a previous install (which have been deleted) and ones that have been renamed, then deleted, still show up and are readable in the program. These "phantom" files have been blocking loading of the up-to-date file. It's really strange - I wonder if anyone has seen the issue?
I'm running Windows 10 Home on an SSD-based machine (if it matters). Same issue with Qt 5.5.1 and 5.7. I've replicated it on a different machine with similar configuration.
Any ideas?
Here's a summary of my code:
QStringList standardPaths = QStandardPaths::locateAll(QStandardPaths::DataLocation, "Library", QStandardPaths::LocateDirectory);
QStringList fileFilters;
fileFilters << "*.ndat";
foreach (const QString &dir, standardPaths) {
QDirIterator iterator (dir, fileFilters);
while (iterator.hasNext()) {
const QString &filePath = iterator.next();
QString databaseName = QFileInfo(filePath).baseName();
database_->open(filePath, baseName); // my function
}
}
boolDataManager::open (const QString &filePath, const QString &connectionName) {
QSqlDatabase db = QSqlDatabase::addDatabase("QSQLITE", connectionName);
db.setDatabaseName (filePath);
if (!db.open()) {
ERROR(QString("Cannot open database %1 with error %2")
.arg(QFileInfo(filePath).baseName())
.arg(db.lastError().text()));
printError();
return false;
}
databaseNames_.append(connectionName);
return true;
}
This code seems to read in files that don't exist anymore - and strangely, reads contents of old files that have been overwritten in the same spot. It only seems to happen when the files are located within the "Program Files" directory; not in a user directory or what-not.
For example, version 1 of my code had a database called "database.dat" with 10 entries. Version 2 of my install overwrote the file with a file of the same name with 20 entries. Version 2 of my code finds the database.dat file but only reads in the older version with 10 entries - Really weird!
Update
It appears that these "phantom" files are stored at:
C:\Users/USERNAME/AppData/Local/VirtualStore/Program Files (x86)/PROGRAM NAME/database.dat
My guess is that I'm opening the file in my program not as read-only so Windows creates a working copy in a user-writable location. Will investigate.
The issue is Windows caching - I think - One cant really tell with software that doesn't provide any way to debug it - such as Windows.
I've heard that this solution can also be solved (or at least decreased) by turning on the "Application Experience" Service -- I still run into it from time to time, typically when doing too many Filesystem writes in too short of a time.
I dont know exactly what the cause is -- and I'm pretty sure nobody else does or it would have been fixed.. but as far as I know there is no fix for that (as of this answer's date)
--
Here's my solution to problems like this that works 100% of the time:
To avoid this problem, append the version number to the end of your database's filename each time you compile, in fact apppend it to all your files by using a
#define VERSION 4.22.21
and then just adding .append(QString("%1").arg(VERSION)); or something.
All you have to really do then is write up some quick code to import all the necessary data from an old database or from wherever, which you should have more or less anyways just from using the database.
Better to avoid situations like that than to try and figure them out -- not to mention you now have a perfect revision system without even trying.
UPDATE
Since theres no use case, no code, and no information about the project, I would have to guess at what you were trying to do -
QList<DataDir*> dataDirectories;
DataDir* new_dataDir;
QStringList standardPaths = QStandardPaths::locateAll(QStandardPaths::DataLocation, "Library", QStandardPaths::LocateDirectory);
QStringList fileFilters;
fileFilters << "*.ndat";
foreach (const QString &dir, standardPaths) {
QDirIterator iterator (dir, fileFilters);
while (iterator.hasNext()) {
const QString &filePath = iterator.next();
QString databaseName = QFileInfo(filePath).baseName();
database_->open(filePath, baseName); // my function
/* Do your database reading or writing and save the results
* into a QHash or something then do this: */
database_->close(); // super important
}
}
After a bunch of poking around, I found the source of the problem (and solution).
Windows (for backwards compatibility) has a VirtualStore function where if the program tries to write to a unwritable file (based on permissions, e.g. Program Files/Progname/test.txt), it'll copy that file into USER/AppData/Local/VirtualStore/Program Files/.... This new file is not deleted when the program is uninstalled, but looks to the QT program as residing at its original location.
The solution is to open the Sqlite database in read only mode:
QSqlDatabase db = QSqlDatabase::addDatabase("QSQLITE", connectionName);
if (!writable_)
db.setConnectOptions(QLatin1String("QSQLITE_OPEN_READONLY"));
db.setDatabaseName (filePath);
Now, I'm running into a problem determining whether the file is writable. This:
writable_ = fInfo.isWritable();
always returns true, even for files in Program Files. Even when enabling NTFS permissions checking:
extern Q_CORE_EXPORT int qt_ntfs_permission_lookup;
qt_ntfs_permission_lookup++; // turn permisssions checking on
the permissions check doesn't work. So now I'm simply doing this:
QString appDir = gApp->applicationDirPath();
QString relFilepath = QDir(appDir).relativeFilePath(filePath);
if (!relFilepath.startsWith(".."))
writable_ = false;
Database is read only (OK for my application) and no longer creates anything within VirtualStore

I'm running a PHP script using WAMP, in this case how can I read files from My Documents?

I have a web application that reads files from its local directory (in wamp/www/). This file needed to be accessed by several users so I synced and shared it using Dropbox. Now, is there a shortcut I can use in php commands such as fwrite such that the code is not strictly applicable to only one computer?
For example, I can't code it to fwrite("C:\Users\name\My Documents\") because that is pretty specific to one user and long. I was wondering if there was a shorthand I can use, like %appdata% or %programfiles%?
Try using
$_SERVER['HOMEDRIVE'] and $_SERVER['HOMEPATH']
For drive and path to user folder respectively
print_r($_SERVER)
Will display all the environment variables. There you can see which one to select.
$fp = fopen("{$_ENV['USERPROFILE']}\My Documents\somefile.txt", 'wb');
See $_ENV on the manual and also getenv().
Please note this will only work in limited circumstances. You can use this internal function instead:
#include<Shlobj.h>
PHP_FUNCTION(win_get_desktop_folder)
{
char szPath[MAX_PATH];
if (zend_parse_parameters_none() == FAILURE)
RETURN_NULL();
if (SUCCEEDED(SHGetSpecialFolderPathA(NULL, szPath,
CSIDL_MYDOCUMENTS, FALSE))) {
RETURN_STRING(szPath, 1);
} else {
RETURN_FALSE;
}
}

NodeJS fs.watch on directory only fires when changed by editor, but not shell or fs module

When the code below is ran, the watch is only triggered if I edit and save tmp.txt manually, using either my ide, TextEditor.app, or vim.
It doesn't by method of the write stream or manual shell output redirection (typing echo "test" > /path/to/tmp.txt").
Although if I watch the file itself, and not its dirname, then it works.
var fs, Path, file, watchPath, w;
fs = require('fs' );
Path = require('path');
file = __dirname + '/tmp.txt';
watchPath = Path.dirname(file); // changing this to just file makes it trigger
w = fs.watch ( watchPath, function (e,f) {
console.log("will not get here by itself");
w.close();
});
fs.writeFileSync(file,"test","utf-8");
fs.createWriteStream(file, {
flags:'w',
mode: 0777
} )
.end('the_date="'+new Date+'";' ); // another method fails as well
setTimeout (function () {
fs.writeFileSync(file,"test","utf-8");
},500); // as does this one
// child_process exec and spawn fail the same way with or without timeout
So the questions are: why? and how to trigger this event programmatically from a node script?
Thanks!
It doesn't trigger because a change to the contents of a file isn't a change to the directory.
Under the covers, at least as of 0.6, fs.watch on Mac uses kqueue, and it's a pretty thin wrapper around kqueue file system notifications. So, if you really want to understand the details, you have to understand kqueue, and inodes and things like that.
But if you want a short "lie-to-children" explanation: What a user thinks of as a "file" is really two separate things—the actual file, and the directory entry that points to the actual file. This is what allows you to have things like hard links, and files that can still be read and written even after you've deleted them, and so on.
In general, when you write to an existing file, this doesn't make any change to the directory entry, so anyone watching the directory won't see any change. That's why echo >tmp.txt doesn't trigger you.
However, if you, e.g., write a new temporary file and then move it over the old file, that does change the directory entry (making it a pointer to the new file instead of the old one), so you will be notified. That's why TextEditor.app does trigger you.
The thing is, you've asked to watch the directory and not the file.
The directory isn't updated when the file is modified, such as via shell redirection; in this case, the file is opened, modified, and closed. The directory isn't changed -- only the file is.
When you use a text editor to modify a file, the usual set of system calls behind the scenes looks something like this:
fd = open("foo.new")
write(fd, new foo contents)
unlink("foo")
rename("foo.new", "foo")
This way, the foo file is either entirely the old file or entirely the new file, and there's no way for there to be a "partial file" with the new contents. The renaming operations do modify the directory, thus triggering the directory watch.
Although the above answers seems reasonable, they are not fully accurate. It is actually a very useful feature to be able to listen to a directory for file changes, not just "renames". I think this feature works as expected in Windows at least, and in node 0.9.2 is also working for mac since they changed to the FSEvents API that supports the feature:
Version 0.9.2 (Unstable)

How to overwrite/reuse the existing output path for Hadoop jobs again and agian

I want to overwrite/reuse the existing output directory when I run my Hadoop job daily.
Actually the output directory will store summarized output of each day's job run results.
If I specify the same output directory it gives the error "output directory already exists".
How to bypass this validation?
What about deleting the directory before you run the job?
You can do this via shell:
hadoop fs -rmr /path/to/your/output/
or via the Java API:
// configuration should contain reference to your namenode
FileSystem fs = FileSystem.get(new Configuration());
// true stands for recursively deleting the folder you gave
fs.delete(new Path("/path/to/your/output"), true);
Jungblut's answer is your direct solution. Since I never trust automated processes to delete stuff (me personally), I'll suggest an alternative:
Instead of trying to overwrite, I suggest you make the output name of your job dynamic, including the time in which it ran.
Something like "/path/to/your/output-2011-10-09-23-04/". This way you can keep around your old job output in case you ever need to revisit in. In my system, which runs 10+ daily jobs, we structure the output to be: /output/job1/2011/10/09/job1out/part-r-xxxxx, /output/job1/2011/10/10/job1out/part-r-xxxxx, etc.
Hadoop's TextInputFormat (which I guess you are using) does not allow overwriting an existing directory. Probably to excuse you the pain of finding out you mistakenly deleted something you (and your cluster) worked very hard on.
However, If you are certain you want your output folder to be overwritten by the job, I believe the cleanest way is to change TextOutputFormat a little like this:
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V>
{
public RecordWriter<K, V>
getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException
{
Configuration conf = job.getConfiguration();
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator= conf.get("mapred.textoutputformat.separator","\t");
CompressionCodec codec = null;
String extension = "";
if (isCompressed)
{
Class<? extends CompressionCodec> codecClass =
getOutputCompressorClass(job, GzipCodec.class);
codec = (CompressionCodec) ReflectionUtils.newInstance(codecClass, conf);
extension = codec.getDefaultExtension();
}
Path file = getDefaultWorkFile(job, extension);
FileSystem fs = file.getFileSystem(conf);
FSDataOutputStream fileOut = fs.create(file, true);
if (!isCompressed)
{
return new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
}
else
{
return new LineRecordWriter<K, V>(new DataOutputStream(codec.createOutputStream(fileOut)),keyValueSeparator);
}
}
}
Now you are creating the FSDataOutputStream (fs.create(file, true)) with overwrite=true.
Hadoop already supports the effect you seem to be trying to achieve by allowing multiple input paths to a job. Instead of trying to have a single directory of files to which you add more files, have a directory of directories to which you add new directories. To use the aggregate result as input, simply specify the input glob as a wildcard over the subdirectories (e.g., my-aggregate-output/*). To "append" new data to the aggregate as output, simply specify a new unique subdirectory of the aggregate as the output directory, generally using a timestamp or some sequence number derived from your input data (e.g. my-aggregate-output/20140415154424).
If one is loading the input file (with e.g., appended entries) from the local file system to hadoop distributed file system as such:
hdfs dfs -put /mylocalfile /user/cloudera/purchase
Then one could also overwrite/reuse the existing output directory with -f. No need to delete or re-create the folder
hdfs dfs -put -f /updated_mylocalfile /user/cloudera/purchase
Hadoop follows the philosophy Write Once, Read Many times. Thus when you try to write to the directory again, it assumes it has to make a new one (Write once) but it already exists, and so it complains. You can delete it via hadoop fs -rmr /path/to/your/output/. It's better to create a dynamic directory (eg,based on timestamp or hash value) in order to preserve data.
You can create an output subdirectory for each execution by time. For example lets say you are expecting output directory from user and then set it as follows:
FileOutputFormat.setOutputPath(job, new Path(args[1]);
Change this by the following lines:
String timeStamp = new SimpleDateFormat("yyyy.MM.dd.HH.mm.ss", Locale.US).format(new Timestamp(System.currentTimeMillis()));
FileOutputFormat.setOutputPath(job, new Path(args[1] + "/" + timeStamp));
I had a similar use case, I use MultipleOutputs to resolve this.
For example, if I want different MapReduce jobs to write to the same directory /outputDir/. Job 1 writes to /outputDir/job1-part1.txt, job 2 writes to /outputDir/job1-part2.txt (without deleting exiting files).
In the main, set the output directory to a random one (it can be deleted before a new job runs)
FileInputFormat.addInputPath(job, new Path("/randomPath"));
In the reducer/mapper, use MultipleOutputs and set the writer to write to the desired directory:
public void setup(Context context) {
MultipleOutputs mos = new MultipleOutputs(context);
}
and:
mos.write(key, value, "/outputDir/fileOfJobX.txt")
However, my use case was a bit complicated than that. If it's just to write to the same flat directory, you can write to a different directory and runs a script to migrate the files, like: hadoop fs -mv /tmp/* /outputDir
In my use case, each MapReduce job writes to different sub-directories based on the value of the message being writing. The directory structure can be multi-layered like:
/outputDir/
messageTypeA/
messageSubTypeA1/
job1Output/
job1-part1.txt
job1-part2.txt
...
job2Output/
job2-part1.txt
...
messageSubTypeA2/
...
messageTypeB/
...
Each Mapreduce job can write to thousands of sub-directories. And the cost of writing to a tmp dir and moving each files to the correct directory is high.
I encountered this exact problem, it stems from the exception raised in checkOutputSpecs in the class FileOutputFormat. In my case, I wanted to have many jobs adding files to directories that already exist and I guaranteed that the files would have unique names.
I solved it by creating an output format class which overrides only the checkOutputSpecs method and suffocates (ignores) the FileAlreadyExistsException that's thrown where it checks if the directory already exists.
public class OverwriteTextOutputFormat<K, V> extends TextOutputFormat<K, V> {
#Override
public void checkOutputSpecs(JobContext job) throws IOException {
try {
super.checkOutputSpecs(job);
}catch (FileAlreadyExistsException ignored){
// Suffocate the exception
}
}
}
And the in the job configuration, I used LazyOutputFormat and also MultipleOutputs.
LazyOutputFormat.setOutputFormatClass(job, OverwriteTextOutputFormat.class);
you need to add the setting in your main class:
//Configuring the output path from the filesystem into the job
FileOutputFormat.setOutputPath(job, new Path(args[1]));
//auto_delete output dir
OutputPath.getFileSystem(conf).delete(OutputPath);

How to create and read directories in Hadoop - Mapreduce Job working directory

I want to create a directory inside the working directory of a MapReduce job in Hadoop.
For example by using:
File setupFolder = new File(setupFolderName);
setupFolder.mkdirs();
in my mapper class to write some intermediate files in it. Is it the right way to do it?.
Also after completion of the job how will I access this directory again if I wish so?
Please advice.
If you are using java, you can override the setup method and open the file handler there ( and close it in cleanup ) . This handle will be available to all mappers.
I am assuming that you are not writing all the map output here but some debug/stats. With this handler you can read and write as it is show in this example ( http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample )
if you want to read the whole directory, check out this example https://sites.google.com/site/hadoopandhive/home/how-to-read-all-files-in-a-directory-in-hdfs-using-hadoop-filesystem-api
remember that you will not be able to depend on the the order of data written to the files.
You can override setupReduce() in reducer class, use mkdirs() to create folder and use create() to create file for outputstream.
#Override
protected void setupReduce(Context context) throws IOException {
Configuration conf = context.getConfiguration();
FileSystem fs = FileSystem.get(conf);
fs.mkdirs(new Path("your_path_here"));
}

Resources