how do we compare a localfile and hdfs file for consistency - hadoop

public String getDirs() throws IOException{
fs=FileSystem.get(conf);
fs.copyFromLocalFile(new Path("/private/tmp/as"), new Path("/test"));
LocalFileSystem lfs=LocalFileSystem.getLocal(conf);
// System.out.println(new LocalFileSystem().ge (conf.getLocalPath("/private/tmp/as")));
System.out.println("Local Path : "+lfs.getFileChecksum(new Path("/private/tmp/as")));
System.out.println("HDFS PATH : "+ fs.getFileChecksum(new Path("/test/as")));
return "done";
}
Output is
Local Path : null
HDFS PATH : MD5-of-0MD5-of-512CRC32:a575c5e99b2e08605dc7c6723889519c
Not sure why the checksum is null for local file

Hadoop relies on the FileSystem to have a checksum ready to match against. It does not generate one on-the-fly.
By default, the LocalFileSystem (or the specific implementation used for file:// paths) does not create/store checksums for all files created through it. You can toggle this behavior via the FileSystem#setWriteChecksum API call, and subsequently retrieving the checksum post-write will then work.

Related

relative path to folder won't work when jpackaged java

so my project is here: https://github.com/Potat-OS1/project_thingo and i started the project from a template.
under the champPortrait section is where i'm having my problem. when i run it in the IDE the path works, as i understand it its the relative path of the build folder. does it not use this path when its packaged? what path should i be using?
i can getResourceAsStream the contents of the folder but in this particular case i need the folder its self so i can put all the names of files inside of the folder into a list.
When the application is bundled with jpackage, all classes and resources are packaged in a jar file. So what you are trying to do is read all the entries in a particular package from a jar file. There is no nice way to do that.
Since the contents of the jar file can't be changed after deployment, the easiest solution is probably just to create a text resource listing the files. You just have to make sure the you update the text file at development time if you change the contents of that resource.
So, e.g., if in your source hierarchy you have
resources
|
|--- images
|
|--- img1.png
|--- img2.png
|--- img3.png
I would just create a text file resources/images/imageList.txt with the content
img1.png
img2.png
img3.png
Then in code you can do:
List<Image> images = new ArrayList<>();
String imageBase = "/images/"
try (BufferedReader br = new BufferedReader(new InputStreamReader(getClass().getResourceAsStream("/images/imageList.txt"))) {
br.lines().forEach(imageName -> {
URL imageUrl = getClass().getResource(imageBse + imageName);
Image image = new Image(imageURL.toExternalForm());
images.add(image);
}
} catch (Exception exc) {
exc.printStackTrace();
}
As mentioned, you will need to keep the text file in sync with the contents of the resource folder before building. If you're feeling ambitious, you could look into automating this as part of your build with your build tool (gradle/Maven etc.).
The Java resource API does not provide a supported way to list the resources in a given package. If you aren't using a framework that provides their own solution (e.g., Spring), then probably the easiest and sufficiently robust solution is to do what #James_D demonstrates: Create another resource that simply lists the names of the resources in the current package. Then you can read that resource to get the names of the other resources.
For a relatively small number of resources, where the number doesn't change often, creating the "name list" resource manually is probably sufficient. But you've tagged this question with gradle, so another option is to have the build tool create these "name list" resources for you. This can be done in plugin, or you could do it directly in your build script.
Example
Here's an example of creating the "plugin" in your build script.
Sources
Source structure:
\---src
\---main
+---java
| \---sample
| Main.java
|
\---resources
\---sample
bar.txt
baz.txt
foo.txt
qux.txt
Where each *.txt file in src/main/resources/sample contains a single line which says Hello from <filename>!.
build.gradle.kts (Kotlin DSL):
plugins {
application // implicitly applies the Java Plugin as well
}
application {
mainClass.set("sample.Main")
}
// gets the 'processResources' task and augments it to add the desired
// behavior. This task processes resources in the "main" source set.
tasks.processResources {
// 'doLast' means everything inside happens at the end, or at least
// near the end, of this task
doLast {
/*
* Get the "main" source set. By default, this essentially
* represents the files under 'src/main'. There is another
* source set added by the Java Plugin named "test", which
* represents the files under 'src/test'.
*/
val main: SourceSet by sourceSets
/*
* Gets *all* the source directories in the main source set
* used for resources. By default, this will only include
* 'src/main/resources'. If you add other resource directories
* to the main source set, then those will be included here as well.
*/
val source: Set<File> = main.resources.srcDirs
/*
* Gets the build output directory for the resources in the
* main source set. By default, this will be the
* 'build/resources/main` directory. The '!!' bit at the end
* of this line of code is a Kotlin language thing, which
* basically says "I know this won't be null, but fail if it is".
*/
val target: File = main.output.resourcesDir!!
/*
* This calls the 'createResourceListFiles' function for every
* resource directory in 'source'.
*/
for (root in source) {
// the last argument is 'root' because the first package is
// the so-called "unnamed/default package", which are resources
// under the "root"
createResourceListFiles(root, target, root)
}
}
}
/**
* Recursively traverses the package hierarchy of the given resource root and creates
* a `resource-list.txt` resource in each package containing the absolute names of every
* resource in that package, with each name on its own line. If a package does not have
* any resources, then no `resource-list.txt` resource is created for that package.
*
* The `resourceRoot` and `targetDir` arguments will never change. Only the `packageDir`
* argument changes for each recursive call.
*
* #param resourceRoot the root of the resources
* #param targetDir the output directory for resources; this is where the
* `resource-list.txt` resource will be created
* #param packageDir the current package directory
*/
fun createResourceListFiles(resourceRoot: File, targetDir: File, packageDir: File) {
// get all non-directories in the current package; these are the resources
val resourceFiles: List<File> = listFiles(packageDir, File::isFile)
// only create a resource-list.txt file if there are resources in this package
if (resourceFiles.isNotEmpty()) {
/*
* Determine the output file path for the 'resource-list.txt' file. This is
* computed by getting the path of the current package directory relative
* to the resource root. And then resolving that relative path against
* the output directory, and finally resolving the filename 'resource-list.txt'
* against that directory.
*
* So, if 'resourceRoot' was 'src/main/resources', 'targetDir' was 'build/resources/main',
* and 'packageDir' was 'src/main/resources/sample', then 'targetFile' will be resolved
* to 'build/resources/main/sample/resource-list.txt'.
*/
val targetFile: File = targetDir.resolve(packageDir.relativeTo(resourceRoot)).resolve("resource-list.txt")
// opens a BufferedWriter to 'targetFile' and will close it when
// done (that's what 'use' does; it's like try-with-resources in Java)
targetFile.bufferedWriter().use { writer ->
// prints the absolute name of each resource on their own lines
for (file in resourceFiles) {
/*
* Prepends a forward slash to make the name absolute. Gets the rest of the name
* by getting the relative path of the resource file from the resource root. Replaces
* any backslashes with forward slashes because Java's resource-lookup API uses forward
* slashes (needed on e.g., Windows, which uses backslashes for filename separators).
*
* So, a resource at 'src/main/resources/sample/foo.txt' would result in
* '/sample/foo.txt' being written to the 'resource-list.txt' file.
*/
writer.append("/${file.toRelativeString(resourceRoot).replace("\\", "/")}")
writer.newLine()
}
}
}
/*
* Gets all the child directories of the current package directory, as these
* are the "sub packages", and recursively calls this function for each
* sub package.
*/
for (packageSubDir in listFiles(packageDir, File::isDirectory)) {
createResourceListFiles(resourceRoot, targetDir, packageSubDir)
}
}
/**
* #param directory the directory to list the children of
* #param predicate the filter function; only children for which this function
* returns `true` are included in the list
* #return a possibly empty list of files which are the children of `dir`
*/
fun listFiles(directory: File, predicate: (File) -> Boolean): List<File>
= directory.listFiles()?.filter(predicate) ?: emptyList()
Main.java:
package sample;
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.util.List;
public class Main {
public static void main(String[] args) throws IOException {
for (var resource : resources()) {
System.out.printf("Contents of '%s':%n", resource);
try (var reader = openResource(resource)) {
String line;
while ((line = reader.readLine()) != null) {
System.out.printf(" %s%n", line);
}
System.out.println();
}
}
}
public static List<String> resources() throws IOException {
try (var input = openResource("/sample/resource-list.txt")) {
return input.lines().toList();
}
}
public static BufferedReader openResource(String name) throws IOException {
var input = Main.class.getResourceAsStream(name);
return new BufferedReader(new InputStreamReader(input));
}
}
Output
After the processResources task runs, you'll have the following /sample/resource-list.txt file in your build output:
/sample/bar.txt
/sample/baz.txt
/sample/foo.txt
/sample/qux.txt
And running the application (./gradlew clean run) will give the following output:
> Task :run
Contents of '/sample/bar.txt':
Hello from bar.txt!
Contents of '/sample/baz.txt':
Hello from baz.txt!
Contents of '/sample/foo.txt':
Hello from foo.txt!
Contents of '/sample/qux.txt':
Hello from qux.txt!
BUILD SUCCESSFUL in 2s
4 actionable tasks: 4 executed
Notes
Note that the resource-list.txt resource(s) will only exist in your build output/deployment. It does not exist in your source directories. Also, the way I implemented this, it will only list resources in your source directories. Any resources generated by, for example, an annotation processor will not be included. You could, of course, modify the code to fix that if it becomes an issue for you.
The above will only run for production resources, not test resources (or any other source set). You can modify the code to change this as needed.
If a package does not have any resources, then the above will not create a resource-list.txt resource for that package.
Each name listed in resource-list.txt is the absolute name. It has a leading /. This will work with Class#getResource[AsStream](String), but I believe to call the same methods on ClassLoader (if you need to for some reason) you'll have to remove the leading / (in code).
Finally, I wrote the Kotlin code in the build script rather quickly. There may be more efficient, or at least less verbose, ways to do the same thing. And if you want to apply this to multiple projects, or even multiple subprojects of the same project, you can create a plugin. Though it may be that some plugin already exists for this, if you're willing to search for one.

CI4 - Trying to move image but get error "could not move file php6WkH2s to /var/www/example.com/development.example.com/app_dir/public/ ()

I am trying to upload a file and move it to public/ folder. The file uploads without problem to writable folder, however, it is the moving to the public folder that has a problem.
Here is my code;
$update_post->move(ROOTPATH.'public/', $update_post.'.'.$fileType);
Path is correct. When I echo out echo ROOTPATH.'public/'; and then manually copy/paste, I do get to the destination directory.
Permissions correct. There are my permission on the public/ directory:
drwxr-xr-x 9 www-data www-data 4096 Jan 30 01:08 public
Any hints appreciated.
Reason:
It's because the move(string $targetPath, ?string $name = null, bool $overwrite = false) method's $name argument is invalid.
$update_post->move( ... , $update_post.'.'.$fileType);
Explanation:
Concatenating an class CodeIgniter\Files\File extends SplFileInfo instance calls the inherited SplFileInfo class's __toString() method which returns the path to the file as a string.
Note that it doesn't return the filename, which is what you're interested in.
Solution:
You should instead pass in the basename instead.
$update_post->move(
ROOTPATH . 'public/',
$update_post->getBasename()
);
Alternatively, since you're not changing the destination filename, it's cleaner to just not pass in the second parameter of the move(...) method. I.e:
$update_post->move(
ROOTPATH . 'public'
);
Addendum:
If you wish to change the destination filename to a new name, try this instead:
guessExtension()
Attempts to determine the file extension based on the trusted
getMimeType() method. If the mime type is unknown, will return null.
This is often a more trusted source than simply using the extension
provided by the filename. Uses the values in app/Config/Mimes.php to
determine extension:
$newFileName = "site_logo"; // New filename without suffixing it with a file extension.
$fileExtension = $update_post->guessExtension();
$update_post->move(
ROOTPATH . 'public',
$newFileName . (empty($fileExtension) ? '' : '.' . $fileExtension)
);
Notes:
The move(...) method returns a new File instance for the relocated file, so you must capture the result if the resulting location is needed: $newRelocatedFileInstance = $update_post->move(...);

How can I use S3InboundFileSynchronizer to synchronize an S3Bucket organized with directories?

I'm trying to use the S3InboundFileSynchronizer to synchronize an S3Bucket to a local directory. The bucket is organised with sub-directories such as:
bucket ->
2016 ->
08 ->
daily-report-20160801.csv
daily-report-20160802.csv
etc...
Using this configuration:
#Bean
public S3InboundFileSynchronizer s3InboundFileSynchronizer() {
S3InboundFileSynchronizer synchronizer = new S3InboundFileSynchronizer(amazonS3());
synchronizer.setDeleteRemoteFiles(true);
synchronizer.setPreserveTimestamp(true);
synchronizer.setRemoteDirectory("REDACTED");
synchronizer.setFilter(new S3RegexPatternFileListFilter(".*\\.csv$"));
Expression expression = PARSER.parseExpression("#this.substring(#this.lastIndexOf('/')+1)");
synchronizer.setLocalFilenameGeneratorExpression(expression);
return synchronizer;
}
I'm able to get as far as connecting to the bucket and listing its contents. When it comes time to read from the bucket the following exception is thrown:
org.springframework.messaging.MessagingException: Problem occurred while synchronizing remote to local directory; nested exception is
org.springframework.messaging.MessagingException: Failed to execute on session;
nested exception is
java.lang.IllegalStateException: 'path' must in pattern [BUCKET/KEY].
at org.springframework.integration.file.remote.synchronizer.AbstractInboundFileSynchronizer.synchronizeToLocalDirectory(AbstractInboundFileSynchronizer.java:266)
Reviewing the code it seems that it'd be impossible to ever synchronize an S3Bucket w/ sub-directories:
private String[] splitPathToBucketAndKey(String path) {
Assert.hasText(path, "'path' must not be empty String.");
String[] bucketKey = path.split("/");
Assert.state(bucketKey.length == 2, "'path' must in pattern [BUCKET/KEY].");
Assert.state(bucketKey[0].length() >= 3, "S3 bucket name must be at least 3 characters long.");
bucketKey[0] = resolveBucket(bucketKey[0]);
return bucketKey;
}
Is there some configuration I'm missing or is this a bug?
(I'm assuming it's a bug 'till I'm told otherwise so I've submitted a pull request with a proposed fix.)
Yes, it is a bug and submitted PullRequest is good for fix.
Only the solution as a workaround is like custom SessionFactory<S3ObjectSummary> which returns a custom S3Session extension with the provided fix in the PR.

h2 with custom java alias and javac compiler issues in multi process environment

H2 database with custom function alias defined as:
create alias to_date as $$
java.util.Date toDate(java.lang.String dateString, java.lang.String pattern) {
try {
return new java.text.SimpleDateFormat(javaPattern).parse(dateString);
} catch(java.text.ParseException e) {
throw new java.lang.RuntimeException(e);
}
}
$$;
H2 initialized as:
jdbc:h2:mem:testdb;INIT=runscript from 'classpath:create_alias.sql
This is used in tests, executed for multiple projects concurrently on a Jenkins instance. Sometimes such tests would fail with following error:
Could not get JDBC Connection; nested exception is org.h2.jdbc.JdbcSQLException: Syntax error in SQL statement "javac: file not found: org/h2/dynamic/TO_DATE.java
Usage: javac <options> <source files>
use -help for a list of possible options
"; SQL statement:
create alias to_date as $$
java.util.Date toDate(java.lang.String dateString, java.lang.String pattern) {
....
My guess is that org.h2.util.SourceCompiler is assuming that there is only one instance of h2 running at the time and writes the generated Java source to 'java.io.tmpdir', which is shared among all processes running under same account. I propose following fix:
Index: SourceCompiler.java
===================================================================
--- SourceCompiler.java (revision 5086)
+++ SourceCompiler.java (working copy)
## -40,7 +40,15 ##
*/
final HashMap<String, Class<?>> compiled = New.hashMap();
- private final String compileDir = Utils.getProperty("java.io.tmpdir", ".");
+ private final String compileDir;
+
+ {
+ // use random folder under java.io.tmpdir so multiple h2 could compile at the same time
+ // without overwriting each other files
+ File tmp = File.createTempFile("h2tmp", ".tmp");
+ tmp.mkdir();
+ compileDir = tmp.getAbsolutePath();
+ }
static {
Class<?> clazz;
Should I open the support ticket or there are workarounds for this issue?
You can use javax.tools.JavaCompiler API and provide your own implementation for in-memory JavaFileManager to completely avoid creating those temp files.
BTW, Janino also support javax.tools.JavaCompiler API.
I had the same problem running multiple Jenkins executors and having Arquillian/Wildfly/H2 integration tests configuration. I found a workaround by setting java.io.tmpdir property to the build directory in the test standalone.xml.

Hadoop distributed cache archive unarchiving in working directory

I’m sending an archive to distributed cache via –Dmapred.cache.archives=hdfs://host:port/path/archive.zip#foldername –D.mapred.create.symlink=yes and it creates a new folder in the working directory and unarchives the files there. The problem is I need those files in the working directory and I’ve already tried using . and ./ as folder name as well as sending an empty one. Any ideas on how to solve this except for moving the files explicitly in my Java code?
What's the specific need for the files to be in the working directory (so i can understand, and suggest some alternatives).
Anyway, it looks like archives in the distributed cache will always be unpacked to a directory so i don't think you can resolve this using archives - however, depending on the number of files you wish to place in the working directory, you can use files in the DistributedCache.
For example, using the GenericOptionsParser parameters you can specify files and folders to include which are then available in the working directory:
public static class DistCacheMapper extends
Mapper<LongWritable, Text, NullWritable, NullWritable> {
#Override
public void run(Context context) throws IOException,
InterruptedException {
Configuration conf = context.getConfiguration();
System.err.println("Local Files:");
listFiles(new File("."), "");
}
private void listFiles(File dir, String ident) {
for (File f : dir.listFiles()) {
System.out.println(ident + (f.isDirectory() ? "d" : "-") + "\t"
+ f.getName());
if (f.isDirectory()) {
listFiles(f, ident + " ");
}
}
}
}
For example with hadoop jar myjar.jar -files pom.xml,.project,.classpath,src dummy.txt gives the following on stderr (which you can see has taken the src folder):
- .classpath
- .project
d tmp
- pom.xml
d src
d test
d resources
d java
d main
d resources
d java
d csw
d sandbox
- DistCacheJob.java
- .DistCacheJob.java.crc
- job.jar
- .job.jar.crc
So the long and the short of it is you're going to have to list all the files you want in the working directory in the Dist Cache files, and subdirectories can be listed as either archives, or using files too.

Resources