stanford coreNLP process many files with a script

stanford coreNLP process many files with a script - stanford-nlp

UPDATE
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_4/*/*/*/*.txt; do
[[ $f == *.xml ]] && continue # skip output files
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist "$f" -outputDirectory .
done
this one seems to work better, but I'm getting a io exception file name too long error, what is that about, how to fix it?
I guess the other command in the documentation is disfunctional
I was trying to use this script to process my corpus with the Stanford CoreNLP but I keep getting the error
Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP
This is the script
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
[[ $f == *.xml ]] && continue # skip output files
java -mx600m -cp $dir/Code/CoreNLP/stanford-corenlp-full-2015-01-29/stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx2g /Users/matthew/Workbench/Code/CoreNLP/stanford-corenlp-full-2015-01-29/edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file "$f" -outputDirectory $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/.
done
A very similar one worked for the Stanford NER, that one looked like this:
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
[[ $f == *_NER.txt ]] && continue # skip output files
g="${f%.txt}_NER.txt"
java -mx600m -cp $dir/Code/StanfordNER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier $dir/Code/StanfordNER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile "$f" -outputFormat inlineXML > "$g"
done
I can't figure out why I keep getting that error, it seems I've specified all the paths correctly.
I know there's the option -filelist parameter [which] points to a file whose content lists all files to be processed (one per line).
but I don't know how exactly that would work in my situation since my directory structure looks like this $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt within which there are many files to be processed.
Also is it possible to dynamically specify -outputDirectory they say in the docs You may specify an alternate output directory with the flag but it seems like that would be called once and then be static which would be a nightmare scenario in my case.
I thought maybe I could just write some code to do this, also doesn't work, this is what I tried:
public static void main(String[] args) throws Exception
{
BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2005/01/01/1638802_output.txt"));
try
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null)
{
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
String everything = sb.toString();
//System.out.println(everything);
Annotation doc = new Annotation(everything);
StanfordCoreNLP pipeline;
// creates a StanfordCoreNLP object, with POS tagging, lemmatization,
// NER, parsing, and coreference resolution
Properties props = new Properties();
// configure pipeline
props.put(
"annotators",
"tokenize, ssplit"
);
pipeline = new StanfordCoreNLP(props);
pipeline.annotate(doc);
System.out.println( doc );
}
finally
{
br.close();
}
}

By far the best way to process a lot of files with Stanford CoreNLP is to arrange to load the system once - since loading all the various models takes 15 seconds or more depending on your computer before any actually document processing is done - and then to process a bunch of files with it. What you have in your update doesn't do that because running CoreNLP is inside the for loop. A good solution is to use the for loop to make a file list and then to run CoreNLP once on the file list. The file list is just a text file with one filename per line, so you can make it any way you want (using a script, editor macro, typing it in yourself), and you can and should check that its contents look correct before running CoreNLP. For your example, based on your update example, the following should work:
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
echo $f >> filelist.txt
done
# You can here check that filelist.txt has in it the files you want
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist filelist
# By default output files are written to the current directory, so you don't need to specify -outputDirectory .
Other notes on earlier tries:
-mx600m isn't a reasonable way to run the full CoreNLP pipeline (right through parsing and coref). The sum of all it's models is just too large. -mx2g is fine.
The best way above doesn't fully extend to the NER case. Stanford NER doesn't take a -filelist option, and if you use -textFiles then the files are concatenated and become one output file, which you may well not want. At present, for NER, you may well need to run it inside the for loop, as in your script for that.
I haven't quite decoded how you're getting the error Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP, but this is happening because you're putting a String (filename?) like that (perhaps with slashes rather than periods) where the java command expects a class name. In that place, there should only be edu.stanford.nlp.pipeline.StanfordCoreNLP as in your updated script or mine.
You can't have a dynamic outputDirectory in one call to CoreNLP. You could get the effect that I think you want reasonably efficiently by making one call to CoreNLP per directory using two nested for loops. The outer for loop would iterate over directories, the inner one make a file list from all the files in that directory, which would then be processed in one call to CoreNLP and written to an appropriate output directory based on the input directory in the outer for loop. Someone with more time or bash-fu than me could try to write that....
You can certainly also write your own code to call CoreNLP, but then you're responsible for scanning input directories and writing to appropriate output files yourself. What you have looks basically okay, except the line System.out.println( doc ); won't do anything useful - it just prints out the test you began with. You need something like:
PrintWriter xmlOut = new PrintWriter("outputFileName.xml");
pipeline.xmlPrint(doc, xmlOut);

Related

Get list of files containing string(s) or pattern(s)

Is there a Gradle pattern for retrieving the list of files in a folder or set of folders that contain a given string, set of strings, or pattern?
My project produces RPMs and is using the Nebula RPM type (great package!). There are a couple of different kinds of sets of files that need post-processing. I am trying to generate the list of files that contain the strings that are the indicators for post-processing. For example, files that contain "#doc" need to be processed by the doc generator script. Files that contain "#HOSTNAME#" and "#HOSTFQDN#" need to be processed by sed to replace the strings with the actual host name or host fqdn.
The search root in the package will be src\main\resources. With the result the build script sets up the post-install script commands - something like:
postInstall('/opt/product/bin/postprocess.sh ' + join(filesContainingDocs, " "))
postInstall('/bin/sed -i -e "s/#HOSTNAME#/$(hostname -s)/" -e s/#HOSTFQDN#/$(hostname)/" ' + join(filesContainingHostname, " ")
I can figure out the postinstall syntax. I'm having difficulty finding the filter for any of the regular Gradle 'things' (i.e., FileTree) that operate on contents of files rather than names of files. How would I populate filesContainingDocs and filesContainingHostname - something along the lines of:
filesContainingDocs = FileTree('src/main/resources', { contents.matches('#doc') }
filesContainingHostname = FileTree('src/main/resources', { contents.matches('#(HOSTNAME|HOSTFQDN)#') }
While the post-process script could simply do the grep, the several RPMs in our product overlay each other and each RPM should only post-process the files it provides, so a general grep over the final installed folder is not workable - it would catch files provided by other RPMs. It seems to me that I ought to be able to, at build time, produce the correct static list of files from the bigger set of source files that comprise the given RPM's project.
It doesn't have to be FileTree - running a command like findstr /s /m /c:"#doc" src\main\resources\*.conf (alas, the build platform is Windows) produces the answer in stdout but I'm not sure how to get that result into an object Gradle can use to expand the result. (I also suspect there is a 'more Gradle way' to do this.)
The set of files, and the contents of those files, is generally fairly small.

I'm having difficulty finding the filter for any of the regular Gradle 'things' (i.e., FileTree) that operate on contents of files rather than names of files.
You can apply any filter you can imagine on a Gradle file tree, in the end it is just Groovy (or Kotlin) code running in the JVM. Each Gradle FileTree is nothing more than a (lazily evaluated) collection of Java File objects. To filter those File objects, you can read their content, e.g. in the same way you would read them in Java. Groovy even provides a JDK enhancement for the Java class File that includes the simple method getText() for this purpose. Now you can easily filter for files that contain a certain string:
filesContainingDocs = fileTree('src/main/resources').filter { file ->
file.text.contains('#doc')
}
Using Groovy, you can call getters like .getText() in the same way as accessing fields (.text in this case).
If a simple contains check is not enough, the Groovy JDK enhancements even provide the method matches(Pattern pattern) on CharSequence/string instances to perform a regular extension check:
filesContainingDocs = fileTree('src/main/resources').filter { file ->
file.text.replace('\r\n','\n').matches('.*some regex.*') }
}

Perl code doesn't run in a bash script with scheduling of crontab

I want to schedule my Perl code to be run every day at a specific time. so I put the below code in bash file:
Automate.sh
#!/bin/sh
perl /tmp/Taps/perl.pl
The schedule has been specified in below path:
10 17 * * * sh /tmp/Taps/Automate.sh > /tmp/Taps/result.log
When the time arrived to 17:10 the .sh file hasn't been running. however, when I run ./Automate.sh (manually) it is running and I see the result. I don't know what is the problem.
Perl Code
#!/usr/bin/perl -w
use strict;
use warnings;
use Data::Dumper;
use XML::Dumper;
use TAP3::Tap3edit;
$Data::Dumper::Indent=1;
$Data::Dumper::Useqq=1;
my $dump = new XML::Dumper;
use File::Basename;
my $perl='';
my $xml='';
my $tap3 = TAP3::Tap3edit->new();
foreach my $file(glob '/tmp/Taps/X*')
{
$files= basename($file);
$tap3->decode($files) || die $tap3->error;
}
my $filename=$files.".xml\n";
$perl = $tap3->structure;
$dump->pl2xml($perl, $filename);
print "Done \n";
error:
No such file or directory for file X94 at /tmp/Taps/perl.pl line 22.
X94.xml

foreach my $file(glob 'Taps/X*') -- when you're running from cron, your current directory is /. You'll want to provide the full path to that Taps directory. Also specify the output directory for Out.xml

Cron uses a minimal environment and a short $PATH, which may not necessarily include the expected path to perl. Try specifying this path fully. Or source your shell settings before running the script.

There are a lot of things that can go wrong here. The most obvious and certain one is that if you use a glob to find the file in directory "Taps", then remove the directory from the file name by using basename, then Perl cannot find the file. Not quite sure what you are trying to achieve there. The file names from the glob will be for example Taps/Xfoo, a relative path to the working directory. If you try to access Xfoo from the working directory, that file will not be found (or the wrong file will be found).
This should also (probably) lead to a fatal error, which should be reported in your error log. (Assuming that the decode function returns a false value upon error, which is not certain.) If no errors are reported in your error log, that is a sign the program does not run at all. Or it could be that decode does not return false on missing file, and the file is considered to be empty.
I assume that when you test the program, you cd to /tmp and run it, or your "Taps" directory is in your home directory. So you are making assumptions about where your program looks for the files. You should be certain where it looks for files, probably by using only absolute paths.
Another simple error might be that crontab does not have permission to execute the file, or no read access to "Taps".
Edit:
Other complications in your code:
You include Data::Dumper, but never actually use that module.
$xml variable is not used.
$files variable not declared (this code would never run with use strict)
Your $files variable is outside your foreach loop, which means it will only run once. Since you use glob I assumed you were reading more than one file, in which case this solution will probably not do what you want. It is also possible that you are using a glob because the file name can change, e.g. X93, X94, etc. In that case you will read the last file name returned by the glob. But this looks like a weak link in your logic.
You add a newline \n to a file name, which is strange.

How to remove a reoccuring word/character and what comes after, from the filenames of multiple files?

I have several folders of video files where, due to the download manager I use, they are all named in the following format "FILENAME.mp4; filename= FILENAME.mp4" All I've been trying to do is to remove everything after (and including) ".mp4; filename". However, I haven't found a way to do this.
I have tried some free software (such as Renamer, Namechanger, Name Munger for Mac, Transnomino) but I failed to do what I need to.
I'm working on Mac OSX 10.13.6.
Any help with this issue would be appreciated.

You can achieve it using Terminal. Go to the folder where you want to rename files using this cd command, for example:
cd ~/Documents/Videos
And run this command to rename all files recursively:
find . -iname "*.mp4;*" | sed -E 's/(\.[^\.]*)(\.mp4)(.*)/mv "\1\2\3" "\1\2"/' | sh
This command will keep only FILENAME.mp4 part from FILENAME.mp4; filename= FILENAME.mp4 file name

I used to extensively use a windows Rename tool called Renamer 6.0, and it had a "pattern rename" facility called "Multi change" that could have handled this.
In the context of that tool it would be asking for a source pattern like %a= %b and a destination pattern (like %b), everything after the = would be stored in %b variable and then renaming the file to just %b would lose everything after the =
See if your preferred rename tool has a similar facility?
If your tool supports regex, then find: .*?=(.*) and replace with $1
I'm also minded that asking this question on https://unix.stackexchange.com/ might elicit some help crafting a shell script that will perform this rename (though also plenty of shell capable people here, one of them may see it - it's just that it's not quite as hardcore programmer-y a question as most).
If you're willing to learn/use java, then that could be another good way to get the problem solved. It would (at a guess) look something like this:
for (final File f : new File("C:\\temp").listFiles()) {
if (f.isFile()) {
string n = f.getName();
if (n.contains("=")) {
f.renameTo(new File(n.substring(n.indexOf("=")+1));
}
}
}

Formatting multiple json files recursively

This is a theoretical question about minimizing side effects in bash scripting.
I recently used a simple mechanism for formatting a bunch of json files, in a nested directory structure...
for f in `find ./ -name *json`; do echo $f ; python -mjson.tool $f > /tmp/1 && cp /tmp/1 $f ; done.
The mechanism is simply to
format each file using python's mjson.tool,
write it to a tmp location, and
then rewrite it back in place.
Is there a way to do this which is more elegant, i.e. with minimal side effects? I'm assuming bash experts have a better way of doing this sort of thing .

Unix tools working on a streaming basis -- they don't store all of the contents of the files in memory at once. Therefore, you have to use an intermediary location since you would be overwriting a file that is currently being read from.
You may consider that your snippet isn't fault tolerant. If you make a mistake, you would have just overwritten all your data. You should store the output in a new location, verify, then move to overwrite. :)

Using Eclipse IDE we can achieve formatting for multiple JSON files
Import the files into eclipse and select the files (you wish to format) or folder(for all the files) and right click -> source -> format

I was looking for something similar and just noticed I can select all JSON files I have in my VSCode file panel and CTRL + Click > "Format". Works like magic for a one-off operation, it's formatting the files in-place.
VSCode format in action

How can I set up an Xcode build rule with a variable output file list?

Build Rules are documented in the Xcode Build System Guide
They are well adapted to the common case where one input file is transformed into a fixed number (usually one) of output files.
The output files must be described in the "Output Files" area of the build rule definition; one line per output file. Typically the output files have the same name as the input file but have different extensions.
In my case, one single input file is transformed into a variable number of files with the same extensions. The number and the names of the output files depend on the content of the input file and are not known in advance.
The output files will have to be further processed later on (they are in this case C files to be compiled).
How can I set up a build rule for such a case?
Any suggestions welcome.
(I asked the same question on the Apple developer forum, but I figured it'd be a good idea to ask here too).

I dealt with this by, instead of generating multiple C files, just concatenating them all together into one file (e.g. "AUTOGENERATED.c"), and specifying that as the output file.
So long as your output files don't contain anything that will conflict (static functions with the same name, conflicting #defines etc.) this works well.

See this article on Cocoa With Love:
http://cocoawithlove.com/2010/02/custom-build-rules-generated-tables-and.html
This has an example of generating custom C code and using that as input to the normal build process. He's using ${} variable syntax in the output

The best way I found to add any number of files to my xcode project (and make some processing) is to write a little php script. The script can simply copy files into the bundle. The tricky part is the integration with xcode. It took me some time to find a clean way. (You can use the script language you like with this method).
First, use "Add Run Script" instead of "Add Copy File"
Shell parameter:
/bin/sh
Command parameter:
${SRCROOT}/your_script.php -s ${SRCROOT} -o ${CONFIGURATION_BUILD_DIR}/${UNLOCALIZED_RESOURCES_FOLDER_PATH}
exit $?
(screenshot in xcode)
${SRCROOT} is your project directory.
${CONFIGURATION(...) is the bundle directory. Exactly what you need :)
This way, your script return code can stop xcode build (use die(0) for success and die(1) for failures) and the output of script will be visible in xcode's build log.
Your script will look like that: (don't forget chmod +x on it)
#!/usr/bin/php
<?php
error_reporting(E_ALL);
$options = getopt("s:o:");
$src_dir = $options["s"]."/";
$output_dir = $options["o"]."/";
// process_files (...)
die(0);
?>
BONUS: here my 'add_file' function.
Note the special treatment for PNG (use apple's png compression)
Note the filemtime/touch usage to prevent copy files each times.
l
define("COPY_PNG", "/Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/copypng -compress");
function add_file_to_bundle($output_dir, $filepath) {
// split path
$path_info = pathinfo($filepath);
$output_filepath = $output_dir.$path_info['basename'];
// get file's dates of input and output
$input_date = filemtime($filepath);
$output_date = #filemtime($output_filepath);
if ($input_date === FALSE) { echo "can't get input file's modification date"; die(1); }
// skip unchanged files
if ($output_date === $input_date) {
//message("skip ".$path_info['basename']);
return 0;
}
// special copy for png with apple's png compression tool
if (strcasecmp($path_info['extension'], "png") == 0) {
//message($path_info['basename']." is a png");
passthru(COPY_PNG." ".escapeshellarg($filepath)." ".escapeshellarg($output_filepath), $return_var);
if ($return_var != 0) die($return_var);
}
// classic copy
else {
//message("copy ".$path_info['basename']);
passthru("cp ".escapeshellarg($filepath)." ".escapeshellarg($output_filepath), $return_var);
if ($return_var != 0) die($return_var);
}
// important: set output file date with input file date
touch($output_filepath, $input_date, $input_date);
return 1;
}

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio