Apache commons-compress - apache-commons-compress

I am using commons-compress to process tarball files and noticed that even files which are not tar seem to be processed. Why is this -- is there a better library to detect valid tar files
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-compress</artifactId>
<version>1.20</version>
</dependency>
bug689.csv is a CSV file, the test fails because apparently te.isFile() returns true. te.getName() seems to return the contents of the CSV. Is this a bug of am I using the package incorrectly -- I'd expect the InputStream to not be successfully converted to TarArchiveEntry
#Test
public void testTarball() throws IOException{
InputStream tarData = this.getClass().getResourceAsStream("/bug689.csv");
TarArchiveInputStream tis = new TarArchiveInputStream(tarData);
TarArchiveEntry te = tis.getNextTarEntry();
assertFalse(te.isFile());
}

If you are not dealing with a tar file, then tis.getNextTarEntry() will be null - so you would have to check for that explicitly.
But if you do have a valid tar file, beware relying on te.isFile(). The first item in your tar may not be a regular file. It may be a directory or something else.
The tar file may even be empty - in which case tis.getNextTarEntry() will again be null.
If you want to only test for a tar containing one regular file, then I see no issue with using te.isFile().

Related

Moving files inside a tar archive

I have a script that archives a mongo collection:
archive.tar.gz contains:
folder/file.bson
and I need to add a additional top level folder to that structure, example:
top-folder/folder/file.bson
It seems that one way is to unpack and re-pack everything but is there any other solution to this ?
The problem is that there's is a third party script that unpacks the archive and fetches the files from top-folder/folder/file.bson and in current formal, the path is wrong.
.tar.gz is actually what the name suggests - first tar converts a directory structure to a byte stream (i.e. a single file), and this byte stream is then compressed by gzip.
Which means that changing the file path inside the archive is equal to byte-editing a compressed data stream - an unnecessarily difficult thing to do without decompressing the stream.

Create a .tar.bz2 file given an array of files

In a Bash script, I have an array that contains a list of files (in the form of their complete file paths):
declare -a individual_files=("/path/to/a" "/path/to/b" "/path/to/c")
I want to create a compressed file in tar.bz2 which contains all the files in the array, using tar command.
So far, I have tried
tar rf files.tar "${individual_files[#]}"
tar cjf files.tar.bz2 files.tar
But for some reason, files.tar.bz2 always contains the last file in the array only.
What is the correct command(s) for doing so, preferably without creating the intermediate .tar file?
UPDATED: using #PanRuochen's answer, this is what I see in the verbose info:
+ tar cfvj /Users/skyork/test.tar.bz2 /Users/skyork/.emacs /Users/skyork/.Rprofile /Users/skyork/.aspell.en.pws /Users/skyork/.bash_profile /Users/skyork/.vimrc /Users/skyork/com.googlecode.iterm2.plist
tar: Removing leading '/' from member names
a Users/skyork/.emacs
a Users/skyork/.Rprofile
a Users/skyork/.aspell.en.pws
a Users/skyork/.bash_profile
a Users/skyork/.vimrc
a Users/skyork/com.googlecode.iterm2.plist
But still, the resulted test.tar.bz2 file has only the last file of the array (/Users/skyork/com.googlecode.iterm2.plist) in it.
My bad, the files are indeed there but hidden.
tar cfvj files.tar.bz2 "${individual_files[#]}"
v should give you verbose information about how bz2 file is created.

Read from a tar.gz file without saving the unpacked version

I have a tar.gz file saved on disk and I want to leave it packed there, but I need to open one file within the archive, read from it and save some information somewhere.
File structure:
base_folder
file_i_need.txt
other_folder
other_file
code (it is not much - I tried 10mio different ways and this is what is left)
def self.open_file(file)
uncompressed_file = Gem::Package::TarReader.new(Zlib::GzipReader.open(file))
uncompressed_file.rewind
end
When I run it in a console I get
<Gem::Package::TarReader:0x007fbaac178090>
and I can run commands on the entries. I just haven't figured out how to open an entry and read from it without saving it unpacked to disk. I mainly need the string from the text file.
Any help appreciated. I might just be missing something...
TarReader is Enumerable, returning Entry.
That said, to retrieve the text content from the file by it’s name one might
uncompressed = Gem::Package::TarReader.new(Zlib::GzipReader.open(file))
text = uncompressed.detect do |f|
f.fullname == 'base_folder/file_i_need.txt'
end.read
#⇒ Hello, I’m content of the text file, located inside gzipped tar
Hope it helps.

stanford coreNLP process many files with a script

UPDATE
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_4/*/*/*/*.txt; do
[[ $f == *.xml ]] && continue # skip output files
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist "$f" -outputDirectory .
done
this one seems to work better, but I'm getting a io exception file name too long error, what is that about, how to fix it?
I guess the other command in the documentation is disfunctional
I was trying to use this script to process my corpus with the Stanford CoreNLP but I keep getting the error
Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP
This is the script
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
[[ $f == *.xml ]] && continue # skip output files
java -mx600m -cp $dir/Code/CoreNLP/stanford-corenlp-full-2015-01-29/stanford-corenlp-VV.jar:stanford-corenlp-VV-models.jar:xom.jar:joda-time.jar:jollyday.jar:ejml-VV.jar -Xmx2g /Users/matthew/Workbench/Code/CoreNLP/stanford-corenlp-full-2015-01-29/edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit -file "$f" -outputDirectory $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/.
done
A very similar one worked for the Stanford NER, that one looked like this:
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
[[ $f == *_NER.txt ]] && continue # skip output files
g="${f%.txt}_NER.txt"
java -mx600m -cp $dir/Code/StanfordNER/stanford-ner-2015-01-30/stanford-ner-3.5.1.jar edu.stanford.nlp.ie.crf.CRFClassifier -loadClassifier $dir/Code/StanfordNER/stanford-ner-2015-01-30/classifiers/english.all.3class.distsim.crf.ser.gz -textFile "$f" -outputFormat inlineXML > "$g"
done
I can't figure out why I keep getting that error, it seems I've specified all the paths correctly.
I know there's the option -filelist parameter [which] points to a file whose content lists all files to be processed (one per line).
but I don't know how exactly that would work in my situation since my directory structure looks like this $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt within which there are many files to be processed.
Also is it possible to dynamically specify -outputDirectory they say in the docs You may specify an alternate output directory with the flag but it seems like that would be called once and then be static which would be a nightmare scenario in my case.
I thought maybe I could just write some code to do this, also doesn't work, this is what I tried:
public static void main(String[] args) throws Exception
{
BufferedReader br = new BufferedReader(new FileReader("/home/matthias/Workbench/SUTD/nytimes_corpus/NYTimesCorpus/2005/01/01/1638802_output.txt"));
try
{
StringBuilder sb = new StringBuilder();
String line = br.readLine();
while (line != null)
{
sb.append(line);
sb.append(System.lineSeparator());
line = br.readLine();
}
String everything = sb.toString();
//System.out.println(everything);
Annotation doc = new Annotation(everything);
StanfordCoreNLP pipeline;
// creates a StanfordCoreNLP object, with POS tagging, lemmatization,
// NER, parsing, and coreference resolution
Properties props = new Properties();
// configure pipeline
props.put(
"annotators",
"tokenize, ssplit"
);
pipeline = new StanfordCoreNLP(props);
pipeline.annotate(doc);
System.out.println( doc );
}
finally
{
br.close();
}
}
By far the best way to process a lot of files with Stanford CoreNLP is to arrange to load the system once - since loading all the various models takes 15 seconds or more depending on your computer before any actually document processing is done - and then to process a bunch of files with it. What you have in your update doesn't do that because running CoreNLP is inside the for loop. A good solution is to use the for loop to make a file list and then to run CoreNLP once on the file list. The file list is just a text file with one filename per line, so you can make it any way you want (using a script, editor macro, typing it in yourself), and you can and should check that its contents look correct before running CoreNLP. For your example, based on your update example, the following should work:
dir=/Users/matthew/Workbench
for f in $dir/Data/NYTimes/NYTimesCorpus_3/*/*/*/*.txt; do
echo $f >> filelist.txt
done
# You can here check that filelist.txt has in it the files you want
java -cp "*" -Xmx2g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -filelist filelist
# By default output files are written to the current directory, so you don't need to specify -outputDirectory .
Other notes on earlier tries:
-mx600m isn't a reasonable way to run the full CoreNLP pipeline (right through parsing and coref). The sum of all it's models is just too large. -mx2g is fine.
The best way above doesn't fully extend to the NER case. Stanford NER doesn't take a -filelist option, and if you use -textFiles then the files are concatenated and become one output file, which you may well not want. At present, for NER, you may well need to run it inside the for loop, as in your script for that.
I haven't quite decoded how you're getting the error Could not find or load main class .Users.matthew.Workbench.Code.CoreNLP.Stanford-corenlp-full-2015-01-29.edu.stanford.nlp.pipeline.StanfordCoreNLP, but this is happening because you're putting a String (filename?) like that (perhaps with slashes rather than periods) where the java command expects a class name. In that place, there should only be edu.stanford.nlp.pipeline.StanfordCoreNLP as in your updated script or mine.
You can't have a dynamic outputDirectory in one call to CoreNLP. You could get the effect that I think you want reasonably efficiently by making one call to CoreNLP per directory using two nested for loops. The outer for loop would iterate over directories, the inner one make a file list from all the files in that directory, which would then be processed in one call to CoreNLP and written to an appropriate output directory based on the input directory in the outer for loop. Someone with more time or bash-fu than me could try to write that....
You can certainly also write your own code to call CoreNLP, but then you're responsible for scanning input directories and writing to appropriate output files yourself. What you have looks basically okay, except the line System.out.println( doc ); won't do anything useful - it just prints out the test you began with. You need something like:
PrintWriter xmlOut = new PrintWriter("outputFileName.xml");
pipeline.xmlPrint(doc, xmlOut);

How to read gz files in Spark using wholeTextFiles

I have a folder which contains many small .gz files (compressed csv text files). I need to read them in my Spark job, but the thing is I need to do some processing based on info which is in the file name. Therefore, I did not use:
JavaRDD<<String>String> input = sc.textFile(...)
since to my understanding I do not have access to the file name this way. Instead, I used:
JavaPairRDD<<String>String,String> files_and_content = sc.wholeTextFiles(...);
because this way I get a pair of file name and the content.
However, it seems that this way, the input reader fails to read the text from the gz file, but rather reads the binary Gibberish.
So, I would like to know if I can set it to somehow read the text, or alternatively access the file name using sc.textFile(...)
You cannot read gzipped files with wholeTextFiles because it uses CombineFileInputFormat which cannot read gzipped files because they are not splittable (source proving it):
override def createRecordReader(
split: InputSplit,
context: TaskAttemptContext): RecordReader[String, String] = {
new CombineFileRecordReader[String, String](
split.asInstanceOf[CombineFileSplit],
context,
classOf[WholeTextFileRecordReader])
}
You may be able to use newAPIHadoopFile with wholefileinputformat (not built into hadoop but all over the internet) to get this to work correctly.
UPDATE 1: I don't think WholeFileInputFormat will work since it just gets the bytes of the file, meaning you may have to write your own class possibly extending WholeFileInputFormat to make sure it decompresses the bytes.
Another option would be to decompress the bytes yourself using GZipInputStream
UPDATE 2: If you have access to the directory name like in the OP's comment below you can get all the files like this.
Path path = new Path("");
FileSystem fileSystem = path.getFileSystem(new Configuration()); //just uses the default one
FileStatus [] fileStatuses = fileSystem.listStatus(path);
ArrayList<Path> paths = new ArrayList<>();
for (FileStatus fileStatus : fileStatuses) paths.add(fileStatus.getPath());
I faced the same issue while using spark to connect to S3.
My File was a gzip csv with no extension .
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile);
This approach returned currupted values
I solved it by using the the below code :
JavaPairRDD<String, String> fileNameContentsRDD = javaSparkContext.wholeTextFiles(logFile+".gz");
By adding .gz to the S3 URL , spark automatically picked the file and read it like gz file .(Seems a wrong approach but solved my problem .

Resources