I am having a huge .tgz file which is further structured inside like this:
./RandomFoldername1/file1
./RandomFoldername1/file2
./RandomFoldername2/file1
./RandomFoldername2/file2
etc
What I want to do is having each individual file extracted to standard output so that I can pipe it afterwards to another command. While doing this, I also need to get the RandomFoldername name and file name so that I can deal with them properly from within the second command.
Till now the options I have are
to either extract all of the tarball and deal with the structured files that I will be having, which is not an option since the extracted tar doesn't fit into the hard drive
Make a loop that pattern match each file and extract one file at time. This option although that solves the problem, is too slow because the tarball is sweeped each time for only one file.
While searching on how to solve this, I've started to fear that there is no better alternative to this.
Using tar the tool I don't believe you have any other options.
Using a tar library for some language of your choice should allow you to do what you want though as it should let you iterate over the entries in the tarball one-by-one and allow you to extract/pipe/etc. each file one-by-one as necessary.
Related
I want to read the contents of a single file from a .tar.gz tarball. The file is in the root of the tarball. Is there some easy way to do this? I was thinking about something like data = Tarball.open('myfile.tar.gz').entry('/myentry').content Is there such a thing?
The problem is that a .tar.gz is not a structured file; it's just a .tar file that has been run through a compression algorithm that knows nothing about tar. So the only way to get data back out of it is to uncompress the whole thing first.
As a less-space-efficient but more time-efficient alternative, you may want to consider exploding the tar file, recompressing each file individually, and then tarring them back up into an (uncompressed) archive. Then extracting individual files is easy using the archive-tar gem, and you can just add a decompression step to recover the originals.
This is a very old question without an answer accepted, so I wonder if it's still actual however, someone might come to it when running into a similar issue. If you know a typical pattern in the file you are looking for, you might be able to use
tar xf yourarchive.tar.gz --wildcards 'yourpattern'
, which would extract only the selected file and then you can use it as you prefer.
Let's say each tarball is expected to have an "app.conf" file then you use this in 'yourpattern'.
I hope this helps?
here is the script to optimize jpg images: https://github.com/kormoc/imgopt/blob/master/imgopt
There is a CMS with image files (not mine).
I assume there is a complicated structure of subdirectories and script just recursively find all img files in given folder.
The question is how to mark already processed files so with next run
script won't touch them and just skip?
I dont know when the guys would like to add new files to it and process it. Also I think renaming is not a good choice either.
I was thinking about hash-table or associative array which will be filled from txt file during
start. But is it ok to have 100K of items array in bash? Seems complicated for a script.
Any other ideas about optimization are also welcome.
I think the easiest thing to do is just output a file with a similar name per processed image file.
For example image1.jpg after being processed would have an empty file with a similar name e.g. .image1.jpg.processed.
Then when your script runs it just checks if the for the current image: NAME.EXT if a file .NAME.EXT.processed exists. If the file doesn't exist then you know it needs to be processed. No memory issues and no hashtable needed granted you will have 100K of empty extra files.
Is there any way to run DistCp, but with an option to rename on file name collisions? Maybe it's easiest to explain with an example.
Let's say I'm copying to hdfs:///foo to hdfs:///bar, and foo contains these files:
hdfs:///foo/a
hdfs:///foo/b
hdfs:///foo/c
and bar contains these:
hdfs:///bar/a
hdfs:///bar/b
Then after the copy, I'd like bar to contain something like:
hdfs:///bar/a
hdfs:///bar/a-copy1
hdfs:///bar/b
hdfs:///bar/b-copy1
hdfs:///bar/c
If there is no such option, what might be the most reliable/efficient way to do this? My own home-grown version of distcp could certainly get it done, but that seems like it could be a lot of work and pretty error-prone. Basically, I don't care at all about the file names, just their directory, and I want to periodically copy large amounts of data into a "consolidation" directory.
Distcp does not have that option. If you are using the Java API for it, it can be easily handled by checking if the destination path exist and changing the path in case it already exists. You can check that with a FileSystem object using the method exists(Path p).
I have a hadoop application that -depending on a parameter- only needs certain (few!) input files from the input directory. My question is now: where is the best place (read: as early as possible) to skip those files? Right now I customized a RecordReader to take care of that, but I was wondering whether I could skip those files sooner? In my current implmentation hadoop still has a huge overhead due to irrelevant files.
Maybe I should add that it is very easy to see whether I need a certain input file. If the filename starts with a parameter, it is needed. Structuring my input directory hierachically might be a solution, but one that is not very likely for my project since every files would end up lonely in a certain directory.
I'd propose you to filter out the input files by applying the appropriate pattern on the input Paths as mentioned here: https://stackoverflow.com/a/13454344/1050422
Note that this solution doesn't consider subdirectories. Alter it
to be able to recursively visit all subdirectories, within the base path.
I've had success with using the setInputPaths() method on TextInputFormat to specify a single String containing comma-separated file names.
I'm using the split linux command to split huge xml files into node-sized ones. The problem is now I have directory with hundreds of thousands of files.
I want a way to get a file from the directory (to pass to another process for import into our database) without needing to list everything in it. Is this how Dir.foreach already works? Any other ideas?
You can use Dir.glob to find the files you need. More details here, but basically, you pass it a pattern like Dir.glob 'dir/*.rb' and get back filenames matching that pattern. I assume it's done in a reasonably good way, but it will depend on your platform and implementation.
As to Dir.foreach, this should be efficient too - the concern would be if it has to process the entire directory for every pass around the loop. But that would be awful implementation, and is not the case.