Ruby: Read single file from tarball

Ruby: Read single file from tarball - ruby

I want to read the contents of a single file from a .tar.gz tarball. The file is in the root of the tarball. Is there some easy way to do this? I was thinking about something like data = Tarball.open('myfile.tar.gz').entry('/myentry').content Is there such a thing?

The problem is that a .tar.gz is not a structured file; it's just a .tar file that has been run through a compression algorithm that knows nothing about tar. So the only way to get data back out of it is to uncompress the whole thing first.
As a less-space-efficient but more time-efficient alternative, you may want to consider exploding the tar file, recompressing each file individually, and then tarring them back up into an (uncompressed) archive. Then extracting individual files is easy using the archive-tar gem, and you can just add a decompression step to recover the originals.

This is a very old question without an answer accepted, so I wonder if it's still actual however, someone might come to it when running into a similar issue. If you know a typical pattern in the file you are looking for, you might be able to use
tar xf yourarchive.tar.gz --wildcards 'yourpattern'
, which would extract only the selected file and then you can use it as you prefer.
Let's say each tarball is expected to have an "app.conf" file then you use this in 'yourpattern'.
I hope this helps?

Related

How to unzip - zip of zips and maintain the Same directory structure

Can anyone please help me to advise what combination of nifi processors to take and how to loop through those processors for a Zip file that contains child zip files and still maintain the folder structure after unzipping them?
Basically, below is what I have:
ZIPPED File Structure
--CLASS.zip
--CLASS1.zip
--SUBJECT1.zip
--SUBJECT2.zip
--SUBJECT3.zip
--SUBJECT4.zip
--CLASS2.zip
--SUBJECT1.zip
--SUBJECT2.zip
--SUBJECT3.zip
--SUBJECT4.zip
--CLASS3.zip
--SUBJECT1.zip
--SUBJECT2.zip
--SUBJECT3.zip
--SUBJECT4.zip
The level of zipping might go further but I gave just for illustration purpose. I want to maintain the same directory structure even after Unzipping like shown below:
UNZIPPED Directory structure :
--CLASS
--CLASS1
--SUBJECT1
--SUBJECT2
--SUBJECT3
--SUBJECT4
--CLASS2
--SUBJECT1
--SUBJECT2
--SUBJECT3
--SUBJECT4
--CLASS3
--SUBJECT1
--SUBJECT2
--SUBJECT3
--SUBJECT4
Please suggest the best possible way to attain this.
Hi daggett,
NiFi Flow image
Did apply the NiFi flow as suggested by you, But its still not unzipping in the format as stated in my previous issue. Please advise.

Search an image in a directory

I have a project in which I have lots of different images. Once in a while, we are adding more images inside it, but before, we need to check if it already existed (because we added it previously).
We were doing this right now manually, looking for the image in the folders, but as the project got bigger, it's pretty time consuming.
SO, I would like to create a script that given an image, it looks in a directory to check if it exists.
Do you know if there is any command line based tool or something I can use to build a script to do this?

There is the fdupes utility which does byte to byte comparison. It has a -d or --delete option which will prompt you to ask which files it should keep when it finds duplicates. If you don't care about the filename you can ask it to keep only the first one:
fdupes --delete --noprompt
If you want to delete images that look the same but are slightly different, it's an image recognition problem which I guess does not have such a straightforward solution.

Handle single files while extracting tar.gz

I am having a huge .tgz file which is further structured inside like this:
./RandomFoldername1/file1
./RandomFoldername1/file2
./RandomFoldername2/file1
./RandomFoldername2/file2
etc
What I want to do is having each individual file extracted to standard output so that I can pipe it afterwards to another command. While doing this, I also need to get the RandomFoldername name and file name so that I can deal with them properly from within the second command.
Till now the options I have are
to either extract all of the tarball and deal with the structured files that I will be having, which is not an option since the extracted tar doesn't fit into the hard drive
Make a loop that pattern match each file and extract one file at time. This option although that solves the problem, is too slow because the tarball is sweeped each time for only one file.
While searching on how to solve this, I've started to fear that there is no better alternative to this.

Using tar the tool I don't believe you have any other options.
Using a tar library for some language of your choice should allow you to do what you want though as it should let you iterate over the entries in the tarball one-by-one and allow you to extract/pipe/etc. each file one-by-one as necessary.

Hadoop DistCp handle same file name by renaming

Is there any way to run DistCp, but with an option to rename on file name collisions? Maybe it's easiest to explain with an example.
Let's say I'm copying to hdfs:///foo to hdfs:///bar, and foo contains these files:
hdfs:///foo/a
hdfs:///foo/b
hdfs:///foo/c
and bar contains these:
hdfs:///bar/a
hdfs:///bar/b
Then after the copy, I'd like bar to contain something like:
hdfs:///bar/a
hdfs:///bar/a-copy1
hdfs:///bar/b
hdfs:///bar/b-copy1
hdfs:///bar/c
If there is no such option, what might be the most reliable/efficient way to do this? My own home-grown version of distcp could certainly get it done, but that seems like it could be a lot of work and pretty error-prone. Basically, I don't care at all about the file names, just their directory, and I want to periodically copy large amounts of data into a "consolidation" directory.

Distcp does not have that option. If you are using the Java API for it, it can be easily handled by checking if the destination path exist and changing the path in case it already exists. You can check that with a FileSystem object using the method exists(Path p).

ruby - get a file from directory without listing all contents

I'm using the split linux command to split huge xml files into node-sized ones. The problem is now I have directory with hundreds of thousands of files.
I want a way to get a file from the directory (to pass to another process for import into our database) without needing to list everything in it. Is this how Dir.foreach already works? Any other ideas?

You can use Dir.glob to find the files you need. More details here, but basically, you pass it a pattern like Dir.glob 'dir/*.rb' and get back filenames matching that pattern. I assume it's done in a reasonably good way, but it will depend on your platform and implementation.
As to Dir.foreach, this should be efficient too - the concern would be if it has to process the entire directory for every pass around the loop. But that would be awful implementation, and is not the case.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Ruby: Read single file from tarball - ruby

I want to read the contents of a single file from a .tar.gz tarball. The file is in the root of the tarball. Is there some easy way to do this? I was thinking about something like data = Tarball.open('myfile.tar.gz').entry('/myentry').content Is there such a thing?

Related

How to unzip - zip of zips and maintain the Same directory structure

Search an image in a directory

Handle single files while extracting tar.gz

Hadoop DistCp handle same file name by renaming

ruby - get a file from directory without listing all contents

Categories

Resources