boost::filtering_streambuf with gzip_decompressor(), how to access line by line from file - iostream

I wrote a Logparser Application and now I want to implement decompression of .gz files. I tried it with boost::iostreams and zlib which seems to work, but I don't know how to handle the input I get from compressed files.
Here's what I do:
input.open(p.source_at(i).c_str(), ios_base::in | ios_base::binary);
boost::iostreams::filtering_streambuf<boost::iostreams::input> in;
in.push(boost::iostreams::gzip_decompressor());
in.push(input);
boost::iostreams::copy(in, cout);
This code is run, if my sourcefile has the .gz ending. The last line outputs the decompressed filestream correctly to cout.
But how can i fetch line by line from the decompressed file? My Program uses getline(input, transfer) to read lines from the input stream, if it's not compressed.
Now I want to read from the decompressed file the same way, but how can I get a new line from in?
The boost decumentation didn't help me much with this.
Thanks in advance!

Ok I found it out. I just had to create an std::istream and pass a reference to the buffer:
std::istream incoming(&in);
getline(incoming, transfer);

Related

How can i re-compress DLL files of xamarin application

I have a xamarin application and its files are compressed using lz4 I could easily decompress the files it using lz4.block.decompress but I couldn't compress it again, can anyone help me with that?
I used the following script to decompress the file and I could read the content of the DLL file but when I modified something and try to recompress it again to patch the APK file I couldn't compress the file in the right format.
with open(input_filepath, "rb") as xalz_file:
data = xalz_file
header = data.read(8)
if header[:4] != b"XALZ":
sys.exit("The input file does not contain the expected magic bytes, aborting ...")
payload = data.read()
decompressed = lz4.block.decompress(payload)
with open(output_filepath, "wb") as output_file:
output_file.write(decompressed)
output_file.close()
print("result written to file")
Note that I couldn't decompress or compress the DLL file using lz4tools.

Read from a tar.gz file without saving the unpacked version

I have a tar.gz file saved on disk and I want to leave it packed there, but I need to open one file within the archive, read from it and save some information somewhere.
File structure:
base_folder
file_i_need.txt
other_folder
other_file
code (it is not much - I tried 10mio different ways and this is what is left)
def self.open_file(file)
uncompressed_file = Gem::Package::TarReader.new(Zlib::GzipReader.open(file))
uncompressed_file.rewind
end
When I run it in a console I get
<Gem::Package::TarReader:0x007fbaac178090>
and I can run commands on the entries. I just haven't figured out how to open an entry and read from it without saving it unpacked to disk. I mainly need the string from the text file.
Any help appreciated. I might just be missing something...
TarReader is Enumerable, returning Entry.
That said, to retrieve the text content from the file by it’s name one might
uncompressed = Gem::Package::TarReader.new(Zlib::GzipReader.open(file))
text = uncompressed.detect do |f|
f.fullname == 'base_folder/file_i_need.txt'
end.read
#⇒ Hello, I’m content of the text file, located inside gzipped tar
Hope it helps.

Read the file names or the number of files in tar.gz

I have a tar.gz file, which holds multiple csv files archived. I need to read the list of the file names or at least the number of files.
This is what I tried:
require 'zlib'
file = Zlib::GzipReader.open('test/data/file_name.tar.gz')
file.each_line do |line|
p line
end
but this only prints each line in the csv files, not the file names. I also tried this:
require 'zlib'
Zlib::GzipReader.open('test/data/file_name.tar.gz') { | f |
p f.read
}
which reads similarly, but character by character instead of line by line.
Any idea how I could get the list of file names or at least the number of files within the archive?
You need to use a tar reader on the uncompressed output.
".tar.gz" means that two processes were applied to generate the file. First a set of files were "tarred" to make a ".tar" file which contains a sequence of (file header block, uncompressed file data) units. Then that was gzipped as a single stream of bytes, to make the ".tar.gz". In reality, the .tar file was very likely never stored anywhere, but generated as a stream of bytes and gzipped on the fly to write out the .tar.gz file directly.
To get the contents, you reverse the process, ungzipping, and then feeding the result of that to a tar reader to interpret the file header blocks and extract the data. Again, you can ungzip and read the tarred file contents on the fly, with no need to store the intermediate .tar file.

Convert a PDF to .txt gives me an empty .txt file

Hi I'm trying to read a pdf in Ruby, first of all I want to convert it into a txt. path is the path to the PDF, The point is that I get a .txt file empty, and as someone told me is a pdftotext problem, but I don't know how to fix it.
spec = path.sub(/\.pdf$/, '')
`pdftotext #{spec}.pdf`
file = File.new("#{spec}.txt", "w+")
text = []
file.readlines.each do |l|
if l.length > 0
text << l
Rails.logger.info l
end
end
file.close
What's wrong with my code? Thanks!
It's not possible to extract text from every PDF. Some PDF files use a font encoding that makes it impossible to extract text with simple tools such as pdftotext (and some PDF files are even completely immune to direct text extraction with any tool known to me -- in these cases you'll have to apply OCR first to have a chance to extract text...).
So if you test your code with the same "weird" PDF file all the time, it may well happen that you're getting frustrated over your code while in reality the fault lies with the PDF.
First make sure that the commandline usage of pdftotxt works well with a given PDF, then test (and develop further) your code with that PDF.
The problem is you are opening the file in write ("w") mode, whuch truncates the file. You can see a table of file modes and what they mean at http://ruby-doc.org/core-1.9.3/IO.html.
Try something like this, it uses a pdftotext option to send the text to stdout to avoid creating a temporary file and uses blocks for more idiomatic ruby.
text = `pdftotext #{path} -`
text.split.select { |line|
line.length > 0
}.each { |line|
Rails.logger.info(line)
}
You would need to open the txt file with write permission.
file = File.new("#{spec}.txt", "w")
You could consult How to create a file in Ruby
Update: your code is not complete and looks buggy.
Cant say what is path
Looks like you are trying to read the text file to which you intend to write file.readlines.each
spell check length you have it l.lenght
You may want to paste the actual code.
Check this gist https://gist.github.com/4160587
As mentioned, your code is not working because you are reading and writing to the same file.
Example
Ruby code file_write.rb to do the file write operation
pdf_file = File.open("in.txt")
output_file = File.open("out.txt", "w") # file to which you want to write
#iterate over input file and write the content to output file
pdf_file.readlines.each do |l|
output_file.puts(l)
end
output_file.close
pdf_file.close
Sample txt file in.txt
Some text in file
Another line of text
1. Line 1
2. Not really line 2
Once your run file_write.rb you should see new file called out.txt with same content as in.txt You could change the content of input file if you want. In your case you would use pdf reader to get the content and write it to the text file. Basically first line of the code will change.

Read .gz file written by gzwirte (zlib) uncorrectly in MapReduce

The .gz file was written by a C program that called gzputs & gzwrite.
I list the compressed file contents by gzip -l, and find the uncompressed value is uncorrectly. This value seems to be equal to the bytes that the latest gzputs or gzwrite writed into the .gz file. That makes the ratio a nagitive value.
An error occurred when these .gz files used as input of Map/Reduce. Only part of the .gz file can be read in map phase seems. (Size of the part seems to be equal to the above uncompressed value).
Someone can teach me what should I do in the C program or Map/Reduce ?
Problem solved. Read error in Map/Reduce seems to be a bug of GZIPInputStream.
I have found a GZIPInputStream-like class from Internet that can read gz file correctly. Then I extended and customized the TextInputFormat and LineRecordReader in hadoop. It works now.

Resources