I am trying to generate a file in ruby that has a specific size. The content doesn't matter.
Here is what I got so far (and it works!):
File.open("done/#{NAME}.txt", 'w') do |f|
contents = "x" * (1024*1024)
SIZE.to_i.times { f.write(contents) }
end
The problem is: Once I zip or rar this file the created archive is only a few kb small. I guess thats because the random data in the file got compressed.
How do I create data that is more random as if it were just a normal file (for example a movie file)? To be specific: How to create a file with random data that keeps its size when archived?
You cannot guarantee an exact file size when compressing. However, as you suggest in the question, completely random data does not compress.
You can generate a random String using most random number generators. Even simple ones are capable of making hard-to-compress data, but you would have to write your own string-creation code. Luckily for you, Ruby comes with a built-in library that already has a convenient byte-generating method, and you can use it in a variation of your code:
require 'securerandom'
one_megabyte = 2 ** 20 # or 1024 * 1024, if you prefer
# Note use 'wb' mode to prevent problems with character encoding
File.open("done/#{NAME}.txt", 'wb') do |f|
SIZE.to_i.times { f.write( SecureRandom.random_bytes( one_megabyte ) ) }
end
This file is not going to compress much, if at all. Many compressors will detect that and just store the file as-is (making a .zip or .rar file slightly larger than the original).
For a given string size N and compression method c (e.g., from the rubyzip, libarchive or seven_zip_ruby gems), you want to find a string str such that:
str.size == c(str).size == N
I'm doubtful that you can be assured of finding such a string, but here's a way that should come close:
Step 0: Select a number m such that m > N.
Step 1: Generate a random string s with m characters.
Step 2: Compute str = c(str). If str.size <= N, increase m and repeat Step 1; else go to Step 3.
Step 3: Return str[0,N].
Related
I am building a file-based index for the sorted haveibeenpwned passwords text file and it got me wondering what's the fastest way to do this?
I figured a good way to build a quickly grep-able index would be split the sorted file into 256 files named with the first two hex digits (i.e. FF.txt, FE.txt, etc). I found ripgrep rg to be about 5 times faster than grep on my computer. So I tried something like this:
for i in {255..0}
do
start=$(date +%s)
hex="$(printf '%02x' $i | tr [:lower:] [:upper:])"
rg "^$hex" pwned-passwords-ntlm-ordered-by-hash-v4.txt > ntlm/$hex-ntlm.txt
echo 0x$hex completed in $(($(date +%s) - $start)) seconds
done
This is the fastest solution I could come up with. ripgrep is able to create each file in 25 seconds. So I'm looking at about 100 minutes to create this index. When I split the job in half, and run them in parallel, each pair of files is created in 80 seconds. So it seems best to just let ripgrep work its magic and work in series.
Obviously, I won't be indexing this list too often, but it's just fun to think about. Any thoughts on a faster way (aside from using a database) to index this file?
ripgrep, like any other tool that's able to work with unsorted input files at all, is the wrong tool for this job. When you're trying to grep sorted inputs, you want something that can bisect your input file to find a position in logarithmic time. For big enough inputs, even a slow O(log n) implementation will be faster than a highly optimized O(n) one.
pts-line-bisect is one such tool, though of course you're also welcome to write your own. You'll need to write it in a language with full access to the seek() syscall, which is not exposed in bash.
You are reading through the file 256 times, doing a full file scan every time. Consider an approach that reads the file once, writing each line into an open file descriptor. I'm thinking python would be an easy choice of implementation (if that's your thing). You could optimize by keeping the file open until you hit a new hex code at the beginning of the line. If you want to be even more clever, there is no need to go through the sorted file line by line. Based on Charles Duffy's hint, you could create a heuristic for sampling the file (using seek()) to get to the next hex value. Once the program has found the byte offset of the next hex value, the block of bytes can be written to the new file. However, since this is tagged as 'bash' let's keep the solution set in that domain:
while
read line
do
hex=${line:0:2}
echo $line >> ntlm/$hex-ntlm.txt
done < pwned-passwords-ntlm-ordered-by-hash-v4.txt
I wrote a Python3 script that solves fast binary-search lookups in the hash file without having to create an index. It doesn't directly address your question (indexing) but probably solves the underlying problem that you wanted to solve with an index - to quickly look up individual hashes. This script checks hundreds of passwords in seconds.
import argparse
import hashlib
parser = argparse.ArgumentParser(description='Searches passwords in https://haveibeenpwned.com/Passwords database.')
parser.add_argument('passwords', metavar='TEST', type=str, help='text file with passwords to test, one per line, utf-8')
parser.add_argument('database', metavar='DATABASE', type=str, help='the downloaded text file with sha-1:count')
args = parser.parse_args()
def search(f: object, pattern: str) -> str:
def search(left, right: int) -> str:
if left >= right:
return None
middle = (left + right) // 2
if middle == 0:
f.seek(0, 0)
test = f.readline()
else:
f.seek(middle - 1, 0)
_ = f.readline()
test = f.readline()
if test.upper().startswith(pattern):
return test
elif left == middle:
return None
elif pattern < test:
return search(left, middle)
else:
return search(middle, right)
f.seek(0, 2)
return search(0, f.tell())
fsource = open(args.passwords)
fdatabase = open(args.database)
source_lines = fsource.readlines()
for l in source_lines:
line = l.strip()
hash_object = hashlib.sha1(line.encode("utf-8"))
pattern = hash_object.hexdigest().upper()
print("%s:%s" % (line, str(search(fdatabase, pattern)).strip()))
fsource.close()
fdatabase.close()
I am working with large matrices of data (Nrow x Ncol) that are too large to be stored in memory. Instead, it is standard in my field of work to save the data into a binary file. Due to the nature of the work, I only need to access 1 column of the matrix at a time. I also need to be able to modify a column and then save the updated column back into the binary file. So far I have managed to figure out how to save a matrix as a binary file and how to read 1 'column' of the matrix from the binary file into memory. However, after I edit the contents of a column I cannot figure out how to save that column back into the binary file.
As an example, suppose the data file is a 32-bit identity matrix that has been saved to disk.
Nrow = 500
Ncol = 325
data = eye(Float32,Nrow,Ncol)
stream_data = open("data","w")
write(stream_data,data[:])
close(stream_data)
Reading the entire file from disk and then reshaping back into the matrix is straightforward:
stream_data = open("data","r")
data_matrix = read(stream_data,Float32,Nrow*Ncol)
data_matrix = reshape(data_matrix,Nrow,Ncol)
close(stream_data)
As I said before, the data-matrices I am working with are too large to read into memory and as a result the code written above would normally not be possible to execute. Instead, I need to work with 1 column at a time. The following is a solution to read 1 column (e.g. the 7th column) of the matrix into memory:
icol = 7
stream_data = open("data","r")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
data_col = read(stream_data,Float32,Nrow)
close(stream_data)
Note that the coefficient '4' in the 'position_data' variable is because I am working with Float32. Also, I don't fully understand what the seek command is doing here, but it seems to be giving me the correct output based on the following tests:
data == data_matrix # true
data[:,7] == data_col # true
For the sake of this problem, lets say I have determined that the column I loaded (i.e. the 7th column) needs to be replaced with zeros:
data_col = zeros(Float32,size(data_col))
The problem now, is to figure out how to save this column back into the binary file without affecting any of the other data. Naturally I intend to use 'write' to perform this task. However, I am not entirely sure how to proceed. I know I need to start by opening up a stream to the data; however I am not sure what 'mode' I need to use: "w", "w+", "a", or "a+"? Here is a failed attempt using "w":
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
The original binary file (before my failed attempt to edit the binary file) occupied 650000 bytes on disk. This is consistent with the fact that the matrix is size 500x325 and Float32 numbers occupy 4 bytes (i.e. 4*500*325 = 650000). However, after my attempt to edit the binary file I have observed that the binary file now occupies only 14000 bytes of space. Some quick mental math shows that 14000 bytes corresponds to 7 columns of data (4*500*7 = 14000). A quick check confirms that the binary file has replaced all of the original data with a new matrix with size 500x7, and whose elements are all zeros.
stream_data = open("data","r")
data_new_matrix = read(stream_data,Float32,Nrow*7)
data_new_matrix = reshape(data_new_matrix,Nrow,7)
sum(abs(data_new_matrix)) # 0.0f0
What do I need to do/change in order to only modify only the 7th 'column' in the binary file?
Instead of
icol = 7
stream_data = open("data","w")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
in the OP, write
icol = 7
stream_data = open("data","r+")
position_data = 4*Nrow*(icol-1)
seek(stream_data,position_data)
write(stream_data,data_col)
close(stream_data)
i.e. replace "w" with "r+" and everything works.
The reference to open is http://docs.julialang.org/en/release-0.4/stdlib/io-network/#Base.open and it explains the various modes. Preferably open shouldn't be used with the original somewhat confusing but definitely slower string parameter.
You can use SharedArrays for the need you describe:
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
# do something with data
data[:,1]=a[:,1].+1
exit()
# restart julia
data=SharedArray("/some/absolute/path/to/a/file", Float32,(Nrow,Ncols))
#show data[1,1]
# prints 1
Now, be mindful that you're supposed to handle synchronisation to read/write from/to this file (if you have async workers) and that you're not supposed to change the size of the array (unless you know what you're doing).
I am trying to generate thumbnails with Ruby,on a linux machine.
The process,includes determining which of the 5 thumbnails,already generated,is the most meaningful(by meaningful,here,i meant to pick the one with the highest size,since a bigger size means more details).
Afterwards i went to rename the file having the biggest size into a generic name in order to use it later.The code doesn't seem to be working for me,and i can't understand the reason,is there any suggestions to improve it?
Thank you in advance.
Here is my code:
For your possible needs, the variable thumb_dir contains the path of the directory we are getting the thumbnails,from.
max = File.size("#{thumb_dir}/thumb01.jpg").to_f #
name = "thumb01.jpg"
for i in 2..5
if max < File.size("#{thumb_dir}/thumb0'"#{i}"'.jpg" ).to_f?
max = File.size("#{thumb_dir}/thumb0'"{i}"'.jpg"
name = "thumb0" + "#{i}" + ".jpg"
end
end
File.rename("#{thumb_dir}/#{name}", "thumbnail.jpg") `
i = (1..5).map {|i| File.size("#{thumb_dir}/thumb#{i}.jpg").to_f }.each_with_index.max[1]
File.rename("#{thumb_dir}/thumb#{i + 1}.jpg", "thumbnail.jpg")
What does it do ?
(1..5).map {|i| File.size("#{thumb_dir}/thumb#{i}.jpg").to_f }
We get an array of file sizes for thumb1.jpg up to thumb5.jpg
array.each_with_index.max[1]
Used to get the index of the greatest value of the array.
File.rename("#{thumb_dir}/thumb#{i+1}.jpg", "thumbnail.jpg")
Now that we know that i is the index of the greatest value in the array, then thumb#{(i+1)}.jpg is the file with the greatest size, so that's the one we want to replace the name of.
Remember that in Ruby there's a number of useful methods in Enumerable that make this sort of thing pretty straightforward:
# Expand to a list of possible thumbnail paths
thumbnails = (2..5).map { |n| '%s/thumb%02d.jpg' % [ thumb_dir, n ] }
# Find the biggest thumbnail by...
biggest_thumbnail = thumbnails.select do |path|
# ...only dealing with those files that exist...
File.exist?(path)
end.max_by do |path|
# ...and looking for the one with the maximum size.
File.size(path)
end
That should return the largest file if one exists. If not you get nil.
You can use that to rename:
if (biggest_thumbnail)
File.rename(biggest_thumbnail, 'thumbnail.jpg')
end
You'll want to back up your original images before unleashing something like this on it that could potentially delete a bunch of files.
I am building a tool to help me reverse engineer database files. I am targeting my tool towards fixed record length flat files.
What I know:
1) Each record has an index(ID).
2) Each record is separated by a delimiter.
3) Each record is fixed width.
4) Each column in each record is separated by at least one x00 byte.
5) The file header is at the beginning (I say this because the header does not contain the delimiter..)
Delimiters I have found in other files are: ( xFAxFA, xFExFE, xFDxFD ) But this is kind of irrelevant considering that I may use the tool on a different database in the future. So I will need something that will be able to pick out a 'pattern' despite how many bytes it is made of. Probably no more than 6 bytes? It would probably eat up too much data if it was more. But, my experience doing this is limited.
So I guess my question is, how would I find UNKNOWN delimiters in a large file? I feel that given, 'what I know' I should be able to program something, I just dont know where to begin...
# Really loose pseudo code
def begin_some_how
# THIS IS THE PART I NEED HELP WITH...
# find all non-zero non-ascii sets of 2 or more bytes that repeat more than twice.
end
def check_possible_record_lengths
possible_delimiter = begin_some_how
# test if any of the above are always the same number of bytes apart from each other(except one instance, the header...)
possible_records = file.split(possible_delimiter)
rec_length_count = possible_records.map{ |record| record.length}.uniq.count
if rec_length_count == 2 # The header will most likely not be the same size.
puts "Success! We found the fixed record delimiter: #{possible_delimiter}
else
puts "Wrong delimiter found"
end
end
possible = [",", "."]
result = [0, ""]
possible.each do |delimiter|
sizes = file.split( delimiter ).map{ |record| record.size }
next if sizes.size < 2
average = 0.0 + sizes.inject{|sum,x| sum + x }
average /= sizes.size #This should be the record length if this is the right delimiter
deviation = 0.0 + sizes.inject{|sum,x| sum + (x-average)**2 }
matching_value = average / (deviation**2)
if matching_value > result[0] then
result[0] = matching_value
result[1] = delimiter
end
end
Take advantage of the fact that the records have constant size. Take every possible delimiter and check how much each record deviates from the usual record length. If the header is small enough compared rest of the file this should work.
I have files that can be 19GB or greater, they will be huge but sorted. Can I use the fact that they are sorted to my advantage when searching to see if a certain string exists?
I looked at something called sgrep but not sure if its what I'm looking for. An example is I will have a 19GB text file with millions of rows of
ABCDEFG,1234,Jan 21,stackoverflow
and I want to search just the first column of these millions of row to see if ABCDEFG exists in this huge text file.
Is there a more efficient way then just greping this file for the string and seeing if a result comes. I don't even need the line, I just need almost a boolean, true/false if it is inside this file
Actually sgrep is what I was looking for. The reason I got confused was because structured grep has the same name as sorted grep and I was installing the wrong package. sgrep is amazing
I don't know if there are any utilities that would help you out if the box, but it would be pretty straight forward to write an application specific to your problem. A binary search would work well, and should yield your result within 20-30 queries against the file.
Let's say your lines are never more than 100 characters, and the file is B bytes long.
Do something like this in your favorite language:
sub file_has_line(file, target) {
a = 0
z = file.length
while (a < z) {
m = (a+z)/2
chunk = file.read(m, 200)
// That is, read 200 bytes, starting at m.
line = chunk.split(/\n/)[2]
// split the line on newlines, and keep only the second line.
if line < target
z = m - 1
else
a = m + 1
}
return (line == target)
}
If you're only doing a single lookup, this will dramatically speed up your program. Instead of reading ~20 GB, you'll be reading ~20 KB of data.
You could try to optimize this a bit by extrapolating that "Xerox" is going to be at 98% of the file and starting the midpoint there...but unless your need for optimization is quite extreme, you really won't see much difference. The binary search will get you that close within 4 or 5 passes, anyway.
If you're doing lots of lookups (I just saw your comment that you will be), I would look to pump all that data into a database where you can query at will.
So if you're doing 100,000 lookups, but this is a one-and-done process where having it in a database has no ongoing value, you could take another approach...
Sort your list of targets, to match the sort order of the log file. Then walk through each in parallel. You'll still end up reading the entire 20 GB file, but you'll only have to do it once and then you'll have all your answers. Something like this:
sub file_has_lines(file, target_array) {
target_array = target_array.sort
target = ''
hits = []
do {
if line < target
line = file.readln()
elsif line > target
target = target_array.pop()
elseif line == target
hits.push(line)
line = file.readln()
} while not file.eof()
return hits
}