How would I find an unknown pattern in an array of bytes? - ruby

I am building a tool to help me reverse engineer database files. I am targeting my tool towards fixed record length flat files.
What I know:
1) Each record has an index(ID).
2) Each record is separated by a delimiter.
3) Each record is fixed width.
4) Each column in each record is separated by at least one x00 byte.
5) The file header is at the beginning (I say this because the header does not contain the delimiter..)
Delimiters I have found in other files are: ( xFAxFA, xFExFE, xFDxFD ) But this is kind of irrelevant considering that I may use the tool on a different database in the future. So I will need something that will be able to pick out a 'pattern' despite how many bytes it is made of. Probably no more than 6 bytes? It would probably eat up too much data if it was more. But, my experience doing this is limited.
So I guess my question is, how would I find UNKNOWN delimiters in a large file? I feel that given, 'what I know' I should be able to program something, I just dont know where to begin...
# Really loose pseudo code
def begin_some_how
# THIS IS THE PART I NEED HELP WITH...
# find all non-zero non-ascii sets of 2 or more bytes that repeat more than twice.
end
def check_possible_record_lengths
possible_delimiter = begin_some_how
# test if any of the above are always the same number of bytes apart from each other(except one instance, the header...)
possible_records = file.split(possible_delimiter)
rec_length_count = possible_records.map{ |record| record.length}.uniq.count
if rec_length_count == 2 # The header will most likely not be the same size.
puts "Success! We found the fixed record delimiter: #{possible_delimiter}
else
puts "Wrong delimiter found"
end
end

possible = [",", "."]
result = [0, ""]
possible.each do |delimiter|
sizes = file.split( delimiter ).map{ |record| record.size }
next if sizes.size < 2
average = 0.0 + sizes.inject{|sum,x| sum + x }
average /= sizes.size #This should be the record length if this is the right delimiter
deviation = 0.0 + sizes.inject{|sum,x| sum + (x-average)**2 }
matching_value = average / (deviation**2)
if matching_value > result[0] then
result[0] = matching_value
result[1] = delimiter
end
end
Take advantage of the fact that the records have constant size. Take every possible delimiter and check how much each record deviates from the usual record length. If the header is small enough compared rest of the file this should work.

Related

In Ruby, what is the best way to convert alphanumeric entries to integers for a column of a CSV containing a huge number of rows?

My CSV contains about 60 million rows. The 10th column contains some alphanumeric entries, some of which repeat, that I want to convert into integers with a one-to-one mapping. That is, I don't want the same entry in Original.csv to have multiple corresponding integer values in Processed.csv. So, initially, I wrote the following code:
require 'csv'
udids = []
CSV.open('Original.csv', "wb") do |csv|
CSV.foreach('Processed.csv', :headers=>true) do |row|
unless udids.include?(row[9])
udids << row[9]
end
udid = udids.index(row[9]) + 1
array = [udid]
csv<<array
end
end
But, the program was taking a lot of time, which I soon realized was because it had to check all the previous rows to make sure only the new values get assigned a new integer value, and the existing ones are not assigned any new value.
So, I thought of hashing them, because when exploring the web about this issue, I learnt that hashing is faster than sequential comparing, somehow (I have not read the details about the how, but anyway...) So, I wrote the following code to hash them:
arrayUDID=[]
arrayUser=[]
arrayHash=[]
array1=[]
f = File.open("Original.csv", "r")
f.each_line { |line|
row = line.split(",");
arrayUDID<<row[9]
arrayUser<<row[9]
}
arrayUser = arrayUser.uniq
arrayHash = []
for i in 0..arrayUser.size-1
arrayHash<<arrayUser[i]
arrayHash<<i
end
hash = Hash[arrayHash.each_slice(2).to_a]
array1=hash.values_at *arrayUDID
logfile = File.new("Processed.csv","w")
for i in 0..array1.size-1
logfile.print("#{array1[i]}\n")
end
logfile.close
But here again, I observed that the program was taking a lot of time, which I realized must be due to the hash array (or hash table) running out of memory.
So, can you kindly suggest any method that will work for my huge file in a reasonable amount of time? By reasonable amount, I mean within 10 hours, because I realize that it's going to take some hours at least as it took about 5 hours to extract that dataset from an even bigger dataset. So, with my aforementioned codes, it was not getting finished even after 2 days of running the programs. So, if you can suggest a method which can do the task by leaving the computer on overnight, that would be great. Thanks.
I think this should work:
udids = {}
unique_count = 1
output_csv = CSV.open("Processed.csv", "w")
CSV.foreach("Original.csv").with_index do |row, i|
output_csv << row and next if i == 0 # skip first row (header info)
val = row[9]
if udids[val.to_sym]
row[9] = udids[val.to_sym]
else
udids[val.to_sym] = unique_count
row[9] = unique_count
unique_count += 1
end
output_csv << row
end
output_csv.close
The performance depends heavily on how many duplicates there are (the more the better), but basically it keeps track of each value as a key in a hash, and checks to see if it has encountered that value yet or not. If so, it uses the corresponding value, and if not it increments a counter, stores that count as the new value for that key and continues.
I was able to process a 10 million line test CSV file in about 3 minutes.

How to see if a string exists in a huge (>19GB) sorted file?

I have files that can be 19GB or greater, they will be huge but sorted. Can I use the fact that they are sorted to my advantage when searching to see if a certain string exists?
I looked at something called sgrep but not sure if its what I'm looking for. An example is I will have a 19GB text file with millions of rows of
ABCDEFG,1234,Jan 21,stackoverflow
and I want to search just the first column of these millions of row to see if ABCDEFG exists in this huge text file.
Is there a more efficient way then just greping this file for the string and seeing if a result comes. I don't even need the line, I just need almost a boolean, true/false if it is inside this file
Actually sgrep is what I was looking for. The reason I got confused was because structured grep has the same name as sorted grep and I was installing the wrong package. sgrep is amazing
I don't know if there are any utilities that would help you out if the box, but it would be pretty straight forward to write an application specific to your problem. A binary search would work well, and should yield your result within 20-30 queries against the file.
Let's say your lines are never more than 100 characters, and the file is B bytes long.
Do something like this in your favorite language:
sub file_has_line(file, target) {
a = 0
z = file.length
while (a < z) {
m = (a+z)/2
chunk = file.read(m, 200)
// That is, read 200 bytes, starting at m.
line = chunk.split(/\n/)[2]
// split the line on newlines, and keep only the second line.
if line < target
z = m - 1
else
a = m + 1
}
return (line == target)
}
If you're only doing a single lookup, this will dramatically speed up your program. Instead of reading ~20 GB, you'll be reading ~20 KB of data.
You could try to optimize this a bit by extrapolating that "Xerox" is going to be at 98% of the file and starting the midpoint there...but unless your need for optimization is quite extreme, you really won't see much difference. The binary search will get you that close within 4 or 5 passes, anyway.
If you're doing lots of lookups (I just saw your comment that you will be), I would look to pump all that data into a database where you can query at will.
So if you're doing 100,000 lookups, but this is a one-and-done process where having it in a database has no ongoing value, you could take another approach...
Sort your list of targets, to match the sort order of the log file. Then walk through each in parallel. You'll still end up reading the entire 20 GB file, but you'll only have to do it once and then you'll have all your answers. Something like this:
sub file_has_lines(file, target_array) {
target_array = target_array.sort
target = ''
hits = []
do {
if line < target
line = file.readln()
elsif line > target
target = target_array.pop()
elseif line == target
hits.push(line)
line = file.readln()
} while not file.eof()
return hits
}

How do I convert a spreadsheet "letternamed" column coordinate to an integer?

In spreadsheets I have cells named like "F14", "BE5" or "ALL1". I have the first part, the column coordinate, in a variable and I want to convert it to a 0-based integer column index.
How do I do it, preferably in an elegant way, in Ruby?
I can do it using a brute-force method: I can imagine loopping through all letters, converting them to ASCII and adding to a result, but I feel there should be something more elegant/straightforward.
Edit: Example: To simplify I do only speak about the column coordinate (letters). Therefore in the first case (F14) I have "F" as the input and I expect the result to be 5. In the second case I have "BE" as input and I expect getting 56, for "ALL" I want to get 999.
Not sure if this is any clearer than the code you already have, but it does have the advantage of handling an arbitrary number of letters:
class String
def upcase_letters
self.upcase.split(//)
end
end
module Enumerable
def reverse_with_index
self.map.with_index.to_a.reverse
end
def sum
self.reduce(0, :+)
end
end
def indexFromColumnName(column_str)
start = 'A'.ord - 1
column_str.upcase_letters.map do |c|
c.ord - start
end.reverse_with_index.map do |value, digit_position|
value * (26 ** digit_position)
end.sum - 1
end
I've added some methods to String and Enumerable because I thought it made the code more readable, but you could inline these or define them elsewhere if you don't like that sort of thing.
We can use modulo and the length of the input. The last character will
be used to calculate the exact "position", and the remainders to count
how many "laps" we did in the alphabet, e.g.
def column_to_integer(column_name)
letters = /[A-Z]+/.match(column_name).to_s.split("")
laps = (letters.length - 1) * 26
position = ((letters.last.ord - 'A'.ord) % 26)
laps + position
end
Using decimal representation (ord) and the math tricks seems a neat
solution at first, but it has some pain points regarding the
implementation. We have magic numbers, 26, and constants 'A'.ord all
over.
One solution is to give our code better knowlegde about our domain, i.e.
the alphabet. In that case, we can switch the modulo with the position of
the last character in the alphabet (because it's already sorted in a zero-based array), e.g.
ALPHABET = ('A'..'Z').to_a
def column_to_integer(column_name)
letters = /[A-Z]+/.match(column_name).to_s.split("")
laps = (letters.length - 1) * ALPHABET.size
position = ALPHABET.index(letters.last)
laps + position
end
The final result:
> column_to_integer('F5')
=> 5
> column_to_integer('AK14')
=> 36
HTH. Best!
I have found particularly neat way to do this conversion:
def index_from_column_name(colname)
s=colname.size
(colname.to_i(36)-(36**s-1).div(3.5)).to_s(36).to_i(26)+(26**s-1)/25-1
end
Explanation why it works
(warning spoiler ;) ahead). Basically we are doing this
(colname.to_i(36)-('A'*colname.size).to_i(36)).to_s(36).to_i(26)+('1'*colname.size).to_i(26)-1
which in plain English means, that we are interpreting colname as 26-base number. Before we can do it we need to interpret all A's as 1, B's as 2 etc. If only this is needed than it would be even simpler, namely
(colname.to_i(36) - '9'*colname.size).to_i(36)).to_s(36).to_i(26)-1
unfortunately there are Z characters present which would need to be interpreted as 10(base 26) so we need a little trick. We shift every digit 1 more then needed and than add it at the end (to every digit in original colname)
`

Can't convert nil into string--Ruby Secret Santa

I wrote a Secret Santa program (ala Ruby Quiz...ish), but occasionally when the program runs, I get an error.
Stats: If there's 10 names in the pot, the error comes up about 5% of the time. If there's 100 names in the pot, it's less than 1%. This is on a trial of 1000 times in bash. I've determined that the gift arrays are coming up nil at some point, but I'm not sure why or how to avoid it.
Providing code...
0.upto($lname.length-1).each do |i|
j = rand($giftlname.length) # should be less each time.
while $giftlname[j] == $lname[i] # redo random if it picks same person
if $lname[i] == $lname.last # if random gives same output again, means person is left with himself; needs to switch with someone
$giftfname[j], $fname[i] = $giftfname[i], $fname[j]
$giftlname[j], $lname[i] = $giftlname[i], $lname[j]
$giftemail[j], $email[i] = $giftemail[i], $email[j]
else
j = rand($giftlname.length)
end
end
$santas.push('Santa ' + $fname[i] + ' ' + $lname[i] + ' sends gift to ' + $giftfname[j] + ' ' + $giftlname[j] + ' at ' + '<' + $giftemail[j] + '>.') #Error here, something is sometimes nil
$giftfname.delete_at(j)
$giftlname.delete_at(j)
$giftemail.delete_at(j)
end
Thanks SO!
I think your problem is right here:
$giftfname[j], $fname[i] = $giftfname[i], $fname[j]
Your i values range between zero to the last index in $fname (inclusive) and, presumably, your $giftfname starts off as a clone of $fname (or at least another array with the same length). But, as you spin through the each, you're shrinking $giftfname so $giftfname[i] will be nil and the swap operation above will put nil into $giftfname[j] (which is supposed to be a useful entry of $giftfname). Similar issues apply to $giftlname and $giftemail.
I'd recommend using one array with three element objects (first name, last name, email) instead of your three parallel arrays. There's also a shuffle method on Array that might be of use to you:
Start with an array of people.
Make copy of that array.
Shuffle the copy until it is different at every index from that original array.
Then zip the together to get your final list of giver/receiver pairs.
Figured it out and used the retry statement. the if statement now looks like this (all other variables have been edited to be non-global as well)
if lname[i] == lname.last
santas = Array.new
giftfname = fname.clone
giftlname = lname.clone
giftemail = email.clone
retry
That, aside from a few other edits, created the solution I needed without breaking apart the code too much again. Will definitely try out mu's solution as well, but I'm just glad I have this running error-free for now.

Ruby, Count syllables

I am using ruby to calculate the Gunning Fog Index of some content that I have, I can successfully implement the algorithm described here:
Gunning Fog Index
I am using the below method to count the number of syllables in each word:
Tokenizer = /([aeiouy]{1,3})/
def count_syllables(word)
len = 0
if word[-3..-1] == 'ing' then
len += 1
word = word[0...-3]
end
got = word.scan(Tokenizer)
len += got.size()
if got.size() > 1 and got[-1] == ['e'] and
word[-1].chr() == 'e' and
word[-2].chr() != 'l' then
len -= 1
end
return len
end
It sometimes picks up words with only 2 syllables as having 3 syllables. Can anyone give any advice or is aware of a better method?
text = "The word logorrhoea is often used pejoratively to describe prose that is highly abstract and contains little concrete language. Since abstract writing is hard to visualize, it often seems as though it makes no sense and all the words are excessive. Writers in academic fields that concern themselves mostly with the abstract, such as philosophy and especially postmodernism, often fail to include extensive concrete examples of their ideas, and so a superficial examination of their work might lead one to believe that it is all nonsense."
# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')
word_array = text.split(' ')
word_array.each do |word|
puts word if count_syllables(word) > 2
end
"themselves" is being counted as 3 but it's only 2
The function I give you before is based upon these simple rules outlined here:
Each vowel (a, e, i, o, u, y) in a
word counts as one syllable subject to
the following sub-rules:
Ignore final -ES, -ED, -E (except
for -LE)
Words of three letters or
less count as one syllable
Consecutive vowels count as one
syllable.
Here's the code:
def new_count(word)
word.downcase!
return 1 if word.length <= 3
word.sub!(/(?:[^laeiouy]es|ed|[^laeiouy]e)$/, '')
word.sub!(/^y/, '')
word.scan(/[aeiouy]{1,2}/).size
end
Obviously, this isn't perfect either, but all you'll ever get with something like this is a heuristic.
EDIT:
I changed the code slightly to handle a leading 'y' and fixed the regex to handle 'les' endings better (such as in "candles").
Here's a comparison using the text in the question:
# used to get rid of any puncuation
text = text.gsub!(/\W+/, ' ')
words = text.split(' ')
words.each do |word|
old = count_syllables(word.dup)
new = new_count(word.dup)
puts "#{word}: \t#{old}\t#{new}" if old != new
end
The output is:
logorrhoea: 3 4
used: 2 1
makes: 2 1
themselves: 3 2
So it appears to be an improvement.
One thing you ought to do is teach your algorithm about diphthongs. If I'm reading your code correctly, it would incorrectly flag "aid" as having two syllables.
You can also add "es" and the like to your special-case endings (you already have "ing") and just not count it as a syllable, but that might still result in some miscounts.
Finally, for best accuracy, you should convert your input to a spelling scheme or alphabet that has a definite relationship to the word's pronunciation. With your "themselves" example, the algorithm has no reliable way to know that the "e" "ves" is dropped. However, if you respelled it as "themselvz", or taught the algorithm the IPA and fed it [ðəmsɛlvz], it becomes very clear that the word is only pronounced with two syllables. That, of course, assumes you have control over the input, and is probably more work than just counting the syllables yourself.
To begin with it seems you should decrement len for the suffixes that should be excluded.
len-=1 if /.*[ing,es,ed]$/.match(word)
You could also check out Lingua::EN::Readability.
It can also calculate several readability measures, such as a Fog Index and a Flesch-Kincaid level.
PS. I think I know where you got the function from. DS.
There is also a rubygem called Odyssey that calculates Gunning Fog, along with some of the other popular ones (Flesch-Kincaid, SMOG, etc.)

Resources