How do I skip certain columns when parsing a text file with Ruby? - ruby

I have to parse a tab-delimited text file with Ruby to extract some data from it. For some unknown reason some columns aren't used and are just essentially spacers; I'd like to ignore these columns since I don't need their output (however I can't just ignore all empty columns since some legitimate columns have empty values). I know the indices of these columns already (e.g. columns 6, 14, 24 and 38).
While I could just add a conditional while I'm parsing the file and say parse this unless it's one of those columns, this doesn't seem very "Rubyish" - is there a better and more elegant way to handle this? RegExps, perhaps? I thought of doing something like [6, 14, 24, 38].each { |x| columns.delete_at(x) } to remove the unused columns, but this will force me to redetermine the indices of the columns which I actually need. What I'd really like to do is just loop through the whole thing, checking the index of the current column and ignore it if it's one of the "bad" ones. However it seems very ugly to have code like unless x == 6 || x == 14 || x == 24 || x == 38

No need for a massive conditional like that.
bad_cols = [6, 14, 24, 38]
columns.each_with_index do |val,idx|
next if bad_cols.include? idx
#process the data
end

Related

How would I find an unknown pattern in an array of bytes?

I am building a tool to help me reverse engineer database files. I am targeting my tool towards fixed record length flat files.
What I know:
1) Each record has an index(ID).
2) Each record is separated by a delimiter.
3) Each record is fixed width.
4) Each column in each record is separated by at least one x00 byte.
5) The file header is at the beginning (I say this because the header does not contain the delimiter..)
Delimiters I have found in other files are: ( xFAxFA, xFExFE, xFDxFD ) But this is kind of irrelevant considering that I may use the tool on a different database in the future. So I will need something that will be able to pick out a 'pattern' despite how many bytes it is made of. Probably no more than 6 bytes? It would probably eat up too much data if it was more. But, my experience doing this is limited.
So I guess my question is, how would I find UNKNOWN delimiters in a large file? I feel that given, 'what I know' I should be able to program something, I just dont know where to begin...
# Really loose pseudo code
def begin_some_how
# THIS IS THE PART I NEED HELP WITH...
# find all non-zero non-ascii sets of 2 or more bytes that repeat more than twice.
end
def check_possible_record_lengths
possible_delimiter = begin_some_how
# test if any of the above are always the same number of bytes apart from each other(except one instance, the header...)
possible_records = file.split(possible_delimiter)
rec_length_count = possible_records.map{ |record| record.length}.uniq.count
if rec_length_count == 2 # The header will most likely not be the same size.
puts "Success! We found the fixed record delimiter: #{possible_delimiter}
else
puts "Wrong delimiter found"
end
end
possible = [",", "."]
result = [0, ""]
possible.each do |delimiter|
sizes = file.split( delimiter ).map{ |record| record.size }
next if sizes.size < 2
average = 0.0 + sizes.inject{|sum,x| sum + x }
average /= sizes.size #This should be the record length if this is the right delimiter
deviation = 0.0 + sizes.inject{|sum,x| sum + (x-average)**2 }
matching_value = average / (deviation**2)
if matching_value > result[0] then
result[0] = matching_value
result[1] = delimiter
end
end
Take advantage of the fact that the records have constant size. Take every possible delimiter and check how much each record deviates from the usual record length. If the header is small enough compared rest of the file this should work.

How do I modify multiple columns in a CSV, and then copy them to a new CSV using Ruby?

Out of the 10 columns there in the original CSV, I have 4 columns which I need to make integers (to process with MATLAB later; the other 6 columns already contain integer values). These 4 columns are: (1) platform (2) push (3) timestamp, and (4) udid.
An example input is: #other_column, Android, Y, 10-05-2015 3:59:59 PM, #other_column, d0155049772de9, #other_columns
The corresponding output should be: #other_column, 2, 1, 1431273612198, #other_column, 17923, #other_columns
So, I wrote the following code:
require 'csv'
CSV.open('C:\Users\hp1\Desktop\Datasets\NewColumns2.csv', "wb") do |csv|
CSV.foreach('C:\Users\hp1\Desktop\Datasets\NewColumns.csv', :headers=>true).map do |row|
if row['platform']=='Android'
row['platform']=2
elsif row['platform']=='iPhone'
row['platform']=1
end
if row['push']=='Y'
row['push']=1
elsif row['push']=='N'
row['push']=0
end
row['timestamp'].to_time.to_i
row['udid'].to_i
csv<<row
end
end
Now, the first 3 columns, weekday, platform and push, are having a small number of unique values for the whole file (i.e., 7, 2 and 2 respectively), which is why I used the above approach. However, the other 2 columns, timestamp and udid, are different - they have several values, a few of them common to some rows in the CSV, but there are thousands of unique values. And hence I thought of converting them to integers in the manner I showed above.
Anyhow, none of the columns are getting converted at all. Plus, there is another problem with the datetime column as it is in a format which Ruby apparently does not recognize as a legitimate time format (a sample looks like this: 10-05-2015 3:59:59 PM). So, what should I do? Thanks.
Edit - Redo, I missed part of the problem with the udids
Problems
You are using map when you don't need to, CSV#foreach already iterates through all of the rows - remove this
Date - include the ruby standard Time library
Unique ids - it sounds like you want to convert the udid into a shorter unique id since there may be more than one entry per mobile device - use an array to make a collection without repeats and use the index of the device udid in the array as your new shorter unique id
I used this as my input csv:
othercol1,platform,push,timestamp,othercol2,udid,othercol3,othercol4,othercol5,othercol6
11,Android, N, 10-05-2015 3:59:59 PM,22, d0155049772de9,33,44,55,66
11,iPhone, N, 10-05-2015 5:59:59 PM,22, d0155044772de9,33,44,55,66
11,iPhone, Y, 10-06-2015 3:59:59 PM,22, d0155049772de9,33,44,55,66
11,Android, Y, 11-05-2015 3:59:59 PM,22, d0155249772de9,33,44,55,66
Here is my output csv:
11,2,0,1431298799,22,1,33,44,55,66
11,1,0,1431305999,22,2,33,44,55,66
11,1,1,1433977199,22,1,33,44,55,66
11,2,1,1431385199,22,3,33,44,55,66
Here is the script I used:
require 'time' # use ruby standard time library to parse for you
require 'csv'
udids = [] # turn the udid in to a shorter unique id
CSV.open('new.csv', "wb") do |csv|
CSV.foreach('old.csv', headers: true) do |row|
if row['platform']=='Android'
row['platform']=2
elsif row['platform']=='iPhone'
row['platform']=1
end
if row['push'].strip =='Y'
row['push']=1
elsif row['push'].strip =='N'
row['push']=0
end
row['timestamp'] = Time.parse(row['timestamp']).to_i
# turn the udid in to a shorter unique id
unless udids.include?(row['udid'])
udids << row['udid']
end
row['udid'] = udids.index(row['udid']) + 1
csv << row
end
end
This is a wrong usage of map, this is not the function you need. Map is if you want to apply a function to all values in the array, and return the array. What you are doing is iterate, doing some changes, then pushing the modified row into a new array - you can just iterate, no need for the map function to be there:
CSV.foreach('C:\Users\hp1\Desktop\Datasets\NewColumns.csv', :headers=>true) instead of CSV.foreach('C:\Users\hp1\Desktop\Datasets\NewColumns.csv', :headers=>true).map
About the date, you can use strptime to transform string into date: DateTime.strptime("10-05-2015 3:59:59 PM", "%d-%m-%Y %l:%M:%S %p"). Here the docs: http://ruby-doc.org/stdlib-1.9.3/libdoc/date/rdoc/DateTime.html
add :converters => :all to your options, so that the dates and numbers are automatically converted. Then, instead of
row['timestamp'].to_time.to_i
which does the conversion but doesn't put it anywhere (it is not in-place), do this:
row['timestamp'] = row['timestamp'].to_time.to_i
note that this only works with converters, otherwise row['timestamp'] is a string and there is no .to_time method.

How to implement slice_after (or group certain elements with certain subsequent ones) in Ruby?

The Enumerable#slice_before method is quite useful, and it does exactly what it says on the tin - slice an array before an element if a certain condition on the element is met. For example, I am using it to group certain numbers to the following ones.
In my case, the IDs 0xF0 to 0xFB should be grouped with the IDs that come after them, including multiple of these IDs in a row (they are "modifier" flags in the thing I'm making). I'm doing it like this:
# example code (for SSCCE)
code = '00FF1234F0AAF0BBF0CCCCF3F4F5AAAAAA'.split('').each_slice(2).map{|n| n.join.to_i 16 }
# grouping the modifiers (more may be added later, so array is used)
code = code.slice_before {|tkn| ![*0xF0..0xFB].include? tkn }.to_a
The result of code after this is
[[0], [255], [18], [52, 240], [170, 240], [187, 240], [204], [204, 243, 244, 245], [170], [170], [170]]
However, the desired result is
[[0], [255], [18], [52], [240, 170], [240, 187], [240, 204], [204], [243, 244, 245, 170], [170], [170]]
I found this entry on bugs.ruby-lang.org, and the response was
The main reason [that this is not implemented] is no one requested.
I have not enough time to implement it now.
Therefore, how can I implement it myself?
Enumerable#slice_after is available if you are using Ruby 2.2.0 or later, so you can just use it:
modifiers = 0xF0..0xFB
hex_code = '00FF1234F0AAF0BBF0CCCCF3F4F5AAAAAA'
bytes = [hex_code].pack('H*').bytes
code = bytes.slice_after { |b| !modifiers.include? b }.to_a
p code # => [[0], [255], [18], [52], [240, 170], ...
It's not the elegant one-liner I'd like to have, but this gets the job done :)
target = []
code.each do |i|
# if there is a previous element and it was one of the "modifiers"
if target.last && [*0xF0..0xFB].include?(target.last.last)
# append it to the current subarray
target.last << i
else
# otherwise, append a new subarray
target << [i]
end
end
You'll find the desired array in target with code being unchanged.

Best approach for deleting when using Array.combination()?

I want to compare every object in lectures with each other and if some_condition is true, the second object has to be deleted:
toDelete=[]
lectures.combination(2).each do |first, second|
if (some_condition)
toDelete << second
end
end
toDelete.uniq!
lectures=lectures-toDelete
I got some weird errors while trying to delete inside the .each loop, so I came up with this approach.
Is there a more efficient way to do this?
EDIT after first comments:
I wanted to keep the source code free of unnecessary things, but now that you ask:
The elements of the lectures array are hashes containing data of different university lectures, like the name, room,the calendar weeks in which they are taught and begin and end time.
I parse the timetables of all student groups to get this data, but because some lectures are held in more than one student group and these sometimes differ in the weeks they are taught, I compare them with each other. If the compared ones only differ in certain values, I add the values from the second object to the first object and delete the second object. That's why.
The errors when deleting while in .each-loop: When using the Rails Hash.diff method, I got something like "Cannot convert Symbol to Integer". Turns out there was suddenly an Integer value of 16 in the array, although I tested before the loop that there are only hashes in the array...
Debugging is really hard if you have 9000 hashes.
EDIT:
Sample Data:
lectures = [ {:day=>0, :weeks=>[11, 12, 13, 14], :begin=>"07:30", :end=>"09:30", :rooms=>["Li201", "G221"], :name=>"TestSubject1", :kind=>"Vw", :lecturers=>["WALDM"], :tut_groups=>["11INM"]},
{:day=>0, :weeks=>[11, 12, 13, 14], :begin=>"07:30", :end=>"09:30", :rooms=>["Li201", "G221"], :name=>"TestSubject1", :kind=>"Vw", :lecturers=>["WALDM"], :tut_groups=>["11INM"]} ]
You mean something like this?
cleaned_lectures = lectures.combination(2).reject{|first, second| some_condition}

about ruby range?

like this
range = (0..10)
how can I get number like this:
0 5 10
plus five every time but less than 10
if range = (0..20) then i should get this:
0 5 10 15 20
Try using .step() to go through at a given step.
(0..20).step(5) do |n|
print n,' '
end
gives...
0 5 10 15 20
As mentioned by dominikh, you can add .to_a on the end to get a storable form of the list of numbers: (0..20).step(5).to_a
Like Dav said, but add to_a:
(0..20).step(5).to_a # [0, 5, 10, 15, 20]
The step method described in http://ruby-doc.org/core/classes/Range.html should do the job but seriously harms may harm the readability.
Just consider:
(0..20).step(5){|n| print ' first ', n }.each{|n| print ' second ',n }
You may think that step(5) kind of produces a new Range, like why_'s question initially intended. But the each is called on the (0..20) and has to be replaced by another step(5) if you want to "reuse" the 0-5-10-15-20 range.
Maybe you will be fine with something like (0..3).map{|i| i*5}?
But "persisting" the step method's results with .to_a should also work fine.

Resources