How to modify this kind of file by Ruby - ruby

I have a file like:
Fruit.Store={
#order:123, order:345, order:456
#order:789
"customer-id:12345,item:store/apple" = 10;
"customer-id:23456,item:store/banana" = 10;
"customer-id:23456,item:store/watermelon" = 10;
#order:987
"customer-id:67890,item:store/pear" = 10;
}
Except the comments, each line has the same format: customer-id and item:store/ are fixed, and customer-id is a 5-digit number. There are about 1000 unique lines in the file.
When a new order is placed based on same customer-id and fruit type with different quantity, I want the order id be added in the comment line above and update the quantity, like if a new order 001 is placed with information "customer-id:23456,item:store/watermelon" = 5; than we should have a new file:
Fruit.Store={
#order:123, order:345, order:456
#order:789, order:000
"customer-id:12345,item:store/apple" = 10;
"customer-id:23456,item:store/banana" = 10;
"customer-id:23456,item:store/watermelon" = 5;
#order:987
"customer-id:67890,item:store/pear" = 10;
}
Is it possible to do so in an efficient way? Because file has to be read and written line by line, how could we detect the matched information and go back to previous line to do modification? Thank you.

In short: no, it is not possible to do so in an efficient way. Your best bet is to open separate files for reading and writing, but even then you'll effectively be rewriting the entire file over and over.
Ultimately, you should be using some sort of referential database, like SQL. Those databases were practically invented for this exact use-case.
I really don't like to tell people that the only solution is to do something entirely different from what they're doing, but in this case I can't stress enough how poorly text files scale for managing data.

Related

Random unique string against a blacklist

I want to create a random string of a fixed length (8 chars in my use case) and the generated string has to be case sensitive and unique against a blacklist. I know this sounds like a UUID but I have a specific requirement that prevents me from utilizing them
some characters are disallowed, i.e. I, l and 1 are lookalikes, and O and 0 as well
My initial implementation is solid and solves the task but performs poorly. And by poorly I mean it is doomed to be slower and slower every day.
This is my current implementation I want to optimize:
private function uuid()
{
$chars = 'ABCDEFGHJKLMNPQRSTVUWXYZabcdefghijkmnopqrstvuwxyz23456789';
$uuid = null;
while (true) {
$uuid = substr(str_shuffle($chars), 0, 8);
if (null === DB::table('codes')->select('id')->whereRaw('BINARY uuid = ?', [$uuid])->first())) {
break;
}
}
return $uuid;
}
Please spare me the critique, we live in an agile world and this implementation is functional and is quick to code.
With a small set of data it works beautifully. However if I have 10 million entries in the blacklist and try to create 1000 more it fails flat as it takes 30+ minutes.
A real use case would be to have 10+ million entries in the DB and to attempt to create 20 thousand new unique codes.
I was thinking of pre-seeding all allowed values but this would be insane:
(24+24+8)^8 = 9.6717312e+13
It would be great if the community can point me in the right direction.
Best,
Nikola
Two options:
Just use a hash of something unique, and truncate so it fits in the bandwidth of your identifier. Hashes sometimes collide, so you will still need to check the database and retry if a code is already in use.
s = "This is a string that uniquely identifies voucher #1. Blah blah."
h = hash(s)
guid = truncate(hash)
Generate five of the digits from an incrementing counter and three randomly. A thief will have a worse than 1 in 140,000 chance of guessing a code, depending on your character set.
u = Db.GetIncrementingCounter()
p = Random.GetCharacters(3)
guid = u + p
I ended up modifying the approach: instead of checking for uuid existence on every loop, e.g. 50K DB checks, I now split the generated codes into multiple chunks of 1000 codes and issue an INSERT IGNORE batch query within a transaction.
If the affected rows are as many as the items (1000 in this case) I know there wasn't a collision and I can commit the transaction. Otherwise I need to rollback the chunk and generate another 1000 codes.

Graphlab: How to avoid manually duplicating functions that has only a different string variable?

I imported my dataset with SFrame:
products = graphlab.SFrame('amazon_baby.gl')
products['word_count'] = graphlab.text_analytics.count_words(products['review'])
I would like to do sentiment analysis on a set of words shown below:
selected_words = ['awesome', 'great', 'fantastic', 'amazing', 'love', 'horrible', 'bad', 'terrible', 'awful', 'wow', 'hate']
Then I would like to create a new column for each of the selected words in the products matrix and the entry is the number of times such word occurs, so I created a function for the word "awesome":
def awesome_count(word_count):
if 'awesome' in product:
return product['awesome']
else:
return 0;
products['awesome'] = products['word_count'].apply(awesome_count)
so far so good, but I need to manually create other functions for each of the selected words in this way, e.g., great_count, etc. How to avoid this manual effort and write cleaner code?
I think the SFrame.unpack command should do the trick. In fact, the limit parameter will accept your list of selected words and keep only these results, so that part is greatly simplified.
I don't know precisely what's in your reviews data, so I made a toy example:
# Create the data and convert to bag-of-words.
import graphlab
products = graphlab.SFrame({'review':['this book is awesome',
'I hate this book']})
products['word_count'] = \
graphlab.text_analytics.count_words(products['review'])
# Unpack the bag-of-words into separate columns.
selected_words = ['awesome', 'hate']
products2 = products.unpack('word_count', limit=selected_words)
# Fill in zeros for the missing values.
for word in selected_words:
col_name = 'word_count.{}'.format(word)
products2[col_name] = products2[col_name].fillna(value=0)
I also can't help but point out that GraphLab Create does have its own sentiment analysis toolkit, which could be worth checking out.
I actually find out an easier way do do this:
def wordCount_select(wc,selectedWord):
if selectedWord in wc:
return wc[selectedWord]
else:
return 0
for word in selected_words:
products[word] = products['word_count'].apply(lambda wc: wordCount_select(wc, word))

How to see if a string exists in a huge (>19GB) sorted file?

I have files that can be 19GB or greater, they will be huge but sorted. Can I use the fact that they are sorted to my advantage when searching to see if a certain string exists?
I looked at something called sgrep but not sure if its what I'm looking for. An example is I will have a 19GB text file with millions of rows of
ABCDEFG,1234,Jan 21,stackoverflow
and I want to search just the first column of these millions of row to see if ABCDEFG exists in this huge text file.
Is there a more efficient way then just greping this file for the string and seeing if a result comes. I don't even need the line, I just need almost a boolean, true/false if it is inside this file
Actually sgrep is what I was looking for. The reason I got confused was because structured grep has the same name as sorted grep and I was installing the wrong package. sgrep is amazing
I don't know if there are any utilities that would help you out if the box, but it would be pretty straight forward to write an application specific to your problem. A binary search would work well, and should yield your result within 20-30 queries against the file.
Let's say your lines are never more than 100 characters, and the file is B bytes long.
Do something like this in your favorite language:
sub file_has_line(file, target) {
a = 0
z = file.length
while (a < z) {
m = (a+z)/2
chunk = file.read(m, 200)
// That is, read 200 bytes, starting at m.
line = chunk.split(/\n/)[2]
// split the line on newlines, and keep only the second line.
if line < target
z = m - 1
else
a = m + 1
}
return (line == target)
}
If you're only doing a single lookup, this will dramatically speed up your program. Instead of reading ~20 GB, you'll be reading ~20 KB of data.
You could try to optimize this a bit by extrapolating that "Xerox" is going to be at 98% of the file and starting the midpoint there...but unless your need for optimization is quite extreme, you really won't see much difference. The binary search will get you that close within 4 or 5 passes, anyway.
If you're doing lots of lookups (I just saw your comment that you will be), I would look to pump all that data into a database where you can query at will.
So if you're doing 100,000 lookups, but this is a one-and-done process where having it in a database has no ongoing value, you could take another approach...
Sort your list of targets, to match the sort order of the log file. Then walk through each in parallel. You'll still end up reading the entire 20 GB file, but you'll only have to do it once and then you'll have all your answers. Something like this:
sub file_has_lines(file, target_array) {
target_array = target_array.sort
target = ''
hits = []
do {
if line < target
line = file.readln()
elsif line > target
target = target_array.pop()
elseif line == target
hits.push(line)
line = file.readln()
} while not file.eof()
return hits
}

Are there names for these two approaches to structuring code?

I do a good bit of programming in C for embedded systems. A while back I was explaining my code to a coworker, so as to reduce our bus factor. We got into a discussion about the way I structured my code. My code tends to be more like this:
while(1){
//read the inputs
input1 = pin4
input2 = pin5
//define the mode
if (input1) mode = CHARGE;
else if (input2) mode = BOOST;
else mode = STANDBY;
//define outputs
if (mode == CHARGE) output1 = 1;
else output1 = 0;
if (mode == BOOST) output2 = 1;
else output2 = 0;
}
His code tends to be more like this:
while(1){
//handle first mode
if (input1){
mode = CHARGE;
output1 = 1;
output2 = 0;
}
//handle second mode
else if (input2){
mode = BOOST;
output1 = 0;
output2 = 1;
}
}
The two are semantically identical, but the way you get from A to B is totally different.
In essence, mine is structured around making sure that any given variable is only set in exactly one place in the code wherever possible. Obviously there are some cases where that can't be, like the results of long strings of sequential calculations. But in general, I find that this makes my code much easier to debug. If something is wrong with the value of one particular variable, that problem can only exist in exactly one place. And if I find I need to insert intermediate flags between one variable and another, it's much easier if there's only one place to do that.
(I'm not sure how I got to this point. I didn't used to program this way. I think I learned it from lots of VHDL pain, back in the day.)
I wonder, are there names for these two approaches to structuring code? Further reading on their advantages and disadvantages would be of interest.
I can't come up with a particular name/definition for this other than coding style (which may have other meanings too).
I must say I prefer to read the second version of the code.
I understand you why you say version 1 is easier to debug but as the project grows, you'll probably prefer something easier to understand what it does than something that will be easier to debug when you go into problems because you miss understood it... :)

How to rename Pipe fields in cascading?

In two separate occasions, I've had to rename all the fields in a Pipe to join (using Merge or CoGroup). What I have done recently is:
//These two pipes contain similar values but different Field Names
Pipe papa = new Retain(papa, fieldsFrom);
Pipe pepe = new Retain(pepe, fieldsTo);
//Where fieldsFrom.size() == fieldsTo.size() and the fields positions match
for (int i =0; i < fieldsFrom.size(); i++){
pepe = new Rename(pepe, fieldsFrom.select(new Fields(i)),
fieldsTo.select(new Fields(i)));
}
//this allows me to do this
Pipe retVal = new Merge(papa, pepe);
Obviously this is pretty fragile since I need to ensure field positions in FieldsFrom and FieldsTo remain constant and that they are the same size etc.
Is there a better - less fragile way to merge without going through all the ceremony above?
You can eliminate some ceremony by utilizing Rename's ability to handle aligned from/to fields like this:
pepe = new Rename(pepe, fieldsFrom, fieldsTo);
But this only eliminates the for loop; yes, you must ensure fieldsFrom and fieldsTo are the same size and aligned to correctly express the rename.
cascading.jruby addresses this by wrapping renaming in a function that accepts a mapping rather than aligned from/to fields.
It is also the case that Merge requires incoming pipes to declare the same fields, but CoGroup only requires that you provide declaredFields to ensure there are no name collisions on the output (all fields propagate through, even grouping keys from all inputs).

Resources