Writing Capybara expectations to verify phone numbers - ruby

I'm using AWS Textract to pull information from PDF documents. After the scanned text is returned from AWS and persisted to a var, I'm doing this:
phone_number = '(555) 123-4567'
scanned_pdf_text.should have_text phone_number
But this fails about 20% of the time because of the non-deterministic way that AWS is returning the scanned PDF text. On occasion, the phone numbers can appear either of these two ways:
(555)123-4567 or (555) 123-4567
Some of this scanned text is very large, and I'd prefer not to go through the exercise of sanitizing the text coming back if I can avoid it (I'm also not good at regex usage). I also think using or logic to handle both cases seems to be a little heavy handed just to check text that is so similar (and clearly near-identical to the human eye).
Is there an rspec matcher that'll allow me to check on this text? I'm also using Capybara.default_normalize_ws = true but that doesn't seem to help in this case.

Assuming scanned_pdf_text is a string and the only differences you're seeing is in spaces then you can just get rid of the spaces and compare
scanned_pdf_text.gsub(/\s+/, '').should eq('(555)123-4567') # exact
scanned_pdf_text.gsub(/\s+/, '').should match('(555)123-4567') # partial
scanned_pdf_text.gsub(/\s+/, '').should have_text('(555)123-4567') # partial

Related

Gensim most_similar() with Fasttext word vectors return useless/meaningless words

I'm using Gensim with Fasttext Word vectors for return similar words.
This is my code:
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format('cc.it.300.vec')
words = model.most_similar(positive=['sole'],topn=10)
print(words)
This will return:
[('sole.', 0.6860659122467041), ('sole.Ma', 0.6750558614730835), ('sole.Il', 0.6727924942970276), ('sole.E', 0.6680260896682739), ('sole.A', 0.6419174075126648), ('sole.È', 0.6401025652885437), ('splende', 0.6336565613746643), ('sole.La', 0.6049465537071228), ('sole.I', 0.5922051668167114), ('sole.Un', 0.5904430150985718)]
The problem is that "sole" ("sun", in english) return a series of words with a dot in it (like sole., sole.Ma, ecc...). Where is the problem? Why most_similar return this meaningless word?
EDIT
I tried with english word vector and the word "sun" return this:
[('sunlight', 0.6970556974411011), ('sunshine', 0.6911839246749878), ('sun.', 0.6835992336273193), ('sun-', 0.6780728101730347), ('suns', 0.6730450391769409), ('moon', 0.6499731540679932), ('solar', 0.6437565088272095), ('rays', 0.6423950791358948), ('shade', 0.6366724371910095), ('sunrays', 0.6306195259094238)] 
Is it impossible to reproduce results like relatedwords.org?
Perhaps the bigger question is: why does the Facebook FastText cc.it.300.vec model include so many meaningless words? (I haven't noticed that before – is there any chance you've downloaded a peculiar model that has decorated words with extra analytical markup?)
To gain the unique benefits of FastText – including the ability to synthesize plausible (better-than-nothing) vectors for out-of-vocabulary words – you may not want to use the general load_word2vec_format() on the plain-text .vec file, but rather a Facebook-FastText specific load method on the .bin file. See:
https://radimrehurek.com/gensim/models/fasttext.html#gensim.models.fasttext.load_facebook_vectors
(I'm not sure that will help with these results, but if choosing to use FastText, you may be interesting it using it "fully".)
Finally, given the source of this training – common-crawl text from the open web, which may contain lots of typos/junk – these might be legimate word-like tokens, essentially typos of sole, that appear often enough in the training data to get word-vectors. (And because they really are typo-synonyms for 'sole', they're not necessarily bad results for all purposes, just for your desired purpose of only seeing "real-ish" words.)
You might find it helpful to try using the restrict_vocab argument of most_similar(), to only receive results from the leading (most-frequent) part of all known word-vectors. For example, to only get results from among the top 50000 words:
words = model.most_similar(positive=['sole'], topn=10, restrict_vocab=50000)
Picking the right value for restrict_vocab might help in practice to leave out long-tail 'junk' words, while still providing the real/common similar words you seek.

Script for Large CSV File to Convert IPv6 addresses to Number (or String)

So I have a big csv file, over 1gb. There's a column with IP addresses in ipv4 and ipv6. I want to convert the ipv6 addresses into numbers, but there are too many rows for libre calc. So I'm wondering if it's possible to use python in the terminal to convert all the ipv6 addresses.
Also, I could split the file up into smaller pieces, then use libre calc, but same problem--I wouldn't know how to script that either.
EDIT:
I don't mind, it might get more complicated though. Also not sure how this should be formatted, but I hope people get the idea...So I have one table with IPv6 addresses like these examples:
2001:db8::cafe:1111
2001:db8:0:a:1:2:3:4
2001:db8:aaaa::c
2001:db8:0:0:1::4
There are a bunch of different rules that govern the formatting--way too hard for me. I've heard that python has a function that will specifically return the conversion, but not sure about the rest (how to get the returned values back into the csv correctly, with formatting unbroken, etc.). Anyway, here's a row from the other table:
"58569107296622255421594597096899477504","58569107375850417935858934690443427839","NG","Nigeria","Abuja Federal Capital Territory","Abuja","9.057350","7.489760"
So the part I need to match is the first two numbers (first two columns), where there are several ranges from
"0","340282366920938463463374607431768211455"
So I wanted to take the IPv6 addresses, convert them to IP numbers, then sort them into their respective ranges.
Yes, this is something you can do in Python. I'll demonstrate with a few short snippets and links to documentation that will fall short of a full solution in favor of empowering you with the resources that you need to put the pieces together yourself.
First off, if you want to load one CSV file line-by-line and write to a second one this is how you would do it:
>>> import csv
>>> with open('eggs.csv', newline='') as in and open('omellette.csv', 'w') as out:
... r = csv.reader(in)
... w = csv.writer(out)
... for row in r:
... print(', '.join(row)) # print unmodified
... row[0] = ipToNum(row[0])
... row[1] = ipToNum(row[1])
... print(', '.join(row)) # print modified
... w.writerow(row)
Spam, Spam, Spam, Spam, Spam, Baked Beans
Spam, Lovely Spam, Wonderful Spam
The original on which this example was based and additional information about python's built-in CSV capabilities can be found here:
https://docs.python.org/3/library/csv.html
You will probably need to make adjustments depending on the exact formatting of your particular CSV file. Now, to convert IP addresses to numbers you can do something like the following:
import socket, struct
def ipToNum(ip):
"convert ipv4/6 string to long integer"
return struct.unpack('>L',socket.inet_pton(ip))[0]
def numToDottedip(n):
"convert long int to ipv4/6"
return socket.inet_ntop(struct.pack('>L',n))
This example is adapted from what I found here:
https://www.oreilly.com/library/view/python-cookbook/0596001673/ch10s06.html
You will have to modify it
Also, if you want to learn more about the socket and struct modules here is the documentation:
https://docs.python.org/3/library/socket.html
https://docs.python.org/3/library/struct.html
You shouldn't need to split the file up since the CSV reader object will only return one line at a time rather than reading in the whole file at once. Of course, you also probably want to actually do something with those numbers once you've read them in but since you didn't specify I'll figuring that out to you.
Also note that I haven't tried any of this code. It's worth repeating here in the form of a metaphor: I'm trying to teach you to fish rather than just giving you fish. It's in your best interest to take this advice and wrestle with getting it to work yourself as that would be your first step toward actually being a programmer.

Yahoo Pipes: Extracting number from feed item for use in URL builder

Been looking all over the place for a solution to this issue. I have a Yahoo Pipe (http://pipes.yahoo.com/pipes/pipe.info?_id=e5420863cfa494ee40e4c9be43f0e812) that I've created to pull back image content from the Bing Search API. The URL builder includes a $skip attribute that takes an integer and uses it to select the starting (index) point for the result set that the query returns.
My initial plan had been to use the math engine in the Wolfram Alpha API to generate a random number (randomInteger[1000]) that I could use to seed the $skip value each time that the pipe is run. I have an earlier version of the pipe where I was able to get the query / result steps working using either "XPath Fetch" and "Fetch Data". However, regardless of how I Fetch the result, the response returns as an attribute / value pair in a list item.Even when I use "Emit items as string" in XPath Fetch, I still get a list with a single item, when what I really want is the integer that I can plug into my $skip attribute.
I've tried everything in Pipes I can think of, and spent a lot of time online looking for an answer. Is there anyway to extract text (in this case, a number) from a single list item and then use the output as input to "wire" a text parameter in another Pipes block? Any suggestions / ideas welcome. In the meantime, I'm generating a sorta-random number by manipulating a timecode hash, but it just feels tacky :-)
Thanks!
All the sources are for repeated items. You can't have a source that just makes a single number.
I'm not really clear what you're trying to do. You want to put a random number into part of the URL string that gets an RSS feed?

Convert to E164 only if possible?

Can I determine if the user entered a phone number that can be safely formatted into E164?
For Germany, this requires that the user started his entry with a local area code. For example, 123456 may be a subscriber number in his city, but it cannot be formatted into E164, because we don't know his local area code. Then I would like to keep the entry as it is. In contrast, the input 089123456 is independent of the area code and could be formatted into E164, because we know he's from Germany and we could convert this into +4989123456.
You can simply convert your number into E164 using libphonenumber
and after conversion checks if both the strings are same or not. If they're same means a number can not be formatted, otherwise the number you'll get from library will be formatted in E164.
Here's how you can convert
PhoneNumberUtil phoneUtil = PhoneNumberUtil.getInstance();
String formattedNumber = phoneUtil.format(inputNumber, PhoneNumberFormat.E164);
Finally compare formattedNumber with inputNumber
It looks as though you'll need to play with isValidNumber and isPossibleNumber for your case. format is certainly not guaranteed to give you something actually dialable, see the javadocs. This is suggested by the demo as well, where formatting is not displayed when isValidNumber is false.
I also am dealing with this FWIW. In the context of US numbers: The issue is I'd like to parse using isPossibleNumber in order to be as lenient as possible, and store the number in E164. However then we accept, e.g. +15551212. This string itself even passes isPossibleNumber despite clearly (I think) not being dialable anywhere.

Ruby on Rails - generating bit.ly style identifiers

I'm trying to generate UUIDs with the same style as bit.ly urls like:
http://bit [dot] ly/aUekJP
or cloudapp ones:
http://cl [dot] ly/1hVU
which are even smaller
how can I do it?
I'm now using UUID gem for ruby but I'm not sure if it's possible to limitate the length and get something like this.
I am currently using this:
UUID.generate.split("-")[0] => b9386070
But I would like to have even smaller and knowing that it will be unique.
Any help would be pretty much appreciated :)
edit note: replaced dot letters with [dot] for workaround of banned short link
You are confusing two different things here. A UUID is a universally unique identifier. It has a very high probability of being unique even if millions of them were being created all over the world at the same time. It is generally displayed as a 36 digit string. You can not chop off the first 8 characters and expect it to be unique.
Bitly, tinyurl et-al store links and generate a short code to represent that link. They do not reconstruct the URL from the code they look it up in a data-store and return the corresponding URL. These are not UUIDS.
Without knowing your application it is hard to advise on what method you should use, however you could store whatever you are pointing at in a data-store with a numeric key and then rebase the key to base32 using the 10 digits and 22 lowercase letters, perhaps avoiding the obvious typo problems like 'o' 'i' 'l' etc
EDIT
On further investigation there is a Ruby base32 gem available that implements Douglas Crockford's Base 32 implementation
A 5 character Base32 string can represent over 33 million integers and a 6 digit string over a billion.
If you are working with numbers, you can use the built in ruby methods
6175601989.to_s(30)
=> "8e45ttj"
to go back
"8e45ttj".to_i(30)
=>6175601989
So you don't have to store anything, you can always decode an incoming short_code.
This works ok for proof of concept, but you aren't able to avoid ambiguous characters like: 1lji0o. If you are just looking to use the code to obfuscate database record IDs, this will work fine. In general, short codes are supposed to be easy to remember and transfer from one medium to another, like reading it on someone's presentation slide, or hearing it over the phone. If you need to avoid characters that are hard to read or hard to 'hear', you might need to switch to a process where you generate an acceptable code, and store it.
I found this to be short and reliable:
def create_uuid(prefix=nil)
time = (Time.now.to_f * 10_000_000).to_i
jitter = rand(10_000_000)
key = "#{jitter}#{time}".to_i.to_s(36)
[prefix, key].compact.join('_')
end
This spits out unique keys that look like this: '3qaishe3gpp07w2m'
Reduce the 'jitter' size to reduce the key size.
Caveat:
This is not guaranteed unique (use SecureRandom.uuid for that), but it is highly reliable:
10_000_000.times.map {create_uuid}.uniq.length == 10_000_000
The only way to guarantee uniqueness is to keep a global count and increment it for each use: 0000, 0001, etc.

Resources