regular expression for creating email - ruby

I am new in programming. I need to know the regular expression for creating of email
username#domain.extension
daniel#gmail.com
which is the length of username. Maximum number of characters. Minumum number of characters
which is the length of domain. Maximum number of characters. Minumum number of characters
which is the length of extension. Maximum number of characters. Minumum number of characters
Characters commonly accepted in every part of the email address
This is my expression regular
/^([a-z 0-9_\.-]{3,10})#([\da-z\.-]+)\.([a-z\.]{2,6})$/
I'd like improve this

If you are looking to create this from scratch start with reading the RFC for email
https://www.rfc-editor.org/rfc/rfc2822
Otherwise either stick to something simple like you have or use a gem that has already walked through that document.
https://github.com/validates-email-format-of/validates_email_format_of

Related

how to detect if the barcode is for Weight Scale Item

I wonder how we can detect if a barcode that is read by barcode reader is related to an items that is sold by weight or a regular item ( in Ean-13 or other formats) ? is there any part of code that shows that it is a weighted Item?
Barcodes are just strings of characters (mostly numbers and letters) and most barcode readers/scanners do not indicate the type of barcode. They just send the value. But some values, such as an EAN13, have embedded check digits that can be used to auto-discriminate. For example, if you see a 13-digit number and calculate the mod10 check digit over the first 12 digits and it matches the 13th digit, you can be fairly certain you have an EAN13.
Alternatively, if you have control over the creation of the barcodes, you can use GS1 application identifiers to prefix each value. (GS1 barcodes can actually contain multiple values in a single symbol.) See https://www.gs1.org/standards/barcodes/application-identifiers?lang=en for more information on the standard ids. Application ids are routinely used in logistics but are fairly rare in retail channels.

Minimum Length for Domain Name

I need to make validation for domain name length in my PHP programming. Can you please suggest me min and max length required for domain name. I tried to find out in google but I am unable to find out proper solution for it.
According to this resource you can find websites with single letter and com. I hope that will help you.
The max limit is set out here https://www.rfc-editor.org/rfc/rfc1035#section-2.3.4.
Label limit 63 characters (strings between periods)
Name limit 255 characters (total of the strings combined together)
Minimum Domain Name Length Is 3 Characters and Maximum of 63. Hosting for 1 or 2 characters only is too expensive and you might not want it for its not yet open for public use.

Substitution cipher decryption using letter frequency analysis for text without blanks and special characters

I need to find the plain text for given cipher text. I also have statistics (in an Excel document) for the letters in the given language e.g. I have the frequencies of the letters and also of the digraphs.
I tried this approach so far: I evaluated the frequency of each letter in the cipher text I received. Then I sorted the letters in descending order by their frequencies and mapped each letter with the corresponding letter from the Excel document. The problem with this approach is that it gives me some text that has no meaning at all. That is because my text is pretty small (only 1500 characters long).
I considered doing some limited permutations, but I have no idea what could I use to evaluate how good some permutation is. I think a good evaluation function would solve my problem.
Be aware that all special characters and white spaces are removed from the text. Also there are no numbers.
Thank you in advance.
for fully automated decryption
you need to add some dictionary of commonly used words
and compare against it
the solution that finds most words from it is probably the right one
with letter probabilities comes few problems
they are derived for common texts
so if your encrypted text is for example technical paper and not beletry ...
or it includes equations or tables
then it can screw your overall letter occurence
so do it like this:
compute the probabilities of letters
divide letters into groups by probabilities
so commonly used (high probability) letters are grouped together (group A)
so less common used (mid probability) letters are grouped together (group B)
and the rest (low probability) also group together (group C)
substitute group A
first see if group A probabilities match your language
if not then the text is in different language,style/form,or it is not a plain text at all
in such case you can not proceed safely
if they match then substitute letters from group A
they should be OK on the first run
try substitute group B
so you know all the letters from group B (encrypted/decrypted)
so generate all permutations of substitutions
for each one try to decipher text
and search for words after decryption (ignoring not yet substituted letters)
compute the word count percentage
and remember the best one (or few top ones)
try substitute group C
do it the same as bullet 4
corrections
it is probable that in the final result will be few letters mixed
so there are ways to handle also this
you can have table of letters that are mixable between each other
so you can try permutate them and test against your dictionary
or find words in your text with 1-2 wrong letters per word (for bigger words like 5 or more letters)
and permutate/correct substitution of the wrong letters if enough such words found
[notes]
you can obtain dictionaries from translators
also saw some plain text translator tables online
the groups should have distinct probability difference to each other
number of groups can change with language
I had best results for this task with semi automated approach
steps 5,6 can use user input

What is the maximum length of a display name in an email address

According to the question "What is the maximum length of a valid email address?", the maximum length of the address is 254. But I like to know what would be the maximum length of the display name:
Display Name <my#examplemailaddress.net>
Following this link https://www.ietf.org/mail-archive/web/ietf-822/current/msg00086.html the size is unlimited but practically according this link https://www.ietf.org/mail-archive/web/ietf-822/current/msg00088.html the size would be 72 characters. But I believe this answer is a bit outdated? What would be reasonable limit for today?
If you ask the maximal length allowed by the specs (the normative source is RFC5322 as of current_timestamp) then, indeed, there is no limit, since folding allows you to have an unlimted length of any field (while still respecting the advised 78 [or the larger 998] character limit).
Practical limit is a very subjective question, since "practical" would be the length which is fully displayed by "most" clients and environments; now that's pretty hard to calculate.
I would say the upper limit of practicality would be the total length of 78 characters from the "From:" header up to the last ">" character of the email address, since longer ones may probably break up while displaying in almost all environments, which would give you around 40 characters to use even for longer email addresses.
Most clients, however, probably expects to display around 20-25 characters under normal circumstances.
These are all displayed characters and not the actual length in bytes of a whatever way encoded address (especially for long utf-8 codes).

How to recognize words in text with non-word tokens?

I am currently parsing a bunch of mails and want to get words and other interesting tokens out of mails (even with spelling errors or combination of characters and letters, like "zebra21" or "customer242"). But how can I know that "0013lCnUieIquYjSuIA" and "anr5Brru2lLngOiEAVk1BTjN" are not words and not relevant? How to extract words and discard tokens that are encoding errors or parts of pgp signature or whatever else we get in mails and know that we will never be interested in those?
You need to decide on a good enough criteria for a word and write a regular expression or a manual to enforce it.
A few rules that can be extrapolated from your examples:
words can start with a captial letter or be all capital letters but if you have more than say, 2 uppercase letters and more than 2 lowercase letters inside a word, it's not a word
If you have numbers inside the word, it's not a word
if it's longer than say, 20 characters
There's no magic trick. you need to decide what you want the rules to be and make them happen.
Al alternative way is to train some kind of Hidden Markov-Models system to recognize things that sound like words but I think this is an overkill for what you want to do.
http://en.wikipedia.org/wiki/English_words_with_uncommon_properties
you can make rules that reject anything with these 'uncommon properties' to build a system that accepts most actual words
Although I generally agree with shoosh's answer, his approach makes it easy to achieve high recall but also low precision, i.e. you would get almost all real words but also a lot non-words. If your definition of word is too restrictive, it's the other way around but that's also not what you want since then you would miss cases like 'zebra123'. So here are a few ideas about how to improve precision:
It may be worthwile thinking about if you could determine what parts of an email belong to the main text and which are footers like pgp signatures. I'm sure it's possible to find some simple heuristics that match most cases, e.g. cut of everything below a line which consists only of '-'-characters.
Depending on your performance criteria you may want to check if a word is a real word or contains a real word by matching against a simple word list. It's easy to find quite exhaustive lists of Englisch words on the web, and you could also compile one yourself by extracting words from a large and clean text corpus.
Using a lexical analyser, you could filter every token which is marked as unknown.
Some simple statistics may tell you how likely it is that something is a word. Tokens which occur with high frequency most probably are words. Tokens which appear only once or whose number is below a certain threshold very probably are not words. Common spelling errors should appear more than once and uncommon ones may be ignored.
Some if these suggestions clearly don't work for cases like 'zebra123'. Again, simply cutting off, or splitting on, in-word numbers may do the trick.
My general approach would be to first identify tokens which certainly are words (using the suggestions above), then identify tokens which certainly are not words (using a regular expression), and then look (with your eyes) at the few hundred or thousand remaining tokens to find common characteristics to handle these separately.

Resources