I just went through a problem, where input is a string which is a single word.
This line is not readable,
Like, I want to leave is written as Iwanttoleave.
The problem is of separating out each of the tokens(words, numbers, abbreviations, etc)
I have no idea where to start
The first thought that came to my mind is making a dictionary and then mapping accordingly but I think making a dictionary is not at all a good idea.
Can anyone suggest some algorithm to do it ?
First of all, create a dictionary which helps you to identify if some string is a valid word or not.
bool isValidString(String s){
if(dictionary.contains(s))
return true;
return false;
}
Now, you can write a recursive code to split the string and create an array of actually useful words.
ArrayList usefulWords = new ArrayList<String>; //global declaration
void split(String s){
int l = s.length();
int i,j;
for(i = l-1; i >= 0; i--){
if(isValidString(s.substr(i,l)){ //s.substr(i,l) will return substring starting from index `i` and ending at `l-1`
usefulWords.add(s.substr(i,l));
split(s.substr(0,i));
}
}
}
Now, use these usefulWords to generate all possible strings. Maybe something like this:
ArrayList<String> splits = new ArrayList<String>[10]; //assuming max 10 possible outputs
ArrayList<String>[] allPossibleStrings(String s, int level){
for(int i = 0; i < s.length(); i++){
if(usefulWords.contains(s.substr(0,i)){
splits[level].add(s.substr(0,i));
allPossibleStrings(s.substr(i,s.length()),level);
level++;
}
}
}
Now, this code gives you all possible splits in a somewhat arbitrary manner. eg.
dictionary = {cat, dog, i, am, pro, gram, program, programmer, grammer}
input:
string = program
output:
splits[0] = {pro, gram}
splits[1] = {program}
input:
string = iamprogram
output:
splits[0] = {i, am, pro, gram} //since `mer` is not in dictionary
splits[1] = {program}
I did not give much thought to the last part, but I think you should be able to formulate a code from there as per your requirement.
Also, since no language is tagged, I've taken the liberty of writing the code in JAVA-like syntax as it is really easy to understand.
Instead of using a Dictionary, I'd suggest you use a Trie with all your valid words (the whole English dictionary?). Then you can start moving one letter at a time in your input line and the trie at the same time. If the letter leads to more results in the trie, you can continue expanding the current word, and if not, you can start looking for a new word in the trie.
This won't be a forward only search for sure, so you'll need some sort of backtracking.
// This method Generates a list with all the matching phrases for the given input
List<string> CandidatePhrases(string input) {
Trie validWords = BuildTheTrieWithAllValidWords();
List<string> currentWords = new List<string>();
List<string> possiblePhrases = new List<string>();
// The root of the trie has an empty key that points to all the first letters of all words
Trie currentWord = validWords;
int currentLetter = -1;
// Calls a backtracking method that creates all possible phrases
FindPossiblePhrases(input, validWords, currentWords, currentWord, currentLetter, possiblePhrases);
return possiblePhrases;
}
// The Trie structure could be something like
class Trie {
char key;
bool valid;
List<Trie> children;
Trie parent;
Trie Next(char nextLetter) {
return children.FirstOrDefault(c => c.key == nextLetter);
}
string WholeWord() {
Debug.Assert(valid);
string word = "";
Trie current = this;
while (current.Key != '\0')
{
word = current.Key + word;
current = current.parent;
}
}
}
void FindPossiblePhrases(string input, Trie validWords, List<string> currentWords, Trie currentWord, int currentLetter, List<string> possiblePhrases) {
if (currentLetter == input.Length - 1) {
if (currentWord.valid) {
string phrase = ""
foreach (string word in currentWords) {
phrase += word;
phrase += " ";
}
phrase += currentWord.WholeWord();
possiblePhrases.Add(phrase);
}
}
else {
// The currentWord may be a valid word. If that's the case, the next letter could be the first of a new word, or could be the next letter of a bigger word that begins with currentWord
if (currentWord.valid) {
// Try to match phrases when the currentWord is a valid word
currentWords.Add(currentWord.WholeWord());
FindPossiblePhrases(input, validWords, currentWords, validWords, currentLetter, possiblePhrases);
currentWords.RemoveAt(currentWords.Length - 1);
}
// If either the currentWord is a valid word, or not, try to match a longer word that begins with current word
int nextLetter = currentLetter + 1;
Trie nextWord = currentWord.Next(input[nextLetter]);
// If the nextWord is null, there was no matching word that begins with currentWord and has input[nextLetter] as the following letter.
if (nextWord != null) {
FindPossiblePhrases(input, validWords, currentWords, nextWord, nextLetter, possiblePhrases);
}
}
}
I have a problem in regards of extracting signed int from string in c++.
Assuming that i have a string of images1234, how can i extract the 1234 from the string without knowing the position of the last non numeric character in C++.
FYI, i have try stringstream as well as lexical_cast as suggested by others through the post but stringstream returns 0 while lexical_cast stopped working.
int main()
{
string virtuallive("Images1234");
//stringstream output(virtuallive.c_str());
//int i = stoi(virtuallive);
//stringstream output(virtuallive);
int i;
i = boost::lexical_cast<int>(virtuallive.c_str());
//output >> i;
cout << i << endl;
return 0;
}
How can i extract the 1234 from the string without knowing the position of the last non numeric character in C++?
You can't. But the position is not hard to find:
auto last_non_numeric = input.find_last_not_of("1234567890");
char* endp = &input[0];
if (last_non_numeric != std::string::npos)
endp += last_non_numeric + 1;
if (*endp) { /* FAILURE, no number on the end */ }
auto i = strtol(endp, &endp, 10);
if (*endp) {/* weird FAILURE, maybe the number was really HUGE and couldn't convert */}
Another possibility would be to put the string into a stringstream, then read the number from the stream (after imbuing the stream with a locale that classifies everything except digits as white space).
// First the desired facet:
struct digits_only: std::ctype<char> {
digits_only(): std::ctype<char>(get_table()) {}
static std::ctype_base::mask const* get_table() {
// everything is white-space:
static std::vector<std::ctype_base::mask>
rc(std::ctype<char>::table_size,std::ctype_base::space);
// except digits, which are digits
std::fill(&rc['0'], &rc['9'], std::ctype_base::digit);
// and '.', which we'll call punctuation:
rc['.'] = std::ctype_base::punct;
return &rc[0];
}
};
Then the code to read the data:
std::istringstream virtuallive("Images1234");
virtuallive.imbue(locale(locale(), new digits_only);
int number;
// Since we classify the letters as white space, the stream will ignore them.
// We can just read the number as if nothing else were there:
virtuallive >> number;
This technique is useful primarily when the stream contains a substantial amount of data, and you want all the data in that stream to be interpreted in the same way (e.g., only read numbers, regardless of what else it might contain).
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
(Edit: What is Code Golf: Code Golf are challenges to solve a specific problem with the shortest amount of code by character count in whichever language you prefer. More info here on Meta StackOverflow. )
Code Golfers, here's a challenge on string operations.
Email Address Validation, but without regular expressions (or similar parsing library) of course. It's not so much about the email addresses but how short you can write the different string operations and constraints given below.
The rules are the following (yes, I know, this is not RFC compliant, but these are going to be the 5 rules for this challenge):
At least 1 character out of this group before the #:
A-Z, a-z, 0-9, . (period), _ (underscore)
# has to exist, exactly one time
john#smith.com
^
Period (.) has to exist exactly one time after the #
john#smith.com
^
At least 1 only [A-Z, a-z] character between # and the following . (period)
john#s.com
^
At least 2 only [A-Z, a-z] characters after the final . period
john#smith.ab
^^
Please post the method/function only, which would take a string (proposed email address) and then return a Boolean result (true/false) depending on the email address being valid (true) or invalid (false).
Samples:
b#w.org (valid/true) #w.org (invalid/false)
b#c#d.org (invalid/false) test#org (invalid/false)
test#%.org (invalid/false) s%p#m.org (invalid/false)
j_r#x.c.il (invalid/false) j_r#x.mil (valid/true)
r..t#x.tw (valid/true) foo#a%.com (invalid/false)
Good luck!
C89 (166 characters)
#define B(c)isalnum(c)|c==46|c==95
#define C(x)if(!v|*i++-x)return!1;
#define D(x)for(v=0;x(*i);++i)++v;
v;e(char*i){D(B)C(64)D(isalpha)C(46)D(isalpha)return!*i&v>1;}
Not re-entrant, but can be run multiple times. Test bed:
#include<stdio.h>
#include<assert.h>
main(){
assert(e("b#w.org"));
assert(e("r..t#x.tw"));
assert(e("j_r#x.mil"));
assert(!e("b#c#d.org"));
assert(!e("test#%.org"));
assert(!e("j_r#x.c.il"));
assert(!e("#w.org"));
assert(!e("test#org"));
assert(!e("s%p#m.org"));
assert(!e("foo#a%.com"));
puts("success!");
}
J
:[[/%^(:[[+-/^,&i|:[$[' ']^j+0__:k<3:]]
C89, 175 characters.
#define G &&*((a+=t+1)-1)==
#define H (t=strspn(a,A
t;e(char*a){char A[66]="_.0123456789Aa";short*s=A+12;for(;++s<A+64;)*s=s[-1]+257;return H))G 64&&H+12))G 46&&H+12))>1 G 0;}
I am using the standard library function strspn(), so I feel this answer isn't as "clean" as strager's answer which does without any library functions. (I also stole his idea of declaring a global variable without a type!)
One of the tricks here is that by putting . and _ at the start of the string A, it's possible to include or exclude them easily in a strspn() test: when you want to allow them, use strspn(something, A); when you don't, use strspn(something, A+12). Another is assuming that sizeof (short) == 2 * sizeof (char), and building up the array of valid characters 2 at a time from the "seed" pair Aa. The rest was just looking for a way to force subexpressions to look similar enough that they could be pulled out into #defined macros.
To make this code more "portable" (heh :-P) you can change the array-building code from
char A[66]="_.0123456789Aa";short*s=A+12;for(;++s<A+64;)*s=s[-1]+257;
to
char*A="_.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";
for a cost of 5 additional characters.
Python (181 characters including newlines)
def v(E):
import string as t;a=t.ascii_letters;e=a+"1234567890_.";t=e,e,"#",e,".",a,a,a,a,a,"",a
for c in E:
if c in t[0]:t=t[2:]
elif not c in t[1]:return 0>1
return""==t[0]
Basically just a state machine using obfuscatingly short variable names.
C (166 characters)
#define F(t,u)for(r=s;t=(*s-64?*s-46?isalpha(*s)?3:isdigit(*s)|*s==95?4:0:2:1);++s);if(s-r-1 u)return 0;
V(char*s){char*r;F(2<,<0)F(1=)F(3=,<0)F(2=)F(3=,<1)return 1;}
The single newline is required, and I've counted it as one character.
Python, 149 chars (after putting the whole for loop into one semicolon-separated line, which I haven't done here for "readability" purposes):
def v(s,t=0,o=1):
for c in s:
k=c=="#"
p=c=="."
A=c.isalnum()|p|(c=="_")
L=c.isalpha()
o&=[A,k|A,L,L|p,L,L,L][t]
t+=[1,k,1,p,1,1,0][t]
return(t>5)&o
Test cases, borrowed from strager's answer:
assert v("b#w.org")
assert v("r..t#x.tw")
assert v("j_r#x.mil")
assert not v("b#c#d.org")
assert not v("test#%.org")
assert not v("j_r#x.c.il")
assert not v("#w.org")
assert not v("test#org")
assert not v("s%p#m.org")
assert not v("foo#a%.com")
print "Yeah!"
Explanation: When iterating over the string, two variables keep getting updated.
t keeps the current state:
t = 0: We're at the beginning.
t = 1: We where at the beginning and have found at least one legal character (letter, number, underscore, period)
t = 2: We have found the "#"
t = 3: We have found at least on legal character (i.e. letter) after the "#"
t = 4: We have found the period in the domain name
t = 5: We have found one legal character (letter) after the period
t = 6: We have found at least two legal characters after the period
o as in "okay" starts as 1, i.e. true, and is set to 0 as soon as a character is found that is illegal in the current state.
Legal characters are:
In state 0: letter, number, underscore, period (change state to 1 in any case)
In state 1: letter, number, underscore, period, at-sign (change state to 2 if "#" is found)
In state 2: letter (change state to 3)
In state 3: letter, period (change state to 4 if period found)
In states 4 thru 6: letter (increment state when in 4 or 5)
When we have gone all the way through the string, we return whether t==6 (t>5 is one char less) and o is 1.
Whatever version of C++ MSVC2008 supports.
Here's my humble submission. Now I know why they told me never to do the things I did in here:
#define N return 0
#define I(x) &&*x!='.'&&*x!='_'
bool p(char*a) {
if(!isalnum(a[0])I(a))N;
char*p=a,*b=0,*c=0;
for(int d=0,e=0;*p;p++){
if(*p=='#'){d++;b=p;}
else if(*p=='.'){if(d){e++;c=p;}}
else if(!isalnum(*p)I(p))N;
if (d>1||e>1)N;
}
if(b>c||b+1>=c||c+2>=p)N;
return 1;
}
Not the greatest solution no doubt, and pretty darn verbose, but it is valid.
Fixed (All test cases pass now)
static bool ValidateEmail(string email)
{
var numbers = "1234567890";
var uppercase = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
var lowercase = uppercase.ToLower();
var arUppercase = uppercase.ToCharArray();
var arLowercase = lowercase.ToCharArray();
var arNumbers = numbers.ToCharArray();
var atPieces = email.Split(new string[] { "#"}, StringSplitOptions.RemoveEmptyEntries);
if (atPieces.Length != 2)
return false;
foreach (var c in atPieces[0])
{
if (!(arNumbers.Contains(c) || arLowercase.Contains(c) || arUppercase.Contains(c) || c == '.' || c == '_'))
return false;
}
if(!atPieces[1].Contains("."))
return false;
var dotPieces = atPieces[1].Split('.');
if (dotPieces.Length != 2)
return false;
foreach (var c in dotPieces[0])
{
if (!(arLowercase.Contains(c) || arUppercase.Contains(c)))
return false;
}
var found = 0;
foreach (var c in dotPieces[1])
{
if ((arLowercase.Contains(c) || arUppercase.Contains(c)))
found++;
else
return false;
}
return found >= 2;
}
C89 character set agnostic (262 characters)
#include <stdio.h>
/* the 'const ' qualifiers should be removed when */
/* counting characters: I don't like warnings :) */
/* also the 'int ' should not be counted. */
/* it needs only 2 spaces (after the returns), should be only 2 lines */
/* that's a total of 262 characters (1 newline, 2 spaces) */
/* code golf starts here */
#include<string.h>
int v(const char*e){
const char*s="0123456789._abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
if(e=strpbrk(e,s))
if(e=strchr(e+1,'#'))
if(!strchr(e+1,'#'))
if(e=strpbrk(e+1,s+12))
if(e=strchr(e+1,'.'))
if(!strchr(e+1,'.'))
if(strlen(e+1)>1)
return 1;
return 0;
}
/* code golf ends here */
int main(void) {
const char *t;
t = "b#w.org"; printf("%s ==> %d\n", t, v(t));
t = "r..t#x.tw"; printf("%s ==> %d\n", t, v(t));
t = "j_r#x.mil"; printf("%s ==> %d\n", t, v(t));
t = "b#c#d.org"; printf("%s ==> %d\n", t, v(t));
t = "test#%.org"; printf("%s ==> %d\n", t, v(t));
t = "j_r#x.c.il"; printf("%s ==> %d\n", t, v(t));
t = "#w.org"; printf("%s ==> %d\n", t, v(t));
t = "test#org"; printf("%s ==> %d\n", t, v(t));
t = "s%p#m.org"; printf("%s ==> %d\n", t, v(t));
t = "foo#a%.com"; printf("%s ==> %d\n", t, v(t));
return 0;
}
Version 2
Still C89 character set agnostic, bugs hopefully corrected (303 chars; 284 without the #include)
#include<string.h>
#define Y strchr
#define X{while(Y
v(char*e){char*s="0123456789_.abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ";
if(*e!='#')X(s,*e))e++;if(*e++=='#'&&!Y(e,'#')&&Y(e+1,'.'))X(s+12,*e))e++;if(*e++=='.'
&&!Y(e,'.')&&strlen(e)>1){while(*e&&Y(s+12,*e++));if(!*e)return 1;}}}return 0;}
That #define X is absolutely disgusting!
Test as for my first (buggy) version.
VBA/VB6 - 484 chars
Explicit off
usage: VE("b#w.org")
Function V(S, C)
V = True
For I = 1 To Len(S)
If InStr(C, Mid(S, I, 1)) = 0 Then
V = False: Exit For
End If
Next
End Function
Function VE(E)
VE = False
C1 = "abcdefghijklmnopqrstuvwxyzABCDEFGHILKLMNOPQRSTUVWXYZ"
C2 = "0123456789._"
P = Split(E, "#")
If UBound(P) <> 1 Then GoTo X
If Len(P(0)) < 1 Or Not V(P(0), C1 & C2) Then GoTo X
E = P(1): P = Split(E, ".")
If UBound(P) <> 1 Then GoTo X
If Len(P(0)) < 1 Or Not V(P(0), C1) Or Len(P(1)) < 2 Or Not V(P(1), C1) Then GoTo X
VE = True
X:
End Function
Java: 257 chars (not including the 3 end of lines for readability ;-)).
boolean q(char[]s){int a=0,b=0,c=0,d=0,e=0,f=0,g,y=-99;for(int i:s)
d=(g="#._0123456789QWERTYUIOPASDFGHJKLZXCVBNMqwertyuiopasdfghjklzxcvbnm".indexOf(i))<0?
y:g<1&&++e>0&(b<1|++a>1)?y:g==1&e>0&(c<1||f++>0)?y:++b>0&g>12?f>0?d+1:f<1&e>0&&++c>0?
d:d:d;return d>1;}
Passes all the tests (my older version was incorrect).
Erlang 266 chars:
-module(cg_email).
-export([test/0]).
%%% golf code begin %%%
-define(E,when X>=$a,X=<$z;X>=$A,X=<$Z).
-define(I(Y,Z),Y([X|L])?E->Z(L);Y(_)->false).
-define(L(Y,Z),Y([X|L])?E;X>=$0,X=<$9;X=:=$.;X=:=$_->Z(L);Y(_)->false).
?L(e,m).
m([$#|L])->a(L);?L(m,m).
?I(a,i).
i([$.|L])->l(L);?I(i,i).
?I(l,c).
?I(c,g).
g([])->true;?I(g,g).
%%% golf code end %%%
test() ->
true = e("b#w.org"),
false = e("b#c#d.org"),
false = e("test#%.org"),
false = e("j_r#x.c.il"),
true = e("r..t#x.tw"),
false = e("test#org"),
false = e("s%p#m.org"),
true = e("j_r#x.mil"),
false = e("foo#a%.com"),
ok.
Ruby, 225 chars.
This is my first Ruby program, so it's probably not very Ruby-like :-)
def v z;r=!a=b=c=d=e=f=0;z.chars{|x|case x when'#';r||=b<1||!e;e=!1 when'.'
e ?b+=1:(a+=1;f=e);r||=a>1||(c<1&&!e)when'0'..'9';b+=1;r|=!e when'A'..'Z','a'..'z'
e ?b+=1:f ?c+=1:d+=1;else r=1 if x!='_'||!e|!b+=1;end};!r&&d>1 end
'Using no regex':
PHP 47 Chars.
<?=filter_var($argv[1],FILTER_VALIDATE_EMAIL);
Haskell (GHC 6.8.2), 165 161 144C Characters
Using pattern matching, elem, span and all:
a=['A'..'Z']++['a'..'z']
e=f.span(`elem`"._0123456789"++a)
f(_:_,'#':d)=g$span(`elem`a)d
f _=False
g(_:_,'.':t#(_:_:_))=all(`elem`a)t
g _=False
The above was tested with the following code:
main :: IO ()
main = print $ and [
e "b#w.org",
e "r..t#x.tw",
e "j_r#x.mil",
not $ e "b#c#d.org",
not $ e "test#%.org",
not $ e "j_r#x.c.il",
not $ e "#w.org",
not $ e "test#org",
not $ e "s%p#m.org",
not $ e "foo#a%.com"
]