loop in tweet text and if word not in stop word correct spelling, lemmatize, and stem - tweets

enter image description here
the code below should loop through a tweet dataset- text column and if a word not in stop words list, it should correct spelling, lemmatize, then stem the word. It is not working properly can you help me fix it? please check the error in the attached image
pstem = PorterStemmer()
lem = WordNetLemmatizer()
spell = SpellChecker()
stop_words = stopwords.words('english')
for i in range(len(df.index)):
text = df.loc[i]['text']
tokens = nltk.word_tokenize(text)
tokens = [word for word in tokens if word not in stop_words]
for j in range(len(tokens)):
tokens[j] = spell.correction(tokens[j])
tokens[j] = lem.lemmatize(tokens[j])
tokens[j] = pstem.stem(tokens[j])
tokens_sent=' '.join(tokens)
df.at[i,"text"] = tokens_sent

Related

I wrote a code to update the Lettering of the first name in Zoho but it's not working

Here's the deluge script to capitalize the first letter of the sentence and make the other letters small that isn't working:
a = zoho.crm.getRecordById("Contacts",input.ID);
d = a.get("First_Name");
firstChar = d.subString(0,1);
otherChars = d.removeFirstOccurence(firstChar);
Name = firstChar.toUppercase() + otherChars.toLowerCase();
mp = map();
mp.put("First_Name",d);
b = zoho.crm.updateRecord("Contacts", Name,{"First_Name":"Name"});
info Name;
info b;
I tried capitalizing the first letter of the alphabet and make the other letters small. But it isn't working as expected.
Try using concat
Name = firstChar.toUppercase().concat( otherChars.toLowerCase() );
Try removing the double-quotes from the Name value in the the following statement. The reason is that Name is a variable holding the case-adjusted name, but "Name" is the string "Name".
From:
b = zoho.crm.updateRecord("Contacts", Name,{"First_Name":"Name"});
To
b = zoho.crm.updateRecord("Contacts", Name,{"First_Name":Name});

How to find text across HTML tag boundaries?

I have HTML like this:
<div>Lorem ipsum <b>dolor sit</b> amet.</div>
How can I find a plain text based match for my search string ipsum dolor in this HTML? I need the start and end XPath node pointers for the match, plus character indexes to point inside these start and stop nodes. I use Nokogiri to work with the DOM, but any solution for Ruby is fine.
Difficulty:
I can't node.traverse {|node| … } through the DOM and do a plain text search whenever a text node comes across, because my search string can cross tag boundaries.
I can't do a plain text search after converting the HTML to plain text, because I need the XPath indexes as result.
I could implement it myself with basic tree traversal, but before I do I'm asking if there is a Nokogiri function or trick to do it more comfortably.
You could do something like:
doc.search('div').find{|div| div.text[/ipsum dolor/]}
In the end, we used code as follows. It is shown for the example given in the question, but also works in the generic case of arbitrary-depth HTML tag nesting. (Which is what we need.)
In addition, we implemented it in a way that can ignore excess (≥2) whitespace characters in a row. Which is why we have to search for the end of the match and can't just use the length of the search string / quote and the start of the match position: the number of whitespace characters in the search string and search match might differ.
doc = Nokogiri::HTML.fragment("<div>Lorem ipsum <b>dolor sit</b> amet.</div>")
quote = 'ipsum dolor'
# (1) Find search string in document text, "plain text in plain text".
quote_query =
quote.split(/[[:space:]]+/).map { |w| Regexp.quote(w) }.join('[[:space:]]+')
start_index = doc.text.index(/#{quote_query}/i)
end_index = start_index+doc.text[/#{quote_query}/i].size
# (2) Find XPath values and character indexes for our search match.
#
# To do this, walk through all text nodes and count characters until
# encountering both the start_index and end_index character counts
# of our search match.
start_xpath, start_offset, end_xpath, end_offset = nil
i = 0
doc.xpath('.//text() | text()').each do |x|
 offset = 0
 x.text.split('').each do
   if i == start_index
     e = x.previous
     sum = 0
     while e
       sum+= e.text.size
       e = e.previous
     end
     start_xpath = x.path.gsub(/^\?/, '').gsub(
/#{Regexp.quote('/text()')}.*$/, ''
)
     start_offset = offset+sum
   elsif i+1 == end_index
     e = x.previous
     sum = 0
     while e
       sum+= e.text.size
       e = e.previous
     end
     end_xpath = x.path.gsub(/^\?/, '').gsub(
/#{Regexp.quote('/text()')}.*$/, ''
)
     end_offset = offset+1+sum
   end
   offset+=1
   i+=1
 end
end
At this point, we can retrieve the desired XPath values for the start and stop of the search match (and in addition, character offsets pointing to the exact character inside the XPath designated element for the start and stop of the search match). We get:
puts start_xpath
/div
puts start_offset
6
puts end_xpath
/div/b
puts end_offset
5

Regex to match a specific sequence of strings

Assuming I have 2 array of strings
position1 = ['word1', 'word2', 'word3']
position2 = ['word4', 'word1']
and I want inside a text/string to check if the substring #{target} which exists in text is followed by either one of the words of position1 or following one of the words of the position2 or even both at the same time. Similarly as if I am looking left and right of #{target}.
For example in the sentence "Writing reports and inputting data onto internal systems, with regards to enforcement and immigration papers" if the target word is data I would like to check if the word left (inputting) and right (onto) are included in the arrays or if one of the words in the arrays return true for the regex match. Any suggestions? I am using Ruby and I have tried some regex but I can't make it work yet. I also have to ignore any potential special characters in between.
One of them:
/^.*\b(#{joined_position1})\b.*$[\s,.:-_]*\b#{target}\b[\s,.:-_\\\/]*^.*\b(#{joined_position2})\b.*$/i
Edit:
I figured out this way with regex to capture the word left and right:
(\S+)\s*#{target}\s*(\S+)
However what could I change if I would like to capture more than one words left and right?
If you have two arrays of strings, what you can do is something like this:
matches = /^.+ (\S+) #{target} (\S+) .+$/.match(text)
if matches and (position1.include?(matches[1]) or position2.include?(matches[2]))
do_something()
end
What this regex does is match the target word in your text and extract the words next to it using capture groups. The code then compares those words against your arrays, and does something if they're in the right places. A more general version of this might look like:
def checkWords(target, text, leftArray, rightArray, numLeft = 1, numRight = 1)
# Build the regex
regex = "^.+"
regex += " (\S+)" * numLeft
regex += " #{target}"
regex += " (\S+)" * numRight
regex += " .+$"
pattern = Regexp.new(regex)
matches = pattern.match(text)
return false if !matches
for i in 1..numLeft
return false if (!leftArray.include?(matches[i]))
end
for i in 1..numRight
return false if (!rightArray.include?(matches[numLeft + i]))
end
return true
end
Which can then be invoked like this:
do_something() if checkWords("data", text, position1, position2, 2, 2)
I'm pretty sure it's not terribly idiomatic, but it gives you a general sense of how you would do what you in a more general way.

Prevent string data from being padded with spaces from the left

I have an input box that I use to enter a alphanumeric account numbers in a database. The box accepts up to 25 characters. However, for data entry, each account number may not be as long as 25 characters. In such a case, the account numbers are saved with blank spaces before it instead of being saved to the left of the column. How can I solve this?
I would like each number to be saved like the two hyphenated numbers and not with a space like the first record.
Code summary:
Set objDB = New db.Detail_Data
objDB.ConnectionString = CONNECTSTRING
With objDB
.summary_code = CDbl(mvarSumcode)
.charge_code = UCase$(Me.txtChargeCode)
.clientID = UCase$(Me.txtClientID)
.JobID = UCase$(Me.txtJobID)
.Invno = UCase$(Me.txtInvno.Text)
.TransAmt = CCur(Me.txtTransAmt)
.Gl_accno = Format(Me.txtGL, "#########################")
.Description = Me.txtDescription
blnStatus = .AddDetail
End With
Looks like it works as coded. Your line:
.Gl_accno = Format(Me.txtGL, "#########################")
Format with the # symbol right justifies the string, filling in spaces on the left. Unless you add a ! like so (source).
.Gl_accno = Format(Me.txtGL, "!#########################")

textpad syntax highlighting confused by apostrophe

I'd like to know what to put in textpad's syntax file to fix the issue where, say, in an html file, you're writing a paragraph and an apostrophe creates syntax highlighting until the next aspostrophe.
Ex:
<p>Hi, I'm an example.
lol text here placeholder lorem ipsum I've died.</p>
I've placed in bold what would be color highlighted in textpad, for lack of stackoverflow coloring knowledge. :P It would be seen as similar to <a href='http://string.lol'> where you would normally use a pair of apostrophes or quotes. I realize that the issue may be in the way the syntax file is set up, where it's matching for any apostrophe instead of matching for an apostrophe not separated by a space. Ideally it would also need to match for equal signs and other common characters that would be seen directly next to an apostrophe or quote.
Here's where I believe it could be found inside the syntax file:
[Syntax]
Namespace1 = 6
IgnoreCase = No
InitKeyWordChars = A-Za-z_
KeyWordChars = A-Za-z0-9_
OperatorChars = -+*/!~%^&|=#`.,;:
KeyWordLength =
BracketChars = {[()]}
PreprocStart = #
HexPrefix = 0x
SyntaxStart =
SyntaxEnd =
CommentStart = /*
CommentEnd = */
CommentStartAlt = <!--
CommentEndAlt = -->
SingleComment = //
SingleCommentCol =
SingleCommentAlt =
SingleCommentColAlt =
SingleCommentEsc =
StringsSpanLines = Yes
StringStart = "
StringEnd = "
StringAlt = '
StringEsc = \
CharStart = '
CharEnd = '
CharEsc = \
You have your String options at the bottom, but is textpad capable of accepting some kind of expression matching or regex, and if so, how would I best do this? I've looked on google and here, and the keywords are just too vague to find anything that does exist on the topic, if anything does.
Thank you for any help you can provide.
I fixed this problem by editing the line in perl5.syn that reads
StringAlt = '
to instead be
; StringAlt = '
(the leading semi-colon comments out the StringAlt setting on that line; or you could just delete that line outright).
You need to use
SyntaxStart = <
SyntaxEnd = >
This will restrict syntax highlighting to only be within tags, and it's the best you can do with TextPad.

Resources