Edit Microsoft Word .doc and .docx files

Edit Microsoft Word .doc and .docx files - ruby

I would like to translate all the text inside a Microsoft Word .doc or .docx file without changing the formatting of the file itself.
Are there any gems or libraries that can help me with this?

The general case is extremely complicated, but for translating continuous runs of text that are formatted the same, you can use WIN32OLE to access Word documents using Word so long as you are on Windows and have a copy of Word installed.
You can find documentation on Word's object model. You can also use the built-in Object Browser (start the macro editor and press F2).
The following short script can form the starting point for your exploration:
require 'win32ole'
file = ENV['USERPROFILE'] + '/Desktop/' + 'This is a test.docx';
word = WIN32OLE.new('Word.Application')
word.visible = true
doc = word.Documents.Open(file)
doc.paragraphs.each { |p| puts p.Range.Text }
doc.Close()
word.Quit()

Related

Need to strip out invalid characters in CSV file

I am generating a CSV file from a Microsoft SQL database that was provided to me, but somehow there are invalid characters in about two dozen places throughout the text (there are many thousands of lines of data). When I open the CSV in my text editor, they display as red, upside-down question marks (there are two of them in the attached screenshot).
When I copy the character and view the "find/replace" dialog in my text editor, I see this:
\x{0D}
...but I have no idea what that means. I need to modify my script that generates the CSV so it strips these characters out, but I don't know how to identify them. My script is written in Classic ASP.

You can also use RegEx to remove unwanted characters:
Set objRegEx = CreateObject(“VBScript.RegExp”)
objRegEx.Global = True
objRegEx.Pattern = “[^A-Za-z]”
strCSV = objRegEx.Replace(strCSV, “”)
This code is from the following article which explains in details what it does:
How Can I Remove All the Non-Alphabetic Characters in a String?
In your case you will want to add some characters to the Pattern:
^[a-zA-Z0-9!##$&()\\-`.+,/\"]*$

You can simply use the Replace function and specify Chr(191) (or "¿" directly):
Replace(yourCSV, Chr(191), "")
or
Replace(yourCSV, "¿", "")
This will remove the character. If you need to replace it with something else, change the last parameter from "" to a different value ("-" for example).
In general, you can use charmap.exe (Character Map) from Run menu, select Arial, find a symbol and copy it to the clipboard. You can then check its value using Asc("¿"), this will return the ASCII code to use with Chr().

How can I detect non UTF-8 encoding in RStudio

I have a script like
a <- 1
# A very long comment, perhaps copy paste from somewhere containing the word ﬁt.
and I want to search for non UTF-8 encoding. How can I do this in RStudio?

I realized, the answer is really simple: Just go to Edit => Find (Strg + F) and search for [^\x00-\x7F] + with enabled Regex field in the search bar.

Cucumber Ruby read Word Doc or Text

I'm writing tests that will be confirming a lot of text on page. It's for Terms and Conditions pages, Cookies, Privacy Policy etc. Not what I'd like to do but it's a requirement I can't avoid. I've heard that Cucumber can open a text file like. txt or .doc and compare the text on screen.
I've tried to find any reference to this but have come up short. Is anyone able to point me in the right direction please?
Thanks

Cucumber/aruba has a Given step that goes like this :
Given a file named "foo" with:
"""
hello world
"""
You would then be able to check that you webpage has content with :
Then I should see "hello world"
And your step :
Then /^I should see "([^"]*)" do |text
(page.should have_content(text))

I continued to play and found this works for a .txt file:
File.foreach(doc) do |line|
line = line.strip
within_window(->{ page.title == 'Cookies' }) do
page_content = find('#main').text
page_content.gsub!('‘',"'")
page_content.gsub!('’',"'")
expect(page_content).to have_text line
end
end
end
doc is my filename variable. I used strip to remove the newline characters created when pasting into the txt file. The cookies open up in a new tab so I navigated to that. I had invalid UTF-8 characters so just used gsub on those.
I'm sure there's a much better way to do this but this works for now

I'd recommend using approval tests. A person approves the original, then the automated test verifies no change. If the license text changes, the test fails. If the text is supposed to pass, then the tester approves the new text, then it's automated thereafter.

Ruby and RegExp

Sorry if this has already been asked.
I have about 1 million text documents contained in psql
I am trying to see if they contain certain word, for example cancer, or died or heart_attack etc. This list is also quite long.
The document only needs to contain one of the words.
If they contain a word, I then try to copy them to a different folder.
My current code is:
directory = "disease" #Creates a directory called heart attacks
FileUtils.mkpath(directory) # Makes the directory if it doesn't exists
cancer = Eightk.where("text ilike '%cancer%'")
died = Eightk.where("text ilike '%died%'")
cancer.each do |filing| #filing can be used instead of eightks
filename = "#{directory}/#{filing.doc_id}.html"
File.open(filename,"w").puts filing.text
puts "Storing #{filing.doc_id}..."
died.each do |filing| #filing can be used instead of eightks
filename = "#{directory}/#{filing.doc_id}.html"
File.open(filename,"w").puts filing.text
puts "Storing #{filing.doc_id}..."
end
end
But this is not working for the following
Doesn't match the exact word
Is very time consuming since it contains lots of coping the same code and changing just one word.
So I have tried using Regexp.union as follows but am a bit lost
directory = "disease" #Creates a directory called heart attacks
FileUtils.mkpath(directory) # Makes the directory if it doesn't exists
keywords = [/dead/,/killed/,/cancer/]
re = regexp.union(keywords)
So I am trying to search the text files for these keywords and then copy the text documents.
Any help is really appreciated.

Since you said:
I have about 1 million text documents contained in psql
and use "iLike" text search operator to search words in those documents.
IMHO, that is an inefficient implementation because your data is huge, your query will process all 1 million text documents for every search and it will be very slow.
Before moving forward, I think you should take a look at PG Full Text Searching first. (if you simply want to use built-in full text search in PG) or you could also take a look at some other products like elasticsearch, solr etc. that are dedicated to text search problem.
Regarding PG full text search, in Ruby, you could use pg_serach gem. Though, if you use Rails, I wrote a post about simple full text search implementaion with PG in Rails.
I hope you may find this useful.

How to search binary file and replace string with Ruby?

Ruby newbie here. I'm using Ruby version 1.9.2. I working at a military facility and whenever when need to send support data to our vendors it needs to be scrubbed of idenfying IP and Hostname info. This is new role for me and now the task of scrubbing files (both text and binary) falls on me when handling support issues.
I created the following script to "scrub" files plain text files of IP address info:
File.open("subnet.htm", 'r+') do |f|
text = f.read
text.gsub!(/\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/, "000.000.000.000")
f.rewind
f.write(text)
end
I need to modify my script to search and replace hostname AND IP address information on text files AND .dat binary files. I'm looking for something really simple like my little script above and I'd like the keep the processing of txt and dat files as separate scripts. The task of creating one script to do both is one I'd like to take up as learning exercise from the two separate scripts. Right now I'm under certain time constraints to scrub the supports files and send them out.
The priority for me is to scrub my binary .dat trace files which are of data type XML. These are binary performance trace files from our storage arrays and they need to have the identifying IP address information scrubbed out before sending off to support for analysis.
I've searched stackoverflow.com somewhat extensively and haven't found a question with answer that addresses my specific need and I simply having a time trying to figure out string.unpack.
Thanks.

In general Ruby processes binary files the same as other files, with two caveats:
On Windows reading files normally translates CRLF pairs into just LF. You need to read in binary mode to ensure no conversion:
File.open('foo.bin','rb'){ ... }
In order to ensure that your binary data is not interpreted as text in some other encoding under Ruby 1.9+ you need to specify the ASCII-8BIT encoding:
File.open('foo.bin','r:ASCII-8BIT'){ ... }
However, as noted in this post, setting the 'b' flag as shown above also sets the encoding for you. Thus, just use the first code snippet above.
However, as noted in the comment by #ennuikiller, I suspect that you don't actually have true binary data. If you're really reading text files with a non-ASCII encoding (e.g. UTF-8) there is a small chance that treating them as binary will accidentally find only half of a multi-byte encoding and cause harm in the resulting file.
Edit: To use Nokogiri on XML files, you might do something like the following:
require 'nokogiri'
File.open("foo.xml", 'r+') do |f|
doc = Nokogiri.XML(f.read)
doc.xpath('//text()').each do |text_node|
# You cannot use gsub! here
text_node.content = text_node.content.gsub /.../, '...'
end
f.rewind
f.write doc.to_xml
end

I've done some binary file parsing, and this is how I read it in and cleaned it up:
data = File.open("file", 'rb' ) {|io| io.read}.unpack("C*").map do |val|
val if val == 9 || val == 10 || val == 13 || (val > 31 && val < 127)
end
For me, my binary file didn't have sequential character strings, so I had to do some shifting and filtering before I could read it (Hence the .map do |val| ... end Unpack with the "C" tag (see http://www.ruby-doc.org/core-1.9.2/String.html#method-i-unpack) will give ASCII character codes rather than the letters, so call val.chr if you'd like to use the interpreted character instead.
I'd suggest that you open your files in a binary editor and look through them to determine how to best handle the data parsing. If they are XML, you might consider parsing them with Nokogiri or a similar XML tool.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio