I would like to replace all non capitalised words in a text with "-".length of the word.
For instance I have the following Text (German):
Florian Homm wuchs als Sohn des mittelständischen Handwerksunternehmers Joachim Homm und seiner Frau Maria-Barbara „Uschi“ Homm im hessischen Bad Homburg vor der Höhe auf. Sein Großonkel mütterlicherseits war der Unternehmer Josef Neckermann. Nach einem Studium an der Harvard University, das er mit einem Master of Business Administration an der Harvard Business School abschloss, begann Homm seine Tätigkeit in der US-amerikanischen Finanzwirtschaft bei der Investmentbank Merrill Lynch, danach war er bei dem US-Fondsanbieter Fidelity Investments, der Schweizer Privatbank Julius Bär und dem US-Vermögensverwalter Tweedy Browne....
should be transformed into
Florian Homm ---- --- Sohn --- ------------ Handwerksunternehmers Joachim Homm --- ------ Frau Maria-Barbara „Uschi“ Homm -- ---------- Bad Homburg --- Höhe ---. ....
▶ input.gsub(/\p{L}+/) { |m| m[0] != m[0].upcase ? '-'*m.length : m }
#⇒ "Florian Homm ----- --- Sohn --- ------------------ Handwerksunternehmers..."
More clean solution (credits to Cary):
▶ input.gsub(/(?<!\p{L})\p{Lower}+(?!\p{L})/) { |m| '-' * m.length }
Try something like this
s.split.map { |word| ('A'..'Z').include?(word[0]) ? word : '-' * word.length }.join(' ')
You can try something like this for small input size:
Basically, I:
Split the input string on whitespace character
Map the array to either the word itself (if not capitalized) or the word replaced with dashes (if capitalized)
join with whitespaces.
Like so
s = "Florian Homm wuchs als Sohn des mittelständischen Handwerksunternehmers Joachim Homm und seiner Frau Maria-Barbara „Uschi“ Homm im hessischen Bad Homburg vor der Höhe auf. Sein Großonkel mütterlicherseits war der Unternehmer Josef Neckermann. Nach einem Studium an der Harvard University, das er mit einem Master of Business Administration an der Harvard Business School abschloss, begann Homm seine Tätigkeit in der US-amerikanischen Finanzwirtschaft bei der Investmentbank Merrill Lynch, danach war er bei dem US-Fondsanbieter Fidelity Investments, der Schweizer Privatbank Julius Bär und dem US-Vermögensverwalter Tweedy Browne...."
s.split(/[[:space:]]/).map { |word| word.capitalize == word ? word : '-' * word.length }.join(' ')
Does that apply to your problem?
Cheers!
Edit: For a more memory efficient solution you can use regex replace gsub, check out this other answer by mudasobwa https://stackoverflow.com/a/41570686/4411941
r = /
(?<![[:alpha:]]) # do not match a letter (negative lookbehind)
[[:lower:]] # match a lowercase letter
[[:alpha:]]* # match zero or more letters
/x # free-spacing regex definition mode
str = "Frau Maria-Barbara „Uschi“ Homm im hessischen Bad Homburg vor der Höhe auf."
str.gsub(r) { |m| '-'*m.size }
#=> "Frau Maria-Barbara „Uschi“ Homm -- ---------- Bad Homburg --- --- Höhe ---."
"die Richter/-innen".gsub(r) { |m| '-'*m.size }
#=> "--- Richter/------"
"Jede(r) Anwältin und Anwalt".gsub(r) { |m| '-'*m.size }
#=> "Jede(-) Anwältin --- Anwalt"
Solution
This problem is harder than it looks!
This code might be more memory hungry than others, but I dare say it works for a wider range of (weird) German words :
def hide_non_capitalized(text)
text.split(/[[:space:]]/).map do |chars|
first_letter = chars[/[[:alpha:]]/]
if first_letter && first_letter == first_letter.downcase
## Keep non-letters :
chars.gsub(/[[:alpha:]]/,'-')
## Replace every character :
# '-' * chars.size
else
chars
end
end.join(' ')
end
It splits the text into character blocks, and replaces all the letters of a block if its first letter is lowercase. This code requires Ruby 2.4, because 'ä'.upcase is still 'ä' up to Ruby 2.3.
Test
puts hide_non_capitalized(text)
#=> Florian Homm ----- --- Sohn --- ----------------- Handwerksunternehmers Joachim Homm --- ------ Frau Maria-Barbara „Uschi“ Homm -- ---------- Bad Homburg --- --- Höhe ---. Sein Großonkel ----------------- --- --- Unternehmer Josef Neckermann. Nach ----- Studium -- --- Harvard University, --- -- --- ----- Master -- Business Administration -- --- Harvard Business School ---------, ------ Homm ----- Tätigkeit -- --- US-amerikanischen Finanzwirtschaft --- --- Investmentbank Merrill Lynch, ------ --- -- --- --- US-Fondsanbieter Fidelity Investments, --- Schweizer Privatbank Julius Bär --- --- US-Vermögensverwalter Tweedy Browne....
hide_none = "Änderung. „Uschi“, Attaché-case Maria-Barbara US-Fondsanbieter. Die Richter/-innen. Jede(r) 1234 \"#+?\""
puts hide_non_capitalized(hide_none)
#=> Änderung. „Uschi“, Attaché-case Maria-Barbara US-Fondsanbieter. Die Richter/-innen. Jede(r) 1234 "#+?"
hide_all = "öfters. „word“ lowercase-Uppercase jede(r) not/exactly/a/word"
puts hide_non_capitalized(hide_all)
#=> ------. „----“ ------------------- ----(-) ---/-------/-/----
Related
I am using R Markdown as per the below:
---
title: title
author:
- Name1:
email: email
institute: med
correspondence: yes
- name: name2
institute: med
date: date
bibliography: ref_file.bib
bib-humanities: true
output:
pdf_document:
includes:
in_header: header.tex
number_sections: yes
toc: no
pandoc_args:
- --lua-filter=scholarly-metadata.lua
- --lua-filter=author-info-blocks.lua
word_document:
toc: no
pandoc_args:
- --lua-filter=scholarly-metadata.lua
- --lua-filter=author-info-blocks.lua
html_document:
toc: no
df_print: paged
header-includes: \usepackage{amsmath}
institute:
- med: etc etc
---
#RN36382 defined...
My ref_file.bib shows:
#article{RN36382,
author = {van der Laan, M. J.},
title = {Statistical Inference for Variable Importance},
journal = {The International Journal of Biostatistics},
volume = {2},
number = {1},
year = {2006},
type = {Journal Article}
}
My pdf output is:
"Laan (2006) defined ..." , however, I was expecting "van der Laan (2006) defined..."
How can I fix this? Thanks!
You should add double brackets ;)
...
author = {{van der Laan, M. J.}}
...
I have this text :
#Heurtebise (Il ramasse son sac)
Vous regretterez de m'avoir fait du mal.
(Silence.) Vous me chassez ?
#Eurydice
Le mystère est mon ennemi. Je suis décidée à le combattre.
oui oui.
I want 2 matches of 2 groups, the result I want is :
Match 1
1. #Heurtebise (Il ramasse son sac)
2. Vous regretterez de m'avoir fait du mal.
(Silence.) Vous me chassez ?
Match 2
1. #Eurydice
2. Le mystère est mon ennemi. Je suis décidée à le combattre.
oui oui.
And I can't understand why my regex : /^(\#.+)$([^(\#|\#)]+)/ does not matches the 4th line beginning by a parenthesis. This is the result I have :
Match 1
1. #Heurtebise (Il ramasse son sac)
2. Vous regretterez de m'avoir fait du mal.
Match 2
1. #Eurydice
2. Le mystère est mon ennemi. Je suis décidée à le combattre.
oui oui.
Notice how it skips the line (Silence.) Vous me chassez ? in match 1. Can't understand why !
See the full case here : http://rubular.com/r/RR2eDc4ZBQ
Can someone help ? Thanks.
You may use
/^(#.+)((?:\R(?![##]).*)*)$/
See the regex demo. It will match any line starting with #, and then will match all consecutive lines that do not start with # or #.
Details
^ - start of a line
(#.+) - Group 1: # and the rest of the line
((?:\R(?![##]).*)*) - Group 2: 0 or more occurrences of:
\R(?![##]) - a line break sequence not followed with # or #
.* - the rest of the line
$ - end of line (not needed though).
The error is in the character class to exclude a line starting with # or #:
[^(\#|\#)] avoids # and # but also avoids (, | and ). A character class needs no alternation and parentheses. Using [^##] makes your sample code work for me.
I've to take the right part and clean it after it comparate with the middle part and save if are equal
> #!/usr/bin/env ruby
require 'rubygems'
require 'levenshtein'
require 'csv'
# Extending String class for blank? method
class String
def blank?
self.strip.empty?
end
end
# In
lines = CSV.read('entrada.csv')
lines.each do |line|
id = line[0].upcase.strip
left = line[1].upcase.strip
right = line[2].upcase.strip
eduardo = line[2].upcase.split(' ','de')
line[0] = id
line[1] = left
line[2] = right
line[4] = eduardo[0]+eduardo[1]
distance = Levenshtein.distance left, right
line << 99 if (left.blank? or right.blank?)
line << distance unless (left.blank? or right.blank?)
end
# Out
# counter = 0
CSV.open('salida.csv', 'w') do |csv|
lines.each do |line|
# counter = counter + 1 if line[3] <= 3
csv << line
end
end
# p counter
The middle is the correct the rigth i should correct
Some examples:
Eduardo | Abner | Herrera | Herrera -> Eduardo Herrera
Angel | De | Leon -> Angel De Leon
Maira | Angelina | de | Leon -> Maira De Leon
Marquilla | Gutierrez | Petronilda |De | Leon -> Marquilla Petronilda
First order of business is to come up with some rules. Based on your examples, and Spanish naming customs, here's my stab at the rules.
A name has a forename, paternal surname, and optional maternal surname.
A forename can be multiple words.
A surname can be multiple words linked by a de, y, or e.
So ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] should be { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda de Leon' }
To simplify the process, I'd first join any composite surnames into one field. ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda De Leon']. Watch out for cases like ['Angel', 'De', 'Leon'] in which case the surname is probably De Leon.
Once that's done, figuring out which part is which becomes easier.
name = {}
if parts.length == 1
error?
# The special case of only two parts: forename paternal_surname
elsif parts.length == 2
name = {
forename: parts[0],
paternal_surname: parts[1]
}
# forename paternal_surname maternal_surname
else
# The forename can have multiple parts, so work from the
# end and whatever's left is their forename.
name[:maternal_surname] = parts.pop
name[:paternal_surname] = parts.pop
name[:forename] = parts.join(" ")
end
There's a lot of ambiguity in Spanish naming, so this can only be an educated guess at what their actual name is. You'll probably have to tweak the rules as you learn more about the dataset. For example, I'm pretty sure handling of de is not that simple. For example...
One Leocadia Blanco Álvarez, married to a Pedro Pérez Montilla, may be addressed as Leocadia Blanco de Pérez or as Leocadia Blanco Álvarez de Pérez
In that case ['Marquilla', 'Gutierrez', 'Petronilda', 'De', 'Leon'] becomes ['Marquilla', 'Gutierrez', 'Petronilda', 'De Leon'] which is { forename: 'Marquilla', paternal_surname: 'Gutierrez', maternal_surname: 'Petronilda', married_to: 'Leon' } or 'Marquilla Gutierrez Petronilda who is married to someone whose parental surname is Leon.
Good luck.
I would add more columns to the database, like last_name1, last_name2, last_name3, etc, and make them optional (don't put validations on those attributes). Hope that answers your question!
I have two addresses side-by-side in a multi-line string:
Adresse de prise en charge : Adresse d'arrivée :
rue des capucines rue des tilleuls
92210 Saint Cloud 67000 Strasbourg
Tél.: Tél.:
I need to extract the addresses on the left and right with a regexp, and assign them to variables. I need to match:
address1: "rue des capucines 92210 Saint Cloud"
address2: "rue des tilleuls 67000 Strasbourg"
I thought of separating them with spaces, but I cant find any regexp to count the spaces. I tried:
en\s*charge\s*:\s*((.|\n)*)\s*
and similar, but that gives me both addresses, and is not what I'm looking for. Any help will be deeply appreciated.
I'd do something like this:
str = <<EOT
Adresse de prise en charge : Adresse d'arrivée :
rue des capucines rue des tilleuls
92210 Saint Cloud 67000 Strasbourg
Tél.: Tél.:
EOT
left_addr = []
right_addr = []
lines = str.squeeze("\n").gsub(':', '').lines.map(&:strip) # => ["Adresse de prise en charge Adresse d'arrivée", "rue des capucines rue des tilleuls", "92210 Saint Cloud 67000 Strasbourg", "Tél. Tél."]
center_line_pos = lines.max.length / 2 # => 35
lines.each do |l|
left_addr << l[0 .. (center_line_pos - 1)].strip
right_addr << l[center_line_pos .. -1].strip
end
At this point left_addr and right_addr look like:
left_addr # => ["Adresse de prise en charge", "rue des capucines", "92210 Saint Cloud", "Tél."]
right_addr # => ["Adresse d'arrivée", "rue des tilleuls", "67000 Strasbourg", "Tél."]
And here's what they contain:
puts left_addr
puts '------'
puts right_addr
# >> Adresse de prise en charge
# >> rue des capucines
# >> 92210 Saint Cloud
# >> Tél.
# >> ------
# >> Adresse d'arrivée
# >> rue des tilleuls
# >> 67000 Strasbourg
# >> Tél.
If you need the results all in one line without the 'Tel:':
puts left_addr[0..-2].join(' ').squeeze(' ')
puts '------'
puts right_addr[0..-2].join(' ').squeeze(' ')
# >> Adresse de prise en charge rue des capucines 92210 Saint Cloud
# >> ------
# >> Adresse d'arrivée rue des tilleuls 67000 Strasbourg
Here's a breakdown of what is going on:
str.squeeze("\n") # => " Adresse de prise en charge : Adresse d'arrivée :\n rue des capucines rue des tilleuls\n 92210 Saint Cloud 67000 Strasbourg\n Tél.: Tél.:\n"
.gsub(':', '') # => " Adresse de prise en charge Adresse d'arrivée \n rue des capucines rue des tilleuls\n 92210 Saint Cloud 67000 Strasbourg\n Tél. Tél.\n"
.lines # => [" Adresse de prise en charge Adresse d'arrivée \n", " rue des capucines rue des tilleuls\n", " 92210 Saint Cloud 67000 Strasbourg\n", " Tél. Tél.\n"]
.map(&:strip) # => ["Adresse de prise en charge Adresse d'arrivée", "rue des capucines rue des tilleuls", "92210 Saint Cloud 67000 Strasbourg", "Tél. Tél."]
Assuming that each address section in each line is indented as much as or further than the corresponding "Adresse" in the first line, the following can extract not only two addresses aligned sidewards, but n addresses in general.
lines = string.split(/#{$/}+/)
# => [
# => "Adresse de prise en charge : Adresse d'arrivée :",
# => " rue des capucines rue des tilleuls",
# => " 92210 Saint Cloud 67000 Strasbourg",
# => " Tél.: Tél.:"
# => ]
break_points = []
lines.first.scan(/\bAdresse\b/){break_points.push($~.begin(0))}
ranges = break_points.push(0).each_cons(2).map{|s, e| s..(e - 1)}
# => [0..53, 54..-1]
address1, address2 =
lines[1..-2]
.map{|s| ranges.map{|r| s[r]}}
.transpose
.map{|a| a.join(" ").strip.squeeze(" ")}
# => [
# => "rue des capucines 92210 Saint Cloud",
# => "rue des tilleuls 67000 Strasbourg"
# => ]
str =
" Adresse de prise en charge : Adresse d'arrivée :
rue des capucines rue des tilleuls
92210 Saint Cloud 67000 Strasbourg
Tél.: Tél.:"
adr_prise, adr_arr = str.lines[3].strip.split(/ {2,}/) #split on 2+ spaces
code_prise, cite_prise, code_arr, cite_arr = str.lines[6].strip.split(/ {2,}/)
Assumptions
I have assumed that the first and last lines are not wanted and the street names are separated by at least two spaces, and the same for the postal code/city strings. This permits the street name (and postal code/city pair) for "prise en charge" to end below "Adresse d'arrivée :".
Code
def parse_text(text)
text.split(/\n+\s+/)[1..-2].
map { |s| s.gsub(/\d+\K\s+/,' ').split(/\s{2,}/) }.
transpose.
map { |a| a.join(' ') }
end
Examples
Example 1
text = <<BITTER_END
Adresse de prise en charge : Adresse d'arrivée :
rue des capucines rue des tilleuls
92210 Saint Cloud 67000 Strasbourg
Tél.: Tél.:
BITTER_END
parse_text(text)
#=> ["rue des capucines 9210 Saint Cloud",
# "rue des tileuls 670 Strasbourg"]
Example 2
text = <<_
Adresse 1 : Adresse 2 : Adresse 3 :
rue nom le plus long du monde par un mile rue gargouilles rue des tilleuls
92210 Saint Cloud 31400 Nice 67000 Strasbourg
France France France
Tél.: Tél.: Tél.:
_
parse_text(text)
#=> ["rue nom le plus long du monde par un mile 92210 Saint Cloud France",
# "rue gargouilles 31400 Nice France",
# "rue des tilleuls 67000 Strasbourg France"]
Explanation
The steps for text given in the question:
Split into lines, removing blank lines and leading spaces:
a1 = text.split(/\n+\s+/)
#=> ["Adresse de prise en charge : Adresse d'arrivée :",
# "rue des capucines rue des tilleuls",
# "92210 Saint Cloud 67000 Strasbourg",
# "Tél.: Tél.:\n"]
Remove first and last lines:
a2 = a1[1..-2]
#=> ["rue des capucines rue des tilleuls",
# "92210 Saint Cloud 67000 Strasbourg"]
Remove extra spaces between the postal codes and cities and split each line on two or more spaces:
r = /
\d+ # match one or more digits
\K # forget everything matched so far
\s+ # match one of more spaces
/x # extended/free-spacing regex definition mode
a3 = a2.map { |s| s.gsub(/\d+\K\s+/,' ').split(/\s{2,}/) }
#=> [["rue des capucines", "rue des tilleuls"],
# ["92210 Saint Cloud", "67000 Strasbourg"]]
Group by column:
a4 = a3.transpose
#=> [["rue des capucines", "92210 Saint Cloud"],
# ["rue des tilleuls", "67000 Strasbourg"]]
Join strings:
a4.map { |a| a.join(' ') }
#=> ["rue des capucines 92210 Saint Cloud",
# "rue des tilleuls 67000 Strasbourg"]
Inspired by #steenslag's very pragmatic answer, here's a pretty dense one-liner just for fun.
# Assume the input data is in the variable `text`
left_addr, right_addr = text.lines.values_at(3, 6).map do |line|
line.scan(/(?:\d+ +)?\S+(?: \S+)*/)
.map {|part| part.squeeze(' ') }
end
.transpose
.map {|addr| addr.join(' ') }
puts left_addr
# => rue des capucines 92210 Saint Cloud
puts right_addr
# => rue des tilleuls 67000 Strasbourg
Like #steenslag's answer, this assumes that the desired data is always on lines 3 and 6. It also assumes that on line 6 both columns will have a postal code and city and that the postal code will always start with a digit.
Because it's a pretty dense one-liner and because it makes a lot of assumptions, I don't think this is the best answer and I'm marking it Community Wiki.
If I have time I'll come back and unpack this later.
Assuming that the "center line position" is known, this would work:
left_lines, right_lines = str.scan(/^(.{50})(.*)$/).transpose
The regular expression captures 50 characters at the beginning of each line plus the remaining characters until the line's end.
scan returns a nested array: (I'm using placeholders because the actual lines are too long)
[
['1st left line', '1st right line'],
['2nd left line', '2nd right line'],
...
]
transpose converts it to:
[
['1st left line', '2nd left line', ...], # <- assigned to left_lines
['1st right line', '2nd right line', ...] # <- assigned to right_lines
]
The lines (excluding the first and last line) have to be joined and spaces have to be removed: (see strip and squeeze)
left_lines[1..-2].join(' ').strip.squeeze(' ')
#=> "rue des capucines 92210 Saint Cloud"
Same for right_lines:
right_lines[1..-2].join(' ').strip.squeeze(' ')
#=> "rue des tilleuls 67000 Strasbourg"
I have a problem with some regular expressions in Ruby. This is the situation:
Input text:
"NU POSTA aşa ceva pe Facebook! „Prostia se plăteşte”
Publicat la: 10.02.2015 10:20 Ultima actualizare: 10.02.2015 10:35
Adresa de e-mail la care vrei sa primesti STIREA atunci cand se intampla
Abonează-te
---- Here is some usefull text ---
Abonează-te
× Citeşte mai mult »
Adauga un comentariu"
I need a regular expression witch can extract only useful text between "Abonează-te" word.
I tried this result = result.gsub(/^[.]{*}\nAbonează-te/, '') to remove the text from the start of the string to the 'Abonează-te' word, but this does not work. I have no ideea how to solve this situation. Can you help me?
Instead of using regular expression, you can use String#split, then take the second part:
s = "NU POSTA aşa ceva pe Facebook! „Prostia se plăteşte”
Publicat la: 10.02.2015 10:20 Ultima actualizare: 10.02.2015 10:35
Adresa de e-mail la care vrei sa primesti STIREA atunci cand se intampla
Abonează-te
---- Here is some usefull text ---
Abonează-te
× Citeşte mai mult »
Adauga un comentariu"
s.split('Abonează-te', 3)[1].strip # 3: at most 3 parts
# => "---- Here is some usefull text ---"
UPDATE
If you want to get multiple matches:
s = "NU
Abonează-te
-- Here's some
Abonează-te
text --
Abonează-te
comentariu"
s.split('Abonează-te')[1..-2].map(&:strip)
# => ["-- Here's some", "text --"]
You could use string.scan function. You don't need to go for string.gsub function where you want to extract a particular text.
> s = "NU POSTA aşa ceva pe Facebook! „Prostia se plăteşte”
" Publicat la: 10.02.2015 10:20 Ultima actualizare: 10.02.2015 10:35
" Adresa de e-mail la care vrei sa primesti STIREA atunci cand se intampla
" Abonează-te
" ---- Here is some usefull text ---
" Abonează-te
" × Citeşte mai mult »
" Adauga un comentariu"
=> "NU POSTA aşa ceva pe Facebook! „Prostia se plăteşte”\nPublicat la: 10.02.2015 10:20 Ultima actualizare: 10.02.2015 10:35\nAdresa de e-mail la care vrei sa primesti STIREA atunci cand se intampla\nAbonează-te\n---- Here is some usefull text --- \nAbonează-te\n× Citeşte mai mult »\nAdauga un comentariu"
irb(main):010:0> s.scan(/(?<=Abonează-te\n)[\s\S]*?(?=\nAbonează-te)/)
=> ["---- Here is some usefull text --- "]
Remove the newline \n character present inside the lookarounds if necessary. [\s\S]*? will do a non-greedy match of space or non-space characters zero or more times.
DEMO
Your regex syntax is incorrect . inside of a character class means match a dot literally, and the {*} matches an opening curly brace "zero or more" times followed by a closing curly brace.
You can match instead of replacing here.
s.match(/Abonează-te(.*?)Abonează-te/m)[1].strip()