Regex order when matching single square bracket - elisp

Hello to all of you,
I have a question regarding a specific regex in Elisp and specifically in Elisp. I'm trying to match a single square bracket and ielm has this:
(string-match "[\]\[]" "[") ; ===> 0
(string-match "[\[\]]" "[") ; ===> nil
(string-match "[\]\[]" "]") ; ===> 0
(string-match "[\[\]]" "]") ; ===> nil
(string-match "[\[\]]" "[]") ; ===> 0
(string-match "[\]\[]" "[]") ; ===> 0
(string-match "[\]\[]" "][") ; ===> 0
(string-match "[\]\[]" "][") ; ===> 0
Where as with JS, these all return true:
'['.match(/[\[\]]/) // ===>['[']
'['.match(/[\]\[]/) // ===>['[']
']'.match(/[\[\]]/) // ===>[']']
']'.match(/[\]\[]/) // ===>[']']
'[]'.match(/[\[\]]/) // ===>['[']
'[]'.match(/[\]\[]/) // ===>['[']
']['.match(/[\[\]]/) // ===>[']']
']['.match(/[\]\[]/) // ===>[']']
Here's a regex101: https://regex101.com/r/e8sLXr/1
I don't understand why the order of my square brackets in Elisp matters. I've tried using double backslashes but it doesn't help. Actually, it gives me more nils on these regexes whereas I thought the proper way to escape a backslack in a string for the regex to process was to double it: https://www.gnu.org/software/emacs/manual/html_node/elisp/Regexp-Example.html#Regexp-Example
Does anyone know what I'm missing a could help me ?
Cheers,
Thomas
EDIT: grammar

Firstly, let's ditch the backslashes. [ and ] are not special to strings(*), and therefore escaping them does not change them. So the following is equivalent, and easier to read:
(string-match "[][]" "[") ; ===> 0
(string-match "[][]" "]") ; ===> 0
(string-match "[][]" "[]") ; ===> 0
(string-match "[][]" "][") ; ===> 0
(string-match "[][]" "][") ; ===> 0
This pattern matches either ] or [, and all the strings being tested have one of those characters at the start; hence we match at position 0 in each case.
Critically, to include a ] in a character alternative it must be the first character. Hence the following did not do what you wanted:
(string-match "[[]]" "[") ; ===> nil
(string-match "[[]]" "]") ; ===> nil
(string-match "[[]]" "[]") ; ===> 0
This pattern matches exactly [], because [[] is a character alternative matching anything in the set comprising the single-character [; and that character alternative is then followed by ] (which, when it is not ending a character alternative, just matches itself).
You will want to read the "character alternative" details at:
C-hig (elisp)Regexp Special RET
(*) Note also that backslashes are not special to a regexp when they are within a character alternative.
Your regexps didn't have any backslashes -- because in double-quoted string format you would have needed to double the backslashes to include those in the regexp -- but if you had done that, and if they were also inside the character alternative, it would just mean that a backslash would be one of the characters matched by that set.
e.g. "[\\]\\[]" is the regexp [\]\[] which matches \[]
(Remembering that ] cannot appear in a character alternative unless it is the first character.)

Related

Trailing backslash error web-mode content type

I get this error when trying to set content type in web-mode: File mode specification error: (invalid-regexp Trailing backslash)
I have had a hard time debugging this. I'm very new to emacs so I need some help setting web-mode. I have been following the documentation in web-mode.org but it has been difficult to decypher. Thanks.
(use-package
web-mode
:defer 2
:ensure t
:mode ("\\.html?\\"
"\\.hbs$\\"
"\\.vue$\\"
"\\.css?\\"
"components/.*\\.js[x]?\\'"
"containers/.*\\.js[x]?\\'")
:config (progn
(setq web-mode-enable-auto-closing t
web-mode-enable-auto-opening t
web-mode-enable-auto-pairing t
web-mode-enable-auto-indentation t
web-mode-enable-auto-quoting t
;; right now paired with AutoComplete
web-mode-ac-sources-alist
'(("css" . (ac-source-css-property))
("vue" . (ac-source-words-in-buffer ac-source-abbrev))
("html" . (ac-source-words-in-buffer ac-source-abbrev)))
web-mode-content-types-alist
'(("jsx" . "components/.*\\.js[x]?\\'")
("jsx" . "containers/.*\\.js[x]?\\'")))))
;; usually I set them in containers/ or components/ directorie
;; and to keep seperate from plain JS
;; adjust indents for web-mode to 2 spaces
(defun my-web-mode-hook ()
"Hooks for Web mode. Adjust indents"
;;; http://web-mode.org/
(setq web-mode-markup-indent-offset 2)
(setq web-mode-css-indent-offset 2)
(setq web-mode-code-indent-offset 2))
(add-hook 'web-mode-hook 'my-web-mode-hook)
In the list of regexps after :mode, make sure that they all end with \\'. Currently two of them do, but four of them lost the final ' character.
:mode ("\\.html?\\'"
"\\.hbs$\\'"
"\\.vue$\\'"
"\\.css?\\'"
"components/.*\\.js[x]?\\'"
"containers/.*\\.js[x]?\\'")
\' is a special regexp construct that "matches the empty string, but only at the end of the buffer or string being matched against".

Ruby Regex find numbers not surrounded by alphabetical characters [duplicate]

I have a regex expression that I'm using to find all the words in a given block of content, case insensitive, that are contained in a glossary stored in a database. Here's my pattern:
/($word)/i
The problem is, if I use /(Foo)/i then words like Food get matched. There needs to be whitespace or a word boundary on both sides of the word.
How can I modify my expression to match only the word Foo when it is a word at the beginning, middle, or end of a sentence?
Use word boundaries:
/\b($word)\b/i
Or if you're searching for "S.P.E.C.T.R.E." like in Sinan Ünür's example:
/(?:\W|^)(\Q$word\E)(?:\W|$)/i
To match any whole word you would use the pattern (\w+)
Assuming you are using PCRE or something similar:
Above screenshot taken from this live example: http://regex101.com/r/cU5lC2
Matching any whole word on the commandline with (\w+)
I'll be using the phpsh interactive shell on Ubuntu 12.10 to demonstrate the PCRE regex engine through the method known as preg_match
Start phpsh, put some content into a variable, match on word.
el#apollo:~/foo$ phpsh
php> $content1 = 'badger'
php> $content2 = '1234'
php> $content3 = '$%^&'
php> echo preg_match('(\w+)', $content1);
1
php> echo preg_match('(\w+)', $content2);
1
php> echo preg_match('(\w+)', $content3);
0
The preg_match method used the PCRE engine within the PHP language to analyze variables: $content1, $content2 and $content3 with the (\w)+ pattern.
$content1 and $content2 contain at least one word, $content3 does not.
Match a number of literal words on the commandline with (dart|fart)
el#apollo:~/foo$ phpsh
php> $gun1 = 'dart gun';
php> $gun2 = 'fart gun';
php> $gun3 = 'farty gun';
php> $gun4 = 'unicorn gun';
php> echo preg_match('(dart|fart)', $gun1);
1
php> echo preg_match('(dart|fart)', $gun2);
1
php> echo preg_match('(dart|fart)', $gun3);
1
php> echo preg_match('(dart|fart)', $gun4);
0
variables gun1 and gun2 contain the string dart or fart. gun4 does not. However it may be a problem that looking for word fart matches farty. To fix this, enforce word boundaries in regex.
Match literal words on the commandline with word boundaries.
el#apollo:~/foo$ phpsh
php> $gun1 = 'dart gun';
php> $gun2 = 'fart gun';
php> $gun3 = 'farty gun';
php> $gun4 = 'unicorn gun';
php> echo preg_match('(\bdart\b|\bfart\b)', $gun1);
1
php> echo preg_match('(\bdart\b|\bfart\b)', $gun2);
1
php> echo preg_match('(\bdart\b|\bfart\b)', $gun3);
0
php> echo preg_match('(\bdart\b|\bfart\b)', $gun4);
0
So it's the same as the previous example except that the word fart with a \b word boundary does not exist in the content: farty.
Using \b can yield surprising results. You would be better off figuring out what separates a word from its definition and incorporating that information into your pattern.
#!/usr/bin/perl
use strict; use warnings;
use re 'debug';
my $str = 'S.P.E.C.T.R.E. (Special Executive for Counter-intelligence,
Terrorism, Revenge and Extortion) is a fictional global terrorist
organisation';
my $word = 'S.P.E.C.T.R.E.';
if ( $str =~ /\b(\Q$word\E)\b/ ) {
print $1, "\n";
}
Output:
Compiling REx "\b(S\.P\.E\.C\.T\.R\.E\.)\b"
Final program:
1: BOUND (2)
2: OPEN1 (4)
4: EXACT (9)
9: CLOSE1 (11)
11: BOUND (12)
12: END (0)
anchored "S.P.E.C.T.R.E." at 0 (checking anchored) stclass BOUND minlen 14
Guessing start of match in sv for REx "\b(S\.P\.E\.C\.T\.R\.E\.)\b" against "S.P
.E.C.T.R.E. (Special Executive for Counter-intelligence,"...
Found anchored substr "S.P.E.C.T.R.E." at offset 0...
start_shift: 0 check_at: 0 s: 0 endpos: 1
Does not contradict STCLASS...
Guessed: match at offset 0
Matching REx "\b(S\.P\.E\.C\.T\.R\.E\.)\b" against "S.P.E.C.T.R.E. (Special Exec
utive for Counter-intelligence,"...
0 | 1:BOUND(2)
0 | 2:OPEN1(4)
0 | 4:EXACT (9)
14 | 9:CLOSE1(11)
14 | 11:BOUND(12)
failed...
Match failed
Freeing REx: "\b(S\.P\.E\.C\.T\.R\.E\.)\b"
For Those who want to validate an Enum in their code you can following the guide
In Regex World you can use ^ for starting a string and $ to end it. Using them in combination with | could be what you want :
^(Male)$|^(Female)$
It will return true only for Male or Female case.
If you are doing it in Notepad++
[\w]+
Would give you the entire word, and you can add parenthesis to get it as a group. Example: conv1 = Conv2D(64, (3, 3), activation=LeakyReLU(alpha=a), padding='valid', kernel_initializer='he_normal')(inputs). I would like to move LeakyReLU into its own line as a comment, and replace the current activation. In notepad++ this can be done using the follow find command:
([\w]+)( = .+)(LeakyReLU.alpha=a.)(.+)
and the replace command becomes:
\1\2'relu'\4 \n # \1 = LeakyReLU\(alpha=a\)\(\1\)
The spaces is to keep the right formatting in my code. :)
use word boundaries \b,
The following (using four escapes) works in my environment: Mac, safari Version 10.0.3 (12602.4.8)
var myReg = new RegExp(‘\\\\b’+ variable + ‘\\\\b’, ‘g’)
Get all "words" in a string
/([^\s]+)/g
Basically ^/s means break on spaces (or match groups of non-spaces)
Don't forget the g for Greedy
Try it:
"Not the answer you're looking for? Browse other questions tagged regex word-boundary or ask your own question.".match(/([^\s]+)/g)
→ (17) ['Not', 'the', 'answer', "you're", 'looking', 'for?', 'Browse', 'other', 'questions', 'tagged', 'regex', 'word-boundary', 'or', 'ask', 'your', 'own', 'question.']

CLOS make-instance is really slow and causes heap exhaustion in SBCL

I'm writing an multiarchitecture assembler/disassembler in Common Lisp (SBCL 1.1.5 in 64-bit Debian GNU/Linux), currently the assembler produces correct code for a subset of x86-64. For assembling x86-64 assembly code I use a hash table in which assembly instruction mnemonics (strings) such as "jc-rel8" and "stosb" are keys that return a list of 1 or more encoding functions, like the ones below:
(defparameter *emit-function-hash-table-x64* (make-hash-table :test 'equalp))
(setf (gethash "jc-rel8" *emit-function-hash-table-x64*) (list #'jc-rel8-x86))
(setf (gethash "stosb" *emit-function-hash-table-x64*) (list #'stosb-x86))
The encoding functions are like these (some are more complicated, though):
(defun jc-rel8-x86 (arg1 &rest args)
(jcc-x64 #x72 arg1))
(defun stosb-x86 (&rest args)
(list #xaa))
Now I am trying to incorporate the complete x86-64 instruction set by using NASM's (NASM 2.11.06) instruction encoding data (file insns.dat) converted to Common Lisp CLOS syntax. This would mean replacing regular functions used for emitting binary code (like the functions above) with instances of a custom x86-asm-instruction class (a very basic class so far, some 20 slots with :initarg, :reader, :initform etc.), in which an emit method with arguments would be used for emitting the binary code for given instruction (mnemonic) and arguments. The converted instruction data looks like this (but it's more than 40'000 lines and exactly 7193 make-instance's and 7193 setf's).
;; first mnemonic + operand combination instances (:is-variant t).
;; there are 4928 such instances for x86-64 generated from NASM's insns.dat.
(eval-when (:compile-toplevel :load-toplevel :execute)
(setf Jcc-imm-near (make-instance 'x86-asm-instruction
:name "Jcc"
:operands "imm|near"
:code-string "[i: odf 0f 80+c rel]"
:arch-flags (list "386" "BND")
:is-variant t))
(setf STOSB-void (make-instance 'x86-asm-instruction
:name "STOSB"
:operands "void"
:code-string "[ aa]"
:arch-flags (list "8086")
:is-variant t))
;; then, container instances which contain (or could be refer to instead)
;; the possible variants of each instruction.
;; there are 2265 such instances for x86-64 generated from NASM's insns.dat.
(setf Jcc (make-instance 'x86-asm-instruction
:name "Jcc"
:is-container t
:variants (list Jcc-imm-near
Jcc-imm64-near
Jcc-imm-short
Jcc-imm
Jcc-imm
Jcc-imm
Jcc-imm)))
(setf STOSB (make-instance 'x86-asm-instruction
:name "STOSB"
:is-container t
:variants (list STOSB-void)))
;; thousands of objects more here...
) ; this bracket closes (eval-when (:compile-toplevel :load-toplevel :execute)
I have converted NASM's insns.dat to Common Lisp syntax (like above) using a trivial Perl script (further below, but there's nothing of interest in the script itself) and in principle it works. So it works, but compiling those 7193 objects is really really slow and commonly causes heap exhaustion. On my Linux Core i7-2760QM laptop with 16G of memory the compiling of an (eval-when (:compile-toplevel :load-toplevel :execute) code block with 7193 objects like the ones above takes more than 7 minutes and sometimes causes heap exhaustion, like this one:
;; Swank started at port: 4005.
* Heap exhausted during garbage collection: 0 bytes available, 32 requested.
Gen StaPg UbSta LaSta LUbSt Boxed Unboxed LB LUB !move Alloc Waste Trig WP GCs Mem-age
0: 0 0 0 0 0 0 0 0 0 0 0 41943040 0 0 0.0000
1: 0 0 0 0 0 0 0 0 0 0 0 41943040 0 0 0.0000
2: 0 0 0 0 0 0 0 0 0 0 0 41943040 0 0 0.0000
3: 38805 38652 0 0 49474 15433 389 416 0 2144219760 9031056 1442579856 0 1 1.5255
4: 127998 127996 0 0 45870 14828 106 143 199 1971682720 25428576 2000000 0 0 0.0000
5: 0 0 0 0 0 0 0 0 0 0 0 2000000 0 0 0.0000
6: 0 0 0 0 1178 163 0 0 0 43941888 0 2000000 985 0 0.0000
Total bytes allocated = 4159844368
Dynamic-space-size bytes = 4194304000
GC control variables:
*GC-INHIBIT* = true
*GC-PENDING* = in progress
*STOP-FOR-GC-PENDING* = false
fatal error encountered in SBCL pid 9994(tid 46912556431104):
Heap exhausted, game over.
Welcome to LDB, a low-level debugger for the Lisp runtime environment.
ldb>
I had to add --dynamic-space-size 4000 parameter for SBCL to get it compiled at all, but still after allocating 4 gigabytes of dynamic space heap sometimes gets exhausted. Even if the heap exhaustion would be solved, more than 7 minutes for compiling 7193 instances after only adding a slot in the class ('x86-asm-instruction class used for these instances) is way too much for interactive development in REPL (I use slimv, if that matters).
Here's (time (compile-file output:
; caught 18636 WARNING conditions
; insns.fasl written
; compilation finished in 0:07:11.329
Evaluation took:
431.329 seconds of real time
238.317000 seconds of total run time (234.972000 user, 3.345000 system)
[ Run times consist of 6.073 seconds GC time, and 232.244 seconds non-GC time. ]
55.25% CPU
50,367 forms interpreted
784,044 lambdas converted
1,031,842,900,608 processor cycles
19,402,921,376 bytes consed
Using OOP (CLOS) would enable incorporating the instruction mnemonic (such as jc or stosb above, :name), allowed operands of the instruction (:operands), instruction's binary encoding (such as #xaa for stosb, :code-string) and possible architecture limitations (:arch-flags) of the instruction in one object. But it seems that at least my 3-year-old computer is not efficient enough to compile around 7000 CLOS object instances quickly.
My question is: Is there some way to make SBCL's make-instance faster, or should I keep assembly code generation in regular functions like the examples further above? I'd be also very happy to know about any other possible solutions.
Here's the Perl script, just in case:
#!/usr/bin/env perl
use strict;
use warnings;
# this program converts NASM's `insns.dat` to Common Lisp Object System (CLOS) syntax.
my $firstchar;
my $line_length;
my $are_there_square_brackets;
my $mnemonic_and_operands;
my $mnemonic;
my $operands;
my $code_string;
my $flags;
my $mnemonic_of_current_mnemonic_array;
my $clos_object_name;
my $clos_mnemonic;
my $clos_operands;
my $clos_code_string;
my $clos_flags;
my #object_name_array = ();
my #mnemonic_array = ();
my #operands_array = ();
my #code_string_array = ();
my #flags_array = ();
my #each_mnemonic_only_once_array = ();
my #instruction_variants_array = ();
my #instruction_variants_for_current_instruction_array = ();
open(FILE, 'insns.dat');
$mnemonic_of_current_mnemonic_array = "";
# read one line at once.
while (<FILE>)
{
$firstchar = substr($_, 0, 1);
$line_length = length($_);
$are_there_square_brackets = ($_ =~ /\[.*\]/);
chomp;
if (($line_length > 1) && ($firstchar =~ /[^\t ;]/))
{
if ($are_there_square_brackets)
{
($mnemonic_and_operands, $code_string, $flags) = split /[\[\]]+/, $_;
$code_string = "[" . $code_string . "]";
($mnemonic, $operands) = split /[\t ]+/, $mnemonic_and_operands;
}
else
{
($mnemonic, $operands, $code_string, $flags) = split /[\t ]+/, $_;
}
$mnemonic =~ s/[\t ]+/ /g;
$operands =~ s/[\t ]+/ /g;
$code_string =~ s/[\t ]+/ /g;
$flags =~ s/[\t ]+//g;
# we don't want non-x86-64 instructions here.
unless ($flags =~ "NOLONG")
{
# ok, the content of each field is now filtered,
# let's convert them to a suitable Common Lisp format.
$clos_object_name = $mnemonic . "-" . $operands;
# in Common Lisp object names `|`, `,`, and `:` must be escaped with a backslash `\`,
# but that would get too complicated.
# so we'll simply replace them:
# `|` -> `-`.
# `,` -> `.`.
# `:` -> `.`.
$clos_object_name =~ s/\|/-/g;
$clos_object_name =~ s/,/./g;
$clos_object_name =~ s/:/./g;
$clos_mnemonic = "\"" . $mnemonic . "\"";
$clos_operands = "\"" . $operands . "\"";
$clos_code_string = "\"" . $code_string . "\"";
$clos_flags = "\"" . $flags . "\""; # add first and last double quotes.
$clos_flags =~ s/,/" "/g; # make each flag its own Common Lisp string.
$clos_flags = "(list " . $clos_flags. ")"; # convert to `list` syntax.
push #object_name_array, $clos_object_name;
push #mnemonic_array, $clos_mnemonic;
push #operands_array, $clos_operands;
push #code_string_array, $clos_code_string;
push #flags_array, $clos_flags;
if ($mnemonic eq $mnemonic_of_current_mnemonic_array)
{
# ok, same mnemonic as the previous one,
# so the current object name goes to the list.
push #instruction_variants_for_current_instruction_array, $clos_object_name;
}
else
{
# ok, this is a new mnemonic.
# so we'll mark this as current mnemonic.
$mnemonic_of_current_mnemonic_array = $mnemonic;
push #each_mnemonic_only_once_array, $mnemonic;
# we first push the old array (unless it's empty), then clear it,
# and then push the current object name to the cleared array.
if (#instruction_variants_for_current_instruction_array)
{
# push the variants array, unless it's empty.
push #instruction_variants_array, [ #instruction_variants_for_current_instruction_array ];
}
#instruction_variants_for_current_instruction_array = ();
push #instruction_variants_for_current_instruction_array, $clos_object_name;
}
}
}
}
# the last instruction's instruction variants must be pushed too.
if (#instruction_variants_for_current_instruction_array)
{
# push the variants array, unless it's empty.
push #instruction_variants_array, [ #instruction_variants_for_current_instruction_array ];
}
close(FILE);
# these objects need be created already during compilation.
printf("(eval-when (:compile-toplevel :load-toplevel :execute)\n");
# print the code to create each instruction + operands combination object.
for (my $i=0; $i <= $#mnemonic_array; $i++)
{
$clos_object_name = $object_name_array[$i];
$mnemonic = $mnemonic_array[$i];
$operands = $operands_array[$i];
$code_string = $code_string_array[$i];
$flags = $flags_array[$i];
# print the code to create a variant object.
# each object here is a variant of a single instruction (or a single mnemonic).
# actually printed as 6 lines to make it easier to read (for us humans, I mean), with an empty line in the end.
printf("(setf %s (make-instance 'x86-asm-instruction\n:name %s\n:operands %s\n:code-string %s\n:arch-flags %s\n:is-variant t))",
$clos_object_name,
$mnemonic,
$operands,
$code_string,
$flags);
printf("\n\n");
}
# print the code to create each instruction + operands combination object.
# for (my $i=0; $i <= $#each_mnemonic_only_once_array; $i++)
for my $i (0 .. $#instruction_variants_array)
{
$mnemonic = $each_mnemonic_only_once_array[$i];
# print the code to create a container object.
printf("(setf %s (make-instance 'x86-asm-instruction :name \"%s\" :is-container t :variants (list \n", $mnemonic, $mnemonic);
#instruction_variants_for_current_instruction_array = $instruction_variants_array[$i];
# for (my $j=0; $j <= $#instruction_variants_for_current_instruction_array; $j++)
for my $j (0 .. $#{$instruction_variants_array[$i]} )
{
printf("%s", $instruction_variants_array[$i][$j]);
# print 3 closing brackets if this is the last variant.
if ($j == $#{$instruction_variants_array[$i]})
{
printf(")))");
}
else
{
printf(" ");
}
}
# if this is not the last instruction, print two newlines.
if ($i < $#instruction_variants_array)
{
printf("\n\n");
}
}
# print the closing bracket to close `eval-when`.
print(")");
exit;
18636 warnings looks really bad, Start by getting rid of all the warnings.
I would start by getting rid of the EVAL-WHEN around all that. Does not make much sense to me. Either load the file directly, or compile and load the file.
Also note that SBCL does not like (setf STOSB-void ...) when the variable is undefined. New top-level variables are introduced with DEFVAR or DEFPARAMETER. SETF just sets them, but does not define them. That should help to get rid of the warnings.
Also :is-container t and :is-variant t smell like these properties should be converted into classes to inherit from (for example as a mixin). A container has variants. A variant does not have variants.

Emacs 23.3.1: whitespace style

I've just upgraded to Kubuntu 11.10. After that the way Emacs represents whitespace in whitespace minor mode got changed. It were shaded rectangulars and not Emacs puts dots in the place of white space:
I tried to change it through the M-x customize-group and then whitespace -- but there's no such thing as a dot. It says that whitespaces are represented by shading (see the pic above) - but they are not (see the same pic).
Here's the value of Whitespace Space face:
I also asked this question at superuser but since I got 0 replies there -- I decided to consult another community.
Edit 1:
Following the Luke's solution gives no coloring to space nor to tabs (unless I've done something wrong):
Edit 2:
Adding face here fixes Luke's solution. Thanks to Sergey.
(setq whitespace-style (quote
( face spaces tabs newline space-mark tab-mark newline-mark)))
Edit 3:
Currently I'm using:
(custom-set-variables
'(whitespace-line-column 9999999)
'(whitespace-tab-width 4 t)
'(whitespace-display-mappings '(
(space-mark ?\ [?\u00B7] [?.]) ; space - centered dot
(space-mark ?\xA0 [?\u00A4] [?_]) ; hard space - currency
(newline-mark ?\n [?$ ?\n]) ; eol - dollar sign
(tab-mark ?\t [?\u00BB ?\t] [?\\ ?\t]) ; tab - left quote mark
))
'(whitespace-style '(face spaces tabs newline space-mark tab-mark newline-mark))
)
(custom-set-faces
'(default ((t (:inherit nil :stipple nil :background "#ffffb1" :foreground "#141312" :inverse-video nil :box nil :strike-through nil :overline nil :underline nil :slant normal :weight normal :height 125 :width normal :foundry "monotype" :family "DejaVu Sans Mono"))))
'(whitespace-trailing ((t (:background "grey99"))))
)
at Emacs 24.3.50.1
There's probably a better way to do this, but adding this to your .emacs should work:
(setq whitespace-display-mappings
'(
(space-mark ?\ [? ]) ;; use space not dot
(space-mark ?\xA0 [?\u00A4] [?_])
(space-mark ?\x8A0 [?\x8A4] [?_])
(space-mark ?\x920 [?\x924] [?_])
(space-mark ?\xE20 [?\xE24] [?_])
(space-mark ?\xF20 [?\xF24] [?_])
(newline-mark ?\n [?$ ?\n])
(tab-mark ?\t [?\u00BB ?\t] [?\\ ?\t])))
(custom-set-faces
'(whitespace-space
((((class color) (background dark)) (:background "red" :foreground "white"))
(((class color) (background light)) (:background "yellow" :foreground "black"))
(t (:inverse-video t)))))
The standard value of whitespace-display-mappings uses a 'middle dot' for a space, the code above uses a standard space. You can change the colours for whitespace-space as required.
All you need is just add 'face' keyword along others in whitespace-style.
E.g.:
(setq whitespace-style (quote
( face spaces tabs newline space-mark tab-mark newline-mark)))
After using Luke Girvin advice and starting emacs with -q flag - Luke's solution worked. I found that the problem was
;; make whitespace-mode use just basic coloring
(setq whitespace-style (quote
( spaces tabs newline space-mark tab-mark newline-mark)))
these lines in .emacs. So I removed them, and then have used customize-group -> whitespace to make things this way:
So ithe problem is solved. Thanks Luke!

Extracting URLs from an Emacs buffer?

How can I write an Emacs Lisp function to find all hrefs in an HTML file and extract all of the links?
Input:
<html>
<a href="http://www.stackoverflow.com" _target="_blank">StackOverFlow&lt/a>
<h1>Emacs Lisp</h1>
<a href="http://news.ycombinator.com" _target="_blank">Hacker News&lt/a>
</html>
Output:
http://www.stackoverflow.com|StackOverFlow
http://news.ycombinator.com|Hacker News
I've seen the re-search-forward function mentioned several times during my search. Here's what I think that I need to do based on what I've read so far.
(defun extra-urls (file)
...
(setq buffer (...
(while
(re-search-forward "http://" nil t)
(when (match-string 0)
...
))
I took Heinzi's solution and came up with the final solution that I needed. I can now take a list of files, extract all URL's and titles, and place the results in one output buffer.
(defun extract-urls (fname)
"Extract HTML href url's,titles to buffer 'new-urls.csv' in | separated format."
(setq in-buf (set-buffer (find-file fname))); Save for clean up
(beginning-of-buffer); Need to do this in case the buffer is already open
(setq u1 '())
(while
(re-search-forward "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>" nil t)
(when (match-string 0) ; Got a match
(setq url (match-string 1) ) ; URL
(setq title (match-string 2) ) ; Title
(setq u1 (cons (concat url "|" title "\n") u1)) ; Build the list of URLs
)
)
(kill-buffer in-buf) ; Don't leave a mess of buffers
(progn
(with-current-buffer (get-buffer-create "new-urls.csv"); Send results to new buffer
(mapcar 'insert u1))
(switch-to-buffer "new-urls.csv"); Finally, show the new buffer
)
)
;; Create a list of files to process
;;
(mapcar 'extract-urls '(
"/tmp/foo.html"
"/tmp/bar.html"
))
If there is at most one link per line and you don't mind some very ugly regular expression hacking, run the following code on your buffer:
(defun getlinks ()
(beginning-of-buffer)
(replace-regexp "^.*<a href=\"\\([^\"]+\\)\"[^>]+>\\([^<]+\\)</a>.*$" "LINK:\\1|\\2")
(beginning-of-buffer)
(replace-regexp "^\\([^L]\\|\\(L[^I]\\)\\|\\(LI[^N]\\)\\|\\(LIN[^K]\\)\\).*$" "")
(beginning-of-buffer)
(replace-regexp "
+" "
")
(beginning-of-buffer)
(replace-regexp "^LINK:\\(.*\\)$" "\\1")
)
It replaces all links with LINK:url|description, deletes all lines containing anything else, deletes empty lines, and finally removes the "LINK:".
Detailed HOWTO: (1) Correct the bug in your example html file by replacing <href with <a href, (2) copy the above function into Emacs scratch, (3) hit C-x C-e after the final ")" to load the function, (4) load your example HTML file, (5) execute the function with M-: (getlinks).
Note that the linebreaks in the third replace-regexp are important. Don't indent those two lines.
You can use the 'xml library, examples of using the parser are found here. To parse your particular file, the following does what you want:
(defun my-grab-html (file)
(interactive "fHtml file: ")
(let ((res (car (xml-parse-file file)))) ; 'car because xml-parse-file returns a list of nodes
(mapc (lambda (n)
(when (consp n) ; don't operate on the whitespace, xml preserves whitespace
(let ((link (cdr (assq 'href (xml-node-attributes n)))))
(when link
(insert link)
(insert "|")
(insert (car (xml-node-children n))) ;# grab the text for the link
(insert "\n")))))
(xml-node-children res))))
This does not recursively parse the HTML to find all the links, but it should get you started in the direction of the general solution.

Resources