Get current position in stream from net/html tokenizer - go

I'm trying to figure out if there's a way to get the current character position of a tag using the golang.org/x/net/html tokenizer library?
Simplified code looks like:
func LookForForm(body string) {
reader := strings.NewReader(body)
tokenizer := html.NewTokenizer(reader)
idx := 0
lastIdx := 0
for {
token := tokenizer.Next()
lastIdx = idx
idx = int(reader.Size()) - int(reader.Len())
switch token {
case html.ErrorToken:
return
case html.StartTagToken:
t := tokenizer.Token()
tagName := strings.ToLower(t.Data)
if tagName == "form" {
fmt.Printf("found at form at %d\n", lastIdx)
return
}
}
}
}
This doesn't work (I think) because reader is not reading character-by-character but by chunks so my calculation of Size - Len is invalid. tokenizer maintains two private span structs ( https://github.com/golang/net/blob/master/html/token.go line 147) but I am unaware of how to access them.
One possible solution that just occurred to me is to make a "reader" that only reads a single character at a time so my Size and Len calculations are always correct. But, that seems like a hack and any suggestions would be appreciated.

You might be able to accomplish what you are trying to do (not what you want) with careful arithmetic using Tokenizer's Buffered method which returns the slice of bytes currently in buffer that have yet been tokenized. But I don't think you will get what you wanted, as <div><form></form></div> would probably buffer the whole string before give you the first div token. In that case the size of the buffered content is not helpful in calculating the solution.
Tokenizing mark up lang with nested structure will almost always need to buffer the input to work. the private span attribute should be quite useless as it is only a reference in it's buffer, not absolute position from the reader.
Since the html Tokenizer is not providing an API to access the raw position of a tag in the original data, to get want you wanted I probably would just do a strings.Index or bytes.Index on the raw buffer of the token to get the position:
strings.Index(body, string(tokenizer.Raw()))

A non-buffering reader ended up working ok for me. The implementation of the reader looks something like:
package rule
import (
"errors"
"io"
"unicode/utf8"
)
type Reader struct {
s string
i int64
z int64
prevRune int64 // index of the previously read rune or -1
}
func (r *Reader) String() string {
return r.s
}
func (r *Reader) Len() int {
if r.i >= r.z {
return 0
}
return int(r.z - r.i)
}
func (r *Reader) Size() int64 {
return r.z
}
func (r *Reader) Pos() int64 {
return r.i
}
func (r *Reader) Read(b []byte) (int, error) {
if r.i >= r.z {
return 0, io.EOF
}
r.prevRune = -1
b[0] = r.s[r.i]
r.i += 1
return 1, nil
}
Then the loop for the tokenizer is fairly easy to calculate:
reader := NewReader(body)
tokenizer := html.NewTokenizer(reader)
idx := 0
lastIdx := 0
tokenLoop:
for {
token := tokenizer.Next()
switch token {
case html.ErrorToken:
break tokenLoop
case html.EndTagToken, html.TextToken, html.CommentToken, html.SelfClosingTagToken:
lastIdx = int(reader.Pos())
case html.StartTagToken:
t := tokenizer.Token()
tagName := strings.ToLower(t.Data)
idx = int(reader.Pos())
if tagName == "form" {
fmt.Printf("found at form at %d\n", lastIdx)
return
}
}
}

Related

Map seems to drop values in recursion

I've been working on a problem and I figured I would demonstrate it using a pokemon setup. I am reading from a file, parsing the file and creating objects/structs from them. This normally isn't a problem except now I need to implement interface like inheriting of traits. I don't want there to be duplicate skills in there so I figured I could use a map to replicate a set data structure. However it seems that in the transitive phase of my recursive parsePokemonFile function (see the implementsComponent case), I appear to be losing values in my map.
I am using the inputs like such:
4 files
Ratatta:
name=Ratatta
skills=Tackle:normal,Scratch:normal
Bulbosaur:
name=Bulbosaur
implements=Ratatta
skills=VineWhip:leaf
Oddish:
name=Oddish
implements=Ratatatt
skills=Acid:poison
Venosaur:
name=Venosaur
implements=bulbosaur,oddish
I'm expecting the output for the following code to be something like
Begin!
{Venosaur [{VineWhip leaf} {Acid poison} {Tackle normal} {Scratch normal}]}
but instead I get
Begin!
{Venosaur [{VineWhip leaf} {Acid poison}]}
What am I doing wrong? Could it be a logic error? Or am I making an assumption about the map holding values that I shouldn't?
package main
import (
"bufio"
"fmt"
"os"
"strings"
)
// In order to create a set of pokemon abilities and for ease of creation and lack of space being taken up
// We create an interfacer capability that imports the skills and attacks from pokemon of their previous evolution
// This reduces the amount of typing of skills we have to do.
// Algorithm is simple. Look for the name "implements=x" and then add x into set.
// Unfortunately it appears that the set is dropping values on transitive implements interfaces
func main() {
fmt.Println("Begin!")
dex, err := parsePokemonFile("Venosaur")
if err != nil {
fmt.Printf("Got error: %v\n", err)
}
fmt.Printf("%v\n", dex)
}
type pokemon struct {
Name string
Skills []skill
}
type skill struct {
SkillName string
Type string
}
func parsePokemonFile(filename string) (pokemon, error) {
file, err := os.Open(filename)
if err != nil {
return pokemon{}, err
}
defer file.Close()
scanner := bufio.NewScanner(file)
var builtPokemon pokemon
for scanner.Scan() {
component, returned := parseLine(scanner.Text())
switch component {
case nameComponent:
builtPokemon.Name = returned
case skillsComponent:
skillsStrings := strings.Split(returned, ",")
var skillsArr []skill
// split skills and add them into pokemon skillset
for _, skillStr := range skillsStrings {
skillPair := strings.Split(skillStr, ":")
skillsArr = append(skillsArr, skill{SkillName: skillPair[0], Type: skillPair[1]})
}
builtPokemon.Skills = append(builtPokemon.Skills, skillsArr...)
case implementsComponent:
implementsArr := strings.Split(returned, ",")
// create set to remove duplicates
skillsSet := make(map[*skill]bool)
for _, val := range implementsArr {
// recursively call the pokemon files and get full pokemon
implementedPokemon, err := parsePokemonFile(val)
if err != nil {
return pokemon{}, err
}
// sieve out the skills into a set
for _, skill := range implementedPokemon.Skills {
skillsSet[&skill] = true
}
}
// append final set into the currently being built pokemon
for x := range skillsSet {
builtPokemon.Skills = append(builtPokemon.Skills, *x)
}
}
}
return builtPokemon, nil
}
type component int
// components to denote where to put our strings when it comes time to assemble what we've parsed
const (
nameComponent component = iota
implementsComponent
skillsComponent
)
func parseLine(line string) (component, string) {
arr := strings.Split(line, "=")
switch arr[0] {
case "name":
return nameComponent, arr[1]
case "implements":
return implementsComponent, arr[1]
case "skills":
return skillsComponent, arr[1]
default:
panic("Invalid field found")
}
}
This has nothing to do with Golang maps dropping any values.
The problem is that you are using a map of skill pointers and not skills. Two pointers to the same skill content can be different.
skillsSet := make(map[*skill]bool)
If you change this to map[skill]bool, this should work. You may try it out!

Using default value in golang func

I'm trying to implement a default value according to the option 1 of the post Golang and default values. But when I try to do go install the following error pops up in the terminal:
not enough arguments in call to test.Concat1
have ()
want (string)
Code:
package test
func Concat1(a string) string {
if a == "" {
a = "default-a"
}
return fmt.Sprintf("%s", a)
}
// other package
package main
func main() {
test.Concat1()
}
Thanks in advance.
I don't think what you are trying to do will work that way. You may want to opt for option #4 from the page you cited, which uses variadic variables. In your case looks to me like you want just a string, so it'd be something like this:
func Concat1(a ...string) string {
if len(a) == 0 {
return "a-default"
}
return a[0]
}
Go does not have optional defaults for function arguments.
You may emulate them to some extent by having a special type
to contain the set of parameters for a function.
In your toy example that would be something like
type Concat1Args struct {
a string
}
func Concat1(args Concat1Args) string {
if args.a == "" {
args.a = "default-a"
}
return fmt.Sprintf("%s", args.a)
}
The "trick" here is that in Go each type has its respective
"zero value", and when producing a value of a composite type
using the so-called literal, it's possible to initialize only some of the type's fields, so in our example that would be
s := Concat1(Concat1Args{})
vs
s := Concat1(Concat1Args{"whatever"})
I know that looks clumsy, and I have showed this mostly for
demonstration purpose. In real production code, where a function
might have a dozen of parameters or more, having them packed
in a dedicate composite type is usually the only sensible way
to go but for a case like yours it's better to just explicitly
pass "" to the function.
Golang does not support default parameters. Accordingly, variadic arguments by themselves are not analogous. However, variadic functions with the use of error handling can 'resemble' the pattern. Try the following as a simple example:
package main
import (
"errors"
"log"
)
func createSeries(p ...int) ([]int, error) {
usage := "Usage: createSeries(<length>, <optional starting value>), length should be > 0"
if len(p) == 0 {
return nil, errors.New(usage)
}
n := p[0]
if n <= 0 {
return nil, errors.New(usage)
}
var base int
if len(p) == 2 {
base = p[1]
} else if len(p) > 2 {
return nil, errors.New(usage)
}
vals := make([]int, n)
for i := 0; i < n; i++ {
vals[i] = base + i
}
return vals, nil
}
func main() {
answer, err := createSeries(4, -9)
if err != nil {
log.Fatal(err)
}
log.Println(answer)
}
Default parameters work differently in Go than they do in other languages. In a function there can be one ellipsis, always at the end, which will keep a slice of values of the same type so in your case this would be:
func Concat1(a ...string) string {
but that means that the caller may pass in any number of arguments >= 0. Also you need to check that the arguments in the slice are not empty and then assign them yourself. This means they do not get assigned a default value through any kind of special syntax in Go. This is not possible but you can do
if a[0] == "" {
a[0] = "default value"
}
If you want to make sure that the user passes either zero or one strings, just create two functions in your API, e.g.
func Concat(a string) string { // ...
func ConcatDefault() string {
Concat("default value")
}

Filter values in range loop in golang

I’ve the following code which runs ok, Im looping on the mStr and printing
Value to a file
func setFile(file io.Writer, mStr []*mod.M, mdl []string) {
for i, mod := range mStr {
fmt.Fprint(file, “app”)
fmt.Fprint(file, “app1”)
…
}
}
Now what I need is to provide a filter on the range ,
e.g. just prints to file if mod.Name ==“app”
func setFile(file io.Writer, mStr []*mod.M, mdl []string) {
for i, mod := range mStr {
if mod.Name == mdl[i] {
fmt.Fprint(file, “app”)
fmt.Fprint(file, “app1”)
…
}
}
}
While this could work, it introduce a few if else forks in the code to support the following :
If mdl is empty (it can be without any value)loop on all the mStr value and prints to all.
If mdl contain value , prints only when the mod.Name == mdl[I] otherwise don’t.
Is there a cleaner way to do this kind of filtering on loop in Golang?
Check for length of slice that you are passing in the function if the slice is empty it will give the length as zero.
if len(mdl) > 0 && mod.Name == mdl[i] {
fmt.Fprint(file, “app”)
fmt.Fprint(file, “app1”)
// code
}else{
// code
}

strings - get characters before a digit

I have some strings such E2 9NZ, N29DZ, EW29DZ . I need to extract the chars before the first digit, given the above example : E, N, EW.
Am I supposed to use regex ? The strings package looks really nice but just doesn't seem to handle this case (extract everything before a specific type).
Edit:
To clarify the "question" I'm wondering what method is more idiomatic to go and perhaps likely to provide better performance.
For example,
package main
import (
"fmt"
"unicode"
)
func DigitPrefix(s string) string {
for i, r := range s {
if unicode.IsDigit(r) {
return s[:i]
}
}
return s
}
func main() {
fmt.Println(DigitPrefix("E2 9NZ"))
fmt.Println(DigitPrefix("N29DZ"))
fmt.Println(DigitPrefix("EW29DZ"))
fmt.Println(DigitPrefix("WXYZ"))
}
Output:
E
N
EW
WXYZ
If there is no digit, example "WXYZ", and you don't want anything returned, change return s to return "".
Not sure why almost everyone provided answers in everything but Go. Here is regex-based Go version:
package main
import (
"fmt"
"regexp"
)
func main() {
pattern, err := regexp.Compile("^[^\\d]*")
if err != nil {
panic(err)
}
part := pattern.Find([]byte("EW29DZ"))
if part != nil {
fmt.Printf("Found: %s\n", string(part))
} else {
fmt.Println("Not found")
}
}
Running:
% go run main.go
Found: EW
Go playground
We don't need regex for this problem. You can easily walk through on a slice of rune and check the current character with unicode.IsDigit(), if it's a digit: return. If it isn't: continue the loop. If there are no numbers: return the argument
Code
package main
import (
"fmt"
"unicode"
)
func UntilDigit(r []rune) []rune {
var i int
for _, v := range r {
if unicode.IsDigit(v) {
return r[0:i]
}
i++
}
return r
}
func main() {
fmt.Println(string(UntilDigit([]rune("E2 9NZ"))))
fmt.Println(string(UntilDigit([]rune("N29DZ"))))
fmt.Println(string(UntilDigit([]rune("EW29DZ"))))
}
Playground link
I think the best option is to use the index returned from strings.IndexAny which will return the first index of any character in a string.
func BeforeNumbers(str string) string {
value := strings.IndexAny(str,"0123456789")
if value >= 0 && value <= len(str) {
return str[:value]
}
return str
}
Will slice the string and return the subslice up to (but not including) the first character that's in the string "0123456789" which is any number.
Way later edit:
It would probably be better to use IndexFunc rather than IndexAny:
func BeforeNumbers(str string) string {
indexFunc := func(r rune) bool {
return r >= '0' && r <= '9'
}
value := strings.IndexFunc(str,indexFunc)
if value >= 0 && value <= len(str) {
return str[:value]
}
return str
}
This is more or less equivalent to the loop version, and eliminates a search over a long string to check for a match every character from my previous answer. But I think it looks cleaner than the loop version, which is obviously a manner of taste.
The code below will continue grabbing characters until it reaches a digit.
int i = 0;
String string2test = "EW29DZ";
String stringOutput = "";
while (!Character.isDigit(string2test.charAt(i)))
{
stringOutput = stringOutput + string2test.charAt(i);
i++;
}

In Go Language, how do I unmarshal json to array of object?

I have the following JSON, and I want to parse it into array of class:
{
"1001": {"level":10, "monster-id": 1001, "skill-level": 1, "aimer-id": 301}
"1002": {"level":12, "monster-id": 1002, "skill-level": 1, "aimer-id": 302}
"1003": {"level":16, "monster-id": 1003, "skill-level": 2, "aimer-id": 303}
}
Here is what i am trying to do but failed:
type Monster struct {
MonsterId int32
Level int32
SkillLevel int32
AimerId int32
}
type MonsterCollection struct {
Pool map[string]Monster
}
func (mc *MonsterCollection) FromJson(jsonStr string) {
var data interface{}
b := []byte(jsonStr)
err := json.Unmarshal(b, &data)
if err != nil {
return
}
m := data.(map[string]interface{})
i := 0
for k, v := range m {
monster := new(Monster)
monster.Level = v["level"]
monster.MonsterId = v["monster-id"]
monster.SkillLevel = v["skill-level"]
monster.AimerId = v["aimer-id"]
mc.Pool[i] = monster
i++
}
}
The compiler complain about the v["level"]
<< invalid operation. index of type interface().
This code has many errors in it. To start with, the json isn't valid json. You are missing the commas in between key pairs in your top level object. I added the commas and pretty printed it for you:
{
"1001":{
"level":10,
"monster-id":1001,
"skill-level":1,
"aimer-id":301
},
"1002":{
"level":12,
"monster-id":1002,
"skill-level":1,
"aimer-id":302
},
"1003":{
"level":16,
"monster-id":1003,
"skill-level":2,
"aimer-id":303
}
}
Your next problem (the one you asked about) is that m := data.(map[string]interface{}) makes m a map[string]interface{}. That means when you index it such as the v in your range loop, the type is interface{}. You need to type assert it again with v.(map[string]interface{}) and then type assert each time you read from the map.
I also notice that you next attempt mc.Pool[i] = monster when i is an int and mc.Pool is a map[string]Monster. An int is not a valid key for that map.
Your data looks very rigid so I would make unmarshall do most of the work for you. Instead of providing it a map[string]interface{}, you can provide it a map[string]Monster.
Here is a quick example. As well as changing how the unmarshalling works, I also added an error return. The error return is useful for finding bugs. That error return is what told me you had invalid json.
type Monster struct {
MonsterId int32 `json:"monster-id"`
Level int32 `json:"level"`
SkillLevel int32 `json:"skill-level"`
AimerId int32 `json:"aimer-id"`
}
type MonsterCollection struct {
Pool map[string]Monster
}
func (mc *MonsterCollection) FromJson(jsonStr string) error {
var data = &mc.Pool
b := []byte(jsonStr)
return json.Unmarshal(b, data)
}
I posted a working example to goplay: http://play.golang.org/p/4EaasS2VLL
Slightly off to one side - you asked for an array of objects when you needed a map
If you need an array (actually a slice)
http://ioblocks.blogspot.com/2014/09/loading-arrayslice-of-objects-from-json.html

Resources