What would be an efficient (performance and readability) of parsing lines in a log file and extracting points of interest?
For example:
*** Time: 2/1/2019 13:51:00
17.965 Pump 10 hose FF price level 1 limit 0.0000 authorise pending (Type 00)
17.965 Pump 10 State change LOCKED_PSTATE to CALLING_PSTATE [31]
38.791 Pump 10 delivery complete, Hose 1, price 72.9500, level 1, value 100.0000, volume 1.3700, v-total 8650924.3700, m-total 21885705.8800, T13:51:38
Things I need to extract are 10 (for pump 10), Price Level. Limit
The _PSTATE changes the values from the delivery completel line etc.
Currently I'm using a regular expression to capture each one and using capture groups. But it feels inefficient and there is quite a bit of duplication.
For example, I have a bunch of these:
reStateChange := regexp.MustCompile(`^(?P<offset>.*) Pump (?P<pump>\d{2}) State change (?P<oldstate>\w+_PSTATE) to (?P<newstate>\w+)_PSTATE`)
Then inside a while loop
if match := reStateChange.FindStringSubmatch(text); len(match) > 0 {
matched = true
for i, name := range match {
result[reStateChange.SubexpNames()[i]] = name
}
} else if match := otherReMatch.FindStringSubmatch(text); len(match) > 0 {
matched = true
for i, name := range match {
result[reStateChange.SubexpNames()[i]] = name
}
} else if strings.Contains(text, "*** Time:") {
}
It feels that there could be a much better way to do this. I would trade some performance for readability. The log files are only really 10MB max. Often smaller.
I'm after some suggestions on how to make this better in golang.
If all your log lines are similar to that sample you posted, they seem quite structured so regular expressions might be a bit overkill and hard to generalize.
Another option would be for you to transform each of those lines to a slice of strings ([]string) by using strings.Fields, or even strings.FieldFunc so that you can strip both white space and commas.
Then you can design an interface like:
type LogLineProcessor interface {
CanParse(line []string)
GetResultFrom(line []string) LogLineResult
}
Where LogLineResult is an struct containing the extracted information.
You can then define multiple structs with methods that implement LogLineProcessor (each implementation would look at specific positions on that []string to realize if it is a line it can process or not, like looking for the words "hose", "FF" and "price" in the positions it expects to find them).
The GetResultFrom implementations would also extract each data point from specific positions in the []string (it can rely on that information being there if it already determined it was one of the lines it can process).
You can create a var processors []LogLineProcessor, put all your processors in there and then just iterate that array:
line := strings.Fields(text)
for _, processor := range processors {
if processor.CanParse(line) {
result := processor.GetResultFrom(line)
// do whatever needed with the result
}
}
Related
I just started learning go. I have a question about pointers.
In the code below, the following line in the code doesn't do what I expect:
last_line.Next_line = &line // slice doesn't change
I want the slice to be changed as well, not only the local variable last_line.
What am I doing wrong?
type Line struct {
Text string
Prev_line *Line
Next_line *Line
}
var (
lines []Line
last_line *Line
)
for i, record := range records {
var prev_line *Line = nil
text := record[0]
if i > 0 {
prev_line = &lines[i-1]
}
line := Line{
Text: text,
Prev_line: prev_line,
Next_line: nil}
if last_line != nil {
last_line.Next_line = &line // slice doesn't change
}
lines = append(lines, line)
last_line = &line
}
Your Line type is a fairly standard-looking doubly linked list. Your lines variable holds a slice of these objects. Combining these two is a bit unusual—not wrong, to be sure, just unusual. And, as Matt Oestreich notes in a comment, we don't know quite what is in records (just that range can be used on it and that after doing so, we can use record[0] to get to a single string value), so there might be better ways to deal with things.
If records itself is a slice or has a sensible len, we can allocate a slice of Line instances all at once, of the appropriate size:
lines = make([]Line, len(records))
Here is a sample on the Go Playground that does it this way.
If we can't really get a suitable len—e.g., if records is a channel whose length is not really relevant—then we might indeed want to allocate individual lines, but in this case, it may be more sensible to avoid keeping them as a slice in the first place. The doubly linked list alone will suffice.
Finally, if you really do want both a slice and this doubly linked list, note that using append may copy the slice's elements to a new, larger slice. If and when it does so, the pointers in any elements you set up earlier will point into the old, smaller slice. This is not invalid in terms of the language itself—those objects still exist and your pointers are keeping them "alive"—but it may not be what you intended at all. In this case, it makes more sense to set all the pointers at the end, after building up the lines slice, just as in the sample code I provided.
(The sample I wrote is deliberately slightly weird in a way that is likely to get your homework or test grade knocked down a bit, if this was an attempt to cheat on homework or a test. :-) )
I have several strings that include various symbols like the following two examples:
z=y+x
#symbol
and I want to split the strings such that I have the resulting slices:
[z = y + x]
[# symbol]
A few things I've looked at and tried:
I've looked at this question but it seems as though golang doesn't support lookarounds.
I know this solution exists using strings.SplitAfter, but I'm looking to have the delimiters as separate elements.
I tried replacing the symbol (e.g. "+") with some variant (e.g. "~+~") and doing a split on the surrounding characters (e.g. "~"), but this solution is far from elegant and runs into problems if I need to do a conditional replacement depending on the symbol (which golang doesn't seem to support either).
Perhaps I've misunderstood some of the previous question and their respective solutions.
I used a modified version of Go's strings.Split implementation https://golang.org/src/strings/strings.go?s=7505:7539#L245
func Test(te *testing.T) {
t := tester.New(te)
t.Assert().Equal(splitCharsInclusive("z=y+x", "=+"), []string{"z", "=", "y", "+", "x"})
t.Assert().Equal(splitCharsInclusive("#symbol", "#"), []string{"", "#", "symbol"})
}
func splitCharsInclusive(s, chars string) (out []string) {
for {
m := strings.IndexAny(s, chars)
if m < 0 {
break
}
out = append(out, s[:m], s[m:m+1])
s = s[m+1:]
}
out = append(out, s)
return
}
This is limited to single characters to split on. And passing something like splitCharsInclusive("(z)(y)(x)", "()") might not get you the output you want, as you'd get a few empty strings in the response. But hopefully this is a good starting point for the modifications you need.
Also, Go's version that I've linked calculates the length of the output array in advance, this is a nice optimization that I've decided to omit, but would likely be good to add back.
I have started a project for creating reports by utilizing excel data and the various Go excel libraries (excelize, tealeg's xlsx)
One of the biggest frustrations I have found is working with slices which have some nil indexes depending on the source of data (blank rows in the input data transfer as "nil" slice indexes when I use the xlsx library to pull data)
These nil slice index throw an "index out of range" obviously if I ever try and utilize them in one of my many for loops - which leads me to the painstaking task of ensuring each time I want to work with a slice index that is isn't actually nil by using len() and cap to death()(excerpt of code below to illustrate)
//example code excerpt
for rowNumber, cellStringSlice := range inputSlice {
for rowColumn, cellString := range cellStringSlice {
//loop var declaration
rowColumnHeading := 2
rowNumberInc := rowNumber + 1
rowNumberDec := rowNumber - 1
if rowNumber > 0 {
if len(inputSlice[rowNumber]) != 0 { //len check to stop index out of range issue with slice
previousColACellValue = inputSlice[rowNumber][rowColumn]
continue
}
if len(inputSlice[rowNumber+1]) != 0 { //len check to stop index out of range issue with slice
nextColACellValue = inputSlice[rowNumber+1][rowColumn]
continue
}
}
}
I should specify that in this 2D slice I am using:
inputSlice[rowNumber][rowColumn]
the proximal slice (rowNumber) is never nil (there is always a row) however the second distal slice it indexes (rowColumn) Can be nil on some instances - which is why in this scenario my overall loop always enters the second inner loop even when it is iterating though a row with no column data (i.e inputSlice[rowNumber][rowColumn] = nil) and brings a frequent need for me to handle index out of range issues
I can't just remove all the nil indexes and shift everything up, as these are representing "blank rows" in the final excel doc I output these rows to.
So my question is, are there any useful go functions or libraries which take care of nil indexes by swapping all nils for "" in slices and 2d/3d slices of type string? Or is it a task for the programmer to always "sanitise" his slices by removing these nils or check for them each time they ever want to access an element?
I appreciate I could write a for loop myself to swap all these nils for a "", but writing a function to do this each time I work with slices of strings containing/possibly containing nil's would seem a little bizarre to me
Your outer loop is on inputSlice, so inputSlice[rowNumber] is always valid, and since the inner loop is on that row, it is never zero. Thus the first check is unnecessary. If you have a nil or empty slice for inputSlice[rowNumber], the inner for loop will not even be entered.
The second check is necessary, but wrong:
if len(inputSlice[rowNumber+1]) != 0 {
If rowNumber is the last row, then inputSlice[rowNumber+1] is not valid as no such row exists. You have to check:
if rowNumber<len(inputSlice) {
...
}
I have some extremely old legacy procedural code which takes 10 or so enumerated inputs [ i0, i1, i2, ... i9 ] and generates 170 odd enumerated outputs [ r0, r1, ... r168, r169 ]. By enumerated, I mean that each individual input & output has its own set of distinct value sets e.g. [ red, green, yellow ] or [ yes, no ] etc.
I’m putting together the entire state table using the existing code, and instead of puzzling through them by hand, I was wondering if there was an algorithmic way of determining an appropriate function to get to each result from the 10 inputs. Note, not all input columns may be required to determine an individual output column, i.e. r124 might only be dependent on i5, i6 and i9.
These are not continuous functions, and I expect I might end up with some sort of hashing function approach, but I wondered if anyone knew of a more repeatable process I should be using instead? (If only there was some Karnaugh map like approach for multiple value non-binary functions ;-) )
If you are willing to actually enumerate all possible input/output sequences, here is a theoretical approach to tackle this that should be fairly effective.
First, consider the entropy of the output. Suppose that you have n possible input sequences, and x[i] is the number of ways to get i as an output. Let p[i] = float(x[i])/float(n[i]) and then the entropy is - sum(p[i] * log(p[i]) for i in outputs). (Note, since p[i] < 1 the log(p[i]) is a negative number, and therefore the entropy is positive. Also note, if p[i] = 0 then we assume that p[i] * log(p[i]) is also zero.)
The amount of entropy can be thought of as the amount of information needed to predict the outcome.
Now here is the key question. What variable gives us the most information about the output per information about the input?
If a particular variable v has in[v] possible values, the amount of information in specifying v is log(float(in[v])). I already described how to calculate the entropy of the entire set of outputs. For each possible value of v we can calculate the entropy of the entire set of outputs for that value of v. The amount of information given by knowing v is the entropy of the total set minus the average of the entropies for the individual values of v.
Pick the variable v which gives you the best ratio of information_gained_from_v/information_to_specify_v. Your algorithm will start with a switch on the set of values of that variable.
Then for each value, you repeat this process to get cascading nested if conditions.
This will generally lead to a fairly compact set of cascading nested if conditions that will focus on the input variables that tell you as much as possible, as quickly as possible, with as few branches as you can manage.
Now this assumed that you had a comprehensive enumeration. But what if you don't?
The answer to that is that the analysis that I described can be done for a random sample of your possible set of inputs. So if you run your code with, say, 10,000 random inputs, then you'll come up with fairly good entropies for your first level. Repeat with 10,000 each of your branches on your second level, and the same will happen. Continue as long as it is computationally feasible.
If there are good patterns to find, you will quickly find a lot of patterns of the form, "If you put in this that and the other, here is the output you always get." If there is a reasonably short set of nested ifs that give the right output, you're probably going to find it. After that, you have the question of deciding whether to actually verify by hand that each bucket is reliable, or to trust that if you couldn't find any exceptions with 10,000 random inputs, then there are none to be found.
Tricky approach for the validation. If you can find fuzzing software written for your language, run the fuzzing software with the goal of trying to tease out every possible internal execution path for each bucket you find. If the fuzzing software decides that you can't get different answers than the one you think is best from the above approach, then you can probably trust it.
Algorithm is pretty straightforward. Given possible values for each input we can generate all the input vectors possible. Then per each output we can just eliminate these inputs that do no matter for the output. As the result we for each output we can get a matrix showing output values for all the input combinations excluding the inputs that do not matter for given output.
Sample input format (for code snipped below):
var schema = new ConvertionSchema()
{
InputPossibleValues = new object[][]
{
new object[] { 1, 2, 3, }, // input #0
new object[] { 'a', 'b', 'c' }, // input #1
new object[] { "foo", "bar" }, // input #2
},
Converters = new System.Func<object[], object>[]
{
input => input[0], // output #0
input => (int)input[0] + (int)(char)input[1], // output #1
input => (string)input[2] == "foo" ? 1 : 42, // output #2
input => input[2].ToString() + input[1].ToString(), // output #3
input => (int)input[0] % 2, // output #4
}
};
Sample output:
Leaving the heart of the backward conversion below. Full code in a form of Linqpad snippet is there: http://share.linqpad.net/cknrte.linq.
public void Reverse(ConvertionSchema schema)
{
// generate all possible input vectors and record the resul for each case
// then for each output we could figure out which inputs matters
object[][] inputs = schema.GenerateInputVectors();
// reversal path
for (int outputIdx = 0; outputIdx < schema.OutputsCount; outputIdx++)
{
List<int> inputsThatDoNotMatter = new List<int>();
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
// find all groups for input vectors where all other inputs (excluding current) are the same
// if across these groups outputs are exactly the same, then it means that current input
// does not matter for given output
bool inputMatters = inputs.GroupBy(input => ExcudeByIndexes(input, new[] { inputIdx }), input => schema.Convert(input)[outputIdx], ObjectsByValuesComparer.Instance)
.Where(x => x.Distinct().Count() > 1)
.Any();
if (!inputMatters)
{
inputsThatDoNotMatter.Add(inputIdx);
Util.Metatext($"Input #{inputIdx} does not matter for output #{outputIdx}").Dump();
}
}
// mapping table (only inputs that matters)
var mapping = new List<dynamic>();
foreach (var inputGroup in inputs.GroupBy(input => ExcudeByIndexes(input, inputsThatDoNotMatter), ObjectsByValuesComparer.Instance))
{
dynamic record = new ExpandoObject();
object[] sampleInput = inputGroup.First();
object output = schema.Convert(sampleInput)[outputIdx];
for (int inputIdx = 0; inputIdx < schema.InputsCount; inputIdx++)
{
if (inputsThatDoNotMatter.Contains(inputIdx))
continue;
AddProperty(record, $"Input #{inputIdx}", sampleInput[inputIdx]);
}
AddProperty(record, $"Output #{outputIdx}", output);
mapping.Add(record);
}
// input x, ..., input y, output z form is needed
mapping.Dump();
}
}
I'm parsing loads of HTTP logs pursing a goal tell how many requests each IP address generated.
The first thing I did is:
var hits = make(map[string]uint)
// so I could populate it with
hits[ipAddr]++
However, I would like to make it "typed", so that it would be immediately clear that hits[string]uint uses an IP address as a string identifier. I thought, well maybe a struct can help me:
type Hit struct {
IP string
Count uint
}
But that way (I think) I'm loosing the performance, because now I how to really look for specific Hit to increment it's count. I tolerate that I could be paranoid here, and could simple go for the loop:
var hits = make([]Hit)
// TrackHit just damn tracks it
func TrackHit(ip string) {
for hit, _ := range hits {
if hit.IP == ip {
hit.Count++
return
}
}
append(hits, Hit{
IP: ip,
Count: 0,
})
}
But that just looks ... suboptimal. I think everything that could be written in 1 line makes you shine as professional, and when 1 line turns into 13, I tend to feel "whaaa? Doing something wrong here, mom?"
Any typed one-liners here in Go?
Thanks
As Uvelichitel pointed out, you can use a typed string:
type IP string
var hits = make(map[IP]uint)
hits[IP("127.0.0.1")]++
Or you could use the existing stdlib IP type:
var hits = make(map[net.IP]uint)
hits[net.ParseIP("127.0.0.1")]++
Either would make it clear that you're referring to IPs, without the overhead introduced by looping over a slice of structs for every increment. The latter has the advantage of giving you full stdlib support for any other IP manipulation you need to do, and a more compact representation (4 bytes for IPv4 addresses instead of a 7-15 character UTF-8 string), at the cost of parsing the strings. Which one is better will depend on your specific use case.