Way to check for duplicates before writing into a file? - go

So I wrote a small script that takes text files as an input, reads every line and tries to validate it as an email. If it passes, it writes the line into a new ('clean') file, if it doesn't pass, it strips it of spaces and tries to validate it again. Now, if it passes this time, it writes the line into a new file and if it fails, it ignores the line.
Thing is, such as it is, my script may write duplicate emails into the output files. How should I go around that and check for duplicates present in the output file before writing?
Here's the relevant code:
// create reading and writing buffers
scanner := bufio.NewScanner(r)
writer := bufio.NewWriter(w)
for scanner.Scan() {
email := scanner.Text()
// validate each email
if !correctEmail.MatchString(email) {
// if validation didn't pass, strip and lowercase the email and store it
email = strings.Replace(email, " ", "", -1)
// validate the email again after cleaning
if !correctEmail.MatchString(email) {
// if validation didn't pass, ignore this email
continue
} else {
// if validation passed, write clean email into file
_, err = writer.WriteString(email + "\r\n")
if err != nil {
return err
}
}
} else {
// if validation passed, write the email into file
_, err = writer.WriteString(email + "\r\n")
if err != nil {
return err
}
}
}
err = writer.Flush()
if err != nil {
return err
}

Create a type that implements writer then create a custom WriteString
Inside WriteString open the file where you store your emails, iterate over each email and save the new emails.

You may use a Go built-in map as a set like this:
package main
import (
"fmt"
)
var emailSet map[string]bool = make(map[string]bool)
func emailExists(email string) bool {
_, ok := emailSet[email]
return ok
}
func addEmail(email string) {
emailSet[email] = true
}
func main() {
emails := []string{
"duplicated#golang.org",
"abc#golang.org",
"stackoverflow#golang.org",
"duplicated#golang.org", // <- Duplicated!
}
for _, email := range emails {
if !emailExists(email) {
fmt.Println(email)
addEmail(email)
}
}
}
Here is the output:
duplicated#golang.org
abc#golang.org
stackoverflow#golang.org
You may try the same code at The Go Playground.

Related

Parsing prometheus metrics from file and updating counters

I've a go application that gets run periodically by a batch. Each run, it should read some prometheus metrics from a file, run its logic, update a success/fail counter, and write metrics back out to a file.
From looking at How to parse Prometheus data as well as the godocs for prometheus, I'm able to read in the file, but I don't know how to update app_processed_total with the value returned by expfmt.ExtractSamples().
This is what I've done so far. Could someone please tell me how should I proceed from here? How can I typecast the Vector I got into a CounterVec?
package main
import (
"fmt"
"net/http"
"strings"
"time"
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promhttp"
dto "github.com/prometheus/client_model/go"
"github.com/prometheus/common/expfmt"
"github.com/prometheus/common/model"
)
var (
fileOnDisk = prometheus.NewRegistry()
processedTotal = prometheus.NewCounterVec(prometheus.CounterOpts{
Name: "app_processed_total",
Help: "Number of times ran",
}, []string{"status"})
)
func doInit() {
prometheus.MustRegister(processedTotal)
}
func recordMetrics() {
go func() {
for {
processedTotal.With(prometheus.Labels{"status": "ok"}).Inc()
time.Sleep(5 * time.Second)
}
}()
}
func readExistingMetrics() {
var parser expfmt.TextParser
text := `
# HELP app_processed_total Number of times ran
# TYPE app_processed_total counter
app_processed_total{status="ok"} 300
`
parseText := func() ([]*dto.MetricFamily, error) {
parsed, err := parser.TextToMetricFamilies(strings.NewReader(text))
if err != nil {
return nil, err
}
var result []*dto.MetricFamily
for _, mf := range parsed {
result = append(result, mf)
}
return result, nil
}
gatherers := prometheus.Gatherers{
fileOnDisk,
prometheus.GathererFunc(parseText),
}
gathering, err := gatherers.Gather()
if err != nil {
fmt.Println(err)
}
fmt.Println("gathering: ", gathering)
for _, g := range gathering {
vector, err := expfmt.ExtractSamples(&expfmt.DecodeOptions{
Timestamp: model.Now(),
}, g)
fmt.Println("vector: ", vector)
if err != nil {
fmt.Println(err)
}
// How can I update processedTotal with this new value?
}
}
func main() {
doInit()
readExistingMetrics()
recordMetrics()
http.Handle("/metrics", promhttp.Handler())
http.ListenAndServe("localhost:2112", nil)
}
I believe you would need to use processedTotal.WithLabelValues("ok").Inc() or something similar to that.
The more complete example is here
func ExampleCounterVec() {
httpReqs := prometheus.NewCounterVec(
prometheus.CounterOpts{
Name: "http_requests_total",
Help: "How many HTTP requests processed, partitioned by status code and HTTP method.",
},
[]string{"code", "method"},
)
prometheus.MustRegister(httpReqs)
httpReqs.WithLabelValues("404", "POST").Add(42)
// If you have to access the same set of labels very frequently, it
// might be good to retrieve the metric only once and keep a handle to
// it. But beware of deletion of that metric, see below!
m := httpReqs.WithLabelValues("200", "GET")
for i := 0; i < 1000000; i++ {
m.Inc()
}
// Delete a metric from the vector. If you have previously kept a handle
// to that metric (as above), future updates via that handle will go
// unseen (even if you re-create a metric with the same label set
// later).
httpReqs.DeleteLabelValues("200", "GET")
// Same thing with the more verbose Labels syntax.
httpReqs.Delete(prometheus.Labels{"method": "GET", "code": "200"})
}
This is taken from the Promethus examples on Github
To use the value of vector you can do the following:
vectorFloat, err := strconv.ParseFloat(vector[0].Value.String(), 64)
if err != nil {
panic(err)
}
processedTotal.WithLabelValues("ok").Add(vectorFloat)
This is assuming you will only ever get a single vector value in your response. The value of the vector is stored as a string but you can convert it to a float with the strconv.ParseFloat method.

How to pass byte slice between go routines using channels

I have a function that reads data from a source and send them to destination. Source and destination could be anything, lets say for this example source is database (any MySQL, PostgreSQL...) and destination is distributed Q (any... ActiveMQ, Kafka). Messages are stored in bytes.
This is main function. idea is it will spin a new go routine and will wait for messages to be returned for future processing.
type Message []byte
func (p *ProcessorService) Continue(dictId int) {
level.Info(p.logger).Log("process", "message", "dictId", dictId)
retrieved := make(chan Message)
go func() {
err := p.src.Read(retrieved, strconv.Itoa(p.dictId))
if err != nil {
level.Error(p.logger).Log("process", "read", "message", "err", err)
}
}()
for r := range retrieved {
go func(message Message) {
level.Info(p.logger).Log("message", message)
if len(message) > 0 {
if err := p.dst.sendToQ(message); err != nil {
level.Error(p.logger).Log("failed", "during", "persist", "err", err)
}
} else {
level.Error(p.logger).Log("failed")
}
}(r)
}
}
and this is read function itself
func (s *Storage) Read(out chan<- Message, opt ...string) error {
// I just skip some basic database read operations here
// but idea is simple, read data from the table / file row by row and
//
for _, value := range dataFromDB {
message, err := value.row
if err == nil {
out <- message
} else {
errorf("Unable to get data %v", err)
out <- make([]byte, 0)
}
}
})
close(out)
if err != nil {
return err
}
return nil
}
As you can see communication done via out chan<- Message channel.
My concern in Continue function, specifically here
for r := range retrieved {
go func(message Message) {
// basically here message and r are pointing to the same underlying array
}
}
When data received var r is a type of slice byte. Then it passed to go func(message Message) everything passed by value in go, in this case var r will be passed as copy to anonymous func, however it will still have a pointer to underlying slice data. I am curious if it could be a problem during p.dst.sendToQ(message); execution and at the same time read function will send something to out channel causing slice data structure to be overridden with a new information. Should I copy byte slice r into the new byte slice before passing to anonymous function, so underlying arrays will be different? I tested it, but couldn't really cause this behavior. Not sure if I am paranoid or have to worry about it.
The message in p.dst.sendToQ(message) is the same slice as value.row when you get data from the db. So, as long as each value.row has a different underlying array, you should be good. So, I suggest you check the source and make sure it does not use a common byte array and keeps rewriting to it.

Idiomatic way to return multiple errors or handle them accordingly

I have this code and I don't like the way it feels not to mention golint doesn't like it with error should be the last type when returning multiple items (golint). To explain this code:
I want to leave it to the user to decide whether they care about any of the errors returned
Particularly in this code the Audio file is sometimes not needed or required and it can be ignored
Video and Outputfile are likely always going to be required for whatever the user is doing
I am open to refactoring this in any way (be it breaking it apart, moving things around, etc.) Is there a more idiomatic way in Go to accomplish something like this?
// Download will download the audio and video files to a particular path
func (r *RedditVideo) Download() (outputVideoFileName string, outputAudioFileName string, errVideo error, errAudio error) {
outputVideoFile, errVideo := downloadTemporaryFile(
r.VideoURL,
TemporaryVideoFilePrefix,
)
if errVideo == nil {
outputVideoFileName = outputVideoFile.Name()
}
outputAudioFile, errAudio := downloadTemporaryFile(
r.AudioURL,
TemporaryAudioFilePrefix,
)
if errAudio == nil {
outputAudioFileName = outputAudioFile.Name()
}
return outputVideoFileName, outputAudioFileName, errVideo, errAudio
}
Similarly I found myself using this same pattern again later in code:
// SetFiles sets up all of the input files and the output file
func (l *localVideo) SetFiles(inputVideoFilePath string, inputAudioFilePath string, outputVideoFilePath string) (errVideo error, errAudio error, errOutputVideo error) {
// Set input video file
l.inputVideoFile, errVideo = os.Open(inputVideoFilePath)
if errVideo != nil {
return
}
if errVideo = l.inputVideoFile.Close(); errVideo != nil {
return
}
// Set output video file
l.outputVideoFile, errOutputVideo = os.Create(outputVideoFilePath)
if errOutputVideo != nil {
return
}
if errOutputVideo = l.outputVideoFile.Close(); errOutputVideo != nil {
return
}
// IMPORTANT! Set input audio file LAST incase it is nil
l.inputAudioFile, errAudio = os.Open(inputAudioFilePath)
if errAudio != nil {
return
}
if errAudio = l.inputAudioFile.Close(); errVideo != nil {
return
}
return
}
This time in this code again some of the same is true like above:
We care that the Video and Output are set and may or may not care about the Audio
There are multiple errors are there to handle that are left up to the user
What you can see quite a bit in the standard library are specific functions which wrap more generic non-exported functions. See the commented code below.
// download is a rather generic function
// and probably you can refactor downloadTemporaryFile
// so that even this function is not needed any more.
func (r *RedditVideo) download(prefix, url string) (filename string, error error) {
outputFile, err := downloadTemporaryFile(
r.VideoURL,
prefix,
)
if err == nil {
return "", fmt.Errorf("Error while download: %s", err)
}
return outputFile.Name(), nil
}
// DownloadVideo wraps download, handing over the values specific
// to the video download
func (r *RedditVideo) DownloadVideo() (filename string, averror error) {
return r.download(TemporaryVideoFilePrefix, r.VideoURL)
}
// DownloadAudio wraps download, handing over the values specific
// to the audio download
func (r *RedditVideo) DownloadAudio() (filename string, averror error) {
return r.download(TemporaryAudioFilePrefix, r.AudioURL)
}
func main() {
r := RedditVideo{
VideoURL: os.Args[1],
AudioURL: os.Args[2],
}
var videoerror, audioerror error
var videofileName, audiofileName string
if videofileName, videoerror = r.DownloadVideo(); videoerror != nil {
fmt.Println("Got an error downloading the video")
}
if audiofileName, audioerror = r.DownloadAudio(); audioerror != nil {
fmt.Println("Got an error downloading the audio")
}
fmt.Printf("Video:\t%s\nAudio:\t%s", videofileName, audiofileName)
}
This way, it is obvious from which download the returned error stems from.
From https://blog.golang.org/error-handling-and-go:
Go code uses error values to indicate an abnormal state.
In your situation, Audio is optional and Video is required. Therefore only Video download should return error:
// Download will download the audio and video files to a particular path
// empty outputAudioFileName indicates inability to download Audio File
func (r *RedditVideo) Download() (outputVideoFileName string, outputAudioFileName string, err error) {
outputVideoFile, err := downloadTemporaryFile(
r.VideoURL,
TemporaryVideoFilePrefix,
)
if err != nil {
return "", "", err
}
outputVideoFileName := outputVideoFile.Name()
outputAudioFile, errAudio := downloadTemporaryFile(
r.AudioURL,
TemporaryAudioFilePrefix,
)
outputAudioFileName := ""
if errAudio == nil {
outputAudioFileName = outputAudioFile.Name()
} else {
// Log error, it is still important to fix it
}
return outputVideoFileName, outputAudioFileName, nil
}
Rule of thumb - any code that can generate abnormal state should have in next line:
if err != nil {
return funcResult, err
}

Output of GET request different to view source

I'm trying to extract match data from whoscored.com. When I view the source on firefox, I find on line 816 a big json string with the data I want for that matchid. My goal is to eventually get this json.
In doing this, I've tried to download every page of https://www.whoscored.com/Matches/ID/Live where ID is the id of the match. I wrote a little Go program to GET request each ID up to a certain point:
package main
import (
"fmt"
"io/ioutil"
"net/http"
"os"
)
// http://www.whoscored.com/Matches/614052/Live is the match for
// Eveton vs Manchester
const match_address = "http://www.whoscored.com/Matches/"
// the max id we get
const max_id = 300
const num_workers = 10
// function that get the bytes of the match id from the website
func match_fetch(matchid int) {
url := fmt.Sprintf("%s%d/Live", match_address, matchid)
resp, err := http.Get(url)
if err != nil {
fmt.Println(err)
return
}
// if we sucessfully got a response, store the
// body in memory
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
fmt.Println(err)
return
}
// write the body to memory
pwd, _ := os.Getwd()
filepath := fmt.Sprintf("%s/match_data/%d", pwd, matchid)
err = ioutil.WriteFile(filepath, body, 0644)
if err != nil {
fmt.Println(err)
return
}
}
// data type to send to the workers,
// last means this job is the last one
// matchid is the match id to be fetched
// a matchid of -1 means don't fetch a match
type job struct {
last bool
matchid int
}
func create_worker(jobs chan job) {
for {
next_job := <-jobs
if next_job.matchid != -1 {
match_fetch(next_job.matchid)
}
if next_job.last {
return
}
}
}
func main() {
// do the eveton match as a reference
match_fetch(614052)
var joblist [num_workers]chan job
var v int
for i := 0; i < num_workers; i++ {
job_chan := make(chan job)
joblist[i] = job_chan
go create_worker(job_chan)
}
for i := 0; i < max_id; i = i + num_workers {
for index, c := range joblist {
if i+index < max_id {
v = i + index
} else {
v = -1
}
c <- job{false, v}
}
}
for _, c := range joblist {
c <- job{true, -1}
}
}
The code seems to work in that it fills a directory called match_data with html. The problem is that this html is completely different to what I get in the browser! Here is the section which I think does this: (from the body of the GET request of http://www.whoscored.com/Matches/614052/Live.
(function() {
var z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D313536343032333530343538313538333938362C31373139363833393832313930303534313833392C31333935303737313737393531363432383234342C3132363636222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval('String.fromCharCode('+z+')'));})();
The reason I think this is the case is that the javascript in the page fetches and edits the DOM to what I see on view source. How can I get golang to run the javascript? Is there are library to do this? Better still, could I directly grab the JSON from the servers?
This can be done with https://godoc.org/github.com/sourcegraph/webloop#View.EvaluateJavaScript
Read their main example https://github.com/sourcegraph/webloop
What you need is a "headless browser" in general.
In general it is better to use an Web API vs. scraping. For example, whoscored themselves use OPTA which you should be able to access directly.
http://www.jokecamp.com/blog/guide-to-football-and-soccer-data-and-apis/#opta

Delete objects in s3 using wildcard matching

I have the following working code to delete an object from Amazon s3
params := &s3.DeleteObjectInput{
Bucket: aws.String("Bucketname"),
Key : aws.String("ObjectKey"),
}
s3Conn.DeleteObjects(params)
But what i want to do is to delete all files under a folder using wildcard **. I know amazon s3 doesn't treat "x/y/file.jpg" as a folder y inside x but what i want to achieve is by mentioning "x/y*" delete all the subsequent objects having the same prefix. Tried amazon multi object delete
params := &s3.DeleteObjectsInput{
Bucket: aws.String("BucketName"),
Delete: &s3.Delete{
Objects: []*s3.ObjectIdentifier {
{
Key : aws.String("x/y/.*"),
},
},
},
}
result , err := s3Conn.DeleteObjects(params)
I know in php it can be done easily by s3->delete_all_objects as per this answer. Is the same action possible in GOlang.
Unfortunately the goamz package doesn't have a method similar to the PHP library's delete_all_objects.
However, the source code for the PHP delete_all_objects is available here (toggle source view): http://docs.aws.amazon.com/AWSSDKforPHP/latest/#m=AmazonS3/delete_all_objects
Here are the important lines of code:
public function delete_all_objects($bucket, $pcre = self::PCRE_ALL)
{
// Collect all matches
$list = $this->get_object_list($bucket, array('pcre' => $pcre));
// As long as we have at least one match...
if (count($list) > 0)
{
$objects = array();
foreach ($list as $object)
{
$objects[] = array('key' => $object);
}
$batch = new CFBatchRequest();
$batch->use_credentials($this->credentials);
foreach (array_chunk($objects, 1000) as $object_set)
{
$this->batch($batch)->delete_objects($bucket, array(
'objects' => $object_set
));
}
$responses = $this->batch($batch)->send();
As you can see, the PHP code will actually make an HTTP request on the bucket to first get all files matching PCRE_ALL, which is defined elsewhere as const PCRE_ALL = '/.*/i';.
You can only delete 1000 files at once, so delete_all_objects then creates a batch function to delete 1000 files at a time.
You have to create the same functionality in your go program as the goamz package doesn't support this yet. Luckily it should only be a few lines of code, and you have a guide from the PHP library.
It might be worth submitting a pull request for the goamz package once you're done!
Using the mc tool you can do:
mc rm -r --force https://BucketName.s3.amazonaws.com/x/y
it will delete all the objects with the prefix "x/y"
You can achieve the same with Go using minio-go like this:
package main
import (
"log"
"github.com/minio/minio-go"
)
func main() {
config := minio.Config{
AccessKeyID: "YOUR-ACCESS-KEY-HERE",
SecretAccessKey: "YOUR-PASSWORD-HERE",
Endpoint: "https://s3.amazonaws.com",
}
// find Your S3 endpoint here http://docs.aws.amazon.com/general/latest/gr/rande.html
s3Client, err := minio.New(config)
if err != nil {
log.Fatalln(err)
}
isRecursive := true
for object := range s3Client.ListObjects("BucketName", "x/y", isRecursive) {
if object.Err != nil {
log.Fatalln(object.Err)
}
err := s3Client.RemoveObject("BucketName", object.Key)
if err != nil {
log.Fatalln(err)
continue
}
log.Println("Removed : " + object.Key)
}
}
Since this question was asked, the AWS GoLang lib for S3 has received some new methods in S3 Manager to handle this task (in response to #Itachi's pr).
See Github record: https://github.com/aws/aws-sdk-go/issues/448#issuecomment-309078450
Here is their example in v1: https://github.com/awsdocs/aws-doc-sdk-examples/blob/main/go/s3/DeleteObjects/DeleteObjects.go#L36
To get "wildcard matching" on paths inside the bucket, add the Prefix param to the example's ListObjectsInput call, as shown here:
iter := s3manager.NewDeleteListIterator(svc, &s3.ListObjectsInput{
Bucket: bucket,
Prefix: aws.String("somePathString"),
})
A bit late in the game, but since I was having the same problem, I created a small pkg that you can copy to your code base and import as needed.
func ListKeysInPrefix(s s3iface.S3API, bucket, prefix string) ([]string, error) {
res, err := s.Client.ListObjectsV2(&s3.ListObjectsV2Input{
Bucket: aws.String(bucket),
Prefix: aws.String(prefix),
})
if err != nil {
return []string{}, err
}
var keys []string
for _, key := range res.Contents {
keys = append(keys, *key.Key)
}
return keys, nil
}
func createDeleteObjectsInput(keys []string) *s3.Delete {
rm := []*s3.ObjectIdentifier{}
for _, key := range keys {
rm = append(rm, &s3.ObjectIdentifier{Key: aws.String(key)})
}
return &s3.Delete{Objects: rm, Quiet: aws.Bool(false)}
}
func DeletePrefix(s s3iface.S3API, bucket, prefix string) error {
keys, err := s.ListKeysInPrefix(bucket, prefix)
if err != nil {
panic(err)
}
_, err = s.Client.DeleteObjects(&s3.DeleteObjectsInput{
Bucket: aws.String(bucket),
Delete: s.createDeleteObjectsInput(keys),
})
if err != nil {
return err
}
return nil
}
So, in the case you have a bucket called "somebucket" with the following structure: s3://somebucket/foo/some-prefixed-folder/bar/test.txt and wanted to delete from some-prefixed-folder onwards, usage would be:
func main() {
// create your s3 client here
// client := ....
err := DeletePrefix(client, "somebucket", "some-prefixed-folder")
if err != nil {
panic(err)
}
}
This implementation only allows to delete a maximum of 1000 entries from the given prefix due ListObjectsV2 implementation - but it is paginated, so it's a matter of adding the functionality to keep refreshing results until results are < 1000.

Resources