go-elasticsearch fetch all documents - go

I am trying to use go-elasticsearch to fetch all data from ES.
"github.com/elastic/go-elasticsearch/v8"
"github.com/elastic/go-elasticsearch/v8/esapi"
So far I have written this code:
var r map[string]interface{}
cfg := elasticsearch.Config{
Addresses: []string{
fmt.Sprint(viper.Get("ELASTICSEARCH_URL")),
},
}
es, err := elasticsearch.NewClient(cfg)
if err != nil {
log.Fatalf("Error creating the client: %s", err)
}
res, err := es.Search(
es.Search.WithContext(context.Background()),
es.Search.WithIndex(index_name),
es.Search.WithTrackTotalHits(true),
es.Search.WithPretty(),
es.Search.WithFrom(0),
es.Search.WithSize(1000),
)
if err != nil {
log.Printf("Error getting response: %s", err)
}
defer res.Body.Close()
if res.IsError() {
var e map[string]interface{}
if err := json.NewDecoder(res.Body).Decode(&e); err != nil {
log.Printf("Error parsing the response body: %s", err)
} else {
// Print the response status and error information.
log.Printf("[%s] %s: %s",
res.Status(),
e["error"].(map[string]interface{})["type"],
e["error"].(map[string]interface{})["reason"],
)
}
}
if err := json.NewDecoder(res.Body).Decode(&r); err != nil {
log.Printf("Error parsing the response body: %s", err)
}
// Print the response status, number of results, and request duration.
log.Printf(
"[%s] %d hits; took: %dms",
res.Status(),
int(r["hits"].(map[string]interface{})["total"].(map[string]interface{})["value"].(float64)),
int(r["took"].(float64)),
)
fmt.Println("length", len(r["hits"].(map[string]interface{})["hits"].([]interface{})))
domain := fmt.Sprint(viper.Get("DOMAIN"))
for _, hit := range r["hits"].(map[string]interface{})["hits"].([]interface{}) {
doc := hit.(map[string]interface{})
source := doc["_source"]
}
I want to fetch all documents from this index, I won't know how many documents will be there in the index. How do I do that? The max limit, by default, to fetch data at a time, is set to 10000.

You can do that by Scrolling search results.
Prefer not to scroll unless needed, as scrolling operation is costly

Related

XACK is not deleting the message, even if it is processed successfully?

I am trying to implement redis stream where we have a producer.
package producer
import (
"RedisStream/models"
"encoding/json"
"fmt"
"github.com/garyburd/redigo/redis"
)
type Producer struct {
streamName string
}
func NewProducer(streamName string) *Producer {
return &Producer{streamName: streamName}
}
func (p *Producer) WriteEvents(conn redis.Conn, key string) {
// Create a new struct
employee := models.Employee{
Name: "ashutosh",
Employer: "self-employee",
}
// Convert struct to JSON
e, _ := json.Marshal(employee)
// Send key and value to Redis stream
_, err := conn.Do("XADD", p.streamName, "*", key, e)
if err != nil {
fmt.Println(err)
}
fmt.Println("Successfully sent data to Redis stream")
}
then I have implemented a consumer
func (c *Consumer) ReadEventsCons1() {
// Connect to Redis
conn, err := redis.Dial("tcp", ":6379")
if err != nil {
fmt.Println(err)
return
}
defer conn.Close()
for {
// Read key and value from Redis stream
reply, err := conn.Do("XREADGROUP", "GROUP", c.groupName[0], "ashu", "COUNT", "1", "STREAMS", c.streamName, ">")
vs, err := redis.Values(reply, err)
if err != nil {
if errors.Is(err, redis.ErrNil) {
continue
}
fmt.Printf("Error: %+v", err)
}
// Get the first and only value in the array since we're only
// reading from one stream "some-stream-name" here.
vs, err = redis.Values(vs[0], nil)
if err != nil {
fmt.Printf("Error: %+v", err)
}
// Ignore the stream name as the first value as we already have
// that in hand! Just get the second value which is guaranteed to
// exist per the docs, and parse it as some stream entries.
res, err := entries(vs[1], nil)
if err != nil {
fmt.Errorf("error parsing entries: %w", err)
}
for _, val := range res {
for k, v := range val.Fields {
empl := &models.Employee{}
_ = json.Unmarshal(v, empl)
fmt.Printf("From Consumer Ashu: Key: %s and val: %+v \n", k, empl)
}
reply, err := redis.Int(conn.Do("XACK", c.streamName, c.groupName[0], val.ID))
if reply != 1 {
fmt.Printf("failed to ack: err: %+v", err)
}
}
}
}
Once a consumer from a consumergroup successfully processed a message, I sent acknowledgement to redis.But messages still resides in redis stream. because post running
XLEN streamName
I can see length is growing. This may create memory challenge, since messages are residing in perpetuity. Is there any intelligent way to handle this issue?

Go elasticsearch insert bulk data from incoming pulsar

I have to use goelastic library inserting the datas bulkly from coming pulsar. But i have a problem.
Firstly, pulsar send 1000 datas per partial bulkly. Then when i insert the elastic, there are a problem sometimes. This problem is attached. This problem cause data loss. Thanks for answer...
ERROR: circuit_breaking_exception: [parent] Data too large, data for [indices:data/write/bulk[s]] would be [524374312/500mb], which is larger than the limit of [510027366/486.3mb], real usage: [524323448/500mb], new bytes reserved: [50864/49.6kb], usages [request=0/0b, fielddata=160771183/153.3mb, in_flight_requests=50864/49.6kb, model_inference=0/0b, eql_sequence=0/0b, accounting=6898128/6.5mb]
This section is bulk code.
func InsertElastic(y []models.CP, ElasticStruct *config.ElasticStruct) {
fmt.Println("------------------")
bi, err := esutil.NewBulkIndexer(esutil.BulkIndexerConfig{
Index: enum.IndexName,
Client: ElasticStruct.Client,
FlushBytes: 10e+6,
})
if err != nil {
panic(err)
}
start := time.Now().UTC()
for _, x := range y {
data, err := json.Marshal(x)
if err != nil {
panic(err)
}
err = bi.Add(
context.Background(),
esutil.BulkIndexerItem{
Action: "index",
Body: bytes.NewReader(data),
OnSuccess: func(ctx context.Context, item esutil.BulkIndexerItem, res esutil.BulkIndexerResponseItem) {
i++
},
OnFailure: func(ctx context.Context, item esutil.BulkIndexerItem, res esutil.BulkIndexerResponseItem, err error) {
if err != nil {
log.Printf("ERROR: %s", err)
} else {
log.Printf("ERROR: %s: %s", res.Error.Type, res.Error.Reason)
}
},
},
)
if err != nil {
log.Fatalf("Unexpected error: %s", err)
}
x++
}
if err := bi.Close(context.Background()); err != nil {
log.Fatalf("Unexpected error: %s", err)
}
dur := time.Since(start)
fmt.Println(dur)
fmt.Println("Success writing data to elastic : ", i)
fmt.Println("Success incoming data from pulsar : ", x)
fmt.Println("Difference : ", x-i)
fmt.Println("Now : ", time.Now().UTC().String())
if i < x {
fmt.Println("FATAL")
}
fmt.Println("------------------")
}
Tldr;
It seems like you do not have enough JVM heap on your node.
You are hitting a circuit breaker to avoid Elasticsearch to be Out Of Memory(OOM).
Solution(s)
Increase the JVM memory, you will find here some documentation to size your nodes.
Smaller bulk request

rows.Next() halts after some number of rows

Im newbie in Golang, so it may be simple for professionals but I got stuck with no idea what to do next.
I'm making some migration app that extract some data from oracle DB and after some conversion insert it to Postges one-by-one.
The result of native Query in DB console returns about 400k of rows and takes about 13 sec to end.
The data from Oracle extracts with rows.Next() with some strange behavior:
First 25 rows extracted fast enough, then about few sec paused, then new 25 rows until it pauses "forever".
Here is the function:
func GetHrTicketsFromOra() (*sql.Rows, error) {
rows, err := oraDB.Query("select id,STATE_ID,REMEDY_ID,HEADER,CREATE_DATE,TEXT,SOLUTION,SOLUTION_USER_LOGIN,LAST_SOLUTION_DATE from TICKET where SOLUTION_GROUP_ID = 5549")
if err != nil {
println("Error while getting rows from Ora")
return nil, err
}
log.Println("Finished legacy tickets export")
return rows, err
}
And here I export data:
func ConvertRows(rows *sql.Rows, c chan util.ArchTicket, m chan int) error {
log.Println("Conversion start")
defer func(rows *sql.Rows) {
err := rows.Close()
if err != nil {
log.Println("ORA connection closed", err)
return
}
}(rows)
for rows.Next() {
log.Println("Reading the ticket")
ot := util.OraTicket{}
at := util.ArchTicket{}
err := rows.Scan(&ot.ID, &ot.StateId, &ot.RemedyId, &ot.Header, &ot.CreateDate, &ot.Text, &ot.Solution, &ot.SolutionUserLogin, &ot.LastSolutionDate)
if err != nil {
log.Println("Error while reading row", err)
return err
}
at = convertLegTOArch(ot)
c <- at
}
if err := rows.Err(); err != nil {
log.Println("Error while reading row", err)
return err
}
m <- 1
return nil
}
UPD. I use "github.com/sijms/go-ora/v2" driver
UPD2. Seems like the root cause of the problem is in TEXT and SOLUTION fields of the result rows. They are varchar and can be big enough. Deleting them from the direct query changes the time of execution from 13sec to 258ms. But I still have no idea what to do with that.
UPD3.
Minimal reproducible example
package main
import (
"database/sql"
_ "github.com/sijms/go-ora/v2"
"log"
)
var oraDB *sql.DB
var con = "oracle://login:password#ora_db:1521/database"
func InitOraDB(dataSourceName string) error {
var err error
oraDB, err = sql.Open("oracle", dataSourceName)
if err != nil {
return err
}
return oraDB.Ping()
}
func GetHrTicketsFromOra() {
var ot string
rows, err := oraDB.Query("select TEXT from TICKET where SOLUTION_GROUP_ID = 5549")
if err != nil {
println("Error while getting rows from Ora")
}
for rows.Next() {
log.Println("Reading the ticket")
err := rows.Scan(&ot)
if err != nil {
log.Println("Reading failed", err)
}
log.Println("Read:")
}
log.Println("Finished legacy tickets export")
}
func main() {
err := InitOraDB(con)
if err != nil {
log.Println("Error connection Ora")
}
GetHrTicketsFromOra()
}

kafka retry many times when i download large file

I am newbie in kafka, i try build a service send mail with attach files.
Execution flow:
Kafka will receive a message to send mail
function get file will download file from url , scale image, and save file
when send mail i will get files from folder and attach to form
Issues:
when i send mail with large files many times , kafka retry many times, i will receive many mail
kafka error: "kafka server: The provided member is not known in the current generation"
I listened MaxProcessingTime , but i try to test a mail with large file, it still work fine
Kafka info : 1 broker , 3 consumer
func (s *customerMailService) SendPODMail() error { filePaths, err := DownloadFiles(podURLs, orderInfo.OrderCode)
if err != nil{
countRetry := 0
for countRetry <= NUM_OF_RETRY{
filePaths, err = DownloadFiles(podURLs, orderInfo.OrderCode)
if err == nil{
break
}
countRetry++
}
}
err = s.sendMailService.Send(ctx, orderInfo.CustomerEmail, tmsPod, content,filePaths)}
function download file :
func DownloadFiles(files []string, orderCode string) ([]string, error) {
var filePaths []string
err := os.Mkdir(tempDir, 0750)
if err != nil && !os.IsExist(err) {
return nil, err
}
tempDirPath := tempDir + "/" + orderCode
err = os.Mkdir(tempDirPath, 0750)
if err != nil && !os.IsExist(err) {
return nil, err
}
for _, fileUrl := range files {
fileUrlParsed, err := url.ParseRequestURI(fileUrl)
if err != nil {
logrus.WithError(err).Infof("Pod url is invalid %s", orderCode)
return nil, err
}
extFile := filepath.Ext(fileUrlParsed.Path)
dir, err := os.MkdirTemp(tempDirPath, "tempDir")
if err != nil {
return nil, err
}
f, err := os.CreateTemp(dir, "tmpfile-*"+extFile)
if err != nil {
return nil, err
}
defer f.Close()
response, err := http.Get(fileUrl)
if err != nil {
return nil, err
}
defer response.Body.Close()
contentTypes := response.Header["Content-Type"]
isTypeAllow := false
for _, contentType := range contentTypes {
if contentType == "image/png" || contentType == "image/jpeg" {
isTypeAllow = true
}
}
if !isTypeAllow {
logrus.WithError(err).Infof("Pod image type is invalid %s", orderCode)
return nil, errors.New("Pod image type is invalid")
}
decodedImg, err := imaging.Decode(response.Body)
if err != nil {
return nil, err
}
resizedImg := imaging.Resize(decodedImg, 1024, 0, imaging.Lanczos)
imaging.Save(resizedImg, f.Name())
filePaths = append(filePaths, f.Name())
}
return filePaths, nil}
function send mail
func (s *tikiMailService) SendFile(ctx context.Context, receiver string, templateCode string, data interface{}, filePaths []string) error {
path := "/v1/emails"
fullPath := fmt.Sprintf("%s%s", s.host, path)
formValue := &bytes.Buffer{}
writer := multipart.NewWriter(formValue)
_ = writer.WriteField("template", templateCode)
_ = writer.WriteField("to", receiver)
if data != nil {
b, err := json.Marshal(data)
if err != nil {
return errors.Wrapf(err, "Cannot marshal mail data to json with object %+v", data)
}
_ = writer.WriteField("params", string(b))
}
for _, filePath := range filePaths {
part, err := writer.CreateFormFile(filePath, filepath.Base(filePath))
if err != nil {
return err
}
pipeReader, pipeWriter := io.Pipe()
go func() {
defer pipeWriter.Close()
file, err := os.Open(filePath)
if err != nil {
return
}
defer file.Close()
io.Copy(pipeWriter, file)
}()
io.Copy(part, pipeReader)
}
err := writer.Close()
if err != nil {
return err
}
request, err := http.NewRequest("POST", fullPath, formValue)
if err != nil {
return err
}
request.Header.Set("Content-Type", writer.FormDataContentType())
resp, err := s.doer.Do(request)
if err != nil {
return errors.Wrap(err, "Cannot send request to send email")
}
defer resp.Body.Close()
b, err := ioutil.ReadAll(resp.Body)
if err != nil {
return err
}
if resp.StatusCode != http.StatusOK {
return errors.New(fmt.Sprintf("Send email with code %s error: status code %d, response %s",
templateCode, resp.StatusCode, string(b)))
} else {
logrus.Infof("Send email with attachment ,code %s success with response %s , box-code", templateCode, string(b),filePaths)
}
return nil
}
Thank
My team found my problem when I redeploy k8s pods, which lead to conflict leader partition causing rebalance. It will try to process the remaining messages in buffer of pods again.
Solution: I don't fetch many messages saved in buffer , I just get a message and process it by config :
ChannelBufferSize = 0
Example conflict leader parition:
consumer A and B startup in the same time
consumer A registers itself as leader, and owns the topic with all partitions
consumer B registers itself as leader, and then begins to rebalance and owns all partitions
consumer A rebalance and obtains all partitions, but can not consume because the memberId is old and need a new one
consumer B rebalance again and owns the topic with all partitions, but it's already obtained by consumer A
My two cents: in case of very big attachments, the consumer takes quite a lot of time to read the file and to send it as an attachment.
This increases the amount of time between two poll() calls. If that time is greater than max.poll.interval.ms, the consumer is thought to be failed and the partition offset is not committed. As a result, the message is processed again and eventually, if by chance the execution time stays below the poll interval, the offset is committed. The effect is a multiple email send.
Try increasing the max.poll.interval.ms on the consumer side.

Loading CSV file into bigquery after os.Create() doesn't load data

I'm trying to run the following flow:
Get data from somewhere
Create new local CSV file, write the data into that file
Upload the CSV to Bigquery
Delete the local file
But it seems to load empty data.
This is the code:
func (c *Client) Do(ctx context.Context) error {
bqClient, err := bigquerypkg.NewBigQueryUtil(ctx, "projectID", "datasetID")
if err != nil {
return err
}
data, err := c.GetSomeData(ctx)
if err != nil {
return err
}
file, err := os.Create("example.csv")
if err != nil {
return err
}
defer file.Close()
// also file need to be delete
writer := csv.NewWriter(file)
defer writer.Flush()
timestamp := time.Now().UTC().Format("2006-01-02 03:04:05.000000000")
for _, d := range data {
csvRow := []string{
d.ID,
d.Name,
timestamp,
}
err = writer.Write(csvRow)
if err != nil {
log.Printf("error writing data to CSV: %v\n", err)
}
}
source := bigquery.NewReaderSource(file)
source.Schema = bigquery.Schema{
{Name: "id", Type: bigquery.StringFieldType},
{Name: "name", Type: bigquery.StringFieldType},
{Name: "createdAt", Type: bigquery.TimestampFieldType},
}
if _, err = bqClient.LoadCsv(ctx, "tableID", source); err != nil {
return err
}
return nil
}
LoadCSV() looks like this:
func (c *Client) LoadCsv(ctx context.Context, tableID string, src bigquery.LoadSource) (string, error) {
loader := c.bigQueryClient.Dataset(c.datasetID).Table(tableID).LoaderFrom(src)
loader.WriteDisposition = bigquery.WriteTruncate
job, err := loader.Run(ctx)
if err != nil {
return "", err
}
status, err := job.Wait(ctx)
if err != nil {
return job.ID(), err
}
if status.Err() != nil {
return job.ID(), fmt.Errorf("job completed with error: %v", status.Err())
}
return job.ID(), nil
}
After running this, bigquery does create the schema but with no data.
If I'm changing os.Create() to os.Open() and the file already exist, everything work. It's like when loading the CSV the file data is not yet written (?)
What's the reason?
The problem I see here is that you don't rewind the file handle's cursor to the beginning of the file. Thus, the next read will be at the end of the file, and will be a 0 byte read. That explains why it seems like there's no content in the file.
https://pkg.go.dev/os#File.Seek can handle this for you.
Actually, the Flush is not relevant, because you're using the same file handle to read the file than you did to write it, so you'll see your own written bytes even without a flush. This would not be the case if the file was opened by a different process or was reopened.
Edit: OP Claims this flush was necessary in their case and I cannot provide evidence to disagree. Flush will not hurt things either.
Demonstration:
package main
import (
"fmt"
"io"
"os"
)
func main() {
f, err := os.CreateTemp("", "data.csv")
if err != nil {
panic(err)
} else {
defer f.Close()
defer os.Remove(f.Name())
}
fmt.Fprintf(f, "hello, world")
fmt.Fprintln(os.Stderr, "Before rewind: ")
if _, err := io.Copy(os.Stderr, f); err != nil {
panic(err)
}
f.Seek(0, io.SeekStart)
fmt.Fprintln(os.Stderr, "\nAfter rewind: ")
if _, err := io.Copy(os.Stderr, f); err != nil {
panic(err)
}
fmt.Fprintln(os.Stderr, "\n")
}
% go run t.go
Before rewind:
After rewind:
hello, world

Resources