I am importing data to neo4j using neoism, and I have some issues importing big data, 1000 nodes, would take 8s. here is a part of the code that imports 100nodes.
quite basic code, needs improvement, anyone can help me improve this?
var wg sync.WaitGroup
for _, itemProps := range items {
wg.Add(1)
go func(i interface{}) {
s := time.Now()
cypher := neoism.CypherQuery{
Statement: fmt.Sprintf(`
CREATE (%v)
SET i = {Props}
RETURN i
`, ItemLabel),
Parameters: neoism.Props{"Props": i},
}
if err := database.ExecuteCypherQuery(cypher); err != nil {
utils.Error(fmt.Sprintf("error ImportItemsNeo4j! %v", err))
wg.Done()
return
}
utils.Info(fmt.Sprintf("import Item success! took: %v", time.Since(s)))
wg.Done()
}(itemProps)
}
wg.Wait()
Afaik neoism still uses old APIs, you should use cq instead: https://github.com/go-cq/cq
also you should batch your creates,
i.e. either send multiple statements per request, e.g 100 statements per request
or even better send a list of parameters to a single cypher query:
e.g. {data} is a [{id:1},{id:2},...]
UNWIND {data} as props
CREATE (n:Label) SET n = props
Related
I am retrieving payloads from a REST API with which I then want to insert into a Snowflake table.
My current process is to use the Snowflake DB connection and iterate over a slice of structs (which contain my data from the API). However, this doesn't seem to be efficient or optimal. Everything is successfully loading, but I am trying to figure out how to optimize a large amount of inserts for potentially thousands of records. Perhaps there needs to be a separate channel for insertions instead of synchronously inserting?
General code flow:
import (
"database/sql"
"fmt"
"sync"
"time"
_ "github.com/snowflakedb/gosnowflake"
)
func ETL() {
var wg sync.WaitGroup
ch := make(chan []*Response)
defer close(ch)
// Create requests to API
for _, req := range requests {
// All of this flows fine without issue
wg.Add(1)
go func(request Request) {
defer wg.Done()
resp, _ := request.Get()
ch <- resp
}(request)
}
// Connect to snowflake
// This is not a problem
connString := fmt.Sprintf(config...)
db, _ := sql.Open("snowflake", connString)
defer db.Close()
// Collect responses from our channel
results := make([][]*Response, len(requests))
for i, _ := range results {
results[i] <-ch
for _, res := range results[i] {
// transform is just a function to flatten my structs into entries that I would like to insert into Snowflake. This is not a bottleneck.
entries := transform(res)
// Load the data into snowflake, passing the entries that have been
// Flattened as well as the db connection
err := load(entries, db)
}
}
}
type Entry struct {
field1 string
field2 string
statusCode int
}
func load(entries []*Entry, db *sql.DB) error {
start := time.Now()
for i, entry := range entries {
fmt.Printf("Loading entry %d\n", i)
stmt := `INSERT INTO tbl (field1, field2, updated_date, status_code)
VALUES (?, ?, CURRENT_TIMESTAMP(), ?)`
_, err := db.Exec(stmt, entry.field1, entry.field2, entry.statusCode)
if err != nil {
fmt.Println(err)
return err
}
}
fmt.Println("Load time: ", time.Since(start))
return nil
}
Instead of INSERTing individual rows, collect rows in files and each time you push one of these to S3/GCS/Azure it will be loaded immediately.
I wrote a post detailing these steps:
https://medium.com/snowflake/lightweight-batch-streaming-to-snowflake-with-gcp-pub-sub-1790ab76da31
With the appropriate storage integration, this would auto-ingest the files:
create pipe temp.public.meetup202011_pipe
auto_ingest = true
integration = temp_meetup202011_pubsub_int
as
copy into temp.public.meetup202011_rsvps
from #temp_fhoffa_gcs_meetup;
Also check these considerations:
https://www.snowflake.com/blog/best-practices-for-data-ingestion/
Soon: If you want to send individual rows and ingest them in real time into Snowflake - that's in development (https://www.snowflake.com/blog/snowflake-streaming-now-hiring-help-design-and-build-the-future-of-big-data-and-stream-processing/).
I have an object I'm using to make paged SQL queries that allows for the queries to be run asynchronously:
type PagedQuery[T any] struct {
Results chan []*T
Errors chan error
Done chan error
Quit chan error
client *sql.DB
}
func NewPagedQuery[T any](client *sql.DB) *PagedQuery[T] {
return &PagedQuery[T]{
Results: make(chan []*T, 1),
Errors: make(chan error, 1),
Done: make(chan error, 1),
Quit: make(chan error, 1),
client: client,
}
}
func (paged *PagedQuery[T]) requestAsync(ctx context.Context, queries ...*Query) {
conn, err := client.Conn(ctx)
if err != nil {
paged.Errors <- err
return
}
defer func() {
conn.Close()
paged.Done <- nil
}()
for i, query := range queries {
select {
case <-ctx.Done():
return
case <-paged.Quit:
return
default:
}
rows, err := conn.QueryContext(ctx, query.String, query.Arguments...)
if err != nil {
paged.Errors <- err
return
}
data, err := sql.ReadRows[T](rows)
if err != nil {
paged.Errors <- err
return
}
paged.Results <- data
}
}
I'm trying to test this code, specifically the cancellation part. My test code looks like this:
svc, mock := createServiceMock("TEST_DATABASE", "TEST_SCHEMA")
mock.ExpectQuery(regexp.QuoteMeta("TEST QUERY")).
WithArgs(...).
WillReturnRows(mock.NewRows([]string{"t", "v", "o", "c", "h", "l", "vw", "n"}))
ctx, cancel := context.WithCancel(context.Background())
go svc.requestAsync(ctx, query1, query2, query3, query4)
time.Sleep(50 * time.Millisecond)
cancel()
results := make([]data, 0)
loop:
for {
select {
case <-query.Done:
break loop
case err := <-query.Errors:
Expect(err).ShouldNot(HaveOccurred())
case r := <-query.Results:
results = append(results, r...)
}
}
Expect(results).Should(BeEmpty())
Expect(mock.ExpectationsWereMet()).ShouldNot(HaveOccurred()) // fails here
The issue I'm running into is that this test fails occaisionally at the line indicated by my comment, because when cancel() is called, execution isn't guaranteed to be at the switch statement where I check for <-ctx.Done or <-Quit. Execution could be anywhere in the loop up to where I send the results to the Results channel. Except that doesn't make sense because execution should block until I receive data from the Results channel, which I don't do until after I call cancel(). Furthermore, I'm relying on the sqlmock package for SQL testing which doesn't allow for any sort of fuzzy checking where SQL queries are concerned. Why am I getting this failure and how can I fix it?
My issue was the result of my own lack of understanding around Go channels. I assumed that, by creating a chan([]*T, 1) meant that the channel would block when it was full (i.e. when it contained a single item) but that is not the case. Rather, the block occurs when I attempted to send to the channel when its buffer was full. So, by modifying Results like this:
func NewPagedQuery[T any](client *sql.DB) *PagedQuery[T] {
return &PagedQuery[T]{
Results: make(chan []*T), // Remove buffer here
Errors: make(chan error, 1),
Done: make(chan error, 1),
Quit: make(chan error, 1),
client: client,
}
}
I can ensure that the channel blocks until the data it contains is received. This one change fixed all the problems with testing.
I'm new to Go, so sorry for the silly question in advance!
I'm using Gin framework and want to make multiple queries to the database within the same handler (database/sql + lib/pq)
userIds := []int{}
bookIds := []int{}
var id int
/* Handling first query here */
rows, err := pgClient.Query(getUserIdsQuery)
defer rows.Close()
if err != nil {
return
}
for rows.Next() {
err := rows.Scan(&id)
if err != nil {
return
}
userIds = append(userIds, id)
}
/* Handling second query here */
rows, err = pgClient.Query(getBookIdsQuery)
defer rows.Close()
if err != nil {
return
}
for rows.Next() {
err := rows.Scan(&id)
if err != nil {
return
}
bookIds = append(bookIds, id)
}
I have a couple of questions regarding this code (any improvements and best practices would be appreciated)
Does Go properly handle defer rows.Close() in such a case? I mean I have reassignment of rows variable later down the code, so will compiler track both and properly close at the end of a function?
Is it ok to reuse id shared var or should I redeclare it while iterating within rows.Next() loop?
What's the better approach of having even more queries within one handler? Should I have some kind of Writer that accepts query and slice and populate it with ids retrieved?
Thanks.
I've never worked with go-pg library, and my answer is mostly focused on the other stuff, which are generic, and are not specific to golang or go-pg.
Regardless of the fact that the rows here has the same reference while being shared between 2 queries (so one rows.Close() call would suffice, unless the library has some special implementation), defining two variables is cleaner, like userRows and bookRows.
Although I already said that I have not worked with go-pg, I believe that you wont need to iterate through rows and scan the id for all the rows manually, I believe that the lib has provided some API like this (based on the quick look on the documentations):
userIds := []int{}
err := pgClient.Query(&userIds, "select id from users where ...", args...)
Regarding your second question, it depends on what you mean by "ok". Since your doing some synchronous iteration, I don't think it would result into bugs, but when it comes to coding style, personally, I wouldn't do this.
I think that the best thing to do in your case is this:
// repo layer
func getUserIds(args whatever) ([]int, err) {...}
// these can be exposed, based on your packaging logic
func getBookIds(args whatever) ([]int, err) {...}
// service layer, or wherever you want to aggregate both queries
func getUserAndBookIds() ([]int, []int, err) {
userIds, err := getUserIds(...)
// potential error handling
bookIds, err := getBookIds(...)
// potential error handling
return userIds, bookIds, nil // you have done err handling earlier
}
I think this code is easier to read/maintain. You won't face the variable reassignment and other issues.
You can take a look at the go-pg documentations for more details on how to improve your query.
The objective of my backend service is to process 90 milllion data and at least 10 million of data in 1 day.
My system config:
Ram 2000 Mb
CPU 2core(s)
what I am doing right now is something like this:
var wg sync.WaitGroup
//length of evs is 4455
for i, ev := range evs {
wg.Add(1)
go migrate(&wg)
}
wg.Wait()
func migrate(wg *sync.WaitGroup) {
defer wg.Done()
//processing
time.Sleep(time.Second)
}
Without knowing more detail about the type of work you need to do, your approach seems good. Some things to think about:
Re-using variables and or clients in your processing loop. For example reusing an HTTP client instead of recreating one.
Depending on how your use case calls to handle failures. It might be efficient to use erroGroup. It's a convenience wrapper that stops all the threads on error possibly saving you a lot of time.
In the migrate function be sure to be aware of the caveats regarding closure and goroutines.
func main() {
g := new(errgroup.Group)
var urls = []string{
"http://www.someasdfasdfstupidname.com/",
"ftp://www.golang.org/",
"http://www.google.com/",
}
for _, url := range urls {
url := url // https://golang.org/doc/faq#closures_and_goroutines
g.Go(func() error {
resp, err := http.Get(url)
if err == nil {
resp.Body.Close()
}
return err
})
}
fmt.Println("waiting")
if err := g.Wait(); err == nil {
fmt.Println("Successfully fetched all URLs.")
} else {
fmt.Println(err)
}
}
I have got the solution. to achieve this much huge processing what I have done is
a limited number of goroutine to 50 and increased the number of cores from 2 to 5.
I am getting started with RethinkDB, I have never used it before. I give it a try together with Gorethink following this tutorial.
To sum up this tutorial, there are two programs:
The first one updates entries infinitely.
for {
var scoreentry ScoreEntry
pl := rand.Intn(1000)
sc := rand.Intn(6) - 2
res, err := r.Table("scores").Get(strconv.Itoa(pl)).Run(session)
if err != nil {
log.Fatal(err)
}
err = res.One(&scoreentry)
scoreentry.Score = scoreentry.Score + sc
_, err = r.Table("scores").Update(scoreentry).RunWrite(session)
}
And the second one, receives this changes and logs them.
res, err := r.Table("scores").Changes().Run(session)
var value interface{}
if err != nil {
log.Fatalln(err)
}
for res.Next(&value) {
fmt.Println(value)
}
In the statistics that RethinkDB shows, I can see that there are 1.5K reads and writes per second. But in the console of the second program, I see 1 or 2 changes per second approximately.
Why does this occur? Am I missing something?
This code:
r.Table("scores").Update(scoreentry).RunWrite(session)
Probably doesn't do what you think it does. This attempts to update every document in the table by merging scoreentry into it. This is why the RethinkDB console is showing so many writes per second: every time you run that query it's resulting in thousands of writes.
Usually you want to update documents inside of ReQL, like so:
r.Table('scores').Get(strconv.Itoa(pl)).Update(func (row Term) interface{} {
return map[string]interface{}{"Score": row.GetField('Score').Add(sc)};
})
If you need to do the update in Go code, though, you can replace just that one document like so:
r.Table('scores').Get(strconv.Itoa(pl)).Replace(scoreentry)
Im not sure why it is quite that slow, it could be because by default each query blocks until the write has been completely flushed. I would first add some kind of instrumentation to see which operation is being so slow. There are also a couple of ways that you can improve the performance:
Set the Durability of the write using UpdateOpts
_, err = r.Table("scores").Update(scoreentry, r.UpdateOpts{
Durability: "soft",
}).RunWrite(session)
Execute each query in a goroutine to allow your code to execute multiple queries in parallel (you may need to use a pool of goroutines instead but this code is just a simplified example)
for {
go func() {
var scoreentry ScoreEntry
pl := rand.Intn(1000)
sc := rand.Intn(6) - 2
res, err := r.Table("scores").Get(strconv.Itoa(pl)).Run(session)
if err != nil {
log.Fatal(err)
}
err = res.One(&scoreentry)
scoreentry.Score = scoreentry.Score + sc
_, err = r.Table("scores").Update(scoreentry).RunWrite(session)
}()
}