Has anyone used Ruby neo4j-core to mass process data? Specifically, I am looking at taking in about 500k lines from a relational database and insert them via something like:
Neo4j::Session.current.transaction.query
.merge(m: { Person: { token: person_token} })
.merge(i: { IpAddress: { address: ip, country: country,
city: city, state: state } })
.merge(a: { UserToken: { token: token } })
.merge(r: { Referrer: { url: referrer } })
.merge(c: { Country: { name: country } })
.break # This will make sure the query is not reordered
.create_unique("m-[:ACCESSED_FROM]->i")
.create_unique("m-[:ACCESSED_FROM]->a")
.create_unique("m-[:ACCESSED_FROM]->r")
.create_unique("a-[:ACCESSED_FROM]->i")
.create_unique("a-[:ACCESSED_FROM]->r")
.create_unique("i-[:IN]->c")
.exec
However doing this locally it takes hours on hundreds of thousands of events. So far, I have attempted the folloiwng:
Wrapping Neo4j::Connection in a ConnectionPool and multi-threading it - I did not see much speed improvements here.
Doing tx = Neo4j::Transaction.new and tx.close every 1000 events processed - looking at a TCP dump, I am not sure this actually does what I expected. It does the exact same requests, with the same frequency, but just has a different response.
With Neo4j::Transaction I see a POST every time the .query(...).exec is called:
Request: {"statements":[{"statement":"MERGE (m:Person{token: {m_Person_token}}) ...{"m_Person_token":"AAA"...,"resultDataContents":["row","REST"]}]}
Response: {"commit":"http://localhost:7474/db/data/transaction/868/commit","results":[{"columns":[],"data":[]}],"transaction":{"expires":"Tue, 10 May 2016 23:19:25 +0000"},"errors":[]}
With Non-Neo4j::Transactions I see the same POST frequency, but this data:
Request: {"query":"MERGE (m:Person{token: {m_Person_token}}) ... {"m_Person_token":"AAA"..."c_Country_name":"United States"}}
Response: {"columns" : [ ], "data" : [ ]}
(Not sure if that is intended behavior, but it looks like less data is transmitted via the Non-Neo4j::Transaction technique - highly possibly I am doing something incorrectly)
Some other ideas I had:
* Post process into a CSV, SCP up and then use the neo4j-import command line utility (although, that seems kinda hacky).
* Combine both of the techniques I tried above.
Has anyone else run into this / have other suggestions?
Ok!
So you're absolutely right. With neo4j-core you can only send one query at a time. With transactions all you're really getting is the ability to rollback. Neo4j does have a nice HTTP JSON API for transactions which allows you to send multiple Cypher requests in the same HTTP request, but neo4j-core doesn't currently support that (I'm working on a refactor for the next major version which will allow this). So there are a number of options:
You can submit your requests via raw HTTP JSON to the APIs. If you still want to use the Query API you can use the to_cypher and merge_params methods to get the cypher and params for that (merge_params is a private method currently, so you'd need to send(:merge_params))
You can load via CSV as you said. You can either
use the neo4j-import command which allows you to import very fast but requires you to put your CSV in a specific format, requires that you be creating a DB from scratch, and requires that you create indexes/constraints after the fact
use the LOAD CSV command which isn't as fast, but is still pretty fast.
You can use the neo4apis gem to build a DSL to import your data. The gem will create Cypher queries under the covers and will batch them for performance. See examples of the gem in use via neo4apis-twitter and neo4apis-github
If you are a bit more adventurous, you can use the new Cypher API in neo4j-core via the new_cypher_api branch on the GitHub repo. The README in that branch has some documentation on the API, but also feel free to drop by our Gitter chat room if you have questions on this or anything else.
If you're implementing a solution which is going to make queries like above where you have multiple MERGE clauses, you'll probably want to profile your queries to make sure that you are avoiding the eager (that post is a bit old and newer versions of Neo4j have alleviated some of the need for care, but you can still look for Eager in your PROFILE)
Also worth a look: Max De Marzi's post on Scaling Cypher Writes
Related
According to README on github, Ruby Whois can be used "as a standalone library to parse WHOIS records fetched previously and/or from different WHOIS clients."
I know how to use the library to directly perform whois query and parse the returning result. But I cannot find anywhere(stackoverflow included) how I can use this library to parse whois data previously fetched ?
I think it's not important but this is how I get my data, anyway: they are fetched through linux whois command and stored in separate files, each file containing one whois query result.
The manual pages on https://whoisrb.org/ are 404. Even the code on the homepage is outdated thus wrong, and the doc pages provide little information.
I tried to scan the source code on github( https://github.com/weppos/whois-parser and https://github.com/weppos/whois). I tried to find the answer on rubydoc ( https://www.rubydoc.info/gems/whois-parser/Whois/Parser, https://www.rubydoc.info/gems/whois/Whois/Record and some related pages). Both failed, partly because this task is the first time and the reason that I use Ruby.
So could anyone help me? I'm really desperate and I'll definitely appreciate any help.
Try it like this,
require 'whois-parser'
domain = 'google.com'
data = 'WHOIS DATA THAT YOU ALREADY HAVE'
whois_server = Whois::Server.guess domain
whois_data = [Whois::Record::Part.new(body: data, host: whois_server.host)]
record = Whois::Record.new(whois_server, whois_data)
parser = record.parser
parser.available? #=> false
parser.registered? #=> true
I am trying to prototype a trigger using the Zapier CLI and I am running to an issue with the 'Pull In Samples' section when setting up the trigger in the UI.
This tries to pull in a live sample of data to use, however the documentation states that if no results are returned it will use the sample data that is configured for the trigger.
In most cases there will be no live data and so ideally would actually prefer the sample data to be used in the first instance, however my trigger does not seem to ever use the sample and I have not been able to find a concrete example of a 'no results' response.
The API I am using returns XML so I am manipulating the result into JSON which works fine if there is data.
If there are no results so far I have tried returning '[]', but that just hangs and if I check the zapier http logs it's looping http requests until I cancel the sample check.
Returning '[{}]' returns an error that I need an 'id' field.
The definition I am using is:
module.exports = {
key: 'getsmsinbound',
noun: 'GetSMSInbound',
display: {
label: 'Get Inbound SMS',
description: 'Check for inbound SMS'
},
operation: {
inputFields: [
{ key: 'number', required: true, type: 'string', helpText: 'Enter the inbound number' },
{ key: 'keyword', required: false, type: 'string', helpText: 'Optional if you have configured a keyword and you wish to check for specific keyword messages.' },
],
perform: getsmsinbound,
sample: {
id: 1,
originator: '+447980123456',
destination: '+447781484146',
keyword: '',
date: '2009-07-08',
time: '10:38:55',
body: 'hello world',
network: 'Orange'
}
}
};
I'm hoping it's something obvious as on scouring the web and Zapier documentation I've not had any luck!
Sample data must be provided from your app and the sample payload is not used for this poll specifically. From the docs:
Sample results will NOT be used for a user's Zap testing step. That
step requires data to be received by an event or returned from a
polling URL. If a user chooses to "Skip Test", then the sample result,
if provided, will be used.
Personally, I have never seen "Skip Test" show up. A while back I asked support about this:
That's a great question! It's definitely one of those "chicken and
egg" situations when using REST Hooks - if there isn't a sample
available, then everything just stalls.
When the Zap editor tries to obtain a "sample result", there are three
places where it's going to look:
The Polling endpoint (in Step #3 of your trigger's setup) is invoked for the current user. If that returns "nothing", then the Zap
editor will try the next step.
The "most recent record/data" in the Zap's history. Since this is a brand new Zap, there won't be anything present.
The Sample result (in Step #4 of your trigger's setup). The Zap editor will tell the user that there's "nothing to show", and will
give the user the option to "skip test and continue", which will use
the sample JSON that you've provided here.
In reality, it will just continue to retry the request over and over and never provide the user with a "skip test and continue" option. I just emailed again asking if anything has changed since then, but it looks like existing sample data is a requirement.
Perhaps create a record in your API by default and hide it from normal use and just send back that one?
Or send back dummy data even though Zapier says not to. Not sure, but I don't know how people can set up a zap when no data has been created yet (and Zapier says not many of their apps have this issue, but nearly every trigger I've created and ever use case for other applications would hint to me otherwise).
Perhaps I've somewhat missed the point of Protobufs, but I spent some time to implement it because I was hoping to gain raw speed compared to my current JSON setup.
My use case is like this: a large, complicated PHP application (not a website), in production and being used heavily. We're now trying to split our application into smaller parts, written in a suitable language for each problem. The first service I have split out does processing and transformations on strings, very domain specific and not very interesting. Involves lots of regex's, custom parsing etc.
I implemented my domain logic in Go, which works beautifully and was very easy to pick up. I attached my logic to a simple JSON API, using Go-Kit. Is a very simple transformation, json encoding simply to something like {"v":"some string usually 10-100 chars"}.
The performance was worse than native PHP which I consider quite acceptable considering the overhead of JSON and the addition of transmitting over a network layer.
However, what really surprised me is that Protobuf has not only been no faster than JSON, but actually slower by 30-50%.
My .proto:
syntax = "proto3";
package pb;
option optimize_for = SPEED;
service StringStuff {
rpc DoStringStuff (StringReq) returns (StringRes) {}
}
message StringReq {
string in = 1;
}
message StringRes {
string out = 1;
}
I used https://github.com/stanley-cheung/Protobuf-PHP and the generated proto php code. My php client code is like this:
$client = new StringClient('localhost:50051', [
'credentials' => \Grpc\ChannelCredentials::createInsecure()]);
$string = new StringReq();
$string->setIn("some string...");
list($reply, $status) = $client->DoStringStuff($string)->wait();
It works but to my surprise it is a lot slower than JSON.
My only guess: is it possible the php implementation of Protobufs is so much slower than json_decode that currently PHP makes a very poor client for Protobuf?
Or is it normal for small, simple uses like transmitting a single string that JSON should out perform Protobuf?
Thank you for any and all thoughts.
The native PHP implementation of protobuf, which you install with composer require google/protobuf is much slower than the protobuf C extension. To get any real performance out of gRPC you need to install the protobuf C extension:
pecl install protobuf
and enable it in php.ini
extension=protobuf.so
This does all the serialization/deserialization in C rather than in PHP, which going to be many times faster than the PHP version.
I'm using SerilogMetrics's BeginTimedOperation() in a Web API, and it would be really great to be able to use the HttpRequestNumber or HttpRequestId properties (from the respective Serilog.Extra.Web enrichers) as the identifier, making it super easy to correlate timing-related log entries with others across a request.
Something like:
using (logger.BeginTimedOperation("doing some work", HttpRequestNumberEnricher.CurrentRequestNumber))
{ ... }
Short of poking around in HttpContext.Current for the magically- (i.e. non-public) named properties, is this achievable? Thanks!
If you begin a timed operation during a web request, the operation's events will already be tagged with the HttpRequestId.
You'll see it when logging to a structured log server like Seq, but if you're writing it out to a text file or trace then the property won't be included in the output message by default. To show it in there use something like:
.WriteTo.File(...,
outputTemplate: "{Timestamp} [{Level}] ({HttpRequestId}) {Message} ...")
The logging methods use a default template you can draw on for inspiration, and there's some info spread around the wiki though there's no definitive reference.
Since I installed the Google Fit app on my Nexus 5 it has been tracking my step count and time spent walking. I'd like to retrieve this info via the Google Fitness REST api (docs) but I can't work out how to get any of that data from the REST api.
I've used the OAuth 2.0 playground to successfully list dataSources but none of the examples I have tried have returned any fitness data whatsoever. I feel like I need to use something similar to a DataReadRequest from the (Android SDK) but I'm not building an Android app -- I just want to access fitness data already stored by the Google Fit app.
Is it even possible to get the data gathered by the Google Fit app? If so, how can I read and aggregate step count data using the REST api?
It turns out that the answer is in the docs after all. Here is the format of the request.
GET https://www.googleapis.com/fitness/v1/users/{userId}/dataSources/{dataSourceId}/datasets/{datasetId}
The only supported {userId} value is me (with authentication).
Possible values for {dataSourceId} are avaiable by running a different request.
The bit I missed was that {datasetId} is not really an ID, but actually where you define the timespan in which you are interested. The format for that variable is {startTime}-{endTime} where the times are in nanoseconds since the epoch.
I was able to get this working by going through the google php client and noticed that they append their start and finish times for the GET request with extra 0's - nine infact.
Use the same GET request format as mentioned in an answer above:
https://www.googleapis.com/fitness/v1/users/{userId}/dataSources/{dataSourceId}/datasets/{datasetId}
Now here is an example with the unix timestamp (php's time() function uses this)
https://www.googleapis.com/fitness/v1/users/me/dataSources/derived:com.google.step_count.delta:com.google.android.gms:estimated_steps/datasets/1470475368-1471080168
This is the response I get:
{
"minStartTimeNs": "1470475368",
"maxEndTimeNs": "1471080168",
"dataSourceId":
"derived:com.google.step_count.delta:com.google.android.gms:estimated_steps
}
However if you append your start and finish times with nine 0's that you put in your GET requests and shape your request like this:
https://www.googleapis.com/fitness/v1/users/me/dataSources/derived:com.google.step_count.delta:com.google.android.gms:estimated_steps/datasets/1470475368000000000-1471080168000000000
It worked - this is the response I got:
{
"minStartTimeNs": "1470475368000000000",
"maxEndTimeNs": "1471080168000000000",
"dataSourceId":
"derived:com.google.step_count.delta:com.google.android.gms:estimated_steps",
"point": [
{
"modifiedTimeMillis": "1470804762704",
"startTimeNanos": "1470801347560000000",
"endTimeNanos": "1470801347567000000",
"value": [
{
"intVal": -3
}
],
"dataTypeName": "com.google.step_count.delta",
"originDataSourceId": "raw:com.google.step_count.delta:com.dsi.ant.plugins.antplus:AntPlus.0.124"
},
The response is a lot longer but I truncated it for the sake of this post. So when passing your datasets parameter into the request:
1470475368-1471080168 will not work, but 1470475368000000000-1471080168000000000 will.
This did the trick for me, hopes it helps someone!
I tried post method with below URL & body. This will work, please check inline comments too.
Use URL: https://www.googleapis.com/fitness/v1/users/me/dataset:aggregate
Method: POST
Body:
{
"aggregateBy": [{
"dataTypeName": "com.google.step_count.delta",
"dataSourceId": "derived:com.google.step_count.delta:com.google.android.gms:estimated_steps"
}],
"bucketByTime": { "durationMillis": 86400000 }, // This is 24 hours
"startTimeMillis": 1504137600000, //start time
"endTimeMillis": 1504310400000 // End Time
}