I am perplexed at this point. I spent a day or three in the deep end of Influx and Grafana, to get some graphs plotted that are crucial to my needs. However, with the last one I need to total up two metrics (two increment counts, in column value). Let's call them notifications.one and notifications.two. In the graph I would like them displayed, it would work well as a total of the two, a single graph line, showing (notifications.one + notifications.two) instead of two separate ones.
I tried with the usual SELECT sum(value) from the two, but I don't get any data from it (which does exist!). There is also merge() mentioned in the documentation of Influx, but I cannot get this to work either.
The documentation for merge requires something like:
SELECT mean(value) FROM /notifications.*/ WHERE ...
This also, comes back as a flat zero line.
I hope my question carries some weight, since I have far from enough knowledge to convey the problem as good as possible.
Thank you.
With InfluxDB 0.12 you can write:
SELECT MEAN(usage_system) + MEAN(usage_user) + MEAN(usage_irq) AS cpu_total
FROM cpu
WHERE time > now() - 10s
GROUP BY host;
These features are not really documented yet, but you can have a look at supported mathematical operators.
In InfluxDB 0.9 there is no way to merge query results across measurements. Within a measurement all series are merged by default, but no series can be merged across measurements. See https://influxdb.com/docs/v0.9/concepts/08_vs_09.html#joins for more detail.
A better schema for 0.9 is instead of two measurements: notifications.one and notifications.two, have one measurement notifications with foo=one and foo=two as tags on that single measurement. Then the query for the merged values is just SELECT MEAN(value) FROM notifications and the per-series query is then SELECT MEAN(value) FROM notifications GROUP BY foo
I think as per the question its possible to club queries together just like nested queries in RDBMS. This can be achieved using Continous Queries in influxdb. This documentation explains it clearly.
Basically you need to create a query from other queries and then use this newly created query to fetch the series.
https://docs.influxdata.com/influxdb/v1.1/query_language/continuous_queries/#substituting-for-nested-functions
Related
A little context: I'm experimenting with Neo4J (as a newbie, but experienced in other database technologies) for possible use as a master data management system within our business of identity intelligence, in particular looking at building up a graph of places, identity attributes (eg: email addresses, telephone numbers, electoral roll data, etc.) with relationships between these nodes that express something meaningful, for example where an email address has been used, or where a telephone number is registered.
Desired system properties: I would like this system to have some specific properties that are valuble to us:
Fast ingestion of information from a significant number of providers (100+), this precludes lengthy (hours) ETL processes, short ones are ok!
On line at all times, this precludes use of the batch importer, we are most likely to use a fault tolerant cluster, sharding would be good :)
Capacity to eventually ingest ~30G records / year (~1000/second) and retain them, creation and retention of ~100G relationships / year, right now we are ingesting ~1/10 of this load.
Where I'm stuck: I have been experimenting with a single node in Azure, 32GB RAM, 4 cores, with non-local disk, running Debian 8 and Neo4J 3.1.1. This happily ingests and relates back together the UK postal address file (PAF), around 29M records, in a few 10s of minutes using either LOAD CSV or home-brew Java and bolt. I have also ingested but not related a test set of email address data, around 20M records, and now need to build relationships based on matching postcodes, building numbers, and possibly other fields between the two data sets. This is where things get much slower when using Cypher, here's the fastest query I have been able to create thus far:
UNWIND {list} AS i
MATCH(e:DDSEMAIL) WHERE ID(e) = i WITH e
MATCH(s:SUBBNAME) USING INDEX s:SUBBNAME(SBNA)
WHERE upper(e.Building) = s.SBNA WITH e,s
MATCH(m:MAINFILE)
WHERE trim(split(e.Postcode,' ')[0]) = m.OUTC AND
trim(split(e.Postcode,' ')[1]) = m.INCO AND
right('0000'+e.HouseNo,4) = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
CREATE (e)-[r:USED_AT]->(m)
RETURN COUNT(r);
Indexes are:
ON :DDSEMAIL(HouseNo) ONLINE
ON :DDSEMAIL(Postcode) ONLINE
ON :DDSEMAIL(Building) ONLINE
ON :MAINFILE(OUTC) ONLINE
ON :MAINFILE(INCO) ONLINE
ON :MAINFILE(BNUM) ONLINE
ON :SUBBNAME(SBNA) ONLINE
Please note that the {list} parameter is being supplied through bolt from a Java client that has already enumerated all the ~20M DDSEMAIL nodes, and is batching into transactions (typically 1000 IDs at a time).
This is taking between 100-200msecs per ID, over a test run of 157000 IDs it took 7.3 hours, indicating a full execution time of ~760 hours or >1 month. The underlying machine appears CPU bound (no significant IO wait time).
Looking at the EXPLAIN for this query, there are no full scans, it's all schema index matching (once I had included the explicit index statement), so I'm not sure where to look for more speed..
(edited to add this PROFILE output):
PROFILE part 1
PROFILE part 2
This shows that the match to both parts of the postcode is filtering a lot of rows (56k), it may be better to re-order these fields to reduce the filter input size.
(end of edit)
As a (very unfair) comparision, I pushed both sets of data from CSV files into a custom Bloom filter written in C#/.NET, which performs similar field reformatting as above then concatenates to generate textual keys, and matches these keys together. This completed convolving all 20M email records against all 29M PAF records in under 5 minutes on a single core of my laptop. It was largely IO bound.
Right now I'm considering using an external application or a user procedure to perform the record matching, and just creating relationships using Cypher, but it feels wrong to avoid a well-written query engine that should be able to do this much, much quicker than it is.
What should I be looking at to improve performance please?
If I recall correctly, the index won't be utilized correctly when there are transformations occurring on the comparison values (such as UPPER() or LOWER() or TRIM()) when they're sourced from another node property. You may need to perform these operations first and alias them, then do the match.
Providing the index hint gets around this, I think, so your match to s.SBNA should be correctly using the index, but if there's an index on any of the matched properties on m:MAINFILE, that may not be using the index.
Test to see if this makes a difference, comparing this query to the older query on a smaller data set:
UNWIND {list} AS i
MATCH(e:DDSEMAIL) WHERE ID(e) = i
WITH e, upper(e.Building) as SBNA
MATCH(s:SUBBNAME)
WHERE s.SBNA = SBNA
WITH e,s, trim(split(e.Postcode,' ')[0]) as OUTC,
trim(split(e.Postcode,' ')[1]) as INCO,
right('0000'+e.HouseNo,4) as BNUM
MATCH(m:MAINFILE)
WHERE OUTC = m.OUTC AND
INCO = m.INCO AND
BNUM = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
CREATE (e)-[r:USED_AT]->(m)
RETURN COUNT(r);
Also, if you could add a screenshot of a PROFILE or EXPLAIN of the query to your description (after expanding all plan nodes) that may help to see where things could improve.
EDIT
As you mentioned in your description, batching these may be a good idea. APOC Procedures has apoc.periodic.iterate(), which may help here.
Let's see if we can apply that to your query. Try this out:
WITH {list} AS list
CALL apoc.periodic.iterate('
UNWIND {list} as list
RETURN list
', '
WITH {list} as i
MATCH(e:DDSEMAIL) WHERE ID(e) = i
WITH e, upper(e.Building) as SBNA
MATCH(s:SUBBNAME)
WHERE s.SBNA = SBNA
WITH e,s, trim(split(e.Postcode,' ')[0]) as OUTC,
trim(split(e.Postcode,' ')[1]) as INCO,
right('0000'+e.HouseNo,4) as BNUM
MATCH(m:MAINFILE)
WHERE OUTC = m.OUTC AND
INCO = m.INCO AND
BNUM = m.BNUM AND
(m)-[:IS_SUBBNAME]->(s)
MERGE (e)-[:USED_AT]->(m)
', {batchSize:1000, iterateList:true, params:{list:list}}) YIELD batches, total, committedOperations, failedOperations, failedBatches, errorMessages
RETURN batches, total, committedOperations, failedOperations, failedBatches, errorMessages
We have to sacrifice returning the total number of relationships created, however, as we can't return values from the batched query.
When merging two tables in PowerQuery an evaulation is run to determine the possible number of matches. I run pretty large tables (merge a 10K record table with a 500K record table) so this can take a long time.
I know there will be matches because I have done this before and I am not a beginner. Yet PowerQuery insists on running this behaviour.
Is there anyway to baypass this step? It almost feels like when you just need to turn automatic calculation off in Excel so that you can get on with actually doing something.
Any ideas?
I would add in an upstream filter to limit the rows e.g. Keep Rows / Keep Top Rows / 100. You may need to do this on both Queries. Ideally you Keep enough rows or use a specific filter to get some matches, to help your downstream Query design work.
Then once the query design is finished, I would remove the filter(s) and let it rip.
This is what PQ should be doing in the Query Editor, but it does seem to go rogue on Merge in particular.
Say I have a collection with 100 million records/documents in it.
I want to create a series of reports that involve summing of values in certain columns and grouping by various columns.
What references for XQuery and/or MarkLogic can anyone point me to that will allow me to do this quickly?
I saw cts:avg-aggregate which looks fine. But then I need to group as well..
Also, since I am dealing with a large amount of data and it will take some time to go through it all, I am thinking about setting this up as a job that runs at night to update the report.
I thought of using corb to run through the records and then do something with the output from that. Is this the right approach with MarkLogic and reporting?
Perhaps this guide would help:
http://developer.marklogic.com/blog/group-by-the-marklogic-way
You have several options which are discussed above:
cts:estimate
cts:element-value-co-occurrences
cts:value-tuples + cts:frequency
I have an application who is doing a job aggregating data from different Social Network sites Back end processes done Java working great.
Its front end is developed Rails application deadline was 3 weeks for some analytics filter abd report task still few days left almost completed.
When i started implemented map reduce for different states work great over 100,000 record over my local machine work great.
Suddenly my colleague gave me current updated database which 2.7 millions record now my expectation was it would run great as i specify date range and filter before map_reduce execution. My believe was it would result set of that filter but its not a case.
Example
I have a query just show last 24 hour loaded record stats
result comes 0 record found but after 200 seconds with 2.7 million record before it comes in milliseconds..
CODE EXAMPLE BELOW
filter is hash of condition expected to check before map_reduce
map function
reduce function
SocialContent.where(filter).map_reduce(map, reduce).out(inline: true).entries
Suggestion please.. what would be ideal solution in remaining time frame as database is growing exponentially in days.
I would suggest you look at a few different things:
Does all your data still fit in memory? You have a lot more records now, which could mean that MongoDB needs to go to disk a lot more often.
M/R can not make use of indexes. You have not shown your Map and Reduce functions so it's not possible to point out mistakes. Update the question with those functions, and what they are supposed to do and I'll update the answer.
Look at using the Aggregation Framework instead, it can make use of indexes, and also run concurrently. It's also a lot easier to understand and debug. There is information about it at http://docs.mongodb.org/manual/reference/aggregation/
I have a table that I've created a Full Text Catalog on. The table has just over 6000 rows. I've added two columns to the index. The first could be considered a unique identifier of sorts and the second could be considered the content for that item (there are 11 other columns in my table that aren't part of the Full Text Catalog). Here is an example of a couple of rows:
TABLE: data_variables
ROW unique_id label
1 A100d1 Personal preference of online shopping sites
2 A100d2 Shopping behaviors for adults in household
In my web application on the front end, I have a text box that the user can type into to get a list of items that match whatever terms they're searching for in the UNIQUE ID or LABEL columns. So, for example, if the user typed in sho or a100 then a list would be populated with both of the rows above. If they typed in behav then a list would be populated with only row 2 above.
This is done via an Ajax request on each keyup. PHP calls a Stored Procedure on the SQL server that looks like:
SELECT TOP 50 dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE (CONTAINS((dv.unique_id, dv.label), #search))
(#search is the text from the user that is passed into the Stored Procedure.)
I've noticed that this gets pretty sluggish, especially when I wasn't using TOP 50 in the query.
What I'm looking for is a way to speed this up either directly on the SQL Server or by abandoning the full-text indexing idea and using jQuery to search through an array of the searchable items on the client-side. I've looked a bit into the jQuery AutoComplete stuff and some other jQuery plugins for AutoComplete, but haven't yet tried to mock up anything. That would be my next step, but I wanted to check here first to see what advice I would get.
Thanks in advance.
Several suggestions, based around the fact that you have only 6000 rows, so the database should eat this alive.
A. Try using Like operator, just in case it helps. Not expecting it too, but pretty trivial to try. There is something else going on here overall for you to detect this is slow given these small volumes.
B. can you cache queries in advance? With 6000 rows, there are probably only 36*36 combinations of 2 character queries, which should take virtually no memory and save the database any work.
C. Moving the selection out to the client is a good idea, depends on how big the 6000 rows are overall, vs network latency for individual lookups.
D. Combining b and c will give you really good performance I suspect, but with some coding effort required. If the server maintains a list of all single character results in cache, and clients download the letter cache set after first keystroke, then they potentially have a subset of all rows, but won't need to do more network IO for additional keystrokes.
I would advise against a LIKE, unless you're using a linear index (left-to-right) and you're doing queries like LIKE 'work%'. If you're doing something like LIKE '%word%' a regular index isn't going to help you. You typically want to use a Full-Text index when you want to search for words inside a paragraph.
With a lot of data, typically the built-in Full-Text engines in databases aren't very stealer. For the best performance you typically have to go with an external solution that is built specifically for Full-Text.
Some options are Sphinx, Solr, and elasticsearch, just to name a few. I wouldn't say that any of these options are better than the other. There are definitely pros and cons to consider:
What kind of data do you have?
What language support do these solutions have?
What database engines do these solutions support?
The best thing you can do is benchmark these solutions against your existing data. Testing each and every individual component (unit testing) can help you identify the real problems and help you find good solutions.
I had the same problem and went for the LIKE solution. I found too that the or operator to be too taxing and divide the query into two selects with an union all (fastest, and in my scenario it was impossible to find the same text in the index column and the data).
Yours will be like
SELECT TOP 50 from (
select dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE dv.unique_id like '%'+#search+'%'
UNION ALL
select dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE dv.label like '%'+#search+'%'
)
Oh!! And test the performance in SQL Server, not the web!
If You plan to increase amount of data it will be best way to use reverse index for full-text searching.
Look at Apache Solr - best fulltext search engine at this moment.
You can simply periodically index Your database data and use solr as search-engine,
it provide simple ajax api and can be queried directly from frontend.
If you really need performance ..you may want to look at; FTS3 and FTS4 ...
snip... from another forum...
For example, if each of the 517430 documents in the "Enron E-Mail Dataset" is inserted into both an FTS table and an ordinary SQLite table created using the following SQL script:
Code:
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table /
CREATE TABLE enrondata2(content TEXT); / Ordinary table */
Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
see...
http://www.sqlite.org/fts3.html