ArangoDB: How to run 2 queries in parallel in community edition - parallel-processing

Hi I have written the below 2 queries and would like to run in these queries in parallel and not execute them sequentially. Is it possible to execute them parallelly in the community edition of the ArangoDB?
FOR d IN Transaction
FILTER d._to == "Account/123"
COLLECT AGGREGATE length = COUNT_UNIQUE(d._id),
totamnt = SUM(d.Amount),
daysactive = COUNT_UNIQUE(DATE_TRUNC(d.Time, "day"))
RETURN {
"Incoming Accounts": length ,
"Days Active": LENGTH(daysactive),
"Total Amount": totamnt
}
FOR d IN Transaction
FILTER d._from == "Account/123"
COLLECT AGGREGATE length = COUNT_UNIQUE(d._id),
totamnt = SUM(d.Amount),
daysactive = COUNT_UNIQUE(DATE_TRUNC(d.Time, "day"))
RETURN {
"Outgoing Accounts": length ,
"Days Active": LENGTH(daysactive),
"Total Amount": totamnt
}

of course it is possible to run multiple requests in parallel. Just fire 2 curl calls to _api/cursor or use 2 different arangosh shells.
Or run 2 curl calls in the same shell and use the x-arango-async header for each request to retrieve the result asynchronously as documented here: https://www.arangodb.com/docs/stable/http/async-results-management.html#async-execution-and-later-result-retrieval

Related

Google Fit API - International users sync issue

I'm using the Google Fit API Rest
Here is the data I'm retrieving from Google Fit using the API:
2021-03-21 29989 Steps
2021-03-20 12 Steps
Here is the data the user exported from Google:
3/22/2021 16,480 Steps
3/21/2021 13,521 Steps
In both circumstances, the steps equal 30,001
The dates are clearly off by one day because of the time zone. The daily count is also off for the same reason, however, it added up to the same steps.
What general approach/strategy can I take to get the steps obtained from the API match those on Google Fit when I don't have a timezone?
My API currently loops through the database and syncs all user data, not distinguishing domestic vs international users.
Here is the code snippet used to get steps:
//***** Get steps
case DATATYPE_STEP_COUNT_DELTA:
if ($dataStreamId == 'derived:com.google.step_count.delta:com.google.android.gms:estimated_steps') {
$listDatasets = $dataSets->get("me", $dataStreamId, $startTime . '000000000' . '-' . $endTime . '000000000');
if ($debug == 1) PrintR($listDatasets,"DATATYPE_STEP_COUNT_DELTA");
$step_count = 0;
foreach ($listDatasets as $dataSet) {
if ($dataSet['startTimeNanos']) {
$sec = $dataSet['startTimeNanos'] / 1000000000;
$activity_date = date('Y-m-d', $sec);
$dataSetValues = $dataSet['value'];
if ($dataSetValues && is_array($dataSetValues)) {
foreach ($dataSetValues as $dataSetValue) {
if(!isset($stepsArr[$studentencodedid][$activity_date])) $stepsArr[$studentencodedid][$activity_date] = 0;
$stepsArr[$studentencodedid][$activity_date] += $dataSetValue['intVal'];
$step_count += $dataSetValue['intVal'];
}
}
}
}
}
break;
//***** End get steps

How to loop CSV file values using Ultimate Thread Group?

I have this links.csv file:
METHOD,HOST,PATH,HITS
GET,google.com,/,7
GET,facebook.com,/,3
I want to create a JMeter test plan using Ultimate Thread Group (UTG) that randomize the hits based on the last column in the CSV above (HITS).
When viewing the results tree, I want to see something like this:
1. google.com
2. google.com
3. facebook.com
4. google.com
5. google.com
6. google.com
7. google.com
8. google.com
9. facebook.com
10. facebook.com
Ideally, I want to set the UTG to use the following settings:
Start Threads Count = sum of all hits in the CSV file (e.g. 7 + 3)
Initial Delay = 0
Startup Time = 60
Hold Load For = 30
Shutdown Time = 0
How to achieve this? I appreciate code samples and screenshots since I'm still new to JMeter.
I can only think of generating a new CSV file out of your original one in order to:
Get the "sum" of "HITS"
Generate a line containing method, host and path per "hit"
In order to achieve this:
Add setUp Thread Group to your Test Plan
Add JSR223 Sampler to the Thread Group
Put the following code into "Script" area:
def entries = new File('/path/to/original.csv').readLines().drop(1)
def sum = 0
def newCSV = new File('/path/to/generated.csv')
newCSV << 'METHOD,HOST,PATH' << System.getProperty('line.separator')
entries.each { entry ->
def values = entry.split(',')
def hits = values[3] as int
sum += hits
1.upto(hits, {
newCSV << values[0] << ',' << values[1] << ',' << values[2] << System.getProperty('line.separator')
})
}
props.put('threads', sum as String)
Use __P() function like ${__P(threads,)} in the Ultimate Thread Group
Use the new "generated" CSV file in the CSV Data Set Config in the Ultimate Thread Group

Redis Sorted Set: Bulk ZSCORE

How to get a list of members based on their ID from a sorted set instead of just one member?
I would like to build a subset with a set of IDs from the actual sorted set.
I am using a Ruby client for Redis and do not want to iterate one by one. Because there could more than 3000 members that I want to lookup.
Here is the issue tracker to a new command ZMSCORE to do bulk ZSCORE.
There is no variadic form for ZSCORE, yet - see the discussion at: https://github.com/antirez/redis/issues/2344
That said, and for the time being, what you could do is use a Lua script for that. For example:
local scores = {}
while #ARGV > 0 do
scores[#scores+1] = redis.call('ZSCORE', KEYS[1], table.remove(ARGV, 1))
end
return scores
Running this from the command line would look like:
$ redis-cli ZADD foo 1 a 2 b 3 c 4 d
(integer) 4
$ redis-cli --eval mzscore.lua foo , b d
1) "2"
2) "4"
EDIT: In Ruby, it would probably be something like the following, although you'd be better off using SCRIPT LOAD and EVALSHA and loading the script from an external file (instead of hardcoding it in the app):
require 'redis'
script = <<LUA
local scores = {}
while #ARGV > 0 do
scores[#scores+1] = redis.call('ZSCORE', KEYS[1], table.remove(ARGV, 1))
end
return scores
LUA
redis = ::Redis.new()
reply = redis.eval(script, ["foo"], ["b", "d"])
Lua script to get scores with member IDs:
local scores = {}
while #ARGV > 0 do
local member_id = table.remove(ARGV, 1)
local member_score = {}
member_score[1] = member_id
member_score[2] = redis.call('ZSCORE', KEYS[1], member_id)
scores[#scores + 1] = member_score
end
return scores

How to limit number of concurrent jobs that are starting by Pig script?

I am trying to implement simple data processing flow for POC in Pig using Hortonworks sandbox.
The idea is following: there is some set of already processed data. New data set should be added to old data without duplicates.
For testing purpose I use very small data sets (less than 10 KB).
For virtual machine I've allocated 4GB of RAM and 2 of 4 processor cores.
Here is my Pig script:
-- CONFIGURABLE PROPERTIES
%DEFAULT atbInput '/user/hue/ATB_Details/in/1'
%DEFAULT atbOutputBase '/user/hue/ATB_Details/out/1'
%DEFAULT atbPrevOutputBase '/user/hue/ATB_Details/in/empty'
%DEFAULT validData 'valid'
%DEFAULT invalidData 'invalid'
%DEFAULT billDateDimensionName 'tmlBillingDate'
%DEFAULT admissionDateDimensionName 'tmlAdmissionDate'
%DEFAULT dischargeDateDimensionName 'tmlDischargeDate'
%DEFAULT arPostDateDimensionName 'tmlARPostDate'
%DEFAULT patientTypeDimensionName 'dicPatientType'
%DEFAULT patientTypeCodeDimensionName 'dicPatientTypeCode'
REGISTER bdw-all-deps-1.0.jar;
DEFINE toDateDimension com.epam.bigdata.etl.udf.ToDateDimension();
DEFINE toCodeDimension com.epam.bigdata.etl.udf.ToCodeDimension();
DEFINE isValid com.epam.bigdata.etl.udf.atbdetails.IsValidFunc();
DEFINE isGarbage com.epam.bigdata.etl.udf.atbdetails.IsGarbageFunc();
DEFINE toAccounntBalanceCategory com.epam.bigdata.etl.udf.atbdetails.ToBalanceCategoryFunc();
DEFINE isEndOfMonth com.epam.bigdata.etl.udf.IsLastDayOfMonthFunc();
DEFINE toBalanceCategoryId com.epam.bigdata.etl.udf.atbdetails.ToBalanceCategoryIdFunc();
rawData = LOAD '$atbInput';
--CLEANSING
SPLIT rawData INTO garbage IF isGarbage($0),
cleanLines OTHERWISE;
splitRecords = FOREACH cleanLines GENERATE FLATTEN(STRSPLIT($0, '\\|'));
cleanData = FOREACH splitRecords GENERATE
$0 AS Id:LONG,
$1 AS FacilityName:CHARARRAY,
$2 AS SubFacilityName:CHARARRAY,
$3 AS PeriodDate:CHARARRAY,
$4 AS AccountNumber:CHARARRAY,
$5 AS RAC:CHARARRAY,
$6 AS ServiceTypeCode:CHARARRAY,
$7 AS ServiceType:CHARARRAY,
$8 AS AdmissionDate:CHARARRAY,
$9 AS DischargeDate:CHARARRAY,
$10 AS BillDate:CHARARRAY,
$11 AS PatientTypeCode:CHARARRAY,
$12 AS PatientType:CHARARRAY,
$13 AS InOutType:CHARARRAY,
$14 AS FinancialClassCode:CHARARRAY,
$15 AS FinancialClass:CHARARRAY,
$16 AS SystemIPGroupCode:CHARARRAY,
$17 AS SystemIPGroup:CHARARRAY,
$18 AS CurrentInsuranceCode:CHARARRAY,
$19 AS CurrentInsurance:CHARARRAY,
$20 AS InsuranceCode1:CHARARRAY,
$21 AS InsuranceBalance1:DOUBLE,
$22 AS InsuranceCode2:CHARARRAY,
$23 AS InsuranceBalance2:DOUBLE,
$24 AS InsuranceCode3:CHARARRAY,
$25 AS InsuranceBalance3:DOUBLE,
$26 AS InsuranceCode4:CHARARRAY,
$27 AS InsuranceBalance4:DOUBLE,
$28 AS InsuranceCode5:CHARARRAY,
$29 AS InsuranceBalance5:DOUBLE,
$30 AS AgingBucket:CHARARRAY,
$31 AS AccountBalance:DOUBLE,
$32 AS TotalCharges:DOUBLE,
$33 AS TotalPayments:DOUBLE,
$34 AS EstimatedRevenue:DOUBLE,
$35 AS CreateDateTime:CHARARRAY,
$36 AS UniqueFileId:LONG,
$37 AS PatientBalance:LONG,
$38 AS VendorCode:CHARARRAY;
--VALIDATION
SPLIT cleanData INTO validData IF isValid(*),
invalidData OTHERWISE;
--Dimension update--
--MACROS
DEFINE mergeDateDimension(validDataSet, dimensionFieldName, previousDimensionFile) RETURNS merged {
dates = FOREACH $validDataSet GENERATE $dimensionFieldName;
oldDimensions = LOAD '$previousDimensionFile' USING PigStorage('|') AS (
id:LONG,
monthName:CHARARRAY,
monthId:INT,
year:INT,
fiscalYear:INT,
originalDate:CHARARRAY);
oldOriginalDates = FOREACH oldDimensions GENERATE originalDate;
allDates = UNION dates, oldOriginalDates;
uniqueDates = DISTINCT allDates;
$merged = FOREACH uniqueDates GENERATE toDateDimension($0);
};
DEFINE mergeCodeDimension(validDataSet, dimensionFieldName, previousDimensionFile, outputIdField) RETURNS merged {
newCodes = FOREACH $validDataSet GENERATE $dimensionFieldName as newCode;
oldDim = LOAD '$previousDimensionFile' USING PigStorage('|') AS (
id:LONG,
code:CHARARRAY);
allCodes = COGROUP oldDim BY code, newCodes BY newCode;
grouped = FOREACH allCodes GENERATE
(IsEmpty(oldDim) ? 0L : SUM(oldDim.id)) as id,
group AS code;
ranked = RANK grouped BY id DESC, code DESC DENSE;
$merged = FOREACH ranked GENERATE
((id == 0L) ? $0 : id) as $outputIdField,
code AS $dimensionFieldName;
};
--DATE DIMENSIONS
billDateDim = mergeDateDimension(validData, BillDate, '$atbPrevOutputBase/dimensions/$billDateDimensionName');
STORE billDateDim INTO '$atbOutputBase/dimensions/$billDateDimensionName';
admissionDateDim = mergeDateDimension(validData, AdmissionDate, '$atbPrevOutputBase/dimensions/$admissionDateDimensionName');
STORE admissionDateDim INTO '$atbOutputBase/dimensions/$admissionDateDimensionName';
dischDateDim = mergeDateDimension(validData, DischargeDate, '$atbPrevOutputBase/dimensions/$dischargeDateDimensionName');
STORE dischDateDim INTO '$atbOutputBase/dimensions/$dischargeDateDimensionName';
arPostDateDim = mergeDateDimension(validData, PeriodDate, '$atbPrevOutputBase/dimensions/$arPostDateDimensionName');
STORE arPostDateDim INTO '$atbOutputBase/dimensions/$arPostDateDimensionName';
--CODE DIMENSION
patientTypeDim = mergeCodeDimension(validData, PatientType, '$atbPrevOutputBase/dimensions/$patientTypeDimensionName', PatientTypeId);
STORE patientTypeDim INTO '$atbOutputBase/dimensions/$patientTypeDimensionName' USING PigStorage('|');
patientTypeCodeDim = mergeCodeDimension(validData, PatientTypeCode, '$atbPrevOutputBase/dimensions/$patientTypeCodeDimensionName', PatientTypeCodeId);
STORE patientTypeCodeDim INTO '$atbOutputBase/dimensions/$patientTypeCodeDimensionName' USING PigStorage('|');
The problem is that when I run this script it never completes (gets stuck).
In Job Browser I can see one completed job and multiple jobs with 0% progress.
If I comment out processing of last three files - everything works fine (i.e. three parallel jobs succeed).
I've tried few approaches to fix this issue:
-no_multiquery Pig parameter - allows to execute script completely using only one job at time. Main disadvantage is huge number of generated jobs (26) and very long execution time (near 15 mins for described script and almost 40 mins for more complicated version).
Work only with parts that I develop and test by commenting out other parts - this is not an option for long term perspective.
Change mapred.capacity-scheduler.maximum-system-jobs property in mapred-site.xml so there should be less than three jobs at once as described here.
Change mapred.capacity-scheduler.queue.default.maximum-capacity in capacity-scheduler.xml for configuring default queue. But this approach didn't worked for me as well as previous.
Allocate more memory for sandbox virtual machine and mappers and reducers - no effect.
So my question is how can I limit the number of concurrent jobs that are starting by Pig script?
Or maybe there is other configuration fix that allow concurrent execution of multiple jobs?
[UPDATE]
If I run the same script with the same input data from shell console - everything works fine.
So I assume that there is some issue with HUE.
[UPDATE]
If I start more complex script from console it also gets stuck, but in this case number of parallel jobs is 8.
Last time we saw this it was because the cluster had only one map task.
You can use EXEC as described here:
http://pig.apache.org/docs/r0.11.1/perf.html#Implicit-Dependencies

How we can combine multiple rows in a single row in Pig

I need to combine this multiple tuples in a single one using Pig script. Could you please provide some guidelines?
dump requestFile;
Current Output
(Logging Transaction ID:21214,/var/log/tibco/,NESS-A-1-LPNameRequesttoNESS.log,tibcoTest log)
(Default Data:LP Name Request Message Executed Successfully)
(LoanPath Request ID: 88128640)
(RequestGroupID#: )
(SplitCount#: 2 )
(SplitIndex: 1)
(Correlation ID : 88128640-1 )
Desired output
(Logging Transaction ID:21214,/var/log/tibco/,NESS-A-1-LPNameRequesttoNESS.log,tibcoTest log,Default Data:LP Name Request Message Executed Successfully,LoanPath Request ID: 88128640,RequestGroupID#: ,SplitCount#: 2,SplitIndex: 1)
(Correlation ID : 88128640-1 )
What about:
requestFile = foreach requestFile generate flatten(tuple);
G = GROUP requestFile ALL;
F = FOREACH G generate requestFile;

Resources