Pinot batch ingestion removing old data - apache-pinot

I am playing with Pinot, and have set it up locally using ./bin/pinot-admin.sh QuickStart -type batch,
and have also added a table with a single multi value column (named values).
I now created a csv file with following data (NOTE: I am using '-' as a delimter multivalues)
values
a-b
a
b
and ingested it using standalone batch ingestion with following job specs:
executionFrameworkSpec:
name: 'standalone'
segmentGenerationJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentGenerationJobRunner'
segmentTarPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentTarPushJobRunner'
segmentUriPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentUriPushJobRunner'
segmentMetadataPushJobRunnerClassName: 'org.apache.pinot.plugin.ingestion.batch.standalone.SegmentMetadataPushJobRunner'
# Recommended to set jobType to SegmentCreationAndMetadataPush for production environment where Pinot Deep Store is configured
jobType: SegmentCreationAndTarPush
inputDirURI: '.'
includeFileNamePattern: 'glob:**/*.csv'
outputDirURI: './csv/segments/'
overwriteOutput: true
pinotFSSpecs:
- scheme: file
className: org.apache.pinot.spi.filesystem.LocalPinotFS
recordReaderSpec:
dataFormat: 'csv'
className: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReader'
configClassName: 'org.apache.pinot.plugin.inputformat.csv.CSVRecordReaderConfig'
configs:
multiValueDelimiter: '-'
tableSpec:
tableName: 'exp'
pinotClusterSpecs:
- controllerURI: 'http://localhost:9000'
pushJobSpec:
pushAttempts: 2
pushRetryIntervalMillis: 1000
Now the first time I add the data using ./bin/pinot-admin.sh LaunchDataIngestionJob -jobSpecFile ingestion-job.yaml, I see all the three values in the table, now I again add the same values using the job, but I don't see 6 rows, rather I still see 3 rows. I then tried changing the csv file to have a single row with value x , when I launched the job then it is just showing a single row. Seems like every time I run the ingestion job the previous data is deleted and the ingested data is the only one left.
I expected the batch ingestion to add the data, am I missing something over where ?

a huge maybe here but have you tried setting the following configs to APPEND:
"batchIngestionConfig": {
"segmentIngestionType": "APPEND",
"segmentIngestionFrequency": "DAILY"
}

Related

Append multiple CSVs into one single file with Apache Nifi

I have a folder with CSV files that have the same first 3 columns and different last N columns. N is minimum 2 and up to 11.
Last n columns have number as header, for example:
File 1:
AAA,BBB,CCC,0,10,15
1,India,c,0,28,54
2,Taiwan,c,0,23,52
3,France,c,0,26,34
4,Japan,c,0,27,46
File 2:
AAA,BBB,CCC,0,5,15,30,40
1,Brazil,c,0,20,64,71,88
2,Russia,c,0,20,62,72,81
3,Poland,c,0,21,64,78,78
4,Litva,c,0,22,66,75,78
Desired output:
AAA,BBB,CCC,0,5,10,15,30,40
1,India,c,0,null,28,54,null,null
2,Taiwan,c,0,null,23,52,null,null
3,France,c,0,null,26,34,null,null
4,Japan,c,0,null,27,46,null,null
1,Brazil,c,0,20,null,64,71,88
2,Russia,c,0,20,null,62,72,81
3,Poland,c,0,21,null,64,78,78
4,Litva,c,0,22,null,66,75,78
Is there a way to append this files together with Nifi where a new column would get created (even if I do not now the column name beforehad) if a file with additional data is present in the folder?
I tried with Merge content processor but by default it just appends content of all my files together without minding headers (all the headers are always appended).
What you could do is write some scripts to combine the rows and columns using the ExecuteStreamCommand. This would allow you to write a custom script in whatever language you want.

How to set start and end row or interval rows for CSV in Nifi?

I want to get particular part of excel file in Nifi. My Nifi template like that;
GetFileProcessor
ConvertExcelToCSVProcessor
PutDatabaseRecordProcessor
I should parse data between step 2 and 3.
Is there a solution for getting specific rows and columns ?
Note:If there is a option for cutting ConvertExcelToCSVProcessor, it will work for me.
You can use Record processors between ConvertExcelToCSV and PutDatabaseRecord.
to remove or override a column use UpdateRecord. this processor can receive your data via CSVReader and prepare an output for PutDatabaseRecord or QueryRecord . check View usage -> Additional Details...
in order to filter by column use QueryRecord.
here an example. this example receives data through CSVReader and makes some aggregations, you can as well do some filtering according to doc
also this post had helped me to understand Records in Nifi

JMeter Unique once feature

I want to use Unique once setting in JMeter as used in load runner for a project.
The data are provided through a CSV file and are parameterized.
If the script is ran with multiple users then also the data should be unique, meaning one data can be used only once.
Users in JMeter equal Threads so just use CSV Data Set Config to read your CSV
Sharing mode All threads - (the default) the file is shared between all the threads.
You can consider using HTTP Simple Table Server plugin, it has KEEP option, if you set it to FALSE once record is read from the source file it will be removed guaranteeing uniqueness even if you run your tests in Distributed Mode.
You can install HTTP Simple Table Server plugin using JMeter Plugins Manager
I have come up with the solution for "Unique Once Vuser setting in jMeter"
1) Create excel and insert data in excel column wise i.e. horizontally insert all the data.
2) Use below code in place of your unique parameter.
${__CSVRead(filePath,${__threadNum})}
So it will pick unique data for each thread.
Thread_1 Iteration_1 --- Data from col 1
Thread_1 Iteration_2 --- Data from col 1
Thread_1 Iteration_3 --- Data from col 1
Thread_2 Iteration_1 --- Data from col 2
Thread_2 Iteration_2 --- Data from col 2
Thread_2 Iteration_3 --- Data from col 2
Although every answer works perfectly fine, but just a different way using if control.
CSV Data Set Config settings :
if controller validation code : ${__groovy(${__jm__Thread Group__idx} == 0,)}
In this if controller we keep a JSR223 sampler and in that we can have one or more line of following code : vars.put("VarName",vars.get("CSV_input"))
where we replace VarName : with the variable name we want to use in the script and CSV_input with the the variable name from CSV Data Set Config input

neo4j performance for Merge queries on 100 thousand nodes

I have started working with neo4j recently and I have performance problem with Merge query for creating my graph.
I have a csv file with 100,000 records and want to load the data from this file.
My query for loading is as follows:
//Script to import global Actors data
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///D:/MOT/test_data.csv" AS row
MERGE (c:Country {Name:row.Country})
MERGE (a:Actor {Name: row.ActorName, Aliases: row.Aliases, Type:row.ActorType})
My system configuration:
8.00 GB RAM and Core i5-3330 CPU.
my neo4j config is as follows:
neostore.nodestore.db.mapped_memory=50M
neostore.relationshipstore.db.mapped_memory=50M
neostore.propertystore.db.mapped_memory=90M
neostore.propertystore.db.strings.mapped_memory=130M
neostore.propertystore.db.arrays.mapped_memory=130M
mapped_memory_page_size=1048576
label_block_size=60
arrat_block_size=120
node_auto_indexing=False
string_block_size=120
when I run this query in neo4j browser it takes more than a day. Would you please help me to solve the problem? please let me know for example if I should change my JVM configuration or change my query or ... and how?
To increase the speed of MERGE queries you should create indexes on your MERGE properties:
CREATE INDEX ON :Country(Name)
CREATE INDEX ON :Actor(Name)
If you have unique node properties, you can increase performance even more by using uniqueness constraints instead of normal indexes:
CREATE CONSTRAINT ON (node:Country) ASSERT node.Name IS UNIQUE
CREATE CONSTRAINT ON (node:Actor) ASSERT node.Name IS UNIQUE
In general your query will be faster if you MERGE on a single, indexed property only:
//Script to import global Actors data
USING PERIODIC COMMIT 500
LOAD CSV WITH HEADERS FROM "file:///D:/MOT/test_data.csv" AS row
MERGE (c:Country {Name:row.Country})
MERGE (a:Actor {Name: row.ActorName})
// if necessary, you can set properties here
ON CREATE SET a.Aliases = row.Aliases, a.Type = row.ActorType
As already answered on the google group.
It should just take a few seconds.
I presume:
you use Neo4j 2.3.2 ?
you created indexes / constraints for the things you merge on ?
you configured your neo4j instance to run with at least 4G of heap?
you are using PERIODIC COMMIT ?
I suggest that you run a profile on your statement to see where the biggest issues show up.
Otherwise it is very recommended to split it up.
e.g. like this:
CREATE CONSTRAINT ON (c:Country) ASSERT c.Name IS UNIQUE;
CREATE CONSTRAINT ON (o:Organization) ASSERT o.Name IS UNIQUE;
CREATE CONSTRAINT ON (a:Actor) ASSERT a.Name IS UNIQUE;
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country
MERGE (c:Country {Name:Country});
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.AffiliationTo as AffiliationTo
MERGE (o:Organization {Name: AffiliationTo});
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MERGE (a:Actor {Name: row.ActorName}) ON CREATE SET a.Aliases=row.Aliases, a.Type=row.ActorType;
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
WITH distinct row.Country as Country, row.ActorName as ActorName
MATCH (c:Country {Name:Country})
MATCH (a:Actor {Name:ActorName})
MERGE(c)<-[:IS_FROM]-(a);
LOAD CSV WITH HEADERS FROM "file:///E:/datasets/Actors_data_all.csv" AS row
MATCH (o:Organization {Name: row.AffiliationTo})
MATCH (a:Actor {Name: row.ActorName})
MERGE (a)-[r:AFFILIATED_TO]->(o)
ON CREATE SET r.Start=row.AffiliationStartDate, r.End=row.AffiliationEndDate;

Cassandra Hadoop map reduce with wide rows ignores slice predicate

I have a wide row column family which Im trying to run a map reduce job against. The CF is a time ordered collection of events, where the column names are essentially timestamps. I need to run the MR job against a specific date range in the CF.
When I run the job with the widerow property set to false, the expected slice of columns are passed into the mapper class. But when I set widerow to true, the entire column family is processed, ignoring the slice predicate.
The problem is that I have to use widerow support, as the number of columns in the slice can grow very large and consume all the memory if loaded in one go.
I've found this JIRA task which outlines the issue, but it has been closed off as "cannot reproduce" - https://issues.apache.org/jira/browse/CASSANDRA-4871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel
Im running cassandra 1.2.6 and using cassandra-thrift 1.2.4 & hadoop-core 1.1.2 in my jar. The CF has been created using CQL3.
Its worth noting that this occurs regardless of whether I use a SliceRange or specify the columns using setColumn_names() - it still process all of the columns.
Any help will be massively appreciated.
So it seems that this is by design. In the word_count example in github, the following comment exists:
// this will cause the predicate to be ignored in favor of scanning everything as a wide row
ConfigHelper.setInputColumnFamily(job.getConfiguration(), KEYSPACE, COLUMN_FAMILY, true);
Urrrrgh. Fair enough then. It seems crazy that there is no way to limit the columns when using wide rows though.
UPDATE
Apparently the solution is to use the new apache.cassandra.hadoop.cql3 library. See the new example on github for reference: https://github.com/apache/cassandra/blob/trunk/examples/hadoop_cql3_word_count/src/WordCount.java
Sorry, to add the comment as answer but we are trying to do same thing but you mentioned that you are able to that "When I run the job with the widerow property set to false, the expected slice of columns are passed into the mapper class. " but when we set widerow property set to false we are still getting we are getting errors. How did you pass the timestamp range in the slice predicate.
The CF that we use is a timeline of events with uid as partition key and event_timestamp as composite column. The equivalent cql is,
CREATE TABLE testcf (
uid varchar,
evennt_timestamp timestamp,
event varchar,
PRIMARY KEY (uid, event_timestamp));
Map reduce code – to send only events within start and end dates (note : we can query from the cassandra-client and cqlsh on the timestamp composite column and get the desired events)
// Settting widerow to false
config.setInputColumnFamily(Constants.KEYSPACE_TRACKING, Constants.CF_USER_EVENTS, false);
DateTime start = getStartDate(); // e.g., July 30th 2013
DateTime end = getEndDate(); // e.g., Aug 6th 2013
SliceRange range = new SliceRange(
ByteBufferUtil.bytes(start.getMillis()),
ByteBufferUtil.bytes(end.getMillis()),
false, Integer.MAX_VALUE);
SlicePredicate predicate = new SlicePredicate().setSlice_range(range);
config.setInputSlicePredicate(predicate);
But the above code doesn't work. We get the following error,
java.lang.RuntimeException: InvalidRequestException(why:Invalid bytes remaining after an end-of-component at component0)
at org.apache.cassandra.hadoop.ColumnFamilyRecordReader$StaticRowIterator.maybeInit(ColumnFamilyRecordReader.java:384)
Wondering if we are sending incorrect data in the start and end parameters in the slice range.
Any hint or help is useful.

Resources