Problem description:
I will have regular inserts into a table(lets call it 'daily'), I am planning to build a MV on top of this, I need data from another static table('metadata') .
Option 1: Do a join but that would load the 'metadata' table in memory
Option 2:
How do I limit dictionaries to be on disk and only 'X' bytes loaded in memory?
The dictionary has 600 million rows and takes around 100 GB of memory, I am using low end machines and do not want to use so much RAM.
I am okay with latency.
How do I solve this, is there any setting ?
Related
I have a spark job that writes to s3 bucket and have a athena table on top of this location.
The table is partitioned. Spark was writing 1GB single file per partition. We experimented with maxRecordsPerFile option thus writing only 500MB data per file. In the above case we ended up having 2 files with 500MB each
This saved 15 mins in run-time on the EMR
However, there was a problem with athena. Athena query CPU time started getting worse with the new file size limit.
I tried comparing the same data with the same query before and after execution and this is what I found:
Partition columns = source_system, execution_date, year_month_day
Query we tried:
select *
from dw.table
where source_system = 'SS1'
and year_month_day = '2022-09-14'
and product_vendor = 'PV1'
and execution_date = '2022-09-14'
and product_vendor_commission_amount is null
and order_confirmed_date is not null
and filter = 1
order by product_id
limit 100;
Execution time:
Before: 6.79s
After: 11.102s
Explain analyze showed that the new structure had to scan more data.
Before: CPU: 13.38s, Input: 2619584 rows (75.06MB), Data Scanned: 355.04MB; per task: std.dev.: 77434.54, Output: 18 rows (67.88kB)
After: CPU: 20.23s, Input: 2619586 rows (74.87MB), Data Scanned: 631.62MB; per task: std.dev.: 193849.09, Output: 18 rows (67.76kB)
Can you please guide me why this takes double the time? What are the things to look out for? Is there a sweet spot on file size that would be optimal for spark & athena combination?
One hypothesis is that pushdown filters are more effective with the single file strategy.
From AWS Big Data Blog's post titled Top 10 Performance Tuning Tips for Amazon Athena:
Parquet and ORC file formats both support predicate pushdown (also
called predicate filtering). Both formats have blocks of data that
represent column values. Each block holds statistics for the block,
such as max/min values. When a query is being run, these statistics
determine whether the block should be read or skipped depending on the
filter value used in the query. This helps reduce data scanned and
improves the query runtime. To use this capability, add more filters
in the query (for example, using a WHERE clause).
One way to optimize the number of blocks to be skipped is to identify
and sort by a commonly filtered column before writing your ORC or
Parquet files. This ensures that the range between the min and max of
values within the block are as small as possible within each block.
This gives it a better chance to be pruned and also reduces data
scanned further.
To test it I would suggest to do another experiment if possible. Change the spark job and sort the data before persisting it into the two files. Use the following order:
source_system, execution_date, year_month_day, product_vendor, product_vendor_commission_amount, order_confirmed_date, filter and product_id. Then check the query statistics.
At least the dataset would be optimised for the presented use case. Otherwise, change it according to the most heavy queries.
The post comments about optimal file sizes too and it gives a general rule of thumb. From my experience, Spark works well with sizes between 128MB and 2GB. It should be also fine for other query engines like Presto used by Athena.
My suggestion would be to break year_month_day/execution date ( as mostly used in the queries ) to Year, Month and Day partitions , which would reduce the amount of data scan and efficient filtering.
I've been looking at the Athena and PrestoDB documentation and can't find any reference to the limit on number of elements in an array column and/or maximum total size. Files will be in Parquet format, but that's negotiable if Parquet is the limiting factor.
Is this known?
MORE CONTEXT:
I'm going to be pushing data into a Fire Hose that will emit Parquet files to S3 that I plan to query using Athena. The data is a one-to-many mapping of S3 URIs to a set of IDs, e.g.
s3://bucket/key_one, 123
s3://bucket/key_one, 456
....
s3://bucket/key_two, 321
s3://bucket/key_two, 654
...
Alternately, I could store in this form:
s3://bucket/key_one, [123, 456, ...]
s3://bucket/key_two, [321, 654, ...]
Since Parquet is compressed I'm not concerned with the size of the files on S3. The repeated URIs should get taken care of by the compression.
What's of more concern is the number of calls I need to make to Firehose in order to insert records. In the first case there's record per (object, ID) tuple, of which there are approximately 6000 per object. There's a "batch" call but it's limited to 500 records per batch, so I would end up making multiple calls. This code will execute in a Lambda function I'm trying to save execution time however possible.
There should not be any explicit limit from Presto/Athena side for number of elements in an array column type. Eventually it drills down to JVM limitation which will be huge. Just make sure that you have enough node memory available to handle these fields. Would be great if you can review your use case and avoid storing very huge column values (of array type)
My project is implement a interaction query for user to discover that data. Like we have a list of columns user can choose then user add to list and press view data. The current data store in Cassandra and we use Spark SQL to query from it.
The Data Flow is we have a raw log after be processed by Spark store into Cassandra. The data is time series with more than 20 columns and 4 metrics. Currently I tested because more than 20 dimensions into cluster keys so write to Cassandra is quite slow.
The idea here is load all data from Cassandra into Spark and cache it in memory. Provide a API to client and run query base on Spark Cache.
But I don't know how to keep that cached data persist. I am try to use spark-job-server they have feature call share object. But not sure it works.
We can provide a cluster with more than 40 CPU cores and 100 GB RAM. We estimate data to query is about 100 GB.
What I have already tried:
Try to store in Alluxio and load to Spark from that but the time to load is slow because when it load 4GB data Spark need to do 2 things first is read from Alluxio take more than 1 minutes and then store into disk (Spark Shuffle) cost more than 2 or 3 minutes. That mean is over the time we target under 1 minute. We tested 1 job in 8 CPU cores.
Try to store in MemSQL but kind of costly. 1 days it cost 2GB RAM. Not sure the speed is keeping good when we scale.
Try to use Cassandra but Cassandra does not support GROUP BY.
So, what I really want to know is my direction is right or not? What I can change to archive the goal (query like MySQL with a lot of group by, SUM, ORDER BY) return to client by a API.
If you explicitly call cache or persist on a DataFrame, it will be saved in memory (and/or disk, depending on the storage level you choose) until the context is shut down. This is also valid for sqlContext.cacheTable.
So, as you are using Spark JobServer, you can create a long running context (using REST or at server start-up) and use it for multiple queries on the same dataset, because it will be cached until the context or the JobServer service shuts down. However, using this approach, you should make sure you have a good amount of memory available for this context, otherwise Spark will save a large portion of the data on disk, and this would have some impact on performance.
Additionally, the Named Objects feature of JobServer is useful for sharing specific objects among jobs, but this is not needed if you register your data as a temp table (df.registerTempTable("name")) and cache it (sqlContext.cacheTable("name")), because you will be able to query your table from multiple jobs (using sqlContext.sql or sqlContext.table), as long as these jobs are executed on the same context.
I am using the tFileInputJson and tMongoDBOutput components to store JSON data into a MongoDB Database.
When trying this with a small amount of data (nearly 100k JSON objects), the data can be stored into database with out any problems.
Now my requirement is to store nearly 300k JSON objects into the database and my JSON objects look like:
{
"LocationId": "253b95ec-c29a-430a-a0c3-614ffb059628",
"Sdid": "00DlBlqHulDp/43W3eyMUg",
"StartTime": "2014-03-18 22:22:56.32",
"EndTime": "2014-03-18 22:22:56.32",
"RegionId": "10d4bb4c-69dc-4522-801a-b588050099e4",
"DeviceCategories": [
"ffffffff-ffff-ffff-ffff-ffffffffffff",
"00000000-0000-0000-0000-000000000000"
],
"CheckedIn": false
}
While I am performing this operation I am getting the following Exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
[statistics] disconnected
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuffer.append(StringBuffer.java:237)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toString(Unknown Source)
at samplebigdata.retail_store2_0_1.Retail_Store2.tFileInputJSON_1Process(Retail_Store2.java:1773)
at samplebigdata.retail_store2_0_1.Retail_Store2.runJobInTOS(Retail_Store2.java:2469)
at samplebigdata.retail_store2_0_1.Retail_Store2.main(Retail_Store2.java:2328)
Job Retail_Store2 ended at 15:14 10/11/2014. [exit code=1]
My current job looks like:
How can I store so much data into the database in a single job?
The issue here is that you're printing the JSON object to the console (with your tLogRow). This requires all of the JSON objects to be held in memory before finally being dumped all at once to the console once the "flow" is completed.
If you remove the tLogRow components then (in a job as simple as this) Talend should only hold whatever the batch size is for your tMongoDbOutput component in memory and keep pushing batches into the MongoDB.
As an example, here's a screenshot of me successfully loading 100000000 rows of randomly generated data into a MySQL database:
The data set represents about 2.5 gb on disk when as a CSV but was comfortably handled in memory with a max heap space of 1 gb as each insert is 100 rows so the job only really needs to keep 100 rows of the CSV (plus any associated metadata and any Talend overheads) in memory at any one point.
In reality, it will probably keep significantly more than that in memory and simply garbage collect the rows that have been inserted into the database when the max memory is close to being reached.
If you have an absolute requirement for logging the JSON records that are being successfully put into the database then you might try outputting into a file instead and stream the output.
As long as you aren't getting too many invalid JSON objects in your tFileInputJson then you can probably keep the reject linked tLogRow as it will only receive the rejected/invalid JSON objects and so shouldn't run out of memory. As you are restricted to small amounts of memory due to being on a 32 bit system you might need to be wary that if the amount of invalid JSON objects grows you will quickly exceed your memory space.
If you simply want to load a large amount of JSON objects to a MongoDB database then you will probably be best off using the tMongoDBBulkLoad component. This takes a flat file (either .csv .tsv or .json) and loads this directly into a MongoDB database. The documentation I just linked to shows all the relevant options but you might be particularly interested by the --jsonArray additional argument that can be passed to the database. There is also a basic example in how to use the component.
This would mean you couldn't do any processing mid way through the load and you are having to use a preprepared json/csv file to load the data but if you just want a quick way to load data into the database using Talend then this should cover it.
If you needed to process chunks of the file at a time then you might want to look at a much more complicated job with a loop where you load n records from your input, process them and then restart the processing part of the loop but selecting n records with a header of n records and then repeat with a header of 2n records and so on...
Garpmitzn's answer pretty much covers how to change JVM settings to increase memory space but for something as simple as this you just want to reduce the amount you're keeping in memory for no good reason.
As an aside, if you're paying out for an Enterprise licence of Talend then you should probably be able to get yourself a 64 bit box with 16 gb of RAM easily enough and that will drastically help with your development. I'd at least hope that your production job execution server has a bunch of memory.
i feel you are reading into memory of talend. you have to play with java JVM parameters like Xms and XmX - you can increase Xmx to say bigger size then what its currently set for you say if its set to Xmx2048 then increase it to Xmx4096 or otherwise..
these parameters are available in .bat/.sh file of exported job or in talend studio you can find them under Run Job tab Advance settings JVM Settings...
but its advisable, to design the job in such a way that you dont load too much in memory..
I am working on a single node Cassandra setup. The system which I am using has 4-Core cpu with 8GB RAM.
The properties of the column family which i am using is:
Keyspace: keyspace1:
Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
Durable Writes: true
Options: [datacenter1:1]
Column Families:
ColumnFamily: colfamily (Super)
Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type
Default column value validator: org.apache.cassandra.db.marshal.UTF8Type
Columns sorted by: org.apache.cassandra.db.marshal.UTF8Type/org.apache.cassandra.db.marshal.BytesType
Row cache size / save period in seconds / keys to save : 100000.0/0/all
Row Cache Provider: org.apache.cassandra.cache.ConcurrentLinkedHashCacheProvider
Key cache size / save period in seconds: 200000.0/14400
GC grace seconds: 864000
Compaction min/max thresholds: 4/32
Read repair chance: 1.0
Replicate on write: true
Built indexes: []
Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy
I tried to insert 1million rows to a column family. The Throughput for writes is around 2500 per sec and reads is around 380 per sec.
How can I improve both the read and write throughput??.
380 per second means, that you are reading data from hard drive with low cache hit rate or OS is swapping. Check Cassandra statistics to find out cache usage:
./nodetool -host <IP> cfstats
You have enabled both row and key cache. row cache will read whole row into RAM - means all columns given by row key. In this case you can disable key cache. But make sure that you have enough free RAM to handle row caching.
If you have Cassandra with off-heap-cache (default from 1.x), it is possible that row cache is very large and OS started swapping - check swap size - this can decrease performance.