Client just had ~1000 rows of data (most recent, of course), just go missing in one of their tables. Doing some forensics, I found that the "last_updated_date" in all of their other rows of said table was also set to roughly the same time as the deletion occurred. This is not one of their larger tables.
Some other oddities are that the mysqldumps for the last week are all exact same size -- 10375605093 Bytes. Previous dumps grew by about .5GB each. MySQL Dump command is standard:
/path/to/mysqldump -S /path/to/mysqld2.sock --lock-all-tables -u username -ppassword database > /path-to-backup/$(date +%Y%m%d)_live_data.mysqldump
df -h on the box shows plenty of space (at least 50%) in every directory.
The data loss combined with the fact that their dumps are not increasing in size has me worried that somehow we're hitting some hardcoded limit in MySQL and (God I hope I'm wrong), data is getting corrupted. Anyone ever heard of anything like this? How can we explain the mysqldump sizes?
50% free space doesn't mean much if you're doing multiple multi-gig dumps and run out of space halfway. Unless you're storing binary data in your dumps, they are quite compressible, so I'd suggest piping mysqldump's output through gzip before outputting to a file:
mysqldump .... | gzip -9 > /path_to_backup/....
MySQL itself doesn't have any arbitrary limits that say "no more after X gigs", but there are limits imposed by the platform it's running on, detailed here.
There is no hardcoded limit to the amount of data MySQL can handle.
Related
I use PDI(kettle) to extract the data from mongodb to greenplum. I tested if extract the data from mongodb to file, it was faster, about 10000 rows per second. But if extract into greenplum, it is only about 130 per second.
And I modified following parameters of greenplum, but it is no significant improvement.
gpconfig -c log_statement -v none
gpconfig -c gp_enable_global_deadlock_detector -v on
And if I want to add the number of output table. It seems to be hung up and no data will be inserted for a long time. I don't know why?
How to increase the performance of insert data from mongo to greenplum with PDI(kettle)?
Thank you.
There are a variety of factors that could be at play here.
Is PDI loading via an ODBC or JDBC connection?
What is the size of data? (row count doesn't really tell us much)
What is the size of your Greenplum cluster (# of hosts and # of segments per host)
Is the table you are loading into indexed?
What is the network connectivity between Mongo and Greenplum?
The best bulk load performance using data integration tools such as PDI, Informatica Power Center, IBM Data Stage, etc.. will be accomplished using Greenplum's native bulk loading utilities gpfdist and gpload.
Greenplum love batches.
a) You can modify batch size in transformation with Nr rows in rowset.
b) You can modify commit size in table output.
I think a and b should match.
Find your optimum values. (For example we use 1000 for rows with big json objects inside)
Now, using following connection properties
reWriteBatchedInserts=true
It will re-write SQL from insert to batched insert. It increase ten times insert performance for my scenario.
https://jdbc.postgresql.org/documentation/94/connect.html
Thank you guys!
My question is similar to this one, essentially I forgot a clause in a join when using MonetDB that produced an enormous result that filled the disk on my computer. Monetdb didn't clean up after this and despite freeing space and waiting 24 hours the disk is still much fuller than it should be.
See below the size of the database in monetdb (In GB):
sql>SELECT CAST(SUM(columnsize) / POWER(1024, 3) AS INT) columnSize FROM STORAGE();
+------------+
| columnsize |
+============+
| 851 |
+------------+
1 tuple
And the size of the farm on disk:
sudo du -hs ./*
3,2T ./data_warehouse
5,5M ./merovingian.log
The difference in size is unexplained and appeared suddenly after launching the query that generated an extremely large result.
I can track these files down into the merovingian.log file and the BAT directory inside warehouse where many large files named after integers and .tail or .theap can be found.:
sudo du -hs ./*
2,0T ./data_warehouse
1,3T ./merovingian.log
4,0K ./merovingian.pid
My question is how can I manually free this disk space without corrupting the database? Can any of these files be safely deleted or is there a command that can be launched to get MonetDB to free this space?
So far I've tried the following with no effect:
Restarting the database
Installing the latest version of the database (last time this happened), my current version is: MonetDB Database Server Toolkit v11.37.11 (Jun2020-SP1)
Various VACUUM and FLUSH commands documented here, (Note that VACUUM doesn't run on my version)
Checking online and reading the mailing list
Many thanks in advance for any assistance.
Normally, during the query execution, MonetDB will free up memory/files that are no longer needed. But if that doesn't happen, you can try the following manual clean up.
First, lock and stop the database (it's called warehouse?):
monetdb lock warehouse
monetdb stop warehouse
You can fairly safely remove the merovingian.log to gain 1.3T (this log file can contain useful information for debugging, but in its current size, it's a bit difficult to use). The kill command is to tell monetdbd to start a new log file:
rm /<path-to>/merovingian.log
kill -HUP `pgrep monetdbd`
Then restart the database:
monetdb release warehouse
monetdb start warehouse
During the start-up, the MonetDB server should clean up the left-over transient data files from the previous session.
Concerning the size difference between SUM(columnsize) and on-disk size:
there can be index files and string heap files. Their sizes are reported in separate columns returned by storage().
In your case, the database directory probably contains a lot of intermediate data files generated for the computation of your query.
I am trying to export data using sqoop from files stored in hdfs to vertica. For around 10k's of data the files get loaded within a few minutes. But when I try to run crores of data, it is loading around .5% within 15 mins or so. I have tried to increase the number of mappers, but they are not serving any purpose to improve efficienct. Even setting the chunk size to increase the number the mappers, does not increase the number.
Please help.
Thanks!
As you are using Batch export try increasing the records per transaction and records per statement parameter using the following properties:
sqoop.export.records.per.statement : property will aggregate multiple rows inside one single insert statement.
sqoop.export.records.per.transaction: how many insert statements will be issued per transaction
I hope these will surely solves the issue.
Most MPP/RDBMS have sqoop connectors to exploit the parallelism and increase efficiency in transfer of data between HDFS and MPP/RDBMS. However it seems the vertica has taken this approach: http://www.vertica.com/2012/07/05/teaching-the-elephant-new-tricks/
https://github.com/vertica/Vertica-Hadoop-Connector
Is this a "wide" dataset? It might be a sqoop bug https://issues.apache.org/jira/browse/SQOOP-2920 if number of columns is very high (in hundreds), sqoop starts choking (very high on cpu). When number of fields is small, it's usually other way around - when sqoop is bored and rdbms systems can't keep up.
I am using the tFileInputJson and tMongoDBOutput components to store JSON data into a MongoDB Database.
When trying this with a small amount of data (nearly 100k JSON objects), the data can be stored into database with out any problems.
Now my requirement is to store nearly 300k JSON objects into the database and my JSON objects look like:
{
"LocationId": "253b95ec-c29a-430a-a0c3-614ffb059628",
"Sdid": "00DlBlqHulDp/43W3eyMUg",
"StartTime": "2014-03-18 22:22:56.32",
"EndTime": "2014-03-18 22:22:56.32",
"RegionId": "10d4bb4c-69dc-4522-801a-b588050099e4",
"DeviceCategories": [
"ffffffff-ffff-ffff-ffff-ffffffffffff",
"00000000-0000-0000-0000-000000000000"
],
"CheckedIn": false
}
While I am performing this operation I am getting the following Exception:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
[statistics] disconnected
at java.util.Arrays.copyOf(Arrays.java:2367)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:130)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:114)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:415)
at java.lang.StringBuffer.append(StringBuffer.java:237)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toJSONString(Unknown Source)
at org.json.simple.JSONArray.toString(Unknown Source)
at samplebigdata.retail_store2_0_1.Retail_Store2.tFileInputJSON_1Process(Retail_Store2.java:1773)
at samplebigdata.retail_store2_0_1.Retail_Store2.runJobInTOS(Retail_Store2.java:2469)
at samplebigdata.retail_store2_0_1.Retail_Store2.main(Retail_Store2.java:2328)
Job Retail_Store2 ended at 15:14 10/11/2014. [exit code=1]
My current job looks like:
How can I store so much data into the database in a single job?
The issue here is that you're printing the JSON object to the console (with your tLogRow). This requires all of the JSON objects to be held in memory before finally being dumped all at once to the console once the "flow" is completed.
If you remove the tLogRow components then (in a job as simple as this) Talend should only hold whatever the batch size is for your tMongoDbOutput component in memory and keep pushing batches into the MongoDB.
As an example, here's a screenshot of me successfully loading 100000000 rows of randomly generated data into a MySQL database:
The data set represents about 2.5 gb on disk when as a CSV but was comfortably handled in memory with a max heap space of 1 gb as each insert is 100 rows so the job only really needs to keep 100 rows of the CSV (plus any associated metadata and any Talend overheads) in memory at any one point.
In reality, it will probably keep significantly more than that in memory and simply garbage collect the rows that have been inserted into the database when the max memory is close to being reached.
If you have an absolute requirement for logging the JSON records that are being successfully put into the database then you might try outputting into a file instead and stream the output.
As long as you aren't getting too many invalid JSON objects in your tFileInputJson then you can probably keep the reject linked tLogRow as it will only receive the rejected/invalid JSON objects and so shouldn't run out of memory. As you are restricted to small amounts of memory due to being on a 32 bit system you might need to be wary that if the amount of invalid JSON objects grows you will quickly exceed your memory space.
If you simply want to load a large amount of JSON objects to a MongoDB database then you will probably be best off using the tMongoDBBulkLoad component. This takes a flat file (either .csv .tsv or .json) and loads this directly into a MongoDB database. The documentation I just linked to shows all the relevant options but you might be particularly interested by the --jsonArray additional argument that can be passed to the database. There is also a basic example in how to use the component.
This would mean you couldn't do any processing mid way through the load and you are having to use a preprepared json/csv file to load the data but if you just want a quick way to load data into the database using Talend then this should cover it.
If you needed to process chunks of the file at a time then you might want to look at a much more complicated job with a loop where you load n records from your input, process them and then restart the processing part of the loop but selecting n records with a header of n records and then repeat with a header of 2n records and so on...
Garpmitzn's answer pretty much covers how to change JVM settings to increase memory space but for something as simple as this you just want to reduce the amount you're keeping in memory for no good reason.
As an aside, if you're paying out for an Enterprise licence of Talend then you should probably be able to get yourself a 64 bit box with 16 gb of RAM easily enough and that will drastically help with your development. I'd at least hope that your production job execution server has a bunch of memory.
i feel you are reading into memory of talend. you have to play with java JVM parameters like Xms and XmX - you can increase Xmx to say bigger size then what its currently set for you say if its set to Xmx2048 then increase it to Xmx4096 or otherwise..
these parameters are available in .bat/.sh file of exported job or in talend studio you can find them under Run Job tab Advance settings JVM Settings...
but its advisable, to design the job in such a way that you dont load too much in memory..
As the title said, how to sort the file? If you PC's memory is just 2GB, but there are ten billion URLs(assume that the longest URL is 256 chars).
Your question is little vague, but I'm assuming :
You have a flat file containing many URLs.
The URLs are delimited somehow, I'm assuming newlines.
You want to create a separate file without duplicates.
Possible solutions :
Write code to read each URL in turn from the file, and insert into a relational database. Make the primary key be the URL, and any duplicates will be rejected.
Build your own index. This is a little more complex. You would need to use something like a disk-based btree implementation. Then read each URL, and add it to the disk-based BTree. Again, check for duplicates as you add to the tree.
However, given all the free database systems out there, solution 1 is probably the way to go.
If you've got a lot of data, then Hadoop either is, or should be on your radar.
In that HDFS is used to store the huge volume of data and also be a lot of tools for query with that data.
In HDFS the data processing is very effective and fast. you can use the No-sql tool like Hive and also other tool like Pig,etc.
Now the YAHOO using the Big-Data technology for huge amount of data processing. Also Hadoop
is open source.
Refer http://hadoop.apache.org/ for more.