How to pass an SQL string to a paradox database using pypxlib? - paradox

Currently I try to work with data files from a measuring system using python.
The data is saved in a paradox database format file.db. I would like to find a fast way to read these files into pandas dataframes.
What I did so far is:
from pypxlib import Table
table = Table(path_to_file)
data_dict = {}
for tab in table.fields:
data_dict[tab] = []
for row in table:
data_dict[tab] += [row[tab]]
data_bas = pd.DataFrame.from_dict(data_dict)
Which is working, but since the data tables are huge and I don't need all of it, I would like to filter them. Is there a fast possibility of prefiltering the data? Currently I need 11.3 seconds for about 130000 rows in 13 columns which is quite a lot if you have several hundreds of files. Maybe there is an option to connect SQL and pypxlib? I am grateful for any suggestions.
Thank you and best wishes, Daniel

Related

Serializing SDO_GEOMETRY type to text really slow

I am trying for a couple of days now to extract SDO_GEOMETRY records from an Oracle table into a CSV file via Microsoft Azure Data Factory (gen2). My select statement looks like this:
select t.MY_GEOM.get_WKT() from my_table t
where MY_GEOM column is of type SDO_GEOMETRY. It works but it's really, really slow. About 2 hours to pull 74000 records via this method.
Without that conversion (so, plain select without .get_wkt() takes about 32 seconds, but of course the result is rubbish and unusable.
Is there some way to speed up the process? My guess it's that the problem is on the server side, but I'm not a DBA and don't have direct access to it. I can connect to it via SQL Developer or from Data Factory.
The data contained there is just some LINESTRING(x1 y1, x2 y2, ...)
I also tried running SDO_UTIL.TO_WKTGEOMETRY to convert it, but it's equally slow.
If you have any suggestions, please let me know.
Kind regards,
Tudor
As i know,no additional burden will be imposed on data sources or sinks in ADF,so looks like that is a performance bottleneck at the db side with get_WKT() method.
Of course,you could refer to the tuning guides in this link to improve your transfer performance.Especially for Parallel copy. For each copy activity run, Azure Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store.That's based on the DIU.
I found a nice solution while searching for different approaches. As stated in some comments above, this solution that works for me consists of two steps:
Split the SDO_GEOMETRY LINESTRING entry into its coordinates via the following select statement
SELECT t.id, nt.COLUMN_VALUE AS coordinates, rownum FROM my_table t, TABLE(t.SDO_GEOMETRY.SDO_ORDINATES) nt
I just use it in a plain Copy Activity in Azure Data Factory to save my raw files as CSVs into a Data Lake. The files are quite large, about 4 times bigger than the final version created by the next step
Aggregate the coordinates back into a string via some Databricks Scala Spark code
val mergeList = udf { strings: Seq[String] => strings.mkString(", ") }
val result = df.withColumn("collected",
collect_list($"coordinates").over(Window.partitionBy("id").orderBy("rownum"))
)
.groupBy("id")
.agg(max($"collected").as("collected"))
.withColumn("final_coordinates", mergeList($"collected"))
.select("id", "final_coordinates")
val outputFilePrefix = s"$dataLakeFolderPath/$tableName"
val tmpOutputFolder = s"$outputFilePrefix.tmp"
result
.coalesce(1)
.write
.option("header", "true")
.csv(tmpOutputFolder)
dbutils.fs.cp(partition_path, s"$outputFilePrefix.csv")
dbutils.fs.rm(tmpOutputFolder, recurse = true)
The final_coordinates column contains my coordinates in the proper order (I had some issues with this). And I can plainly save the file back into my storage account. In the end, I only keep the proper CSV file that I am interested in.
As I said, it's quite fast. It takes about 2.5 minutes for my first step and a couple of seconds for the second one compared to 2 hours, so, I'm quite happy with this solution.

What's more performing between an EXCEL formula, a PIVOT TABLE and VBA code?

I have a (growing) table of data with 40.000 rows and 20 columns.
I need to group these data (by month and week) and to perform some simple operations (+ & /) between rows/columns.
I must be able to change the period in question and some specific rows to sum up. I know how to macro/pivot/formula, but I didn't started yet, and I would like the recalculation process to be the fastest possible, not that I click a button and then everything freezes for minutes.
Do you have any idea on what could be the most efficient solution?
Thank you
Excel have it's limits to store and analyze data at the same time.
If you're planning to build a growing database at MS Excel, at some point you will add so much data that the Excel files will not work. (or using them won't be time effective)
Before you get to that point you should be looking for alternate storage options as a scalable data solution.
They can be simple, like an Access DB, sqlite, PostgreSQL, Maria DB, or even PowerPivot (though this can have it's own issues).
Or more complex, like storing the data into a database, then adding an analysis cube and pulling smaller slices of data from these databases, into Excel for analysis and reporting.
Regardless of what you end up doing you will have to change how Excel interacts with the data.
You need to move all of the raw data to another system (Access or SQL are the easiest, but Excel supports a lot of other DB options) and pull smaller chunks of data back into Excel for time effective analysis.
Useful Links:
SQL Databases vs Excel
Using Access or Excel to manage your data

How to implement ORACLE to VERTICA replication?

I am in the process of creating an Oracle to Vertica process!
We are looking to create a Vertica DB that will run heavy reports. For now is all cool Vertica is fast space use is great and all well and nice until we get to the main part getting the data from Oracle to Vertica.
OK, initial load is ok, dump to csv from Oracle to Vertica, load times are a joke no problem so far everybody things is bad joke or there's some magic stuff going on! well is Simply Fast.
Bad Part Now -> Databases are up and going ORACLE/VERTICA - and I have data getting altered in ORACLE so I need to replicate my data in VERTICA. What now:
From my tests and from what I can understand about Vertica insert, updates are not to used unless maybe max 20 per sec - so real time replication is out of question.
So I was thinking to read the arch log from oracle and ETL -it to create CSV data with the new data, altered data, deleted values-changed data and then applied it into VERTICA but I can not get a list like this:
Because explicit data change in VERTICA leads to slow performance.
So I am looking for some ideas about how I can solve this issue, knowing I cannot:
Alter my ORACLE production structure.
Use ORACLE env resources for filtering the data.
Cannot use insert, update or delete statements in my VERTICA load process.
Things I depend on:
The use of copy command
Data consistency
A max of 60 min window(every 60 min - new/altered data need to go to VERTICA).
I have seen the Continuent data replication, but it seems that nowbody wants to sell their prod, I cannot get in touch with them.
will loading the whole data to a new table
and then replacing them be acceptable?
copy new() ...
-- you can swap tables in one command:
alter table old,new,swap rename to swap,old,new;
truncate new;
Extract data from Oracle(in .csv format) and load it using Vertica COPY command. Write a simple shell script to automate this process.
I used to use Talend(ETL), but it was very slow then moved to the conventional process and it has really worked for me. Currently processing 18M records, my entire process takes less than 2 min.

Deleting large number of rows of an Oracle table

I have a data table from company which is of 250Gb having 35 columns. I need to delete around 215Gb of data which
is obviously large number of rows to delete from the table. This table has no primary key.
What could be the fastest method to delete data from this table? Are there any tools in Oracle for such large deletion processes?
Please suggest me the fastest way to do this with using Oracle.
As it is said in the answer above it's better to move the rows to be retained into a separate table and truncate the table because there's a thing called HIGH WATERMARK. More details can be found here http://sysdba.wordpress.com/2006/04/28/how-to-adjust-the-high-watermark-in-oracle-10g-alter-table-shrink/ . The delete operation will overwhelm your UNDO TABLESPACE it's called.
The recovery model term is rather applicable for mssql I believe :).
hope it clarifies the matter abit.
thanks.
Dou you know which records need to be retained ? How will you identify each record ?
A solution might be to move the records to be retained to a temp db, and then truncate the big table. Afterwards, move the retained records back.
Beware that the transaction log file might become very big because of this (but depends on your recovery model).
We had a similar problem a long time ago. Had a table with 1 billion rows in it but had to remove a very large proportion of the data based on certain rules. We solved it by writing a Pro*C job to extract the data that we wanted to keep and apply the rules, and sprintf the data to be kept to a csv file.
Then created a sqlldr control file to upload the data using direct path (which wont create undo/redo (but if you need to recover the table, you have the CSV file until you do your next backup anyway).
The sequence was
Run the Pro*C to create CSV files of data
generate DDL for the indexes
drop the indexes
run the sql*load using the CSV files
recreate indexes using parallel hint
analyse the table using degree(8)
The amount of parellelism depends on the CPUs and memory of the DB server - we had 16CPUs and a few gig of RAM to play with so not a problem.
The extract of the correct data was the longest part of this.
After a few trial runs, the SQL Loader was able to load the full 1 billion rows (thats a US Billion or 1000 million rows) in under an hour.

csv viewer on windows environement for 10MM lines file

We a need a csv viewer which can look at 10MM-15MM rows on a windows environment and each column can have some filtering capability (some regex or text searching) is fine.
I strongly suggest using a database instead, and running queries (eg, with Access). With proper SQL queries you should be able to filter on the columns you need to see, without handling such huge files all at once. You may need to have someone write a script to input each row of the csv file (and future csv file changes) into the database.
I don't want to be the end user of that app. Store the data in SQL. Surely you can define criteria to query on before generating a .csv file. Give the user an online interface with the column headers and filters to apply. Then generate a query based on the selected filters, providing the user only with the lines they need.
This will save many people time, headaches and eye sores.
We had this same issue and used a 'report builder' to build the criteria for the reports prior to actually generating the downloadable csv/Excel file.
As other guys suggested, I would also choose SQL database. It's already optimized to perform queries over large data sets. There're couple of embeded databases like SQLite or FirebirdSQL (embeded).
http://www.sqlite.org/
http://www.firebirdsql.org/manual/ufb-cs-embedded.html
You can easily import CSV into SQL database with just few lines of code and then build a SQL query instead of writing your own solution to filter large tabular data.

Resources