Serializing SDO_GEOMETRY type to text really slow - oracle

I am trying for a couple of days now to extract SDO_GEOMETRY records from an Oracle table into a CSV file via Microsoft Azure Data Factory (gen2). My select statement looks like this:
select t.MY_GEOM.get_WKT() from my_table t
where MY_GEOM column is of type SDO_GEOMETRY. It works but it's really, really slow. About 2 hours to pull 74000 records via this method.
Without that conversion (so, plain select without .get_wkt() takes about 32 seconds, but of course the result is rubbish and unusable.
Is there some way to speed up the process? My guess it's that the problem is on the server side, but I'm not a DBA and don't have direct access to it. I can connect to it via SQL Developer or from Data Factory.
The data contained there is just some LINESTRING(x1 y1, x2 y2, ...)
I also tried running SDO_UTIL.TO_WKTGEOMETRY to convert it, but it's equally slow.
If you have any suggestions, please let me know.
Kind regards,
Tudor

As i know,no additional burden will be imposed on data sources or sinks in ADF,so looks like that is a performance bottleneck at the db side with get_WKT() method.
Of course,you could refer to the tuning guides in this link to improve your transfer performance.Especially for Parallel copy. For each copy activity run, Azure Data Factory determines the number of parallel copies to use to copy data from the source data store and to the destination data store.That's based on the DIU.

I found a nice solution while searching for different approaches. As stated in some comments above, this solution that works for me consists of two steps:
Split the SDO_GEOMETRY LINESTRING entry into its coordinates via the following select statement
SELECT t.id, nt.COLUMN_VALUE AS coordinates, rownum FROM my_table t, TABLE(t.SDO_GEOMETRY.SDO_ORDINATES) nt
I just use it in a plain Copy Activity in Azure Data Factory to save my raw files as CSVs into a Data Lake. The files are quite large, about 4 times bigger than the final version created by the next step
Aggregate the coordinates back into a string via some Databricks Scala Spark code
val mergeList = udf { strings: Seq[String] => strings.mkString(", ") }
val result = df.withColumn("collected",
collect_list($"coordinates").over(Window.partitionBy("id").orderBy("rownum"))
)
.groupBy("id")
.agg(max($"collected").as("collected"))
.withColumn("final_coordinates", mergeList($"collected"))
.select("id", "final_coordinates")
val outputFilePrefix = s"$dataLakeFolderPath/$tableName"
val tmpOutputFolder = s"$outputFilePrefix.tmp"
result
.coalesce(1)
.write
.option("header", "true")
.csv(tmpOutputFolder)
dbutils.fs.cp(partition_path, s"$outputFilePrefix.csv")
dbutils.fs.rm(tmpOutputFolder, recurse = true)
The final_coordinates column contains my coordinates in the proper order (I had some issues with this). And I can plainly save the file back into my storage account. In the end, I only keep the proper CSV file that I am interested in.
As I said, it's quite fast. It takes about 2.5 minutes for my first step and a couple of seconds for the second one compared to 2 hours, so, I'm quite happy with this solution.

Related

How to pass an SQL string to a paradox database using pypxlib?

Currently I try to work with data files from a measuring system using python.
The data is saved in a paradox database format file.db. I would like to find a fast way to read these files into pandas dataframes.
What I did so far is:
from pypxlib import Table
table = Table(path_to_file)
data_dict = {}
for tab in table.fields:
data_dict[tab] = []
for row in table:
data_dict[tab] += [row[tab]]
data_bas = pd.DataFrame.from_dict(data_dict)
Which is working, but since the data tables are huge and I don't need all of it, I would like to filter them. Is there a fast possibility of prefiltering the data? Currently I need 11.3 seconds for about 130000 rows in 13 columns which is quite a lot if you have several hundreds of files. Maybe there is an option to connect SQL and pypxlib? I am grateful for any suggestions.
Thank you and best wishes, Daniel

How to avoid data duplicates in ClickHouse

I already read this but I still have questions. I only have one VM with 16 GB of RAM, 4 cores and a disk of 100 GB, with only ClickHouse and a light web api working on it.
I'm storing leaked credentials in a database:
CREATE TABLE credential (
user String,
domain String,
password String,
first_seen Date,
leaks Array(UInt64)
) ENGINE ReplacingMergeTree
PARTITION BY first_seen
ORDER BY user, domain, password, first_seen
It something happens that some credentials appear more than once (inside a file or between many).
My long-term objective is(was) the following:
- when inserting a credential which is already in the database, I want to keep the smaller first_seen and add the new leak id to the field leaks.
I have tried the ReplacingMergeTree engine, insert twice the same data ($ cat "data.csv" | clickhouse-client --query 'INSERT INTO credential FORMAT CSV') and then performed OPTIMIZE TABLE credential to force the replacing engine to do its asynchronous job, according to the documentation. Nothing happens, data is twice in the database.
So I wonder:
- what did i miss with the ReplacingMergeTree engine ?
- how does OPTIMIZE work and why doesn't it do what I was expecting from it ?
- is there a real solution for avoiding replicated data on a single instance of ClickHouse ?
I have already tried to do it manually. My problem is a have 4.5 billions records into my database, and identifying duplicates inside a 100k entries sample almost takes 5 minutes with the follow query: SELECT DISTINCT user, domain, password, count() as c FROM credential WHERE has(leaks, 0) GROUP BY user, domain, password HAVING c > 1 This query obviously does not work on the 4.5b entries, as I do not have enough RAM.
Any ideas will be tried.
Multiple things are going wrong here:
You partition very granulary... you should partition by something like a month of data, whatsoever. Now clickhous has to scan lots of files.
You dont provide the table engine with a version. The problem here is, that clickhouse is not able to find out wich row should replace the other.
I suggest you use the "version" parameter of the ReplacingMergeTree, as it allows you to provide an incremental version as a number, or if this works better for you, the current DateTime (where the last DateTime always wins)
You should never design your solution to require OPTIMIZE be called to make your data consistent in your result sets, it is not designed for this.
Clickhouse always allows you to write a query where you can provide (eventual) consistency without using OPTIMIZE beforehand.
Reason for avoiding OPTIMIZE, besides being really slow and heavy on your DB, you could end up in race conditions, where other clients of the database (or replicating clickhouse nodes) could invalidate your data between the OPTIMIZE finished and the SELECT is done.
Bottomline, as a solution:
So what you should do here is, add a version column. Then when inserting rows, insert the current timestamp as a version.
Then select for each row only the one that has the highest version in your result so that you do not depend on OPTIMIZE for anything other then garbage collection.

Comparing delimited files for database update

Any suggestions on packages (or methodologies) that might help with this? I need to take a ~40MB file we receive weekly and determine what changed from the previous to the current file. Whatever those changes are, then need to be made to a single simple database table. In a previous life I've accomplished similar via Linux "diff" with -Hae parameters, resulting in an "ed script". The contents were then handled by a PERL program, using Tie::File to reference the changed record in the previous file. In an effort to strengthen my Go skills I'm trying to utilize it for this current task. https://github.com/sergi/go-diff looks like it might be the ticket, but I'm not sure "patch" output will quite do what I need (easily).
Fixed width and/or delimited text files are still commonly used, does anyone have any samples or pointers or suggestions on packages that might help in dealing with them in this way?
Are you sure you need the intermediate step? 40 MB is not very much, and your database engine is specialized in handling data like that..
For instance with postgresql just load the new data into a temporary table:
create table temptable (
a varchar,
b varchar,
c varchar
);
copy temptable from '/path/to/csv/newdata.txt' delimiter ',' csv;
Then you can use simple SQL query to get the lines that do not have exact match in the old data, for example:
select *
from temptable t
where not exists (
select 1
from oldtable o
where t.a=o.a and t.b=o.b and t.c=o.c
)
If you did not save the data from previous week's batch, then just remember to copy it to an other table for storing. Now the real question is what you want to do with the information, but you should be able to handle most scenarios.

Full-text indexing sluggish. Looking for alternatives

I have a table that I've created a Full Text Catalog on. The table has just over 6000 rows. I've added two columns to the index. The first could be considered a unique identifier of sorts and the second could be considered the content for that item (there are 11 other columns in my table that aren't part of the Full Text Catalog). Here is an example of a couple of rows:
TABLE: data_variables
ROW unique_id label
1 A100d1 Personal preference of online shopping sites
2 A100d2 Shopping behaviors for adults in household
In my web application on the front end, I have a text box that the user can type into to get a list of items that match whatever terms they're searching for in the UNIQUE ID or LABEL columns. So, for example, if the user typed in sho or a100 then a list would be populated with both of the rows above. If they typed in behav then a list would be populated with only row 2 above.
This is done via an Ajax request on each keyup. PHP calls a Stored Procedure on the SQL server that looks like:
SELECT TOP 50 dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE (CONTAINS((dv.unique_id, dv.label), #search))
(#search is the text from the user that is passed into the Stored Procedure.)
I've noticed that this gets pretty sluggish, especially when I wasn't using TOP 50 in the query.
What I'm looking for is a way to speed this up either directly on the SQL Server or by abandoning the full-text indexing idea and using jQuery to search through an array of the searchable items on the client-side. I've looked a bit into the jQuery AutoComplete stuff and some other jQuery plugins for AutoComplete, but haven't yet tried to mock up anything. That would be my next step, but I wanted to check here first to see what advice I would get.
Thanks in advance.
Several suggestions, based around the fact that you have only 6000 rows, so the database should eat this alive.
A. Try using Like operator, just in case it helps. Not expecting it too, but pretty trivial to try. There is something else going on here overall for you to detect this is slow given these small volumes.
B. can you cache queries in advance? With 6000 rows, there are probably only 36*36 combinations of 2 character queries, which should take virtually no memory and save the database any work.
C. Moving the selection out to the client is a good idea, depends on how big the 6000 rows are overall, vs network latency for individual lookups.
D. Combining b and c will give you really good performance I suspect, but with some coding effort required. If the server maintains a list of all single character results in cache, and clients download the letter cache set after first keystroke, then they potentially have a subset of all rows, but won't need to do more network IO for additional keystrokes.
I would advise against a LIKE, unless you're using a linear index (left-to-right) and you're doing queries like LIKE 'work%'. If you're doing something like LIKE '%word%' a regular index isn't going to help you. You typically want to use a Full-Text index when you want to search for words inside a paragraph.
With a lot of data, typically the built-in Full-Text engines in databases aren't very stealer. For the best performance you typically have to go with an external solution that is built specifically for Full-Text.
Some options are Sphinx, Solr, and elasticsearch, just to name a few. I wouldn't say that any of these options are better than the other. There are definitely pros and cons to consider:
What kind of data do you have?
What language support do these solutions have?
What database engines do these solutions support?
The best thing you can do is benchmark these solutions against your existing data. Testing each and every individual component (unit testing) can help you identify the real problems and help you find good solutions.
I had the same problem and went for the LIKE solution. I found too that the or operator to be too taxing and divide the query into two selects with an union all (fastest, and in my scenario it was impossible to find the same text in the index column and the data).
Yours will be like
SELECT TOP 50 from (
select dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE dv.unique_id like '%'+#search+'%'
UNION ALL
select dv.id, dv.id + ': ' + dv.label,
dv.type_id, dv.grouping, dv.friendly_label
FROM data_variables dv
WHERE dv.label like '%'+#search+'%'
)
Oh!! And test the performance in SQL Server, not the web!
If You plan to increase amount of data it will be best way to use reverse index for full-text searching.
Look at Apache Solr - best fulltext search engine at this moment.
You can simply periodically index Your database data and use solr as search-engine,
it provide simple ajax api and can be queried directly from frontend.
If you really need performance ..you may want to look at; FTS3 and FTS4 ...
snip... from another forum...
For example, if each of the 517430 documents in the "Enron E-Mail Dataset" is inserted into both an FTS table and an ordinary SQLite table created using the following SQL script:
Code:
CREATE VIRTUAL TABLE enrondata1 USING fts3(content TEXT); /* FTS3 table /
CREATE TABLE enrondata2(content TEXT); / Ordinary table */
Then either of the two queries below may be executed to find the number of documents in the database that contain the word "linux" (351). Using one desktop PC hardware configuration, the query on the FTS3 table returns in approximately 0.03 seconds, versus 22.5 for querying the ordinary table.
see...
http://www.sqlite.org/fts3.html

Entity framework and performance

I am trying to develop my first web project using the entity framework, while I love the way that you can use linq instead of writing sql, I do have some severe performance issuses. I have a lot of unhandled data in a table which I would like to do a few transformations on and then insert into another table. I run through all objects and then inserts them into my new table. I need to do some small comparisons (which is why I need to insert the data into another table) but for performance tests I have removed them. The following code (which approximately 12-15 properties to set) took 21 seconds, which is quite a long time. Is it usually this slow, and what might I do wrong?
DataLayer.MotorExtractionEntities mee = new DataLayer.MotorExtractionEntities();
List<DataLayer.CarsBulk> carsBulkAll = ((from c in mee.CarsBulk select c).Take(100)).ToList();
foreach (DataLayer.CarsBulk carBulk in carsBulkAll)
{
DataLayer.Car car = new DataLayer.Car();
car.URL = carBulk.URL;
car.color = carBulk.SellerCity.ToString();
car.year = //... more properties is set this way
mee.AddToCar(car);
}
mee.SaveChanges();
You cannot create batch updates using Entity Framework.
Imagine you need to update rows in a table with a SQL statement like this:
UPDATE table SET col1 = #a where col2 = #b
Using SQL this is just one roundtrip to the server. Using Entity Framework, you have (at least) one roundtrip to the server loading all the data, then you modify the rows on the client, then it will send it back row by row.
This will slow things down especially if your network connection is limited, and if you have more than just a couple of rows.
So for this kind of updates a stored procedure is still a lot more efficient.
I have been experimenting with the entity framework quite a lot and I haven't seen any real performance issues.
Which row of your code is causing the big delay, have you tried debugging it and just measuring which method takes the most time?
Also, the complexity of your database structure could slow down the entity framework a bit, but not to the speed you are saying. Are there some 'infinite loops' in your DB structure? Without the DB structure it is really hard to say what's wrong.
can you try the same in straight SQL?
The problem might be related to your database and not the Entity Framework. For example, if you have massive indexes and lots of check constraints, inserting can become slow.
I've also seen problems at insert with databases which had never been backed-up. The transaction log could not be reclaimed and was growing insanely, causing a single insert to take a few seconds.
Trying this in SQL directly would tell you if the problem is indeed with EF.
I think I solved the problem. I have been running the app locally, and the database is in another country (neighbor, but never the less). I tried to load the application to the server and run it from there, and it then only took 2 seconds to run instead of 20. I tried to transfer 1000 records which took 26 seconds, which is quite an update, though I don't know if this is the "regular" speed for saving the 1000 records to the database?

Resources