Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
As part of my development I need to process some .csv files.
For what it matters I am writing a super fast CSV parser in java
I would like to ask if somebody can name some websites where I can find some good csv files so I can test my app.
Please don't tag this question is inappropriate I think developers would benefit from a list
of good sites where to find sample data
The baseball archive can be downloaded in CSV format. The batting statistics file contains a little over 90,000 rows of data which should be helpful in performance testing your app.
You can download the Sample CSV Data Files from this site.
Examples:
Sample Insurance Data
Real Estate Data
Sales Transactions Data
See also this question on sample data.
I've used http://www.fakenamegenerator.com for these purposes in the past.
Another good source is baseball reference. Pick whatever baseball player or manager you can think of.
http://www.baseball-reference.com/managers/coxbo01.shtml
This is a site that is in beta that can give you data in JSON, XML or CSV. All lists are customizable. This is a sample call to return data as CSV: http://mysafeinfo.com/api/data?list=dowjonescompanies&format=csv
Documentation on lists, formats and options under documentation: http://mysafeinfo.com/content/documentation -
Over 80 data sets available - see a full list under Datasets on the main menu
If you're looking for some large CSV files with real-world data, try http://www.baseball-databank.org.
Severals very nice testing csv files : http://support.spatialkey.com/spatialkey-sample-csv-data/
Sample insurance portfolio,
Real estate transactions,
Sales transactions,
Company Funding Records,
Crime Records
Thank you for the question !
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am studying a use case where we are going to move datas from a SQL database (600TB ~100 tables) into a transformed format into hadoop. We don't have logs enabled in the SQL DB. We decided to copy the datas as a datamart view and to refresh this view every week. The copied datas will be erased every week to be rewritten.
This SQL DB is used for reporting purposes that is derived from the datalake. This OLTP database is an old system we are replacing progressivly. The dataset that is copied is deleted every week and copied again (refreshed).
80% of data copy is straight with no transformation.
20% has redesign.
We identified 3 options :
AirFlow + Beam for the processing
ETL (informatica) was excluded
Kafka (connect, stream, sink into hadoop) with optionnaly CDC Debezium
What do you think is a best approach regarding : performance, overall time to deliver, data architecture ?
Thanks for help !
My thoughts - for what they are worth:
I would definitely not be looking to copy 600TB per week. Given that the majority of this data will not have changed from week to week (I assume) then you should be looking to only copy across the data that has changed. As your data in Hadoop will be partitioned then you would mainly be inserting new data into new partitions - for those records that have changed you will just be dropping/reloading a few partitions
I would copy all the necessary data into a staging area in Hadoop as-is (without transformation) and then process it on the Hadoop platform to produce the data you actually need - you can then drop the staging area data if you want
Data processing tool - if you already have experience of a specific toolset within your company then use that; don't multiply the toolsets in use unless there is critical functionality required that is not available within existing tools. If this one process is all you are going to be using this toolset for then it probably doesn't matter which one you use - pick one that is quickest to learn/deploy. If this toolset is going to be expanded to other use cases then I would definitely use a dedicated ETL/ELT tool rather than use a coding solution (why have you discarded Informatica as a solution?)
The following is definitely an opinion...
If you are building a new analytical platform, I am surprised that you are using Hadoop. Hadoop is legacy technology that has been superseded by more modern and capable Cloud data platforms (Snowflake, etc.).
Also, Hadoop is a horrible platform to try and run analytics on (it's ok as just a data lake to hold data while you decide what you want to do with it). Trying to run queries on it that don't align with how that data is partitioned gives really bad performance (for non-trivial dataset sizes). For example, if your transactions are partitioned by date then running a query to sum transaction values in the last week will run quickly. However, running a query to sum transactions for a specific account (or group of accounts) will perform very badly
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
The Common Log Format is a standardized text file format used by web servers when generating server log files. Example1:
127.0.0.1 - - [10/Oct/2000:13:55:36 -0700] "GET /apache_pb.gif HTTP/1.0" 200 2326
Suppose that an Oracle database is used to store the access log of an e-commerce website with gigabytes of log data over the past six months. Discuss the options we may adopt and the steps involved such that the user can efficiently query all the IP addresses and the files accessed within any given time interval (with specified start time and end time).
If every log entry (such as the one you presented) is stored into one row in an Oracle table, then see if you can split it to store the IP address and date values into separate columns (shouldn't be difficult if format is fixed). Then index those columns and make access simpler & faster.
If that's not the case, investigate Oracle Text capabilities.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I would like to ask if you write/make any unit tests on your database and if yes what are yours experiences?
Do the tests are worth the effort? You test only high-level procedure or also functions? What are the best practises?
Testing best practices for PL/SQL or any DB for that matter:
Software 101- The earlier you catch a bug, less expensive it is to fix. By that adage every code going into production should be tested and PL/SQL is no exception. Testing is always worth the effort - no ambiguities there
Database testing should be done at two levels - for the data and about the data
For the data - this includes metrics of data loaded and the process- eg - define sample data set and calculate how much expected counts will be in target tables after the test case is executed.
Secondly Performance test cases - this test the process eg - if you load full production set, how long that takes. Again you don't want to uncover performance issue in production
About the data - this is more business testing, whether the data loaded is as per expected functionality - eg - if if you are aggregating sales rep to their parent companies, is the one to many relationship between company and sales rep valid after you run the test case.
Always create a test query which results in a number, eg - select count of sales rep which are not associated to any company. if the count > 0 then it is a failure
It's a good idea to put test cases, their results, test query and actual result in a table so that you can review them and slice and dice if required.
You can write a SP to automate running the test query from the table and this be repeated very easily and even can be embedded in a batch or a GUI screen
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I am asked to extract the oracle database dictionary from a tool. They used to do that with power designer 12.5. They generate a report and it represents in a html format. This report includes all tables and columns information’s, and programmers easily can ready it. The bad thing about it, it needs about a week to make such report (reverse engineering, customizing...). They are trying to find a fast tool so they can generate a daily data dictionary tool.
For now i have found Oracle Data Modeler, but I will download it to see if its a fast tool.
my question: do you know a fast tool to quickly generate a data dictionary report ?
Oracle's SQL Developer tool will produce an html formatted data dictionary very quickly and easily, as I recall. The data modeller functionality is probably more complex than you need.
the poor mans solution :
generate html report from sqlplus
break on owner , table_name skip 1
set html markup on
spool dailyDataDictionaryReport.html
select owner,table_name,column_name,data_type,data_length,data_precision
from all_tab_columns
where owner not in ('SYS','SYSOPER','XDB');
spool off
set html markup off
I think TOAD might have a good wziard for that.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
I'm looking for a document-oriented db with a Ruby API that has SQLite-like properties:
self-contained,
serverless,
zero-configuration.
Are there light alternatives to MongoDB or CouchDB?
Is RDDB a possibility?
If not, what are the best paths to walk then?
I know, the question was asked 5 years ago, but just for completeness' sake, embedded MongoDB has happened since:
https://github.com/hamiltop/MongoLiteDB
It's not ready yet, but embeddable version of CouchDB are on the long term roadmap.
Replication is intended to enable offline applications with CouchDB. If you ended up with very specific needs you could replicate data from couchdb to a local datastructure, store it locally, update it, and push the data back via replication but it would take some code.
If you were using Perl, I'd recommend DBM::Deep, which stores arbitrary data structures on disk, including transactions with commit/rollback, and it's a non-C one-Perl-module install. Doesn't get much lighter than that.
I almost feel you could do some sort of hack to achieve this.
Have a table using sqlite's row ids along with a field for collection name and text blob that would be json code.
Have another table for indexing with fields in a collection (collection name, field name, field value, document row id).
You could do some wrapper class to handle things like updates and lookups. Would be interesting.