Replacing characters & numbers in parquet files - parquet

We'd like to use Production data to test performance in our dev environment. But the company doesn't allow us to use production data in dev. So we're looking to replace characters in specific fields in the parquet to the point where it's not considered production data so we can test it in dev. The requirement is that the data structure/characteristics must remain the same. Meaning, if we change "John" to "Alex" then all "John" must change to "Alex", same for addresses, phone numbers, IP addresses etc.
Can we easily achieve this somehow with parquet files?

Related

Do all my log sources need to use the same field names with the ELK stack?

If I have a number of different sources for logs all logging to the same Elasticsearch cluster, do all those sources need to use the same field names for the same information if I want to make correlations between different sources?
For example say three different sources used the names "Machine", "MachineName", and "Computer" to identify the computer that sent the log message. Is there anyway to associate these values to one another? I didn't see anything like a "field alias" in the documentation.
Is there anyway around having to keep track of all these property names between sources?
Are there any documented conventions for field names?

Is it still best to store images and files on a file system or non-RDBMS

I am working on a system which will store user's picture and in the future some soft documents as well.
Number of users: 4000+
Transcripts and other documents per user: 10 MB
Total system requirement in first year: 40 GB
Additional Increment Each year: 10%
Reduction due to archiving Each year: 10%
Saving locally on Ubuntu Linux system without any fancy RAIDS.
Using MySQL community edition for application.
Simultaneous Users: 10 to 20
Documents are for historical purposes and will not be accessed frequently.
I always thought it is cumbersome to store in a RDBMS due to the multiple layers to access etc. However, since we use key/value pair in nonRDBMS databases, is it still better to store the documents in file system or DB? Thanks for any pointers.
Similar question was asked about 7 years ago (storing uploaded photos and documents - filesystem vs database blob)!. I hope there was some change in the technology with all NoSQL databases in the spin. Hence, I am asking this again.
Please correct me if I should be doing something else instead of raising a fresh question.
It really depends (notably of the DBMS considered, of the file system, is it remote or local, total size of data -petabytes is not the same as gigabytes-, numbers of users/documents etc.).
If the data is remote on a 1Gb/s Ethernet the network is the bottleneck. So using a DBMS won't add significant additional overhead. See the answers section of this interesting webpage, or STFW for Approximate timing for various operations on a typical PC...
If the data is local, things matter much more (but few computers have one petabyte of SATA disks). Most filesystems on Linux use some minimal block size (e.g. 1Kbytes, 4Kbytes, ...) per file.
A possible approach might be to have some threshold (typically 4 or 8kilobytes, or even perhaps 64kilobytes, that is several pages; YMMV). Data smaller than it could be directly a field in a database, data bigger than it could be in a file. The database might sometimes contain file path for the data. Read about BLOBs in databases.
Consider not only RDBMS like PostGreSQL, but also noSQL solutions à la MongoDB, and key-value stores à la REDIS, etc.
For a local data approach, consider not only plain files, but also sqlite & GDBM, etc. If you use a file system, consider avoiding very wide directories, so instead of having widedir/000001.jpg .... widedir/999999.jpg organise it as dir/subdir000/001.jpg ... dir/subdir999/999.jpg and have no more than a thousand entries per directory.
If you locally use a MySQL database, and don't consider a lot of data (e.g. less than a terabyte), you might store directly in the database any raw data smaller than e.g. 64Kbytes, and store bigger data in individual files (whose path is going into the database); but you still should avoid very wide directories for them.
Of course, don't forget to define and apply (human decided) backup procedures.

FileNet Content Engine - Database Table for Physical path

I realize this is possible with the FileNET P8 API, however I'm looking for a way to find the physical document path within the database. Specifically there are two level subfolders in the FileStore, like FN01\FN13\DocumentID but I can't find the reference to FN01 or FN13 anywhere.
You will not find the names of the folders anywhere in the FN databases. The folder structure is determined by a hashing function. Here is an excerpt from this page on filestores:
Documents are stored among the directories at the leaf level using a hashing algorithm to evenly distribute files among these leaf directories.
The IBM answer is correct only from a technical standpoint of intended functionality.
If you really really need to find the document file name and folder location, disable your actual file store(s) by making the file store(s) folder unavailable to Content Engine. I did that for each file store by simply changing the root FN#'s to FN#a. For instance, FN3 became FN3a. Once done, I changed the top tree folder back. I used that method so log files would not exceed the tool's maximum output. Any method that leaves a storage location (eg: drive, share, etc) accessible and searchable, but renders the individual files unavailable should cause the same results.
Then, run the Content Engine Consistency Checker. It will provide you with a full list of all files, IDs and locations.
After that, you can match the entries to the OBJECT_ID fields in the database tables. In non-MSSQL databases, the byte ordering is reversed for the first few octets of the UUID. You need to account for that and fix the byte ordering to match the CCC output.
...needs to be byte reversed so that it can be queried upon in Oracle.
When querying on GUIDs, GUIDs are stored in byte-reversed form in
Oracle and DB2 (not MS SQL), whereby the first three sections are pair
reversed and the last two are left alone.
Thus, the same applies in reverse. In order to use the output from the Content Consistency Checker to match output to database, one must go through the same byte ordering reversal.
See this IBM Tech Doc and the answer linked below for details:
IBM Technote: https://www.ibm.com/support/pages/node/469173
Stack Answer: https://stackoverflow.com/a/53319983/1854328
More detailed information on the storage mechanisms is located here:
IBM Technote: "How to translate the unique identifier as displayed within FileNet Enterprise Manager so that it matches what is stored in the Oracle and DB2 databases"
I do not suggest using this for anything but catastrophic need, such as rebuilding and rewriting an entire file store that got horrendously corrupted by your predecessor when they destroyed an NTFS (or some similarly nasty situation).
It is a workaround to bypass FileNet's hashing that's used to obsfucate content information from those looking at the file system.

HBase schema/key for real-time analytics solution

We are looking at using HBase for real-time analytics.
Prior to HBase, we will be running a Hadoop Map Reduce job over our log files and aggregating the data, and storing the fine-grained aggregate results in HBase to enable real-time analytics and queries on the aggregated data. So the HBase tables will have pre-aggregated data (by date).
My question is: how to best design the schema and primary key design for the HBase database to enable fast but flexible queries.
For example, assume that we store the following lines in a database:
timestamp, client_ip, url, referrer, useragent
and say our map-reduce job produces three different output fields, each of which we want to store in a separate "table" (HBase column family):
date, operating_system, browser
date, url, referrer
date, url, country
(our map-reduce job obtains the operating_system, browser and country fields from the user agent and client_ip data.)
My question is: how can we structure the HBase schema to allow fast, near-realtime and flexible lookups for any of these fields, or a combination? For instance, the user must be able to specify:
operating_system by date ("How many iPad users in this date range?")
url by country and date ("How many users to this url from this country for the last month?")
and basically any other custom query?
Should we use keys like this:
date_os_browser
date_url_referrer
date_url_country
and if so, can we fulfill the sort of queries specified above?
You've got the gist of it, yes. Both of your example queries filter by date, and that's a natural "primary" dimension in this domain (event reporting).
A common note you'll get about starting your keys with a date is that it will cause "hot spotting" problems; the essence of that problem is, date ranges that are contiguous in time will also be contiguous servers, and so if you're always inserting and querying data that happened "now" (or "recently"), one server will get all the load while the others sit idle. This doesn't sound like it'd be a huge concern on insert, since you'll be batch loading exclusively, but it might be a problem on read; if all of your queries go to one of your 20 servers, you'll effectively be at 5% capacity.
OpenTSDB gets around this by prepending a 3-byte "metric id" before the date, and that works well to spray updates across the whole cluster. If you have something that's similar, and you know you always (or usually) include a filter for it in most queries, you could use that. Or you could prepend a hash of some higher order part of the date (like "month") and then at least your reads would be a little more spread out.

Upload DB2 data into an Oracle database - fixing junk data

I've been given a DB2 export of data (around 7 GB) with associated DB2 control files. My goal is to upload all of the data into an Oracle database. I've almost succeeded in this - I took the route of converting the control files into SQL*Loader CTL files and it has worked for the most part.
However, I have found some of the data files contain terminators and junk data in some of the columns, which is loaded into the database, causing obvious issues with matching on that data. E.g., A column should contain '9930027130', will show length(trim(col)) = 14 : 4 Bytes of junk data.
My question is, what is the best way to eliminate this junk data from the system? I hope theres a simple addition to the CTL file that allows it to replace the junk with spaces - otherwise I can only think of writing a script that analyses the data and replaces nulls/junk with spaces before running SQL*Loader.
What, exactly, is your definition of "junk"?
If you know that a column should only contain 10 characters of data, for example, you can add a NULLIF( LENGTH( <<column>> ) > 10 ) to your control file. If you know that the column should only contain numeric characters (or alphanumerics), you can write a custom data cleansing function (i.e. STRIP_NONNUMERIC) and call that from your control file, i.e.
COLUMN_NAME position(1:14) CHAR "STRIP_NONNUMERIC(:LAST_NAME)",
Depending on your requirements, these cleansing functions and the cleansing logic can get rather complicated. In data warehouses that are loading and cleansing large amounts of data every night, data is generally moved through a series of staging tables as successive rounds of data cleansing and validation rules are applied rather than trying to load and cleanse all the data in a single step. A common approach would be, for example, to load all the data into VARCHAR2(4000) columns with no cleansing via SQL*Loader (or external tables). Then you'd have a separate process move the data to a staging table that has the proper data types NULL-ing out data that couldn't be converted (i.e. non-numeric data in a NUMBER column, impossible dates, etc.). Another process would come along and move the data to another staging table where you apply domain rules-- things like a social security number has to be 9 digits, a latitude has to be between -90 and 90 degrees, or a state code has to be in the state lookup table. Depending on the complexity of the validations, you may have more processes that move the data to additional staging tables to apply ever stricter sets of validation rules.
"A column should contain '9930027130', will show length(trim(col)) = 14 : 4 Bytes of junk data. "
Do a SELECT DUMP(col) to determine the strange characters. Then decide whether that are always invalid, valid in some cases or valid but interpreted wrong.

Resources