How does ORC delimit fields? - hadoop

I know this must be a silly question, but after hours' googling, I cannot get the answer.
It's easy to understand in plain text format such as csv how the delimiters work. Whilein ORC, since is is binary stored in HDFS, what would be a delimiter for a field? I was told that there is no delimiter in ORC, but I highly doubt about this statement.
Even it is stored as row groups, for each row group's one column, there can be multiple data fields, how is each field distinguished from the next one? How is each row separated from the next row? Is there a delimiter to achieve this?
Thank you for any comments!

No delimiter. It uses Stride/Stripes,
The body of the file is divided into stripes. Each stripe is self
contained and may be read using only its own bytes combined with the
file’s Footer and Postscript. Each stripe contains only entire rows so
that rows never straddle stripe boundaries. Stripes have three
sections: a set of indexes for the rows within the stripe, the data
itself, and a stripe footer. Both the indexes and the data sections
are divided by columns so that only the data for the required columns
needs to be read.
Refer: ORC

Related

Copying Data in Excel and replacing in Word Document using UIPATH

I would like to copy data from excel and replace text in word document.
However the cell references that contain data are NOT fixed as it depends on the number of debtors and the number of table rows the user wants.
Any help would be appreciated!
You can try reading the excel data using Read Range Activity and then Lookup into that datatable for the exact values that you want to replace.
And you can use read the Word File and replace the required values in Word file from the Datatable.

data factory special character in column headers

I have a file I am reading into a blob via datafactory.
Its formatted in excel. Some of the column headers have special characters and spaces which isn't good if want to take it to csv or parquet and then SQL.
Is there a way to correct this in the pipeline?
Example
"Activations in last 15 seconds high+Low" "first entry speed (serial T/a)"
Thanks
Normally, Data Flow can handle this for you by adding a Select transformation with a Rule:
Uncheck "Auto mapping".
Click "+ Add mapping"
For the column name, enter "true()" to process all columns.
Enter an appropriate expression to rename the columns. This example uses regular expressions to remove any character that is not a letter.
SPECIAL CASE
There may be an issue with this is the column name contains forward slashes ("/"). I accidentally came across this in my testing:
Every one of the columns not mapped contains forward slashes. Unfortunately, I cannot explain why this would be the case as Data Flow is clearly aware of the column name. It can be addressed manually by adding a Fixed rule for EACH offending column, which is obviously less than ideal:
ANOTHER OPTION
The other thing you could try is to pre-process the text file with another Data Flow using a Source dataset that has no delimiters. This would give you the contents of each row as a single column. If you could get a handle on the just first row, you could remove the special characters.

How to find columns count of csv(Excel) sheet in ETL?

To count the rows of csv file we can use Get Files Rows Count Input in etl. How to find the number columns of a csv file?
Just read the first row of the CSV file using Text-File-Input setting header rows to 0. Usually, the first row contains field names. If you read the whole row into a single field, you can use Split-Field-To-Rows to have a single fieldname per row and the number of rows tells you the number of fields. There are other ways, but this one easily prepares for a subsequent metadata injection - if that's what you have in mind.
No Need of Metadata injection , In Split-Field-To-Rows, check "Include rownum in output" and give some name to that Variable. Then apply sort rows on that Variable, use Sample rows, then you will get number of fields which are present in the file.

Does column name length count for each cell size when counting the column size in Google bigquery?

I know in HBase for example, you need to put column names as small as possible to minimize the size.
Is it the same in google bigquery? Should I put column names as small as possible?
Good news: In BigQuery you don't need to worry about the column names length. Be as descriptive as you'd like to, since the column name is part of the table description, and not of each record.

How to increase maximum size of csv field in Magento, where is this located

I have one field when importing that can contain large data, it seems that CSV has unofficial limitation of about 65000 (likely 65535*) character. as both libreoffice calc and magento truncating the data for that particular field. I have investigated well and I'm certain it is not because of a special character or quotes. the data pretty straight forward, the lines are similar in format to each other.
Question: How to increase that size? or at least where I should look to find it?
Note: I counted in libreoffice writer and it was about 65040. but probably with carriage return characters it could reach 65535
I change:
1) in table catalog_category_entity_text
type of field "value" from "text" to "longtext"
2) in file app/code/core/Mage/ImportExport/Model/Import/Entity/Abstract.php
const DB_MAX_TEXT_LENGTH = 65536;
to
const DB_MAX_TEXT_LENGTH = 16777215;
and all OK
You are right, there is a limitation in Magento, because it sets texts fields as TEXT in MySQL database and, according to MySQL docs, this kind of field supports a maximum of 65535 chars.
http://dev.mysql.com/doc/refman/5.0/es/storage-requirements.html
So you could change the column type in your Magento database to use MEDIUMTEXT. I guess the correct place is in the catalog_product_entity_text table, where you should modify the 'value' field type to match your needs. But please, keep in mind this is dangerous. Make a full backup before trying. And you may even need to play with core files... not recommended!
I'm having the same issue with 8 products from a list of more than 400, and I think I'm not going to mess with Magento core and database, we can reduce the description strings for those few products.
The CSV could care less. Due to Microsoft Access allowing Memo fields which can contain quite a bit of data, I've exported 2-3k descriptions in CSV format to be imported into Magento quite successfully.
Your limitation is either because you are using a spreadsheet that has a cell limitation or export limitation on cells or because the field you are trying to import into has a maximum character limitation set in its table for that field.
You can determine the latter by using phpMyAdmin to see what the maximum character setting is for that field.

Resources