CLI "bq load" - how to use non-printable character as delimiter? - bash

I'm having trouble loading data into BigQuery as a single column row. I wish BigQuery offered the ability to have "no delimiter" as an option, but in the meantime I need to choose the most obscure ASCII delimiter I can find so my single column row is not split into columns.
When doing this the CLI won't allow me to input strange characters, so I need to use the API through Python or other channels.
How can I use the CLI instead with a non printable character?
Python example from BigQuery lazy data loading: DDL, DML, partitions, and half a trillion Wikipedia pageviews:
#!/bin/python
from google.cloud import bigquery
bq_client = bigquery.Client(project='fh-bigquery')
table_ref = bq_client.dataset('views').table('wikipedia_views_gcs')
table = bigquery.Table(table_ref, schema=SCHEMA)
extconfig = bigquery.ExternalConfig('CSV')
extconfig.schema = [bigquery.SchemaField('line', 'STRING')]
extconfig.options.field_delimiter = u'\u00ff'
extconfig.options.quote_character = ''

To use a non-printable character with BQ load you can use echo in bash:
bq load \
--source_format=CSV \
--field_delimiter=$(echo -en "\x01") \
--noreplace --max_bad_records=100 \
<bq_dataset>.<bq_table> gs://<bucket_name>/<file_name>.csv

Related

How to clean a CSV file for reading text in double quotes as one column

I am working with a dataset which has crime data for Chicago using Scala and Apache Spark
There are a few lines where multiple values are separated by a comma and put in double quotes. Is there a way to clean the data so that the text under double quotes can be read as one column
The text is below and the columns in bold is what I want to read as a single column
10366565,HZ102660,01/03/2016 01:50:00 PM,020XX S WABASH AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,**“SCHOOL, PRIVATE, BUILDING”**,false,false,0131,001,3,33,14,1177070,1890608,2016,01/10/2016 08:46:55 AM,41.855167994,-87.625552607,"(41.855167994, -87.625552607)"
The desired output would be something like below so that the text under quote could be read as a single string by removing the commas:
10366565,HZ102660,01/03/2016 01:50:00 PM,020XX S WABASH AVE,1310,CRIMINAL DAMAGE,TO PROPERTY,**“SCHOOL|PRIVATE|BUILDING”**,false,false,0131,001,3,33,14,1177070,1890608,2016,01/10/2016 08:46:55 AM,41.855167994,-87.625552607,**"(41.855167994|-87.625552607)"**
Is there a way to do it in Scala or transform it into a new file using shell scripting?
Spark by default takes quoted string (regardless of whether it contains commas) from CSV file as a single column, hence if you want you can process the quoted content after it's read into a DataFrame:
Sample CSV data:
10366565,01/03/2016 01:50:00 PM,"SCHOOL, PRIVATE, BUILDING"
10366700,01/04/2016 12:30:00 PM,"SCHOOL, PRIVATE, BUILDING"
Sample code:
val df = spark.read.csv("/path/to/csvfile")
+--------+----------------------+-------------------------+
|_c0 |_c1 |_c2 |
+--------+----------------------+-------------------------+
|10366565|01/03/2016 01:50:00 PM|SCHOOL, PRIVATE, BUILDING|
|10366700|01/04/2016 12:30:00 PM|SCHOOL, PRIVATE, BUILDING|
+--------+----------------------+-------------------------+
// A UDF function that converts ",\s*" to "|"
def commaToPipe = udf( (s: String) =>
""",\s*""".r.replaceAllIn(s, "|")
)
df.select($"_c0", commaToPipe($"_c2")).show(truncate=false)
+--------+-----------------------+
|_c0 |UDF(_c2) |
+--------+-----------------------+
|10366565|SCHOOL|PRIVATE|BUILDING|
|10366700|SCHOOL|PRIVATE|BUILDING|
+--------+-----------------------+
[UPDATE]
As pointed out by a commenter, using regexp_replace would eliminate the need for UDF:
df.select($"_c0", regexp_replace($"_c2", """,\s*""", "|"))

Split characters inside Pig field

I have a text input with '|' separator as
0.0000|25000| |BM|BM901002500109999998|SZ
which I split using PigStorage
A = LOAD '/user/hue/data.txt' using PigStorage('|');
Now I need to split the field BM901002500109999998 into different fields based on their position , say 0-2 = BM - Field1 and like wise.
So after this step I should get BM, 90100, 2500, 10, 9999998.
Is there any way in Pig script to achieve this, otherwise I plan to write an UDF and put separator on required positions.
Thanks.
You are looking for SUBSTRING:
A = LOAD '/user/hue/data.txt' using PigStorage('|');
B = FOREACH A GENERATE SUBSTRING($4,0,2) AS FIELD_1, SUBSTRING($4,2,7) AS FIELD_2, SUBSTRING($4,7,11) AS FIELD_3, SUBSTRING($4,11,13) AS FIELD_4, SUBSTRING($4,13,20) AS FIELD_5;
The output would be:
dump B;
(BM,90100,2500,10,9999998)
You can find more info about this function here.
I think that it will be much more efficient to use the built in UDF REGEX_EXTRACT_ALL.
You can get some idea of how to use this UDF from:
http://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#REGEX_EXTRACT_ALL
STRSPLIT and REGEX_EXTRACT_ALL in PigLatin

PigStorage and Variable Schemas from Input

I have a comma separated text file like
1,abc,1,
2,def,1,2,3,4
3,ghi,1,2
4,jkl,1,5,6,7,8,9
5,mno
The text file will always have the first two values, but will have 0 or more values after the second comma.
How can I load this data and give an alias to the first two values?
I can load it and not give an alias to the first two values via:
A = LOAD 'data.txt' USING PigStorage(',');
From here, I can do a B = FOREACH A GENERATE $0 AS foo:chararray, $1 AS bar:chararray; but it would discard the rest. It would be nice to do a wildcard and put the rest in a tuple.
Is there anyway to do this?
Try this
B = foreach A generate $0 as foo:chararray, $1 as bar:chararray, $2..;
reference
Drop single column in Pig
I am not sure about what you need.
Try this one
A = LOAD 'data.txt' USING PigStorage(',') AS (foo:chararray, bar:chararray);
This will ignore the other values after the second comma in your file.
Or you can create a Map for reamining fields.

Sql loader - second enclosure string not present

I am loading a .csv file data into oracle table through sql loader. One of the fields has a new line character (CRLF) in its data and so, am getting the below error:
second enclosure string not present
This is my control file
load data
characterset UTF8
infile 'C:\Users\lab.csv'
truncate
into table test_labinal
fields terminated by ";" optionally enclosed by '"'
TRAILING NULLCOLS
(
STATEMENT_STATUS ,
MANDATORY_TASK ,
COMMENTS CHAR(9999) "SubStr(:Comments, 0, 1000)"
)
The field COMMENTS has a new line character in one of its records. Can any one suggest a solution for this.
Thanks
If your last field is always present (though trailing nullcols suggests it isn't) and you have some control over the formatting, you can use the CONTINUEIF directive to treat the second line as part of the same logical record.
If the comments field is always present and enclosed in double-quotes then you can do:
...
truncate
continueif last != x'22'
into table ...
Which would handle data records like:
S;Y;"Test 1"
F;N;"Test 2"
P;Y;"Test with
new line"
P;N;""
Or if you always have a delimiter after the comments field, whether it is populated or not:
...
truncate
continueif last != ';'
into table ...
Which would handle:
S;Y;Test 1;
F;N;"Test 2";
P;Y;Test with
new line;
P;N;;
Both ways will load the data as:
S M COMMENTS
- - ------------------------------
S Y Test 1
F N Test 2
P Y Test withnew line
P N
But this loses the new line from the data. To keep that you need the terminating field delimiter to be present, and instead of CONTINUEIF you can change the record separator using the stream record format:
...
infile 'C:\Users\lab.csv' "str ';\n'"
truncate
into table ...
The "str ';\n'" defines the terminator as the combination of the field terminator and a new line character. Your split comment only has that combination on the final line. With the same data file as the previous version, this gives:
S M COMMENTS
- - ------------------------------
S Y Test 1
F N Test 2
P Y Test with
new line
P N
4 rows selected.
Since you're on Windows you might have to include \r in the format as well, e.g. "str ';\r\n'", but I'm not able to check that.
load data
characterset UTF8
infile 'C:\Users\lab.csv'
truncate
into table test_labinal
fields terminated by ";" optionally enclosed by '"'
TRAILING NULLCOLS
(
STATEMENT_STATUS ,
MANDATORY_TASK ,
COMMENTS CHAR(9999) "SubStr(REPLACE(REPLACE(:Comments,CHR(13)),CHR(10)), 0, 1000)"
)
Note: The CHR(13) is the ASCII character for "carriage return" and the CHR(10) is the ASCII character for "new line". Using the Oracle PL/SQL REPLACE command without a replacement value will remove any "carriage return" and/or "new line" character that is embedded in your data. Which is probably the case because the comment field is the last field in your CSV file.
You can use replace(replace(column_name, chr(10)), chr(13)) to remove newline charactors or regexp_replace(column_name, '\s+') to remove non printable charactors during loading
I found the best way to load the .csv files with fields containing newline and comma.Please run the macro over the .csv file and then load using sqlloader
Sub remove()
Dim row As Integer
Dim oxcel As Excel.Application
Dim wbk As Excel.Workbook
Set oxcel = New Excel.Application
Set wbk = oxcel.Workbooks.Open("filename.csv", 0, True)
row = 0
With oxcel
.ActiveSheet.Select
Do
row = row + 1
'Assume first column is PK and so checking for empty pk to find the number of rows
Loop Until IsEmpty(Cells(row, 1)) Or IsNull(Cells(row, 1))
Range(Cells(1, 24), Cells(row - 1, 24)).Select
For Each oneCell In Selection
oneCell.Value = Application.Substitute(Application.Substitute
(Application.Substitute (CStr(oneCell.Value), vbLf, vbCr), vbCr, "-"),",","-")
Next oneCell
End With
End Sub
It's running perfect for me.

Expecting QUOTED STRING in pig script

I have written a script to select from vsql:
LOAD 'sql://{select * from sandesh.insights_voice_day
WHERE Observation_date BETWEEN '2011-11-22' AND '2011-11-23' AND
Type='total'
ORDER BY Observation_date}'
It is showing exception as '' Expecting QUOTEDSTRING?. What is problem?
Pig expects a quoted string following a load with the name of the file you are loading. Pig is not SQL, so you have to do something like first dump your query into a file and then:
A = LOAD "your_file" as (column1:datatype, column2:datatype);
B = FITER A by observation date > '2011-11-22' AND observation_date < '2011-11-23' AND
Type='total';
C = ORDER B by observation_date;
DUMP C;
Now, this will order these as strings. So depending on the version of Pig you're using, you'll need to deal with timestamps with the appropriate function. Something like:
http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/evaluation/datetime/convert/CustomFormatToISO.html
The problem seems to be use of single quotes multiple times. Following in a single line seems to compile (pig -c test.pig)
A = LOAD 'sql://{select * from sandesh.insights_voice_day WHERE Observation_date BETWEEN "2011-11-22" AND "2011-11-23" AND Type="total" ORDER BY Observation_date}';

Resources