Setting textinputformat.record.delimiter in sparksql - hadoop

In spark2.0.1 ,hadoop2.6.0, I have many files delimited with '!#!\r' and not with the usual new line \n,for example:
=========================================
2001810086 rongq 2001 810!#!
2001810087 hauaa 2001 810!#!
2001820081 hello 2001 820!#!
2001820082 jaccy 2001 820!#!
2002810081 cindy 2002 810!#!
=========================================
I try to extracted data according to Setting textinputformat.record.delimiter in spark
set textinputformat.record.delimiter='!#!\r';or set textinputformat.record.delimiter='!#!\n';but still cannot extracted the data
In spark-sql,I do this :
===== ================================
create table ceshi(id int,name string, year string, major string)
row format delimited
fields terminated by '\t';
load data local inpath '/data.txt' overwrite into table ceshi;
select count(*) from ceshi;
the result is 5,but I try to set textinputformat.record.delimiter='!#!\r'; then select count(*) from ceshi; the result is 1, the delimiter donot work well;
I also check the source of hadoop2.6.0, the method of RecordReader in TextInputFormat.java,I notice that default textinputformat.record.delimiter is null,then the the LineReader.java use the method readDefaultLine to read a line terminated by one of CR, LF, or CRLF(CR ='\r',LF ='\n').

You should use sparkContext's hadoopConfiguration api to set the textinputformat.record.delimiter as
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\r")
Then if you read the text file using sparkContext as
sc.textFile("the input file path")
You should fine.
Updated
I have noticed that a text file with delimiter \r when saved is changed to \n delimiter.
so, following format should work for you as it did for me
sc.hadoopConfiguration.set("textinputformat.record.delimiter", "!#!\n")
val data = sc.textFile("the input file path")
val df = data.map(line => line.split("\t"))
.map(array => ceshi(array(0).toInt, array(1), array(2), array(3)))
.toDF
a case class called ceshi is needed as
case class ceshi(id: Int, name: String, year: String, major :String)
which should give dataframe as
+----------+-----+-----+-----+
|id |name |year |major|
+----------+-----+-----+-----+
|2001810086|rongq| 2001|810 |
|2001810087|hauaa| 2001|810 |
|2001820081|hello| 2001|820 |
|2001820082|jaccy| 2001|820 |
|2002810081|cindy| 2002|810 |
+----------+-----+-----+-----+
Now you can hit the count function as
import org.apache.spark.sql.functions._
df.select(count("*")).show(false)
which would give output as
+--------+
|count(1)|
+--------+
|5 |
+--------+

Related

UCanAccess keeps repeating column names as data

I was using UcanAccess 5.0.1 in databricks 9.1LTS (Spark 3.1.2, Scala 2.1.2), and for whatever reasons when I use the following code to read in a single record Access db table it keeps treating the column names as the record itself (I've tried adding more records and got the same results.)
The Access db table looks like this (2 records):
ID Field1 Field2
2 key1 column2
3 key2 column2-1
The python code looks like this:
connectionProperties = {
"driver" : "net.ucanaccess.jdbc.UcanaccessDriver"
}
url = "jdbc:ucanaccess:///dbfs/mnt/internal/Temporal/Database1.accdb"
df = spark.read.jdbc(url=url, table="Table1", properties=connectionProperties)
And the result looks like this:
df.printSchema()
df.count()
root
|-- Field1: string (nullable = true)
|-- Field2: string (nullable = true)
|-- ID: string (nullable = true)
Out[21]: 2
df.show()
+------+------+---+
|Field1|Field2| ID|
+------+------+---+
|Field1|Field2| ID|
|Field1|Field2| ID|
+------+------+---+
Any idea/suggestion?
If your data has column names in a first row, you can try header = True, to set first row as column headers.
Sample code –
df = spark.read.jdbc( url = url, table="Table1", header = true, properties= connectionProperties)
But if your data does not have column headers, you need to explicitly define column names and then assign them as column headers.
Sample code –
columns = ["column_name_1"," column_name_2"," column_name_3"]
df = spark.read.jdbc( url = url, table="Table1”, schema=columns, properties= connectionProperties)
You can also refer this answer by Alberto Bonsanto
Reference - https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/sql-databases#read-data-from-jdbc
turns out that there was a bug in the jdbc code ([https://stackoverflow.com/questions/63177736/spark-read-as-jdbc-return-all-rows-as-columns-name])
I added the following code and now the ucanaccess driver works fine:
%scala
import org.apache.spark.sql.jdbc.JdbcDialect
import org.apache.spark.sql.jdbc.JdbcDialects
private case object HiveDialect extends JdbcDialect {
override def canHandle(url : String): Boolean = url.startsWith("jdbc:ucanaccess")
override def quoteIdentifier(colName: String): String = {
colName.split('.').map(part => s"`$part`").mkString(".")
}
}
JdbcDialects.registerDialect(HiveDialect)
Then display(df) would show
|Field1 |Field2 |ID |
|:------|:------|:----- |
|key1 |column2 | 2|
|key2 |column2-1| 3|

regex pattern not working in pyspark after applying the logic

I have data as below:
>>> df1.show()
+-----------------+--------------------+
| corruptNames| standardNames|
+-----------------+--------------------+
|Sid is (Good boy)| Sid is Good Boy|
| New York Life| New York Life In...|
+-----------------+--------------------+
So, as per above data I need to apply regex,create a new column and get the data as in the second column i.e standardNames. I tried below code:
spark.sql("select *, case when corruptNames rlike '[^a-zA-Z ()]+(?![^(]*))' or corruptNames rlike 'standardNames' then standardNames else 0 end as standard from temp1").show()
It throws below error:
pyspark.sql.utils.AnalysisException: "cannot resolve '`standardNames`' given input columns: [temp1.corruptNames, temp1. standardNames];
Try this example without select sql. I am assuming you want to create a new column called standardNames based on corruptNames if the regex pattern is true, otherwise "do something else...".
Note: Your pattern won't compile because you need to escape the second last ) with \.
pattern = '[^a-zA-Z ()]+(?![^(]*))' #this won't compile
pattern = r'[^a-zA-Z ()]+(?![^(]*\))' #this will
Code
import pyspark.sql.functions as F
df_text = spark.createDataFrame([('Sid is (Good boy)',),('New York Life',)], ('corruptNames',))
pattern = r'[^a-zA-Z ()]+(?![^(]*\))'
df = (df_text.withColumn('standardNames', F.when(F.col('corruptNames').rlike(pattern), F.col('corruptNames'))
.otherwise('Do something else'))
.show()
)
df.show()
#+-----------------+---------------------+
#| corruptNames| standardNames|
#+-----------------+---------------------+
#|Sid is (Good boy)| Do something else|
#| New York Life| Do something else|
#+-----------------+---------------------+

Insert collect_set values into Elasticsearch with PIG

I have a HIVE table which contains 3 columns- "id"(String), "booklist"(Array of String), and "date"(string) with the following data:
----------------------------------------------------
id | booklist | date
----------------------------------------------------
1 | ["Book1" , "Book2"] | 2017-11-27T01:00:00.000Z
2 | ["Book3" , "Book4"] | 2017-11-27T01:00:00.000Z
When trying to insert into Elasticsearch with this PIG script
-------------------------Script begins------------------------------------------------
SET hive.metastore.uris 'thrift://node:9000';
REGISTER hdfs://node:9001/library/elasticsearch-hadoop-5.0.0.jar;
DEFINE HCatLoader org.apache.hive.hcatalog.pig.HCatLoader();
DEFINE EsStore org.elasticsearch.hadoop.pig.EsStorage(
'es.nodes = elasticsearch.service.consul',
'es.port = 9200',
'es.write.operation = upsert',
'es.mapping.id = id',
'es.mapping.pig.tuple.use.field.names=true'
);
hivetable = LOAD 'default.reading' USING HCatLoader();
hivetable_flat = FOREACH hivetable
GENERATE
id AS id,
booklist as bookList,
date AS date;
STORE hivetable_flat INTO 'readings/reading' USING EsStore();
-------------------------Script Ends------------------------------------------------
When running above, i got an error saying:
ERROR 2999:Unexpected internal error. Found unrecoverable error [ip:port] returned Bad Request(400) - failed to parse [bookList]; Bailing out..
Can anyone shed any light on how to parse ARRAY of STRING into ES and get above to work?
Thank you!

cloudera impala jdbc query doesn't see array<string> Hive column

I have a table in Hive that has the following structure:
> describe volatility2;
Query: describe volatility2
+------------------+---------------+---------+
| name | type | comment |
+------------------+---------------+---------+
| version | int | |
| unmappedmkfindex | int | |
| mfvol | array<string> | |
+------------------+---------------+---------+
It was created by Spark HiveContext code by using a DataFrame API like this:
val volDF = hc.createDataFrame(volRDD)
volDF.saveAsTable(volName)
which carried over the RDD structure that was defined in the schema:
def schemaVolatility: StructType = StructType(
StructField("Version", IntegerType, false) ::
StructField("UnMappedMKFIndex", IntegerType, false) ::
StructField("MFVol", DataTypes.createArrayType(StringType), true) :: Nil)
However, when I'm trying to select from this table using the latest JDBC Impala driver the last column is not visible to it. My query is very simple - trying to print the data to the console - exactly like in the example code provided by the driver download:
String sqlStatement = "select * from default.volatility2";
Class.forName(jdbcDriverName);
con = DriverManager.getConnection(connectionUrl);
Statement stmt = con.createStatement();
ResultSet rs = stmt.executeQuery(sqlStatement);
System.out.println("\n== Begin Query Results ======================");
ResultSetMetaData metadata = rs.getMetaData();
for (int i=1; i<=metadata.getColumnCount(); i++) {
System.out.println(rs.getMetaData().getColumnName(i)+":"+rs.getMetaData().getColumnTypeName(i));
}
System.out.println("== End Query Results =======================\n\n");
The console output it this:
== Begin Query Results ======================
version:version
unmappedmkfindex:unmappedmkfindex
== End Query Results =======================
Is it a driver bug or I'm missing something?
I found the answer to my own question. Posting it here so it may help others and save time in searching. Apparently Impala lately introduced the so called "complex types" support to their SQL that include array among others. The link to the document is this:
http://www.cloudera.com/documentation/enterprise/5-5-x/topics/impala_complex_types.html#complex_types_using
According to this what I had to do is change the query to look like this:
select version, unmappedmkfindex, mfvol.ITEM from volatility2, volatility2.mfvol
and I got the right expected results back

WebFocus, two title columns and merging cells

If i have a table in a WebFocus Raport design
+--------+---------+--------+---------+
| left_1 | right_1 | left_2 | right_2 |
+--------+---------+--------+---------+
| v11 | p11 | v21 | v21 |
+--------+---------+--------+---------+
| v12 | p12 | v22 | v22 |
....
How to do a such table with syllabus column titles:
+-------+-------+-------+-------+
| One | Two |
+-------+-------+-------+-------+
| left | right | left | right |
+-------+-------+-------+-------+
| v11 | p11 | v21 | v21 |
+-------+-------+-------+-------+
| v12 | p12 | v22 | v22 |
....
Thank you
Sorry for the delay of the answer :)
To rename columns, with the AS command. Example:
TABLE FILE SYSTABLE
PRINT NAME
COMPUTE LEFT1/A3 = 'v11'; AS 'left';
COMPUTE RIGHT1/A3 = 'p11'; AS 'right';
COMPUTE LEFT2/A3 = 'v21'; AS 'left';
COMPUTE RIGHT2/A3 = 'p21'; AS 'right';
IF RECORDLIMIT EQ 10
END
To put the heading columns, you can work with the ACROSS command but it will be more tricky that if u use simply SUBHEAD. With the same example:
TABLE FILE SYSTABLE
PRINT NAME NOPRINT
COMPUTE LEFT1/A3 = 'v11'; AS 'left';
COMPUTE RIGHT1/A3 = 'p11'; AS 'right';
COMPUTE LEFT2/A3 = 'v21'; AS 'left';
COMPUTE RIGHT2/A3 = 'p21'; AS 'right';
IF RECORDLIMIT EQ 10
ON TABLE SUBHEAD
"<+0>One<+0> Two"
ON TABLE PCHOLD FORMAT HTML
ON TABLE SET HTMLCSS ON
ON TABLE SET STYLE *
UNITS=IN, PAGESIZE='Letter',
LEFTMARGIN=0.500000, RIGHTMARGIN=0.500000,
TOPMARGIN=0.500000, BOTTOMMARGIN=0.500000,
SQUEEZE=ON, GRID=OFF, ORIENTATION=LANDSCAPE, $
TYPE=REPORT,FONT='ARIAL',SIZE=9,$
TYPE=TABHEADING,HEADALIGN=BODY,$
TYPE=TABHEADING, LINE=1, ITEM=1, COLSPAN=2, SQUEEZE=ON,$
TYPE=TABHEADING, LINE=1, ITEM=2, COLSPAN=2, SQUEEZE=ON,$
ENDSTYLE
END
Hope it helps!
I'm not entirely sure if you load the headers as a field or if that is the field name
But this might help you
Define fields
TITL1/A3 = 'One';
TITL2/A3 = 'Two';
BLANK/A1 = '';
Edit the Left and Right title fields to remove the _1 or _2
Print the fields BY BLANK NOPRINT
Add
ON BLANK SUBHEAD
"
You can also add more rows to the subhead if you need more titles
You can easily do it by embedding HTML/CSS scripts in report(.fex) file.
just add the HTML/css code at the end of the file.
For eg.
-HTMLFORM BEGIN // to start styling your generated report table with HTML/CSS
TABLE tr
td:first-child // applies on 1st row ONLY.It can be td or th.
{
colspan = "2"; //to merge 2 columns
}
-HTMLFORM END //end HTML.
So the first row must have two cells having title "ONE" and "TWO"(in your case), and both cells must have property of colspan = "2"
Also you can refer:
Colspan propery from here
manipulating first row of table from here
Second option is to write the whole code in a file and save it in .htm/.html format and just insert the file in to WEBFOCUS(.fex) file.For eg.
-HTMLFORM BEGIN
-INCLUDE HTML_FILE.HTML
-HTMLFORM END
Hope it helps.Thanks.

Resources