I have an input file with one of the column as ID and another column as counter value. Based on the counter value, I am filtering data from input to output file. I made a task in DMExpress and checked for the counter and ID. I have 10 rows for each id in the input file. If counter value for each id is 3 then I will extract top 3 rows for this ID and then check for the next ID. While running this task in hadoop, Hadoop is taking the first 3 record of several IDs and creating the new file(when desired size reached) for other IDs.
Now, when hadoop is writing the record in file 0, it is extracting 3 records for ID X, but when it is writing the another part of the output file (file 1), it is writing the first record from the ID X of the previous file(which was at the last line of the file 0 . It is 4th record for the ID X). This in return increasing my record count in the output file.
Ex:this is the record in input file.
..more records..
1|XXXX|3|NNNNNNN
2|XXXX|3|MMMMMMM
3|XXXX|3|AAAAAAA
4|XXXX|3|BBBBBBB
5|XXXX|3|NNNDDDD
6|YYYY|3|QQQQQQQ
7|YYYY|3|4444444
8|YYYY|3|1111111
..more records..
The output file that hadoop is creating is as below:
file 0 :
..more records..
1|XXXX|3|NNNNNNN
2|XXXX|3|MMMMMMM
3|XXXX|3|AAAAAAA
file 1:
4|XXXX|3|BBBBBBB
6|YYYY|3|QQQQQQQ
7|YYYY|3|4444444
8|YYYY|3|1111111
..more records..
*line 4 for ID: XXXX should not be there!
Why hadoop is not filtering the counter correctly?
Related
I have data for one column in my CSV file as :
`column1`
row1 : {'name':'Steve Jobs','location':'America','status':'none'}
row2 : {'name':'Mark','location':'America','status':'present'}
row3 : {'name':'Elan','location':'Canada','status':'present'}
I want as the output for that column as :
`name` `location` `status`
Steve jobs America none
Mark America present
Elan Canada present
But sometimes I have row value like {'name':'Steve Jobs','location':'America','status':'none'},{'name':'Mark','location':'America','status':'present'}
Please help !
You have to use tMap and tExtractDelimitedFields components.
Flow,
Below is the step by step explination,
Original data - row1 : {'name':'Steve Jobs','location':'America','status':'none'}
Substring the value inside the braces using below function
row1.Column0.substring(row1.Column0.indexOf("{")+1, row1.Column0.indexOf("}")-1)
Now the result is - 'name':'Steve Jobs','location':'America','status':'none'
3.Extract single columns to multiple using tExtractDelimitedFields. Since the columns are seperated be ,, delimiter should be provided as comma. And we have 3 fields in the data, so create 3 fields in the component schema. Below is the snipping of the tExtractDelimitedFields component configuration
Now the result is,
name location status
'name':'Steve Jobs' 'location':'America' 'status':'none'
'name':'Mark' 'location':'America' 'status':'present'
'name':'Elan' 'location':'Canada' 'status':'present'
Again using one more tMap, replace the column names and single quotes from the data,
row2.name.replaceAll("'name':", "").replaceAll("'", "")
row2.location.replaceAll("'location':", "").replaceAll("'", "")
row2.status.replaceAll("'status':", "").replaceAll("'", "")
Your final result is below,
Using parquet-mr#1.11.0, i have a schema such as:
schema message page {
required binary url (STRING);
optional binary content (STRING);
}
I'm doing a single row lookup by url to retrieve the associated content
Rows are ordered by url.
The file was created with:
parquet.block.size: 256 MB
parquet.page.size: 10 MB
Using parquet-tools I was able to verify that I have indeed my column index and/or offsets for my columns:
column index for column url:
Boudary order: ASCENDING
null count min max
page-0 0 http://materiais.(...)delos-de-curriculo https://api.quero(...)954874/toogle_like
page-1 0 https://api.quero(...)880/toogle_dislike https://api.quero(...)ior-online/encceja
page-2 0 https://api.quero(...)erior-online/todos https://api.quero(...)nte-em-saude/todos
offset index for column url:
offset compressed size first row index
page-0 4 224274 0
page-1 224278 100168 20000
page-2 324446 67778 40000
column index for column content:
NONE
offset index for column content:
offset compressed size first row index
page-0 392224 504412 0
page-1 896636 784246 125
page-2 1680882 641212 200
page-3 2322094 684826 275
[... truncated ...]
page-596 256651848 183162 53100
Using a reader configured as:
AvroParquetReader
.<GenericRecord>builder(HadoopInputFile.fromPath(path, conf))
.withFilter(FilterCompat.get(
FilterApi.eq(
FilterApi.binaryColumn(urlKey),
Binary.fromString(url)
)
))
.withConf(conf)
.build();
Thanks to the column-index and column-offsets I was expecting the reader to read only 2 pages:
The one containing the url matching min/max using column index.
then, the one containing the matching row index for content using offset index.
But what I see is that the reader is reading and decoding hundreds of pages (~250MB) for the content column, am I missing something on how PageIndex is supposed to work in parquet-mr ?
Looking a the 'loading page' and 'skipping record' log lines this is trying to build the whole record before applying the filter on url which, in my opinion, defeat the purpose of PageIndex.
I tried to look online and dive into how the reader works but I could not find anything.
edit
I found an opened PR from 2015 on parquet-column hinting that the current reader (at the time at least) is indeed building the whole record with all the required columns before applying the predicate:
https://github.com/apache/parquet-mr/pull/288
But I fail to see, on this context, the purpose of the column offsets.
Found out that, even though this is not what I was expecting reading the specs, it is working as intended.
From this issue I quote:
The column url has 3 pages. Your filter finds out that page-0 matches. Based on the offset index it is translated to the row range [0..19999]. Therefore, we need to load page-0 for the column url and all the pages are in the row range [0..19999] for column content.
I have following code in pig in which i am checking the field (srcgt & destgt in record) from main files stored in record for values as mentioned in another file(intlgt.txt) having values 338,918299,181,238 but it throws error as mentioned below. Can you please suggest how to overcome this on Apache Pig version 0.15.0 (r1682971).
Pig code:
record = LOAD '/u02/20160201*.SMS' USING PigStorage('|','-tagFile') ;
intlgtrec = LOAD '/u02/config/intlgt.txt' ;
intlgt = foreach intlgtrec generate $0 as intlgt;
cdrfilter = foreach record generate (chararray) $1 as aparty, (chararray) $2 as bparty,(chararray) $3 as dt,(chararray)$4 as timestamp,(chararray) $29 as status,(chararray) $26 as srcgt,(chararray) $27 as destgt,(chararray)$0 as cdrfname ,(chararray) $13 as prepost;
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) ) ;`
Error is:
WARN org.apache.hadoop.mapred.LocalJobRunner - job_local1939982195_0002
java.lang.Exception: org.apache.pig.backend.executionengine.ExecException: ERROR 0: Scalar has more than one row in the output. 1st : (338), 2nd :(918299) (common cause: "JOIN" then "FOREACH ... GENERATE foo.bar" should be "foo::bar") at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
When you are using
intlcdrs = FILTER cdrfilter by ( STARTSWITH(srcgt,intlgt::intlgt) or STARTSWITH(destgt,intlgt::intlgt) );
PIG is looking for a scalar. Be it a number, or a chararray; but a single one. So pig assumes your intlgt::intlgt is a relation with one row. e.g. the result of
intlgt = foreach (group intlgtrec all) generate COUNT_STAR(intlgtrec.$0)
(this would generate single row, with the count of records in the original relation)
In your case, the intlgt contains more than one row, since you have not done any grouping on it.
Based on your code, you're trying to look for SMS messages that had an intlgt on either end. Possible solutions:
if your intlgt enteries all have the same length (e.g. 3) then generate substring(srcgt, 1, 3) as srcgtshort, and JOIN intlgt::intlgt with record::srcgtshort. this will give you the records where srcgt begins with a value from intlgt. Then repeat this for destgt.
if they have a small number of lengths (e.g. some entries have length 3, some have length 4, and some have length 5) you can do the same thing, but it would be more laborious (as a field is required for each 'length').
if the number of rows in the two relations is not too big, do a cross between them, which would create all possible combinations of rows from record and rows from intlgt. Then you can filter by STARTSWITH(srcgt, intlgt::intlgt), because the two of them are fields in the same relation. Beware of this approach, as the number of records can get HUGE!
To be concrete, say we have a folder with 10k of tab-delimited csv files with following attributes format (each csv file is about 10GB):
id name address city...
1 Matt add1 LA...
2 Will add2 LA...
3 Lucy add3 SF...
...
And we have a lookup table based on "name" above
name gender
Matt M
Lucy F
...
Now we are interested to output from top 100,000 rows of each csv file into following format:
id name gender
1 Matt M
...
Can we use pyspark to efficiently handle this?
How to handle these 10k csv files in parallel?
You can do that in python to exploit the 1000 first line of your files :
top1000 = sc.parallelize("YourFile.csv").map(lambda line : line.split("CsvSeparator")).take(1000)
I have a scenario with PIG in terms of comparing 2 CSV files. Basically, what it should do is read the 2 CSV files, compare it to each other, and create a log file which contains the ROW Number, and if possible Column Number of the different value.
Sample output :
Found 1 different value :
Row : #8764
Column : #67
Expected : 8984954
Actual : 0
Is there a way in PIG to do this?