How can I merge two files using PIG script? - hadoop

I have two files. And I want to merge it sequentially. How can I do so using Pig/PigLatin script?
f1.csv
1,aa
1,aa
1,ab
1,ac
2,bd
2,bd
2,bd
4,ab
4,bc
f2.csv
1,xxx
1,xxy
1,xyx
1,yxx
1,xyy
1,yyx
2,pqr
2,pq
2,pqrs
2,pqs
3,def
And the output i need is
1,aa,1,xxy
1,aa,1,xyx
1,ab,1,yxx
1,ac,1,xyy
2,bd,2,pqr
2,bd,2,pq
2,bd,2,pqrs
Can anyone help me which join should be used and how to get this?

1) LOAD each file.
2) Then UNION them together
http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#UNION
3) STORE the new unioned alias.
P.S. You can SET DEFAULT_PARALLEL 1; to make sure you only output one file.

Related

How to loop over multiple folders to concatenate FastQ files?

I have received multiple fastq.gz files from Illumina Sequencing for 100 samples. But all the fastq.gz files for the respective samples are in separate folders according to the sample ID. Moreover, I have multiple (8-16) R1.fastq.gz and R2.fastq.gz files for one sample. So, I used the following code for concatenating all the R1.fastq.gz and R2.fastq.gz into a single R1.fastq.gz and R2.fastq.gz.
cat V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz V350043117_L04_some_digits-525_1.fq.gz > sample_R1.fq.gz
So in the sequencing file, the structure is like the above in the code. For each sample, the string with V has different number then L with different number and then another string of digits before the _1 and _2. For each sample, the numbers keep changing.
My questing is, how can I create a loop that will go over all the folders at once taking the different file numbering of sequence files into consideration for concatenating the multiple fq.gz files and combine them into a single R1 and R2 file?
Surely, I cannot just concatenate one by one by going into each sample folder.
Please give some helpful tips. Thank you.
The folder structure is the following:
/data/Sample_1/....._525_1_fq.gz /....._525_2_fq.gz /....._526_1_fq.gz /....._526_2_fq.gz
/data/Sample_2/....._580_1_fq.gz /....._580_2_fq.gz /....._589_1_fq.gz /....._589_2_fq.gz
/data/Sample_3/....._690_1_fq.gz /....._690_2_fq.gz /....._645_1_fq.gz /....._645_2_fq.gz
Below I have attached a screenshot of the folder structure.
Folder structure
Based on the provided file structure, would you please try:
#!/bin/bash
for d in Raw2/C*/; do
(
cd "$d"
id=${d%/}; id=${id##*/} # extract ID from the directory name
cat V*_1.fq.gz > "${id}_R1.fq.gz"
cat V*_2.fq.gz > "${id}_R2.fq.gz"
)
done
The syntax for d in Raw2/C*/ loops over the subdirectories starting with C.
The parentheses make the inner commands executed in a subshell so we don't have to care about returning from cd "$d" (at the expense of small extra execution time).
The variable id is assigned to the ID extracted from the directory name.
cat V*_1.fq.gz, for example, will be expanded as V350028825_L04_581_1.fq.gz V350028825_L04_582_1.fq.gz V350028825_L04_583_1.fq.gz ... according to the files in the directory and are concatenated into ${id}_R1.fastq.gz. Same for ${id}_R2.fastq.gz.

Combine CSV files with condition

I need to combine all the csv files in some directory (.csv), provided that there are other files with the same name in this directory, but with different expansion (.csv.done).
If a csv file doesn't have .done in this extension then I don't need it for combine process.
What is the best way to do it using Bash ?
This approach is a solution to your problem. I see you've commented that it "didn't work", but whatever the reason is for it not working, it's likely simple to fix e.g. if you forgot to include key details, or failed to adapt it appropriately to suit your specific situation. If you need further help troubleshooting, add more info to your question.
The approach:
for f in *.csv.done
do
cat "${f%.*}" >> combined_file.csv
done
How it works:
In your example, you have 3 files named 1.csv 2.csv 3.csv and two 'done' files named 1.csv.done 2.csv.done.
This script begins by making a list of all files that end in .csv.done (two files: 1.csv.done 2.csv.done).
It then uses a parameter expansion, specifically ${parameter%word}, to 'shorten' the name of the two files in the list to .csv (instead of .csv.done).
Then it 'prints' the content of the two 'shortened' filenames (1.csv and 2.csv) into a 'combined' file.
It doesn't 'print' the content of 1.csv.done or 2.csv.done, or 3.csv, because these files weren't in the original 'list'.
If you run this script multiple times, it will keep adding the contents of files 1.csv and 2.csv to the 'combined' file (only run it once, or delete the 'combined' file before running it again)

How to merge files from keyword filename?

I have many files in a folder, contain
4628_group_1
3643_group_0
7578_group_1
4684_group_0
Finally, I merge file into 2 groups
Group1.csv is merged from 4628_group_1 and 7578_group_1
Group0.csv is merged from 3643_group_0 and 4684_group_0
Depending on what you mean by merge, you may be able to achieve this with two simple cat commands.
cat *_group_0 > Group0.csv
cat *_group_1 > Group1.csv

export data to csv using hive sql

How to export hive table/select query to csv? I have tried the command below. But it creates the output as multiple files. Any better methods?
INSERT OVERWRITE LOCAL DIRECTORY '/mapr/mapr011/user/output/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
SELECT fied1,field2,field3 FROM table1
Hive creates as many files as many reducers were running. This is fully parallel.
If you want single file then add order by to force running on single reducer or try to increase bytes per reducer configuration parameter:
SELECT fied1,field2,field3 FROM table1 order by fied1
OR
set hive.exec.reducers.bytes.per.reducer=67108864; --increase accordingly
Also you can try to merge files:
set hive.merge.smallfiles.avgsize=500000000;
set hive.merge.size.per.task=500000000;
set hive.merge.mapredfiles=true;
Also you can concatenate files using cat after getting them from hadoop.
You can use hadoop fs -cat /hdfspath > some.csv
command and get the output in one file.
If you want Header then you can use SED along with hive.. See this link which discusses various options in exporting Hive to CSV
https://medium.com/#gchandra/best-way-to-export-hive-table-to-csv-file-326063f0f229

MSBUILD: ReadLinesFromFile doesn't read duplicated lines

I'm using MSBuild to read all SQL post-deployment files on which my database project is dependent and I write this data to one main script which is loaded.
I get all needed files:
<ReadLinesFromFile File="$(ScriptsList)" >
<Output TaskParameter="Lines" ItemName="IncludedFiles"/>
</ReadLinesFromFile>
And then I batch them (reading all files, line by line, into ListedData)
<ReadLinesFromFile File='$(ScriptDirectory)$([System.String]::Copy("%(IncludedFiles.Identity))' Condition="$([System.String]::Copy('%(IncludedFiles.Identity)').Substring(0,2))==':r'">
<Output TaskParameter="Lines" ItemName="ListedData"/>
</ReadLinesFromFile>
All files are found without problem and then I write it to output.sql.
But the file is missing several lines, which makes output.sql impossible to parse by sqlcmd.
SOURCE:
INSERT INTO [Characteristics] (
[CharacteristicID],
[CharName],
[RuleName],
[ActionRuleName],
[CriteriaSetID],
[ActionCriteriaSetID],
[ListCodeID],
[LocalID],
[BomCategory]
)
SELECT ...something,something... from Characteristics
INSERT INTO [CharacteristicDomain] (
[RuleSet],
[CharName],
[CharSlot],
[Description],
[Seq],
[ValueInteger],
[ValueFloat],
[ValueDate],
[ValueString]
)
SELECT ...something,something... from CharacteristicsDomain
As you see, there will be several lines with a single ')' bracket sign and the task reads only the first line, and then ignores all the duplicates (because it's an item group, not a list). So in effect i get a file looking like this:
OUTPUT:
INSERT INTO [Characteristics] (
[CharacteristicID],
[CharName],
[RuleName],
[ActionRuleName],
[CriteriaSetID],
[ActionCriteriaSetID],
[ListCodeID],
[LocalID],
[BomCategory]
)
SELECT ...something,something... from Characteristics
INSERT INTO [CharacteristicDomain] (
[RuleSet],
[CharName],
[CharSlot],
[Description],
[Seq],
[ValueInteger],
[ValueFloat],
[ValueDate],
[ValueString]
SELECT ...something,something... from CharacteristicsDomain
Does someone know a way to read lines from files using MSBuild, but not losing duplicate lines?
I thought maybe there some way to use Exec task? I surely can't write own tasks, and I'm also not allowed to modify sql files (I can't rely on users, that they will format the files the way i need it). I need to read files with MSBuild, because I modify some of them before I push them to sqlcmd.
How are you writing to output.sql? If you are batching on %(ListedData.Identity), then that will give you only unique lines. Use it as #(ListedData) and it should be fine.
Your second ReadLinesFromFile, the one that creates #(ListedData) is at fault. It is using task batching with %(IncludedFiles.Identity), so both lines with the ")" will be placed into a single batch.

Resources