Transpose a list based on specific text in ruby - ruby

I have one file which is one long list of different patient samples. Each sample always starts with "SLX" as below:
I would like to transpose each sample into a CSV with the output shown below. I know that the CSV library might be able to do this but I don't know how to approach it as I would have to transpose only when the line starting with SLX is matched.
Input:
SLX.1767356.fdfsIH.fq.gz
Sequences: 160220
With index: 139019
Sufficient length: 139018
Min index: 0
Max index: 83
Unique: 48932
# reads processed: 48932
# reads with at least one reported alignment: 21172 (43.27%)
# reads that failed to align: 27022 (55.22%)
# reads with alignments suppressed due to -m: 738 (1.51%)
Reported 21172 alignments to 1 output stream(s)
SLX.94373.GHDUA_.fq.gz
Sequences: 28232
With index: 24875
Sufficient length: 24875
Min index: 3
Max index: 41
Unique: 14405
# reads processed: 14405
# reads with at least one reported alignment: 8307 (57.67%)
# reads that failed to align: 5776 (40.10%)
# reads with alignments suppressed due to -m: 322 (2.24%)
Reported 8307 alignments to 1 output stream(s)
SLX.73837.BLABLA_Control.fq.gz
Sequences: 248466
With index: 230037
Sufficient length: 230036
Min index: 0
Max index: 98
Unique: 64883
# reads processed: 64883
# reads with at least one reported alignment: 24307 (37.46%)
# reads that failed to align: 39764 (61.29%)
# reads with alignments suppressed due to -m: 812 (1.25%)
Reported 24307 alignments to 1 output stream(s)
Output
SLX.10456.FastSeqI_Control_OC_AH_094.fq.gz Sequences: 160220 With index: 139019 Sufficient length: 139018 Min index: 0 Max index: 83 Unique: 48932 # reads processed: 48932 # reads with at least one reported alignment: 21172 (43.27%) # reads that failed to align: 27022 (55.22%) # reads with alignments suppressed due to -m: 738 (1.51%) Reported 21172 alignments to 1 output stream(s) mv: /Volumes/SeagateBackupPlusDriv1/SequencingRawFiles/TumourOesophagealOCCAMS/MetaOCCAMSTumoursRawFiles/LCMDysplasiaAndCancer_LCM_PS14_1105_1F/SLX.10456.FastSeqI_Control_OC_AH_094.fq.gz and /Volumes/SeagateBackupPlusDriv1/SequencingRawFiles/TumourOesophagealOCCAMS/MetaOCCAMSTumoursRawFiles/LCMDysplasiaAndCancer_LCM_PS14_1105_1F/SLX.10456.FastSeqI_Control_OC_AH_094.fq.gz are identical
SLX.10456.FastSeqI_Control_OC_ED_008_F1_.fq.gz Sequences: 28232 With index: 24875 Sufficient length: 24875 Min index: 3 Max index: 41 Unique: 14405 # reads processed: 14405 # reads with at least one reported alignment: 8307 (57.67%) # reads that failed to align: 5776 (40.10%) # reads with alignments suppressed due to -m: 322 (2.24%) Reported 8307 alignments to 1 output stream(s)
SLX.10456.FastSeqJ_OC_AH_086_F1_Control.fq.gz Sequences: 248466 With index: 230037 Sufficient length: 230036 Min index: 0 Max index: 98 Unique: 64883 # reads processed: 64883 # reads with at least one reported alignment: 24307 (37.46%) # reads that failed to align: 39764 (61.29%) # reads with alignments suppressed due to -m: 812 (1.25%) Reported 24307 alignments to 1 output stream(s)

OK, it’s so easy that I will post an answer.
input.scan(/^SLX.*?(?=^SLX|\z)/m)
.map { |p| p.split($/).map { |e| %Q|"#{e}"| }.join (', ') }
.join($/)

Related

Can someone explain me the output of orcfiledump?

My table test_orc contains (for one partition):
col1 col2 part1
abc def 1
ghi jkl 1
mno pqr 1
koi hai 1
jo pgl 1
hai tre 1
By running
hive --orcfiledump /hive/user.db/test_orc/part1=1/000000_0
I get the following:
Structure for /hive/a0m01lf.db/test_orc/part1=1/000000_0 .
2018-02-18 22:10:24 INFO: org.apache.hadoop.hive.ql.io.orc.ReaderImpl - Reading ORC rows from /hive/a0m01lf.db/test_orc/part1=1/000000_0 with {include: null, offset: 0, length: 9223372036854775807} .
Rows: 6 .
Compression: ZLIB .
Compression size: 262144 .
Type: struct<_col0:string,_col1:string> .
Stripe Statistics:
Stripe 1:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .
File Statistics:
Column 0: count: 6 .
Column 1: count: 6 min: abc max: mno sum: 17 .
Column 2: count: 6 min: def max: tre sum: 18 .
Stripes:
Stripe: offset: 3 data: 58 rows: 6 tail: 49 index: 67 .
Stream: column 0 section ROW_INDEX start: 3 length 9 .
Stream: column 1 section ROW_INDEX start: 12 length 29 .
Stream: column 2 section ROW_INDEX start: 41 length 29 .
Stream: column 1 section DATA start: 70 length 20 .
Stream: column 1 section LENGTH start: 90 length 12 .
Stream: column 2 section DATA start: 102 length 21 .
Stream: column 2 section LENGTH start: 123 length 5 .
Encoding column 0: DIRECT .
Encoding column 1: DIRECT_V2 .
Encoding column 2: DIRECT_V2 .
What does the part about stripes mean?
First, let's see how an ORC file looks like.
Now some keywords used in above image and also in your question!
Stripe - A chunk of data stored in ORC file. Any ORC file is divided into those chunks, called stripes, each sized 250 MB with index data, actual data and some metadata for actual data stored in that stripe.
Compression - The compression codec used to compress the data stored. ZLIB is the default for ORC.
Index Data - includes min and max values for each column and the row positions within each column. (A bit field or bloom filter could also be included.) Row index entries provide offsets that enable seeking to the right compression block and byte within a decompressed block. Note that ORC indexes are used only for the selection of stripes and row groups and not for answering queries.
Row data - Actual data. Is used in table scans.
Stripe Footer - The stripe footer contains the encoding of each column and the directory of the streams including their location. To describe each stream, ORC stores the kind of stream, the column id, and the stream’s size in bytes. The details of what is stored in each stream depends on the type and encoding of the column.
Postscript - holds compression parameters and the size of the compressed footer.
File Footer - The file footer contains a list of stripes in the file, the number of rows per stripe, and each column's data type. It also contains column-level aggregates count, min, max, and sum.
Now! Talking about your output from orcfiledump.
First is general information about your file. The name, location, compression codec, compression size etc.
Stripe statistics will list all the stripes in your ORC file and its corresponding information. You can see counts and some statistics about Integer columns like min, max, sum etc.
File statistics is similar to #2. Just for the complete file as opposed to each stripe in #2.
Last part, the Stripe section, talks about each column in your file and corresponding index info for each of it.
Also, you can use various options with orcfiledump to get "desired" results. Follows a handy guide.
// Hive version 0.11 through 0.14:
hive --orcfiledump <location-of-orc-file>
// Hive version 1.1.0 and later:
hive --orcfiledump [-d] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.2.0 and later:
hive --orcfiledump [-d] [-t] [--rowindex <col_ids>] <location-of-orc-file>
// Hive version 1.3.0 and later:
hive --orcfiledump [-j] [-p] [-d] [-t] [--rowindex <col_ids>] [--recover] [--skip-dump]
[--backup-path <new-path>] <location-of-orc-file-or-directory>
Follows a quick guide to the options used in the commands above.
Specifying -d in the command will cause it to dump the ORC file data
rather than the metadata (Hive 1.1.0 and later).
Specifying --rowindex with a comma separated list of column ids will
cause it to print row indexes for the specified columns, where 0 is
the top level struct containing all of the columns and 1 is the first
column id (Hive 1.1.0 and later).
Specifying -t in the command will print the timezone id of the
writer.
Specifying -j in the command will print the ORC file metadata in JSON
format. To pretty print the JSON metadata, add -p to the command.
Specifying --recover in the command will recover a corrupted ORC file
generated by Hive streaming.
Specifying --skip-dump along with --recover will perform recovery
without dumping metadata.
Specifying --backup-path with a new-path will let the recovery tool
move corrupted files to the specified backup path (default: /tmp).
is the URI of the ORC file.
is the URI of the ORC file or
directory. From Hive 1.3.0 onward, this URI can be a directory
containing ORC files.
Hope that helps!

plugin Config wont load correctly

I'm running a two plugins on spigot 1.8.9 called minigameslib and openskywars. To set up the randomizing of chests its required to setup a config
# Just copy paste if you want more chests. The percentages must add up to 100!
config:
enabled: true
chests:
chest1:
items: 5*64;5*64;5*64;5*64;262*64;278*1;5*64%30
percentage: 5
chest2:
items: 5*64;262*64;267*1
percentage: 20
chest3:
items: 5*64;262*64
percentage: 25
chest4:
items: 5*64
percentage: 50
Thats the default config file. It's named chests.yml
I'm atttempting to change the file to contain the following:
# Just copy paste if you want more chests. The percentages must add up to 100!
config:
enabled: true
chests:
chest1:
items:298*1;303*1;304*1;301*1;276*1;3*64;1*64;368*2;262*16;322*4;
percentage: 4
chest2:
items:298*1;315*1;300*1;317*1;367*1;3*64;1*64;322*4;364*12
percentage: 4
chest3:
items:298*1;299*1;312*1;305*1;272*1;1*64;79*1;261:3#ARROW_DAMAGE*1
percentage: 4
chest4:
items:298*1;307*1;308*1;309*1;272*1;3*64;261*1;364*12
percentage: 4
chest5:
items:298*1;311*1;316*1;313*1;283*1;1*64;326*1;262*16
percentage: 4
chest6:
items:302*1;315*1;312*1;301*1;272*1;33:5#KNOCKBACK*1
percentage: 4
chest7:
items:302*1;299*1;304*1;309*1;272*1;3*64;79*1;364*12
percentage: 4
chest8:
items:302*1;303*1;316*1;313*1;283*1;1*64;261:5#ARROW_DAMAGE*1;364*12
percentage: 4
chest9:
items:302*1;311*1;308*1;305*1;367*1;3*64;79*1;368*5;33:
percentage: 4
chest10:
items:302*1;307*1;300*1;317*1;276*1;1*64;326*1;262*16;364*12
percentage: 4
chest11:
items:306*1;299*1;316*1;309*1;283*1;3*64;322*4
percentage: 4
chest12:
items:306*1;307*1;308*1;301*1;272*1;3*64;262*16;261:1#ARROW_DAMAGE*1;364*12
percentage: 4
chest13:
items:306*1;303*1;312*1;313*1;276*1;1*64;326*1
percentage: 4
chest14:
items:306*1;315*1;304*1;305*1;367*1;3*64;368*1
percentage: 4
chest15:
items:306*1;311*1;300*1;317*1;272*1;322*4
percentage: 4
chest16:
items:310*1;307*1;316*1;301*1;276*1;261*1
percentage: 4
chest17:
items:310*1;311*1;304*1;309*1;272*1;3*64;261:1#ARROW_DAMAGE*1
percentage: 4
chest18:
items:310*1;315*1;312*1;305*1;283*1;262*16;322*4;364*12
percentage: 4
chest19:
items:310*1;303*1;308*1;317*1;367*1;3*64;79*1
percentage: 4
chest20:
items:310*1;299*1;300*1;313*1;272*1;1*64;364*12
percentage: 4
chest21:
items:314*1;303*1;312*1;317*1;367*1;3*64;368*2;33:5#KNOCKBACK*1;364*12
percentage: 4
chest22:
items:314*1;307*1;316*1;305*1;283*1;326*1;364*12
percentage: 4
chest23:
items:314*1;311*1;300*1;301*1;276*1;1*64;261:1#ARROW_DAMAGE*1
percentage: 4
chest24:
items:314*1;299*1;304*1;313*1;272*1;3*64;262*16;364*12
percentage: 4
chest25:
items:314*1;315*1;308*1;309*1;272*1;79*1;261*1;322*4
percentage: 4
Im not sure if my syntax for yml files is wrong or the item ids are wrong. The enchanting ids are Here, and the plugin page is here. The program resets back to the original config every time its run, making small changes and its fine. I would like to get this long list working if I can.
I hope you have more like than I do. Thanks in advance.
A yaml file needs a space after the colon.
Also indentation is dictative for what object belongs to what.
You made chests a top level opbject
You have
# Just copy paste if you want more chests. The percentages must add up to 100!
config:
enabled: true
chests:
chest1:
items:298*1;303*1;304*1;301*1;276*1;3*64;1*64;368*2;262*16;322*4;
percentage: 4
it should be
# Just copy paste if you want more chests. The percentages must add up to 100!
# top level. no spaces
config:
# secondary, two spaces. Could also be one space.
# All following secondary level elements need to have the equal amount of spaces
enabled: true
# secondary two spaces
chests:
# Tertiary: 4 spaces. All following tertiary elements under this secondary
# element need to have 4 spaces.
chest1:
# Quaternary element. 6 spaces. All following quaternary elements under this
# tertiary element needs to have 6 spaces
# Also note the space after the colon: Yaml needs this to discern where the
# variable starts
items: 298*1;303*1;304*1;301*1;276*1;3*64;1*64;368*2;262*16;322*4;
percentage: 4
Without clarifying comments
# Just copy paste if you want more chests. The percentages must add up to 100!
config:
enabled: true
chests:
chest1:
items: 298*1;303*1;304*1;301*1;276*1;3*64;1*64;368*2;262*16;322*4;
percentage: 4

Matching all lines between two lines recursively in ruby

I would like to match all lines (including the first line) between two lines that start with 'SLX-', convert them to a comma separated line and then append them to a text file.
A truncated version of the original text file looks like:
SLX-9397._TC038IV_L_FLD0214.Read1.fq.gz
Sequences: 1406295
With index: 1300537
Sufficient length: 1300501
Min index: 0
Max index: 115
0 1299240
1 71
2 1
4 1
Unique: 86490
# reads processed: 86490
# reads with at least one reported alignment: 27433 (31.72%)
# reads that failed to align: 58544 (67.69%)
# reads with alignments suppressed due to -m: 513 (0.59%)
Reported 27433 alignments to 1 output stream(s)
SLX-9397._TC044II_D_FLD0197.Read1.fq.gz
Sequences: 308905
With index: 284599
Sufficient length: 284589
Min index: 0
Max index: 114
0 284290
1 16
Unique: 32715
# reads processed: 32715
# reads with at least one reported alignment: 13114 (40.09%)
# reads that failed to align: 19327 (59.08%)
# reads with alignments suppressed due to -m: 274 (0.84%)
Reported 13114 alignments to 1 output stream(s)
SLX-9397._TC047II_D_FLD0220.Read1.fq.gz
I imagine the ruby would look like
Convert all /n between two lines with SLX- to commas
Save the original text file as a new text file (or even better a CSV file.
I think I specifically have a problem with how to find and replace between two specific lines.
I guess I could do this without using ruby, but seeing as I'm trying to get into Ruby...
Assuming, that you have your string in str:
require 'csv'
CSV.open("/tmp/file.csv", "wb") do |csv|
str.scan(/^(SLX-.*?)(?=\R+SLX-)/m).map do |s| # break by SLX-
s.first.split($/).map do |el| # split by CR
"'#{el}'" # quote values
end
end.each do |line| # iterate
csv << line # fulfil csv
end
end
I don't know much about Ruby but this should work. You should read the entire file into a Sting. Use this regex - (\RSLX-) - to match all SLX- (all but the first one) and replace it with ,SLX-. For the explanation of the regex, go to https://regex101.com/r/pP3pP3/1
This question - Ruby replace string with captured regex pattern - might help you to understand how to replace in ruby

Bash Parsing of output from command with filter

I have a command that outputs hard drive status.
I am planning to run this in a script for monitoring purposes.
I would like to pull out certain rows and display them such as Slot Number, PD type Raw Size, Drive's position.
How would I do this. (I'm assuming that it would be some sort of awk statement.)
Output is as such (note that the "(\n)" denotes new lines not a formatting choice)
Enclosure Device ID: 252
Slot Number: 3
Drive's postion: DiskGroup: 1, Span: 0, Arm: 1
Enclosure position: 0
Device Id: 7
WWN: 5000C50034BB0CD8
Sequence Number: 2
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
Non Coerced Size: 2.728 TB [0x15d40a3b0 Sectors]
Coerced Size: 2.727 TB [0x15d3ef000 Sectors]
Firmware state: Online, Spun Up
Device Firmware Level: 0003
Connected Port Number: 2(path0)
Inquiry Data: SEAGATE ST33000650SS 0003Z290VK2V
(\n)
(\n)
(\n)
Enclosure Device ID: 252
Slot Number: 4
Drive's postion: DiskGroup: 1, Span: 0, Arm: 1
Enclosure position: 0
Device Id: 8
WWN: 5000C50034BB0CD8
Sequence Number: 2
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SAS
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]
(continues like this)
EDIT:
I would like to display them as
Slot Number: 3
Drive's postion: DiskGroup: 1, Span: 0, Arm: 1
Enclosure position: 0
Device Id: 7
PD Type: SAS
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]ors]
Firmware state: Online, Spun Up
Device Firmware Level: 0003
(\n)
Slot Number: 4
Drive's postion: DiskGroup: 1, Span: 0, Arm: 1
Enclosure position: 0
Device Id: 8
PD Type: SAS
Raw Size: 2.728 TB [0x15d50a3b0 Sectors]ors]
Firmware state: Online, Spun Up
Device Firmware Level: 0003
If you just want to extract the lines of interest, you can use egrep:
cmd | egrep '^((Slot Number)|(PD Type)|(Raw Size)):' \
| sed 's/^Slot Number/\n&/'

Btrieve GetNextExtended Status 62

I'm having trouble getting the GetNextExtended(36) operation working
in Btrieve. Here is the call which returns the status code 62 :
intStatus = BTRCALL(B_GETNEXTEXTENDED, _
m_byteFilePosBlk, _
m_byteRecordBuffer(0), _
lngDataBufferLen, _
ByVal strKeyBuffer, _
intKeyBufferLen, _
m_intKeyNum)
After doing a search for the code I found numerous site stating that the
code indicates an error in the databuffer, stored in m_byteRecordBuffer.
Here are the values stored in that variable :
m_byteRecordBuffer(0) 16 'These two bytes indicate the total size of'
m_byteRecordBuffer(1) 0 'data buffer'
m_byteRecordBuffer(2) 67 'These two bytes indicate the characters 'UC''
m_byteRecordBuffer(3) 85
m_byteRecordBuffer(4) 0 'These two bytes indicate the maximum reject'
m_byteRecordBuffer(5) 0 'count, which if set to 0 defaults to 4,095'
m_byteRecordBuffer(6) 0 'These two bytes indicate the number of terms'
m_byteRecordBuffer(7) 0 'which has been set to zero'
m_byteRecordBuffer(8) 1 'These two bytes indicate the number of'
m_byteRecordBuffer(9) 0 'records to return'
m_byteRecordBuffer(10) 1 'These two bytes indicate the number of fields'
m_byteRecordBuffer(11) 0 'to extract'
m_byteRecordBuffer(12) 2 'These two bytes indicate the length of the'
m_byteRecordBuffer(13) 0 'field to extract'
m_byteRecordBuffer(14) 1 'These two bytes indicate the field offset'
m_byteRecordBuffer(15) 0
I hope I am just missing something simple. Any help would be greatly appreciated.
In the record buffer, try swapping the position of the UC characters.
Put 'U' (85) in position 2 and 'C' (67) in position 3.

Resources