Combining data from multiple tuples in one bag in Pig

Combining data from multiple tuples in one bag in Pig - hadoop

I am trying to parse a bunch of log data using pig. Unfortunately the data for one command is spread across multiple lines (an audit log). I know that there is an id that correlates all of the log messages and that there are different types that contain pieces of the whole, but I am unsure how to gather them all into one message.
I split the message based on type and then joined based on the id, but since there is a one to many relationship between SYSCALL and PATH, this doesn't gather all of the information on one line. I can group by id, but then I want to be able to pull out the same field (name) from every PATH tuple but I don't know of anyway to do that.
Should I just write my own UDF? A FOREACH doesn't keep track of state such that I can concatenate the name field from each tuple.
Edited to add example:
{"message":"Jan 6 15:30:11 r01sv06 auditd: node=r01sv06 type=SYSCALL
msg=audit(1389047402.069:4455727): arch=c000003e syscall=59
success=yes exit=0 a0=7fff8ef30600 a1=7fff8ef30630 a2=270f950
a3=fffffffffffffff0 items=2 ppid=1493 pid=1685 auid=0 uid=0 gid=0
euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) ses=8917
comm=\"ip\" exe=\"/sbin/ip\"
key=\"command\"","#timestamp":"2014-01-06T22:30:14.642Z","#version":"1","type":"audit","host":"r01sv09a","path":"/data/logs/audit.log","syslog_timestamp":"Jan
6 15:30:11","syslog_program":"auditd","received_at":"2014-01-06
22:30:14 UTC", "received_from":"r01sv06" ,"syslog_severity_code":5
,"syslog_facility_code":1
,"syslog_facility":"user-level","syslog_severity":"notice","#source_host":"r01sv06"}
{"message":"Jan 6 15:30:11 r01sv06 auditd: node=r01sv06 type=EXECVE
msg=audit(1389047402.069:4455727): argc=4 a0=\"/sbin/ip\" a1=\"link\"
a2=\"show\"
a3=\"lo\"","#timestamp":"2014-01-06T22:30:14.643Z","#version":"1","type":"audit","host":"r01sv09a","path":"/data/logs/audit.log","syslog_timestamp":"Jan
6 15:30:11","syslog_program":"auditd","received_at":"2014-01-06
22:30:14 UTC", "received_from":"r01sv06", "syslog_severity_code":5,
"syslog_facility_code":1,"syslog_facility":"user-level",
"syslog_severity":"notice","#source_host":"r01sv06"}
{"message":"Jan 6 15:30:11 r01sv06 auditd: node=r01sv06 type=CWD
msg=audit(1389047402.069:4455727):
cwd=\"/root\"","#timestamp":"2014-01-06T22:30:14.644Z","#version":"1","type":"audit","host":"r01sv09a","path":"/data/logs/audit.log","syslog_timestamp":"Jan
6 15:30:11","syslog_program":"auditd","received_at":"2014-01-06
22:30:14 UTC","received_from":"r01sv06", "syslog_severity_code":5,
"syslog_facility_code":1, "syslog_facility":"user-level",
"syslog_severity":"notice", "#source_host":"r01sv06"}
{"message":"Jan 6 15:30:11 r01sv06 auditd: node=r01sv06 type=PATH
msg=audit(1389047402.069:4455727): item=0 name=\"/sbin/ip\"
inode=1703996 dev=08:02 mode=0100755 ouid=0 ogid=0
rdev=00:00","#timestamp":"2014-01-06T22:30:14.645Z","#version":"1","type":"audit","host":"r01sv09a","path":"/data/logs/audit.log","syslog_timestamp":"Jan
6 15:30:11","syslog_program":"auditd","received_at":"2014-01-06
22:30:14 UTC", "received_from":"r01sv06", "syslog_severity_code":5,
"syslog_facility_code":1,"syslog_facility":"user-level",
"syslog_severity":"notice", "#source_host":"r01sv06",}

Related

Plot side by side box and whisker plots from two dataframes

I'm hoping to take these two box plots and combine them into one image:
[![These are two data files I was able to make box and whisker charts for easily using Seaborn boxplot][1]][1]
The datafile I am using is from multiple excel spread sheets and looks like this:
0
1
2
3
4
5
6
...
5
2
3
5
6
2
5
...
2
3
4
6
1
2
1
...
1
2
4
6
7
8
9
...
...
...
...
...
...
...
...
...
Where the column headers represent hours and the column values are the ones I want to use to create box and whisker plots with.
Currently my code is this:
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
xls = pd.ExcelFile('ControlDayVar.xlsx')
df1= pd.read_excel(xls, 'DE_ControlDays').assign(Location=1)
df2= pd.read_excel(xls, 'DE_FestDays').assign(Location=2)
DE_all =pd.concat([df1,df2])
DE= pd.melt(DE_all, id_vars=['Location'], var_name=['Hours'], value_name='Concentration')
ax= sns.boxplot(x='Hours', y= 'Concentration', hue= 'Location', data=DE)
plt.show()
The result I get is this:
[![Yikes][2]][2]
I expect my issue has to do with the format of my data files, but any help would be appreciated.Thanks!
[1]: https://i.stack.imgur.com/dXo6F.jpg
[2]: https://i.stack.imgur.com/NEpi7.jpg

This could happen if somehow the Concentration values are not properly recognized as a numerical data type anymore.
In that case, the y-axis can no longer be understood as continuous, which can lead to that "yikes" result.

How to find number of unique strings in a column followed by position of a given string

I need to do get 2 things from tsv input file:
1- To find how many unique strings are there in a given column where individual values are comma separated. For this I used the below command which gave me unique values.
$awk < input.tsv '{print $5}' | sort | uniq | wc -l
Input file example with header (6 columns) and 10 rows:
$cat hum1003.tsv
p-Value Score Disease-Id Disease-Name Gene-Symbols Entrez-IDs
0.0463 4.6263 OMIM:117000 #117000 CENTRAL CORE DISEASE OF MUSCLE;;CCD;;CCOMINICORE MYOPATHY, MODERATE, WITH HAND INVOLVEMENT, INCLUDED;;MULTICORE MYOPATHY, MODERATE, WITH HAND INVOLVEMENT, INCLUDED;;MULTIMINICORE DISEASE, MODERATE, WITH HAND INVOLVEMENT, INCLUDED;;NEUROMUSCULAR DISEASE, CONGENITAL, WITH UNIFORM TYPE 1 FIBER, INCLUDED;CNMDU1, INCLUDED RYR1 (6261) 6261
0.0463 4.6263 OMIM:611705 MYOPATHY, EARLY-ONSET, WITH FATAL CARDIOMYOPATHY TTN (7273) 7273
0.0513 4.6263 OMIM:609283 PROGRESSIVE EXTERNAL OPHTHALMOPLEGIA WITH MITOCHONDRIAL DNA DELETIONS,AUTOSOMAL DOMINANT, 2 POLG2 (11232), SLC25A4 (291), POLG (5428), RRM2B (50484), C10ORF2 (56652) 11232, 291, 5428, 50484, 56652
0.0539 4.6263 OMIM:605637 #605637 MYOPATHY, PROXIMAL, AND OPHTHALMOPLEGIA; MYPOP;;MYOPATHY WITH CONGENITAL JOINT CONTRACTURES, OPHTHALMOPLEGIA, ANDRIMMED VACUOLES;;INCLUSION BODY MYOPATHY 3, AUTOSOMAL DOMINANT, FORMERLY; IBM3, FORMERLY MYH2 (4620) 4620
0.0577 4.6263 OMIM:609284 NEMALINE MYOPATHY 1 TPM2 (7169), TPM3 (7170) 7169, 7170
0.0707 4.6263 OMIM:608358 #608358 MYOPATHY, MYOSIN STORAGE;;MYOPATHY, HYALINE BODY, AUTOSOMAL DOMINANT MYH7 (4625) 4625
0.0801 4.6263 OMIM:255320 #255320 MINICORE MYOPATHY WITH EXTERNAL OPHTHALMOPLEGIA;;MINICORE MYOPATHY;;MULTICORE MYOPATHY;;MULTIMINICORE MYOPATHY MULTICORE MYOPATHY WITH EXTERNAL OPHTHALMOPLEGIA;;MULTIMINICORE DISEASE WITH EXTERNAL OPHTHALMOPLEGIA RYR1 (6261) 6261
0.0824 4.6263 OMIM:256030 #256030 NEMALINE MYOPATHY 2; NEM2 NEB (4703) 4703
0.0864 4.6263 OMIM:161800 #161800 NEMALINE MYOPATHY 3; NEM3MYOPATHY, ACTIN, CONGENITAL, WITH EXCESS OF THIN MYOFILAMENTS, INCLUDED;;NEMALINE MYOPATHY 3, WITH INTRANUCLEAR RODS, INCLUDED;;MYOPATHY, ACTIN, CONGENITAL, WITH CORES, INCLUDED ACTA1 (58) 58
0.0939 4.6263 OMIM:602771 RIGID SPINE MUSCULAR DYSTROPHY 1 MYH7 (4625), SEPN1 (57190), TTN (7273), ACTA1 (58) 4625, 57190, 7273, 58
So in this case the string is gene name and I want to count unique strings within the entire stretch of 5th column where they are separated by a comma and a space.
2- Next, the order of data is fixed and is arranged as per column 2's score. So, I want to know where is the gene of interest placed in this ranked list within column 5 (Gene-Symbols). And this has to be done after removing duplicates as same genes are being repeated based on other parameters in rest of the columns but it doesn't concern my final output. I only need to focus on ranked list as per column 2. How do I do that? Is there a command I can pipe to above command to get the position of given value?
Expected output:
If I type the command in point 1 then it should give me unique genes in column 5. I have total 18 genes in column 5. But unique values are 14. If gene of interest is TTN, then it's first occurrence was at second position in original ranked list. Hence, expected answer of where my gene of interest is located should be 2.
$14
$2
Thanks

How to fetch two associated Database values Using Rails 3

Hi I have two tables in DB.The first table is given below.
Table name-
t_hcsy_details
class name in model-
class THcsyDetails < ActiveRecord::Base
end
The values in side table is given below.
HCSY_Details_ID HCSY_ID HCSY_Fund_Type_ID Amount
1 2 1 1125
2 2 2 390
3 2 3 285
4 2 4 100
5 2 5 60
6 2 6 40
My second table is given below.
Table Name:
t_hcsy_fund_type_master
class in model:
class THcsyFundTypeMaster < ActiveRecord::Base
end
Table values are given below.
HCSY_Fund_Type_ID Fund_Type_Code Fund_Type_Name Amount
1 1 woods 1125
2 2 Burning 390
3 3 goods 285
4 4 brahmin 100
5 5 swd 60
6 6 Photo 40
I know only HCSY_ID value(i.e-2) of first table.But i need Fund_Type_Name and Amount from second table.As you can see one HCSY_ID has 6 different records.But i need all Fund_Type_Name and Amount of one HCSY_ID. Please help me to resolve this issue by creating object for both two classes shown above.Please help me.

You haven't specified any relationships setup, so it would be easier to split this in two queries:
# you already have hcsy_id
fund_type_ids = THcsyDetails.where(hcsy_id: hcsy_id).pluck(:hcsy_fund_type_id)
fund_types = THcsyFundTypeMaster.where(id: fund_type_ids)
fund_types.group(:fund_type_name).sum(:amount)
In case you had proper relationships setup, the above would've simplified to:
THcsyDetails.
joins(association_name). # THcsyFundTypeMaster
where(hcsy_id: hcsy_id).
group("#{t = THcsyFundTypeMaster.table_name}.fund_type_name").
sum("#{t}.amount")

Algorithm: Creating an outage table

I created this device which sends a point to my webserver. My web server stores a Point instance which has the attributes created_at to reflect the point's creation time. My device consistently sends a request to my server at a 180s interval.
Now I want to see the periods of time my device has experienced outages in the last 7 days.
As an example, let's pretend it's August 3rt (08/03). I can query my Points table for points up to the last 3 days sorted by created_at
Points = [ point(name=p1, created_at="08/01 00:00:00"),
point(name=p2, created_at="08/01 00:03:00"),
point(name=p3, created_at="08/01 00:06:00"),
point(name=p4, created_at="08/01 00:20:00"),
point(name=p5, created_at="08/03 00:01:00"),
... ]
I would like to write an algorithm that can list out the following outages:
outages = {
"08/01": [ "00:06:00-00:20:00", "00:20:00-23:59:59" ],
"08/02": [ "00:00:00-23:59:59" ],
"08/03": [ "00:00:00-00:01:00" ],
}
Is there an elegant way to do this?

How to sum up Redis keys from entire week?

I have following redis keys:
REDIS.del "weekly:activity"
REDIS.del "2013-02-27:activity"
REDIS.del "2013-02-28:activity"
REDIS.sadd "2013-02-27:activity", 1
REDIS.sadd "2013-02-27:activity", 2
REDIS.sadd "2013-02-27:activity", 3
REDIS.sadd "2013-02-28:activity", 4
REDIS.sadd "2013-02-28:activity", 1
REDIS.sadd "2013-02-28:activity", 1
REDIS.sadd "2013-02-28:activity", 6
REDIS.sunionstore "weekly:activity", "2013-02-27:activity", "2013-02-28:activity"
REDIS.scard "weekly:activity"
How will be the best way to recognise first day in current week and sum stats from current week.
Can I do that using Redis?
Or should I do it in Ruby?

Instead of summing regularly, I suggest you instead collect more data and create keys for weeks and month.
So every time a user logs in, you:
REDIS.incr "{today}:activity"
REDIS.incr "{this week}:weekly:activity"
REDIS.incr "{this month}:monthly:activity"
This way, no reporting to do. I'll it up to you to compute this week and this month.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio

Combining data from multiple tuples in one bag in Pig - hadoop

Related

Plot side by side box and whisker plots from two dataframes

How to find number of unique strings in a column followed by position of a given string

How to fetch two associated Database values Using Rails 3

Algorithm: Creating an outage table

How to sum up Redis keys from entire week?

Categories

Resources