How to mapreduce in this situation in hadoop? - hadoop

I want to analysis a text file
the text file's format is like this...
<msg time='2015-07-30T16:37:48.408+09:00' org_id='oracle' comp_id='rdbms'
msg_id='opiexe:3056:2780954927' client_id='' type='NOTIFICATION'
group='admin_ddl' level='16' host_id='TEST_DB1'
host_addr='127.0.0.1' module='sqlplus#TEST_DB1 (TNS V1-V3)' pid='24436'>
<txt>ORA-1543 signalled during: create tablespace TS_MODULE_I datafile &apos;/data001/orasvc01/NEWDB/ts_module_i_01.dbf&apos; size 20m...
</txt>
</msg>
<msg time='2015-07-30T16:39:13.173+09:00' org_id='oracle' comp_id='rdbms'
client_id='' type='UNKNOWN' level='16'
host_id='TEST_DB1' host_addr='127.0.0.1' module=''
pid='23242'>
<txt>Errors in
file /logs001/orasvc01/diag/rdbms/newdb/NEWDB/trace/NEWDB_smon_23242.trc:
ORA-01116: error in opening database file 6
ORA-01110: data file 6:
&apos;/data001/orasvc01/NEWDB/ts_module_d_01.dbf&apos;
ORA-27041: unable to open file
Linux-x86_64 Error: 2: No such file or directory
Additional information: 3
</txt>
</msg>
....
sometimes it includes 7lines but other thing inculde 10 lines.
in this situation..
I want an output like
(column[0]) (column[1]) sum of errors
2015-07-31 ora-1051 7
what should i do?

Your input file is xml. If you have entire xml as a string in each line, you would hae used straight forward map reduce. However your input is in different form. Mostly dependent on start and end tag, to get a record.
So you should use record reader, and create your own format for map reduce- XmlInputFormat. The good news is, its already created, and you have to customize it. You can search for "xmlinputformat mahout", for actual class. However even more easier way is, to see an example which uses the above format. You can find it here. Once your mappers recornizes a record, and you get hold of the contents inside, the rest is straigth forward, and it depends on you which details are to be sent to output. Happy coding

Related

Sort loki logs in grafana

Sort loki logs
<timestamp><level> <msg in json>
<timestamp><level> <msg in json>
<timestamp><level> <msg in json>
<timestamp><level> <msg in json>
<timestamp><level> <msg in json>
this is how my logs look in like
in msg json there is duration:20.32ms
so looking to sort logs while displaying in grafana.
based on duration.

Input / Output error when using HDFS NFS Gateway

Getting "Input / output error" when trying work with files in mounted HDFS NFS Gateway. This is despite having set dfs.namenode.accesstime.precision=3600000 in Ambari. For example, doing something like...
$ hdfs dfs -cat /hdfs/path/to/some/tsv/file | sed -e "s/$NULL_WITH_TAB/$TAB/g" | hadoop fs -put -f - /hdfs/path/to/some/tsv/file
$ echo -e "Lines containing null (expect zero): $(grep -c "\tnull\t" /nfs/hdfs/path/to/some/tsv/file)"
when trying to remove nulls from a tsv then inspect for nulls in that tsv based on the NFS location throws the error, but I am seeing it in many other places (again, already have dfs.namenode.accesstime.precision=3600000). Anyone have any ideas why this may be happening or debugging suggestions? Can anyone explain what exactly "access time" is in this context?
From discussion on the apache hadoop mailing list:
I think access time refers to the POSIX atime attribute for files, the “time of last access” as described here for instance (https://www.unixtutorial.org/atime-ctime-mtime-in-unix-filesystems). While HDFS keeps a correct modification time (mtime), which is important, easy and cheap, it only keeps a very low-resolution sense of last access time, which is less important, and expensive to monitor and record, as described here (https://issues.apache.org/jira/browse/HADOOP-1869) and here (https://superuser.com/questions/464290/why-is-cat-not-changing-the-access-time).
However, to have a conforming NFS api, you must present atime, and so the HDFS NFS implementation does. But first you have to configure it on. [...] many sites have been advised to turn it off entirely by setting it to zero, to improve HDFS overall performance. See for example here ( https://community.hortonworks.com/articles/43861/scaling-the-hdfs-namenode-part-4-avoiding-performa.html, section "Don’t let Reads become Writes”). So if your site has turned off atime in HDFS, you will need to turn it back on to fully enable NFS. Alternatively, you can maintain optimum efficiency by mounting NFS with the “noatime” option, as described in the document you reference.
[...] check under /var/log, eg with find /var/log -name ‘*nfs3*’ -print

what is the file ORA_DUMMY_FILE.f in oracle?

oracle version: 12.2.0.1
As you know, these are then unix processes for the parallel servers in oracle:
ora_p000_ora12c
ora_p001_ora12c
....
ora_p???_ora12c
They can be seen also with the view: gv$px_process.
The spid for each parallel server can be obtained from there.
Then I look for the open files associated with te parallel server here:
ls -l /proc/<spid>/fd
And I'm obtaining around 500-10000 file descriptors for several parallel servers equal to this one:
991 -> /u01/app/oracle/admin/ora12c/dpdump/676185682F2D4EA0E0530100007FFF5E/ORA_DUMMY_FILE.f (deleted)
I've deleted them using:(actually I've create a small script for doing it because there are thousands of them)
gdb -p <spid>
gdb> p close(<fd_id>)
But after some hours the file descriptors start being created again (hundreds every day)
If they are not deleted then eventually the linux limit is reached and any parallel query throws an error like this:
ORA-12801: error signaled in parallel query server P001
ORA-01116: error in opening database file 132
ORA-01110: data file 132: '/u02/oradata/ora12c/pdbname/tablespacenaname_ts_1.dbf'
ORA-27077: too many files open
Does anyone have any idea of how and why this file descriptors are being created, and how to avoid it?.
Edited: Added some more information that could be useful.
I've tested that when a new PDB is created a directory DATA_PUMP_DIR is created in it (select * from all_directories) that is pointing to:
/u01/app/oracle/admin/ora12c/dpdump/<xxxxxxxxxxxxx>
The linux directory is also created.
Also one file descriptor is created pointing to ORA_DUMMY_FILE.f in the new dpdump subdirectory like the ones described initially
lsof | grep "ORA_DUMMY_FILE.f (deleted)"
/u01/app/oracle/admin/ora12c/dpdump/<xxxxxxxxxxxxx>/ORA_DUMMY_FILE.f (deleted)
This may be ok, the problem I face is the continuos growing of the file descriptors pointing to ORA_DUMMY_FILE that reach the linux limits.

Cronjob with Jelastic and Glassfish

I am running a web-application (MyCronTest) on a Glassfish-Server in a Jelastic-Environment. This web-application contains the servlet (/test), that I would like to call regularly with a cron-job.
So I followed this tutorial from the Jelastic docs, but they use Tomcat instead of Glassfish and I am not so sure about the paths and where to put which file...and now I am lost ;)
the servlet
When calling the servlet directly in my browser it prints out the following line to System.out:
test executed at 05/03/2014 15:00
the bash file to execute
I created a bash script called myCronJob.sh and put it in the directory glassfish3/temp:
#!/bin/bash
curl http://myGlassfish.jelastic.dogado.eu/MyCronTest/test;
I tested it of course, it is executable and it works (at least when I execute it on my computer).
the cron event scheduler
according to the tutorial there is a file /cron/tomcat I need to edit. Well, I found a /cron/glassfish which (I am guessing) should do the same.
# IMPORTANT NOTE!
# Please make sure there is a blank line after the last cronjob entry.
*/1 * * * * /opt/glassfish3/temp/myCronJob.sh
I added an empty line at the end, as they told me to. I even tried it with
*/1 * * * * /bin/bash /opt/glassfish3/temp/myCronJob.sh
as they suggested in the tutorial. But still no output. No error.. just empty log files.
Does anyone have an idea what I am missing here? Am I doing something wrong?
Solution / Edit
Thanks to Damien's Answer I was finally able to narrow down my problem. It was actually the line in my bash-script that caused the problem:
curl http://myGlassfish.jelastic.dogado.eu/MyCronTest/test;
should have been
curl http://localhost/MyCronTest/test;
since I was blocked by a firewall. Lucky for me, my Glassfish is running on the same machine / environment, so localhost works.
Everything else is correct.
Well, I found a /cron/glassfish which (I am guessing) should do the same.
Correct.
But still no output. No error.. just empty log files.
Assuming that you have correctly uploaded your file to /opt/glassfish3/temp/myCronJob.sh, I recommend that you try to direct the cron output to your own log file or email it to you:
MAILTO="your#email.com"
*/1 * * * * /opt/glassfish3/temp/myCronJob.sh 2&1 > /opt/glassfish3/glassfish/nodes/localhost-domain1/instance-168458181/logs/cronoutput.log
Note that the email may be filtered by your spam filters due to things like missing PTR (reverse DNS) and so on - but it's ok to use like this for testing/debugging purposes (just don't rely on these mails getting through for anything critical!)
If these tips don't help you, then I recommend contacting your hosting provider's support team to verify the .sh file's permissions, output when executed manually, and the cron log file contents (all of which only they can help you with).

FTP Adapter Oracle SOA

I want to read file with the gap of 3mint each. So my BPEL FTP adapter read every file after 3mint. e.g. I have 5 files in a directory and my FTP adapter reads 1st file and after 3 mint he reads 2nd and so on.
In BPEL FTP / File Adapter Configuration Wizard
In the Get File or Read Operation
Next to the file name slide
you can set "Polling Frequency" to 3 Minutes
so that it will poll for the file every 3 minutes
after that check the .jca file it has the below property
<property name="PollingFrequency" value="180"/>
from that you can edit the polling frequency
You can use the following property for your file adapter configuration
property name="PollingFrequency" value="180"
property name="MaxRaiseSize" value="1"

Resources