Extract individual column from a HIVE table - hadoop

Below is a select query from a HIVE table:
select * from test_aviation limit 5;
OK
2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0855 -5.00 0.00 0.00 -1 0900-0959 17.00 0912 1230 7.00 1230 1237 7.00 7.00 0.00 0 1200-1259 0.00 0.00 390.00 402.00 378.00 1.00 2475.00 10
2015 1 1 2 5 2015-01-02 AA 19805 AA N795AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0850 -10.00 0.00 0.00 -1 0900-0959 15.00 0905 1202 9.00 1230 1211 -19.00 0.00 0.00 -2 1200-1259 0.00 0.00 390.00 381.00 357.00 1.00 2475.00 10
2015 1 1 3 6 2015-01-03 AA 19805 AA N788AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 15.00 0908 1138 13.00 1230 1151 -39.00 0.00 0.00 -2 1200-1259 0.00 0.00 390.00 358.00 330.00 1.00 2475.00 10
2015 1 1 4 7 2015-01-04 AA 19805 AA N791AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 14.00 0907 1159 19.00 1230 1218 -12.00 0.00 0.00 -1 1200-1259 0.00 0.00 390.00 385.00 352.00 1.00 2475.00 10
2015 1 1 5 1 2015-01-05 AA 19805 AA N783AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 27.00 0920 1158 24.00 1230 1222 -8.00 0.00 0.00 -1 1200-1259 0.00 0.00 390.00 389.00 338.00 1.00 2475.00 10
Time taken: 0.067 seconds, Fetched: 5 row(s)
Structure of HIVE table
hive> describe test_aviation;
OK
col_value string
Time taken: 0.221 seconds, Fetched: 1 row(s)
I want to segregate the entire table in different columns.I have written a query like below to extract 12th column:
SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 12) from test_aviation;
Output:
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1437067221195_0008, Tracking URL = http://localhost:8088/proxy/application_1437067221195_0008/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1437067221195_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-07-17 02:46:56,215 Stage-1 map = 0%, reduce = 0%
2015-07-17 02:47:27,650 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1437067221195_0008 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://localhost:8088/proxy/application_1437067221195_0008/
Examining task ID: task_1437067221195_0008_m_000000 (and more) from job job_1437067221195_0008
Task with the most failures(4):
-----
Task ID:
task_1437067221195_0008_m_000000
URL:
http://localhost:8088/taskdetails.jsp?jobid=job_1437067221195_0008&tipid=task_1437067221195_0008_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:195)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract#4def4616 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0855 -5.00 0.00 0.00 -1 0900-0959 17.00 0912 1230 7.00 1230 1237 7.00 7.00 0.00 0 1200-1259 0.00 0.00 390.00 402.00 378.00 1.00 2475.00 10 :java.lang.String, ^(?:([^,]*),?){1}:java.lang.String, 12:java.lang.Integer} of size 3
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1243)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:182)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:166)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:79)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:540)
... 9 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1219)
... 18 more
Caused by: java.lang.IndexOutOfBoundsException: No group 12
at java.util.regex.Matcher.group(Matcher.java:487)
at org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(UDFRegExpExtract.java:56)
... 23 more
Please help me to extract different columns from a HIVE table.

Try this:
select split(col_value,' ')[11] as column_12 from test_aviation;
Assuming you have space delimiters.
'\\t' if tab
'\\|' for pipe...
':'
and so on

Related

Removing lines depending upon keyword occurance

I have 7,000 files(sade1.pdbqt ... sade7200.pdbqt). Only some of the files contains second and so occurrence of a keyword TORSDOF. For a given file, I want to remove all lines following the first occurrence if there is second occurrence of keyword TORSDOF, while preserving the file names. Can somebody please provide a sample snippet. Thank you.
$ cat FileWith2ndOccurance.txt
ashu
vishu
jyoti
TORSDOF
Jatin
Vishal
Shivani
TORSDOF
Sushil
Kiran
after function run
$ cat FileWith2ndOccurance.txt
ashu
vishu
jyoti
TORSDOF
EDIT1: Actual file copy-
REMARK Name = 17-DMAG.cdx
REMARK 8 active torsions:
REMARK status: ('A' for Active; 'I' for Inactive)
REMARK 1 A between atoms: C_1 and N_8
REMARK 2 A between atoms: N_8 and C_9
REMARK 3 A between atoms: C_9 and C_10
REMARK 4 A between atoms: C_10 and N_11
REMARK 5 A between atoms: C_15 and O_17
REMARK 6 A between atoms: C_25 and O_28
REMARK 7 A between atoms: C_27 and O_33
REMARK 8 A between atoms: O_28 and C_29
REMARK x y z vdW Elec q Type
REMARK _______ _______ _______ _____ _____ ______ ____
ROOT
ATOM 1 C UNL 1 7.579 11.905 0.000 0.00 0.00 +0.000 C
ATOM 2 C UNL 1 7.579 10.500 0.000 0.00 0.00 +0.000 C
ATOM 30 O UNL 1 8.796 8.398 0.000 0.00 0.00 +0.000 OA
ENDROOT
BRANCH 21 31
ATOM 31 O UNL 1 13.701 7.068 0.000 0.00 0.00 +0.000 OA
ATOM 32 C UNL 1 12.306 6.953 0.000 0.00 0.00 +0.000 C
ENDBRANCH 41 42
ENDBRANCH 19 41
TORSDOF 8
REMARK Name = 17-DMAG.cdx
REMARK 8 active torsions:
REMARK status: ('A' for Active; 'I' for Inactive)
REMARK 1 A between atoms: C_1 and N_8
REMARK 2 A between atoms: N_8 and C_9
REMARK x y z vdW Elec q Type
REMARK _______ _______ _______ _____ _____ ______ ____
ROOT
ATOM 1 CL UNL 1 0.000 11.656 0.000 0.00 0.00 +0.000 Cl
ENDROOT
TORSDOF 0
What I would do:
#!/bin/bash
for file in sade*.pdbqt; do
count=$(grep -c '^TORSDOF' "$file")
if ((count>1)); then
awk '/^TORSDOF/{print;exit}1' "$file" > /tmp/.torsdof &&
mv /tmp/.torsdof "$file"
fi
done

Open txt file inside windows 10 container

I have a simple docker container that is based on windows image:
FROM mcr.microsoft.com/windows:1903
WORKDIR /app1/
ENTRYPOINT powershell.exe
I run it interactively, using:
docker run -it -v c:\app1:c:\app1 test-image:1.0
There is a file called 1.txt inside app1 folder.
When I run:
.\app1\1.txt
I see no notepad.exe process, but instead I can spot OpenWith process:
Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName
------- ------ ----- ----- ------ -- -- -----------
78 5 1056 4452 0.02 1940 1 CExecSvc
74 5 5360 3792 0.02 1904 1 cmd
81 5 904 1364 0.00 1844 1 CompatTelRunner
156 10 6532 6088 0.00 1728 1 conhost
97 7 1196 4980 0.05 1896 1 conhost
286 13 1836 4976 0.27 984 1 csrss
37 6 1348 3356 0.06 524 1 fontdrvhost
0 0 60 8 0 0 Idle
831 22 4748 13844 0.17 460 1 lsass
546 25 13156 28920 0.17 1952 1 OfficeClickToRun
420 24 7400 28844 0.13 2472 1 OpenWith
376 22 6732 27168 0.13 2536 1 OpenWith
I suspect that some mapping might be missing, event though assoc shows that .txt file is associated with notepad.exe:
assoc .txt
.txt=txtfile
ftype txtfile
txtfile=%SystemRoot%\system32\NOTEPAD.EXE %1
What might be the problem here? Am I missing some register value?

Grep not parsing the whole file

I want to use grep to pick lines not containing "WAT" in a file containing 425409 lines with a file size of 26.8 MB, UTF8 encoding.
The file looks like this
>ATOM 1 N ALA 1 9.979 -15.619 28.204 1.00 0.00
>ATOM 2 H1 ALA 1 9.594 -15.053 28.938 1.00 0.00
>ATOM 3 H2 ALA 1 9.558 -15.358 27.323 1.00 0.00
>ATOM 12 O ALA 1 7.428 -16.246 28.335 1.00 0.00
>ATOM 13 N HID 2 7.563 -18.429 28.562 1.00 0.00
>ATOM 14 H HID 2 6.557 -18.369 28.638 1.00 0.00
>ATOM 15 CA HID 2 8.082 -19.800 28.535 1.00 0.00
>ATOM 24 HE1 HID 2 8.603 -23.670 33.041 1.00 0.00
>ATOM 25 NE2 HID 2 8.012 -23.749 30.962 1.00 0.00
>ATOM 29 O HID 2 5.854 -20.687 28.537 1.00 0.00
>ATOM 30 N GLN 3 7.209 -21.407 26.887 1.00 0.00
>ATOM 31 H GLN 3 8.168 -21.419 26.566 1.00 0.00
>ATOM 32 CA GLN 3 6.271 -22.274 26.157 1.00 0.00
**16443 lines**
>ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
>ATOM 16426 H16R PA 1089 -35.669 7.267 -0.368 1.00 0.00
>ATOM 16427 H16S PA 1089 -34.579 5.878 -0.218 1.00 0.00
>ATOM 16428 H16T PA 1089 -34.016 7.366 -0.990 1.00 0.00
>ATOM 16429 C115 PA 1089 -34.144 7.493 1.177 1.00 0.00
>ATOM 16430 H15R PA 1089 -33.101 7.198 1.305 1.00 0.00
>ATOM 16431 H15S PA 1089 -34.179 8.585 1.197 1.00 0.00
>ATOM 16432 C114 PA 1089 -34.971 6.910 2.342 1.00 0.00
>ATOM 16433 H14R PA 1089 -35.147 5.847 2.166 1.00 0.00
**132284 lines**
>ATOM 60981 O WAT 7952 -46.056 -5.515 -56.245 1.00 0.00
>ATOM 60982 H1 WAT 7952 -45.185 -5.238 -56.602 1.00 0.00
>ATOM 60983 H2 WAT 7952 -46.081 -6.445 -56.561 1.00 0.00
>TER
>ATOM 60984 O WAT 7953 -51.005 -3.205 -46.712 1.00 0.00
>ATOM 60985 H1 WAT 7953 -51.172 -3.159 -47.682 1.00 0.00
>ATOM 60986 H2 WAT 7953 -51.051 -4.177 -46.579 1.00 0.00
>TER
>ATOM 60987 O WAT 7954 -49.804 -0.759 -49.284 1.00 0.00
>ATOM 60988 H1 WAT 7954 -48.962 -0.677 -49.785 1.00 0.00
>ATOM 60989 H2 WAT 7954 -49.868 0.138 -48.903 1.00 0.00
**many lines until the end**
>TER
>END
I have used grep -v 'WAT' file.txt but it only returned me the first 16179 lines not containing "WAT" and I can see that there are more lines not containing "WAT". For instance, the following line (and many others) does not appear in the output:
> ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
In order to try to figure out what was happening I've tried grep ' ' file.txt. This command should return every line in the file, but it only returned he first 16179 lines too.
I've also tried to use tail -408977 file.txt | grep ' ' and it returned me all lines recalled by tail. Then I've tried tail -408978 file.txt | grep ' ' and the output was totally empty, zero lines.
I am working on a "normal" 64 bit system, Kubuntu.
Thanks a lot for the help!
When I try I get
$: grep WAT file.txt
Binary file file.txt matches
grep is assuming it's a binary file. add -a
-a, --text equivalent to --binary-files=text
$: grep -a WAT file.txt|head -3
ATOM 29305 O WAT 4060 -75.787 -79.125 25.925 1.00 0.00 O
ATOM 29306 H1 WAT 4060 -76.191 -78.230 25.936 1.00 0.00 H
ATOM 29307 H2 WAT 4060 -76.556 -79.670 25.684 1.00 0.00 H
Your file has 2 NULLs each at the end of lines 16426, 16428, 16430, and 16432.
$: tr "\0" # <file.txt|grep -n #
16426:ATOM 16421 KA CAL 1085 -20.614 -22.960 18.641 1.00 0.00 ##
16428:ATOM 16422 KA CAL 1086 20.249 21.546 19.443 1.00 0.00 ##
16430:ATOM 16423 KA CAL 1087 22.695 -19.700 19.624 1.00 0.00 ##
16432:ATOM 16424 KA CAL 1088 -22.147 19.317 17.966 1.00 0.00 ##

Discrepancy between classification report and confusion matrix

Maybe I'm reading the classification report or the confusion matrix wrong (or both!), but after having trained my classifier and run on it my test set, I get the following report:
precision recall f1-score support
0 0.71 0.67 0.69 5086
1 0.64 0.54 0.59 2244
2 0.42 0.25 0.31 598
3 0.65 0.22 0.33 262
4 0.53 0.42 0.47 266
5 0.42 0.15 0.22 466
6 0.35 0.25 0.29 227
7 0.07 0.05 0.06 127
8 0.39 0.14 0.21 376
9 0.35 0.25 0.29 167
10 0.25 0.14 0.18 229
avg / total 0.61 0.52 0.55 10048
Which is good and all, but when I create my confusion matrix:
0 1 2 3 4 5 6 7 8 9 10
[[4288 428 80 16 44 58 33 38 47 21 33]
[ 855 1218 54 8 12 17 25 15 15 12 13]
[ 291 72 147 1 12 10 20 2 2 17 24]
[ 173 21 3 57 1 3 0 1 1 1 1]
[ 102 20 4 0 113 0 0 6 4 9 8]
[ 331 40 10 3 7 68 3 0 2 1 1]
[ 104 30 17 0 1 0 56 2 1 10 6]
[ 85 19 4 2 5 0 2 6 4 0 0]
[ 270 29 4 1 6 2 2 7 53 1 1]
[ 63 17 11 0 8 3 14 1 1 42 7]
[ 138 13 19 0 5 2 7 3 6 5 31]]
Am I wrong in assuming, that it has predicted 4288 samples of class label 0 out of a total of 5086, which should result in a recall value of 84.3% (0.843)? But that's not the number the report spits out. The precision seems wrong as well, unless I'm wrong when I calculate the percentage of correct predictions (4288) with the sum of the rest in column 0, which results in 0.563, and not 0.71.
What am I misunderstanding?
It might be worth nothing that I use sklearn's classification_report and confusion_matrix for these.

how to Combine two tables column in oracle?

Table 1:
CURRENCY CODE ER GUARANTOR ID G AMOUNT
USD 1.2986 117 750
AED 4.76976 117 5750
ZAR 11.4717 117 234
INR 70.676 117 1243
AMD 526.5823 117 500000
EUR 1 117 12435
ALL 139.63197 117 2000000
EUR 1 173 200000
EUR 1 217 20000000
INR 70.676 26 100000
AED 4.76976 43 1000000
EUR 1 53 10000
Table 2:
F AMOUNT
USD 1.2986 117 450
AED 4.76976 117 7900
INR 70.676 117 2237.4
ZAR 11.4717 117 140.4
AMD 526.5823 117 500000
EUR 1 117 6961
ALL 139.63197 117 2000000
EUR 1 173 20000
EUR 1 217 14000000
INR 70.676 26 300000
AED 4.76976 43 2000000
EUR 1 53 10000
Result:
CURRENCY CODE ER GUARANTOR ID G AMOUNT F AMOUNT
USD 1.2986 117 750 450
AED 4.76976 117 5750 7900
ZAR 11.4717 117 234 2237.4
INR 70.676 117 1243 140.4
AMD 526.5823 117 500000 500000
EUR 1 117 12435
ALL 139.63197 117 2000000
EUR 1 173 200000
EUR 1 217 20000000
INR 70.676 26 100000
AED 4.76976 43 1000000
EUR 1 53 10000
I want to combine Both table like i need all the column in table 1 and F AMOUNT column from table 2. how to achieve this?
Thanks in Advance.
use the below query
select t1.CURRENCY CODE
, t1.ER
, t1.GUARANTOR
, t1.ID
, t2.FAMOUNT
from table1 t1
, table2 t2
where t1.CURRENCY CODE=t2.CURRENCY CODE

Resources