Below is a select query from a HIVE table:
select * from test_aviation limit 5;
OK
2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0855 -5.00 0.00 0.00 -1 0900-0959 17.00 0912 1230 7.00 1230 1237 7.00 7.00 0.00 0 1200-1259 0.00 0.00 390.00 402.00 378.00 1.00 2475.00 10
2015 1 1 2 5 2015-01-02 AA 19805 AA N795AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0850 -10.00 0.00 0.00 -1 0900-0959 15.00 0905 1202 9.00 1230 1211 -19.00 0.00 0.00 -2 1200-1259 0.00 0.00 390.00 381.00 357.00 1.00 2475.00 10
2015 1 1 3 6 2015-01-03 AA 19805 AA N788AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 15.00 0908 1138 13.00 1230 1151 -39.00 0.00 0.00 -2 1200-1259 0.00 0.00 390.00 358.00 330.00 1.00 2475.00 10
2015 1 1 4 7 2015-01-04 AA 19805 AA N791AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 14.00 0907 1159 19.00 1230 1218 -12.00 0.00 0.00 -1 1200-1259 0.00 0.00 390.00 385.00 352.00 1.00 2475.00 10
2015 1 1 5 1 2015-01-05 AA 19805 AA N783AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0853 -7.00 0.00 0.00 -1 0900-0959 27.00 0920 1158 24.00 1230 1222 -8.00 0.00 0.00 -1 1200-1259 0.00 0.00 390.00 389.00 338.00 1.00 2475.00 10
Time taken: 0.067 seconds, Fetched: 5 row(s)
Structure of HIVE table
hive> describe test_aviation;
OK
col_value string
Time taken: 0.221 seconds, Fetched: 1 row(s)
I want to segregate the entire table in different columns.I have written a query like below to extract 12th column:
SELECT regexp_extract(col_value, '^(?:([^,]*)\,?){1}', 12) from test_aviation;
Output:
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1437067221195_0008, Tracking URL = http://localhost:8088/proxy/application_1437067221195_0008/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1437067221195_0008
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 0
2015-07-17 02:46:56,215 Stage-1 map = 0%, reduce = 0%
2015-07-17 02:47:27,650 Stage-1 map = 100%, reduce = 0%
Ended Job = job_1437067221195_0008 with errors
Error during job, obtaining debugging information...
Job Tracking URL: http://localhost:8088/proxy/application_1437067221195_0008/
Examining task ID: task_1437067221195_0008_m_000000 (and more) from job job_1437067221195_0008
Task with the most failures(4):
-----
Task ID:
task_1437067221195_0008_m_000000
URL:
http://localhost:8088/taskdetails.jsp?jobid=job_1437067221195_0008&tipid=task_1437067221195_0008_m_000000
-----
Diagnostic Messages for this Task:
Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:195)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:450)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:415)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row {"col_value":"2015\t1\t1\t1\t4\t2015-01-01\tAA\t19805\tAA\tN787AA\t1\tJFK\tNew York\t NY\tNY\t36\tNew York\t22\tLAX\tLos Angeles\t CA\tCA\t06\tCalifornia\t91\t0900\t0855\t-5.00\t0.00\t0.00\t-1\t0900-0959\t17.00\t0912\t1230\t7.00\t1230\t1237\t7.00\t7.00\t0.00\t0\t1200-1259\t0.00\t\t0.00\t390.00\t402.00\t378.00\t1.00\t2475.00\t10\t\t\t"}
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:550)
at org.apache.hadoop.hive.ql.exec.mr.ExecMapper.map(ExecMapper.java:177)
... 8 more
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to execute method public java.lang.String org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(java.lang.String,java.lang.String,java.lang.Integer) on object org.apache.hadoop.hive.ql.udf.UDFRegExpExtract#4def4616 of class org.apache.hadoop.hive.ql.udf.UDFRegExpExtract with arguments {2015 1 1 1 4 2015-01-01 AA 19805 AA N787AA 1 JFK New York NY NY 36 New York 22 LAX Los Angeles CA CA 06 California 91 0900 0855 -5.00 0.00 0.00 -1 0900-0959 17.00 0912 1230 7.00 1230 1237 7.00 7.00 0.00 0 1200-1259 0.00 0.00 390.00 402.00 378.00 1.00 2475.00 10 :java.lang.String, ^(?:([^,]*),?){1}:java.lang.String, 12:java.lang.Integer} of size 3
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1243)
at org.apache.hadoop.hive.ql.udf.generic.GenericUDFBridge.evaluate(GenericUDFBridge.java:182)
at org.apache.hadoop.hive.ql.exec.ExprNodeGenericFuncEvaluator._evaluate(ExprNodeGenericFuncEvaluator.java:166)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:77)
at org.apache.hadoop.hive.ql.exec.ExprNodeEvaluator.evaluate(ExprNodeEvaluator.java:65)
at org.apache.hadoop.hive.ql.exec.SelectOperator.processOp(SelectOperator.java:79)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
at org.apache.hadoop.hive.ql.exec.TableScanOperator.processOp(TableScanOperator.java:92)
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:793)
at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:540)
... 9 more
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.hive.ql.exec.FunctionRegistry.invoke(FunctionRegistry.java:1219)
... 18 more
Caused by: java.lang.IndexOutOfBoundsException: No group 12
at java.util.regex.Matcher.group(Matcher.java:487)
at org.apache.hadoop.hive.ql.udf.UDFRegExpExtract.evaluate(UDFRegExpExtract.java:56)
... 23 more
Please help me to extract different columns from a HIVE table.
Try this:
select split(col_value,' ')[11] as column_12 from test_aviation;
Assuming you have space delimiters.
'\\t' if tab
'\\|' for pipe...
':'
and so on
Related
I have 7,000 files(sade1.pdbqt ... sade7200.pdbqt). Only some of the files contains second and so occurrence of a keyword TORSDOF. For a given file, I want to remove all lines following the first occurrence if there is second occurrence of keyword TORSDOF, while preserving the file names. Can somebody please provide a sample snippet. Thank you.
$ cat FileWith2ndOccurance.txt
ashu
vishu
jyoti
TORSDOF
Jatin
Vishal
Shivani
TORSDOF
Sushil
Kiran
after function run
$ cat FileWith2ndOccurance.txt
ashu
vishu
jyoti
TORSDOF
EDIT1: Actual file copy-
REMARK Name = 17-DMAG.cdx
REMARK 8 active torsions:
REMARK status: ('A' for Active; 'I' for Inactive)
REMARK 1 A between atoms: C_1 and N_8
REMARK 2 A between atoms: N_8 and C_9
REMARK 3 A between atoms: C_9 and C_10
REMARK 4 A between atoms: C_10 and N_11
REMARK 5 A between atoms: C_15 and O_17
REMARK 6 A between atoms: C_25 and O_28
REMARK 7 A between atoms: C_27 and O_33
REMARK 8 A between atoms: O_28 and C_29
REMARK x y z vdW Elec q Type
REMARK _______ _______ _______ _____ _____ ______ ____
ROOT
ATOM 1 C UNL 1 7.579 11.905 0.000 0.00 0.00 +0.000 C
ATOM 2 C UNL 1 7.579 10.500 0.000 0.00 0.00 +0.000 C
ATOM 30 O UNL 1 8.796 8.398 0.000 0.00 0.00 +0.000 OA
ENDROOT
BRANCH 21 31
ATOM 31 O UNL 1 13.701 7.068 0.000 0.00 0.00 +0.000 OA
ATOM 32 C UNL 1 12.306 6.953 0.000 0.00 0.00 +0.000 C
ENDBRANCH 41 42
ENDBRANCH 19 41
TORSDOF 8
REMARK Name = 17-DMAG.cdx
REMARK 8 active torsions:
REMARK status: ('A' for Active; 'I' for Inactive)
REMARK 1 A between atoms: C_1 and N_8
REMARK 2 A between atoms: N_8 and C_9
REMARK x y z vdW Elec q Type
REMARK _______ _______ _______ _____ _____ ______ ____
ROOT
ATOM 1 CL UNL 1 0.000 11.656 0.000 0.00 0.00 +0.000 Cl
ENDROOT
TORSDOF 0
What I would do:
#!/bin/bash
for file in sade*.pdbqt; do
count=$(grep -c '^TORSDOF' "$file")
if ((count>1)); then
awk '/^TORSDOF/{print;exit}1' "$file" > /tmp/.torsdof &&
mv /tmp/.torsdof "$file"
fi
done
I have a simple docker container that is based on windows image:
FROM mcr.microsoft.com/windows:1903
WORKDIR /app1/
ENTRYPOINT powershell.exe
I run it interactively, using:
docker run -it -v c:\app1:c:\app1 test-image:1.0
There is a file called 1.txt inside app1 folder.
When I run:
.\app1\1.txt
I see no notepad.exe process, but instead I can spot OpenWith process:
Handles NPM(K) PM(K) WS(K) CPU(s) Id SI ProcessName
------- ------ ----- ----- ------ -- -- -----------
78 5 1056 4452 0.02 1940 1 CExecSvc
74 5 5360 3792 0.02 1904 1 cmd
81 5 904 1364 0.00 1844 1 CompatTelRunner
156 10 6532 6088 0.00 1728 1 conhost
97 7 1196 4980 0.05 1896 1 conhost
286 13 1836 4976 0.27 984 1 csrss
37 6 1348 3356 0.06 524 1 fontdrvhost
0 0 60 8 0 0 Idle
831 22 4748 13844 0.17 460 1 lsass
546 25 13156 28920 0.17 1952 1 OfficeClickToRun
420 24 7400 28844 0.13 2472 1 OpenWith
376 22 6732 27168 0.13 2536 1 OpenWith
I suspect that some mapping might be missing, event though assoc shows that .txt file is associated with notepad.exe:
assoc .txt
.txt=txtfile
ftype txtfile
txtfile=%SystemRoot%\system32\NOTEPAD.EXE %1
What might be the problem here? Am I missing some register value?
I want to use grep to pick lines not containing "WAT" in a file containing 425409 lines with a file size of 26.8 MB, UTF8 encoding.
The file looks like this
>ATOM 1 N ALA 1 9.979 -15.619 28.204 1.00 0.00
>ATOM 2 H1 ALA 1 9.594 -15.053 28.938 1.00 0.00
>ATOM 3 H2 ALA 1 9.558 -15.358 27.323 1.00 0.00
>ATOM 12 O ALA 1 7.428 -16.246 28.335 1.00 0.00
>ATOM 13 N HID 2 7.563 -18.429 28.562 1.00 0.00
>ATOM 14 H HID 2 6.557 -18.369 28.638 1.00 0.00
>ATOM 15 CA HID 2 8.082 -19.800 28.535 1.00 0.00
>ATOM 24 HE1 HID 2 8.603 -23.670 33.041 1.00 0.00
>ATOM 25 NE2 HID 2 8.012 -23.749 30.962 1.00 0.00
>ATOM 29 O HID 2 5.854 -20.687 28.537 1.00 0.00
>ATOM 30 N GLN 3 7.209 -21.407 26.887 1.00 0.00
>ATOM 31 H GLN 3 8.168 -21.419 26.566 1.00 0.00
>ATOM 32 CA GLN 3 6.271 -22.274 26.157 1.00 0.00
**16443 lines**
>ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
>ATOM 16426 H16R PA 1089 -35.669 7.267 -0.368 1.00 0.00
>ATOM 16427 H16S PA 1089 -34.579 5.878 -0.218 1.00 0.00
>ATOM 16428 H16T PA 1089 -34.016 7.366 -0.990 1.00 0.00
>ATOM 16429 C115 PA 1089 -34.144 7.493 1.177 1.00 0.00
>ATOM 16430 H15R PA 1089 -33.101 7.198 1.305 1.00 0.00
>ATOM 16431 H15S PA 1089 -34.179 8.585 1.197 1.00 0.00
>ATOM 16432 C114 PA 1089 -34.971 6.910 2.342 1.00 0.00
>ATOM 16433 H14R PA 1089 -35.147 5.847 2.166 1.00 0.00
**132284 lines**
>ATOM 60981 O WAT 7952 -46.056 -5.515 -56.245 1.00 0.00
>ATOM 60982 H1 WAT 7952 -45.185 -5.238 -56.602 1.00 0.00
>ATOM 60983 H2 WAT 7952 -46.081 -6.445 -56.561 1.00 0.00
>TER
>ATOM 60984 O WAT 7953 -51.005 -3.205 -46.712 1.00 0.00
>ATOM 60985 H1 WAT 7953 -51.172 -3.159 -47.682 1.00 0.00
>ATOM 60986 H2 WAT 7953 -51.051 -4.177 -46.579 1.00 0.00
>TER
>ATOM 60987 O WAT 7954 -49.804 -0.759 -49.284 1.00 0.00
>ATOM 60988 H1 WAT 7954 -48.962 -0.677 -49.785 1.00 0.00
>ATOM 60989 H2 WAT 7954 -49.868 0.138 -48.903 1.00 0.00
**many lines until the end**
>TER
>END
I have used grep -v 'WAT' file.txt but it only returned me the first 16179 lines not containing "WAT" and I can see that there are more lines not containing "WAT". For instance, the following line (and many others) does not appear in the output:
> ATOM 16425 C116 PA 1089 -34.635 6.968 -0.185 1.00 0.00
In order to try to figure out what was happening I've tried grep ' ' file.txt. This command should return every line in the file, but it only returned he first 16179 lines too.
I've also tried to use tail -408977 file.txt | grep ' ' and it returned me all lines recalled by tail. Then I've tried tail -408978 file.txt | grep ' ' and the output was totally empty, zero lines.
I am working on a "normal" 64 bit system, Kubuntu.
Thanks a lot for the help!
When I try I get
$: grep WAT file.txt
Binary file file.txt matches
grep is assuming it's a binary file. add -a
-a, --text equivalent to --binary-files=text
$: grep -a WAT file.txt|head -3
ATOM 29305 O WAT 4060 -75.787 -79.125 25.925 1.00 0.00 O
ATOM 29306 H1 WAT 4060 -76.191 -78.230 25.936 1.00 0.00 H
ATOM 29307 H2 WAT 4060 -76.556 -79.670 25.684 1.00 0.00 H
Your file has 2 NULLs each at the end of lines 16426, 16428, 16430, and 16432.
$: tr "\0" # <file.txt|grep -n #
16426:ATOM 16421 KA CAL 1085 -20.614 -22.960 18.641 1.00 0.00 ##
16428:ATOM 16422 KA CAL 1086 20.249 21.546 19.443 1.00 0.00 ##
16430:ATOM 16423 KA CAL 1087 22.695 -19.700 19.624 1.00 0.00 ##
16432:ATOM 16424 KA CAL 1088 -22.147 19.317 17.966 1.00 0.00 ##
Maybe I'm reading the classification report or the confusion matrix wrong (or both!), but after having trained my classifier and run on it my test set, I get the following report:
precision recall f1-score support
0 0.71 0.67 0.69 5086
1 0.64 0.54 0.59 2244
2 0.42 0.25 0.31 598
3 0.65 0.22 0.33 262
4 0.53 0.42 0.47 266
5 0.42 0.15 0.22 466
6 0.35 0.25 0.29 227
7 0.07 0.05 0.06 127
8 0.39 0.14 0.21 376
9 0.35 0.25 0.29 167
10 0.25 0.14 0.18 229
avg / total 0.61 0.52 0.55 10048
Which is good and all, but when I create my confusion matrix:
0 1 2 3 4 5 6 7 8 9 10
[[4288 428 80 16 44 58 33 38 47 21 33]
[ 855 1218 54 8 12 17 25 15 15 12 13]
[ 291 72 147 1 12 10 20 2 2 17 24]
[ 173 21 3 57 1 3 0 1 1 1 1]
[ 102 20 4 0 113 0 0 6 4 9 8]
[ 331 40 10 3 7 68 3 0 2 1 1]
[ 104 30 17 0 1 0 56 2 1 10 6]
[ 85 19 4 2 5 0 2 6 4 0 0]
[ 270 29 4 1 6 2 2 7 53 1 1]
[ 63 17 11 0 8 3 14 1 1 42 7]
[ 138 13 19 0 5 2 7 3 6 5 31]]
Am I wrong in assuming, that it has predicted 4288 samples of class label 0 out of a total of 5086, which should result in a recall value of 84.3% (0.843)? But that's not the number the report spits out. The precision seems wrong as well, unless I'm wrong when I calculate the percentage of correct predictions (4288) with the sum of the rest in column 0, which results in 0.563, and not 0.71.
What am I misunderstanding?
It might be worth nothing that I use sklearn's classification_report and confusion_matrix for these.
Table 1:
CURRENCY CODE ER GUARANTOR ID G AMOUNT
USD 1.2986 117 750
AED 4.76976 117 5750
ZAR 11.4717 117 234
INR 70.676 117 1243
AMD 526.5823 117 500000
EUR 1 117 12435
ALL 139.63197 117 2000000
EUR 1 173 200000
EUR 1 217 20000000
INR 70.676 26 100000
AED 4.76976 43 1000000
EUR 1 53 10000
Table 2:
F AMOUNT
USD 1.2986 117 450
AED 4.76976 117 7900
INR 70.676 117 2237.4
ZAR 11.4717 117 140.4
AMD 526.5823 117 500000
EUR 1 117 6961
ALL 139.63197 117 2000000
EUR 1 173 20000
EUR 1 217 14000000
INR 70.676 26 300000
AED 4.76976 43 2000000
EUR 1 53 10000
Result:
CURRENCY CODE ER GUARANTOR ID G AMOUNT F AMOUNT
USD 1.2986 117 750 450
AED 4.76976 117 5750 7900
ZAR 11.4717 117 234 2237.4
INR 70.676 117 1243 140.4
AMD 526.5823 117 500000 500000
EUR 1 117 12435
ALL 139.63197 117 2000000
EUR 1 173 200000
EUR 1 217 20000000
INR 70.676 26 100000
AED 4.76976 43 1000000
EUR 1 53 10000
I want to combine Both table like i need all the column in table 1 and F AMOUNT column from table 2. how to achieve this?
Thanks in Advance.
use the below query
select t1.CURRENCY CODE
, t1.ER
, t1.GUARANTOR
, t1.ID
, t2.FAMOUNT
from table1 t1
, table2 t2
where t1.CURRENCY CODE=t2.CURRENCY CODE