Store value of impala results in a variable in linux - bash

I have a requirement to retrieve 239, 631 etc from the below output and store it in a variable in linux -- this is output of impala results..
+-----------------+
| organization_id |
+-----------------+
| 239 |
| 631 |
| 632 |
| 633 |
+-----------------+
below is the query I am running.
x=$(impala-shell -q "${ORG_ID}" -ki "${impalaserver}");
How to do it?

Could you please try following. This will be deleting duplicates for all Input_file(even a single id comes in 1 organization_id it will NOT br printed in other stanza then too)
your_command | awk -v s1="'" 'BEGIN{OFS=","} /---/{flag=""} /organization_id/{flag=1;getline;next} flag && !a[$2]++{val=val?val OFS s1 $2 s1:s1 $2 s1} END{print val}'
In case you need to print ids(which are coming in 1 stanza and could come in other stanza of organization_id then try following):
your_command | awk -v s="'" 'BEGIN{OFS=","} /---/{print val;val=flag="";delete a} /organization_id/{flag=1;getline;next} flag && !a[$2]++{val=val?val OFS s1 $2 s1:s1 $2 s1} END{if(val){print val}}'

What about this :
x=$(impala-shell -B -q "${ORG_ID}" -ki "${impalaserver}")
I just added the -B option which removes the pretty-printing and the header.
If you want comma-separated values you can pipe the result to tr :
echo $x | tr ' ' ','

Related

Bash Shell: How do I sort by values on last column, but ignoring the header of a file?

file
ID First_Name Last_Name(s) Average_Winter_Grade
323 Popa Arianna 10
317 Tabarcea Andreea 5.24
326 Balan Ionut 9.935
327 Balan Tudor-Emanuel 8.4
329 Lungu Iulian-Gabriel 7.78
365 Brailean Mircea 7.615
365 Popescu Anca-Maria 7.38
398 Acatrinei Andrei 8
How do I sort it by last column, except for the header ?
This is what file should look like after the changes:
ID First_Name Last_Name(s) Average_Winter_Grade
323 Popa Arianna 10
326 Balan Ionut 9.935
327 Balan Tudor-Emanuel 8.4
398 Acatrinei Andrei 8
329 Lungu Iulian-Gabriel 7.78
365 Brailean Mircea 7.615
365 Popescu Anca-Maria 7.38
317 Tabarcea Andreea 5.24
If it's always 4th column:
head -n 1 file; tail -n +2 file | sort -n -r -k 4,4
If all you know is that it's the last column:
head -n 1 file; tail -n +2 file | awk '{print $NF,$0}' | sort -n -r | cut -f2- -d' '
You'd like to just sort by the last column, but sort doesn't allow you to do that easily. So rewrite the data with the column to be sorted at the beginning of each line:
Ignoring the header for the moment (although this will often work by itself):
awk '{print $NF, $0 | "sort -nr" }' input | cut -d ' ' -f 2-
If you do need to trim the order (eg, it's getting mixed in the sort), you can do things like:
< input awk 'NR==1; NR>1 {print $NF, $0 | "sh -c \"sort -nr | cut -d \\\ -f 2-\"" }'
or
awk 'NR==1{ print " ", $0} NR>1 {print $NF, $0 | "sort -nr" }' OFS=\; input | cut -d \; -f 2-

CONCAT columns within a file

I'd like to concatenate column2 until column4.
Example (first.txt):
|ID|column2|column3|column4|
|1 | a | b | c |
|2 | d | e | f |
To this (mynewfile.txt) :
ID|column2
1 | a b c
2 | d e f
This is my script in cygwin : $ awk '{print $2" "$3" "$4 }' first.txt > mynewfile.txt
Of course, it is not working out well.. How do I improve the script?
You need to set the field separator so that a pipe with optional whitespace around it is the field delimiter.
The pipe at the beginning of the line causes an empty field 1 before the pipe, so the ID is field 2, and columns 2-4 are fields 3-5. So it should be:
awk -F' *\\| *' 'NR == 1 {print "ID|column2|"} NR > 1 {printf("%d | %s %s %s |\n", $2, $3, $4, $5)}' first.txt > mynewfile.txt
Not especially general GNU sed method:
sed 's/^[|]//;1s/2.*/2/;1!{s/|/ /g2;s/ */ /2g}' first.txt
Output:
ID|column2
1 | a b c
2 | d e f

awk query with numbers vs. strings

I am writing a function in R that will generate an awk script to pull in rows from a csv according to conditions that a user selected through a UI.
This is the example of the string generated by the function:
$ tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv |
> parallel -k -q --block 500M --pipe \
> awk -F , '$5 > "2013-01-01" && $5 < "2015-11-05" && ($3 == "20116688") && ($20 == "Disregard") {print $1 "," $3 "," $17 "," $20 }' |
> head | csvlook
It doesn’t return anything because $3 is a numeric variable. Neither does:
$ tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv |
> parallel -k -q --block 500M --pipe \
> awk -F , '$5 > "2013-01-01" && $5 < "2015-11-05" && ($3 == 20116688) && ($20 == Disregard) {print $1 "," $3 "," $17 "," $20 }' |
> head | csvlook
… because $20 is a string.
This returns a portion of the dataset:
$ tail -n +2 ../data/faults_main_only_dp_1_shopFlag.csv |
> parallel -k -q --block 500M --pipe \
> awk -F , '$5 > "2013-01-01" && $5 < "2015-11-05" && ($3 == 20116688) && ($20 == "Disregard") {print $1 "," $3 "," $17 "," $20 }' |
> head | csvlook`
|---------+------------+------+------------|
| 5058.0 | 20116688.0 | 4162 | Disregard |
|---------+------------+------+------------|
| 5060.0 | 20116688.0 | 3622 | Disregard |
| 5060.0 | 20116688.0 | 3619 | Disregard |
| 5061.0 | 20116688.0 | 766 | Disregard |
| 5059.0 | 20116688.0 | 3603 | Disregard |
| 5055.0 | 20116688.0 | 1013 | Disregard |
| 5058.0 | 20116688.0 | 1012 | Disregard |
| 5055.0 | 20116688.0 | 4163 | Disregard |
| 5060.0 | 20116688.0 | 4225 | Disregard |
| 5061.0 | 20116688.0 | 3466 | Disregard |
|---------+------------+------+——————|
Unfortunately, I don’t currently have a way of anticipating which of the variables that the user selects through the UI will be string or numerical (I know how to do that, but it will take time that I’d rather not spend if there was a workaround). Is there a way to cast each variable a string before the comparison or have some other way of dealing with this issue?
Edit This is what the raw data look like:
$ csvcut -c15:20 faults_main_only_dp_1_shopFlag.csv | head
faultActiveLongitude,faultActiveAltitude,faultCode,faultSoftwareVersion,stateID,stateName
-0.8100106,-1.0,3604,25.07.01 11367,2.0,Work Item
-0.81860137,840.0,766,25.07.01 11367,5.0,Disregard
-0.8100140690000001,-1.0,4279,25.07.01 11367,2.0,Work Item
-0.8100509640000001,-2.0,4279,25.07.01 11367,2.0,Work Item
-0.8102342,14.0,3604,25.07.01 11367,2.0,Work Item
-0.8181563620000001,831.0,3604,25.07.01 11367,5.0,Disregard
-0.81022054,11.0,3604,25.07.01 11367,2.0,Work Item
-0.8102272,11.0,4279,25.07.01 11367,2.0,Work Item
-0.8083836999999999,17.0,766,25.07.01 11367,5.0,Disregard
awk can do the int <--> string comparison if the token can be converted. Note that you're using comma as the field separator and spaces will be part of the fields. If it's not a decimal point issue where your numbers are integers,
Check these three cases
$ echo "42,42" | awk -F, '$1=="42" && $2==42{print "works";next} {print "does not work"}'
works
$ echo "42, 42" | awk -F, '$1=="42" && $2==42{print "works";next} {print "does not work"}'
works
$ echo "42 , 42" | awk -F, '$1=="42" && $2==42{print "works";next} {print "does not work"}'
does not work
The string interpretation (first field) should not have the space!
You can try setting up your field separator to " *, *"
UPDATE: If your integers get .0 floating point extensions which you can ignore, convert the them to int before the comparison
$ echo "42.0 , 42" | awk -v FS=" *, *" 'int($1)=="42" && $2=="42"{print "works";next} {print "does not work"}'
works
Here your generic value will be quoted but the field will be converted to int before the string conversion. You need to know what fields are numeric what fields are string though.

replace string in comma delimiter file using nawk

I need to implement the if condition in the below nawk command to process input file if the third column has more that three digit.Pls help with the command what i am doing wrong as it is not working.
inputfile.txt
123 | abc | 321456 | tre
213 | fbc | 342 | poi
outputfile.txt
123 | abc | 321### | tre
213 | fbc | 342 | poi
cat inputfile.txt | nawk 'BEGIN {FS="|"; OFS="|"} {if($3 > 3) $3=substr($3, 1, 3)"###" print}'
Try:
awk 'length($3) > 3 { $3=substr($3, 1, 3)"###" } 1 ' FS=\| OFS=\| test1.txt
This works with gawk:
awk -F '[[:blank:]]*\\\|[[:blank:]]*' -v OFS=' | ' '
$3 ~ /^[[:digit:]]{4,}/ {$3 = substr($3,1,3) "###"}
1
' inputfile.txt
It won't preserve the whitespace so you might want to pipe through column -t

retrieve and add two numbers of files

In my file I have following structure :-
A | 12 | 10
B | 90 | 112
C | 54 | 34
What I have to do is I have to add column 2 and column 3 and print the result with column 1.
output:-
A | 22
B | 202
C | 88
I retrieve the two columns but dont know how to add
What I did is :-
cut -d ' | ' -f3,5 myfile.txt
How to add those columns and display.
A Bash solution:
#!/bin/bash
while IFS="|" read f1 f2 f3
do
echo $f1 "|" $((f2+f3))
done < file
You can do this easily with awk.
awk '{print $1," | ",($3+$5)'} myfile.txt
wil work perhaps.
You can do this with awk:
awk 'BEGIN{FS="|"; OFS="| "} {print $1 OFS $2+$3}' input_filename
Input:
A | 12 | 10
B | 90 | 112
C | 54 | 34
Output:
A | 22
B | 202
C | 88
Explanation:
awk: invoke the awk tool
BEGIN{...}: do things before starting to read lines from the file
FS="|": FS stands for Field Separator. Think of it as the delimiter that separates each line of your file into fields
OFS="| ": OFS stands for Output Field Separator. Same idea as above, but for output. FS =/= OFS in this case due to formatting
{print $1 OFS $2+$3}: For each line that awk reads, print the first field (the letter), followed by a delimiter specified by OFS, then the sum of field 2 and field 3.
input_filename: awk accepts the input file name as an argument here.

Resources