I want to fix this below issue in csv file using unix. I don't have access to source so i have to fix with this csv file alone. I need to desired output. is it achievable. Please help.
I have tried this below code but it doesn't work.
perl -p00e 's/\n|/|/g' test.csv
Issue:
DATECODE|SUBCLASSCODE|SUBCLASS_NAME|CLASS
2021-05-25|2202|Bras|1310
2021-05-25|1119|No Longer in Use - Depleted by 2019 Reclass|0805
2021-05-25|0949|No Longer in Use - Depleted by 2021 Reclass|0231
2021-05-25|1928|Fishing Gloves|1155
2021-05-25|1604|Training FW|1080
2021-05-25|0894|Hunting Waders|0894
2021-05-25|1873|Small Game|0326
2021-05-25|9950|EVENT
REGISTRATION FEE|9950
2021-05-25|0476|Regular Golf Gloves|0476
2021-05-25|1366|
Shorts|0988
2021-05-25|1914|Wade Shoes|0894
2021-05-25|0537|No Longer in Use - Depleted by 2019 Reclass|0537
2021-05-25|1635|Pickleball FW|
0021
2021-05-25|0679|Case Sunglasses|0679
2021-05-25|1544|Sandals|0001
2021-05-25|
1527|Golf/Tennis Accessories|1059
2021-05-25|1582|Lifestyle FW|0502
Desired Result:
DATECODE|SUBCLASSCODE|SUBCLASS_NAME|CLASS
2021-05-25|2202|Bras|1310
2021-05-25|1119|No Longer in Use - Depleted by 2019 Reclass|0805
2021-05-25|0949|No Longer in Use - Depleted by 2021 Reclass|0231
2021-05-25|1928|Fishing Gloves|1155
2021-05-25|1604|Training FW|1080
2021-05-25|0894|Hunting Waders|0894
2021-05-25|1873|Small Game|0326
2021-05-25|9950|EVENT REGISTRATION FEE|9950
2021-05-25|0476|Regular Golf Gloves|0476
2021-05-25|1366|Shorts|0988
2021-05-25|1914|Wade Shoes|0894
2021-05-25|0537|No Longer in Use - Depleted by 2019 Reclass|0537
2021-05-25|1635|Pickleball FW|0021
2021-05-25|0679|Case Sunglasses|0679
2021-05-25|1544|Sandals|0001
2021-05-25|1527|Golf/Tennis Accessories|1059
2021-05-25|1582|Lifestyle FW|0502
You can fix the output fairly simply with awk using 3-rules. Specifically, you will check that each line begins with a date in your format and ends (e.g. the 4th field $4) with 4-digits. If so, just print the line (rule 1). If not, and the line begins with a date in your format, just output without a '\n' so you can append the next line to it (rule 2). If you have reach a line that satisfies neither rule 1 or rule 2, it is the end of the previous line, just output with a '\n' to complete the previous line (rule 3).
That can be done with:
awk -F'|' '
NF==4 && $4~/^[[:digit:]]{4}$/ { print; next }
$1~/[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}/ {
printf "%s",$0
next
}
{ print }
' f.csv
Example Use/Output
With your input file in f.csv you would obtain:
$ awk -F'|' '
> NF==4 && $4~/^[[:digit:]]{4}$/ { print; next }
> $1~/[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}/ {
> printf "%s",$0
> next
> }
> { print }
> ' f.csv
DATECODE|SUBCLASSCODE|SUBCLASS_NAME|CLASS
2021-05-25|2202|Bras|1310
2021-05-25|1119|No Longer in Use - Depleted by 2019 Reclass|0805
2021-05-25|0949|No Longer in Use - Depleted by 2021 Reclass|0231
2021-05-25|1928|Fishing Gloves|1155
2021-05-25|1604|Training FW|1080
2021-05-25|0894|Hunting Waders|0894
2021-05-25|1873|Small Game|0326
2021-05-25|9950|EVENTREGISTRATION FEE|9950
2021-05-25|0476|Regular Golf Gloves|0476
2021-05-25|1366|Shorts|0988
2021-05-25|1914|Wade Shoes|0894
2021-05-25|0537|No Longer in Use - Depleted by 2019 Reclass|0537
2021-05-25|1635|Pickleball FW|0021
2021-05-25|0679|Case Sunglasses|0679
2021-05-25|1544|Sandals|0001
2021-05-25|1527|Golf/Tennis Accessories|1059
2021-05-25|1582|Lifestyle FW|0502
Which is the output you specified.
You can write it in condensed form with one rule per-line as:
awk -F'|' '
NF==4 && $4~/^[[:digit:]]{4}$/ { print; next }
$1~/[[:digit:]]{4}-[[:digit:]]{2}-[[:digit:]]{2}/ { printf "%s",$0; next }
{ print }
' f.csv
Look things over and let me know if you have further questions.
You have also very simple solution
perl -pe 's/\n/ /g;s/2021-/\n2021-/g;s/\| */|/g' input.txt
gives you
+------------+--------------+---------------------------------------------+--------+
| DATECODE | SUBCLASSCODE | SUBCLASS_NAME | CLASS |
+------------+--------------+---------------------------------------------+--------+
| 2021-05-25 | 2202 | Bras | 1310 |
| 2021-05-25 | 1119 | No Longer in Use - Depleted by 2019 Reclass | 0805 |
| 2021-05-25 | 0949 | No Longer in Use - Depleted by 2021 Reclass | 0231 |
| 2021-05-25 | 1928 | Fishing Gloves | 1155 |
| 2021-05-25 | 1604 | Training FW | 1080 |
| 2021-05-25 | 0894 | Hunting Waders | 0894 |
| 2021-05-25 | 1873 | Small Game | 0326 |
| 2021-05-25 | 9950 | EVENT REGISTRATION FEE | 9950 |
| 2021-05-25 | 0476 | Regular Golf Gloves | 0476 |
| 2021-05-25 | 1366 | Shorts | 0988 |
| 2021-05-25 | 1914 | Wade Shoes | 0894 |
| 2021-05-25 | 0537 | No Longer in Use - Depleted by 2019 Reclass | 0537 |
| 2021-05-25 | 1635 | Pickleball FW | 0021 |
| 2021-05-25 | 0679 | Case Sunglasses | 0679 |
| 2021-05-25 | 1544 | Sandals | 0001 |
| 2021-05-25 | 1527 | Golf/Tennis Accessories | 1059 |
| 2021-05-25 | 1582 | Lifestyle FW | 0502 |
+------------+--------------+---------------------------------------------+--------+
Related
My file looks as follows:
+------------------------------------------+---------------+----------------+------------------+------------------+-----------------+
| Message | Status | Adress | Changes | Test | Calibration |
|------------------------------------------+---------------+----------------+------------------+------------------+-----------------|
| Hello World | Active | up | 1 | up | done |
| Hello Everyone Here | Passive | up | 2 | down | none |
| Hi there. My name is Eric. How are you? | Down | up | 3 | inactive | done |
+------------------------------------------+---------------+----------------+------------------+------------------+-----------------+
+----------------------------+---------------+----------------+------------------+------------------+-----------------+
| Message | Status | Adress | Changes | Test | Calibration |
|----------------------------+---------------+----------------+------------------+------------------+-----------------|
| What's up? | Active | up | 1 | up | done |
| Hi. I'm Otilia | Passive | up | 2 | down | none |
| Hi there. This is Marcus | Up | up | 3 | inactive | done |
+----------------------------+---------------+----------------+------------------+------------------+-----------------+
I want to extract a specific column using AWK.
I can use CUT to do it; however when the length of each table varies depending on how many characters are present in each column, I'm not getting the desired output.
cat File.txt | cut -c -44
+------------------------------------------+
| Message |
|------------------------------------------+
| Hello World |
| Hello Everyone Here |
| Hi there. My name is Eric. How are you? |
+------------------------------------------+
+----------------------------+--------------
| Message | Status
|----------------------------+--------------
| What's up? | Active
| Hi. I'm Otilia | Passive
| Hi there. This is Marcus | Up
+----------------------------+--------------
or
cat File.txt | cut -c 44-60
+---------------+
| Status |
+---------------+
| Active |
| Passive |
| Down |
+---------------+
--+--------------
| Adress
--+--------------
| up
| up
| up
--+--------------
I tried using AWK but I don't know how to add 2 different delimiters which would take care of all the lines.
cat File.txt | awk 'BEGIN {FS="|";}{print $2,$3}'
Message Status
------------------------------------------+---------------+----------------+------------------+------------------+-----------------
Hello World Active
Hello Everyone Here Passive
Hi there. My name is Eric. How are you? Down
Message Status
----------------------------+---------------+----------------+------------------+------------------+-----------------
What's up? Active
Hi. I'm Otilia Passive
Hi there. This is Marcus Up
The output I'm looking for:
+------------------------------------------+
| Message |
|------------------------------------------+
| Hello World |
| Hello Everyone Here |
| Hi there. My name is Eric. How are you? |
+------------------------------------------+
+----------------------------+
| Message |
|----------------------------+
| What's up? |
| Hi. I'm Otilia |
| Hi there. This is Marcus |
+----------------------------+
or
+------------------------------------------+---------------+
| Message | Status |
|------------------------------------------+---------------+
| Hello World | Active |
| Hello Everyone Here | Passive |
| Hi there. My name is Eric. How are you? | Down |
+------------------------------------------+---------------+
+----------------------------+---------------+
| Message | Status |
|----------------------------+---------------+
| What's up? | Active |
| Hi. I'm Otilia | Passive |
| Hi there. This is Marcus | Up |
+----------------------------+---------------+
or random other columns
+------------------------------------------+----------------+------------------+
| Message | Adress | Test |
|------------------------------------------+----------------+------------------+
| Hello World | up | up |
| Hello Everyone Here | up | down |
| Hi there. My name is Eric. How are you? | up | inactive |
+------------------------------------------+----------------+------------------+
+----------------------------+---------------+------------------+
| Message |Adress | Test |
|----------------------------+---------------+------------------+
| What's up? |up | up |
| Hi. I'm Otilia |up | down |
| Hi there. This is Marcus |up | inactive |
+----------------------------+---------------+------------------+
Thanks in advance.
One idea using GNU awk:
awk -v fldlist="2,3" '
BEGIN { fldcnt=split(fldlist,fields,",") } # split fldlist into array fields[]
{ split($0,arr,/[|+]/,seps) # split current line on dual delimiters "|" and "+"
for (i=1;i<=fldcnt;i++) # loop through our array of fields (fldlist)
printf "%s%s", seps[fields[i]-1], arr[fields[i]] # print leading separator/delimiter and field
printf "%s\n", seps[fields[fldcnt]] # print trailing separator/delimiter and terminate line
}
' File.txt
NOTES:
requires GNU awk for the 4th argument to the split() function (seps == array of separators; see gawk string functions for details)
assumes our field delimiters (|, +) do not show up as part of the data
the input variable fldlist is a comma-delimited list of columns that mimics what would be passed to cut (eg, when a line starts with a delimiter then field #1 is blank)
For fldlist="2,3" this generates:
+------------------------------------------+---------------+
| Message | Status |
|------------------------------------------+---------------+
| Hello World | Active |
| Hello Everyone Here | Passive |
| Hi there. My name is Eric. How are you? | Down |
+------------------------------------------+---------------+
+----------------------------+---------------+
| Message | Status |
|----------------------------+---------------+
| What's up? | Active |
| Hi. I'm Otilia | Passive |
| Hi there. This is Marcus | Up |
+----------------------------+---------------+
For fldlist="2,4,6" this generates:
+------------------------------------------+----------------+------------------+
| Message | Adress | Test |
|------------------------------------------+----------------+------------------+
| Hello World | up | up |
| Hello Everyone Here | up | down |
| Hi there. My name is Eric. How are you? | up | inactive |
+------------------------------------------+----------------+------------------+
+----------------------------+----------------+------------------+
| Message | Adress | Test |
|----------------------------+----------------+------------------+
| What's up? | up | up |
| Hi. I'm Otilia | up | down |
| Hi there. This is Marcus | up | inactive |
+----------------------------+----------------+------------------+
For fldlist="4,3,2" this generates:
+----------------+---------------+------------------------------------------+
| Adress | Status | Message |
+----------------+---------------|------------------------------------------+
| up | Active | Hello World |
| up | Passive | Hello Everyone Here |
| up | Down | Hi there. My name is Eric. How are you? |
+----------------+---------------+------------------------------------------+
+----------------+---------------+----------------------------+
| Adress | Status | Message |
+----------------+---------------|----------------------------+
| up | Active | What's up? |
| up | Passive | Hi. I'm Otilia |
| up | Up | Hi there. This is Marcus |
+----------------+---------------+----------------------------+
Say that again? (fldlist="3,3,3"):
+---------------+---------------+---------------+
| Status | Status | Status |
+---------------+---------------+---------------+
| Active | Active | Active |
| Passive | Passive | Passive |
| Down | Down | Down |
+---------------+---------------+---------------+
+---------------+---------------+---------------+
| Status | Status | Status |
+---------------+---------------+---------------+
| Active | Active | Active |
| Passive | Passive | Passive |
| Up | Up | Up |
+---------------+---------------+---------------+
And if you make the mistake of trying to print the '1st' column, ie, fldlist="1":
+
|
|
|
|
|
+
+
|
|
|
|
|
+
If GNU awk is available, please try markp-fuso's nice solution.
If not, here is a posix-compliant alternative:
#!/bin/bash
# define bash variables
cols=(2 3 6) # bash array of desired columns
col_list=$(IFS=,; echo "${cols[*]}") # create a csv string
awk -v cols="$col_list" '
NR==FNR {
if (match($0, /^[|+]/)) { # the record contains a table
if (match($0, /^[|+]-/)) # horizontally ruled line
n = split($0, a, /[|+]/) # split into columns
else # "cell" line
n = split($0, a, /\|/)
len = 0
for (i = 1; i < n; i++) {
len += length(a[i]) + 1 # accumulated column position
pos[FNR, i] = len
}
}
next
}
{
n = split(cols, a, /,/) # split the variable `cols` on comma into an array
for (i = 1; i <= n; i++) {
col = a[i]
if (pos[FNR, col] && pos[FNR, col+1]) {
printf("%s", substr($0, pos[FNR, col], pos[FNR, col + 1] - pos[FNR, col]))
}
}
print(substr($0, pos[FNR, col + 1], 1))
}
' file.txt file.txt
Result with cols=(2 3 6) as shown above:
+---------------+----------------+-----------------+
| Status | Adress | Calibration |
+---------------+----------------+-----------------|
| Active | up | done |
| Passive | up | none |
| Down | up | done |
+---------------+----------------+-----------------+
+---------------+----------------+-----------------+
| Status | Adress | Calibration |
+---------------+----------------+-----------------|
| Active | up | done |
| Passive | up | none |
| Up | up | done |
+---------------+----------------+-----------------+
It detects the column width in the 1st pass then splits the line on the column position in the 2nd pass.
You can control the columns to print with the bash array cols which is assigned at the beginning of the script. Please assign the array to the list of desired column numbers in increasing order. If you want to use the bash variable in different way, please let me know.
my data examples are
1.txt
MTQZ3CODT0SQKGE3QE6B | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
2.txt
MTQZ3CODT0SQKGE3QE6B | joe#example.com
desired output
joe#example.com | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
I suppose to match & replace 1st column from 1.txt
with 2nd column in 2.txt
so far i did try :
awk 'BEGIN { while((getline < "file2.txt") > 0) a[$1]=$3 } { $1 = a[$1] } 1' file1.txt
Its work well but after 12hours of running i just finalise only 1GB looks very slow
INFO: file1.txt=7GB file2.txt=4GB my memory 16GB
I'm not sure what cause the slowly thing but i hope if there's another fast way then i'm using of awk
will be helpfull.
Thanks!!
Note: I'm running out of memory is there another way to do it
and that's to not have an array at all?
Also in my case lines are randomly and not in the same lines!
$ join <(sort 2.txt) <(sort 1.txt) | cut -d' ' -f3-
joe#example.com | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
If that's not all you need then edit your question to provide more truly representative sample input/output including cases that this doesn't work for.
You may use this awk:
awk -F ' *\\| *' -v OFS=' | ' '
FNR == NR {
map[$1]=$2
next
}
$1 in map {
$1 = map[$1]
} 1' 2.txt 1.txt
joe#example.com | j t | j | t | 22312 | stimpy | EST | 8 | 20 | text | list | 0 | | 2002-08-22 13:07:05
I am currently trying to find a way to redirect the standard output from beeline shell to text file without the grid. The biggest problem I am facing right now is that my columns have negative values and when I'm using regex to remove the '-', it is affecting the column values.
+-------------------+
| col |
+-------------------+
| -100 |
| 22 |
| -120 |
| -190 |
| -800 |
+-------------------+
Here's what I'm doing:
beeline -u jdbc:hive2://localhost:10000/default \
-e "SELECT * FROM $db.$tbl;" | sed 's/\+//g' | sed 's/\-//g' | sed 's/\|//g' > table.txt
I am trying to clean this file so I can read all the data into a variable.
Assumming all your data has the same pattern , where no significant '-' are wrapped in '+' :
[root#machine]# cat boo
+-------------------+
| col |
+-------------------+
| -100 |
| 22 |
| -120 |
| -190 |
| -800 |
+-------------------+
[root#machine]# cat boo | sed 's/\+-*+//g' | sed 's/\--//g' | sed 's/|//g'
col
-100
22
-120
-190
-800
This question already has answers here:
How to clean up masscan output (-oL)
(4 answers)
Closed 6 years ago.
I have a problem with the output L options ("grep-able" output); for instance, it outputs this:
| 14.138.12.21:123 | unknown | disabled |
| 14.138.184.122:123 | unknown | disabled |
| 14.138.179.27:123 | unknown | disabled |
| 14.138.20.65:123 | unknown | disabled |
| 14.138.12.235:123 | unknown | disabled |
| 14.138.178.97:123 | unknown | disabled |
| 14.138.182.153:123 | unknown | disabled |
| 14.138.178.124:123 | unknown | disabled |
| 14.138.201.191:123 | unknown | disabled |
| 14.138.180.26:123 | unknown | disabled |
| 14.138.13.129:123 | unknown | disabled |
The above is neither very readable nor easy to understand.
How can I use Linux command-line utilities, e.g. sed, awk, or grep, to output something as follows, using the file above?
output
14.138.12.21
14.138.184.122
14.138.179.27
14.138.20.65
14.138.12.235
Using awk with field separator as space, and : and getting the second field:
awk -F '[ :]' '{print $2}' file.txt
Example:
% cat file.txt
| 14.138.12.21:123 | unknown | disabled |
| 14.138.184.122:123 | unknown | disabled |
| 14.138.179.27:123 | unknown | disabled |
| 14.138.20.65:123 | unknown | disabled |
| 14.138.12.235:123 | unknown | disabled |
| 14.138.178.97:123 | unknown | disabled |
| 14.138.182.153:123 | unknown | disabled |
| 14.138.178.124:123 | unknown | disabled |
| 14.138.201.191:123 | unknown | disabled |
| 14.138.180.26:123 | unknown | disabled |
| 14.138.13.129:123 | unknown | disabled |
% awk -F '[ :]' '{print $2}' file.txt
14.138.12.21
14.138.184.122
14.138.179.27
14.138.20.65
14.138.12.235
14.138.178.97
14.138.182.153
14.138.178.124
14.138.201.191
14.138.180.26
14.138.13.129
AWK is perfect for cases when you want to split the file by "columns", and you know exactly that the order of values/columns is constant. AWK splits the lines by a field separator (which can be a regular expression like '[: ]'). The column names are accessible by their positions from the left: $1, $2, $3, etc.:
awk -F '[ :]' '{print $2}' src.log
awk -F '[ :|]' '{print $3}' src.log
awk 'BEGIN {FS="[ :|]"} {print $3}' src.log
You can also filter the lines with a regular expression:
awk -F '[ :]' '/138\.179\./ {print $2}' src.log
However, it is impossible to capture substrings with the regular expression groups.
SED is more flexible in regard to regular expressions:
sed -r 's/^[^0-9]*([0-9\.]+)\:.*/\1/' src.log
However, it lacks many useful features of the Perl-like regular expressions we used to use in every day programming. For example, even the extended syntax (-r) fails to interpret \d as a number.
Perhaps, Perl is the most flexible tool for parsing files. You can opt to simple expressions:
perl -n -e '/^\D*([^:]+):/ and print "$1\n"' src.log
or make the matching as strict as you like:
perl -n -e '/^\D*((?:\d{1,3}\.){3}\d{1,3}):/ and print "$1\n"' src.log
using sed
sed -r 's/^ *[|] *([0-9]+[.][0-9]+[.][0-9]+[.][0-9]+):[0-9]{3}.*/\1/
My source:
+-----------+-------+----------------------+----------------------------------------------------------------------------------+
| positives | total | scan_date | url |
+===========+=======+======================+==================================================================================+
| 4 | 65 | 2015-09-21 23:29:33 | http://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/ |
| | | | prettyphoto/images/prettyPhoto/light_rounded/66836487162.txt |
+-----------+-------+----------------------+----------------------------------------------------------------------------------+
| 1 | 64 | 2015-09-17 19:28:50 | http://thebackpack.fr/ |
+-----------+-------+----------------------+----------------------------------------------------------------------------------+
| 1 | 64 | 2015-09-17 08:44:16 | http://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/ |
| | | | prettyphoto/images/prettyPhoto/light_rounded/ |
+-----------+-------+----------------------+----------------------------------------------------------------------------------+
I would like to extract the full URLs (Full URL in one line):
hxxp://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/prettyphoto/images/prettyPhoto/light_rounded/66836487162.txt
hxxp://thebackpack.fr/
hxxp://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/prettyphoto/images/prettyPhoto/light_rounded/
The multiple lines URL is my problem. I tried for example: awk '{print $9}'
Thanks in advance for your help!
You can use this awk command:
awk -F '[[:blank:]]*\\|[[:blank:]]*' 'NR<3 || NF<5{next}
$2{if (url) print url; url=$5; next}
{url=url $5}
END{print url}' file
Output:
http://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/prettyphoto/images/prettyPhoto/light_rounded/66836487162.txt
http://thebackpack.fr/
http://thebackpack.fr/wp-content/themes/salient/wpbakery/js_composer/assets/lib/prettyphoto/images/prettyPhoto/light_rounded/