Removing unwanted characters and empty lines with SED, TR or/and awk - shell

I need to remove some unknown characters and remaining empty lines from a file, it should be simple and I'm feeling really stupid that I couldn't do it yet.
Here's the file contents (readable):
136;2014-09-07 13:41:25;2014-09-07 13:41:55
136;2014-09-07 13:41:55;2014-09-07 13:42:25
136;2014-09-07 13:42:25;2014-09-07 13:42:55
(empty line)
(empty line)
For some reason, this file comes with several unwanted/unknown chars. The HEX is:
fffe 3100 3300 3600 3b00 3200 3000 3100 3400 2d00 3000 3900 :..1.3.6.;.2.0.1.4.-.0.9.
2d00 3000 3700 2000 3100 3300 3a00 3400 3100 3a00 3200 3500 :-.0.7. .1.3.:.4.1.:.2.5.
3b00 3200 3000 3100 3400 2d00 3000 3900 2d00 3000 3700 2000 :;.2.0.1.4.-.0.9.-.0.7. .
3100 3300 3a00 3400 3100 3a00 3500 3500 0d00 0a00 3100 3300 :1.3.:.4.1.:.5.5.....1.3.
3600 3b00 3200 3000 3100 3400 2d00 3000 3900 2d00 3000 3700 :6.;.2.0.1.4.-.0.9.-.0.7.
2000 3100 3300 3a00 3400 3100 3a00 3500 3500 3b00 3200 3000 : .1.3.:.4.1.:.5.5.;.2.0.
3100 3400 2d00 3000 3900 2d00 3000 3700 2000 3100 3300 3a00 :1.4.-.0.9.-.0.7. .1.3.:.
3400 3200 3a00 3200 3500 0d00 0a00 3100 3300 3600 3b00 3200 :4.2.:.2.5.....1.3.6.;.2.
3000 3100 3400 2d00 3000 3900 2d00 3000 3700 2000 3100 3300 :0.1.4.-.0.9.-.0.7. .1.3.
3a00 3400 3200 3a00 3200 3500 3b00 3200 3000 3100 3400 2d00 ::.4.2.:.2.5.;.2.0.1.4.-.
3000 3900 2d00 3000 3700 2000 3100 3300 3a00 3400 3200 3a00 :0.9.-.0.7. .1.3.:.4.2.:.
3500 3500 0d00 0a00 0000 0d00 0a00 :5.5...........
So, as you can see the first 2 bytes are xFF and xFE and there are many x00 after each char. The line endings are a join of 0D00 + 0A00, carriage return and linefeed (\r\n) plus the x00.
I wanted to remove those x00 and the first 2 bytes xFFxFE and the last 4, and convert the CRLF to LF.
I could do that by using head, tail and tr:
tr -d '\15\00' < 2014.log | tail -c +3 | head -c -2 > 3.log
The problem is, I'm not sure if the file will always arrive like this, so I need to build a more generic method. I ended up with:
sed 's/\xFF\xFE//g; s/\x00//g; s/\x0D//g' 2014.log > 2.log
or
tr -d '\377\376\00\15' < 2014.log > 2.log
Now I need to remove the last two empty lines, which as I said in the beginning, should be easy, but I can't accomplish that.
I've tried:
sed '/^\s*$/d'
sed '/^$/d'
awk 'NF > 0'
egrep -v "^$"
Other stuff
But in the end it removes only one of the blank lines, I still have one x0A in the end. I tried to replace the join of two x0Ax0A with sed, even using \n\n but it didn't work.
I can't remove all \n because I need the normal lines, I just want to remove when they appear at least two times in sequence. Again I could use tail or head to remove it, but I would be assuming that all files would arrive that way, and its not true.
I see it as a simple find and replace stuff, but it seems it doesn't work that way when we are working with linefeeds.
For information purposes:
file -i 2014-09-07-13-46-51.log
2014-09-07-13-46-51.log: application/octet-stream; charset=binary
Its not been recognized as a text file... this file is extracted from a flash shared object (.sol).
As the new files may not be like this and arrived as normal text files, I can't simple cut the files, but I need to treat those who are problematic.

The "fffe" at the beginning of the file is a byte order mark (http://en.wikipedia.org/wiki/Byte_order_mark) and for me an indication that you have a unicode type file. In that kind of file 'normal' ascii characters are represented by 2 bytes.
In another stackoverflow question/aswer the file is first converted to UTF-8... (grepping binary files and UTF16)

I finally made it, but really didn't like the solution. I've replaced all linefeeds with another character, like pipe (|), then removed then when I found two in sequence (||), and then convert pipes (|) back to \n
sed 's/\xFF\xFE//g; s/\x00//g; s/\x0D//g' 2014.log | tr '\n' '|' | sed 's/||//g;' | sed 's/|/\x0A/g' > 5.log
-- #Luciano

Wow I solved the problem by that time but forgot to answer, so here it is!
Using only tr command I could accomplish that like this:
tr -d '\377\376\015\000\277\003' < logs.csv | tr -s '\n'
tr removed all the unwanted characters and empty lines, and it was really, really fast, much faster than the options using sed and awk

If you just want the ASCII characters out of the file you might try iconv
You probably can identify the file's encoding with file -i

I know you asked for sed, tr or awk but on the offchance it will change your mind, this is how easy it is to get Perl to do the heavy lifting:
perl -e 'open my $fh, "<:encoding(utf16)", $ARGV[0] or die "Error reading $ARGV[0]: $!"; while (<$fh>) { s{\x0d\x0a}{\n}g; s{\x00\n}{}g; print $_; }' input_filename

Related

Get previous 4 digits of a hex pattern search

I'm trying to do a hex search for a pattern.
I have a file and I search for a pattern on the file with...
xxd -g 2 -c 32 -u file | grep "0045 5804 0001 0000"
This returns the lines that contain that pattern:
FFFF FFFF FFFF 4556 4E54 0000 0116 0100 08B9 0045 5804 0001 0000 2008 0000 0001
But I want it to return the 4 digits before that pattern which is 08B9 in this case. How could I do it?
With GNU grep and a Perl-compatible regular expression:
xxd -g 2 -c 32 -u file | grep -Po '....(?= 0045 5804 0001 0000)'
Output:
08B9
Don't use grep, use sed, e.g. using any sed:
$ xxd whataver | sed -n 's/.*\(....\) 0045 5804 0001 0000.*/\1/p'
08B9
A not very elegant but intuitively simple approach might be to pipe your grep result into sed and use a simple regex to substitute your search term with an empty string to the end of the line. This leaves the block you want as the last space-separated 'word' of the result, which can be retrieved by piping into awk and printing the last field (steps shown on separate lines for presentation, join them):
xxd -g 2 -c 32 -u file |
grep "0045 5804 0001 0000" |
sed 's/0045 5804 0001 0000.*//' |
awk '{print $NF}'
nawk 'sub(".* ",_, $!--NF)^_' OFS= FS=' 0045 5804 0001 0000.*$'
mawk '$!NF = $--NF' FS=' 0045 5804 0001 0000.*$| '
gawk ' $_ = $--NF' FS=' 0045 5804 0001 0000.*$| '
08B9
I would harness GNU AWK for this task following way, let file.txt content be
FFFF FFFF FFFF 4556 4E54 0000 0116 0100 08B9 0045 5804 0001 0000 2008 0000 0001
then
awk 'match($0, /[[:xdigit:]]{4} 0045 5804 0001 0000/){print substr($0,RSTART,4)}' file.txt
gives output
08B9
Explanation: I use two String Functions, match to check if current line ($0) and set RSTART variable, then substr to get 4 first characters of match. [[:xdigit:]] denotes base-16 digit, {4} number of repeats.
(tested in gawk 4.2.1)
My xxd prints an 8-digit address, a :, 16x 4-digit hex codes (separated by spaces), and finally the corresponding raw data from the file, eg:
$ xxd -g 2 -c 32 -u file
1 2 3 4 5 6 7 8 9 10 11 12 13
1234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
00000000: 4120 302E 3730 3220 6173 646C 666B 6A61 7364 666C 6B61 6A73 6466 3B6C 6B61 736A A 0.702 asdlfkjasdflkajsdf;lkasj
00000020: 6466 6C6B 6173 6A64 660A 4220 302E 3836 3820 6173 646C 666B 6A61 7364 666C 6B61 dflkasjdf.B 0.868 asdlfkjasdflka
00000040: 322E 3135 3220 6173 646C 666B 6A61 7364 666C 6B61 6A73 6466 3B6C 6B61 736A 6466 2.152 asdlfkjasdflkajsdf;lkasjdf
00000060: 6C6B 6173 6A64 660A lkasjdf.
NOTE: the 1st two lines (a ruler) added to show column numbering
OP appears to be interested solely in the 4-digit hex codes which means we're interested in the data in columns 11-89 (inclusive).
From here we need to address 4x different scenarios:
match could occur at the very beginning of the xxd output in which
case there is no preceeding 4-digit hex code
match occurs at the beginning of the line so we're interested in the
4-digit hex code at the end of the previous line
match occurs in the middle of the line in which case we're
interested in the 4-digit hex code just prior to the match
match spans two lines in which case we're interested in the 4-digit
hex code just prior to the match on the 1st line
A contrived set of xxd output to demonstrate all 4x scenarios:
$ cat xxd.out
00000000: 0045 5804 0001 0000 6173 646C 666B 6A61 7364 666C 6B61 6A73 6466 3B6C 6B61 736A A 0.702 asdlfkjasdflkajsdf;lkasj
# ^^^^^^^^^^^^^^^^^^^
00000020: 0045 5804 0001 0000 660A 4220 0045 5804 0001 0000 646C 666B 6A61 7364 0045 5804 dflkasjdf.B 0.868 asdlfkjasdflka
# ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^^^^^ ^^^^^^^^^
00000040: 0001 0000 3B6C 6B61 736A 6466 6C6B 6173 6A64 660A 4320 332E 3436 3720 6173 646C jsdf;lkasjdflkasjdf.C 3.467 asdl
# ^^^^^^^^^
NOTE: comments added to highlight our matches
One idea using awk:
x='0045 5804 0001 0000'
cat xxd.out | # simulate feeding xxd output to awk
awk -v x="${x}" '
function parse_string() {
while ( length(string) > (2 * lenx) ) {
pos= index(string,x)
if (pos) {
if (pos==1) output= "NA (at front of file)"
else output= substr(string,pos - 5,4)
cnt++
printf "Match #%s: %s\n", cnt, output
string= substr(string,pos + lenx)
}
else {
string= substr(string,length(string) - (2 * lenx))
break
}
}
}
BEGIN { lenx = length(x) }
{ string=string substr($0,11,80) # strip off address & raw data, append 4-digit hex codes into one long string
if ( length(string) > (1000 * lenx) )
parse_string()
}
END { parse_string() }
'
NOTE: the parse_string() function and the assorted if (length(string) > ...) tests allow us to limit memory usage to 1000x the length of our search pattern (in this example => 1000 x 19 = 19,000); granted, 'overkill' in the case of small files but it allows us to process large(r) files without having to worry about hogging memory (or in a worst case scenario: an OOM - Out Of Memory - error)
This generates:
Match #1: NA (at front of file)
Match #2: 736A
Match #3: 4220
Match #4: 7364
Just make a lookahead and print only the matched string
$ xxd -g 2 -c 32 -u file | grep -Po "[0-9A-F]{4} (?=0045 5804 0001 0000)"
$ xxd -g 2 -c 32 -u file | perl -lne 'print for /([0-9A-F]{4}) (?=0045 5804 0001 0000)/'
But searching the hex representation like that is just silly because:
It won't work when the pattern 0045 5804 0001 0000 is at the beginning of the line (i.e. the output is on the previous line)
It'll be much slower than searching directly in binary
So just search directly with grep then decode like this
grep -Pao "..\x00\x45\x58\x04\x00\x01\x00\x00" file | xxd -p -u -l 2
It matches 2 bytes followed by your byte pattern, then print the first 2 bytes as hex
grep -ao $'..\x12\x34<remaining bytes of hex pattern>' file | xxd -p -u -l 2 also works but not in every case due to the handling of null bytes
If the pattern contains LF \n then you'll also need the -z option
grep -Pzao "..<hex pattern>" file | xxd -p -u -l 2
grep -zao $'..<hex pattern>' file | xxd -p -u -l 2
See also
Using grep to search for hex strings in a file
How can I grep a hex value in a string in a binary file?

Replacing the value of specific field in a table-like string stored as bash variable

I am looking for a way to replace (with 0) a specific value (1043252782) in a "table-like" string stored as a bash variable. The output of echo "$var"looks like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 1043252782
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
After the replacement echo "$var" should look like this:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
Is there a way to do this without saving the content of $var to a file and directly manipulating it within the bash (shell script)?
Maby with awk? I can select the value in the 10th field of the second record with awk and pattern matching ("7 Seek_Error_Rate ....") like this:
echo "$var" | awk '/^ 7/{print $10}'
Maby there is some way doing it with awk (or other cli-tool) to replace it and store it back into $var? Also, the value changes over time, but the structure remains the same (some record at the 10th field).
You can change a specific string directly in the shell:
var=${var/1043252782/0}
To replace final number of second line, you could use awk or sed:
var=$(awk 'NR==2 { sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '2s/[0-9][0-9]*$/0/' <<<"$var")
If you don't know which line it will be, you can match a known string:
var=$(awk '/Seek_Error_Rate/{ sub(/[0-9]+$/,0) }1' <<<"$var")
var=$(sed '/Seek_Error_Rate/s/[0-9][0-9]*$/0/' <<<"$var")
You can use a here-string to feed the variable as input to awk.
Use sub() to perform a regular expression replacement.
var=$(awk '{sub(/1043252782$/, "0")}1' <<<"$var")
Using sed
$ var=$(sed '/1043252782$/s//0/' <<< "$var")
$ echo "$var"
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
if you don't wanna ruin formatting of tabs and spaces :
{m,g}wk NF=NF FS=' 1043252782$' OFS=' 0'
:
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
7 Seek_Error_Rate 0x000f 090 060 045 Pre-fail Always - 0
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
or doing the whole file in one single shot :
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS='^$' ORS=
awk NF=NF FS=' 1043252782\n' OFS=' 0\n' RS= -- (This might work too but I'm not too well versed in any side effects for blank RS)

How to sum the column with matching characters in a specific location?

I would like to add the output of du for all sub folders with the certain same subfolder characters.
I have tried (example)
du -s /aa/bb/cc/*/ | sort -k2.11,2.14
where I got the output sorted
2000 /aa/bb/cc/1234/
1000 /aa/bb/dd/1234/
2000 /aa/bb/ff/1234/
2000 /aa/bb/cc/5678/
2000 /aa/bb/dd/5678/
3000 /aa/bb/ee/5678/
1000 /aa/bb/gg/5678/
Now I would like to add all the ones with 1234 and 5678
Expected result
5000 -- 1234
8000 -- 5678
You can use awk to store all the content of first filed into an array a using key of 2nd last field.
du -s /aa/bb/cc/*/ | sort -k2.11,2.14 |awk -F'/' '{a[$(NF-1)]+=$1}END{for(i in a) print a[i],i}'
8000 5678
5000 1234

Bash- Sort a file based on a list in another file [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
I know that similar questions have been asked about sorting a file by a specific column, but none of them seem to answer my question.
My Input file looks like
OHJ07_1_contig_10 0 500 130 500 500 1.0000000
OHJ07_1_contig_10 500 1000 180 500 500 1.0000000
OHJ07_1_contig_10 1000 1500 171 500 500 1.0000000
OHJ07_1_contig_10 1500 2000 79 380 500 0.7600000
OHJ07_1_contig_10 2000 2500 62 500 500 1.0000000
OHJ07_1_contig_10 2500 3000 96 500 500 1.0000000
OHJ07_1_contig_10 3000 3500 76 500 500 1.0000000
OHJ07_1_contig_10 3500 4000 87 500 500 1.0000000
OHJ07_1_contig_10 4000 4500 60 500 500 1.0000000
OHJ07_1_contig_10 4500 5000 64 500 500 1.0000000
OHJ07_1_contig_10 5000 5468 213 468 468 1.0000000
OHJ07_1_contig_100 0 500 459 500 500 1.0000000
OHJ07_1_contig_100 500 1000 156 500 500 1.0000000
OHJ07_1_contig_100 1000 1314 77 305 314 0.9713376
OHJ07_1_contig_1000 0 500 239 500 500 1.0000000
OHJ07_1_contig_1000 500 1000 226 500 500 1.0000000
OHJ07_1_contig_1000 1000 1500 238 500 500 1.0000000
OHJ07_1_contig_1000 1500 2000 263 500 500 1.0000000
The program that generated it, sorted alphanumerically based on the name in the first column, but I would like to sort it based on a list of names in another file, and keep all the other data. The other file has other information, like contig length in column 2 (this file was produced with samtools faidx).
OHJ07_1_contig_25270 888266 96530655 60 61
OHJ07_1_contig_36751 583964 120924448 60 61
OHJ07_1_contig_44057 504884 134192571 60 61
OHJ07_1_contig_21721 415942 87354744 60 61
OHJ07_1_contig_46339 411691 143341916 60 61
OHJ07_1_contig_44022 330441 133783765 60 61
Since each name has a different number of entries in the first file, what's the easiest way to deal with this? Preferably using bash
I haven't tried anything because I have literally no way to tackle this.
I would prepend each line of file that determines the order (from now on named index) with its line number, there is a way using awk , I used the answer written here https://superuser.com/questions/10201/how-can-i-prepend-a-line-number-and-tab-to-each-line-of-a-text-file to do this (assuming your index file is named index and data file is named data.txt):
awk '{printf "%d,%s\n", NR, $0}' < index > index-numbered
in this way you will have in index-numbered a correspondence between the arbitrary words order you decided and numbers.
you can then use a while on file to sort that replaces each first word with index line number, a comma and the rest of line (keeping name) , for example :
57,OHJ07_1_contig_46339 411691 143341916 60 61
in this way you will be able to sort using the first field, the number, which translates your arbitrary order in a numeric order.
The while which create a new data file with same numbers as above:
while read line
do
key=$(echo $line | cut -f1)
n=$(grep $key index-numbered | cut -d, -f1)
echo $n","$line >> indexed-data.txt
done < data.txt
Then you can simply sort your modified data file (indexed-data.txt) using sort and using the inserted line number as sort key :
sort -k1 -n -t, indexed-data.txt >sorted-data.txt
If you want to hide line numbers on final output you can filter each one out modifying preceding instructions with these :
sort -k1 -n -t, indexed-data.txt | cut -d, -f2 > sorted-data.txt
Your final output will be in file sorted-data.txt .
I'm sure this is not the best solution, maybe others can answer better than me.

How can I redirect fixed lines to a new file with shell

I know we can use > to redirect IO to a file. While I want to write fixed line to a file.
For example,
more something will output 3210 lines, then I want
line 1~1000 in file1
line 1001~2000 in file2
line 2001~3000 in file3
line 3001~3210 in file4.
How can I do it with SHELL script?
Thx.
The split command is what you need.
split -l 1000 your_file.txt "prefix"
Where:
-l - split in lines.
1000 - The number of lines to split.
your_file.txt - The file you want to split.
prefix - A prefix to the output files' names.
Example for a file of 3210 lines:
# Generate the file
$ seq 3210 > your_file.txt
# Split the file
$ split -l 1000 your_file.txt "prefix"
# Check the output files' names
$ ls prefix*
prefixaa prefixab prefixac prefixad
# Check all files' ending
$ tail prefixa*
==> prefixaa <==
991
992
993
994
995
996
997
998
999
1000
==> prefixab <==
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
==> prefixac <==
2991
2992
2993
2994
2995
2996
2997
2998
2999
3000
==> prefixad <==
3201
3202
3203
3204
3205
3206
3207
3208
3209
3210

Resources