computer basics problem - ascii

hi everybody can anyone tell me answer of this question ?
i created a simple txt file. it contain only two words and the words are hello word according to i studied computer uses ascii code to store the text on disk or memory .In ascii code each letter or symbol is represented by one byte or in simple words one byte is used to store a symbol.
Now the problem is this when ever i saw the size of file it shows 11 byte I understand 9 byte for words one byte for space makes the total of 10 then why it is showing 11 byte size .i tried different things such as changing the name of file saving it with shortest name possible or longest name possible but it did not change the total storage
so can any body explain why it is happening? i tried this thing over window or Linux(Ubuntu.centos) system result is same.

pax> echo hello word >outfile.txt
pax> ls -al outfile.txt
-rw-r--r-- 1 pax pax 11 2010-11-19 15:34 outfile.txt
pax> od -xcb outfile.txt
0000000 6568 6c6c 206f 6f77 6472 000a
h e l l o w o r d \n
150 145 154 154 157 040 167 157 162 144 012
pax> hd outfile.txt
00000000 68 65 6c 6c 6f 20 77 6f 72 64 0a |hello word.|
0000000b
As per above, you're storing "hello word" and the newline character. That's 11 characters in total. If you don't want the newline, you can use something like the -n option of echo (which doesn't add the newline):
pax> echo -n hello word >outfile.txt
pax> ls -al outfile.txt
-rw-r--r-- 1 pax pax 10 2010-11-19 15:36 outfile.txt
pax> od -xcb outfile.txt
0000000 6568 6c6c 206f 6f77 6472
h e l l o w o r d
150 145 154 154 157 040 167 157 162 144
pax> hd outfile.txt
00000000 68 65 6c 6c 6f 20 77 6f 72 64 |hello word|
0000000a

If you want to see the content of the file you can perform an octal dump of it using the "od" command under linux "od ". Most probably what you will see is a CR (carriage return) and a LN (linefeed).
The name of the file has nothing to do with his size.
Luis

Did you a new line in the text file (\n)? Just because this character cannot be seen does not mean it is not there.

Related

tcsh if/then statement gives error

I'm trying to do a simple tcsh script to look for a folder, then navigate to it if it exists. The statement evaluates properly, but if it evaluates false, I get an error "then: then/endif not found". If it evaluates true, no problem. Where am I going wrong?
#!/bin/tcsh
set icmanagedir = ""
set workspace = `find -maxdepth 1 -name "*$user*" | sort -r | head -n1`
if ($icmanagedir != "" && $workspace != "") then
setenv WORKSPACE_DIR `readlink -f $workspace`
echo "Navigating to workspace" $WORKSPACE_DIR
cd $WORKSPACE_DIR
endif
($icmanagedir is initialized elswehere, but I get the error regardless of which variable is empty)
The problem is that tcsh needs to have every line end in a newline, including the last line; it uses the newline as the "line termination character", and if it's missing it errors out.
You can use a hex editor/viewer to check if the file ends with a newline:
$ hexdump -C x.tcsh i:arch:21:49
00000000 69 66 20 28 22 78 22 20 3d 20 22 78 22 29 20 74 |if ("x" = "x") t|
00000010 68 65 6e 0a 09 65 63 68 6f 20 78 0a 65 6e 64 69 |hen..echo x.endi|
00000020 66 |f|
Here the last character if f (0x66), not a newline. A correct file has 0x0a as the last character (represented by a .):
$ hexdump -C x.tcsh
00000000 69 66 20 28 22 78 22 20 3d 20 22 78 22 29 20 74 |if ("x" = "x") t|
00000010 68 65 6e 0a 09 65 63 68 6f 20 78 0a 65 6e 64 69 |hen..echo x.endi|
00000020 66 0a |f.|
Ending the last line in a file with a newline is a common UNIX idiom, and some shell tools expect this. See What's the point in adding a new line to the end of a file? for some more info on this.
Most UNIX editors (such as Vim, Nano, Emacs, etc.) should do this by default, but some editors or IDEs don't do this by default, but almost all editors have a setting through which this can be enabled.
The best solution is to enable this setting in your editor. If you can't do this then adding a blank line at the end also solves your problem.

Grep from file fails but grep with individual lines from the file works

I am trying to extract lines from file genome.gff that contain a line from file suspicious.txt. suspicious.txt was derived from genome.gff and every line should match.
Using grep on a single line from suspicious.txt works as expected:
grep 'gene10002' genome.gff
NC_007082.3 Gnomon gene 1269632 1273520 . + . ID=gene10002;Dbxref=BEEBASE:GB54789,GeneID:409846;Name=bur;gbkey=Gene;gene=bur;gene_biotype=protein_coding
NC_007082.3 Gnomon mRNA 1269632 1273520 . + . ID=rna21310;Parent=gene10002;Dbxref=GeneID:409846,Genbank:XM_393336.5,BEEBASE:GB54789;Name=XM_393336.5;gbkey=mRNA;gene=bur;product=burgundy;transcript_id=XM_393336.5
But every variation on using grep from a file that I've been able to think of or find online produces no output or an empty file:
grep -f suspicious.txt genome.gff
grep -F -f suspicious.txt genome.gff
while read line; do grep "$line" genome.gff; done<suspicious.txt
while read line; do grep '$line' genome.gff; done<suspicious.txt
while read line; do grep "${line}" genome.gff; done<suspicious.txt
cat suspicious.txt | while read line; do grep '$line' genome.gff; done
cat suspicious.txt | while read line; do grep '$line' genome.gff >> suspicious.gff; done
cat suspicious.txt | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done
cat "$(cat suspicious_bee_geneIDs_test.txt)" | while read line; do grep -e "${line}" genome.gff >> suspicious.gff; done
Running it as a script also produces an empty file:
#!/bin/bash
SUSP=$1
GFF=$2
while read -r line; do
grep -e "${line}" $GFF >> suspicious_bee_genes.gff
done<$SUSP
This is what the files look like:
head genome.gff
##gff-version 3
#!gff-spec-version 1.21
#!processor NCBI annotwriter
#!genome-build Amel_4.5
#!genome-build-accession NCBI_Assembly:GCF_000002195.4
##sequence-region NC_007070.3 1 29893408
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=7460
NC_007070.3 RefSeq region 1 29893408 . + . ID=id0;Dbxref=taxon:7460;Name=LG1;gbkey=Src;genome=chromosome;linkage- group=LG1;mol_type=genomic DNA;strain=DH4
NC_007070.3 Gnomon gene 181 211962 . - . ID=gene0;Dbxref=BEEBASE:GB42164,GeneID:726912;Name=cort;gbkey=Gene;gene=cort;gene_biotype=protein_coding
NC_007070.3 Gnomon mRNA 181 71559 . - . ID=rna0;Parent=gene0;Dbxref=GeneID:726912,Genbank:XM_006557348.1,BEEBASE:GB42164;Name=XM_006557348.1;gbkey=mRNA;gene=cort;product=cortex%2C transcript variant X2;transcript_id=XM_006557348.1
wc -l genome.gff
457742
head suspicious.txt
gene10002
gene1001
gene1003
gene10038
gene10048
gene10088
gene10132
gene10134
gene10181
gene10209
wc -l suspicious.txt
928
Does anyone know what's going wrong here?
This can happen when the input file is in DOS format: each line will have a trailing CR character at the end, which will break the matching.
One way to check if this is the case is using hexdump, for example (just the first few lines):
$ hexdump -C suspicious.txt
00000000 67 65 6e 65 31 30 30 30 32 0d 0a 67 65 6e 65 31 |gene10002..gene1|
00000010 30 30 31 0d 0a 67 65 6e 65 31 30 30 33 0d 0a 67 |001..gene1003..g|
00000020 65 6e 65 31 30 30 33 38 0d 0a 67 65 6e 65 31 30 |ene10038..gene10|
In the ASCII representation at the right, notice the .. after each gene. These dots correspond to 0d and 0a. The 0d is the CR character.
Without the CR character, the output should look like this:
$ hexdump -C <(tr -d '\r' < suspicious.txt)
00000000 67 65 6e 65 31 30 30 30 32 0a 67 65 6e 65 31 30 |gene10002.gene10|
00000010 30 31 0a 67 65 6e 65 31 30 30 33 0a 67 65 6e 65 |01.gene1003.gene|
00000020 31 30 30 33 38 0a 67 65 6e 65 31 30 30 34 38 0a |10038.gene10048.|
Just one . after each gene, corresponding to 0a, and no 0d.
Another way to see the DOS line endings in the vi editor. If you open the file with vi, the status line would show [dos], or you could run the ex command :set ff? to make it tell you the file format (the status line will say fileformat=dos).
You can remove the CR characters on the fly like this:
grep -f <(tr -d '\r' < suspicious.txt) genome.gff
Or you could remove in vi, by running the ex command :set ff=unix and then save the file. There are other command line tools too that can remove the DOS line ending.
Another possibility is that instead of a trailing CR character, you might have trailing whitespace. The output of hexdump -C should make that perfectly clear. After the trailing whitespace characters are removed, the grep -f should work as expected.

How to add a NUL character separator using AWK in bash? [duplicate]

This question already has answers here:
How can I output null-terminated strings in Awk?
(4 answers)
Closed 7 years ago.
I have a function which outputs some file paths, I need these paths are separated by NUL charachter instead of new line \n character. I tried following code:
function myfunc
{
declare -a DUPS
# some commands to fill DUPS with appropriate file/folder paths
( for i in "${DUPS[#]}"; do echo "$i"; done )|sort|uniq|awk 'BEGIN{ORS="\x00";} {print substr($0, index($0, $2))}'
}
But if I pipe its output to hexdump or hd, no NUL character is diplayed. It seems that NUL character is not included in the awk output:
myfunc | hd
Will print:
00000000 2f 70 61 74 68 2f 6e 75 6d 62 65 72 2f 6f 6e 65 |/path/number/one|
00000010 2f 2f 70 61 74 68 2f 6e 75 6d 62 65 72 2f 74 77 |//path/number/tw|
00000020 6f 2f 2f 70 61 74 68 2f 6e 75 6d 62 65 72 2f 74 |o//path/number/t|
00000030 68 72 65 65 2f |hree/|
00000035
My awk version is:
~$ awk -W version
mawk 1.3.3 Nov 1996, Copyright (C) Michael D. Brennan
compiled limits:
max NF 32767
sprintf buffer 2040
Also any solution with other commands such as sed is acceptable for me.
NOTE: My question is not duplicate of enter link description here, because it asks for a solution that works on different machines with different awks. But I just need a solution that works on my own machine, so I could use any version of awk that could be installed on Ubuntu 14.04.
Gnu Awk v4.0.1 works just fine with your original program, but all the other awks I have kicking around (mawk, original-awk and busybox awk) produce the same NUL-free output as you seem to be experiencing. It appears that with those awks, using either print or printf to print a string with embedded NULs causes the NUL to be treated as a string terminator.
However, mawk and original-awk will output a real NUL if you use printf "%s",0;. So if you are using one of those, you could set ORS to the empty string and add {printf "%s", 0;} to the end of your awk program. (You'd need other more invasive modifications if your awk program uses next).
I don't know any way to convince busybox awk to print a NUL byte, so if that is what you are using you might want to consider choosing a real awk.

bash script to sort

I have this file I created:
Kuala Lumpur 78 56
Seoul 86 66
Karachi 95 75
Tokyo 85 60
Lahore 85 75
Manila 90 85
On the command line I can sort it no problem using sort -t and delimit with a tab space, but now I'm trying to write a script to read this in and print out different sorts. Now if I read into an array and tell it to store by the tab the "Kuala Lumpur" line is thrown off and then, so is the sort. What do i do about that space. I don't want to take it out or replace with a comma but if I have to I will.
#!/bin/bash
cat asiapac-temps | sort -t' ' -k 1,1d
echo ""
cat asiapac-temps | sort -t' ' -k 2,2n
echo ""
cat asiapac-temps | sort -t' ' -k 3
this is what I'm using now. I was trying to do this in a different way so to not use sort over and over
The output is:
By city:
Karachi 95 75
Kuala Lumpur 78 56
Lahore 85 75
Manila 90 85
Seoul 86 66
Tokyo 85 60
by high temp (col2)
Kuala Lumpur 78 56
Lahore 85 75
Tokyo 85 60
Seoul 86 66
Manila 90 85
Karachi 95 75
by low temp (col3)
Kuala Lumpur 78 56
Tokyo 85 60
Seoul 86 66
Karachi 95 75
Lahore 85 75
Manila 90 85
Since feature requests to mark a comment as an answer remain declined, I copy the above solution here.
You can't sort anything once and output 3 different results. Any time you write a loop in shell you've probably got the wrong approach (shell is primarily an environment from which to call tools, not a programming language). Just calling sort each time you want to produce sorted output will almost certainly be simpler and more efficient than any approach you can come up with involving array indexing. – Ed Morton
If your question is "how do I input the tab character from the command line", the answer is "you don't need to" -- sort recognizes the tab character as a separator by default.

How do you delete all lines that contain double quotes in sh?

I tried sed -ne '/\"/!p' theinput > theproductbut that got me nowhere. It didn't do anything. What can I try?
You don't need to escape quote. Write:
sed '/"/d' theinput > theproduct
or
sed -i '/"/d' theinput
to alter the file directly.
In case you have other quotes as #Jonathan Leffler suggests, you have to find out which ones. Then, using \x you can achieve what you want. \x is used to specify hexadecimal values.
sed -i '/\x22/d' theinput
The line above would delete all rows in theinput containing the ordinary (ASCII 34) quote. You'll have to try the code points Jonathan suggested.
try this:
grep -v '"' theinput > theproduct
The command you showed us should have worked.
$ cat theinput
foo"bar
foo.bar
$ sed -ne '/\"/!p' theinput > theproduct
$ cat theproduct
foo.bar
$
unless you're using csh or tcsh as your interactive shell. In that case, you'd need to escape the ! character, even within quotation marks:
% cat theinput
foo"bar
foo.bar
% sed -ne '/\"/!p' theinput > theproduct
sed -ne '/"/pwd' theinput > theproduct
sed: -e expression #1, char 5: extra characters after command
% rm theproduct
% sed -ne '/\"/\!p' theinput > theproduct
% cat theproduct
foo.bar
%
But that's inconsistent with your statement that "It didn't do anything", so it's not clear what's really going on (and the question is tagged bourne-shell anyway).
But there are much simpler ways to accomplish the same task, particularly the grep command suggested by #Mike Sokolov.
Are you sure you have 'ASCII' input? Could you have Unicode (UTF-8) with characters that are not not ASCII 34, or Unicode U+0022, but something else?
Alternative Unicode 'double quotes' could be:
U+2033 DOUBLE PRIME; U+201C LEFT DOUBLE QUOTATION MARK;
U+201D RIGHT DOUBLE QUOTATION MARK;
U+201F DOUBLE HIGH-REVERSED-9 QUOTATION MARK;
U+02DD DOUBLE ACUTE ACCENT;
(and there could easily be others I've left out).
You can look to debug this with the od command:
$ cat theinput
No double quote here
Double quote " here
Unicode pseudo-double-quotes include “”‟″˝.
$ od -c theinput
0000000 N o d o u b l e q u o t e
0000020 h e r e \n D o u b l e q u o t
0000040 e " h e r e \n U n i c o d e
0000060 p s e u d o - d o u b l e - q
0000100 u o t e s i n c l u d e “ **
0000120 ** ” ** ** ‟ ** ** ″ ** ** ˝ ** . \n
0000136
$ od -x theinput
0000000 6f4e 6420 756f 6c62 2065 7571 746f 2065
0000020 6568 6572 440a 756f 6c62 2065 7571 746f
0000040 2065 2022 6568 6572 550a 696e 6f63 6564
0000060 7020 6573 6475 2d6f 6f64 6275 656c 712d
0000100 6f75 6574 2073 6e69 6c63 6475 2065 80e2
0000120 e29c 9d80 80e2 e29f b380 9dcb 0a2e
0000136
$ odx theinput
0x0000: 4E 6F 20 64 6F 75 62 6C 65 20 71 75 6F 74 65 20 No double quote
0x0010: 68 65 72 65 0A 44 6F 75 62 6C 65 20 71 75 6F 74 here.Double quot
0x0020: 65 20 22 20 68 65 72 65 0A 55 6E 69 63 6F 64 65 e " here.Unicode
0x0030: 20 70 73 65 75 64 6F 2D 64 6F 75 62 6C 65 2D 71 pseudo-double-q
0x0040: 75 6F 74 65 73 20 69 6E 63 6C 75 64 65 20 E2 80 uotes include ..
0x0050: 9C E2 80 9D E2 80 9F E2 80 B3 CB 9D 2E 0A ..............
0x005E:
$ sed '/"/d' theinput > theproduct
$ cat theproduct
No double quote here
Unicode pseudo-double-quotes include “”‟″˝.
$
(odx is my own command for dumping data in hex.)

Resources