opening a file with an accented character in its name, in Python 2 on Windows - windows

In a directory in Windows I have 2 files, both of them with an accented character in its name: t1û.fn and t2ű.fn; The dir command in the Command Prompt shows both correctly:
S:\p>dir t*.fn
Volume in drive S is q
Volume Serial Number is 05A0-8823
Directory of S:\p
2017-09-03 14:54 4 t1û.fn
2017-09-03 14:54 4 t2ű.fn
2 File(s) 8 bytes
0 Dir(s) 19,110,621,184 bytes free
Screenshot:
However, Python can't see both files:
S:\p>python -c "import os; print [(fn, os.path.isfile(fn)) for fn in os.listdir('.') if fn.endswith('.fn')]"
[('t1\xfb.fn', True), ('t2u.fn', False)]
It looks like Python 2 uses a single-byte API for filenames, thus the accented character in t1û.fn is mapped to the single byte \xfb, and the accented character in t2ű.fn is mapped to the unaccented ASCII single byte u.
How is it possible to use a multi-byte API for filenames on Windows in Python 2? I want to open both files in the console version of Python 2 on Windows.

Use a unicode string:
f1 = open(u"t1\u00fb.fn") # t1û.fn
f2 = open(u"t2\u0171.fn") # t2ű.fn

Related

How to create a file name with UTF-8 characters in Cygwin [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 1 year ago.
Improve this question
Using a shell script run under Cygwin, I want to create a file name which contains Danish characters (Ø, Æ, and Å). I have a bash script which basically does this: echo "some data" > "file name with Danish letters.txt". After running such script, all Danish letters look like a dot in the file name. I have tested this using Cygwin 32 under Windows 7 and Cygwin 64 under Windows 10. The locale command produces the following:
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=
Running:
echo "rødgrød" | od -ctx1
produces this:
0000000 r 303 270 d g r 303 270 d \n
72 c3 b8 64 67 72 c3 b8 64 0a
0000012
Here is an example of how the bash script looks like:
echo "some data" > "Peter Sørensen.txt"
The letter ø looks like a dot when I look at the created file name in Windows. Here is a screenshot of the file name:
In cygwin, running the ls command results in this message:
ls: cannot compare file names ‘Peter S\370rensen.txt’ and ‘test.sh’: Invalid or incomplete multibyte or wide character
And here is a screenshot of how this file name looks like in cygwin terminal after running the ls command:

Copying file in Windows 10 changes its size

I copied a large file to a new directory in Windows 10 by dragging the file from Explorer to a folder in Eclipse. The file size of the copied file changed even though fc shows the original and new files as identical. The original file has a size of 209,715,200 bytes (200 MiB):
c:\>dir c:\Users\GeoffAlexander\Documents\Python\200MiB.txt
Volume in drive C is Windows
Volume Serial Number is 0447-709A
Directory of c:\Users\GeoffAlexander\Documents\Python
08/13/2019 09:42 AM 209,715,200 200MiB.txt
1 File(s) 209,715,200 bytes
0 Dir(s) 268,331,835,392 bytes free
The new file has a size of 211,812,352 bytes:
c:\>dir c:\Users\GeoffAlexander\Desktop\200MiB.txt
Volume in drive C is Windows
Volume Serial Number is 0447-709A
Directory of c:\Users\GeoffAlexander\Desktop
08/15/2019 09:11 AM 211,812,352 200MiB.txt
1 File(s) 211,812,352 bytes
0 Dir(s) 268,232,798,208 bytes free
The fc command shows the files as being identical:
c:\>fc c:\Users\GeoffAlexander\Documents\Python\200MiB.txt c:\Users\GeoffAlexander\Desktop\200MiB.txt
Comparing files C:\USERS\GEOFFALEXANDER\DOCUMENTS\PYTHON\200MiB.txt and C:\USERS\GEOFFALEXANDER\DESKTOP\200MIB.TXT
FC: no differences encountered
Why does the copied file get a new size? How can two files with different sizes be identical? Is Windows 10 incorrectly reporting the size of the new file?
I'm running Windows 10 Enterprise Build 1809 (OS Build 17763.615) if that makes any difference.
It turns out the file size change wasn't due to the copying of the file. Rather the file size change occurred when checking in the file to RTC (Rational Team Concert). The RTC check in was converting existing LF line delimiters into CRLF line delimiters (Windows line delimiters). See RTC
File content types and line delimiters for details.

Disassamble ELF file - debugging area where specific string of binary is loaded

I would like to disassamble / debug an elf file. Is it somehow possible to track the function where a specific string in the elf file is called?
So I mean, I have a string where I know it is used to search for that string in a file. Is it somehow possible with e.g. gdb to debug exactly that position in the executable?
Or is the position of the string in the elf file, somehow visible in the objdump -d output?
In order to do that you need a disassembler - objdump just dumps the info - it might not give you enough information as some analysis is needed before you can tell where it is being used. What you need is to get the XREFs for the string you have in mind.
If you open your binary in the disassembler it will probably have the ability to show you strings that are present in the binary with the ability to jump to the place where the string is being used (it might be multiple places).
I'll showcase this using radare2.
Open the binary (I'll use ls here)
r2 -A /bin/ls
and then
iz
to display all the strings. There's a lot of them so here's an extract
000 0x00004af1 0x100004af1 7 8 (4.__TEXT.__cstring) ascii COLUMNS
001 0x00004af9 0x100004af9 39 40 (4.__TEXT.__cstring) ascii 1#ABCFGHLOPRSTUWabcdefghiklmnopqrstuvwx
002 0x00004b21 0x100004b21 6 7 (4.__TEXT.__cstring) ascii bin/ls
003 0x00004b28 0x100004b28 8 9 (4.__TEXT.__cstring) ascii Unix2003
004 0x00004b31 0x100004b31 8 9 (4.__TEXT.__cstring) ascii CLICOLOR
005 0x00004b3a 0x100004b3a 14 15 (4.__TEXT.__cstring) ascii CLICOLOR_FORCE
006 0x00004b49 0x100004b49 4 5 (4.__TEXT.__cstring) ascii TERM
007 0x00004b60 0x100004b60 8 9 (4.__TEXT.__cstring) ascii LSCOLORS
008 0x00004b69 0x100004b69 8 9 (4.__TEXT.__cstring) ascii fts_open
009 0x00004b72 0x100004b72 28 29 (4.__TEXT.__cstring) ascii %s: directory causes a cycle
let's see where this last one is being used. If we move to the location where it's defined 0x100004b72. We can see this:
;-- str.s:_directory_causes_a_cycle:
; DATA XREF from 0x100001cbe (sub.fts_open_INODE64_b44 + 378)
And here we see where it's being referenced -> DATA XREF. We can move there (s 0x100001cbe) and there we see how it's being used.
⁝ 0x100001cbe 488d3dad2e00. lea rdi, str.s:_directory_causes_a_cycle ; 0x100004b72 ; "%s: directory causes a cycle"
⁝ 0x100001cc5 4c89ee mov rsi, r13
⁝ 0x100001cc8 e817290000 call sym.imp.warnx ;[1]
Having the location you can put a breakpoint there (r2 is also a debugger) or use it in gdb.

How can two 100% identical files have different sizes?

I have two 100% identical empty .sh shell script files on Mac:
encrypt.sh: 299 bytes
decrypt.sh: 13 bytes (Actually this size is correct, since I have 13 bytes: 11 character + two new line)
The contents of encrypt.sh and its hexdump:
The contents of decrypt.sh and its hexdump:
The file info window of encrypt.sh:
The file info window of decrypt.sh:
They have the exact same hexdump, then how is it possible that they have different sizes?
Mac OS X file system is implementing forks, so the larger one is likely having something specific stored in its resource fork.
Use ls -l# to get more details.

os x screen command,'.screenrc', termcap

I need help in the conceptual area surrounding:
/usr/bin/screen,
~/.screenrc,
termcap
My Goal: is to create a 'correctly' formatted log file via 'screen'.
Symptom: The log file contains hundreds of carriage-return bytes [i.e. (\015) or (\r) ]. I would like to replace every carriage-return byte with a linefeed byte [i.e. (\012) or (\n)].
My Approach: I have created the file: ~/.screenrc and added a 'termcap' line to it with the hope of intercepting the inbound bytes and translating the carriage-return bytes into linefeed bytes BEFORE they are written to the log file. I cycled through nine different syntactical forms of my request. None had the desired effect (see below for all nine forms).
My Questions:
Can my goal be accomplished with my approach?
If yes, what changes do I need to make to achieve my goal?
If no, what alternative should I implement?
Do I need to mix in the 'stty' command?
If yes, how?
Note: I can create a 'correctly' formatted file using the log file as input to 'tr':
$ /usr/bin/tr '\015' '\012' <screenlog.0 | head
<5 BAUD ADDRESS: FF>
<WAITING FOR 5 BAUD INIT>
<5 BAUD ADDRESS: 33>
<5 BAUD INIT: OK>
Rx: C233F1 01 00 # 254742 ms
Tx: 86F110 41 00 BE 1B 30 13 # 254753 ms
Tx: 86F118 41 00 88 18 00 10 # 254792 ms
Tx: 86F128 41 00 80 08 00 10 # 254831 ms
Rx: C133F0 3E # 255897 ms
Tx: 81F010 7E # 255903 ms
$
The 'screen' log file ( ~/screenlog.0 ) is created using the following command:
$ screen -L /dev/tty.usbserial-000014FA 115200
where:
$ ls -dl /dev/*usb*
crw-rw-rw- 1 root wheel 17, 25 Jul 21 19:50 /dev/cu.usbserial-000014FA
crw-rw-rw- 1 root wheel 17, 24 Jul 21 19:50 /dev/tty.usbserial-000014FA
$
$
$ ls -dl ~/.screenrc
-rw-r--r-- 1 scottsmith staff 684 Jul 22 12:28 /Users/scottsmith/.screenrc
$ cat ~/.screenrc
#termcap xterm* 'XC=B%,\015\012' # 01 no effect
#termcap xterm* 'XC=B%\E(B,\015\012' # 02 no effect
#termcap xterm* 'XC=B\E(%\E(B,\015\012' # 03 no effect
#terminfo xterm* 'XC=B%,\015\012' # 04 no effect
#terminfo xterm* 'XC=B%\E(B,\015\012' # 05 no effect
#terminfo xterm* 'XC=B\E(%\E(B,\015\012' # 06 no effect
#termcapinfo xterm* 'XC=B%,\015\012' # 07 no effect
#termcapinfo xterm* 'XC=B%\E(B,\015\012' # 08 no effect
termcapinfo xterm* 'XC=B\E(%\E(B,\015\012' # 09 no effect
$
$ echo $TERM
xterm-256color
$ echo $SCREENRC
$ ls -dl /usr/lib/terminfo/?/*
ls: /usr/lib/terminfo/?/*: No such file or directory
$ ls -dl /usr/lib/terminfo/*
ls: /usr/lib/terminfo/*: No such file or directory
$ ls -dl /etc/termcap
ls: /etc/termcap: No such file or directory
$ ls -dl /usr/local/etc/screenrc
ls: /usr/local/etc/screenrc: No such file or directory
$
System:
MacBook Pro (17-inch, Mid 2010)
Processor 2.53 GHz Intel Core i5
Memory 8 GB 1067 MHz DDR3
Graphics NVIDIA GeForce GT 330M 512 MB
OS X Yosemite Version 10.10.4
Screen(1) Mac OS X Manual Page: ( possible relevant content ):
CHARACTER TRANSLATION
Screen has a powerful mechanism to translate characters to arbitrary strings depending on the current font and terminal type. Use this feature if you want to work with a common standard character set (say ISO8851-latin1) even on terminals that scatter the more unusual characters over several national language font pages.
Syntax: XC=<charset-mapping>{,,<charset-mapping>}
<charset-mapping> := <designator><template>{,<mapping>}
<mapping> := <char-to-be-mapped><template-arg>
The things in braces may be repeated any number of times.
A tells screen how to map characters in font ('B': Ascii, 'A': UK, 'K': german, etc.) to strings. Every describes to what string a single character will be translated. A template mechanism is used, as most of the time the codes have a lot in common (for example strings to switch to and from another charset). Each occurrence of '%' in gets substituted with the specified together with the character. If your strings are not similar at all, then use '%' as a template and place the full string in . A quoting mechanism was added to make it possible to use a real '%'. The '\' character quotes the special char- acters '\', '%', and ','.
Here is an example:
termcap hp700 'XC=B\E(K%\E(B,\304[,\326\\,\334]'
This tells screen how to translate ISOlatin1 (charset 'B') upper case umlaut characters on a hp700 terminal that has a german charset. '\304' gets translated to '\E(K[\E(B' and so on. Note that this line gets parsed three times before the internal lookup table is built, therefore a lot of quoting is needed to create a single '\'.
Another extension was added to allow more emulation: If a mapping translates the unquoted '%' char, it will be sent to the terminal whenever screen switches to the corresponding . In this special case the template is assumed to be just '%' because the charset switch sequence and the char- acter mappings normally haven't much in common.
This example shows one use of the extension:
termcap xterm 'XC=K%,%\E(B,[\304,\\\326,]\334'
Here, a part of the german ('K') charset is emulated on an xterm. If screen has to change to the 'K' charset, '\E(B' will be sent to the terminal, i.e. the ASCII charset is used instead. The template is just '%', so the mapping is straightforward: '[' to '\304', '\' to '\326', and ']' to '\334'.
The section on character translation is describing a feature which is unrelated to logging. It is telling screen how to use ISO-2022 control sequences to print special characters on the terminal. In the manual page's example
termcap xterm 'XC=K%,%\E(B,[\304,\\\\\326,]\334'
this tells screen to send escape(B (to pretend it is switching the terminal to character-set "K") when it has to print any of [, \ or ]. Offhand (referring to XTerm Control Sequences) the reasoning in the example seems obscure:
xterm handles character set "K" (German)
character set "B" is US-ASCII
assuming that character set "B" is actually rendered as ISO-8859-1, those three characters are Ä, Ö and Ü (which is a plausible use of German, to print some common umlauts).
Rather than being handled by this feature, screen's logging is expected to record the original characters sent to the terminal — before translation.

Resources