Pandas read_csv internal memory usage

Pandas read_csv internal memory usage - memory-management

I have a 320 MB, comma-separated (csv) - file.
To read it in, I use
pd.read_csv(loggerfile, header = 2)
I have 8 GB of Ram (5 are free), how can this ever throw an error?
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 443, in parser_f
return _read(filepath_or_buffer, kwds)
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 235, in _read
return parser.read()
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 686, in read
ret = self._engine.read(nrows)
File "C:\Users\me\AppData\Local\Continuum\Anaconda\lib\site-packages\pandas\io\parsers.py", line 1130, in read
data = self._reader.read(nrows)
File "parser.pyx", line 727, in pandas.parser.TextReader.read (pandas\parser.c:7146)
File "parser.pyx", line 777, in pandas.parser.TextReader._read_low_memory (pandas\parser.c:7725)
File "parser.pyx", line 1788, in pandas.parser._concatenate_chunks (pandas\parser.c:21033)
MemoryError
EDIT:
Windows 7 Enterprise 64 Bit
Anaconda 2.0.1 x86 <- perhaps x86_64 would be better?
Still the memory error occurs way before my memory cap is reached (seen in task manager), even on a 3 Gb - 32 Bit - machine.

Related

Copying file in Windows 10 changes its size

I copied a large file to a new directory in Windows 10 by dragging the file from Explorer to a folder in Eclipse. The file size of the copied file changed even though fc shows the original and new files as identical. The original file has a size of 209,715,200 bytes (200 MiB):
c:\>dir c:\Users\GeoffAlexander\Documents\Python\200MiB.txt
Volume in drive C is Windows
Volume Serial Number is 0447-709A
Directory of c:\Users\GeoffAlexander\Documents\Python
08/13/2019 09:42 AM 209,715,200 200MiB.txt
1 File(s) 209,715,200 bytes
0 Dir(s) 268,331,835,392 bytes free
The new file has a size of 211,812,352 bytes:
c:\>dir c:\Users\GeoffAlexander\Desktop\200MiB.txt
Volume in drive C is Windows
Volume Serial Number is 0447-709A
Directory of c:\Users\GeoffAlexander\Desktop
08/15/2019 09:11 AM 211,812,352 200MiB.txt
1 File(s) 211,812,352 bytes
0 Dir(s) 268,232,798,208 bytes free
The fc command shows the files as being identical:
c:\>fc c:\Users\GeoffAlexander\Documents\Python\200MiB.txt c:\Users\GeoffAlexander\Desktop\200MiB.txt
Comparing files C:\USERS\GEOFFALEXANDER\DOCUMENTS\PYTHON\200MiB.txt and C:\USERS\GEOFFALEXANDER\DESKTOP\200MIB.TXT
FC: no differences encountered
Why does the copied file get a new size? How can two files with different sizes be identical? Is Windows 10 incorrectly reporting the size of the new file?
I'm running Windows 10 Enterprise Build 1809 (OS Build 17763.615) if that makes any difference.

It turns out the file size change wasn't due to the copying of the file. Rather the file size change occurred when checking in the file to RTC (Rational Team Concert). The RTC check in was converting existing LF line delimiters into CRLF line delimiters (Windows line delimiters). See RTC
File content types and line delimiters for details.

macOS size command shows a really large number?

> size /bin/ls
__TEXT __DATA __OBJC others dec hex
20480 4096 0 4294983680 4295008256 10000a000
How could it be that ls is 4GB? Is size not meant to be used on executables? I have 4GB ram, so is it just showing me the amount memory it can use?

On macOS, 64-bit apps have a 4GB page zero, by default. Page zero is chunk of the address space starting at address 0 which allows no access. This is what causes access violations when a program dereferences a null pointer.
64-bit Mac programs use a 4GB page zero so that, should any valid pointer get accidentally truncated to 32 bits by a program bug (e.g. cast to int and back to a pointer), it will be invalid and cause a crash as soon as possible. That helps to find and fix such bugs.
The page zero segment in the Mach-O executable file doesn't actually use 4GB on disk. It's just a bit of metadata that tells the kernel and dynamic loader how much address space to reserve for it. It seems that size is including the virtual size of all segments, regardless of whether they take up space on disk or not.
Also, the page zero doesn't consume actual RAM when the program is loaded, either. Again, there's just some bookkeeping data to track the fact that the lower 4GB of the address space is reserved.
The size being reported for "others", 4294983680 bytes, is 0x100004000 in hex. That's the 4GB page zero (0x100000000) plus another 4 pages for some other segments.
You can use the -m option to size to get more detail:
$ size -m /bin/ls
Segment __PAGEZERO: 4294967296
Segment __TEXT: 20480
Section __text: 13599
Section __stubs: 456
Section __stub_helper: 776
Section __const: 504
Section __cstring: 1150
Section __unwind_info: 148
total 16633
Segment __DATA: 4096
Section __got: 40
Section __nl_symbol_ptr: 16
Section __la_symbol_ptr: 608
Section __const: 552
Section __data: 40
Section __bss: 224
Section __common: 140
total 1620
Segment __LINKEDIT: 16384
total 4295008256
You can also use the command otool -lV /bin/ls to see the loader commands of the executable, including the one establishing the __PAGEZERO segment.

The size command outputs information related to some binary executable and how it is running. It is not about the file. The 4Gb number might be (that is just my guess) related to the virtual address space needed to run it.
I don't have a MacOSX operating system (because it is proprietary and tied to hardware that I dislike and find too expensive). But on Linux (which is mostly POSIX, like MacOSX), size /bin/ls gives:
text data bss dec hex filename
124847 4672 4824 134343 20cc7 /bin/ls
while ls -l /bin/ls shows
-rwxr-xr-x 1 root root 138856 Feb 28 16:30 /bin/ls
Of course, when ls is running, it has some data (notably bss) which is not corresponding to a part of the executable
Try man size on your system to get an explanation. For Linux, see size(1) (it gives info about sections of an ELF executable) and ls(1) (it gives the file size).
On MacOSX, executables follow the Mach-O format.
On Linux, if you try size on a non-executable file such as /etc/passwd, you get
size: /etc/passwd: file format not recognized
and I guess that you should have some error message on MacOSX if you try that.
Think of size giving executable size information. The name is historical and a bit misleading.

HDFS: Errno 22 on attempt to edit an existing file in mounted NFS volume

Summary: I mounted HDFS nfs volume in OSX and it will not let me edit existing files. I can append and create files with content but not "open them with write flag".
Originally, I asked about a particular problem with JupyterLab failing to save notebooks into nfs mounted volumes but while trying to dig down to the roots, I realized (hopefully right) that it's about editing existing files.
I mounted HDFS nfs on OSX and I can access the files, read and write and whatnot. JupyterLab though can do pretty much everything but can't really save notebooks.
I was able to identify the pattern for what's really happening and the problem boils down to this: you can't open existing files in nfs volume for write:
This will work with a new file:
with open("rand.txt", 'w') as f:
f.write("random text")
But if you try to run it again (the file has been created now and the content is there), you'll get the following Exception:
---------------------------------------------------------------------------
OSError Traceback (most recent call last)
<ipython-input-15-94a46812fad4> in <module>()
----> 1 with open("rand.txt", 'w') as f:
2 f.write("random text")
OSError: [Errno 22] Invalid argument: 'rand.txt'
I am pretty sure the permissions and all are ok:
with open("seven.txt", 'w') as f:
f.write("random text")
f.writelines(["one","two","three"])
r = open("seven.txt", 'r')
print(r.read())
random textonetwothree
I can also append to files no problem:
aleksandrs-mbp:direct sasha$ echo "Another line of text" >> seven.txt && cat seven.txt
random textonetwothreeAnother line of text
I mount it with the following options:
aleksandrs-mbp:hadoop sasha$ mount -t nfs -o
vers=3,proto=tcp,nolock,noacl,sync localhost:/
/srv/ti/jupyter-samples/~Hadoop
Apache documentation suggests that NFS gateway does not support random writes. I tried looking at mount documentation but could not find anything specific that points to enforcing sequential writing. I tried playing with different options but it doesn't seem to help much.
This is the exception I get from JupyterLab when it's trying to save the notebook:
[I 03:03:33.969 LabApp] Saving file at /~Hadoop/direct/One.ipynb
[E 03:03:33.980 LabApp] Error while saving file: ~Hadoop/direct/One.ipynb [Errno 22] Invalid argument: '/srv/ti/jupyter-samples/~Hadoop/direct/.~One.ipynb'
Traceback (most recent call last):
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/contents/filemanager.py", line 471, in save
self._save_notebook(os_path, nb)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/contents/fileio.py", line 293, in _save_notebook
with self.atomic_writing(os_path, encoding='utf-8') as f:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/contextlib.py", line 82, in __enter__
return next(self.gen)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/contents/fileio.py", line 213, in atomic_writing
with atomic_writing(os_path, *args, log=self.log, **kwargs) as f:
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/contextlib.py", line 82, in __enter__
return next(self.gen)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/contents/fileio.py", line 103, in atomic_writing
copy2_safe(path, tmp_path, log=log)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/notebook/services/contents/fileio.py", line 51, in copy2_safe
shutil.copyfile(src, dst)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/shutil.py", line 115, in copyfile
with open(dst, 'wb') as fdst:
OSError: [Errno 22] Invalid argument: '/srv/ti/jupyter-samples/~Hadoop/direct/.~One.ipynb'
[W 03:03:33.981 LabApp] 500 PUT /api/contents/~Hadoop/direct/One.ipynb?1534835013966 (::1): Unexpected error while saving file: ~Hadoop/direct/One.ipynb [Errno 22] Invalid argument: '/srv/ti/jupyter-samples/~Hadoop/direct/.~One.ipynb'
[W 03:03:33.981 LabApp] Unexpected error while saving file: ~Hadoop/direct/One.ipynb [Errno 22] Invalid argument: '/srv/ti/jupyter-samples/~Hadoop/direct/.~One.ipynb'
This is what I see in the NFS logs at the same time:
2018-08-21 03:05:34,006 ERROR org.apache.hadoop.hdfs.nfs.nfs3.RpcProgramNfs3: Setting file size is not supported when setattr, fileId: 16417
2018-08-21 03:05:34,006 ERROR org.apache.hadoop.hdfs.nfs.nfs3.RpcProgramNfs3: Setting file size is not supported when setattr, fileId: 16417
Not exactly sure what this means but if I understand the RFC, it should be part of implementation:
Servers must support extending the file size via SETATTR.
I understand the complexity behind mounting hdfs and letting clients write all they want while keeping these files distributed and maintain integrity. Is there though a compromise that would enable writes via nfs?

Floppy disk sector count

I am trying to understand why lseek() is used in this image creator. Why 5 bytes away from start of file? If I changed that number, the OS won't boot.
The image creator creates a .img file with the bootloader.bin inside.
/* modify the sector count */
total_sector_number = file_size / 512
lseek(disk_image_fd, 5, SEEK_SET);
write(disk_image_fd, &total_sector_number, 2);
write(disk_image_fd, &kernel_32_sector_number, 2);
//printf("%d\n", lawl);
printf("TOTAL_SECTOR_NUMBER : %d\n", total_sector_number);
printf("KERNEL_32_SECTOR_NUMBER : %d\n", kernel_32_sector_number);
The source code (image maker):
http://pastebin.com/raw.php?i=MuDpYP3Y
Bootloader:
http://pastebin.com/raw.php?i=kzw2ZaU1
The hexdump with lseek() and writing umber of sectors to byte at offset 5:
Without lseek() OS does not boot correctly.

I only figured this out because of your previous post Bootloader memory location which contained different source code for the bootloader.
You mentioned the two unknown variables TOTALSECTORCOUNT and KERNEL32SECTORCOUNT. These variables were near the beginning of the file, and I guess when assembled they sit 5 bytes into the binary. Calling lseek with the SEEK_SET parameter moves the file pointer to 5 bytes after the beginning of the file. It then writes the two values which will overwrite the ones in the bootloader code.
When you remove the lseek it will instead append the two values to the end of the file. If you changed the offset parameter of lseek to zero it would overwrite the jmp command of the bootloader instead.
Notice in your hexdump.
00000000 00eb b8fa 02c0 0000 c000 e08e e88e 00b8
^ ^- kernel_32_sector_number is never initialized.
|-total_sector_number which was calculated in code before the write.

Tornado app halting regularly for few seconds with 100% CPU

I am trying to troubleshoot an app running on tornado 2.4 on Ubuntu 11.04 on EC2. It appears to be hitting 100% CPU regularly and halts at that request for few seconds.
Any help on this is greatly appreciated.
Symptoms:
top shows 100% cpu just at the time it halts. Normally server is about 30-60% cpu utilization.
It halts every 2-5 minutes just for one request. I have checked that there are no cronjobs affecting this.
It halts for about 2 to 9 seconds. Problem goes away on restarting tornado and worsens with tornado uptime. Longer the server is up, for longer duration it halts.
Http requests for which the problem appears do not seem to have any pattern.
Interestingly, next request in log sometimes sometimes matches the halting duration and some times does not. Example:
00:00:00 GET /some/request ()
00:00:09 GET /next/request (9000ms)
00:00:00 GET /some/request ()
00:00:09 GET /next/request (1ms)
# 9 seconds gap in requests is certainly not possible as clients are constantly polling.
Database (mongodb) shows no expensive or large number of queries. No page faults. Database is on the same machine - local disk.
vmstat shows no change in read/write sizes compared to last few minutes.
tornado is running behind nginx.
sending SIGINT when it was most likely halting, gives different stacktrace everytime. Some of them are below:
Traceback (most recent call last):
File "chat/main.py", line 3396, in <module>
main()
File "chat/main.py", line 3392, in main
tornado.ioloop.IOLoop.instance().start()
File "/home/ubuntu/tornado/tornado/ioloop.py", line 515, in start
self._run_callback(callback)
File "/home/ubuntu/tornado/tornado/ioloop.py", line 370, in _run_callback
callback()
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/iostream.py", line 303, in wrapper
callback(*args)
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/httpserver.py", line 298, in _on_request_body
self.request_callback(self._request)
File "/home/ubuntu/tornado/tornado/web.py", line 1421, in __call__
handler = spec.handler_class(self, request, **spec.kwargs)
File "/home/ubuntu/tornado/tornado/web.py", line 126, in __init__
application.ui_modules.iteritems())
File "/home/ubuntu/tornado/tornado/web.py", line 125, in <genexpr>
self.ui["_modules"] = ObjectDict((n, self._ui_module(n, m)) for n, m in
File "/home/ubuntu/tornado/tornado/web.py", line 1114, in _ui_module
def _ui_module(self, name, module):
KeyboardInterrupt
Traceback (most recent call last):
File "chat/main.py", line 3398, in <module>
main()
File "chat/main.py", line 3394, in main
tornado.ioloop.IOLoop.instance().start()
File "/home/ubuntu/tornado/tornado/ioloop.py", line 515, in start
self._run_callback(callback)
File "/home/ubuntu/tornado/tornado/ioloop.py", line 370, in _run_callback
callback()
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/iostream.py", line 303, in wrapper
callback(*args)
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/httpserver.py", line 285, in _on_headers
self.request_callback(self._request)
File "/home/ubuntu/tornado/tornado/web.py", line 1408, in __call__
transforms = [t(request) for t in self.transforms]
File "/home/ubuntu/tornado/tornado/web.py", line 1811, in __init__
def __init__(self, request):
KeyboardInterrupt
Traceback (most recent call last):
File "chat/main.py", line 3351, in <module>
main()
File "chat/main.py", line 3347, in main
tornado.ioloop.IOLoop.instance().start()
File "/home/ubuntu/tornado/tornado/ioloop.py", line 571, in start
self._handlers[fd](fd, events)
File "/home/ubuntu/tornado/tornado/stack_context.py", line 216, in wrapped
callback(*args, **kwargs)
File "/home/ubuntu/tornado/tornado/netutil.py", line 342, in accept_handler
callback(connection, address)
File "/home/ubuntu/tornado/tornado/netutil.py", line 237, in _handle_connection
self.handle_stream(stream, address)
File "/home/ubuntu/tornado/tornado/httpserver.py", line 156, in handle_stream
self.no_keep_alive, self.xheaders, self.protocol)
File "/home/ubuntu/tornado/tornado/httpserver.py", line 183, in __init__
self.stream.read_until(b("\r\n\r\n"), self._header_callback)
File "/home/ubuntu/tornado/tornado/iostream.py", line 139, in read_until
self._try_inline_read()
File "/home/ubuntu/tornado/tornado/iostream.py", line 385, in _try_inline_read
if self._read_to_buffer() == 0:
File "/home/ubuntu/tornado/tornado/iostream.py", line 401, in _read_to_buffer
chunk = self.read_from_fd()
File "/home/ubuntu/tornado/tornado/iostream.py", line 632, in read_from_fd
chunk = self.socket.recv(self.read_chunk_size)
KeyboardInterrupt
Any tips on how to troubleshoot this is greatly appreciated.
Further observations:
strace -p, during the time it hangs, shows empty output.
ltrace -p during hang time shows only free() calls in large numbers:
free(0x6fa70080) =
free(0x1175f8060) =
free(0x117a5c370) =

It sounds like you're suffering from garbage collection (GC) storms. The behavior you've described is typical of that diagnosis, and the ltrace further supports the hypothesis.
Lots of objects are being allocated and disposed of in the main/event loops being exercised by your usage ... and the periodic flurries of calls to free() result from that.
One possible approach would be to profile your code (or libraries on which you are depending) and see if you can refactor it to use (and re-use) objects from pre-allocated pools.
Another possible mitigation would be to make your own, more frequent, calls to trigger the garbage collection --- more expensive in aggregate but possibly less costly at each call. (That would be a trade-off for more predictable throughput).
You might be able to use the Python: gc module both for investigating the issue more deeply (using gc.set_debug()) and for a simple attempted mitigation (calls to gc.collect() after each transaction for example). You might also try running your application with gc.disable() for a reasonable length of time to see if further implicates the Python garbage collector. Note that disabling the garbage collector for an extended period of time will almost certainly cause paging/swapping ... so use it only for validation of our hypothesis and don't expect that to solve the problem in any meaningful way. It may just defer the problem 'til the whole system is thrashing and needs to be rebooted.
Here's an example of using gc.collect() in another SO thread on Tornado: SO: Tornado memory leak on dropped connections

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio