I'm using KafkaStreams 0.10.2.1 with a Windowed RocksDB state store, and I'm seeing a very weird behavior during state store initialization.
Inside each task's state store folder KafkaStreams is creating and deleting folders containing RocksDB files for 30 minutes.
If the state store is named XXX, then I see folders being created inside a folder named
State Folder/Task ID/XXX
with names such as
XXX-201710211345
containing RocksDB files. These folders are created, then deleted and new folders with a different timestamp are created. This goes on for 30 minutes until message processing ensues.
I'm guessing that RocksDB is reconstructing from the change log topic of the state store all the historical states, but I fail to understand for what purpose, as it eventually deletes all but the last one.
What is the reason that KafkaStreams is creating and deleting these folders?
How can I make KafkaStreams recreate only the latest state?
This is a stripped down version of my topology:
stream
.map((key, value) -> KeyValue.pair(key, value))
.through(Serdes.String(), serde, MY_TOPIC)
.groupByKey(Serdes.String(), serde)
.count(TimeWindows.of(TimeUnit.SECONDS.toMillis(windowDurationSec)).until(TimeUnit.SECONDS.toMillis(windowDurationSec) + TimeUnit.SECONDS.toMillis(lateEventGraceTimeSec)), "Hourly_Agg")
.foreach((k, v) -> System.out.println(""));
And here's a (tiny part of) dump from strace:
6552 stat("/path/Prod/kafka-streams/Counter-V13/3_131/Hourly_Agg/Hourly_Agg-201710211230/000006.sst", {st_mode=S_IFREG|0644, st_size=3158, ...}) = 0
6552 unlink("/path/Prod/kafka-streams/Counter-V13/3_131/Hourly_Agg/Hourly_Agg-201710211230/000006.sst") = 0
6552 unlink("/path/Prod/kafka-streams/Counter-V13/3_131/Hourly_Agg/Hourly_Agg-201710211230") = -1 EISDIR (Is a directory)
6552 rmdir("/path/Prod/kafka-streams/Counter-V13/3_131/Hourly_Agg/Hourly_Agg-201710211230") = 0
6552 stat("/path/Prod/kafka-streams/Counter-V13/3_131/Hourly_Agg", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
6552 mkdir("/path/Prod/kafka-streams/Counter-V13/3_131/Hourly_Agg/Hourly_Agg-201710211500", 0755) = 0
6552 rename("/path/Prod/kafka-streams/Counter-V13/3_131/Hourly_Agg/Hourly_Agg-201710211500/LOG", "/path/Prod/kafka-streams/Counter-V13/3_131/Hourly_Agg/Hourly_Agg-201710211500/LOG.old.1508746634575191") = -1 ENOENT (No such file or directory)
Kafka Streams does recreate the latest state and the behavior you see is by design.
For windowed stores, the window retention time period is divided into so-called segments and Streams use one RocksDB per segment to store the corresponding data. This allow to "roll" segments based on time progress and delete data that is older than retention time efficiently (ie, drop a hole segment/RocksDB).
When state is recreated, we simply read the whole changelog topic and apply all those updates to the store. Thus, you see the same segment rolling behavior as during processing (just in a much smaller time frame). It's not easily possible, to "jump" to the last state as there is not enough information upfront -- thus, blindly replaying the changelog is the best option.
Related
I have a parquet file of position data for vehicles that is indexed by vehicle ID and sorted by timestamp. I want to read the parquet file, do some calculations on each partition (not aggregations) and then write the output directly to a new parquet file of similar size.
I organized my data and wrote my code (below) to use Dask's map_partitions, as I understood this would perform the operations one partition at a time, saving each result to disk sequentially and thereby minimizing memory usage. I was surprised to find that this was exceeding my available memory and I found that if I instead create a loop that runs my code on a single partition at a time and appends the output to the new parquet file (see second code block below), it easily fits within memory.
Is there something incorrect in the original way I used map_partitions? If not, why does it use so much more memory? What is the proper, most efficient way of achieving what I want?
Thanks in advance for any insight!!
Original (memory hungry) code:
ddf = dd.read_parquet(input_file)
meta_dict = ddf.dtypes.to_dict()
(
ddf
.map_partitions(my_function, meta = meta_dict)
.to_parquet(
output_file,
append = False,
overwrite = True,
engine = 'fastparquet'
)
)
Awkward looped (but more memory friendly) code:
ddf = dd.read_parquet(input_file)
for partition in range(0, ddf.npartitions, 1):
partition_df = ddf.partitions[partition]
(
my_function(partition_df)
.to_parquet(
output_file,
append = True,
overwrite = False,
engine = 'fastparquet'
)
)
More hardware and data details:
The total input parquet file is around 5GB and is split into 11 partitions of up to 900MB. It is indexed by ID with divisions so I can do vehicle grouped operations without working across partitions. The laptop I'm using has 16GB RAM and 19GB swap. The original code uses all of both, while the looped version fits within RAM.
As #MichaelDelgado pointed out, by default Dask will spin up multiple workers/threads according to what is available on the machine. With the size of the partitions I have, this maxes out the available memory when using the map_partitions approach. In order to avoid this, I limited the number of workers and the number of threads per worker to prevent automatic parellelization using the code below, and the task fit in memory.
from dask.distributed import Client, LocalCluster
cluster = LocalCluster(
n_workers = 1,
threads_per_worker = 1)
client = Client(cluster)
I am writing a larger than RAM data out from my Python application - basically dumping data from SQLAlchemy to Parque. My solution was inspired by this question. Even though increasing the batch size as hinted here I am facing the issues:
RAM usage grows heavily
The writer starts to slow down after a while (write throughput speed drops more than 5x)
My assumption is that this is because the ParquetWriter metadata management becomes expensive when the number of rows increase. I am thinking that I should switch to datasets that would allow the writer to close the file in the middle of processing flush out the metadata.
My question is
Is there an example for writing incremental datasets with Python and Parquet
Are my assumptions correct or incorrect and using datasets would help to maintain the writer throughput?
My distilled code:
writer = pq.ParquetWriter(
fname,
Candle.to_pyarrow_schema(small_candles),
compression='snappy',
allow_truncated_timestamps=True,
version='2.0', # Highest available schema
data_page_version='2.0', # Highest available schema
) as writer:
def writeout():
nonlocal data
duration = time.time() - stats["started"]
throughout = stats["candles_processed"] / duration
logger.info("Writing Parquet table for candle %s, throughput is %s", "{:,}".format(stats["candles_processed"]), throughout)
writer.write_table(
pa.Table.from_pydict(
data,
writer.schema
)
)
data = dict.fromkeys(data.keys(), [])
process = psutil.Process(os.getpid())
logger.info("Flushed %s writer, the memory usage is %s", bucket, process.memory_info())
# Use massive yield_per() or otherwise we are leaking memory
for item in query.yield_per(100_000):
frame = construct_frame(row_type, item)
for key, value in frame.items():
data[key].append(value)
stats["candles_processed"] += 1
# Do regular checkopoints to avoid out of memory
# and to log the progress to the console
# For fine tuning Parquet writer see
# https://issues.apache.org/jira/browse/ARROW-10052
if stats["candles_processed"] % 100_000 == 0:
writeout()
In this case, the reason was the incorrect use of Python lists and dicts as a working buffer, as pointed out by #0x26res.
After making sure the dictionary of lists is cleared correctly, the memory consumption issues become negligible.
Assume I have multiple processes writing large files (20gb+). Each process is writing its own file and assume that the process writes x mb at a time, then does some processing and writes x mb again, etc..
What happens is that this write pattern causes the files to be heavily fragmented, since the files blocks get allocated consecutively on the disk.
Of course it is easy to workaround this issue by using SetEndOfFile to "preallocate" the file when it is opened and then set the correct size before it is closed. But now an application accessing these files remotely, which is able to parse these in-progress files, obviously sees zeroes at the end of the file and takes much longer to parse the file.
I do not have control over the this reading application so I can't optimize it to take zeros at the end into account.
Another dirty fix would be to run defragmentation more often, run Systernal's contig utility or even implement a custom "defragmenter" which would process my files and consolidate their blocks together.
Another more drastic solution would be to implement a minifilter driver which would report a "fake" filesize.
But obviously both solutions listed above are far from optimal. So I would like to know if there is a way to provide a file size hint to the filesystem so it "reserves" the consecutive space on the drive, but still report the right filesize to applications?
Otherwise obviously also writing larger chunks at a time obviously helps with fragmentation, but still does not solve the issue.
EDIT:
Since the usefulness of SetEndOfFile in my case seems to be disputed I made a small test:
LARGE_INTEGER size;
LARGE_INTEGER a;
char buf='A';
DWORD written=0;
DWORD tstart;
std::cout << "creating file\n";
tstart = GetTickCount();
HANDLE f = CreateFileA("e:\\test.dat", GENERIC_ALL, FILE_SHARE_READ, NULL, CREATE_ALWAYS, 0, NULL);
size.QuadPart = 100000000LL;
SetFilePointerEx(f, size, &a, FILE_BEGIN);
SetEndOfFile(f);
printf("file extended, elapsed: %d\n",GetTickCount()-tstart);
getchar();
printf("writing 'A' at the end\n");
tstart = GetTickCount();
SetFilePointer(f, -1, NULL, FILE_END);
WriteFile(f, &buf,1,&written,NULL);
printf("written: %d bytes, elapsed: %d\n",written,GetTickCount()-tstart);
When the application is executed and it waits for a keypress after SetEndOfFile I examined the on disc NTFS structures:
The image shows that NTFS has indeed allocated clusters for my file. However the unnamed DATA attribute has StreamDataSize specified as 0.
Systernals DiskView also confirms that clusters were allocated
When pressing enter to allow the test to continue (and waiting for quite some time since the file was created on slow USB stick), the StreamDataSize field was updated
Since I wrote 1 byte at the end, NTFS now really had to zero everything, so SetEndOfFile does indeed help with the issue that I am "fretting" about.
I would appreciate it very much that answers/comments also provide an official reference to back up the claims being made.
Oh and the test application outputs this in my case:
creating file
file extended, elapsed: 0
writing 'A' at the end
written: 1 bytes, elapsed: 21735
Also for sake of completeness here is an example how the DATA attribute looks like when setting the FileAllocationInfo (note that the I created a new file for this picture)
Windows file systems maintain two public sizes for file data, which are reported in the FileStandardInformation:
AllocationSize - a file's allocation size in bytes, which is typically a multiple of the sector or cluster size.
EndOfFile - a file's absolute end of file position as a byte offset from the start of the file, which must be less than or equal to the allocation size.
Setting an end of file that exceeds the current allocation size implicitly extends the allocation. Setting an allocation size that's less than the current end of file implicitly truncates the end of file.
Starting with Windows Vista, we can manually extend the allocation size without modifying the end of file via SetFileInformationByHandle: FileAllocationInfo. You can use Sysinternals DiskView to verify that this allocates clusters for the file. When the file is closed, the allocation gets truncated to the current end of file.
If you don't mind using the NT API directly, you can also call NtSetInformationFile: FileAllocationInformation. Or even set the allocation size at creation via NtCreateFile.
FYI, there's also an internal ValidDataLength size, which must be less than or equal to the end of file. As a file grows, the clusters on disk are lazily initialized. Reading beyond the valid region returns zeros. Writing beyond the valid region extends it by initializing all clusters up to the write offset with zeros. This is typically where we might observe a performance cost when extending a file with random writes. We can set the FileValidDataLengthInformation to get around this (e.g. SetFileValidData), but it exposes uninitialized disk data and thus requires SeManageVolumePrivilege. An application that utilizes this feature should take care to open the file exclusively and ensure the file is secure in case the application or system crashes.
I create one background thread B,and in func of B,
void func()
{
system('gzip -f text-file'); // size of text-file is 100M
xxx
}
I found sometime the sys of one cpu(my server has more than one cpu core) is 100%.
strace the progress, I found clone syscall consume more than 3 seconds, which is almost execution time of gzip.
**17:46:04.545159** clone(child_stack=0, flags=CLONE_PARENT_SETTID|SIGCHLD, parent_tidptr=0x418dba38) = 39169
**17:46:07.432385** wait4(39169, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], 0, NULL) = 39169
so my question is,
1. is system('gzip -f text-file') lead to 100% cpu sys ?
2. what is the root cause
sys_clone without CLONE_MM does full copy of virtual memory mapping from parent process into child process, according to https://www.kernel.org/doc/gorman/html/understand/understand021.html
343 Allocate a new mm
348-350 Copy the parent mm and initialise the process specific mm fields with init_mm()
352-353 Initialise the MMU context for architectures that do not automatically manage their MMU
355-357 Call dup_mmap() which is responsible for copying all the VMAs regions in use by the parent process
VMA count for process with 60GB in 2000 mmaps is high, and dup_mm may take lot of time.
You want to do small external run (gzip), but the fork is not best solution for such large programs. All copies of vma will be trashed by doing exec: http://landley.net/writing/memory-faq.txt
For example, the fork/exec combo creates transient virtual memory usage
spikes, which go away again almost immediately without ever breaking the
copy on write status of most of the pages in the forked page tables. Thus
if a large process forks off a smaller process, enormous physical memory
demands threaten to happen (as far as overcommit is concerned), but never
materialize.
So, it can be better for you to:
check vfork+exec pair (aka posix_spawn), which will suspend your huge process for small time, until child will do exec or `exit)
create separate helper process before doing all the 60GB of mmaps; communicate with it using pipes/sockets/ipc/anything. Helper process is small and will sleep most time on ipc. When you needs gzip, you just asks helper to run it.
or integrate compression into your program. Gzip and bzip2 both has good libraries, zlib and libbz2, and there are wrappers in boost.
I am working with a shared memory application, and to delete the segments I use the following command:
ipcrm -M 0x0000162e (this is the key)
But I do not know if I'm doing the right things, because when I run ipcs I see the same segment but with the key 0x0000000. So is the memory segment really deleted? When I run my application several times I see different memory segments with the key 0x000000, like this:
key shmid owner perms bytes nattch status
0x00000000 65538 me 666 27 2 dest
0x00000000 98307 me 666 5 2 dest
0x00000000 131076 me 666 5 1 dest
0x00000000 163845 me 666 5 0
What is actually happening? Is the memory segment really deleted?
Edit: The problem was - as said below in the accepted answer - that there were two processes using the shared memory, until all the process were closed, the memory segment is not going to disappear.
I vaguely remember from my UNIX (AIX and HPUX, I'll admit I've never used shared memory in Linux) days that deletion simply marks the block as no longer attachable by new clients.
It will only be physically deleted at some point after there are no more processes attached to it.
This is the same as with regular files that are deleted, their directory information is removed but the contents of the file only disappear after the last process closes it. This sometimes leads to log files that take up more and more space on the file system even after they're deleted as processes are still writing to them, a consequence of the "detachment" between a file pointer (the zero or more directory entries pointing to an inode) and the file content (the inode itself).
You can see from your ipcs output that 3 of the 4 still have attached processes so they won't be going anywhere until those processes detach from the shared memory blocks. The other's probably waiting for some 'sweep' function to clean it up but that would, of course, depend on the shared memory implementation.
A well-written client of shared memory (or log files for that matter) should periodically re-attach (or roll over) to ensure this situation is transient and doesn't affect the operation of the software.
You said that you used the following command
ipcrm -M 0x0000162e (this is the key)
From the man page for ipcrm
-M shmkey
Mark the shared memory segment associated with key shmkey for
removal. This marked segment will be destroyed after the
last detach.
So the behaviour of -M options does exactly what you observed, ie set the segment to be destroyed only after the last detach.