Nowadays even smart phones have 2 or (as newly announced) 4 cores. AMD sells chips with 16 cores. Indeed, the 3 main machines used to develop DataPath have 16, 32 and 48 cores respectively. If performance is the goal, multi-threading everything is the key. The problem of multi-threading very uniform loads like numeric processing or graphics is fairly well understood. Extending the expertise to heterogeneous loads that database systems face is tricky at best.
To make the discussion concrete, let's assume we have the TPC-H workload and we are trying to execute Q1 (scan of most of lineitem with 8 aggregates and simple GROUP BY). When you have fast disks (more about this in future blog posts) and many-cores, you probably need to pay attention to multi-threading. For example, on our largest machine, data comes from the disk at 2-3GB/s. The machine has 48 cores; it needs a lot of them to keep up with the deluge of data from the disk.
When it comes to execution models, most database engines available today fall into two categories: multi-processor and (weak) multi-threaded. A third category: strong multithreaded is far rarerer.
The multi-processor engines like Postgress and derivatives, do not use threads at all. Multiple backends need to be executed and coordinated if the many-cores are to be used. Asterdata and Greenplum made a business out of the art of accomplishing this but it is no easy task. The main challenge is the fact that the system resources need to be partitioned between the multiple processes; it is harder to more resources from a process to another than between threads. A key ingredient for this solution, borrowed from MPP distributed databases, is an EXCHANGE operator that braks the work between multiple engines (discussion for a different post).
The multi-threaded engines of most other databases, MySQL and derivatives, most open source and commercial engines use multiple threads to control multiple disks (IO purposes) or to provide some parallelism in execution. Multiple queries that run in parallel can be assigned to different threads. If the pipes that connect the operators in a traditional iterator model implementation of a database engine are thread-safe, each operator can run in its own thread, thus allowing some level of parallelism. I say some since for Q1 in TPCH, only 2-3 threads can be used for the selection, aggregation/group-by and possibly for unpacking(depending on IO model). On the 48 core machine, the rest of the threads would simply be idle. I will refer to this multi-threading model as weak multi-threading.
For machines with many cores, a strong version of multi-threading is needed: each operator needs to be multi-threaded and capable of using all the cores in the system. In the case of TPCH-Q1, the most power needs to be given to the aggregation operator, selection is trivial (filter on date). Other queries like Q6 have most of the effort in selection if indexing is not used. Achieving strong multi-threading for selection and aggregation is nontrivial but not very complex due to the fact that they parallelize nicely. The join and related operators are far trickier. The main problem in general is how to avoid any significant use of locking since the performance quickly decreases below the single-threaded level. A story I hear very often when I talk to people about their multi-threading adventures is that now, using 5 threads, their code is slower that single-threaded.
The key for achieving strong multi-threading in DataPath is the notion of Chunk. Chunks are bundles of 2 million tuples (+-) that are processed as a unit. DataPath uses a fairly sophisticated task scheduler but it can afford to do so since each decision corresponds to 2M tuples, thus it is virtually free. The use of Chunks as the unit of storage and execution is, in my opinion, by far the best decision we took in the design of DataPath. Chunks are build in parallel from the disk. Work on different Chunks can be performed in parallel by various operators and can use all the cores if necessary. In the Q1 example, we can have up to 48 Chunks processed by the aggregate operator to keep up with the fast disk.
Chunks have many other benefits, which I will highlight in subsequent posts. Other software engineering techniques are also needed to design an engine that can keep track of many parallel sub-tasks and manage them efficiently. Long discussions are needed for all of these issues.
The most important point I can make on this issue is that without the use of medium-coarse grained parallelism, strong multi-threading cannot be achieved. If the parallelism is too finely controlled, the management dominates the running time. If the task is not broken into parts smallel than an operator, large analytical queries like TPCH-Q1 cannot benefit. The Chunks in DataPath achieve precisely this medium-coarse granularity for tasks.
No comments:
Post a Comment