Capriccio - UBC Computer Science

Capriccio: Scalable Threads For
Internet Services
Authors:
Rob von Behren, Jeremy Condit,
Feng Zhou, George C. Necula,
Eric Brewer
Presentation by: Will Hrudey
Introduction

Capriccio:
“a spritely improvisational musical dance involving
multiple voices”

Introduces a fast, scalable user-level thread
package for thread management and
synchronization
Motivation

Internet Servers And Databases
–
Have ever-increasing scalability needs
–
Need to handle hundreds of thousands of
simultaneous connections without significant
degradation
–
Need for a programming model to achieve
efficient, robust servers with ease
Approach

Utilizes user level threads to provide a natural abstraction for
high concurrency programming
–

Decouples thread package from OS to take advantage of:
–
–
–

Prior work discussed threads versus events
Cooperative threading
New asynchronous I/O interfaces
Compiler support
Provides 3 key features:
–
–
–
Scalability
Linked stacks
Resource aware scheduling
Goals





To allow high performance without high
complexity
Support for existing thread API’s (POSIX)
Scalability to 100,000’s threads
Flexibility to address application-specific
needs
Little or no modification of application itself
User Level Threads



Provide performance & flexibility advantages
Provide a clean programming model with
useful invariants and semantics
Decouples thread package from OS
–
–


Hides both OS variation & kernel evolution
Integrate compiler support
Can complicate preemption
Can interact badly with kernel scheduler
User Level Threads

Flexibility
–
–
–

Performance
–
–

Take advantage of new asynchronous I/O mechanisms
Tailored scheduling
Lightweight (scale to 100,000 threads)
Reduced synchronization overhead on uniprocessors
More efficient memory management
Disadvantages
–
–
–
Blocking I/O
Wrapper layer to translate blocking to non-blocking I/O
Lightweight synchronization diminished on multiprocessors
User Level Threads

Implementation (user level library for Linux)
–
Context switches

–
I/O: intercepts blocking I/O calls

–

Main loop looks like an event-driven application
Run threads and checks for I/O completions
Synchronization

–
epoll() for pollable file descriptors and Linux AIO
Scheduling:

–
coroutine library
Cooperative scheduling to improve synchronization
Efficiency

Thread management functions have bounded worst case running times
User Level Threads

Microbenchmark:
–
Testbed:

–
2x2.4GHz Xeon / 1GB / 2x10K RPM SCSI Ultra II HD /
3xGigabit Ethernet / Linux 2.5.70
Thread packages:

Capriccio, LinuxThreads, NPTL
Efficient Stack Management

Optimizes stack allocation for many threads
–

Small non-contiguous stack chunks
–

Reduces size of VM dedicated to stacks
Grow and shrink at run time
Compiler analysis and runtime checks
–
Generates a weight directed call graph
Efficient Stack Management
Weighted Call graph




Nodes are functions weighted by max stack size
Edges indicate function calls between nodes
Path is a sequence of stack frames
Checkpoints are code inserted at call sites
Efficient Stack Management

Places a reasonable bound on the amount stack
space consumed by each thread

Checkpoints determine if enough space left to reach
next checkpoint without overflow
–

If not, new stack chunk allocated & SP adjusted
Checkpoint placement
–
–
Break cycles
Scan nodes to ensure path within desired bound
Efficient Stack Management

Special cases
–
–

Function pointers complicate analysis
External function calls
Tuning to optimize memory usage
–
–
MaxPath
MinChunk

Linked stacks can improve paging behavior

Apache SPECweb99 results: 3-4% slowdown overall
Resource Aware Scheduling

Thread scheduling and admission control adapt to resource
usage

Application viewed as sequence of stages separated by
blocking points

Dynamic scheduling decisions are finer grained

Blocking graphs generated at runtime
–
–
Learn behavior dynamically to improve scheduling
Determine impact on resource utilization if schedule thread
Resource Aware Scheduling
Blocking Graph





Nodes are program locations where threads block
Edges reflect consecutive blocking points
Edges annotated with weighted averages reflecting resource usage
Nodes annotated with weighted outer edge values
Threads walk this graph independently
Resource Aware Scheduling

Promote nodes that release resources and demote
nodes that acquire resources

Dynamically prioritize nodes (threads) for scheduling

Responds to changes in resource consumption due
to type of work and offered load

Implement using separate run queues for each node
Resource Aware Scheduling

Usage
–

Drive each resource to max capacity, throttle back,
coupled with hysteresis, keeps system at full
throttle
Challenges
–
–
–
–
Determination of max capacity of resources is tricky
Interaction between resources
Thrashing can be difficult to detect
Application specific resources – memory mgmt
Performance

Evaluate real-world web server workload
Testbed
– 4x500 MHz Pentium / 2GB / Gigabit Ethernet
– Linux 2.4.20
– Kernel version doesn’t support epoll or AIO (used poll)
– Client load up to 16 similar configurations
– 3.2GB static file data with various file sizes
– Clients repeatedly connect, issue 5 requests waiting 20ms apart
– Limited cache sizes: Haboob / Knot to 200MB to force disk activity
– Request frequencies for each size and file based on SPECweb99
Performance


15% increase
with Apache
Knot comparable
to event-based
Haboob
Performance

Overhead involved in maintaining information
about resources at each node
–
Gathering and maintaining statistics:



–
<2% for edges in Apache
Statistics remained fairly steady in tested workloads
Ratio of 1/20 reduces aggregate overhead to 0.1%
Stack trace overhead significant


(8% - Apache / 36% - Knot)
Could be reduced with compiler integration
Future Work

Incorporate multiprocessor support

Reduce kernel crossings under heavy load with a
batching interface for async I/O
Improve thrashing detection
Improve stack analysis – function pointers (CCured)
Develop profiler tools to optimize tuning parameters
Generate blocking graph at compile time
Implement blocking point fairness strategies





Conclusion



Thread package was “fixed” to support scalable, high
concurrency Internet servers
Threading model is more useful for high concurrency
programming
User level thread package is decoupled from OS
–

Can benefit from new I/O mechanisms and compiler support
Linked stacks and scheduler delivered significant
improvements in scalability and performance
compared with existing systems
Observations


External function call stack size doesn’t scale
Offloads responsibility to compiler support
“compiler technology will play an important role in the evolution
of the techniques described in this paper”

Performance test
–
–

Data not qualified: how many runs? Are results repeatable?
Kernel didn’t have same non-blocking call support so
comparison is difficult; are the results still meaningful?
Stated goal of achieving 100,000’s of threats not
explicitly evident
Discussion
1.
It seems as though using a graph to dynamically
adjust the stack size (vs a default large stack size)
is a smart thing to do, especially if memory is a
problem. I'm trying to figure out if this is a new era
of more intelligent thread packages, or if this is an
overly complex solution which has been avoided.
So what is the expense (in terms of computation) of
this intelligent stack management? Is it necessary
for this application to succeed?
Discussion
2.
3.
Capriccio can scale to 100,000 threads, what about
more than 100,000 thread? Will the system just
crash? Is there no mechanism in place if that
happens?
I was wondering whether the dynamic stack chunks
are mapped contiguously in the virtual memory of
the thread? If this was the case, how could they
achieve adding a chunk of memory to the stack as
small as half a page?
Discussion
4.
In the experimental section there is no mention of
how many tests were performed, and from the
looks of it, there was just one---since otherwise
vanilla-apache seems to dip and then improve in
bandwidth as more clients connect. Also Knot
seems to have approximately the same
performance as Haboob, so I'm wondering how
conclusive these tests really are?
Discussion
5.
The authors continually refer to their program’s
‘event-driven behavior’ (page 3,8, 11). In this way, it
is a similar implementation to SEDA (in that both
event and thread behaviors are exhibited). What is
the implied advantage of fixing threads to behave
like events over fixing events to behave like (or
use) threads?
Discussion
6.
What the authors seem to be doing with the
scheduling of the system is wrap an event-based
behavior (for I/O) into a thread-based abstraction.
Is this extra layer of abstraction really needed?
How much does the extra layer of abstraction affect
the performance of the system in general? Also,
why is it that people don't accept the fact that
events are better for this type of task and just use
them as they are, as opposed to dressing them up
in thread costumes?
Discussion
7.
One assumption that the authors make is that
resource usage is likely to be similar for many tasks
at a blocking point. They say that this assumption
*seems* to hold in practice. This is of course not
too convincing. Is this actually a good assumption
to make? Are there any systems where this does
not hold, and what would be the consequences on
this piece of work?
Discussion
8.
Authors commented that the resource-aware
scheduling is completely adaptive, but also confess
that the system suffers from several parameter
tuning problem like knowing maximum capacity of
each resource, adjusting speed of adaptation (no
reason why they use exponentially weighted
averages). Finding optimal parameters can be
another huge work to do which could be too hard to
be tuned by hand. Isn't it making things more
complicated or uncontrollable?
Discussion
9.
One of the key features that is incorporated into
Capriccio is a new method of stack management,
linked stack management, whose goal is to
improve performance by reducing the amount of
wasted stack space, typical with other types of
stack management. Their approach is contingent
on compiler support. Is it realistic to expect to see
the development of a compiler for this purpose?
Discussion
10.
In the case study, the authors choose MaxPath and
MinChunk, the two tuning parameters available
with their linked stack management algorithm,
based on profiling information. Is it reasonable to
expect the programmer to supply this information?
How sensitive is the algorithm to these
parameters?
Discussion
11.
Would it be possible to use something like NPTL
under low-load, since it performs better than
Capriccio, then switch to Capriccio under higher
loads when it begins to outperform NPTL? This
would give the best of both and constantly maintain
good performance.
Discussion
12.
In Section 3.1, the authors used whole-program
analysis to determine the maximum amount of
stack space that a single stack frame for that a
function will consume. How about dynamic memory
allocation? If the codes allocate various size of
memory during run-time, how could the program
estimate the maximum stack size (or they just give
a rough estimation?)?