An Overview of ASPLOS IX - Research

An Overview of ASPLOS IX
Eduardo Pinheiro
Rutgers University
November 2000
Overview
•
•
•
•
•
Key note speech
Quick statistics
Papers
Wild and crazy ideas session
Conclusions
Statistics about ASPLOS
100.00%
90.00%
80.00%
70.00%
60.00%
Industry
50.00%
Academia
Cooperation
40.00%
30.00%
20.00%
10.00%
0.00%
ASPLOS94
ASPLOS96
ASPLOS98
ASPLOS 00
FS
pli
ng
va
l
30.00%
25.00%
Ne
tw
or
k in
g
M
em
or
y/
Ca
ch
e
am
Pe
rf/
E
Ar
ch
or
ag
e/
Pr
of/
S
St
pil
er
s
ag
ing
Co
m
es
s
yn
ch
DS
M
he
d/
S
M
Sc
Ot
he
rs
More Statistics
94
96
98
00
20.00%
15.00%
10.00%
5.00%
0.00%
About the Papers in ASPLOS 2000
North Carolina State
Washington
UCSD
Transmeta
Colorado
Illinois
Lucas Digital
HP
Arizona
Intel
Michigan
Umass
Texas
Duke
Berkeley
Vrije Amsterdam
Cornell
Stanford
IBM
Winsconsin-Madison
Compaq
CMU
Rutgers?
0
1
2
3
4
5
Average # of Authors and Acceptance Ratio
Average number of authors per paper
acceptance ratio
5
24%
4.5
23%
4
3.5
22%
3
2.5
21%
2
20%
1.5
1
19%
0.5
0
18%
94
96
98
2000
94
96
98
2000
Key Note Speech
•
•
•
•
Speaker: Butler Lampson.
Difficulties of building hardware.
Example of register leaking on register renaming.
Dark corners of x86, instructions never used and
not really supported by all Intel clones.
• TCP bugs and interoperability: manuals on how to
be compatible with the bugs of TCP versions.
• “In theory there is no difference between practice
and theory. In practice, there is.”
• Networking people is very distant from the rest of
systems people for no obvious reason.
Papers
• MEMs devices as disks, CMU
– New magnetic media format. Sliding media squares with
thousands of heads on top. Activated by electromagnetic
field and springs.
– Low latency, low power consumption portable devices.
– Same as disk with thousand heads, only low power?
• Alpha Server GS320, Compaq WRL
– 32-64 procs, due to lack of demand for large SMPs. And
lack of scalable applications.
– Directory-based cache coherency protocol with tweaks
(avoid some messages to/from home node)
– Design dated as of 1996. Only deployed in 2000!
Papers
• Timestamp Snooping, Wisconsin-Madison
– Enable scaling of SMPs by use of switched interconnection
networks.
– Snoop on timestamps, reordering of messages at the
destination.
– Guarantees global time by delaying other nodes virtual time
cascadingly.
– Trade off between speed and bandwidth consumption.
Papers
• MemorIES, IBM TJ Watson
– Cache emulation with special hardware plugged in PCI bus
– Snoops memory access. Only sees L2 misses.
– Emulate different cache policies and configurations in realtime, collect traces non-intrusively.
• Flash vs Simulated Flash, Stanford and Cornell
–
–
–
–
–
Doubts on simulator accuracy. Potential bugs.
Absolute accuracy: impossible in simulation.
Relative accuracy: shows trends. Can be done in simulation.
Several simulation comparisons with the real hardware.
Conclusions: be honest, work harder, build the hardware.
Papers
• Meta-level compilation for FLASH, Stanford & Cornell
– Checking error conditions in OS is hard. Kernel must verify
special conditions (valid user pointers?) and assert others.
– Solution: meta-level language that tells compilers to check
these coding rules.
– Found 34 errors in FLASH and another handful in Linux.
• Evaluating high-speed networks, Cornel and Vrije
–
–
–
–
Reliability and multicast schemes on Myrinet.
Question: who should forward packets? Host or interface?
And retransmissions?
Trade offs and comparisons among the proposed schemes.
Papers
• Communication scheduling, Stanford
– Scheduling between functional units in a VLIW processor.
– Sharing register files is difficult because of proper
scheduling techniques.
– Their technique minimizes communication on the
interconnection of register files to functional units. Requires
less power dissipation and die area with same performance.
• Networked Sensors, Berkeley
– Ad hoc networks of small, low power sensors. “Smart dust”.
– TinyOS on chip is small (to fit on chip) and fast to
communicate real-time via radio waves.
– 176 bytes of size. Context switches in 6 memory accesses.
Papers
• Power-aware page allocation, Duke
– Typically: lower latencies imply more power consumption.
– Clustered accesses to same memory chip. Others not
accessed can be in standby, sleep, power down modes.
– Thresholds must be carefully chosen. Too early or too late
transitions are bad (delay computation or waste power).
– Allocation policies: random, first touch.
– Metric: energy*delay.
– Suggested thresholds for SPEC2000 and NT traces apps.
– Compared against ideal transition schemes (based on
perfect knowledge of future time) and performed well.
Papers
• Hoard: Scalable Memory Allocator, Texas & Umass
– Memory allocation is a bottleneck for multithreaded apps.
due to false sharing.
– Simple fix doesn’t work: private heaps. Memory
consumption too high.
– Hoard: have a shared pool (“super blocks”) and grab blocks
in chunks. Free in chunks too and put back in pool.
• TLP for interactive Applications, Intel and Michigan
– Focus on improving performance of delays perceptible by
humans, rather than maximize end-to-end throughput.
– Measured the IPC of tasks with and without multiprocessing.
– Dual PCs improve mouse-click events in up to 36% (avg. of
22%) of a maximum possible of 50%.
– Subjective study. Experiments not perfectly reproducible.
Papers
• Null Pointer Check Elimination in Java, IBM Tokyo
– Null ptr. check has to be performed despite HW traps.
– HW traps only extend to one page. Arrays and objects
could have unbounded displacement.
– Move checks backwards, perform other optimizations,
move checks forward again. Other optimizations will
cancel need for null check with high probability.
• Frequent Value Cache, Arizona
– Store frequent used values in a compressed form in a
direct-mapped cache.
– Small size. Low overhead.
– Reduces miss rates of SPECInt95 apps from 1 to 68%.
– Better than doubling cache for some applications.
– How are frequent values discovered? Not clear…
Papers
• Value Sampling, Compaq SRC
– Profiling tool for user code and kernel. Extension of DCPI.
– Interpret instructions for a short-period of time on interrupts.
– Low overhead (10% slowdown).
• XOM: eXecute-only Memory, Stanford
– Nothing is secure. Not even private memory. Only chip is.
– New processor that has a built-in, on-chip private key and
encryption/decryption mechanisms.
– Software is distributed by encrypting code with receiver’s
private key. Only that unique processor can run that software.
– Compartments are special memory private to application and
is encrypted.
– Unencrypted (null) compartments exist to enable IPC.
Papers
• Fast symmetric-key cryptography, Michigan
– Crypto systems are slow and can’t be parallelized.
– Suggested adding new hardware instructions to general
purpose processors to speed up encryption/decryption.
– 59-74% speedup by adding these new instructions.
• OceanStore, Berkeley
– Megalomaniacal project on storage systems.
– Envisions mobility, persistance, availability, durability,
protection, etc.
– All files reside on shared servers on the internet.
– Big companies are responsible for sending data around to
where clients are (similar to cell phone service providers).
– Localization of data through Plaxton’s data structure.
Papers
• Software Profiling, HP Labs
– Trade off between performance and profiling.
– Found that less profiling leads to more effective hot path
prediction
• Design of the IA-64 Architecture, Intel, HP, Lucas
– Idea: to let compiler and OS fine tune the performance of
the processor by giving them more parameters to play with.
– Flow control speculation is done by compiler: speculative
load (ld.s) and check (chk.s).
– Data speculation: advance loads (ld.a) and check (chk.a).
Loads can be moved before stores and check is put after the
stores. If conflict happens, exception is generated.
– Stack optimizations.
Papers
• Computation reuse, Colorado, Illinois, Transmeta
– Reuse buffers: store computation that is likely to happen
again, obviating the need for redoing work.
– Traditionally, value profiling was needed to help compiler
find and dynamically fill reuse buffers.
– Hardware that detects potentially interesting reusable areas.
– Three components: reuse sentry, evaluation buffer, reuse
monitoring system.
• Symbiotic Jobscheduling for SMT, UC San Diego
– What jobs should be coscheduled to improve performance?
– Gets statistics from the hardware counters.
– Many policies: IPC balance, Dcache hit, diversity, etc.
Papers
• Analysis of OS behavior on SMTs, Washington
–
–
–
–
–
Simulate an Alpha SMT running Digital Unix with SimOS.
Show Apache speeds up 4x (but IPC is only 1.1).
For SPECInt, presence of OS drops performance 5-15%.
D-cache use is bad, due to kernel-kernel conflicts.
Lots more of data.
• Slipstream Processors, North Carolina State
– Goal: to shorten instruction streams to get the same outputs,
but faster. (Predict flow and data results ahead of time)
– Show that only 20% of the code of gcc is needed.
– Run shorter version with long version together to validate it.
– Partial redundancy fault tolerance by detecting HW faults.
– Improvements of 7% for SPECInt 95.