An Overview of ASPLOS IX Eduardo Pinheiro Rutgers University November 2000 Overview • • • • • Key note speech Quick statistics Papers Wild and crazy ideas session Conclusions Statistics about ASPLOS 100.00% 90.00% 80.00% 70.00% 60.00% Industry 50.00% Academia Cooperation 40.00% 30.00% 20.00% 10.00% 0.00% ASPLOS94 ASPLOS96 ASPLOS98 ASPLOS 00 FS pli ng va l 30.00% 25.00% Ne tw or k in g M em or y/ Ca ch e am Pe rf/ E Ar ch or ag e/ Pr of/ S St pil er s ag ing Co m es s yn ch DS M he d/ S M Sc Ot he rs More Statistics 94 96 98 00 20.00% 15.00% 10.00% 5.00% 0.00% About the Papers in ASPLOS 2000 North Carolina State Washington UCSD Transmeta Colorado Illinois Lucas Digital HP Arizona Intel Michigan Umass Texas Duke Berkeley Vrije Amsterdam Cornell Stanford IBM Winsconsin-Madison Compaq CMU Rutgers? 0 1 2 3 4 5 Average # of Authors and Acceptance Ratio Average number of authors per paper acceptance ratio 5 24% 4.5 23% 4 3.5 22% 3 2.5 21% 2 20% 1.5 1 19% 0.5 0 18% 94 96 98 2000 94 96 98 2000 Key Note Speech • • • • Speaker: Butler Lampson. Difficulties of building hardware. Example of register leaking on register renaming. Dark corners of x86, instructions never used and not really supported by all Intel clones. • TCP bugs and interoperability: manuals on how to be compatible with the bugs of TCP versions. • “In theory there is no difference between practice and theory. In practice, there is.” • Networking people is very distant from the rest of systems people for no obvious reason. Papers • MEMs devices as disks, CMU – New magnetic media format. Sliding media squares with thousands of heads on top. Activated by electromagnetic field and springs. – Low latency, low power consumption portable devices. – Same as disk with thousand heads, only low power? • Alpha Server GS320, Compaq WRL – 32-64 procs, due to lack of demand for large SMPs. And lack of scalable applications. – Directory-based cache coherency protocol with tweaks (avoid some messages to/from home node) – Design dated as of 1996. Only deployed in 2000! Papers • Timestamp Snooping, Wisconsin-Madison – Enable scaling of SMPs by use of switched interconnection networks. – Snoop on timestamps, reordering of messages at the destination. – Guarantees global time by delaying other nodes virtual time cascadingly. – Trade off between speed and bandwidth consumption. Papers • MemorIES, IBM TJ Watson – Cache emulation with special hardware plugged in PCI bus – Snoops memory access. Only sees L2 misses. – Emulate different cache policies and configurations in realtime, collect traces non-intrusively. • Flash vs Simulated Flash, Stanford and Cornell – – – – – Doubts on simulator accuracy. Potential bugs. Absolute accuracy: impossible in simulation. Relative accuracy: shows trends. Can be done in simulation. Several simulation comparisons with the real hardware. Conclusions: be honest, work harder, build the hardware. Papers • Meta-level compilation for FLASH, Stanford & Cornell – Checking error conditions in OS is hard. Kernel must verify special conditions (valid user pointers?) and assert others. – Solution: meta-level language that tells compilers to check these coding rules. – Found 34 errors in FLASH and another handful in Linux. • Evaluating high-speed networks, Cornel and Vrije – – – – Reliability and multicast schemes on Myrinet. Question: who should forward packets? Host or interface? And retransmissions? Trade offs and comparisons among the proposed schemes. Papers • Communication scheduling, Stanford – Scheduling between functional units in a VLIW processor. – Sharing register files is difficult because of proper scheduling techniques. – Their technique minimizes communication on the interconnection of register files to functional units. Requires less power dissipation and die area with same performance. • Networked Sensors, Berkeley – Ad hoc networks of small, low power sensors. “Smart dust”. – TinyOS on chip is small (to fit on chip) and fast to communicate real-time via radio waves. – 176 bytes of size. Context switches in 6 memory accesses. Papers • Power-aware page allocation, Duke – Typically: lower latencies imply more power consumption. – Clustered accesses to same memory chip. Others not accessed can be in standby, sleep, power down modes. – Thresholds must be carefully chosen. Too early or too late transitions are bad (delay computation or waste power). – Allocation policies: random, first touch. – Metric: energy*delay. – Suggested thresholds for SPEC2000 and NT traces apps. – Compared against ideal transition schemes (based on perfect knowledge of future time) and performed well. Papers • Hoard: Scalable Memory Allocator, Texas & Umass – Memory allocation is a bottleneck for multithreaded apps. due to false sharing. – Simple fix doesn’t work: private heaps. Memory consumption too high. – Hoard: have a shared pool (“super blocks”) and grab blocks in chunks. Free in chunks too and put back in pool. • TLP for interactive Applications, Intel and Michigan – Focus on improving performance of delays perceptible by humans, rather than maximize end-to-end throughput. – Measured the IPC of tasks with and without multiprocessing. – Dual PCs improve mouse-click events in up to 36% (avg. of 22%) of a maximum possible of 50%. – Subjective study. Experiments not perfectly reproducible. Papers • Null Pointer Check Elimination in Java, IBM Tokyo – Null ptr. check has to be performed despite HW traps. – HW traps only extend to one page. Arrays and objects could have unbounded displacement. – Move checks backwards, perform other optimizations, move checks forward again. Other optimizations will cancel need for null check with high probability. • Frequent Value Cache, Arizona – Store frequent used values in a compressed form in a direct-mapped cache. – Small size. Low overhead. – Reduces miss rates of SPECInt95 apps from 1 to 68%. – Better than doubling cache for some applications. – How are frequent values discovered? Not clear… Papers • Value Sampling, Compaq SRC – Profiling tool for user code and kernel. Extension of DCPI. – Interpret instructions for a short-period of time on interrupts. – Low overhead (10% slowdown). • XOM: eXecute-only Memory, Stanford – Nothing is secure. Not even private memory. Only chip is. – New processor that has a built-in, on-chip private key and encryption/decryption mechanisms. – Software is distributed by encrypting code with receiver’s private key. Only that unique processor can run that software. – Compartments are special memory private to application and is encrypted. – Unencrypted (null) compartments exist to enable IPC. Papers • Fast symmetric-key cryptography, Michigan – Crypto systems are slow and can’t be parallelized. – Suggested adding new hardware instructions to general purpose processors to speed up encryption/decryption. – 59-74% speedup by adding these new instructions. • OceanStore, Berkeley – Megalomaniacal project on storage systems. – Envisions mobility, persistance, availability, durability, protection, etc. – All files reside on shared servers on the internet. – Big companies are responsible for sending data around to where clients are (similar to cell phone service providers). – Localization of data through Plaxton’s data structure. Papers • Software Profiling, HP Labs – Trade off between performance and profiling. – Found that less profiling leads to more effective hot path prediction • Design of the IA-64 Architecture, Intel, HP, Lucas – Idea: to let compiler and OS fine tune the performance of the processor by giving them more parameters to play with. – Flow control speculation is done by compiler: speculative load (ld.s) and check (chk.s). – Data speculation: advance loads (ld.a) and check (chk.a). Loads can be moved before stores and check is put after the stores. If conflict happens, exception is generated. – Stack optimizations. Papers • Computation reuse, Colorado, Illinois, Transmeta – Reuse buffers: store computation that is likely to happen again, obviating the need for redoing work. – Traditionally, value profiling was needed to help compiler find and dynamically fill reuse buffers. – Hardware that detects potentially interesting reusable areas. – Three components: reuse sentry, evaluation buffer, reuse monitoring system. • Symbiotic Jobscheduling for SMT, UC San Diego – What jobs should be coscheduled to improve performance? – Gets statistics from the hardware counters. – Many policies: IPC balance, Dcache hit, diversity, etc. Papers • Analysis of OS behavior on SMTs, Washington – – – – – Simulate an Alpha SMT running Digital Unix with SimOS. Show Apache speeds up 4x (but IPC is only 1.1). For SPECInt, presence of OS drops performance 5-15%. D-cache use is bad, due to kernel-kernel conflicts. Lots more of data. • Slipstream Processors, North Carolina State – Goal: to shorten instruction streams to get the same outputs, but faster. (Predict flow and data results ahead of time) – Show that only 20% of the code of gcc is needed. – Run shorter version with long version together to validate it. – Partial redundancy fault tolerance by detecting HW faults. – Improvements of 7% for SPECInt 95.
© Copyright 2026 Paperzz