•
US Patent 7810093,
Parallel-Aware Dedicated Job Co-Scheduling Within/Across Symmetric Mutliprocessing Nodes
Date of Patent: October 5, 2010
Filed:
November 15, 2004
Abstract:
This patent
presents a parallel-aware co-scheduling method and system for improving the
performance and scalability of a dedicated parallel job having synchronizing collective operations in a parallel computing
environment comprised of a network of SMP nodes each having at least one
processor. The method uses a global co-scheduler and an operating system kernel
adapted to coordinate interfering system and daemon activities on a node and
across nodes to promote intra-node and inter-node overlap of said interfering
system and daemon activities as well as intra-node and inter-node overlap of
said synchronizing collective operations.
[1]
IEEE 1588-2019, IEEE Standard for a Precision Clock Synchronization
Protocol for Networked Measurement and Control Systems, IEEE,
16 June 2020. Available at online.
[2] Marcelo
Barrios, Terry Jones, Scott Kinnane, Mathis Landzettel, Safran Al-Safran, Jerry Stevens, Christopher
Stone, Chris Thomas, and Ulf Troppens, Sizing and
Tuning GFS, Poughkeepsie, NY: IBM Corporation,
September 1999. Available online.
[3] MPI
2.0, MPI-2: Extensions to
the Message-Passing Interface, The Message Passing Interface
Forum, July, 1997.
Available online.
[1] M. Ben Olson, Brandon Kammerdiener,
Michael R. Jantz, Kshitij A. Doshi and Terry Jones. ACM Transactions on Architecture and Code Optimization Volume 19 Issue 3. September 2022 Article No.: 45pp 1–27. https://doi.org/10.1145/3533855.
[2] Terry Jones, Doug Arnold, Frank Tuffner,
Rodney Cummings, and Kang Lee. “Recent Advances in Precision Clock
Synchronization Protocols for Power Grid Control Systems.” Energies 14, no. 17 (2021): 5303. https://doi.org/10.3390/en14175303.
[3]
Elvis Rojas, Esteban Meneses,
Terry Jones, and Don Maxwell. Understanding failures through the lifetime of a
top-level supercomputer. Journal of Parallel and Distributed
Computing, 154, pp.27-41. August, 2021. https://doi.org/10.1016/j.jpdc.2021.04.001.
[4]
Terry Jones, George Ostrouchov, Gregory A. Koenig, Oscar H. Mondragon, and Patrick
G. Bridges. An Evaluation of the state of time synchronization on leadership
class supercomputers. Journal of Concurrency and Computation: Practice and
Experience. 09
October 2017. https://doi.org/10.1002/cpe.4341
[5]
Yoav
Tock, Benjamin Mandler, José Moreira, and
Terry Jones. “Design and Implementation of a Scalable Membership Service
for Supercomputer Resiliency-Aware Runtime.” Lecture Notes in Computer Science Volume 8097,
2013, pp 354-366. . Print
ISBN 978-3-642-40046-9. doi: https://doi.org/10.1007/978-3-642-40047-6_37
[6]
Terry
Jones, and Gregory Koenig. “Clock Synchronization in High-end Computing
Environments: A Strategy for Minimizing Clock Variance at Runtime.” Journal of
Concurrency and Computation: Practice & Experience, Volume 25, Issue 6, doi: 10.1002/cpe.2868, pages 881-897, April 25, 2013. https://doi.org/10.1002/cpe.2868
[7]
Terry
Jones, “Linux Kernel Co-Scheduling and Bulk Synchronous Parallelism.” The
International Journal of High Performance Computing
Applications (IJHPCA). Vol. 26, Issue 2, May 2012. doi: https://doi.org/10.1177/1094342011433523
[8]
Sayantan Chakravorty, Celso L. Mendes, Laxmikant
V. Kale, Terry Jones, Andrew Tauferner, Todd Inglett,
and Jose Moreira, “HPC-Colony: Services and Interfaces for Very Large Systems.” in Operating
Systems Review, vol. 40, no. 2, pp. 43-49, 2006. doi:
https://doi.org/10.1145/1131322.1131334
[1]
Elvis
Rojas, Fabricio Quirós-Corella, Terry Jones, and Esteban Meneses.
Large-Scale Distributed Deep Learning: A Study
of Mechanisms and Trade-Offs with PyTorch. In Latin
American High Performance Computing Conference.
Springer CCIS: Communications In Computer and
Information Science, vol 1540. 2022. pp. 177-192.
https://doi.org/10.1007/978-3-031-04209-6
[2]
Suren Byna, Stratos
Idreos, Terry Jones,
Kathryn Mohror, Rob Ross, and Florin Rusu. The Management and Storage of Scientific Data. United States,
2022. https://doi.org/10.2172/1845705 (brochure)
[3]
Rojas, Elvis, Fabricio Quirós-Corella, Terry
Jones, and Esteban Meneses. Large-Scale Distributed
Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch.
In proceedings of the Latin American High Performance Computing
Conference. Guadalajara, Mexico (Virtual). October 4-8, 2021. In: Gitler, I., Barrios
Hernández, C.J., Meneses, E. (eds) High Performance
Computing. CARLA 2021. Communications in Computer and Information Science, vol
1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_13 .
[4] Elvis Rojas, Diego Perez, Jon Calhoun, Leonardo Bautista Gomez,
Terry Jones and Esteban Meneses. “Understanding Soft
Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint
Alteration.” In proceedings of the 23rd IEEE Cluster conference
(Cluster 2021). Portland, OR (Virtual). Sept 7-10, 2021. https://doi.org/10.1109/Cluster48925.2021.00045
[5] Terry Jones, David Schoenwald, and Frank Tuffner. Challenges To Updating Timing In The Power Grid. The 30th Workshop on Synchronization and Timing Systems (WSTS ’21). Virtual. March 30-April 1st, 2021. Available online (accepted talk).
[6] Elvis Rojas, Esteban Meneses,
Terry Jones and Don Maxwell. “Towards a Model to Estimate the
Reliability of Large-scale Hybrid Supercomputers”.
In proceedings of The 26th International European Conference on
Parallel and Distributed Computing (Euro-Par 2020). Warsaw, Poland. Aug.
24-28, 2020. https://doi.org/10.1007/978-3-030-57675-2_3
[7] T. Chad Effler, Michael R.
Jantz, and Terry Jones. “Performance Potential of Mixed Data Management Modes
for Heterogeneous Memory Systems”. In proceedings of Workshop on Memory Centric
High Performance Computing (MCHPC 2020). Atlanta, GA.
Nov 11, 2020. https://doi.org/10.1109/MCHPC51950.2020.00007
[8] Elvis Rojas, Esteban Meneses,
Terry Jones and Don Maxwell. “Analyzing a Five-year Failure Record of a
Leadership-class Supercomputer”. In proceedings of The 33rd
International Symposium on Computer Architecture and High Performance Computing
(SBAC-PAD 2019). Campo Grande, Brazil. Oct. 15-18, 2019. https://doi.org/10.1109/SBAC-PAD.2019.00040
[9] M. Ben Olson, Brandon Kammerdiener,
Michael R. Jantz, Kshitij A. Doshi, and Terry Jones.
“Portable Application Guidance for Complex Memory Systems”. In Proceedings of 5th International Symposium on Memory Systems. Washington, DC.
Sept. 30 – Oct 3, 2019. https://doi.org/10.1145/3357526.3357575
[10] T. Chad Effler, Brandon Kammerdiener, Michael R. Jantz, Saikat
Sengupta, Prasad A. Kulkarni, Kshitij A. Doshi, and
Terry Jones. “Evaluating the Effectiveness of Program Data Features for Guiding
Memory Management”. In Proceedings of 5th International Symposium on Memory Systems. Washington, DC.
Sept. 30 – Oct 3, 2019. https://doi.org/10.1145/3357526.3357537
[11]
Terry
Jones, Michael J. Brim, Geoffroy Vallee, Benjamin Mayer, Aaron Welch, Tonglin Li, Michael Lang, Latchesar
Ionkov, Douglas Otstott, Ada Gavrilovska, Greg Eisenhauer, Thaleia
Doudali, and Pradeep Fernando. "UNITY: Unified
Memory and File Space" In Proceedings of the 7th International Workshop on
Runtime and Operating Systems for Supercomputers (ROSS'17). Washington, DC,
Jun. 2017. https://dl.acm.org/doi/10.1145/3095770.3095776
[12]
Shakar Achanta, Magnus Danielson, Phil Evans, Terry Jones,
Harold Kirkham, Ya-Shian Li-Baboud,
Robert Orndorff, Alison Silverstein, Kyle Thomas, Gerardo Trevino, Frank Tuffner, and Marc Weiss. Time Synchronization in the
Electric Power System. Technical Report of the North American SynchroPhasor Initative,
NASPI-2017-TR-001. NASPI, 60 pp. 2017. Available online.
[13]
Chris Kelley, Joan Pellegrino and Emmanuel
Taylor (eds.). Time Distribution Alternatives for the Smart Grid Workshop
Report. Technical Report NIST.SP.1500-12. National Institute of Standards and
Technology, Gaithersburg, MD, USA, 20899. 33 pp. 2017. https://doi.org/10.6028/NIST.SP.1500-12
[14]
Chris Kelley, Joan Pellegrino and Emmanuel
Taylor (eds.). Advanced Electrical Power System Sensors Workshop Report.
Technical Report NIST.SP.1500-11. National Institute of Standards and
Technology, Gaithersburg, MD, USA, 20899. 32 pp. 2017. https://doi.org/10.6028/NIST.SP.1500-11
[15] Terry Jones, Phil Evans. Is NTP A Suitable Timing Source
for Grid Applications? North American SynchroPhasor
Initiative (NASPI) ’17. Gaithersburg, MD. March 21-23, 2017. Available online (accepted talk).
[16]
E. Wes Bethel and Martin Greenwald (eds.).
Report of the DOE Workshop on Management, Analysis, and Visualization of
Experimental and Observational Data: The Convergence of Data and Computing.
Technical Report LBNL-1005155. Lawrence Berkeley National Laboratory, Berkeley,
CA, USA, 94720. 227 pp. 2016. https://doi.org/10.1109/eScience.2016.7870902
[17]
Terry Jones,
"Precision Time, Supercomputing and the Nation's Power Grid", i-PCGRID 2016, San Francisco, CA, Mar 2016. Available online (invited talk)
[18] Oscar Mondragon, Patrick Bridges, and Terry Jones. Quantifying Scheduling Challenges for Exascale System Software. International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 15). Portland, OR. June, 2015. https://doi.org/10.1145/2768405.2768413
[19]
Brian
Kocoloski, John Lange, Hasan Abbasi, David Bernholdt, Terry Jones, Jai Dayal,
Noah Evans, Michael Lang, Jay Lofstead, Kevin Pedretti, and Patrick Bridges. System-Level Support for
Composition of Applications. International Workshop on Runtime and Operating
Systems for Supercomputers (ROSS 15). Portland, OR. June,
2015. https://doi.org/10.1145/2768405.2768412
[20] Esteban Meneses, Xiang
Ni, Terry R. Jones, and Don E. Maxwell. “Analyzing the Interplay of Failures and Workload on a
Leadership-Class Supercomputer”. Conference: CUG 2015, Chicago, IL, USA, April 2015. Available online.
[21] Terry Jones,
Bradley Settlemyer. “Fan-in Communication On A Cray Germini Interconnect.”
CUG 2014, Lugano, Switzerland. May 2014. Available online.
[22]
Thomas
Ilsche, Joseph Schuchart,
Joseph Cope, Dries Kimpe, Terry Jones, Andreas Knuepfer, Kamil Iskra, Robert
Ross, Wolfgang E. Nagel, and Stephen W. Poole. “Optimizing I/O Forwarding Techniques for Extreme-Scale Event Tracing.”
Cluster Computing. March 2014, Volume 17, Issue 1, pp 1-18. https://doi.org/10.1007/s10586-013-0272-9.
[23]
Terry
Jones, Laxmikant Kalé, José
Moreira. “Project Final Report: HPC-Colony II.” Technical Report
ORNL/TM-2013/553, Oak Ridge National Laboratory, September,
2013. https://doi.org/10.2172/1105948
[24]
Terry
Jones, Douglas Fuller, and Sudharshan Vazhukudai. “Digital Object Identifiers for OLCF.”
Technical Report ORNL/TM-2013/370, Oak Ridge National Laboratory, September, 2013. Available online.
[25]
Terry Jones, Sudharshan
Vazhkudai, Doug Fuller. “DOIs and Supercomputing. DataCite 2013 Meeting.” Washington, DC. September 2013.
Available online (invited talk)
[26]
Alex
Woodie. Vampir Rises To The
Occasion. HPC Wire. July-31-2013. Available online (Ezine Article)
[27]
Yanhua Sun, Gengbin Zheng, Chao Mei, Eric Bohm, James Phillips, Laxmikant Kale, and Terry Jones. “Optimizing Fine-grained Communication
in a Biomolecular Simulation Application on Cray XK6.” The 25th International
Conference for High Performance Computing, Networking, Storage and Analysis
(SC'12). Salt Lake City, UT. November 2012. https://doi.org/10.1109/SC.2012.87
[28]
Ziming Zheng, Zhiling Lan, Li Yu, and Terry Jones. “3-Dimensional Root
Cause via Co-analysis.” 9th International Conference on Autonomic Computing.
San Jose, CA. September 2012. https://doi.org/10.1145/2371536.2371571
[29]
Thomas
Ilsche, Joseph Schuchart,
Joseph Cope, Dries Kimpe, Terry Jones, Andreas Knuepfer, Kamil Iskra, Robert
Ross, Wolfgang Nagel and Stephen Poole. “Enabling Event Tracing at
Leadership-Class Scale through I/O Forwarding Middleware.” The 21st
International ACM Symposium on High-Performance Parallel and Distributing
Computing (HPDC 2012). Delft, Netherlands. June, 2012.
https://doi.org/10.1145/2287076.2287085
[30]
Li
Yu, Ziming Zheng, Zhiling
Lan, Terry Jones, Jim Brandt, and Ann Gentile. “Filtering log data: Finding the
Needles in the Haystack.” The 42nd IEEE/IFIP International Conference on
Dependable Systems and Networks (DSN 2012). Boston, MA. June, 2012. https://doi.org/10.1109/DSN.2012.6263948
[31]
Jonathan Lifflander, Phil
Miller, Ramprasad Venkataraman, Anshu Arya, Terry
Jones, Laxmikant Kale. “Mapping Dense LU
Factorization on Multicore Supercomputer Nodes.” 26th IEEE International
Parallel & Distributed Processing Symposium (IEEE IPDPS 2012). Shanghai,
China. May, 2012.
https://doi.org/10.1109/IPDPS.2012.61
[32] Yanhua Sun, Genbin Zheng, Ryan Olson, Terry Jones, and Laxmikant Kale. “A uGNI-Based
Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini
Interconnect.” 26th IEEE International Parallel & Distributed Processing
Symposium (IEEE IPDPS 2012). Shanghai, China. May,
2012. https://doi.org/10.1109/IPDPS.2012.127 .
[33]
Y Tock, B Mandler, J Moreira, T Jones. “Poster: scalable infrastructure
to support supercomputer resiliency-aware applications and load balancing.” The 24th International Conference for High Performance Computing,
Networking, Storage and Analysis (SC'11). Seattle, WA. November 2011. https://doi.org/10.1145/2148600.2148606
(poster)
[34]
Jonathan
Lifflander, Phil Miller, Ramprasad Venkataraman, Anshu Arya, Terry Jones, Laxmikant
Kale. “Exploring Partial Synchrony in an Asynchronous Environment Using Dense
LU.” University of Illinois at Urbana-Champaign Technical Report PPL Technical
Report 11-34, August 2011. Available online.
[35]
Terry
Jones. “HPC Colony II Consolidated Annual Report: July-2010 to June-2011.”
Technical Report ORNL/LTR-2011/154, Oak Ridge National Laboratory, June, 2011. Available online.
[36]
Terry
Jones. “Linux Kernel Co-Scheduling For Bulk
Synchronous Parallel Applications.” International Workshop on Runtime and
Operating Systems for Supercomputers (ROSS 2011), Tucson, Arizona, USA. May
2011. https://doi.org/10.1145/1988796.1988805
[37]
Terry
Jones and Gregory Koenig. “Providing Runtime Clock Synchronization With Minimal Node-to-Node Time Deviation on XT4s and XT5s.”
CUG 2011, Fairbanks, AK, USA. June 2011. Available online.
[38]
Terry Jones and Gregory
Koenig. “A Clock Synchronization Strategy for Minimizing Clock Variance at Runtime in High-end Computing Environments.” 22nd
International Symposium on Computer Architecture and High
Performance Computing (SBAC 2010), Rio De Janeiro Brazil. October
2010. https://doi.org/10.1109/SBAC-PAD.2010.33
[39]
Joshua
Thompson, David W. Dreisigmeyer, Terry Jones, Michael
Kirby, and Joshua Ladd. “Accurate Fault Prediction of BlueGene/P
RAS Logs Via Geometric Reduction.” 1st International Workshop on Fault
Tolerance for HPC at Extreme Scale (FTXS 2010), Chicago IL, June 2010. https://doi.org/10.1109/DSNW.2010.5542626
[40]
Terry
Jones, Andrew Tauferner, and Todd Inglett. “Linux OS
Jitter Measurements at Large Node Counts using a BlueGene/L.” Technical Report ORNL/TM-2009/303, Oak Ridge
National Laboratory, November, 2009. https://doi.org/10.2172/971232
[41]
Jeff Keasler, Terry Jones, and Dan Quinlan, “TALC: A
Simple C Language Extension For Improved Performance
and Code Maintainability.” 9th LCI International
Conference on High-Performance Clustered Computing, Urbana, IL, May 2008. Available online.
[42]
Matthew
Koop, Terry Jones, and Dhabaleswar Panda, “MVAPICH-Aptus: Scalable High-Performance Mult-Transport
MPI over InfiniBand.” 22nd IEEE
International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, FL,
May 2008. https://doi.org/10.1109/IPDPS.2008.4536283
[43]
M.
Koop, T. Jones and D. K. Panda, “Reducing Connection Memory Requirements of MPI
for InfiniBand Clusters: A Message Coalescing Approach.” 7th IEEE Int'l Symposium on Cluster Computing and the
Grid (CCGrid07), Rio de Janeiro, Brazil, May 2007. https://doi.org/10.1109/CCGRID.2007.92
[44]
Terry Jones, Andrew Tauferner, and
Todd Inglett, “HPC System Call Usage Trends.” 8th LCI International
Conference on High-Performance Clustered Computing - South Lake Tahoe, CA, May 2007.
Available online.
[45]
Terry
Jones, Laxmikant Kalé, José
Moreira. “HPC-Colony: Services and Interfaces to Support Systems With Very Large Numbers of Processors / 2006 Annual Report.”
Urbana, IL. 2007. Available online.
[46] T Jones, L
Kale, J Moreira, C Mendes, S Chakravorty, A Tauferner, T Inglett. “FY 2006 Accomplishment Colony: Services and Interfaces to Support Large
Numbers of Processors.” Technical
Report UCRL-TR-222559, Lawrence Livermore National Laboratory, July, 2006. https://doi.org/10.2172/929165
[47]
Terry
Jones, “Purple L1 Milestone Review Panel-MPI.” Technical Report UCRL-TR-226719,
Lawrence Livermore National Laboratory, June, 2006.
Available online.
[48]
T
Jones, L Kale, J Moreira, C Mendes, S Chakravorty, A Tauferner, T Inglett. “FY 2005 Accomplishment Colony: Services and Interfaces to Support Large
Numbers of Processors.” Technical Report UCRL-TR-213924,
Lawrence Livermore National Laboratory, July, 2005.
Available online.
[49]
Terry Jones, “Reducing the Impact of Operating System Interference
on Scientific Applications.” ScicomP 11, Edinburgh
Scotland, June 3, 2005.
Available online (accepted talk)
[50]
Daniel
Han, Terry Jones, “MPI Profiling.” Technical Report UCRL-MI-209658, Lawrence
Livermore National Laboratory, August, 2004. https://doi.org/10.2172/15014654
[51]
Daniel
Han, Terry Jones, “Survey of MPI Call Usage.” Scicomp
2004, Austin, TX, August, 2004. Available online (accepted
talk).
[52]
Terry
Jones, Shawn Dawson, Rob Neely, William Tuel, Larry
Brenner, Jeff Fier, Robert Blackmore, Pat Caffrey,
Brian Maskell, Paul Tomlinson, and Mark Roberts, “Improving the Scalability of Parallel Jobs by adding
Parallel Awareness to the Operating System.” In The 16th International
Conference for High Performance Computing, Networking, Storage and Analysis
(SC'03). Phoenix, AZ, November 2003. https://doi.org/10.1109/SC.2003.10024
[53]
Terry
Jones, Jeff Fier, and Larry Brenner, “Impacts of
Operating Systems on the Scalability of Applications.” Technical Report
UCRL-MI-202629, Lawrence Livermore National Laboratory, March 5, 2003. Available online.
[54]
P.
Vaidyanathan, T. M. Madhyastha,
and T. R. Jones, “Input/Output Scalability of Genomic Alignment: How to
Configure a Computational Biology Cluster.” Technical Report UCRL-JC-145770,
Lawrence Livermore National Laboratory, October 23, 2001. Available online.
[55]
Terry Jones and Linda Stanberry, “MPI On-node and Large Processor
Count Scaling Performance.” Fourth Scicomp meeting, Knoxville, TN, October 2001. Available online (accepted talk).
[56]
Terry Jones, “Using GPFS.” ScicomP 2000,
San Diego, CA, August 15, 2000. Available online (invited
talk).
[57]
Richard
Hedges, Terry Jones, John May, and R. Kim Yates, “Performance of an MPI-IO
Implementation using third-party Transfer.” Proceedings of the
Eighth NASA GSFC Storage Systems and Technology Conference, College Park,
MD, March 2000. Available online.
[58]
Terry
Jones, Alice Koniges, and R. Kim Yates, “Performance
of the IBM General Parallel File System.” Proceedings of the
International Parallel and Distributed Processing Symposium (IPDPS
2000),
Cancun, Mexico, May 2000. Available online.
[59]
Terry
Jones, Richard Mark, Jeanne Martin, John May, Elsie Pierce, and Linda
Stanberry, “An MPI-IO Interface to HPSS.” Proceedings of the Fifth
NASA GSFC Storage Systems and Technology Conference, College Park,
MD, September 1996. Available online.
[60]
T.R.
Jones, R. Mark, J. Martin, E. Pierce, and L.C. Stanberry, “Parallel data
transfer using MPI-IO.” Technical Report UCRL-JC-122386, Lawrence Livermore
National Laboratory, October, 1995. Available online.
[61]
J.A.
Rathkopf, T.R. Jones, and L.C. Stanberry,
“Parallelizing Monte Carlo with PMC.” Parallelizing Monte Carlo with PMC,
Proceedings of 1994 Symposium on Distributed Computing and Massively Parallel
Processing, Livermore, CA, June 1994. Available online.
Last modified: Sept 12, 2022 by Terry Jones
You can contact me at
This page is: http://xenon.stanford.edu/~trj/publications.html