Terry Jones’ Publications

Patents

  US Patent 7810093, Parallel-Aware Dedicated Job Co-Scheduling Within/Across Symmetric Mutliprocessing Nodes

Date of Patent: October 5, 2010

Filed: November 15, 2004

 

Abstract:

This patent presents a parallel-aware co-scheduling method and system for improving the performance and scalability of a dedicated parallel job having synchronizing collective operations in a parallel computing environment comprised of a network of SMP nodes each having at least one processor. The method uses a global co-scheduler and an operating system kernel adapted to coordinate interfering system and daemon activities on a node and across nodes to promote intra-node and inter-node overlap of said interfering system and daemon activities as well as intra-node and inter-node overlap of said synchronizing collective operations.

 

Books & Major Specifications

[1]       IEEE 1588-2019, IEEE Standard for a Precision Clock Synchronization Protocol for Networked Measurement and Control Systems, IEEE, 16 June 2020. Available at online.

[2]       Marcelo Barrios, Terry Jones, Scott Kinnane, Mathis Landzettel, Safran Al-Safran, Jerry Stevens, Christopher Stone, Chris Thomas, and Ulf Troppens, Sizing and Tuning GFS, Poughkeepsie, NY: IBM Corporation, September 1999. Available online.

[3]       MPI 2.0, MPI-2: Extensions to the Message-Passing Interface, The Message Passing Interface Forum, July, 1997.  Available online.

 

Journal Articles

[1]       M. Ben Olson, Brandon Kammerdiener, Michael R. Jantz, Kshitij A. Doshi and Terry Jones. ACM Transactions on Architecture and Code Optimization Volume 19 Issue 3. Article No.: 45pp 1–27. https://doi.org/10.1145/3533855. 

[2]       Terry Jones, Doug Arnold, Frank Tuffner, Rodney Cummings, and Kang Lee. “Recent Advances in Precision Clock Synchronization Protocols for Power Grid Control Systems.” Energies 14, no. 17 (2021): 5303. https://doi.org/10.3390/en14175303.

[3]       Elvis Rojas, Esteban Meneses, Terry Jones, and Don Maxwell. Understanding failures through the lifetime of a top-level supercomputer. Journal of Parallel and Distributed Computing, 154, pp.27-41. August, 2021. https://doi.org/10.1016/j.jpdc.2021.04.001.

[4]       Terry Jones, George Ostrouchov, Gregory A. Koenig, Oscar H. Mondragon, and Patrick G. Bridges. An Evaluation of the state of time synchronization on leadership class supercomputers. Journal of Concurrency and Computation: Practice and Experience. 09 October 2017. https://doi.org/10.1002/cpe.4341

[5]       Yoav Tock, Benjamin Mandler, José Moreira, and Terry Jones. “Design and Implementation of a Scalable Membership Service for Supercomputer Resiliency-Aware Runtime.” Lecture Notes in Computer Science Volume 8097, 2013, pp 354-366. . Print ISBN 978-3-642-40046-9. doi: https://doi.org/10.1007/978-3-642-40047-6_37

[6]       Terry Jones, and Gregory Koenig. “Clock Synchronization in High-end Computing Environments: A Strategy for Minimizing Clock Variance at Runtime.” Journal of Concurrency and Computation: Practice & Experience, Volume 25, Issue 6, doi: 10.1002/cpe.2868, pages 881-897, April 25, 2013. https://doi.org/10.1002/cpe.2868

[7]       Terry Jones, “Linux Kernel Co-Scheduling and Bulk Synchronous Parallelism.” The International Journal of High Performance Computing Applications (IJHPCA). Vol. 26, Issue 2, May 2012.  doihttps://doi.org/10.1177/1094342011433523

[8]       Sayantan Chakravorty, Celso L. Mendes, Laxmikant V. Kale, Terry Jones, Andrew Tauferner, Todd Inglett, and Jose Moreira, “HPC-Colony: Services and Interfaces for Very Large Systems.” in Operating Systems Review, vol. 40, no. 2, pp. 43-49, 2006. doi: https://doi.org/10.1145/1131322.1131334

 

Conference Publications & Research Artifacts with DOIs or Available Online

[1]       Elvis Rojas, Fabricio Quirós-Corella, Terry Jones, and Esteban Meneses. Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch. In Latin American High Performance Computing Conference. Springer CCIS: Communications In Computer and Information Science, vol 1540. 2022. pp. 177-192. https://doi.org/10.1007/978-3-031-04209-6

[2]       Suren Byna, Stratos Idreos, Terry Jones, Kathryn Mohror, Rob Ross, and Florin Rusu. The Management and Storage of Scientific Data. United States, 2022. https://doi.org/10.2172/1845705   (brochure)

[3]       Rojas, Elvis, Fabricio Quirós-Corella, Terry Jones, and Esteban Meneses. Large-Scale Distributed Deep Learning: A Study of Mechanisms and Trade-Offs with PyTorch. In proceedings of the Latin American High Performance Computing Conference. Guadalajara, Mexico (Virtual). October 4-8, 2021. In: Gitler, I., Barrios Hernández, C.J., Meneses, E. (eds) High Performance Computing. CARLA 2021. Communications in Computer and Information Science, vol 1540. Springer, Cham. https://doi.org/10.1007/978-3-031-04209-6_13 .

[4]       Elvis Rojas, Diego Perez, Jon Calhoun, Leonardo Bautista Gomez, Terry Jones and Esteban Meneses. “Understanding Soft Error Sensitivity of Deep Learning Models and Frameworks through Checkpoint Alteration.” In proceedings of the 23rd IEEE Cluster conference (Cluster 2021). Portland, OR (Virtual). Sept 7-10, 2021. https://doi.org/10.1109/Cluster48925.2021.00045

[5]       Terry Jones, David Schoenwald, and Frank Tuffner. Challenges To Updating Timing In The Power Grid. The 30th Workshop on Synchronization and Timing Systems (WSTS ’21). Virtual. March 30-April 1st, 2021. Available online (accepted talk).

[6]       Elvis Rojas, Esteban Meneses, Terry Jones and Don Maxwell. “Towards a Model to Estimate the Reliability of Large-scale Hybrid Supercomputers”. In proceedings of The 26th International European Conference on Parallel and Distributed Computing (Euro-Par 2020). Warsaw, Poland. Aug. 24-28, 2020. https://doi.org/10.1007/978-3-030-57675-2_3

[7]       T. Chad Effler, Michael R. Jantz, and Terry Jones. “Performance Potential of Mixed Data Management Modes for Heterogeneous Memory Systems”. In proceedings of Workshop on Memory Centric High Performance Computing (MCHPC 2020). Atlanta, GA. Nov 11, 2020. https://doi.org/10.1109/MCHPC51950.2020.00007

[8]       Elvis Rojas, Esteban Meneses, Terry Jones and Don Maxwell. “Analyzing a Five-year Failure Record of a Leadership-class Supercomputer”. In proceedings of The 33rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2019). Campo Grande, Brazil. Oct. 15-18, 2019. https://doi.org/10.1109/SBAC-PAD.2019.00040

[9]       M. Ben Olson, Brandon Kammerdiener, Michael R. Jantz, Kshitij A. Doshi, and Terry Jones. “Portable Application Guidance for Complex Memory Systems”. In Proceedings of 5th International Symposium on Memory Systems. Washington, DC. Sept. 30 – Oct 3, 2019.  https://doi.org/10.1145/3357526.3357575

[10]    T. Chad Effler, Brandon Kammerdiener, Michael R. Jantz, Saikat Sengupta, Prasad A. Kulkarni, Kshitij A. Doshi, and Terry Jones. “Evaluating the Effectiveness of Program Data Features for Guiding Memory Management”. In Proceedings of 5th International Symposium on Memory Systems. Washington, DC. Sept. 30 – Oct 3, 2019. https://doi.org/10.1145/3357526.3357537

[11]    Terry Jones, Michael J. Brim, Geoffroy Vallee, Benjamin Mayer, Aaron Welch, Tonglin Li, Michael Lang, Latchesar Ionkov, Douglas Otstott, Ada Gavrilovska, Greg Eisenhauer, Thaleia Doudali, and Pradeep Fernando. "UNITY: Unified Memory and File Space" In Proceedings of the 7th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS'17). Washington, DC, Jun. 2017. https://dl.acm.org/doi/10.1145/3095770.3095776

[12]    Shakar Achanta, Magnus Danielson, Phil Evans, Terry Jones, Harold Kirkham, Ya-Shian Li-Baboud, Robert Orndorff, Alison Silverstein, Kyle Thomas, Gerardo Trevino, Frank Tuffner, and Marc Weiss. Time Synchronization in the Electric Power System. Technical Report of the North American SynchroPhasor Initative, NASPI-2017-TR-001. NASPI, 60 pp. 2017. Available online.

[13]    Chris Kelley, Joan Pellegrino and Emmanuel Taylor (eds.). Time Distribution Alternatives for the Smart Grid Workshop Report. Technical Report NIST.SP.1500-12. National Institute of Standards and Technology, Gaithersburg, MD, USA, 20899. 33 pp. 2017. https://doi.org/10.6028/NIST.SP.1500-12

[14]    Chris Kelley, Joan Pellegrino and Emmanuel Taylor (eds.). Advanced Electrical Power System Sensors Workshop Report. Technical Report NIST.SP.1500-11. National Institute of Standards and Technology, Gaithersburg, MD, USA, 20899. 32 pp. 2017. https://doi.org/10.6028/NIST.SP.1500-11

[15]    Terry Jones, Phil Evans. Is NTP A Suitable Timing Source for Grid Applications? North American SynchroPhasor Initiative (NASPI) ’17. Gaithersburg, MD. March 21-23, 2017. Available online (accepted talk).

[16]    E. Wes Bethel and Martin Greenwald (eds.). Report of the DOE Workshop on Management, Analysis, and Visualization of Experimental and Observational Data: The Convergence of Data and Computing. Technical Report LBNL-1005155. Lawrence Berkeley National Laboratory, Berkeley, CA, USA, 94720. 227 pp. 2016. https://doi.org/10.1109/eScience.2016.7870902

[17]    Terry Jones, "Precision Time, Supercomputing and the Nation's Power Grid", i-PCGRID 2016, San Francisco, CA, Mar 2016. Available online (invited talk)

[18]    Oscar Mondragon, Patrick Bridges, and Terry Jones. Quantifying Scheduling Challenges for Exascale System Software. International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 15). Portland, OR. June, 2015. https://doi.org/10.1145/2768405.2768413

[19]    Brian Kocoloski, John Lange, Hasan Abbasi, David Bernholdt, Terry Jones, Jai Dayal, Noah Evans, Michael Lang, Jay Lofstead, Kevin Pedretti, and Patrick Bridges. System-Level Support for Composition of Applications. International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 15). Portland, OR. June, 2015. https://doi.org/10.1145/2768405.2768412

[20]    Esteban Meneses, Xiang Ni, Terry R. Jones, and Don E. Maxwell. Analyzing the Interplay of Failures and Workload on a Leadership-Class Supercomputer. Conference: CUG 2015, Chicago, IL, USA, April 2015. Available online.

[21]    Terry Jones, Bradley Settlemyer. “Fan-in Communication On A Cray Germini Interconnect.” CUG 2014, Lugano, Switzerland. May 2014. Available online.

[22]    Thomas Ilsche, Joseph Schuchart, Joseph Cope, Dries Kimpe, Terry Jones, Andreas Knuepfer, Kamil Iskra, Robert Ross, Wolfgang E. Nagel, and Stephen W. Poole. “Optimizing I/O Forwarding Techniques for Extreme-Scale Event Tracing.” Cluster Computing. March 2014, Volume 17, Issue 1, pp 1-18. https://doi.org/10.1007/s10586-013-0272-9.

[23]    Terry Jones, Laxmikant Kalé, José Moreira. “Project Final Report: HPC-Colony II.” Technical Report ORNL/TM-2013/553, Oak Ridge National Laboratory, September, 2013.  https://doi.org/10.2172/1105948

[24]    Terry Jones, Douglas Fuller, and Sudharshan Vazhukudai. “Digital Object Identifiers for OLCF.” Technical Report ORNL/TM-2013/370, Oak Ridge National Laboratory, September, 2013. Available online.

[25]    Terry Jones, Sudharshan Vazhkudai, Doug Fuller. “DOIs and Supercomputing. DataCite 2013 Meeting.” Washington, DC. September 2013. Available online (invited talk)

[26]    Alex Woodie. Vampir Rises To The Occasion. HPC Wire. July-31-2013. Available online (Ezine Article)

[27]    Yanhua Sun, Gengbin Zheng, Chao Mei, Eric Bohm, James Phillips, Laxmikant Kale, and Terry Jones. “Optimizing Fine-grained Communication in a Biomolecular Simulation Application on Cray XK6.” The 25th International Conference for High Performance Computing, Networking, Storage and Analysis (SC'12). Salt Lake City, UT. November 2012.  https://doi.org/10.1109/SC.2012.87

[28]    Ziming Zheng, Zhiling Lan, Li Yu, and Terry Jones. “3-Dimensional Root Cause via Co-analysis.” 9th International Conference on Autonomic Computing. San Jose, CA. September 2012.   https://doi.org/10.1145/2371536.2371571

[29]    Thomas Ilsche, Joseph Schuchart, Joseph Cope, Dries Kimpe, Terry Jones, Andreas Knuepfer, Kamil Iskra, Robert Ross, Wolfgang Nagel and Stephen Poole. “Enabling Event Tracing at Leadership-Class Scale through I/O Forwarding Middleware.” The 21st International ACM Symposium on High-Performance Parallel and Distributing Computing (HPDC 2012). Delft, Netherlands. June, 2012. https://doi.org/10.1145/2287076.2287085

[30]    Li Yu, Ziming Zheng, Zhiling Lan, Terry Jones, Jim Brandt, and Ann Gentile. “Filtering log data: Finding the Needles in the Haystack.” The 42nd IEEE/IFIP International Conference on Dependable Systems and Networks (DSN 2012). Boston, MA. June,  2012.  https://doi.org/10.1109/DSN.2012.6263948

[31]    Jonathan Lifflander, Phil Miller, Ramprasad Venkataraman, Anshu Arya, Terry Jones, Laxmikant Kale. “Mapping Dense LU Factorization on Multicore Supercomputer Nodes.” 26th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2012). Shanghai, China. May, 2012.  https://doi.org/10.1109/IPDPS.2012.61

[32]    Yanhua Sun, Genbin Zheng, Ryan Olson, Terry Jones, and Laxmikant Kale. “A uGNI-Based Asynchronous Message-driven Runtime System for Cray Supercomputers with Gemini Interconnect.” 26th IEEE International Parallel & Distributed Processing Symposium (IEEE IPDPS 2012). Shanghai, China. May, 2012.   https://doi.org/10.1109/IPDPS.2012.127 .

[33]    Y Tock, B Mandler, J Moreira, T Jones. “Poster: scalable infrastructure to support supercomputer resiliency-aware applications and load balancing.” The 24th International Conference for High Performance Computing, Networking, Storage and Analysis (SC'11). Seattle, WA. November 2011. https://doi.org/10.1145/2148600.2148606 (poster) 

[34]    Jonathan Lifflander, Phil Miller, Ramprasad Venkataraman, Anshu Arya, Terry Jones, Laxmikant Kale. “Exploring Partial Synchrony in an Asynchronous Environment Using Dense LU.” University of Illinois at Urbana-Champaign Technical Report PPL Technical Report 11-34, August 2011. Available online.

[35]    Terry Jones. “HPC Colony II Consolidated Annual Report: July-2010 to June-2011.” Technical Report ORNL/LTR-2011/154, Oak Ridge National Laboratory, June, 2011. Available online.

[36]    Terry Jones. “Linux Kernel Co-Scheduling For Bulk Synchronous Parallel Applications.” International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2011), Tucson, Arizona, USA. May 2011.  https://doi.org/10.1145/1988796.1988805

[37]    Terry Jones and Gregory Koenig. “Providing Runtime Clock Synchronization With Minimal Node-to-Node Time Deviation on XT4s and XT5s.” CUG 2011, Fairbanks, AK, USA. June 2011. Available online.

[38]    Terry Jones and Gregory Koenig. “A Clock Synchronization Strategy for Minimizing Clock Variance at Runtime in High-end Computing Environments.” 22nd International Symposium on Computer Architecture and High Performance Computing (SBAC 2010), Rio De Janeiro Brazil. October 2010.  https://doi.org/10.1109/SBAC-PAD.2010.33

[39]    Joshua Thompson, David W. Dreisigmeyer, Terry Jones, Michael Kirby, and Joshua Ladd. “Accurate Fault Prediction of BlueGene/P RAS Logs Via Geometric Reduction.” 1st International Workshop on Fault Tolerance for HPC at Extreme Scale (FTXS 2010), Chicago IL, June 2010.  https://doi.org/10.1109/DSNW.2010.5542626

[40]    Terry Jones, Andrew Tauferner, and Todd Inglett. “Linux OS Jitter Measurements at Large Node Counts using a BlueGene/L.” Technical Report ORNL/TM-2009/303, Oak Ridge National Laboratory, November, 2009.   https://doi.org/10.2172/971232

[41]    Jeff Keasler, Terry Jones, and Dan Quinlan, “TALC: A Simple C Language Extension For Improved Performance and Code Maintainability.” 9th LCI International Conference on High-Performance Clustered Computing, Urbana, IL, May 2008. Available online.

[42]    Matthew Koop, Terry Jones, and Dhabaleswar Panda, “MVAPICH-Aptus: Scalable High-Performance Mult-Transport MPI over InfiniBand.” 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, FL, May 2008.  https://doi.org/10.1109/IPDPS.2008.4536283

[43]    M. Koop, T. Jones and D. K. Panda, “Reducing Connection Memory Requirements of MPI for InfiniBand Clusters: A Message Coalescing Approach.” 7th IEEE Int'l Symposium on Cluster Computing and the Grid (CCGrid07), Rio de Janeiro, Brazil, May 2007.  https://doi.org/10.1109/CCGRID.2007.92

[44]    Terry Jones, Andrew Tauferner, and Todd Inglett, “HPC System Call Usage Trends.” 8th LCI International Conference on High-Performance Clustered Computing - South Lake Tahoe, CA, May 2007.  Available online.

[45]    Terry Jones, Laxmikant Kalé, José Moreira. “HPC-Colony: Services and Interfaces to Support Systems With Very Large Numbers of Processors / 2006 Annual Report.” Urbana, IL. 2007. Available online.

[46]    T Jones, L Kale, J Moreira, C Mendes, S Chakravorty, A Tauferner, T Inglett. “FY 2006 Accomplishment Colony: Services and Interfaces to Support Large Numbers of Processors.” Technical Report UCRL-TR-222559, Lawrence Livermore National Laboratory, July, 2006.  https://doi.org/10.2172/929165

[47]    Terry Jones, “Purple L1 Milestone Review Panel-MPI.” Technical Report UCRL-TR-226719, Lawrence Livermore National Laboratory, June, 2006. Available online.

[48]    T Jones, L Kale, J Moreira, C Mendes, S Chakravorty, A Tauferner, T Inglett. “FY 2005 Accomplishment Colony: Services and Interfaces to Support Large Numbers of Processors.” Technical Report UCRL-TR-213924, Lawrence Livermore National Laboratory, July, 2005. Available online.

[49]    Terry Jones, “Reducing the Impact of Operating System Interference on Scientific Applications.” ScicomP 11, Edinburgh Scotland, June 3, 2005. Available online (accepted talk)

[50]    Daniel Han, Terry Jones, “MPI Profiling.” Technical Report UCRL-MI-209658, Lawrence Livermore National Laboratory, August, 2004. https://doi.org/10.2172/15014654

[51]    Daniel Han, Terry Jones, “Survey of MPI Call Usage.” Scicomp 2004, Austin, TX, August, 2004. Available online (accepted talk).

[52]    Terry Jones, Shawn Dawson, Rob Neely, William Tuel, Larry Brenner, Jeff Fier, Robert Blackmore, Pat Caffrey, Brian Maskell, Paul Tomlinson, and Mark Roberts, “Improving the Scalability of Parallel Jobs by adding Parallel Awareness to the Operating System.” In The 16th International Conference for High Performance Computing, Networking, Storage and Analysis (SC'03). Phoenix, AZ, November 2003.  https://doi.org/10.1109/SC.2003.10024

[53]    Terry Jones, Jeff Fier, and Larry Brenner, “Impacts of Operating Systems on the Scalability of Applications.” Technical Report UCRL-MI-202629, Lawrence Livermore National Laboratory, March 5, 2003. Available online.

[54]    P. Vaidyanathan, T. M. Madhyastha, and T. R. Jones, “Input/Output Scalability of Genomic Alignment: How to Configure a Computational Biology Cluster.” Technical Report UCRL-JC-145770, Lawrence Livermore National Laboratory, October 23, 2001. Available online.

[55]    Terry Jones and Linda Stanberry, “MPI On-node and Large Processor Count Scaling Performance.”  Fourth Scicomp meeting, Knoxville, TN, October 2001.  Available online (accepted talk).

[56]    Terry Jones, “Using GPFS.” ScicomP 2000, San Diego, CA, August 15, 2000. Available online (invited talk).

[57]    Richard Hedges, Terry Jones, John May, and R. Kim Yates, “Performance of an MPI-IO Implementation using third-party Transfer.” Proceedings of the Eighth NASA GSFC Storage Systems and Technology Conference, College Park, MD, March 2000. Available online.

[58]    Terry Jones, Alice Koniges, and R. Kim Yates, “Performance of the IBM General Parallel File System.” Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS 2000), Cancun, Mexico, May 2000. Available online.

[59]    Terry Jones, Richard Mark, Jeanne Martin, John May, Elsie Pierce, and Linda Stanberry, “An MPI-IO Interface to HPSS.” Proceedings of the Fifth NASA GSFC Storage Systems and Technology Conference, College Park, MD, September 1996. Available online.

[60]    T.R. Jones, R. Mark, J. Martin, E. Pierce, and L.C. Stanberry, “Parallel data transfer using MPI-IO.” Technical Report UCRL-JC-122386, Lawrence Livermore National Laboratory, October, 1995. Available online.

[61]    J.A. Rathkopf, T.R. Jones, and L.C. Stanberry, “Parallelizing Monte Carlo with PMC.” Parallelizing Monte Carlo with PMC, Proceedings of 1994 Symposium on Distributed Computing and Massively Parallel Processing, Livermore, CA, June 1994. Available online.


Last modified: Sept 12, 2022 by Terry Jones
You can contact me at

trj@cs.stanford.edu

This page is: http://xenon.stanford.edu/~trj/publications.html