A Case for Non-blocking Collective Operations

Hoefler, Torsten; Squyres, Jeffrey M.; Rehm, Wolfgang; Lumsdaine, Andrew

doi:10.1007/11942634_17

Torsten Hoefler^21,23,
Jeffrey M. Squyres²²,
Wolfgang Rehm²³ &
…
Andrew Lumsdaine²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4331))

Included in the following conference series:

International Symposium on Parallel and Distributed Processing and Applications

544 Accesses
20 Citations

Abstract

Non-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of collective algorithms and a benchmark to measures it are provided and show a large potential to overlap communication and computation. We show that non-blocking collective operations can provide at least the same benefits as non-blocking point to point operations already do. Our claim is that actual CPU overhead for non-blocking collective operations depends on the message size and the communicator size and benefits especially highly scalable applications with huge communicators. We prove that the share of the overhead of the overall communication time of current blocking collective operations gets smaller with bigger communicators and larger messages. We show that the user level CPU overhead is less than 10% for MPICH2 and LAM/MPI using TCP/IP communication, which leads us to the conclusion that, by using non-blocking collective communication, ideally 90% idle CPU time can be freed for the application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wagner, A., Buntinas, D., Panda, D.K., Brightwell, R.: Application-bypass reduction for large-scale clusters. In: 2003 IEEE International Conference on Cluster Computing (CLUSTER 2003), pp. 404–411. IEEE Computer Society, Los Alamitos (2003)
Google Scholar
Terry, P., Shan, A., Huttunen, P.: Improving application performance on hpc systems with process synchronization. Linux J. 127, 3 (2004)
Google Scholar
Iancu, C., Husbands, P., Hargrove, P.: Hunting the overlap. In: PACT 2005: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005), pp. 279–290. IEEE Computer Society, Los Alamitos (2005)
Chapter Google Scholar
White III, J., Bova, S.: Where’s the Overlap? - An Analysis of Popular MPI Implementations (1999)
Google Scholar
Lawry, W., Wilson, C., Maccabe, A.B., Brightwell, R.: Comb: A portable benchmark suite for assessing mpi overlap. In: CLUSTER, pp. 472–475. IEEE Computer Society, Los Alamitos (2002)
Google Scholar
Liu, G., Abdelrahman, T.: Computation-communication overlap on network-of-workstation multiprocessors. In: Proc. of the Int’l Conference on Parallel and Distributed Processing Techniques and Applications, pp. 1635–1642 (1998)
Google Scholar
Brightwell, R., Underwood, K.D.: An analysis of the impact of mpi overlap and independent progress. In: ICS 2004: Proceedings of the 18th annual international conference on Supercomputing, pp. 298–305. ACM Press, New York (2004)
Chapter Google Scholar
Dimitrov, R.: Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations. PhD thesis, Mississippi State University (2001)
Google Scholar
Calland, P.Y., Dongarra, J., Robert, Y.: Tiling on systems with communication/computation overlap. Concurrency - Practice and Experience 11(3), 139–153 (1999)
Article Google Scholar
Baude, F., Caromel, D., Furmento, N., Sagnol, D.: Optimizing Metacomputing with Communication-Computation Overlap. In: Malyshkin, V.E. (ed.) PaCT 2001. LNCS, vol. 2127, pp. 190–204. Springer, Heidelberg (2001)
Chapter Google Scholar
Danalis, A., Kim, K.Y., Pollock, L., Swany, M.: Transformations to parallel codes for communication-computation overlap. In: SC 2005, vol. 58, IEEE Computer Society, Los Alamitos (2005)
Google Scholar
Abdelrahman, T.S., Liu, G.: Overlap of computation and communication on shared-memory networks-of-workstations, pp. 35–45 (2001)
Google Scholar
Dubey, A., Tessera, D.: Redistribution strategies for portable parallel FFT: a case study. Concurrency and Computation: Practice and Experience 13(3), 209–220 (2001)
Article MATH Google Scholar
Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl. 19(2), 103–117 (2005)
Article Google Scholar
Kale, L.V., Kumar, S., Vardarajan, K.: A Framework for Collective Personalized Communication. In: Proceedings of IPDPS 2003, Nice, France (2003)
Google Scholar
Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer performance: Achieving optimal performance on the 8, 192 processors of asci q. In: Proceedings of the ACM/IEEE SC 2003 Conference on High Performance Networking and Computing, Phoenix, AZ, USA, CD-Rom, November 15–21, vol. 55, ACM, New York (2003)
Google Scholar
Agarwal, S., Garg, R., Vishnoi, N.: The impact of noise on the scaling of collectives: A theoretical approach. In: 12th Annual IEEE International Conference on High Performance Computing, Goa, India (2005)
Google Scholar
Jones, T., Dawson, S., Neely, R., Tuel Jr., W., Brenner, L., Fier, J., Blackmore, R., Caffrey, P., Maskell, B., Tomlinson, P., Roberts, M.: Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In: Proceedings of the ACM/IEEE SC 2003 Conference on High Performance Networking and Computing, vol. 10 (2003)
Google Scholar
Gorlatch, S.: Send-receive considered harmful: Myths and realities of message passing. ACM Trans. Program. Lang. Syst. 26(1), 47–56 (2004)
Article Google Scholar
Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Performance Analysis of MPI Collective Operations. In: Proceedings of the 19th International Parallel and Distributed Processing Symposium, 4th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS 2005), Denver, CO (2005)
Google Scholar
Hoefler, T., Cerquetti, L., Mehlan, T., Mietke, F., Rehm, W.: A practical Approach to the Rating of Barrier Algorithms using the LogP Model and Open MPI. In: Proceedings of the 2005 International Conference on Parallel Processing Workshops (ICPP 2005), pp. 562–569 (2005)
Google Scholar
Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: towards a realistic model of parallel computation. Principles Practice of Parallel Programming, 1–12 (1993)
Google Scholar
Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating Long Messages into the LogP Model. Journal of Parallel and Distributed Computing 44(1), 71–79 (1995)
Article Google Scholar
Vetter, J.S., Mueller, F.: Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In: IPDPS 2002: Proceedings of the 16th International Parallel and Distributed Processing Symposium, Washington, DC, USA, vol. 96, IEEE Computer Society, Los Alamitos (2002)
Google Scholar
Brightwell, R., Goudy, S., Rodrigues, A., Underwood, K.: Implications of application usage characteristics for collective communication offload. Internation Journal of High-Performance Computing and Networking 4(2) (2006)
Google Scholar
Rabenseifner, R.: Automatic mpi counter profiling. 42nd CUG Conference, CUG Summit (2000)
Google Scholar
Hoefler, T., Reinhardt, M., Mehlan, T., Mietke, F., Rehm, W.: Low overhead ethernet communication for open mpi on linux clusters. In: EuroPVM 2006 (submitted, 2006)
Google Scholar
Shivam, P., Wyckoff, P., Panda, D.: Emp: zero-copy os-bypass nic-driven gigabit ethernet message passing. In: Supercomputing 2001: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), pp. 57–57. ACM Press, New York (2001)
Chapter Google Scholar
Hoefler, T., Gottschling, P., Rehm, W., Lumsdaine, A.: Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations. In: The ParSim 2006 Workshop (accepted, 2006)
Google Scholar
LibNBC (2006), http://www.unixer.de/NBC
Hoefler, T., Lumsdaine, A.: Design, Implementation, and Usage of LibNBC. Technical report, Open Systems Lab, Indiana University (2006)
Google Scholar
Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters. In: 19th International Conference on Architecture and Computing Systems - ARCS 2006, pp. 343–350 (2006)
Google Scholar
Squyres, J.M., Lumsdaine, A.: The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms. In: Proceedings, 18th ACM International Conference on Supercomputing, Workshop on Component Models and Systems for Grid Applications, St. Malo, France (2004)
Google Scholar
Hoefler, T., Squyres, J., Bosilca, G., Fagg, G., Lumsdaine, A., Rehm, W.: Non-Blocking Collective Operations for MPI-2 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Open Systems Lab, Indiana University, Bloomington, IN, 47405, USA
Torsten Hoefler & Andrew Lumsdaine
Cisco Systems, San Jose, CA, 95134, USA
Jeffrey M. Squyres
Technical University of Chemnitz, 09107, Chemnitz, Germany
Torsten Hoefler & Wolfgang Rehm

Authors

Torsten Hoefler
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey M. Squyres
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Rehm
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Lumsdaine
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computing, School of Informatics, University of Bradford, BD7 1DP, Bradford, U.K.
Geyong Min
Dipartimento di Ingegneria dell’ Informazione - Second, University of Naples - Italy, Real Casa dell’Annunziata - via Roma, 29 81031, Aversa (CE), Italy
Beniamino Di Martino
Department of Computer Science, St. Francis Xavier University, Antigonish, Canada
Laurence T. Yang
Department of Computer Science and Engineering, Shanghai Jiao Tong University, 200030, Shanghai, China
Minyi Guo
Department of Computer Science, Chemnitz University of Technology, Germany
Gudula Rünger

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A. (2006). A Case for Non-blocking Collective Operations. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds) Frontiers of High Performance Computing and Networking – ISPA 2006 Workshops. ISPA 2006. Lecture Notes in Computer Science, vol 4331. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11942634_17

Download citation

DOI: https://doi.org/10.1007/11942634_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-49860-5
Online ISBN: 978-3-540-49862-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics