Skip to main content

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4331))

Abstract

Non-blocking collective operations for MPI have been in discussion for a long time. We want to contribute to this discussion and to give a rationale for the usage these operations and assess their possible benefits. A LogGP model for the CPU overhead of collective algorithms and a benchmark to measures it are provided and show a large potential to overlap communication and computation. We show that non-blocking collective operations can provide at least the same benefits as non-blocking point to point operations already do. Our claim is that actual CPU overhead for non-blocking collective operations depends on the message size and the communicator size and benefits especially highly scalable applications with huge communicators. We prove that the share of the overhead of the overall communication time of current blocking collective operations gets smaller with bigger communicators and larger messages. We show that the user level CPU overhead is less than 10% for MPICH2 and LAM/MPI using TCP/IP communication, which leads us to the conclusion that, by using non-blocking collective communication, ideally 90% idle CPU time can be freed for the application.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wagner, A., Buntinas, D., Panda, D.K., Brightwell, R.: Application-bypass reduction for large-scale clusters. In: 2003 IEEE International Conference on Cluster Computing (CLUSTER 2003), pp. 404–411. IEEE Computer Society, Los Alamitos (2003)

    Google Scholar 

  2. Terry, P., Shan, A., Huttunen, P.: Improving application performance on hpc systems with process synchronization. Linux J. 127, 3 (2004)

    Google Scholar 

  3. Iancu, C., Husbands, P., Hargrove, P.: Hunting the overlap. In: PACT 2005: Proceedings of the 14th International Conference on Parallel Architectures and Compilation Techniques (PACT 2005), pp. 279–290. IEEE Computer Society, Los Alamitos (2005)

    Chapter  Google Scholar 

  4. White III, J., Bova, S.: Where’s the Overlap? - An Analysis of Popular MPI Implementations (1999)

    Google Scholar 

  5. Lawry, W., Wilson, C., Maccabe, A.B., Brightwell, R.: Comb: A portable benchmark suite for assessing mpi overlap. In: CLUSTER, pp. 472–475. IEEE Computer Society, Los Alamitos (2002)

    Google Scholar 

  6. Liu, G., Abdelrahman, T.: Computation-communication overlap on network-of-workstation multiprocessors. In: Proc. of the Int’l Conference on Parallel and Distributed Processing Techniques and Applications, pp. 1635–1642 (1998)

    Google Scholar 

  7. Brightwell, R., Underwood, K.D.: An analysis of the impact of mpi overlap and independent progress. In: ICS 2004: Proceedings of the 18th annual international conference on Supercomputing, pp. 298–305. ACM Press, New York (2004)

    Chapter  Google Scholar 

  8. Dimitrov, R.: Overlapping of Communication and Computation and Early Binding: Fundamental Mechanisms for Improving Parallel Performance on Clusters of Workstations. PhD thesis, Mississippi State University (2001)

    Google Scholar 

  9. Calland, P.Y., Dongarra, J., Robert, Y.: Tiling on systems with communication/computation overlap. Concurrency - Practice and Experience 11(3), 139–153 (1999)

    Article  Google Scholar 

  10. Baude, F., Caromel, D., Furmento, N., Sagnol, D.: Optimizing Metacomputing with Communication-Computation Overlap. In: Malyshkin, V.E. (ed.) PaCT 2001. LNCS, vol. 2127, pp. 190–204. Springer, Heidelberg (2001)

    Chapter  Google Scholar 

  11. Danalis, A., Kim, K.Y., Pollock, L., Swany, M.: Transformations to parallel codes for communication-computation overlap. In: SC 2005, vol. 58, IEEE Computer Society, Los Alamitos (2005)

    Google Scholar 

  12. Abdelrahman, T.S., Liu, G.: Overlap of computation and communication on shared-memory networks-of-workstations, pp. 35–45 (2001)

    Google Scholar 

  13. Dubey, A., Tessera, D.: Redistribution strategies for portable parallel FFT: a case study. Concurrency and Computation: Practice and Experience 13(3), 209–220 (2001)

    Article  MATH  Google Scholar 

  14. Brightwell, R., Riesen, R., Underwood, K.D.: Analyzing the impact of overlap, offload, and independent progress for message passing interface applications. Int. J. High Perform. Comput. Appl. 19(2), 103–117 (2005)

    Article  Google Scholar 

  15. Kale, L.V., Kumar, S., Vardarajan, K.: A Framework for Collective Personalized Communication. In: Proceedings of IPDPS 2003, Nice, France (2003)

    Google Scholar 

  16. Petrini, F., Kerbyson, D.J., Pakin, S.: The case of the missing supercomputer performance: Achieving optimal performance on the 8, 192 processors of asci q. In: Proceedings of the ACM/IEEE SC 2003 Conference on High Performance Networking and Computing, Phoenix, AZ, USA, CD-Rom, November 15–21, vol. 55, ACM, New York (2003)

    Google Scholar 

  17. Agarwal, S., Garg, R., Vishnoi, N.: The impact of noise on the scaling of collectives: A theoretical approach. In: 12th Annual IEEE International Conference on High Performance Computing, Goa, India (2005)

    Google Scholar 

  18. Jones, T., Dawson, S., Neely, R., Tuel Jr., W., Brenner, L., Fier, J., Blackmore, R., Caffrey, P., Maskell, B., Tomlinson, P., Roberts, M.: Improving the scalability of parallel jobs by adding parallel awareness to the operating system. In: Proceedings of the ACM/IEEE SC 2003 Conference on High Performance Networking and Computing, vol. 10 (2003)

    Google Scholar 

  19. Gorlatch, S.: Send-receive considered harmful: Myths and realities of message passing. ACM Trans. Program. Lang. Syst. 26(1), 47–56 (2004)

    Article  Google Scholar 

  20. Pjesivac-Grbovic, J., Angskun, T., Bosilca, G., Fagg, G.E., Gabriel, E., Dongarra, J.J.: Performance Analysis of MPI Collective Operations. In: Proceedings of the 19th International Parallel and Distributed Processing Symposium, 4th International Workshop on Performance Modeling, Evaluation, and Optimization of Parallel and Distributed Systems (PMEO-PDS 2005), Denver, CO (2005)

    Google Scholar 

  21. Hoefler, T., Cerquetti, L., Mehlan, T., Mietke, F., Rehm, W.: A practical Approach to the Rating of Barrier Algorithms using the LogP Model and Open MPI. In: Proceedings of the 2005 International Conference on Parallel Processing Workshops (ICPP 2005), pp. 562–569 (2005)

    Google Scholar 

  22. Culler, D., Karp, R., Patterson, D., Sahay, A., Schauser, K.E., Santos, E., Subramonian, R., von Eicken, T.: LogP: towards a realistic model of parallel computation. Principles Practice of Parallel Programming, 1–12 (1993)

    Google Scholar 

  23. Alexandrov, A., Ionescu, M.F., Schauser, K.E., Scheiman, C.: LogGP: Incorporating Long Messages into the LogP Model. Journal of Parallel and Distributed Computing 44(1), 71–79 (1995)

    Article  Google Scholar 

  24. Vetter, J.S., Mueller, F.: Communication characteristics of large-scale scientific applications for contemporary cluster architectures. In: IPDPS 2002: Proceedings of the 16th International Parallel and Distributed Processing Symposium, Washington, DC, USA, vol. 96, IEEE Computer Society, Los Alamitos (2002)

    Google Scholar 

  25. Brightwell, R., Goudy, S., Rodrigues, A., Underwood, K.: Implications of application usage characteristics for collective communication offload. Internation Journal of High-Performance Computing and Networking 4(2) (2006)

    Google Scholar 

  26. Rabenseifner, R.: Automatic mpi counter profiling. 42nd CUG Conference, CUG Summit (2000)

    Google Scholar 

  27. Hoefler, T., Reinhardt, M., Mehlan, T., Mietke, F., Rehm, W.: Low overhead ethernet communication for open mpi on linux clusters. In: EuroPVM 2006 (submitted, 2006)

    Google Scholar 

  28. Shivam, P., Wyckoff, P., Panda, D.: Emp: zero-copy os-bypass nic-driven gigabit ethernet message passing. In: Supercomputing 2001: Proceedings of the 2001 ACM/IEEE conference on Supercomputing (CDROM), pp. 57–57. ACM Press, New York (2001)

    Chapter  Google Scholar 

  29. Hoefler, T., Gottschling, P., Rehm, W., Lumsdaine, A.: Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations. In: The ParSim 2006 Workshop (accepted, 2006)

    Google Scholar 

  30. LibNBC (2006), http://www.unixer.de/NBC

  31. Hoefler, T., Lumsdaine, A.: Design, Implementation, and Usage of LibNBC. Technical report, Open Systems Lab, Indiana University (2006)

    Google Scholar 

  32. Hoefler, T., Mehlan, T., Mietke, F., Rehm, W.: Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters. In: 19th International Conference on Architecture and Computing Systems - ARCS 2006, pp. 343–350 (2006)

    Google Scholar 

  33. Squyres, J.M., Lumsdaine, A.: The Component Architecture of Open MPI: Enabling Third-Party Collective Algorithms. In: Proceedings, 18th ACM International Conference on Supercomputing, Workshop on Component Models and Systems for Grid Applications, St. Malo, France (2004)

    Google Scholar 

  34. Hoefler, T., Squyres, J., Bosilca, G., Fagg, G., Lumsdaine, A., Rehm, W.: Non-Blocking Collective Operations for MPI-2 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Hoefler, T., Squyres, J.M., Rehm, W., Lumsdaine, A. (2006). A Case for Non-blocking Collective Operations. In: Min, G., Di Martino, B., Yang, L.T., Guo, M., Rünger, G. (eds) Frontiers of High Performance Computing and Networking – ISPA 2006 Workshops. ISPA 2006. Lecture Notes in Computer Science, vol 4331. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11942634_17

Download citation

  • DOI: https://doi.org/10.1007/11942634_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-49860-5

  • Online ISBN: 978-3-540-49862-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics