大规模并行处理器程序设计（英文版原书第3版） PDF下载

编辑推荐

暂无

内容简介

本书介绍并行编程和GPU架构的基本概念，详细探索了构建并行程序的各种技术，涵盖性能、浮点格式、并行模式和动态并行等主题，适合专业人士及学生阅读。书中通过案例研究展示了开发过程，从计算思维的细节着手，*终给出了高效的并行程序示例。新版更新了关于CUDA的讨论，包含CuDNN等新的库，同时将不再重要的内容移到附录中。新版还增加了关于并行模式的两个新章节，并更新了案例研究，以反映当前的行业实践。

作者简介

大卫·B. 柯克（David B. Kirk）美国国家工程院院士，NVIDIA Fellow，曾任NVIDIA公司首席科学家。他领导了NVIDIA图形技术的开发，并且是CUDA技术的创始人之一。2002年，他荣获ACM SIGGRAPH计算机图形成就奖，以表彰其在把高性能计算机图形系统推向大众市场方面做出的杰出贡献。他拥有加州理工学院计算机科学博士学位。

胡文美（Wen-mei W. Hwu）美国伊利诺伊大学厄巴纳-香槟分校电气与计算机工程系AMD Jerry Sanders讲席教授，并行计算研究中心首席科学家，领导IMPACT团队和CUDA卓越中心的研究工作。他在编译器设计、计算机体系结构、微体系结构和并行计算方面做出了卓越贡献，是IEEE Fellow、ACM Fellow，荣获了包括ACM SigArch Maurice Wilkes Award在内的众多奖项。他还是MulticoreWare公司的联合创始人兼CTO。他拥有加州大学伯克利分校计算机科学博士学位。

大规模并行处理器程序设计（英文版原书第3版） PDF下载

Preface Acknowledgements 
CHAPTER.1 Introduction.................................................................................1 
1.1 Heterogeneous Parallel Computing................................................2 
1.2 Architecture of a Modern GPU.......................................................6 
1.3 Why More Speed or Parallelism?...................................................8 
1.4 Speeding Up Real Applications....................................................10 
1.5 Challenges in Parallel Programming ............................................12 
1.6 Parallel Programming Languages and Models.............................12 
1.7 Overarching Goals........................................................................14 
1.8 Organization of the Book..............................................................15 
References ............................................................................................18 
CHAPTER.2 Data Parallel Computing.......................................................19 
2.1 Data Parallelism............................................................................20 
2.2 CUDA C Program Structure.........................................................22 
2.3 A Vector Addition Kernel .............................................................25 
2.4 Device Global Memory and Data Transfer...................................27 
2.5 Kernel Functions and Threading...................................................32 
2.6 Kernel Launch...............................................................................37 
2.7 Summary.......................................................................................38 
Function Declarations...................................................................38 
Kernel Launch...............................................................................38 
Built-in (Predefined) Variables .....................................................39 
Run-time API................................................................................39 
2.8 Exercises.......................................................................................39 
References ............................................................................................41 
CHAPTER.3 Scalable Parallel Execution................................................43 
3.1 CUDA Thread Organization.........................................................43 
3.2 Mapping Threads to Multidimensional Data................................47 
3.3 Image Blur: A More Complex Kernel ..........................................54 
3.4 Synchronization and Transparent Scalability ...............................58 
3.5 Resource Assignment....................................................................60 
3.6 Querying Device Properties..........................................................61 
3.7 Thread Scheduling and Latency Tolerance...................................64 
3.8 Summary.......................................................................................67 
3.9 Exercises.......................................................................................67 
CHAPTER.4 Memory and Data Locality ...................................................71 
4.1 Importance of Memory Access Efficiency....................................72 
4.2 Matrix Multiplication....................................................................73 
4.3 CUDA Memory Types..................................................................77 
4.4 Tiling for Reduced Memory Traffic..............................................84 
4.5 A Tiled Matrix Multiplication Kernel...........................................90 
4.6 Boundary Checks..........................................................................94 
4.7 Memory as a Limiting Factor to Parallelism................................97 
4.8 Summary.......................................................................................99 
4.9 Exercises...........................................