Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization

Chao Chin Wu, Kai Cheng Wei, Ting Hong Lin

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of processing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the threadlevel parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Subproblems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of subproblems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application - the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.

Original languageEnglish
Title of host publicationProceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS 2012
Pages45-52
Number of pages8
DOIs
Publication statusPublished - 2012 Dec 1
Event18th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2012 - Singapore, Singapore
Duration: 2012 Dec 172012 Dec 19

Publication series

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
ISSN (Print)1521-9097

Other

Other18th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2012
CountrySingapore
CitySingapore
Period12-12-1712-12-19

Fingerprint

Dynamic programming
Synchronization
Data storage equipment
Tile
Graphics processing unit
Processing

All Science Journal Classification (ASJC) codes

  • Hardware and Architecture

Cite this

Wu, C. C., Wei, K. C., & Lin, T. H. (2012). Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS 2012 (pp. 45-52). [6413552] (Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS). https://doi.org/10.1109/ICPADS.2012.17
Wu, Chao Chin ; Wei, Kai Cheng ; Lin, Ting Hong. / Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization. Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS 2012. 2012. pp. 45-52 (Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS).
@inproceedings{b8de2f96cdf94e93a4451f500f8263d4,
title = "Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization",
abstract = "Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of processing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the threadlevel parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Subproblems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of subproblems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application - the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.",
author = "Wu, {Chao Chin} and Wei, {Kai Cheng} and Lin, {Ting Hong}",
year = "2012",
month = "12",
day = "1",
doi = "10.1109/ICPADS.2012.17",
language = "English",
isbn = "9780769549033",
series = "Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS",
pages = "45--52",
booktitle = "Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS 2012",

}

Wu, CC, Wei, KC & Lin, TH 2012, Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization. in Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS 2012., 6413552, Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS, pp. 45-52, 18th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2012, Singapore, Singapore, 12-12-17. https://doi.org/10.1109/ICPADS.2012.17

Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization. / Wu, Chao Chin; Wei, Kai Cheng; Lin, Ting Hong.

Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS 2012. 2012. p. 45-52 6413552 (Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

TY - GEN

T1 - Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization

AU - Wu, Chao Chin

AU - Wei, Kai Cheng

AU - Lin, Ting Hong

PY - 2012/12/1

Y1 - 2012/12/1

N2 - Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of processing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the threadlevel parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Subproblems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of subproblems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application - the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.

AB - Our previous study focused on accelerating an important category of DP problems, called nonserial polyadic dynamic programming (NPDP), on a graphics processing unit (GPU). In NPDP applications, the degree of parallelism varies significantly in different stages of computation, making it difficult to fully utilize the compute power of hundreds of processing cores in a GPU. To address this challenge, we proposed a methodology that can adaptively adjust the threadlevel parallelism in mapping a NPDP problem onto the GPU, thus providing sufficient and steady degrees of parallelism across different compute stages. This work aims at further improving the performance of NPDP problems. Subproblems and data are tiled to make it possible to fit small data regions into shared memory and reuse the buffered data for each tile of subproblems, thus reducing the amount of global memory access. However, we found invoking the same kernel many times, due to data consistency enforcement across different stages, makes it impossible to reuse the tiled data in shared memory after the kernel is invoked again. Fortunately, the inter-block synchronization technique allows us to invoke the kernel exactly one time with the restriction that the maximum number of blocks is equal to the total number of streaming multiprocessors. In addition to data reuse, invoking the kernel only one time also enables us to prefetch data to shared memory across inter-block synchronization point, which improves the performance more than data reuse. We realize our approach in a real-world NPDP application - the optimal matrix parenthesization problem. Experimental results demonstrate invoking a kernel only one time cannot guarantee performance improvement unless we also reuse and prefetch data across barrier synchronization points.

UR - http://www.scopus.com/inward/record.url?scp=84874098640&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84874098640&partnerID=8YFLogxK

U2 - 10.1109/ICPADS.2012.17

DO - 10.1109/ICPADS.2012.17

M3 - Conference contribution

AN - SCOPUS:84874098640

SN - 9780769549033

T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS

SP - 45

EP - 52

BT - Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS 2012

ER -

Wu CC, Wei KC, Lin TH. Optimizing dynamic programming on graphics processing units via data reuse and data prefetch with inter-block barrier synchronization. In Proceedings of the 2012 IEEE 18th International Conference on Parallel and Distributed Systems, ICPADS 2012. 2012. p. 45-52. 6413552. (Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS). https://doi.org/10.1109/ICPADS.2012.17