Evaluating impacts of micro-architectural metrics on error resilience and performance of general purpose GPU applications

Topçu, Burak

Please use this identifier to cite or link to this item: https://hdl.handle.net/11147/13860

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Öz, Işıl	en_US
dc.contributor.author	Topçu, Burak	en_US
dc.date.accessioned	2023-10-05T07:42:34Z	-
dc.date.available	2023-10-05T07:42:34Z	-
dc.date.issued	2023-07	en_US
dc.identifier.uri	https://hdl.handle.net/11147/13860	-
dc.description	Thesis (Master)--İzmir Institute of Technology, Computer Engineering, Izmir, 2023	en_US
dc.description	Includes bibliographical references (leaves. 88-99)	en_US
dc.description	Text in English; Abstract: Turkish and English	en_US
dc.description.abstract	Rapidly growing data processing tasks require powerful and energy-efficient heterogeneous computing systems, and GPUs take on a significant mission for those systems in accelerating heavy workloads by executing multiple parallel tasks concurrently. Increasing architectural complexity and widening employment of GPUs bring error resiliency concerns for safety-critical applications. Furthermore, approaches that enhance performance and reduce energy dissipation handle error resiliency on GPUs through approximate computing solutions. Evaluating error resiliency in terms of either identifying error proneness of a system or investigating approximations without much disturbing the output necessities robust knowledge about the execution of a program on a device. In this thesis, we develop a runtime performance and power monitoring tool visualizing the execution with detailed micro-architectural metrics. By utilizing the tool, we acquire several fundamental understandings about runtime performance bottlenecks and how perturbations affect output quality. Afterward, we propose a framework predicting fault vulnerability for error-resilient GPU applications. The framework can accurately estimate error tolerance and saves from analyzing the fault occurrence probability requiring significant effort. Depending on the performance bottlenecks observed with the tool and the error propagation gained during prediction experiments, we introduce a hardware-based approximation computing approach targeting to improve the performance and power of GPU programs, especially memory-bound ones. The approximation method, which resolves memory utilization bottlenecks at runtime, enhances performance by 1.49× (up to 2.1×) and diminishes energy consumption by 28.4% (up to 52.6%) while maintaining the accuracy on the output above 98%.	en_US
dc.description.abstract	Hızla artan veri işleme görevleri güçlü ve enerji tüketimi açısından verimli heterojen hesaplama ortamları gerektirir ve GPU cihazları birçok görevi paralel şekilde çalıştırarak bu sistemlerdeki yoğun iş yüklerini hızlandırmada önemli bir misyon üstlenir. Artan mimari karmaşıklık ve GPU cihazlarının yaygın şekilde kullanılması güvenlik açısından önemli uygulamalar için hataya karşı dayanıklılığa ilişkin endişeler ortaya çıkarır. Yanı sıra, performansı artırırken enerji tüketimini azaltmayı hedefleyen yaklaşımlar ise hataya karşı dayanıklılığı yakınsamalar yapmak ve faydalanmak yönüyle konuyu ele alır. Hataya karşı dayanıklılığı, hata oluşumuna yönelimi veya çıktıyı çok bozmayacak yakınsamaları değerlendirmek bir programın cihazdaki çalışmasına yönelik kapsamlı bilgilere sahip olmayı gerekli kılar. Bu tezde, GPU'daki gerçek zamanlı çalışmayı mikro mimari ölçümler aracılığıyla sunan ve görselleştiren bir performans ve güç izleme aracı geliştirdik. Bu araç sayesinde, çalışma esnasındaki performans darboğazları ve meydana gelen hataların çıktı kalitesini nasıl etkilediği hakkında birçok temel anlayış elde ettik. Daha sonra, GPU uygulamaları için hata güvenlik açığını tahmin eden bir yapı öneriyoruz. Bu yapı, hata toleransını doğru bir şekilde tahmin etmeyi sağlar ve önemli çaba gerektiren hata oluşma olasılığını analiz etmekten kurtarır. İzleme aracıyla gözlemlenen performans darboğazları ve tahmin deneyleri sırasında elde edilen hata yayılımı gözlemlerini temel alarak, özellikle bellek kullanımından kaynaklı GPU programlarının performansını ve gücünü iyileştirmeyi hedefleyen donanım tabanlı bir yakınsama aracı sunuyoruz. Çalışma zamanında bellek kullanımına yönelik darboğazlarını çözen yakınsama yöntemi çıktıdaki doğruluğu %98'in üzerinde tutarken, performansı 1,49× (en fazla 2,1×) artırır ve enerji tüketimini %28,4 (%52,6'ya kadar) azaltır.	en_US
dc.format.extent	xi, 99 leaves	en_US
dc.language.iso	en	en_US
dc.publisher	01. Izmir Institute of Technology	en_US
dc.rights	info:eu-repo/semantics/openAccess	en_US
dc.subject	GPU applications	en_US
dc.subject	Micro-architectural metrics	en_US
dc.subject	Error resilience	en_US
dc.subject	CUDA	en_US
dc.title	Evaluating impacts of micro-architectural metrics on error resilience and performance of general purpose GPU applications	en_US
dc.title.alternative	Mikro-mimari metriklerin genel amaçlı GPU uygulama hatalarına ve performansına etkilerinin değerlendirilmesi	en_US
dc.type	Master Thesis	en_US
dc.authorid	0000-0002-2462-0509	en_US
dc.department	Thesis (Master)--İzmir Institute of Technology, Computer Engineering	en_US
dc.relation.publicationcategory	Tez	en_US
dc.identifier.yoktezid	813983	en_US
item.openairetype	Master Thesis	-
item.openairecristype	http://purl.org/coar/resource_type/c_18cf	-
item.grantfulltext	open	-
item.languageiso639-1	en	-
item.cerifentitytype	Publications	-
item.fulltext	With Fulltext	-
crisitem.author.dept	01. Izmir Institute of Technology	-
Appears in Collections:	Master Degree / Yüksek Lisans Tezleri

Files in This Item:

File	Description	Size	Format
10560868.pdf	Master Thesis	6.03 MB	Adobe PDF	View/Open

Show simple item record

CORE Recommender

Page view(s)

50

checked on Aug 5, 2024

Download(s)

36

checked on Aug 5, 2024

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM