# PkMin: Peak Power Minimization for Multi-Threaded Many-Core Applications

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Motivational Example

## 3. Related Work

## 4. Convex Optimization Sub-Routine for Solving Serialized DAG

**System Model:**we need to execute an application with $M\in \mathbb{N}$ multi-threaded tasks indexed while using i on a many-core with $N\in \mathbb{N}$ cores. Tasks execute serially in an order ordained by the serialized DAG. ${N}_{i}$ is the maximum number of cores that can be allocated to the task i. Each task i is executed with ${C}_{i}\in {\mathbb{R}}_{+}$ number of cores with the domain constraint $1\le {C}_{i}\le {N}_{i}$. The number of cores ${C}_{i}$ allocated to the task i in practice needs to be discrete. However, we, at first, assume it to be a non-negative real value greater than equal to 1 and less than equal to ${N}_{i}$ for tractability.

**Execution Model:**the execution time of the application in totality $\tau :{\mathbb{R}}_{+}^{M}\to {\mathbb{R}}_{+}$ with configuration (core allocation vector) $\overrightarrow{C}=\langle {C}_{1},{C}_{2},\dots ,{C}_{M}\rangle \in {\mathbb{R}}_{+}^{\mathbb{M}}$ is the sum of the execution time of the individual tasks.

**Peak Power Model:**peak power of the application $\widehat{\rho}:{\mathbb{R}}_{+}^{M}\to {\mathbb{R}}_{+}$ with configuration $\overrightarrow{C}$ is given by task with the maximum power amongst all of the tasks.

**Deadline Model:**the peak power function $\widehat{\rho}(\overrightarrow{C})$ attains its lowest value when all of the tasks execute with bare minimum cores (${\forall}_{i}\phantom{\rule{3.33333pt}{0ex}}{C}_{i}={n}_{i}$), but this is only permitted when there is no constraint on the execution time. The application’s hard deadline of $\widehat{\tau}\in {\mathbb{R}}_{+}$ put a constraint $\tau (\overrightarrow{C})\le \widehat{\tau}$ on its execution time. The deadline $\widehat{\tau}$ divides the domain for minimization of peak power function $\widehat{\rho}(\overrightarrow{C})$ into feasible and infeasible regions.

**Solution:**our problem reduces to minimizing a convex peak power function $\widehat{\rho}(\overrightarrow{C})$ over the feasible convex set F as its domain. Formally, it can also be summarized, as follows

**Solution Discretization:**when we modify the constraint in Equation (4) to force cores allocated to the task to be integers i.e., $\forall {C}_{i}\in {\mathbb{Z}}_{+}$ in the domain $[1,{N}_{i}]\in {\mathbb{Z}}_{+}$ instead of real-numbers, the problem becomes an NP-Hard Convex Mixed Integer Non-Linear Programming (CMINLP) problem [27]. Still, we can expect local optima in the discrete domain to be close to the discrete global optimum, because the optimization is tractable using convex programming in its relaxed continuous domain, unlike an arbitrary optimization problem [28].

## 5. Peak Power Minimization with PkMin

- The new configuration yields a higher peak power than the previous configuration, i.e., a local minimum is reached.
- There are no more candidate task pairs that can be parallelized.

#### Working Example

## 6. Experimental Evaluation

**Experimental Setup:**we use Sniper simulator [32] to simulate the execution of multi-threaded many-core applications. The simulated multi-core is composed of eight tiles—with two cores each—arranged in a 4 × 2 grid connected while using a Network on Chip (NoC) with hop latency of four cycles and link bandwidth of 256 bits. Two cores within the tile share a 1 MB L2 cache. Cores implement Intel x86 Instruction Set Architecture (ISA) and run at a frequency of 4 GHz with each core holding a 32 KB private L1 data and instruction caches. Many-core’s power consumption is provided by the integrated McPat [14] assuming a 22 nm technology node fabrication.

**Application Task Graphs:**we use a set of five benchmarks—CilkSort, DFS, Fibonacci, Pi, and Queens—from Lace benchmark suite [33] to create our tasks. In order to generate random DAGs of size N, we first sample with replacement N tasks from the benchmark set, thereafter with a probability p, we add an edge between select pair of nodes, such that the acyclic property of the resulting directed graph is preserved. The setup allows for us to thoroughly evaluate PkMin with an arbitrarily large number of tasks while simultaneously generating a large number of randomized applications for a given number of tasks.

**Application Deadline:**setting up arbitrarily short deadlines will render application execution infeasible. In order to set up a feasible deadline, we first note the minimum execution time that is achievable by all of the benchmarks, as, for example, illustrated in Figure 1a for DFS and CilkSort. Let B be the benchmark execution time that is worst among all of the benchmarks considered. We then set the deadline to $B\xb7N$ for an application task graph with N tasks. This ensures the existence of a feasible solution. This is also a fairly tight deadline, as all of the tasks are forced to execute with maximum available cores, if they choose to execute one after the other in a serial fashion. If the application deadline is relaxed further, then other execution configurations with much lower cores (and hence peak power) may become feasible.

**Baseline:**we are unaware of any work that also solves the problem of peak power minimization for multi-threaded many-cores applications with DAG under deadline constraints. The authors of [12] propose a framework, called D&C, which uses a divide and conquer algorithm to minimize execution time for multi-threaded many-core applications with DAG under a peak power constraint. Therefore, D&C solves dual of the problem solved by PkMin. We modify D&C to DCPace that solves the same problem as PkMin by replacing the constraint from peak power to deadline and replacing the objective function from minimizing executing time to minimizing peak power. Modification keeps the underlying algorithm’s ethos intact. DCPace thus acts as a suitable baseline for PkMin.

**Power and Energy Consumption Analyses:**Figure 7 illustrates the working of DCPace and PkMin algorithms. In this experiment, we use tasks graph with 100 tasks and set the deadline to 1700 million clock cycles or 425 ms. DCPace chooses the most energy-efficient core allocation to execute each task, which is not changed thereafter. Given the task execution time and task peak power characteristics, as shown in Figure 1, the minimum energy allocation can only occur either when all of the cores are allocated, or a minimum number of core is allocated to each task. Figure 7b, shows the variation in the total cores allocated as the application execution proceeds in time. Both DCPace and PkMin show considerable variations in the total-cores allocated, although the former exclusively varies between the maximum and the minimum possible allocations. Because the goal of PkMin is to reduce peak power exclusively, its allocations amongst tasks under PkMin are such that any two different non-overlapping tasks have almost similar peak power consumption when compared to DCPace. PkMin exploits the convexity properties of the task characteristics in order to achieve this “equivalent power” allocations. The power trace of application execution in Figure 7a under PkMin has almost no peaks and troughs as compared to DCPace.

**Performance Evaluation:**we evaluate the efficacy of PkMin in minimizing peak power for applications with an increasing number of tasks. We also evaluate the same applications using DCPace to put the performance of PkMin in context.

**Scalability:**PkMin uses NLOpt internally, which has a low polynomial-time computational complexity. It invokes NLOpt at the max number of tasks $\left|M\right|$ times, keeping the computational complexity still polynomial. PkMin also uses topological sort and transitive closure graph algorithms that also have a worst-case polynomial computational complexity of $O\left(\right|M\left|\right)$ and $O\left(\right|M{|}^{2})$, respectively. This low polynomial-time computational complexity makes PkMin highly scalable. Figure 11 shows the increase in worst-case problem-solving time that is required under PkMin with an increase in the number of tasks in applications. For a 100-task application, PkMin requires $1.3$ s to compute the near-optimal configuration.

## 7. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## References

- Singh, A.K.; Jigang, W.; Kumar, A.; Srikanthan, T. Run-time Mapping of Multiple Communicating Tasks on MPSoC Platforms. Procedia Comput. Sci.
**2010**, 1, 1019–1026. [Google Scholar] [CrossRef] [Green Version] - Kriebel, F.; Shafique, M.; Rehman, S.; Henkel, J.; Garg, S. Variability and Reliability Awareness in the Age of Dark Silicon. IEEE Des. Test
**2015**, 33, 59–67. [Google Scholar] [CrossRef] - Salehi, M.; Shafique, M.; Kriebel, F.; Rehman, S.; Tavana, M.K.; Ejlali, A.; Henkel, J. dsReliM: Power-constrained Reliability Management in Dark-Silicon Many-Core Chips under Process Variations. In Proceedings of the 2015 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS), Amsterdam, The Netherlands, 4–9 October 2015. [Google Scholar]
- Ma, Y.; Chantem, T.; Dick, R.P.; Hu, X.S. Improving System-Level Lifetime Reliability of Multicore Soft Real-Time Systems. IEEE Trans. Very Large Scale Integr. Syst.
**2017**, 25, 1895–1905. [Google Scholar] [CrossRef] - Pagani, S.; Bauer, L.; Chen, Q.; Glocker, E.; Hannig, F.; Herkersdorf, A.; Khdr, H.; Pathania, A.; Schlichtmann, U.; Schmitt-Landsiedel, D.; et al. Dark Silicon Management: An Integrated and Coordinated Cross-Layer Approach. Inf. Technol.
**2016**, 58. [Google Scholar] [CrossRef] - Pathania, A.; Khdr, H.; Shafique, M.; Mitra, T.; Henkel, J. QoS-Aware Stochastic Power Management for Many-Cores. In Proceedings of the 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 24–28 June 2018. [Google Scholar]
- Pathania, A.; Khdr, H.; Shafique, M.; Mitra, T.; Henkel, J. Scalable Probabilistic Power Budgeting for Many-Cores. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March 2017. [Google Scholar]
- Pathania, A.; Venkataramani, V.; Shafique, M.; Mitra, T.; Henkel, J. Distributed Scheduling for Many-Cores Using Cooperative Game Theory. In Proceedings of the 2016 53nd ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 5–9 June 2016. [Google Scholar]
- Pathania, A.; Venkataramani, V.; Shafique, M.; Mitra, T.; Henkel, J. Distributed Fair Scheduling for Many-Cores. In Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition (DATE), Dresden, Germany, 14–18 March 2016. [Google Scholar]
- Pathania, A.; Venkatramani, V.; Shafique, M.; Mitra, T.; Henkel, J. Optimal Greedy Algorithm for Many-Core Scheduling. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
**2017**, 36, 1054–1058. [Google Scholar] [CrossRef] - Venkataramani, V.; Pathania, A.; Shafique, M.; Mitra, T.; Henkel, J. Scalable Dynamic Task Scheduling on Adaptive Many-Core. In Proceedings of the 2018 IEEE 12th International Symposium on Embedded Multicore/Many-Core Systems-on-Chip (MCSoC), Hanoi, Vietnam, 12–14 September 2018. [Google Scholar]
- Demirci, G.; Marincic, I.; Hoffmann, H. A Divide and Conquer Algorithm for DAG Scheduling Under Power Constraints. In Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, Dallas, TX, USA, 11–16 November 2018. [Google Scholar]
- Carlson, T.E.; Heirman, W.; Eeckhout, L. Sniper: Exploring the Level of Abstraction for Scalable and Accurate Parallel Multi-Core Simulation. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), Seatle, WA, USA, 12–18 November 2011. [Google Scholar]
- Li, S.; Ahn, J.H.; Strong, R.D.; Brockman, J.B.; Tullsen, D.M.; Jouppi, N.P. McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures. In Proceedings of the 2009 42nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), New York, NY, USA, 12–16 December 2009. [Google Scholar]
- Singh, A.K.; Shafique, M.; Kumar, A.; Henkel, J. Mapping on multi/many-core systems: Survey of current and emerging trends. In Proceedings of the 2013 50th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, TX, USA, 29 May–7 June 2013. [Google Scholar]
- Rapp, M.; Pathania, A.; Henkel, J. Pareto-optimal power-and cache-aware task mapping for many-cores with distributed shared last-level cache. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), Seattle, WA, USA, 23–25 July 2018. [Google Scholar]
- Rapp, M.; Pathania, A.; Mitra, T.; Henkel, J. Prediction-Based Task Migration on S-NUCA Many-Cores. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Florence, Italy, 25–29 March 2019. [Google Scholar]
- Rapp, M.; Sagi, M.; Pathania, A.; Herkersdorf, A.; Henkel, J. Power-and Cache-Aware Task Mapping with Dynamic Power Budgeting for Many-Cores. IEEE Trans. Comput.
**2019**, 69, 1–13. [Google Scholar] [CrossRef] - Bartolini, A.; Borghesi, A.; Libri, A.; Beneventi, F.; Gregori, D.; Tinti, S.; Gianfreda, C.; Altoè, P. The DAVIDE Big-Data-Powered Fine-Grain Power and Performance Monitoring Support. In Proceedings of the 15th ACM International Conference on Computing Frontiers, Ischia, Italy, 8–10 May 2018. [Google Scholar]
- Oleynik, Y.; Gerndt, M.; Schuchart, J.; Kjeldsberg, P.G.; Nagel, W.E. Run-Time Exploitation of Application Dynamism for Energy-Efficient Exascale Computing (READEX). In Proceedings of the 2015 IEEE 18th International Conference on Computational Science and Engineering, Porto, Portugal, 21–23 October 2015. [Google Scholar]
- Lee, B.; Kim, J.; Jeung, Y.; Chong, J. Peak Power Reduction Methodology for Multi-Core Systems. In Proceedings of the International SoC Design Conference (ISOCC), Seoul, Korea, 22–23 November 2010. [Google Scholar]
- Lee, J.; Yun, B.; Shin, K.G. Reducing Peak Power Consumption in Multi-Core Systems without Violating Real-Time Constraints. IEEE Trans. Parallel Distrib. Syst.
**2014**, 25, 1024–1033. [Google Scholar] - Munawar, W.; Khdr, H.; Pagani, S.; Shafique, M.; Chen, J.J.; Henkel, J. Peak Power Management for Scheduling Real-Time Tasks on Heterogeneous Many-Core Systems. In Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Hsinchu, Taiwan, 16–19 December 2014. [Google Scholar]
- Ansari, M.; Yeganeh-Khaksar, A.; Safari, S.; Ejlali, A. Peak-Power-Aware Energy Management for Periodic Real-Time Applications. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst.
**2019**, 39, 779–788. [Google Scholar] [CrossRef] [Green Version] - Johnson, S.G. The NLopt Nonlinear-Optimization Package. Available online: https://github.com/stevengj/nlopt (accessed on 25 September 2020).
- Svanberg, K. A Class of Globally Convergent Optimization Methods Based on Conservative Convex Separable Approximations. SIAM J. Optim.
**2002**, 12, 555–573. [Google Scholar] [CrossRef] [Green Version] - Bonami, P.; Kilinç, M.; Linderoth, J. Algorithms and Software for Convex Mixed Integer Nonlinear Programs. In Mixed Integer Nonlinear Programming; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
- Moriguchi, S.; Tsuchimura, N. Discrete L-Convex Function Minimization Based on Continuous Relaxation. Pac. J. Optim.
**2009**, 5, 227–236. [Google Scholar] - Chekuri, C.; Khanna, S. On Multidimensional Packing Problems. J. Comput.
**2004**, 33, 837–851. [Google Scholar] [CrossRef] - Pearce, D.J.; Kelly, P.H. A Dynamic Topological Sort Algorithm for Directed Acyclic Graphs. J. Exp. Algorithmics
**2007**. [Google Scholar] [CrossRef] - Ioannidis, Y.E.; Ramakrishnan, R. Efficient Transitive Closure Algorithms. In Proceedings of the 1988 VLDB Conference: 14th International Conference on Very Large Data Bases, Los Angeles, CA, USA, 29 August–1 September 1988. [Google Scholar]
- Pathania, A.; Henkel, J. HotSniper: Sniper-based toolchain for many-core thermal simulations in open systems. IEEE Embed. Syst. Lett.
**2018**, 11, 54–57. [Google Scholar] [CrossRef] - Van Dijk, T.; van de Pol, J.C. Lace: Non-Blocking Split Deque for Work-Stealing. In European Conference on Parallel Processing (Euro-Par); Springer: Berlin/Heidelberg, Germany, 2014. [Google Scholar]
- Coffman, E.G., Jr.; Garey, M.R.; Johnson, D.S.; Tarjan, R.E. Performance Bounds for Level-Oriented Two-Dimensional Packing Algorithms. SIAM J. Comput.
**1980**, 9, 808–826. [Google Scholar] [CrossRef]

**Figure 2.**Motivational example for peak power minimization of multi-threaded many-core applications with Directed Acyclic Graph (DAG) under a deadline constraints.

**Figure 3.**Characteristics for a two-task application (with a serialized DAG) with different core allocations for each task in the continuous domain.

**Figure 4.**Feasible region for a two-task application (with a serialized DAG) with different number of cores allocated to each task given a hard deadline.

**Figure 6.**Working example for peak power minimization of the motivational example that is shown in Figure 2 using PkMin.

**Figure 8.**Application performance under PkMin normalized with respect to DCPace. Application size varies from 10 tasks to 100 tasks.

**Figure 9.**Application performance under PkMin with 100 task applications normalized against their performance under DCPace.

**Figure 10.**Peak power under PkMin for a 100-task application with different deadlines that are normalized against its peak power under DCPace.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Maity, A.; Pathania, A.; Mitra, T.
PkMin: Peak Power Minimization for Multi-Threaded Many-Core Applications. *J. Low Power Electron. Appl.* **2020**, *10*, 31.
https://doi.org/10.3390/jlpea10040031

**AMA Style**

Maity A, Pathania A, Mitra T.
PkMin: Peak Power Minimization for Multi-Threaded Many-Core Applications. *Journal of Low Power Electronics and Applications*. 2020; 10(4):31.
https://doi.org/10.3390/jlpea10040031

**Chicago/Turabian Style**

Maity, Arka, Anuj Pathania, and Tulika Mitra.
2020. "PkMin: Peak Power Minimization for Multi-Threaded Many-Core Applications" *Journal of Low Power Electronics and Applications* 10, no. 4: 31.
https://doi.org/10.3390/jlpea10040031