next up previous
Next: Experimental Results Up: Interconnect-aware Low Power High-level Previous: SSA in buffers


interconnect-aware high-level synthesis for low power

We have implemented our techniques for interconnect-aware power optimization on top of a low power high-level synthesis tool, SCALP [10]. The implementations constitute about 4000 lines of additional C++ code over about 20,000 lines in SCALP. An overview of the new system is shown in Fig. 13.
Figure 13: The framework of the interconnect-aware high-level synthesis tool for low power
\begin{figure}
\centering\epsfig{file=iscalp.eps,height=3.6in}
\end{figure}
First, the CDFG is simulated with typical input traces in order to profile operations and data transfers. The profiling information combined with the RTL design library (with its associated power macro-models) is used to evaluate the RTL circuit in terms of power, area and performance. The initial solution consists of a fully parallel implementation in which each CDFG node is bound separately to the fastest functional unit in the library that can implement it, and every CDFG edge is bound to a separate register. The initial schedule is as-soon-as-possible. Then the iterative improvement engine is used to optimize the RTL architecture for different objectives ($e.g$, area, power) under performance constraints. The different moves used for this purpose consist of functional unit selection, resource sharing and resource splitting. Functional unit selection involves choosing an appropriate functional unit for a given CDFG node among the many available, $e.g.$, choosing a carry-lookahead adder vs. a ripple-carry adder. Resource sharing involves merging of functional units or registers and resource splitting the reverse. Functional unit selection and resource sharing may necessitate rescheduling. Resource splitting may adversely affect the targeted objective. However, it enables the algorithm to escape local minima. Physical and interconnect information is used for cost gain computation and move identification using techniques proposed in Section IV. The neighborhood crowd checking and communication cost gain influence this step. Initially, a series of moves are identified based on their cost gain without floorplan information, and temporarily implemented. If these moves cause any scheduling conflict, the behavior is rescheduled with the binding constraints to solve the conflict. Then this temporary solution is floorplanned to determine the cost of the solution for the targeted objective. If there is a cost reduction, the temporary solution is accepted as the new starting point for the next iteration (note that individual moves in the series can have a negative impact on the objective; however, the whole series should not). Otherwise, it is rejected and a new iteration uses those series of moves which have not yet been temporarily implemented. The algorithm terminates if there is no improvement or a pre-specified maximal number of iterations has been reached. Then the best RTL architecture solution seen so far is output. This architecture is post-processed using the SSA reduction techniques to obtain the final RTL architecture. The post-processing includes control signal generation, gating logic insertion and controller respecification [37]. Judged by the synthesis flow, the interconnect-aware binding techniques basically accelerate the iterative improvement and help it escape from local minima by finding better moves; the floorplanner and wire power model improve power estimation accuracy by incorporating physical level information; SSA reduction is simply a post-processing step. As illustrated by Fig. 13, almost every module of the tool can be implemented independent with each other giving their interfaces are preserved. For example, the rescheduler, the wire power model, the RTL design library and the floorplanner, can all be implemented and upgraded independently. This adds flexibility and scalability to the tool. It also enhances the orthogonality of our interconnect-aware techniques to other problems. Another feature of SCALP, and therefore our augmented SCALP, is that they can control the performance degradation while optimizing for power or area. They first calculate the least execution time a behavior may need based on its CDFG. In our implementation, the initial solution is the fastest one. The tools take as input a number, called the performance constraint, which specifies how much performance degradation is permitted. For example, a performance constraint of 1.3 means that 30% execution time increase compared with the initial solution is permitted to optimize for power or area. In another word, the tools will try to find a best power/area optimized solution with 30% performance degradation.
next up previous
Next: Experimental Results Up: Interconnect-aware Low Power High-level Previous: SSA in buffers
Lin Zhong 2003-10-11