Next: Experimental Results
Up: Interconnect-aware Low Power High-level
Previous: SSA in buffers
interconnect-aware high-level synthesis for low power
We have implemented our techniques for
interconnect-aware power optimization on top of a low power
high-level synthesis tool, SCALP [10]. The implementations
constitute about 4000 lines of additional C++ code over about
20,000 lines in SCALP. An overview of the new system is shown in
Fig. 13.
Figure 13:
The
framework of the interconnect-aware
high-level synthesis tool for low power
|
First, the CDFG is simulated with typical input traces in order to
profile operations and data transfers. The profiling information
combined with the RTL design library (with its associated power
macro-models) is used to evaluate the RTL circuit in terms of
power, area and performance. The initial solution consists of a
fully parallel implementation in which each CDFG node is bound
separately to the fastest functional unit in the library that can
implement it, and every CDFG edge is bound to a separate register.
The initial schedule is as-soon-as-possible.
Then the iterative improvement engine is used to optimize the RTL
architecture for different objectives (, area, power) under
performance constraints. The different moves used for this purpose
consist of functional unit selection,
resource sharing and resource
splitting. Functional unit selection involves choosing an
appropriate functional unit for a given CDFG node among the many
available, , choosing a carry-lookahead adder vs. a
ripple-carry adder. Resource sharing involves merging of
functional units or registers and resource splitting the reverse.
Functional unit selection and resource sharing may necessitate
rescheduling. Resource splitting may adversely affect the targeted
objective. However, it enables the algorithm to escape local
minima. Physical and interconnect information is used for cost
gain computation and move identification using techniques proposed
in Section IV. The neighborhood crowd checking and
communication cost gain influence this step. Initially, a series
of moves are identified based on their cost gain without floorplan
information, and temporarily implemented. If these moves cause any
scheduling conflict, the behavior is rescheduled with the binding
constraints to solve the conflict. Then this temporary solution is
floorplanned to determine the cost of the solution for the
targeted objective. If there is a cost reduction, the temporary
solution is accepted as the new starting point for the next
iteration (note that individual moves in the series can have a
negative impact on the objective;
however, the whole series should not). Otherwise, it is rejected
and a new iteration uses those series of moves which have not yet
been temporarily implemented. The algorithm terminates if there
is no improvement or a pre-specified maximal number of iterations
has been reached. Then the best RTL architecture solution seen so
far is output. This architecture is post-processed using the SSA
reduction techniques to obtain the final RTL architecture. The
post-processing includes control signal generation, gating logic
insertion and controller respecification [37]. Judged by
the synthesis flow, the interconnect-aware binding techniques
basically accelerate the iterative improvement and help it escape
from local minima by finding better moves; the floorplanner and
wire power model improve power estimation accuracy by
incorporating physical level information; SSA reduction is simply
a post-processing step.
As illustrated by Fig. 13, almost every module of the
tool can be implemented independent with each other giving their
interfaces are preserved. For example, the rescheduler, the wire
power model, the RTL design library and the floorplanner, can all
be implemented and upgraded independently. This adds flexibility
and scalability to the tool. It also enhances the orthogonality of
our interconnect-aware techniques to other problems.
Another feature of SCALP, and therefore our augmented SCALP, is
that they can control the performance degradation while optimizing
for power or area. They first calculate the least execution time a
behavior may need based on its CDFG. In our implementation, the
initial solution is the fastest one. The tools take as input a
number, called the performance constraint, which specifies how
much performance degradation is permitted. For example, a
performance constraint of 1.3 means that 30% execution time
increase compared with the initial solution is permitted to
optimize for power or area. In another word, the tools will try to
find a best power/area optimized solution with 30% performance
degradation.
Next: Experimental Results
Up: Interconnect-aware Low Power High-level
Previous: SSA in buffers
Lin Zhong
2003-10-11