

# Real Time Parallel Architecture in VLSI using Microcontroller

Dr. Ramanpreet Kaur

Chandigarh Engineering College, Department of Electronics and Communication Engineering, Chandigarh Group of Colleges, Chandigarh cgcpapers@gmail.com

Article Info Volume 82 Page Number: 2493 - 2497 Publication Issue: January-February 2020

Article History Article Received: 14 March 2019 Revised: 27 May 2019 Accepted: 16 October 2019 Publication: 18 January 2020 Abstract

Abstract- This paper proposes a novel architecture for high speed arithmetic by a "multiplier-and-accumulator (MAC)". A hybrid type of CSA is developed by the combination of multiplication and accumulation. The largest delay accumulator in MAC that are merged into CSA and evaluation of overall performance. The CSA trees proposed here use a "1's-complement-based radix-2 modified Booth's algorithm (MBA)" and the array is modified for extending sign for increasing operand's bit density. The carry is propagated to LSBs of partial products by CSA and LSB is generated for decreasing final adder's input bits. The intermediary results are accumulated in proposed MAC resulting in carry and sum bits instead of final adder output, which makes optimization of pipeline scheme for improving performance. The synthesis of architecture proposed here was done using standard CMOS library of 90,130,180 and 250 nm. The results are analysed based on experimental and theoretical estimation such as delay, pipelining scheme and hardware resources. The delay modelling use Sakurai's alpha power law. The MAC proposed here is superior in properties in comparison with standard design and it has twice performance at same clock frequency as previous research and it could be applied to high performance requirements.

Keywords: MAC, carry save adder, Booth's algorithm, partial products.

#### I. INTRODUCTION

The demands in real time processing of signals such as audio signals, large capacity of data and video/image processing processing are increasing with advancement an in communication and multimedia systems. The essential elements of DSP are multiplier and "multiplier-and accumulator (MAC)"[1]. They perform inner products, filtering and convolution. Mostly non-linear functions are utilized by methods of digital signal processing such as "Discrete Cosine Transform (DCT)" [2] and "Discrete Transform Wavelet (DWT)"[3]. Because of repetitive addition and multiplication, the execution of speed proceedings and entire calculation are determined by the addition and

multiplication speeds. The critical path in a digital system is calculated by a multiplier as it requires largest delay among basic blocks of operation. A "modified radix-4 Booth's algorithm (MBA)" [4] used applications high speed is in of multiplication. However the problems associated with long critical path cannot be solved by this [5]. A multiplier comprise of full adders (FAs), Booth's algorithm [3] and a Wallace tree [5] multiplier inspite of an array of full adders. Thus, there are three parts of this multiplier: a booth encoder, a Wallace tree and a final adder [6][7]. The Wallace tree adds encoder's partial products in a parallel manner and its time of operation is in direct proportion to the number of inputs. The fact that the number of outputs are reduced when



number of 1's are counted among inputs. The number of outputs are reduced in each step of pipeline by using (3:2) and (7:3) counters. The speed of a multiplier can be effectively increased by reducing partial products as addition series are proceeded for partial products. MBA algorithm is used for reducing the steps for calculation of partial products. The speed of addition of partial products is increased by a Wallace tree. The research for various parallel architectures has been done for increasing the speed of MBA algorithm [8][9]. Among those research work, architectures based on "Baugh-Wooley algorithm (BWA)" was found to be best one.

### II. VLSI ARCHITECTURE

The progress of regular machines has exploited locality and concurrency by the anticipated scaling of VLSI and it was programmable. The table 1 has two columns 2 and 3 which shows a 1000 fold increase in grid numbers in an expected period of twenty years and hence the fabrication of devices on a chip economically. The device count can be increased for performance by using clear concurrency (parallelism). There is a requirement of locality as the bandwidth of wire at module's periphery was scaling as a device count's square root which is comparatively slower than two third of power needed by Rent's rule. It was seen in 1979 that wires limit performance, power and area of various modules. The programmability and regularity are motivated by issues of complexity in design[10].

It is easier to design simple processing and identical nodes in an array rather than a complex "multi -million transistor processor. The mounting of design costs is beamorizted for a wide variety of applications by a programmable design. The design can be understood by an efficient network that connects together processor arrays [Dally92, DYN97][3]. Mechanisms allowing quick synchronization and communication of processors have been developed over these networks [LDK+98]. The implementation of coherent and efficient shared memory systems is understood [ASHH88]. Also the parallel machines have been programmed by several methods[11]. The

technology is demonstrated by constructing research machines and parallel software research platform. A general purpose network such as 3-D torus is able to outperform a network having a topology which matches with the problem of interest such as tree for divide and conquer problems. Thus, it is advisable to provide general purpose mechanism set instead of a specialized machine for a single computation model. At the success of high end, there is a little impact of parallel VLSI architectures on mainstream computer industry. A few 10s of processors are contained in departmental stores and desktop machines are uniprocessors. The microprocessor chips are capable of holding 1000 of 8086, still area is required for implementing a single processor. The three major reasons for course of events is a considerable opportunity for applying additional grids for improving performance of sequential processors, compatibility of software for favoured sequential machines and mechanisms used in parallel machines that inspired a coarse granularity of software and hardware. In 1979, the difference in performance is larger than 100's factor between best and high end CPU such as used in Cray 1 or IBM 370. The difference in gate delay between MOS technology and bipolar resulted in a small difference of a factor 3. The difference existed because of increased count of gates that pipelined the execution aggressively and exploited parallelism. Most of the unconventional features were incorporated by microprocessors which closed the gap between 1979 and 1999. The table 1 illustrates an increase in the clock frequency by 80: factor 20 is because of delay in gate and 4 is because of reduction in gate numbers per clock. The number of clocks per instruction are reduced by a factor 12.5 from 10.

#### III. MIMO SYSTEM MODEL

MIMO communication systems require multiple antennas at the transmitter and receiver sections. Figure 1 illustrates spatial multiplexing and transmit antennas that are capable of transmitting M symbols where N receive antennas are present for receiving every symbol. Thus, multiple



channel path is created because all the transmitted symbols are received by receiving antennas. On combining these paths, a matrix of channel elements is formed. N channel paths are created by each symbol and N antennas receive these symbols. Because of simultaneous transmission of M symbols, an NxM matrix is formed.

$$s = (s_1, s_2, \dots, s_M)T$$
 And  
 $r = (r_1, r_2, \dots, r_M)T$ 

Here, s denotes the transmitted symbol vector and r denotes the received signal and H is the channel matrix of NxM between received and transmit antenna array. The independent AWGN is denoted by y and noise vector is distributed identically.



Fig. 1 System model of MIMO

## A. Square Root Algorithm for V-BLAST

VBLAST includes successive cancellation and nulling for detecting transmitted symbols. The symbol that has highest value of detection of "Signal to Noise ratio (SNR)" is determined by inverting and reordering channel matrix. This relates to an inverted channel matrix row that has smallest Euclidean distance. After a signal is detected, it is subtracted from a symbol vector which is received. The column of H matrix related to it is zeroed down and process is continued to repeat with a deflated channel matrix until the detection of all symbols is done. This research paper uses MMSE for channel inversion. A generic matrix H has its pseudo inverse is:

$$H += (H * H) - 1H *= R - 1Q *$$
 (3)

Either QR decomposition or "singular value decomposition (SVD)" is used for computing

pseudo inverse. The MMSE-VBLAST is computed by "square root algorithm" [3] and augmented channel matrix's QR decomposition.  $HNxM \alpha IMxM = QR = QaNxMx RMxM$ (4)

c. Irrelevant entries are denoted by x. The channel matrix is first decomposed into

#### QR ar+ja1

And then P1/2=R-1 is computed. The repeated computation of pseudo inverse can be neglected after computing Qa and P1/2.

#### IV. CORDIC

CORDIC is accomplishes rotation in hardware efficiently by implementing the rotation equations:  $x'=\cos\theta x$ -ytan $\theta$ ,  $y'=\cos\theta y$ +xtan $\theta$ (7) Angles are selected in such a way that  $tan\theta=2$ -*i* (8)

A right shift follows multiplication in this case. Several processing elements of CORDIC are together used which can be rotated by combining allowed angles by an arbitrary angle:  $\theta = tan - 12 - i$ (9)

The terms are constant for rotating for fixed iteration numbers. Up to 15 iterations, a constant value of scaling is seen [12]. This design requires a CORDIC for rotating a vector to mulling axis and then angles which are related to following vectors are rotated to same angle. These operation modes are called rotation and vectoring. The constraint on angles was used for implementing equations of rotation this design of CORDIC [9] in such a way it results in nullsy'. It is also required to design a CORDIC capable of operating in rotation and vectoring mode. The equations are implemented by using adders and shifters for doing bulk of work. Two input vectors are received by every processing element and their sign is found[13]ss. The decision about up or down rotation is made based on signs by using SIMULINK. It enables the creation and use of block diagrams of high level which is utilized for emulation, hardware description and simulation. This design used Xilinx block set. The



architecture of "pseudo inverse" is designed by using a block set of Xilinx[14].



Fig.2 Architecture for pseudo inverse in VLSI



# V. RESULTS

Extensive simulation of every block is done during simulation. The algorithm is performed for testing that block in Xilinx ISE, system generator and MATLAB for obtaining expected values that are given in certain data. The inputs similar to algorithm is given to blocks is given in MATLAB, running the simulation and reviewing blocks[15]. The Pseudo Inverse module consumed a total power of about 239mW.

| Supply Summary |           | Total       | Dynamic     | Quiescent   |  |  |  |  |
|----------------|-----------|-------------|-------------|-------------|--|--|--|--|
| Source         | Voltage   | Current (A) | Current (A) | Current (A) |  |  |  |  |
| Vecint         | 1.200     | 0.110       | 0.093       | 0.017       |  |  |  |  |
| Vecaux         | 2.500     | 0.016       | 0.001       | 0.015       |  |  |  |  |
| Veco25         | 2.500     | 0.027       | 0.025       | 0.002       |  |  |  |  |
|                |           |             |             |             |  |  |  |  |
|                |           | Total       | Dynamic     | Quiescent   |  |  |  |  |
| Supply         | Power (W) | 0.239       | 0.178       | 0.061       |  |  |  |  |

Fig. 4 PINV module power analysis

| A                | В             | С        | D       | E          | F             | G           | Н               |
|------------------|---------------|----------|---------|------------|---------------|-------------|-----------------|
| Device           |               |          | On-Chip | Power (W)  | Used          | Available   | Utilization (%) |
| Family           | Spartan3      |          | Clocks  | 0.009      | 1             |             |                 |
| Part             | xc3s400       |          | Logia   | 0.037      | 1911          | 7168        | 26.7            |
| Package          | ft256         |          | Signals | 0.061      | 3136          |             |                 |
| Grade            | Commercial    | ~        | 10s     | 0.068      | 57            | 173         | 32.9            |
| Process          | Typical 🛛     | ~        | BRAM⊚   | 0.002      | 1             | 16          | 6.3             |
| Speed Grade      | -4            |          | MULTS   | 0.001      | 4             | 16          | 25.0            |
|                  |               |          | Leakage | 0.061      |               |             |                 |
| Environment      |               |          | Total   | 0.235      |               |             |                 |
| Ambient Temp (C) | 25.0          |          |         |            |               |             |                 |
| Use custom TJA?  | No            | ~        | 1       |            | Eflective TJA | Max Ambient | Junction Temp   |
| Custom TJA (CAV) | NA            |          | Thermal | Properties | (C/W)         | (C)         | (C)             |
| Airflow (LFM)    | 0             | <b>~</b> |         |            | 27.9          | 78.3        | 31.7            |
|                  |               |          |         |            |               |             |                 |
| Characterization |               |          |         |            |               |             |                 |
| PRODUCTION       | v1.2,06-25-09 |          |         |            |               |             |                 |

Fig. 5 Device family used for Pseudo Inverse module

Table 1 Summary of device utilization

| Device Utilization Summary                     |       |           |             |         |  |  |  |  |  |
|------------------------------------------------|-------|-----------|-------------|---------|--|--|--|--|--|
| Logic Utilization                              | Used  | Available | Utilization | Note(s) |  |  |  |  |  |
| Number of Slice Flip Flops                     | 1,703 | 7,168     | 23%         |         |  |  |  |  |  |
| Number of 4 input LUTs                         | 1,714 | 7,168     | 23%         |         |  |  |  |  |  |
| Number of occupied Slices                      | 1,184 | 3,584     | 33%         |         |  |  |  |  |  |
| Number of Slices containing only related logic | 1,184 | 1,184     | 100%        |         |  |  |  |  |  |
| Number of Slices containing unrelated logic    | 0     | 1,184     | 0%          |         |  |  |  |  |  |
| Total Number of 4 input LUTs                   | 1,911 | 7,168     | 26%         |         |  |  |  |  |  |
| Number used as logic                           | 1,639 |           |             |         |  |  |  |  |  |
| Number used as a route-thru                    | 197   |           |             |         |  |  |  |  |  |
| Number used as Shift registers                 | 75    |           |             |         |  |  |  |  |  |
| Number of bonded IOBs                          | 57    | 173       | 32%         |         |  |  |  |  |  |
| Number of RAMB16s                              | 1     | 16        | 6%          |         |  |  |  |  |  |
| Number of MULT18X18s                           | 4     | 16        | Z5%         |         |  |  |  |  |  |
| Number of BUFGMUXs                             | 1     | 8         | 12%         |         |  |  |  |  |  |
| Average Fanout of Non-Clock Nets               | 1.94  |           |             |         |  |  |  |  |  |



Fig. 6 Layout of pseudo inverse module

#### VI. CONCLUSION

A VLSI architecture which has single processor instead of large number processor based QR triangular array is used in detection of V-BLAST. The tradeoff is considered between performance and hardware complexity and then square root algorithm presents its quantization scheme for V-BLAST detection. Special XILINX block sets were used for implementing proposed architecture in SIMULINK. The information stream is subsequently decoded in V-BLAST architecture



and SIC is easily performed after finding p1/2. The future work involves addressing design and implementation of different modules of NULL and SORT for area utilization and power.

#### REFERENCES

- [1] U. Cini and O. Kurt, "A MAC unit with double carry-save scheme suitable for 6-input LUT based reconfigurable systems," in *Proceedings* of the IEEE International Conference on Electronics, Circuits, and Systems, 2016.
- [2] R. Andraka, "A survey of CORDIC algorithms for FPGA based computers," 2004.
- [3] B. Hassibi, "An efficient square-root algorithm for BLAST," in *ICASSP*, *IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings*, 2000.
- [4] L. Lu, G. Y. Li, A. L. Swindlehurst, A. Ashikhmin, and R. Zhang, "An overview of massive MIMO: Benefits and challenges," *IEEE Journal on Selected Topics in Signal Processing*. 2014.
- [5] *CMOS IC Layout.* 2016.
- [6] B. Lakshmi and A. S. Dhar, "CORDIC architectures: A survey," *VLSI Design*. 2010.
- [7] Z. Guo and P. Nilsson, "A VLSI implementation of MIMO detection for future wireless communications," in *IEEE International Symposium on Personal, Indoor and Mobile Radio Communications, PIMRC*, 2003.
- [8] D. Wu, J. Eilert, R. Asghar, and D. Liu, "VLSI implementation of a fixed-complexity softoutput MIMO detector for high-speed wireless," *Eurasip J. Wirel. Commun. Netw.*, 2010.
- [9] Z. Khan, T. Arslan, J. S. Thompson, and A. T. Erdogan, "Area & power efficient VLSI architecture for computing pseudo inverse of channel matrix in a MIMO wireless system," in *Proceedings of the IEEE International Conference on VLSI Design*, 2006.
- [10] J. Mack, S. Bellestri, and D. Llamocca, "Floating point CORDIC-based architecture for powering computation," in 2015 International Conference on ReConFigurable Computing and FPGAs, ReConFig 2015, 2016.
- [11] S. K. V and Vanathi A, "A New Robust Scan Technique for Secured Advanced Encryption

Standards (AES) Against Differential Cryptanalysis Attacks," *Int. J. Adv. Res. Trends Eng. Technol.*, 2015.

- [12] O. Durmaz Incel, "Multi-channel wireless sensor networks: protocols, design and evaluation," 2009.
- [13] S. Kurlekar, P. Mali, M. Sachane, and S. Ghorpade SITCOE, "CORDIC based Trigonometric Computing using 'C' Language," *Int. J. Innov. Adv. Comput. Sci.*, 2016.
- [14] S. Hussein, H. Noura, S. Martin, L. Boukhatem, and K. Al Agha, "ERCA: Efficient and robust cipher algorithm for LTE data confidentiality," in MSWiM 2013 - Proceedings of the 16th ACM International Conference on Modeling, Analysis and Simulation of Wireless and Mobile Systems, 2013.
- [15] B. Wahlberg and P. Stoica, "New square-root factorization of inverse Toeplitz matrices," *IEEE Signal Process. Lett.*, vol. 17, no. 2, pp. 137–140, 2010.