vlsi processor architectures: Topics by WorldWideScience.org

Sample records for vlsi processor architectures

Opto-VLSI-based reconfigurable free-space optical interconnects architecture

DEFF Research Database (Denmark)

Aljada, Muhsen; Alameh, Kamal; Chung, Il-Sug

2007-01-01

is the Opto-VLSI processor which can be driven by digital phase steering and multicasting holograms that reconfigure the optical interconnects between the input and output ports. The optical interconnects architecture is experimentally demonstrated at 2.5 Gbps using high-speed 1×3 VCSEL array and 1......×3 photoreceiver array in conjunction with two 1×4096 pixel Opto-VLSI processors. The minimisation of the crosstalk between the output ports is achieved by appropriately aligning the VCSEL and PD elements with respect to the Opto-VLSI processors and driving the latter with optimal steering phase holograms....
Wavelength-encoded OCDMA system using opto-VLSI processors.

Science.gov (United States)

Aljada, Muhsen; Alameh, Kamal

2007-07-01

We propose and experimentally demonstrate a 2.5 Gbits/sper user wavelength-encoded optical code-division multiple-access encoder-decoder structure based on opto-VLSI processing. Each encoder and decoder is constructed using a single 1D opto-very-large-scale-integrated (VLSI) processor in conjunction with a fiber Bragg grating (FBG) array of different Bragg wavelengths. The FBG array spectrally and temporally slices the broadband input pulse into several components and the opto-VLSI processor generates codewords using digital phase holograms. System performance is measured in terms of the autocorrelation and cross-correlation functions as well as the eye diagram.
Wavelength-encoded OCDMA system using opto-VLSI processors

Science.gov (United States)

Aljada, Muhsen; Alameh, Kamal

2007-07-01

We propose and experimentally demonstrate a 2.5 Gbits/sper user wavelength-encoded optical code-division multiple-access encoder-decoder structure based on opto-VLSI processing. Each encoder and decoder is constructed using a single 1D opto-very-large-scale-integrated (VLSI) processor in conjunction with a fiber Bragg grating (FBG) array of different Bragg wavelengths. The FBG array spectrally and temporally slices the broadband input pulse into several components and the opto-VLSI processor generates codewords using digital phase holograms. System performance is measured in terms of the autocorrelation and cross-correlation functions as well as the eye diagram.
Development methods for VLSI-processors

International Nuclear Information System (INIS)

Horninger, K.; Sandweg, G.

1982-01-01

The aim of this project, which was originally planed for 3 years, was the development of modern system and circuit concepts, for VLSI-processors having a 32 bit wide data path. The result of this first years work is the concept of a general purpose processor. This processor is not only logically but also physically (on the chip) divided into four functional units: a microprogrammable instruction unit, an execution unit in slice technique, a fully associative cache memory and an I/O unit. For the ALU of the execution unit circuits in PLA and slice techniques have been realized. On the basis of regularity, area consumption and achievable performance the slice technique has been prefered. The designs utilize selftesting circuitry. (orig.) [de
Embedded Processor Based Automatic Temperature Control of VLSI Chips

Directory of Open Access Journals (Sweden)

Narasimha Murthy Yayavaram

2009-01-01

Full Text Available This paper presents embedded processor based automatic temperature control of VLSI chips, using temperature sensor LM35 and ARM processor LPC2378. Due to the very high packing density, VLSI chips get heated very soon and if not cooled properly, the performance is very much affected. In the present work, the sensor which is kept very near proximity to the IC will sense the temperature and the speed of the fan arranged near to the IC is controlled based on the PWM signal generated by the ARM processor. A buzzer is also provided with the hardware, to indicate either the failure of the fan or overheating of the IC. The entire process is achieved by developing a suitable embedded C program.
Hybrid VLSI/QCA Architecture for Computing FFTs

Science.gov (United States)

Fijany, Amir; Toomarian, Nikzad; Modarres, Katayoon; Spotnitz, Matthew

2003-01-01

A data-processor architecture that would incorporate elements of both conventional very-large-scale integrated (VLSI) circuitry and quantum-dot cellular automata (QCA) has been proposed to enable the highly parallel and systolic computation of fast Fourier transforms (FFTs). The proposed circuit would complement the QCA-based circuits described in several prior NASA Tech Briefs articles, namely Implementing Permutation Matrices by Use of Quantum Dots (NPO-20801), Vol. 25, No. 10 (October 2001), page 42; Compact Interconnection Networks Based on Quantum Dots (NPO-20855) Vol. 27, No. 1 (January 2003), page 32; and Bit-Serial Adder Based on Quantum Dots (NPO-20869), Vol. 27, No. 1 (January 2003), page 35. The cited prior articles described the limitations of very-large-scale integrated (VLSI) circuitry and the major potential advantage afforded by QCA. To recapitulate: In a VLSI circuit, signal paths that are required not to interact with each other must not cross in the same plane. In contrast, for reasons too complex to describe in the limited space available for this article, suitably designed and operated QCAbased signal paths that are required not to interact with each other can nevertheless be allowed to cross each other in the same plane without adverse effect. In principle, this characteristic could be exploited to design compact, coplanar, simple (relative to VLSI) QCA-based networks to implement complex, advanced interconnection schemes.
A VLSI image processor via pseudo-mersenne transforms

International Nuclear Information System (INIS)

Sei, W.J.; Jagadeesh, J.M.

1986-01-01

The computational burden on image processing in medical fields where a large amount of information must be processed quickly and accurately has led to consideration of special-purpose image processor chip design for some time. The very large scale integration (VLSI) resolution has made it cost-effective and feasible to consider the design of special purpose chips for medical imaging fields. This paper describes a VLSI CMOS chip suitable for parallel implementation of image processing algorithms and cyclic convolutions by using Pseudo-Mersenne Number Transform (PMNT). The main advantages of the PMNT over the Fast Fourier Transform (FFT) are: (1) no multiplications are required; (2) integer arithmetic is used. The design and development of this processor, which operates on 32-point convolution or 5 x 5 window image, are described
A High Performance VLSI Computer Architecture For Computer Graphics

Science.gov (United States)

Chin, Chi-Yuan; Lin, Wen-Tai

1988-10-01

A VLSI computer architecture, consisting of multiple processors, is presented in this paper to satisfy the modern computer graphics demands, e.g. high resolution, realistic animation, real-time display etc.. All processors share a global memory which are partitioned into multiple banks. Through a crossbar network, data from one memory bank can be broadcasted to many processors. Processors are physically interconnected through a hyper-crossbar network (a crossbar-like network). By programming the network, the topology of communication links among processors can be reconfigurated to satisfy specific dataflows of different applications. Each processor consists of a controller, arithmetic operators, local memory, a local crossbar network, and I/O ports to communicate with other processors, memory banks, and a system controller. Operations in each processor are characterized into two modes, i.e. object domain and space domain, to fully utilize the data-independency characteristics of graphics processing. Special graphics features such as 3D-to-2D conversion, shadow generation, texturing, and reflection, can be easily handled. With the current high density interconnection (MI) technology, it is feasible to implement a 64-processor system to achieve 2.5 billion operations per second, a performance needed in most advanced graphics applications.
VLSI Architectures for Computing DFT's

Science.gov (United States)

Truong, T. K.; Chang, J. J.; Hsu, I. S.; Reed, I. S.; Pei, D. Y.

1986-01-01

Simplifications result from use of residue Fermat number systems. System of finite arithmetic over residue Fermat number systems enables calculation of discrete Fourier transform (DFT) of series of complex numbers with reduced number of multiplications. Computer architectures based on approach suitable for design of very-large-scale integrated (VLSI) circuits for computing DFT's. General approach not limited to DFT's; Applicable to decoding of error-correcting codes and other transform calculations. System readily implemented in VLSI.
Design of 10Gbps optical encoder/decoder structure for FE-OCDMA system using SOA and opto-VLSI processors.

Science.gov (United States)

Aljada, Muhsen; Hwang, Seow; Alameh, Kamal

2008-01-21

In this paper we propose and experimentally demonstrate a reconfigurable 10Gbps frequency-encoded (1D) encoder/decoder structure for optical code division multiple access (OCDMA). The encoder is constructed using a single semiconductor optical amplifier (SOA) and 1D reflective Opto-VLSI processor. The SOA generates broadband amplified spontaneous emission that is dynamically sliced using digital phase holograms loaded onto the Opto-VLSI processor to generate 1D codewords. The selected wavelengths are injected back into the same SOA for amplifications. The decoder is constructed using single Opto-VLSI processor only. The encoded signal can successfully be retrieved at the decoder side only when the digital phase holograms of the encoder and the decoder are matched. The system performance is measured in terms of the auto-correlation and cross-correlation functions as well as the eye diagram.
Optimizing Vector-Quantization Processor Architecture for Intelligent Query-Search Applications

Science.gov (United States)

Xu, Huaiyu; Mita, Yoshio; Shibata, Tadashi

2002-04-01

The architecture of a very large scale integration (VLSI) vector-quantization processor (VQP) has been optimized to develop a general-purpose intelligent query-search agent. The agent performs a similarity-based search in a large-volume database. Although similarity-based search processing is computationally very expensive, latency-free searches have become possible due to the highly parallel maximum-likelihood search architecture of the VQP chip. Three architectures of the VQP chip have been studied and their performances are compared. In order to give reasonable searching results according to the different policies, the concept of penalty function has been introduced into the VQP. An E-commerce real-estate agency system has been developed using the VQP chip implemented in a field-programmable gate array (FPGA) and the effectiveness of such an agency system has been demonstrated.
VLSI Design of a Variable-Length FFT/IFFT Processor for OFDM-Based Communication Systems

Directory of Open Access Journals (Sweden)

Jen-Chih Kuo

2003-12-01

Full Text Available The technique of {orthogonal frequency division multiplexing (OFDM} is famous for its robustness against frequency-selective fading channel. This technique has been widely used in many wired and wireless communication systems. In general, the {fast Fourier transform (FFT} and {inverse FFT (IFFT} operations are used as the modulation/demodulation kernel in the OFDM systems, and the sizes of FFT/IFFT operations are varied in different applications of OFDM systems. In this paper, we design and implement a variable-length prototype FFT/IFFT processor to cover different specifications of OFDM applications. The cached-memory FFT architecture is our suggested VLSI system architecture to design the prototype FFT/IFFT processor for the consideration of low-power consumption. We also implement the twiddle factor butterfly {processing element (PE} based on the {{coordinate} rotation digital computer (CORDIC} algorithm, which avoids the use of conventional multiplication-and-accumulation unit, but evaluates the trigonometric functions using only add-and-shift operations. Finally, we implement a variable-length prototype FFT/IFFT processor with TSMC 0.35Ã¢Â€Â‰ÃŽÂ¼m 1P4M CMOS technology. The simulations results show that the chip can perform (64-2048-point FFT/IFFT operations up to 80Ã¢Â€Â‰MHz operating frequency which can meet the speed requirement of most OFDM standards such as WLAN, ADSL, VDSL (256Ã¢ÂˆÂ¼2K, DAB, and 2K-mode DVB.
VLSI and system architecture-the new development of system 5G

Energy Technology Data Exchange (ETDEWEB)

Sakamura, K.; Sekino, A.; Kodaka, T.; Uehara, T.; Aiso, H.

1982-01-01

A research and development proposal is presented for VLSI CAD systems and for a hardware environment called system 5G on which the VLSI CAD systems run. The proposed CAD systems use a hierarchically organized design language to enable design of anything from basic architectures of VLSI to VLSI mask patterns in a uniform manner. The cad systems will eventually become intelligent cad systems that acquire design knowledge and perform automatic design of VLSI chips when the characteristic requirements of VLSI chip is given. System 5G will consist of superinference machines and the 5G communication network. The superinference machine will be built based on a functionally distributed architecture connecting inferommunication network. The superinference machine will be built based on a functionally distributed architecture connecting inference machines and relational data base machines via a high-speed local network. The transfer rate of the local network will be 100 mbps at the first stage of the project and will be improved to 1 gbps. Remote access to the superinference machine will be possible through the 5G communication network. Access to system 5G will use the 5G network architecture protocol. The users will access the system 5G using standardized 5G personal computers. 5G personal logic programming stations, very high intelligent terminals providing an instruction set that supports predicate logic and input/output facilities for audio and graphical information.
VLSI design

CERN Document Server

Einspruch, Norman G

1986-01-01

VLSI Electronics Microstructure Science, Volume 14: VLSI Design presents a comprehensive exposition and assessment of the developments and trends in VLSI (Very Large Scale Integration) electronics. This volume covers topics that range from microscopic aspects of materials behavior and device performance to the comprehension of VLSI in systems applications. Each article is prepared by a recognized authority. The subjects discussed in this book include VLSI processor design methodology; the RISC (Reduced Instruction Set Computer); the VLSI testing program; silicon compilers for VLSI; and special
An efficient interpolation filter VLSI architecture for HEVC standard

Science.gov (United States)

Zhou, Wei; Zhou, Xin; Lian, Xiaocong; Liu, Zhenyu; Liu, Xiaoxiang

2015-12-01

The next-generation video coding standard of High-Efficiency Video Coding (HEVC) is especially efficient for coding high-resolution video such as 8K-ultra-high-definition (UHD) video. Fractional motion estimation in HEVC presents a significant challenge in clock latency and area cost as it consumes more than 40 % of the total encoding time and thus results in high computational complexity. With aims at supporting 8K-UHD video applications, an efficient interpolation filter VLSI architecture for HEVC is proposed in this paper. Firstly, a new interpolation filter algorithm based on the 8-pixel interpolation unit is proposed in this paper. It can save 19.7 % processing time on average with acceptable coding quality degradation. Based on the proposed algorithm, an efficient interpolation filter VLSI architecture, composed of a reused data path of interpolation, an efficient memory organization, and a reconfigurable pipeline interpolation filter engine, is presented to reduce the implement hardware area and achieve high throughput. The final VLSI implementation only requires 37.2k gates in a standard 90-nm CMOS technology at an operating frequency of 240 MHz. The proposed architecture can be reused for either half-pixel interpolation or quarter-pixel interpolation, which can reduce the area cost for about 131,040 bits RAM. The processing latency of our proposed VLSI architecture can support the real-time processing of 4:2:0 format 7680 × 4320@78fps video sequences.
A novel configurable VLSI architecture design of window-based image processing method

Science.gov (United States)

Zhao, Hui; Sang, Hongshi; Shen, Xubang

2018-03-01

Most window-based image processing architecture can only achieve a certain kind of specific algorithms, such as 2D convolution, and therefore lack the flexibility and breadth of application. In addition, improper handling of the image boundary can cause loss of accuracy, or consume more logic resources. For the above problems, this paper proposes a new VLSI architecture of window-based image processing operations, which is configurable and based on consideration of the image boundary. An efficient technique is explored to manage the image borders by overlapping and flushing phases at the end of row and the end of frame, which does not produce new delay and reduce the overhead in real-time applications. Maximize the reuse of the on-chip memory data, in order to reduce the hardware complexity and external bandwidth requirements. To perform different scalar function and reduction function operations in pipeline, this can support a variety of applications of window-based image processing. Compared with the performance of other reported structures, the performance of the new structure has some similarities to some of the structures, but also superior to some other structures. Especially when compared with a systolic array processor CWP, this structure at the same frequency of approximately 12.9% of the speed increases. The proposed parallel VLSI architecture was implemented with SIMC 0.18-μm CMOS technology, and the maximum clock frequency, power consumption, and area are 125Mhz, 57mW, 104.8K Gates, respectively, furthermore the processing time is independent of the different window-based algorithms mapped to the structure
Point DCT VLSI Architecture for Emerging HEVC Standard

OpenAIRE

Ahmed, Ashfaq; Shahid, Muhammad Usman; Rehman, Ata ur

2012-01-01

This work presents a flexible VLSI architecture to compute the -point DCT. Since HEVC supports different block sizes for the computation of the DCT, that is, 4 × 4 up to 3 2 × 3 2 , the design of a flexible architecture to support them helps reducing the area overhead of hardware implementations. The hardware proposed in this work is partially folded to save area and to get speed for large video sequences sizes. The proposed architecture relies on the decomposition of the DCT matrices into ...
NASA Space Engineering Research Center for VLSI systems design

Science.gov (United States)

1991-01-01

This annual review reports the center's activities and findings on very large scale integration (VLSI) systems design for 1990, including project status, financial support, publications, the NASA Space Engineering Research Center (SERC) Symposium on VLSI Design, research results, and outreach programs. Processor chips completed or under development are listed. Research results summarized include a design technique to harden complementary metal oxide semiconductors (CMOS) memory circuits against single event upset (SEU); improved circuit design procedures; and advances in computer aided design (CAD), communications, computer architectures, and reliability design. Also described is a high school teacher program that exposes teachers to the fundamentals of digital logic design.
Point DCT VLSI Architecture for Emerging HEVC Standard

Directory of Open Access Journals (Sweden)

Ashfaq Ahmed

2012-01-01

Full Text Available This work presents a flexible VLSI architecture to compute the -point DCT. Since HEVC supports different block sizes for the computation of the DCT, that is, 4×4 up to 32×32, the design of a flexible architecture to support them helps reducing the area overhead of hardware implementations. The hardware proposed in this work is partially folded to save area and to get speed for large video sequences sizes. The proposed architecture relies on the decomposition of the DCT matrices into sparse submatrices in order to reduce the multiplications. Finally, multiplications are completely eliminated using the lifting scheme. The proposed architecture sustains real-time processing of 1080P HD video codec running at 150 MHz.
Periodic Application of Concurrent Error Detection in Processor Array Architectures. PhD. Thesis -

Science.gov (United States)

Chen, Paul Peichuan

1993-01-01

Processor arrays can provide an attractive architecture for some applications. Featuring modularity, regular interconnection and high parallelism, such arrays are well-suited for VLSI/WSI implementations, and applications with high computational requirements, such as real-time signal processing. Preserving the integrity of results can be of paramount importance for certain applications. In these cases, fault tolerance should be used to ensure reliable delivery of a system's service. One aspect of fault tolerance is the detection of errors caused by faults. Concurrent error detection (CED) techniques offer the advantage that transient and intermittent faults may be detected with greater probability than with off-line diagnostic tests. Applying time-redundant CED techniques can reduce hardware redundancy costs. However, most time-redundant CED techniques degrade a system's performance.

Architectures for single-chip image computing

Science.gov (United States)

Gove, Robert J.

1992-04-01

This paper will focus on the architectures of VLSI programmable processing components for image computing applications. TI, the maker of industry-leading RISC, DSP, and graphics components, has developed an architecture for a new-generation of image processors capable of implementing a plurality of image, graphics, video, and audio computing functions. We will show that the use of a single-chip heterogeneous MIMD parallel architecture best suits this class of processors--those which will dominate the desktop multimedia, document imaging, computer graphics, and visualization systems of this decade.
Array processor architecture

Science.gov (United States)

Barnes, George H. (Inventor); Lundstrom, Stephen F. (Inventor); Shafer, Philip E. (Inventor)

1983-01-01

A high speed parallel array data processing architecture fashioned under a computational envelope approach includes a data base memory for secondary storage of programs and data, and a plurality of memory modules interconnected to a plurality of processing modules by a connection network of the Omega gender. Programs and data are fed from the data base memory to the plurality of memory modules and from hence the programs are fed through the connection network to the array of processors (one copy of each program for each processor). Execution of the programs occur with the processors operating normally quite independently of each other in a multiprocessing fashion. For data dependent operations and other suitable operations, all processors are instructed to finish one given task or program branch before all are instructed to proceed in parallel processing fashion on the next instruction. Even when functioning in the parallel processing mode however, the processors are not locked-step but execute their own copy of the program individually unless or until another overall processor array synchronization instruction is issued.
VLSI architecture of a K-best detector for MIMO-OFDM wireless communication systems

International Nuclear Information System (INIS)

Jian Haifang; Shi Yin

2009-01-01

The K-best detector is considered as a promising technique in the MIMO-OFDM detection because of its good performance and low complexity. In this paper, a new K-best VLSI architecture is presented. In the proposed architecture, the metric computation units (MCUs) expand each surviving path only to its partial branches, based on the novel expansion scheme, which can predetermine the branches' ascending order by their local distances. Then a distributed sorter sorts out the new K surviving paths from the expanded branches in pipelines. Compared to the conventional K-best scheme, the proposed architecture can approximately reduce fundamental operations by 50% and 75% for the 16-QAM and the 64-QAM cases, respectively, and, consequently, lower the demand on the hardware resource significantly. Simulation results prove that the proposed architecture can achieve a performance very similar to conventional K-best detectors. Hence, it is an efficient solution to the K-best detector's VLSI implementation for high-throughput MIMO-OFDM systems.
VLSI architecture of a K-best detector for MIMO-OFDM wireless communication systems

Energy Technology Data Exchange (ETDEWEB)

Jian Haifang; Shi Yin, E-mail: jhf@semi.ac.c [Institute of Semiconductors, Chinese Academy of Sciences, Beijing 100083 (China)

2009-07-15

The K-best detector is considered as a promising technique in the MIMO-OFDM detection because of its good performance and low complexity. In this paper, a new K-best VLSI architecture is presented. In the proposed architecture, the metric computation units (MCUs) expand each surviving path only to its partial branches, based on the novel expansion scheme, which can predetermine the branches' ascending order by their local distances. Then a distributed sorter sorts out the new K surviving paths from the expanded branches in pipelines. Compared to the conventional K-best scheme, the proposed architecture can approximately reduce fundamental operations by 50% and 75% for the 16-QAM and the 64-QAM cases, respectively, and, consequently, lower the demand on the hardware resource significantly. Simulation results prove that the proposed architecture can achieve a performance very similar to conventional K-best detectors. Hence, it is an efficient solution to the K-best detector's VLSI implementation for high-throughput MIMO-OFDM systems.
Architectural design and analysis of a programmable image processor

International Nuclear Information System (INIS)

Siyal, M.Y.; Chowdhry, B.S.; Rajput, A.Q.K.

2003-01-01

In this paper we present an architectural design and analysis of a programmable image processor, nicknamed Snake. The processor was designed with a high degree of parallelism to speed up a range of image processing operations. Data parallelism found in array processors has been included into the architecture of the proposed processor. The implementation of commonly used image processing algorithms and their performance evaluation are also discussed. The performance of Snake is also compared with other types of processor architectures. (author)
Associative Memory Design for the Fast TracKer Processor (FTK)at ATLAS

CERN Document Server

Annovi, A; The ATLAS collaboration; Beretta, M; Bossini, E; Crescioli, F; Dell'Orso, M; Giannetti, P; Hoff, J; Liu, T; Liberali, V; Sacco, I; Schoening, A; Soltveit, H K; Stabile, A; Tripiccione, R

2011-01-01

We describe a VLSI processor for pattern recognition based on Content Addressable Memory (CAM) architecture, optimized for on-line track finding in high-energy physics experiments. A large CAM bank stores all trajectories of interest and extracts the ones compatible with a given event. This task is naturally parallelized by a CAM architecture able to output identified trajectories, recognized among a huge amount of possible combinations, in just a few 100 MHz clock cycles. We have developed this device (called the AMchip03 processor), using 180 nm technology, for the Silicon Vertex Trigger (SVT) upgrade at CDF [1] using a standard-cell VLSI design methodology. We propose a new design that introduces a full custom CAM cell and takes advantage of 65 nm technology. The customized design maximizes the pattern density, minimizes the power consumption and implements the functionalities needed for the planned Fast Tracker (FTK) [2], an ATLAS trigger upgrade project at LHC. We introduce a new variable resolution patt...
VLSI Architecture and Design

OpenAIRE

Johnsson, Lennart

1980-01-01

Integrated circuit technology is rapidly approaching a state where feature sizes of one micron or less are tractable. Chip sizes are increasing slowly. These two developments result in considerably increased complexity in chip design. The physical characteristics of integrated circuit technology are also changing. The cost of communication will be dominating making new architectures and algorithms both feasible and desirable. A large number of processors on a single chip will be possible....
FY1995 study of design methodology and environment of high-performance processor architectures; 1995 nendo koseino processor architecture sekkeiho to sekkei kankyo no kenkyu

Energy Technology Data Exchange (ETDEWEB)

NONE

1997-03-01

The aim of our project is to develop high-performance processor architectures for both general purpose and application-specific purpose. We also plan to develop basic softwares, such as compliers, and various design aid tools for those architectures. We are particularly interested in performance evaluation at architecture design phase, design optimization, automatic generation of compliers from processor designs, and architecture design methodologies combined with circuit layout. We have investigated both microprocessor architectures and design methodologies / environments for the processors. Our goal is to establish design technologies for high-performance, low-power, low-cost and highly-reliable systems in system-on-silicon era. We have proposed PPRAM architecture for high-performance system using DRAM and logic mixture technology, Softcore processor architecture for special purpose processors in embedded systems, and Power-Pro architecture for low power systems. We also developed design methodologies and design environments for the above architectures as well as a new method for design verification of microprocessors. (NEDO)
Power efficient and high performance VLSI architecture for AES algorithm

Directory of Open Access Journals (Sweden)

K. Kalaiselvi

2015-09-01

Full Text Available Advanced encryption standard (AES algorithm has been widely deployed in cryptographic applications. This work proposes a low power and high throughput implementation of AES algorithm using key expansion approach. We minimize the power consumption and critical path delay using the proposed high performance architecture. It supports both encryption and decryption using 256-bit keys with a throughput of 0.06 Gbps. The VHDL language is utilized for simulating the design and an FPGA chip has been used for the hardware implementations. Experimental results reveal that the proposed AES architectures offer superior performance than the existing VLSI architectures in terms of power, throughput and critical path delay.
VLSI electronics microstructure science

CERN Document Server

1981-01-01

VLSI Electronics: Microstructure Science, Volume 3 evaluates trends for the future of very large scale integration (VLSI) electronics and the scientific base that supports its development.This book discusses the impact of VLSI on computer architectures; VLSI design and design aid requirements; and design, fabrication, and performance of CCD imagers. The approaches, potential, and progress of ultra-high-speed GaAs VLSI; computer modeling of MOSFETs; and numerical physics of micron-length and submicron-length semiconductor devices are also elaborated. This text likewise covers the optical linewi
VLSI Architecture for Configurable and Low-Complexity Design of Hard-Decision Viterbi Decoding Algorithm

Directory of Open Access Journals (Sweden)

Rachmad Vidya Wicaksana Putra

2016-06-01

Full Text Available Convolutional encoding and data decoding are fundamental processes in convolutional error correction. One of the most popular error correction methods in decoding is the Viterbi algorithm. It is extensively implemented in many digital communication applications. Its VLSI design challenges are about area, speed, power, complexity and configurability. In this research, we specifically propose a VLSI architecture for a configurable and low-complexity design of a hard-decision Viterbi decoding algorithm. The configurable and low-complexity design is achieved by designing a generic VLSI architecture, optimizing each processing element (PE at the logical operation level and designing a conditional adapter. The proposed design can be configured for any predefined number of trace-backs, only by changing the trace-back parameter value. Its computational process only needs N + 2 clock cycles latency, with N is the number of trace-backs. Its configurability function has been proven for N = 8, N = 16, N = 32 and N = 64. Furthermore, the proposed design was synthesized and evaluated in Xilinx and Altera FPGA target boards for area consumption and speed performance.
VLSI Architectures for Sliding-Window-Based Space-Time Turbo Trellis Code Decoders

Directory of Open Access Journals (Sweden)

Georgios Passas

2012-01-01

Full Text Available The VLSI implementation of SISO-MAP decoders used for traditional iterative turbo coding has been investigated in the literature. In this paper, a complete architectural model of a space-time turbo code receiver that includes elementary decoders is presented. These architectures are based on newly proposed building blocks such as a recursive add-compare-select-offset (ACSO unit, A-, B-, Γ-, and LLR output calculation modules. Measurements of complexity and decoding delay of several sliding-window-technique-based MAP decoder architectures and a proposed parameter set lead to defining equations and comparison between those architectures.
Parallel VLSI Architecture

Science.gov (United States)

Truong, T. K.; Reed, I.; Yeh, C.; Shao, H.

1985-01-01

Fermat number transformation convolutes two digital data sequences. Very-large-scale integration (VLSI) applications, such as image and radar signal processing, X-ray reconstruction, and spectrum shaping, linear convolution of two digital data sequences of arbitrary lenghts accomplished using Fermat number transform (ENT).
VLSI System Implementation of 200 MHz, 8-bit, 90nm CMOS Arithmetic and Logic Unit (ALU Processor Controller

Directory of Open Access Journals (Sweden)

Fazal NOORBASHA

2012-08-01

Full Text Available In this present study includes the Very Large Scale Integration (VLSI system implementation of 200MHz, 8-bit, 90nm Complementary Metal Oxide Semiconductor (CMOS Arithmetic and Logic Unit (ALU processor control with logic gate design style and 0.12µm six metal 90nm CMOS fabrication technology. The system blocks and the behaviour are defined and the logical design is implemented in gate level in the design phase. Then, the logic circuits are simulated and the subunits are converted in to 90nm CMOS layout. Finally, in order to construct the VLSI system these units are placed in the floor plan and simulated with analog and digital, logic and switch level simulators. The results of the simulations indicates that the VLSI system can control different instructions which can divided into sub groups: transfer instructions, arithmetic and logic instructions, rotate and shift instructions, branch instructions, input/output instructions, control instructions. The data bus of the system is 16-bit. It runs at 200MHz, and operating power is 1.2V. In this paper, the parametric analysis of the system, the design steps and obtained results are explained.
Acoustooptic linear algebra processors - Architectures, algorithms, and applications

Science.gov (United States)

Casasent, D.

1984-01-01

Architectures, algorithms, and applications for systolic processors are described with attention to the realization of parallel algorithms on various optical systolic array processors. Systolic processors for matrices with special structure and matrices of general structure, and the realization of matrix-vector, matrix-matrix, and triple-matrix products and such architectures are described. Parallel algorithms for direct and indirect solutions to systems of linear algebraic equations and their implementation on optical systolic processors are detailed with attention to the pipelining and flow of data and operations. Parallel algorithms and their optical realization for LU and QR matrix decomposition are specifically detailed. These represent the fundamental operations necessary in the implementation of least squares, eigenvalue, and SVD solutions. Specific applications (e.g., the solution of partial differential equations, adaptive noise cancellation, and optimal control) are described to typify the use of matrix processors in modern advanced signal processing.
VLSI 'smart' I/O module development

Science.gov (United States)

Kirk, Dan

The developmental history, design, and operation of the MIL-STD-1553A/B discrete and serial module (DSM) for the U.S. Navy AN/AYK-14(V) avionics computer are described and illustrated with diagrams. The ongoing preplanned product improvement for the AN/AYK-14(V) includes five dual-redundant MIL-STD-1553 channels based on DSMs. The DSM is a front-end processor for transferring data to and from a common memory, sharing memory with a host processor to provide improved 'smart' input/output performance. Each DSM comprises three hardware sections: three VLSI-6000 semicustomized CMOS arrays, memory units to support the arrays, and buffers and resynchronization circuits. The DSM hardware module design, VLSI-6000 design tools, controlware and test software, and checkout procedures (using a hardware simulator) are characterized in detail.
VLSI implementations for image communications

CERN Document Server

Pirsch, P

1993-01-01

The past few years have seen a rapid growth in image processing and image communication technologies. New video services and multimedia applications are continuously being designed. Essential for all these applications are image and video compression techniques. The purpose of this book is to report on recent advances in VLSI architectures and their implementation for video signal processing applications with emphasis on video coding for bit rate reduction. Efficient VLSI implementation for video signal processing spans a broad range of disciplines involving algorithms, architectures, circuits
Optical linear algebra processors - Architectures and algorithms

Science.gov (United States)

Casasent, David

1986-01-01

Attention is given to the component design and optical configuration features of a generic optical linear algebra processor (OLAP) architecture, as well as the large number of OLAP architectures, number representations, algorithms and applications encountered in current literature. Number-representation issues associated with bipolar and complex-valued data representations, high-accuracy (including floating point) performance, and the base or radix to be employed, are discussed, together with case studies on a space-integrating frequency-multiplexed architecture and a hybrid space-integrating and time-integrating multichannel architecture.
FILTRES: a 128 channels VLSI mixed front-end readout electronic development for microstrip detectors

International Nuclear Information System (INIS)

Anstotz, F.; Hu, Y.; Michel, J.; Sohler, J.L.; Lachartre, D.

1998-01-01

We present a VLSI digital-analog readout electronic chain for silicon microstrip detectors. The characteristics of this circuit have been optimized for the high resolution tracker of the CERN CMS experiment. This chip consists of 128 channels at 50 μm pitch. Each channel is composed by a charge amplifier, a CR-RC shaper, an analog memory, an analog processor, an output FIFO read out serially by a multiplexer. This chip has been processed in the radiation hard technology DMILL. This paper describes the architecture of the circuit and presents test results of the 128 channel full chain chip. (orig.)
Vlsi implementation of flexible architecture for decision tree classification in data mining

Science.gov (United States)

Sharma, K. Venkatesh; Shewandagn, Behailu; Bhukya, Shankar Nayak

2017-07-01

The Data mining algorithms have become vital to researchers in science, engineering, medicine, business, search and security domains. In recent years, there has been a terrific raise in the size of the data being collected and analyzed. Classification is the main difficulty faced in data mining. In a number of the solutions developed for this problem, most accepted one is Decision Tree Classification (DTC) that gives high precision while handling very large amount of data. This paper presents VLSI implementation of flexible architecture for Decision Tree classification in data mining using c4.5 algorithm.

Positron emission tomographic images and expectation maximization: A VLSI architecture for multiple iterations per second

International Nuclear Information System (INIS)

Jones, W.F.; Byars, L.G.; Casey, M.E.

1988-01-01

A digital electronic architecture for parallel processing of the expectation maximization (EM) algorithm for Positron Emission tomography (PET) image reconstruction is proposed. Rapid (0.2 second) EM iterations on high resolution (256 x 256) images are supported. Arrays of two very large scale integration (VLSI) chips perform forward and back projection calculations. A description of the architecture is given, including data flow and partitioning relevant to EM and parallel processing. EM images shown are produced with software simulating the proposed hardware reconstruction algorithm. Projected cost of the system is estimated to be small in comparison to the cost of current PET scanners
Tinuso: A processor architecture for a multi-core hardware simulation platform

DEFF Research Database (Denmark)

Schleuniger, Pascal; Karlsson, Sven

2010-01-01

Multi-core systems have the potential to improve performance, energy and cost properties of embedded systems but also require new design methods and tools to take advantage of the new architectures. Due to the limited accuracy and performance of pure software simulators, we are working on a cycle...... accurate hardware simulation platform. We have developed the Tinuso processor architecture for this platform. Tinuso is a processor architecture optimized for FPGA implementation. The instruction set makes use of predicated instructions and supports C/C++ and assembly language programming. It is designed...... to be easy extendable to maintain the exibility required for the research on multi-core systems. Tinuso contains a co-processor interface to connect to a network interface. This interface allow for communication over an on-chip network. A clock frequency estimation study on a deeply pipelined Tinuso...
Reducing Competitive Cache Misses in Modern Processor Architectures

OpenAIRE

Prisagjanec, Milcho; Mitrevski, Pece

2017-01-01

The increasing number of threads inside the cores of a multicore processor, and competitive access to the shared cache memory, become the main reasons for an increased number of competitive cache misses and performance decline. Inevitably, the development of modern processor architectures leads to an increased number of cache misses. In this paper, we make an attempt to implement a technique for decreasing the number of competitive cache misses in the first level of cache memory. This tec...
Recovery Act - CAREER: Sustainable Silicon -- Energy-Efficient VLSI Interconnect for Extreme-Scale Computing

Energy Technology Data Exchange (ETDEWEB)

Chiang, Patrick [Oregon State Univ., Corvallis, OR (United States)

2014-01-31

The research goal of this CAREER proposal is to develop energy-efficient, VLSI interconnect circuits and systems that will facilitate future massively-parallel, high-performance computing. Extreme-scale computing will exhibit massive parallelism on multiple vertical levels, from thou sands of computational units on a single processor to thousands of processors in a single data center. Unfortunately, the energy required to communicate between these units at every level (on chip, off-chip, off-rack) will be the critical limitation to energy efficiency. Therefore, the PI's career goal is to become a leading researcher in the design of energy-efficient VLSI interconnect for future computing systems.
Towards an Analogue Neuromorphic VLSI Instrument for the Sensing of Complex Odours

Science.gov (United States)

Ab Aziz, Muhammad Fazli; Harun, Fauzan Khairi Che; Covington, James A.; Gardner, Julian W.

2011-09-01

Almost all electronic nose instruments reported today employ pattern recognition algorithms written in software and run on digital processors, e.g. micro-processors, microcontrollers or FPGAs. Conversely, in this paper we describe the analogue VLSI implementation of an electronic nose through the design of a neuromorphic olfactory chip. The modelling, design and fabrication of the chip have already been reported. Here a smart interface has been designed and characterised for thisneuromorphic chip. Thus we can demonstrate the functionality of the a VLSI neuromorphic chip, producing differing principal neuron firing patterns to real sensor response data. Further work is directed towards integrating 9 separate neuromorphic chips to create a large neuronal network to solve more complex olfactory problems.
An Efficient VLSI Architecture for Multi-Channel Spike Sorting Using a Generalized Hebbian Algorithm

Directory of Open Access Journals (Sweden)

Ying-Lun Chen

2015-08-01

Full Text Available A novel VLSI architecture for multi-channel online spike sorting is presented in this paper. In the architecture, the spike detection is based on nonlinear energy operator (NEO, and the feature extraction is carried out by the generalized Hebbian algorithm (GHA. To lower the power consumption and area costs of the circuits, all of the channels share the same core for spike detection and feature extraction operations. Each channel has dedicated buffers for storing the detected spikes and the principal components of that channel. The proposed circuit also contains a clock gating system supplying the clock to only the buffers of channels currently using the computation core to further reduce the power consumption. The architecture has been implemented by an application-specific integrated circuit (ASIC with 90-nm technology. Comparisons to the existing works show that the proposed architecture has lower power consumption and hardware area costs for real-time multi-channel spike detection and feature extraction.
An Efficient VLSI Architecture for Multi-Channel Spike Sorting Using a Generalized Hebbian Algorithm.

Science.gov (United States)

Chen, Ying-Lun; Hwang, Wen-Jyi; Ke, Chi-En

2015-08-13

A novel VLSI architecture for multi-channel online spike sorting is presented in this paper. In the architecture, the spike detection is based on nonlinear energy operator (NEO), and the feature extraction is carried out by the generalized Hebbian algorithm (GHA). To lower the power consumption and area costs of the circuits, all of the channels share the same core for spike detection and feature extraction operations. Each channel has dedicated buffers for storing the detected spikes and the principal components of that channel. The proposed circuit also contains a clock gating system supplying the clock to only the buffers of channels currently using the computation core to further reduce the power consumption. The architecture has been implemented by an application-specific integrated circuit (ASIC) with 90-nm technology. Comparisons to the existing works show that the proposed architecture has lower power consumption and hardware area costs for real-time multi-channel spike detection and feature extraction.
An Efficient VLSI Architecture for Multi-Channel Spike Sorting Using a Generalized Hebbian Algorithm

Science.gov (United States)

Chen, Ying-Lun; Hwang, Wen-Jyi; Ke, Chi-En

2015-01-01

A novel VLSI architecture for multi-channel online spike sorting is presented in this paper. In the architecture, the spike detection is based on nonlinear energy operator (NEO), and the feature extraction is carried out by the generalized Hebbian algorithm (GHA). To lower the power consumption and area costs of the circuits, all of the channels share the same core for spike detection and feature extraction operations. Each channel has dedicated buffers for storing the detected spikes and the principal components of that channel. The proposed circuit also contains a clock gating system supplying the clock to only the buffers of channels currently using the computation core to further reduce the power consumption. The architecture has been implemented by an application-specific integrated circuit (ASIC) with 90-nm technology. Comparisons to the existing works show that the proposed architecture has lower power consumption and hardware area costs for real-time multi-channel spike detection and feature extraction. PMID:26287193
A parallel VLSI architecture for a digital filter of arbitrary length using Fermat number transforms

Science.gov (United States)

Truong, T. K.; Reed, I. S.; Yeh, C. S.; Shao, H. M.

1982-01-01

A parallel architecture for computation of the linear convolution of two sequences of arbitrary lengths using the Fermat number transform (FNT) is described. In particular a pipeline structure is designed to compute a 128-point FNT. In this FNT, only additions and bit rotations are required. A standard barrel shifter circuit is modified so that it performs the required bit rotation operation. The overlap-save method is generalized for the FNT to compute a linear convolution of arbitrary length. A parallel architecture is developed to realize this type of overlap-save method using one FNT and several inverse FNTs of 128 points. The generalized overlap save method alleviates the usual dynamic range limitation in FNTs of long transform lengths. Its architecture is regular, simple, and expandable, and therefore naturally suitable for VLSI implementation.
Efficient Algorithm and Architecture of Critical-Band Transform for Low-Power Speech Applications

Directory of Open Access Journals (Sweden)

Gan Woon-Seng

2007-01-01

Full Text Available An efficient algorithm and its corresponding VLSI architecture for the critical-band transform (CBT are developed to approximate the critical-band filtering of the human ear. The CBT consists of a constant-bandwidth transform in the lower frequency range and a Brown constant- transform (CQT in the higher frequency range. The corresponding VLSI architecture is proposed to achieve significant power efficiency by reducing the computational complexity, using pipeline and parallel processing, and applying the supply voltage scaling technique. A 21-band Bark scale CBT processor with a sampling rate of 16 kHz is designed and simulated. Simulation results verify its suitability for performing short-time spectral analysis on speech. It has a better fitting on the human ear critical-band analysis, significantly fewer computations, and therefore is more energy-efficient than other methods. With a 0.35 m CMOS technology, it calculates a 160-point speech in 4.99 milliseconds at 234 kHz. The power dissipation is 15.6 W at 1.1 V. It achieves 82.1 power reduction as compared to a benchmark 256-point FFT processor.
Optical chirp z-transform processor with a simplified architecture.

Science.gov (United States)

Ngo, Nam Quoc

2014-12-29

Using a simplified chirp z-transform (CZT) algorithm based on the discrete-time convolution method, this paper presents the synthesis of a simplified architecture of a reconfigurable optical chirp z-transform (OCZT) processor based on the silica-based planar lightwave circuit (PLC) technology. In the simplified architecture of the reconfigurable OCZT, the required number of optical components is small and there are no waveguide crossings which make fabrication easy. The design of a novel type of optical discrete Fourier transform (ODFT) processor as a special case of the synthesized OCZT is then presented to demonstrate its effectiveness. The designed ODFT can be potentially used as an optical demultiplexer at the receiver of an optical fiber orthogonal frequency division multiplexing (OFDM) transmission system.
VLSI signal processing technology

CERN Document Server

Swartzlander, Earl

1994-01-01

This book is the first in a set of forthcoming books focussed on state-of-the-art development in the VLSI Signal Processing area. It is a response to the tremendous research activities taking place in that field. These activities have been driven by two factors: the dramatic increase in demand for high speed signal processing, especially in consumer elec tronics, and the evolving microelectronic technologies. The available technology has always been one of the main factors in determining al gorithms, architectures, and design strategies to be followed. With every new technology, signal processing systems go through many changes in concepts, design methods, and implementation. The goal of this book is to introduce the reader to the main features of VLSI Signal Processing and the ongoing developments in this area. The focus of this book is on: • Current developments in Digital Signal Processing (DSP) pro cessors and architectures - several examples and case studies of existing DSP chips are discussed in...
Supercomputers and parallel computation. Based on the proceedings of a workshop on progress in the use of vector and array processors organised by the Institute of Mathematics and its Applications and held in Bristol, 2-3 September 1982

International Nuclear Information System (INIS)

Paddon, D.J.

1984-01-01

This book is based on the proceedings of a conference on parallel computing held in 1982. There are 18 papers which cover the following topics: VLSI parallel architectures, the theory of parallel computing and vector and array processor computing. One paper on 'Tough Problems in Reactor Design' is indexed separately. All the contributions are on research done in the United Kingdom. Although much of the experience in array processor computing is associated with the ICL distributed array processor (DAP) and this is reflected in the contributions, the research relating to the ICL DAP is relevant to all types of array processors. (UK)
Memory Based Machine Intelligence Techniques in VLSI hardware

OpenAIRE

James, Alex Pappachen

2012-01-01

We briefly introduce the memory based approaches to emulate machine intelligence in VLSI hardware, describing the challenges and advantages. Implementation of artificial intelligence techniques in VLSI hardware is a practical and difficult problem. Deep architectures, hierarchical temporal memories and memory networks are some of the contemporary approaches in this area of research. The techniques attempt to emulate low level intelligence tasks and aim at providing scalable solutions to high ...
A Workload-Adaptive and Reconfigurable Bus Architecture for Multicore Processors

Directory of Open Access Journals (Sweden)

Shoaib Akram

2010-01-01

Full Text Available Interconnection networks for multicore processors are traditionally designed to serve a diversity of workloads. However, different workloads or even different execution phases of the same workload may benefit from different interconnect configurations. In this paper, we first motivate the need for workload-adaptive interconnection networks. Subsequently, we describe an interconnection network framework based on reconfigurable switches for use in medium-scale (up to 32 cores shared memory multicore processors. Our cost-effective reconfigurable interconnection network is implemented on a traditional shared bus interconnect with snoopy-based coherence, and it enables improved multicore performance. The proposed interconnect architecture distributes the cores of the processor into clusters with reconfigurable logic between clusters to support workload-adaptive policies for inter-cluster communication. Our interconnection scheme is complemented by interconnect-aware scheduling and additional interconnect optimizations which help boost the performance of multiprogramming and multithreaded workloads. We provide experimental results that show that the overall throughput of multiprogramming workloads (consisting of two and four programs can be improved by up to 60% with our configurable bus architecture. Similar gains can be achieved also for multithreaded applications as shown by further experiments. Finally, we present the performance sensitivity of the proposed interconnect architecture on shared memory bandwidth availability.
Behavioral Simulation and Performance Evaluation of Multi-Processor Architectures

Directory of Open Access Journals (Sweden)

Ausif Mahmood

1996-01-01

Full Text Available The development of multi-processor architectures requires extensive behavioral simulations to verify the correctness of design and to evaluate its performance. A high level language can provide maximum flexibility in this respect if the constructs for handling concurrent processes and a time mapping mechanism are added. This paper describes a novel technique for emulating hardware processes involved in a parallel architecture such that an object-oriented description of the design is maintained. The communication and synchronization between hardware processes is handled by splitting the processes into their equivalent subprograms at the entry points. The proper scheduling of these subprograms is coordinated by a timing wheel which provides a time mapping mechanism. Finally, a high level language pre-processor is proposed so that the timing wheel and the process emulation details can be made transparent to the user.
VLSI design

CERN Document Server

Basu, D K

2014-01-01

Very Large Scale Integrated Circuits (VLSI) design has moved from costly curiosity to an everyday necessity, especially with the proliferated applications of embedded computing devices in communications, entertainment and household gadgets. As a result, more and more knowledge on various aspects of VLSI design technologies is becoming a necessity for the engineering/technology students of various disciplines. With this goal in mind the course material of this book has been designed to cover the various fundamental aspects of VLSI design, like Categorization and comparison between various technologies used for VLSI design Basic fabrication processes involved in VLSI design Design of MOS, CMOS and Bi CMOS circuits used in VLSI Structured design of VLSI Introduction to VHDL for VLSI design Automated design for placement and routing of VLSI systems VLSI testing and testability The various topics of the book have been discussed lucidly with analysis, when required, examples, figures and adequate analytical and the...
Scalable architecture for a room temperature solid-state quantum information processor.

Science.gov (United States)

Yao, N Y; Jiang, L; Gorshkov, A V; Maurer, P C; Giedke, G; Cirac, J I; Lukin, M D

2012-04-24

The realization of a scalable quantum information processor has emerged over the past decade as one of the central challenges at the interface of fundamental science and engineering. Here we propose and analyse an architecture for a scalable, solid-state quantum information processor capable of operating at room temperature. Our approach is based on recent experimental advances involving nitrogen-vacancy colour centres in diamond. In particular, we demonstrate that the multiple challenges associated with operation at ambient temperature, individual addressing at the nanoscale, strong qubit coupling, robustness against disorder and low decoherence rates can be simultaneously achieved under realistic, experimentally relevant conditions. The architecture uses a novel approach to quantum information transfer and includes a hierarchy of control at successive length scales. Moreover, it alleviates the stringent constraints currently limiting the realization of scalable quantum processors and will provide fundamental insights into the physics of non-equilibrium many-body quantum systems.
Applications of VLSI circuits to medical imaging

International Nuclear Information System (INIS)

O'Donnell, M.

1988-01-01

In this paper the application of advanced VLSI circuits to medical imaging is explored. The relationship of both general purpose signal processing chips and custom devices to medical imaging is discussed using examples of fabricated chips. In addition, advanced CAD tools for silicon compilation are presented. Devices built with these tools represent a possible alternative to custom devices and general purpose signal processors for the next generation of medical imaging systems
Array processors: an introduction to their architecture, software, and applications in nuclear medicine

International Nuclear Information System (INIS)

King, M.A.; Doherty, P.W.; Rosenberg, R.J.; Cool, S.L.

1983-01-01

Array processors are ''number crunchers'' that dramatically enhance the processing power of nuclear medicine computer systems for applicatons dealing with the repetitive operations involved in digital image processing of large segments of data. The general architecture and the programming of array processors are introduced, along with some applications of array processors to the reconstruction of emission tomographic images, digital image enhancement, and functional image formation

A Low Cost VLSI Architecture for Spike Sorting Based on Feature Extraction with Peak Search.

Science.gov (United States)

Chang, Yuan-Jyun; Hwang, Wen-Jyi; Chen, Chih-Chang

2016-12-07

The goal of this paper is to present a novel VLSI architecture for spike sorting with high classification accuracy, low area costs and low power consumption. A novel feature extraction algorithm with low computational complexities is proposed for the design of the architecture. In the feature extraction algorithm, a spike is separated into two portions based on its peak value. The area of each portion is then used as a feature. The algorithm is simple to implement and less susceptible to noise interference. Based on the algorithm, a novel architecture capable of identifying peak values and computing spike areas concurrently is proposed. To further accelerate the computation, a spike can be divided into a number of segments for the local feature computation. The local features are subsequently merged with the global ones by a simple hardware circuit. The architecture can also be easily operated in conjunction with the circuits for commonly-used spike detection algorithms, such as the Non-linear Energy Operator (NEO). The architecture has been implemented by an Application-Specific Integrated Circuit (ASIC) with 90-nm technology. Comparisons to the existing works show that the proposed architecture is well suited for real-time multi-channel spike detection and feature extraction requiring low hardware area costs, low power consumption and high classification accuracy.
vPELS: An E-Learning Social Environment for VLSI Design with Content Security Using DRM

Science.gov (United States)

Dewan, Jahangir; Chowdhury, Morshed; Batten, Lynn

2014-01-01

This article provides a proposal for personal e-learning system (vPELS [where "v" stands for VLSI: very large scale integrated circuit])) architecture in the context of social network environment for VLSI Design. The main objective of vPELS is to develop individual skills on a specific subject--say, VLSI--and share resources with peers.…
ARTiS, an Asymmetric Real-Time Scheduler for Linux on Multi-Processor Architectures

OpenAIRE

Piel , Éric; Marquet , Philippe; Soula , Julien; Osuna , Christophe; Dekeyser , Jean-Luc

2005-01-01

The ARTiS system is a real-time extension of the GNU/Linux scheduler dedicated to SMP (Symmetric Multi-Processors) systems. It allows to mix High Performance Computing and real-time. ARTiS exploits the SMP architecture to guarantee the preemption of a processor when the system has to schedule a real-time task. The implementation is available as a modification of the Linux kernel, especially focusing (but not restricted to) IA-64 architecture. The basic idea of ARTiS is to assign a selected se...
Advanced Avionics and Processor Systems for a Flexible Space Exploration Architecture

Science.gov (United States)

Keys, Andrew S.; Adams, James H.; Smith, Leigh M.; Johnson, Michael A.; Cressler, John D.

2010-01-01

The Advanced Avionics and Processor Systems (AAPS) project, formerly known as the Radiation Hardened Electronics for Space Environments (RHESE) project, endeavors to develop advanced avionic and processor technologies anticipated to be used by NASA s currently evolving space exploration architectures. The AAPS project is a part of the Exploration Technology Development Program, which funds an entire suite of technologies that are aimed at enabling NASA s ability to explore beyond low earth orbit. NASA s Marshall Space Flight Center (MSFC) manages the AAPS project. AAPS uses a broad-scoped approach to developing avionic and processor systems. Investment areas include advanced electronic designs and technologies capable of providing environmental hardness, reconfigurable computing techniques, software tools for radiation effects assessment, and radiation environment modeling tools. Near-term emphasis within the multiple AAPS tasks focuses on developing prototype components using semiconductor processes and materials (such as Silicon-Germanium (SiGe)) to enhance a device s tolerance to radiation events and low temperature environments. As the SiGe technology will culminate in a delivered prototype this fiscal year, the project emphasis shifts its focus to developing low-power, high efficiency total processor hardening techniques. In addition to processor development, the project endeavors to demonstrate techniques applicable to reconfigurable computing and partially reconfigurable Field Programmable Gate Arrays (FPGAs). This capability enables avionic architectures the ability to develop FPGA-based, radiation tolerant processor boards that can serve in multiple physical locations throughout the spacecraft and perform multiple functions during the course of the mission. The individual tasks that comprise AAPS are diverse, yet united in the common endeavor to develop electronics capable of operating within the harsh environment of space. Specifically, the AAPS tasks for
Microfluidic very large scale integration (VLSI) modeling, simulation, testing, compilation and physical synthesis

CERN Document Server

Pop, Paul; Madsen, Jan

2016-01-01

This book presents the state-of-the-art techniques for the modeling, simulation, testing, compilation and physical synthesis of mVLSI biochips. The authors describe a top-down modeling and synthesis methodology for the mVLSI biochips, inspired by microelectronics VLSI methodologies. They introduce a modeling framework for the components and the biochip architecture, and a high-level microfluidic protocol language. Coverage includes a topology graph-based model for the biochip architecture, and a sequencing graph to model for biochemical application, showing how the application model can be obtained from the protocol language. The techniques described facilitate programmability and automation, enabling developers in the emerging, large biochip market. · Presents the current models used for the research on compilation and synthesis techniques of mVLSI biochips in a tutorial fashion; · Includes a set of "benchmarks", that are presented in great detail and includes the source code of several of the techniques p...
FPGA Based Intelligent Co-operative Processor in Memory Architecture

DEFF Research Database (Denmark)

Ahmed, Zaki; Sotudeh, Reza; Hussain, Dil Muhammad Akbar

2011-01-01

benefits of PIM, a concept of Co-operative Intelligent Memory (CIM) was developed by the intelligent system group of University of Hertfordshire, based on the previously developed Co-operative Pseudo Intelligent Memory (CPIM). This paper provides an overview on previous works (CPIM, CIM) and realization......In a continuing effort to improve computer system performance, Processor-In-Memory (PIM) architecture has emerged as an alternative solution. PIM architecture incorporates computational units and control logic directly on the memory to provide immediate access to the data. To exploit the potential...
A Low Cost VLSI Architecture for Spike Sorting Based on Feature Extraction with Peak Search

Directory of Open Access Journals (Sweden)

Yuan-Jyun Chang

2016-12-01

Full Text Available The goal of this paper is to present a novel VLSI architecture for spike sorting with high classification accuracy, low area costs and low power consumption. A novel feature extraction algorithm with low computational complexities is proposed for the design of the architecture. In the feature extraction algorithm, a spike is separated into two portions based on its peak value. The area of each portion is then used as a feature. The algorithm is simple to implement and less susceptible to noise interference. Based on the algorithm, a novel architecture capable of identifying peak values and computing spike areas concurrently is proposed. To further accelerate the computation, a spike can be divided into a number of segments for the local feature computation. The local features are subsequently merged with the global ones by a simple hardware circuit. The architecture can also be easily operated in conjunction with the circuits for commonly-used spike detection algorithms, such as the Non-linear Energy Operator (NEO. The architecture has been implemented by an Application-Specific Integrated Circuit (ASIC with 90-nm technology. Comparisons to the existing works show that the proposed architecture is well suited for real-time multi-channel spike detection and feature extraction requiring low hardware area costs, low power consumption and high classification accuracy.
The architecture of a video image processor for the space station

Science.gov (United States)

Yalamanchili, S.; Lee, D.; Fritze, K.; Carpenter, T.; Hoyme, K.; Murray, N.

1987-01-01

The architecture of a video image processor for space station applications is described. The architecture was derived from a study of the requirements of algorithms that are necessary to produce the desired functionality of many of these applications. Architectural options were selected based on a simulation of the execution of these algorithms on various architectural organizations. A great deal of emphasis was placed on the ability of the system to evolve and grow over the lifetime of the space station. The result is a hierarchical parallel architecture that is characterized by high level language programmability, modularity, extensibility and can meet the required performance goals.
A memory-array architecture for computer vision

Energy Technology Data Exchange (ETDEWEB)

Balsara, P.T.

1989-01-01

With the fast advances in the area of computer vision and robotics there is a growing need for machines that can understand images at a very high speed. A conventional von Neumann computer is not suited for this purpose because it takes a tremendous amount of time to solve most typical image processing problems. Exploiting the inherent parallelism present in various vision tasks can significantly reduce the processing time. Fortunately, parallelism is increasingly affordable as hardware gets cheaper. Thus it is now imperative to study computer vision in a parallel processing framework. The author should first design a computational structure which is well suited for a wide range of vision tasks and then develop parallel algorithms which can run efficiently on this structure. Recent advances in VLSI technology have led to several proposals for parallel architectures for computer vision. In this thesis he demonstrates that a memory array architecture with efficient local and global communication capabilities can be used for high speed execution of a wide range of computer vision tasks. This architecture, called the Access Constrained Memory Array Architecture (ACMAA), is efficient for VLSI implementation because of its modular structure, simple interconnect and limited global control. Several parallel vision algorithms have been designed for this architecture. The choice of vision problems demonstrates the versatility of ACMAA for a wide range of vision tasks. These algorithms were simulated on a high level ACMAA simulator running on the Intel iPSC/2 hypercube, a parallel architecture. The results of this simulation are compared with those of sequential algorithms running on a single hypercube node. Details of the ACMAA processor architecture are also presented.
VLSI Architectures for the Multiplication of Integers Modulo a Fermat Number

Science.gov (United States)

Chang, J. J.; Truong, T. K.; Reed, I. S.; Hsu, I. S.

1984-01-01

Multiplication is central in the implementation of Fermat number transforms and other residue number algorithms. There is need for a good multiplication algorithm that can be realized easily on a very large scale integration (VLSI) chip. The Leibowitz multiplier is modified to realize multiplication in the ring of integers modulo a Fermat number. This new algorithm requires only a sequence of cyclic shifts and additions. The designs developed for this new multiplier are regular, simple, expandable, and, therefore, suitable for VLSI implementation.
An efficient ASIC implementation of 16-channel on-line recursive ICA processor for real-time EEG system.

Science.gov (United States)

Fang, Wai-Chi; Huang, Kuan-Ju; Chou, Chia-Ching; Chang, Jui-Chung; Cauwenberghs, Gert; Jung, Tzyy-Ping

2014-01-01

This is a proposal for an efficient very-large-scale integration (VLSI) design, 16-channel on-line recursive independent component analysis (ORICA) processor ASIC for real-time EEG system, implemented with TSMC 40 nm CMOS technology. ORICA is appropriate to be used in real-time EEG system to separate artifacts because of its highly efficient and real-time process features. The proposed ORICA processor is composed of an ORICA processing unit and a singular value decomposition (SVD) processing unit. Compared with previous work [1], this proposed ORICA processor has enhanced effectiveness and reduced hardware complexity by utilizing a deeper pipeline architecture, shared arithmetic processing unit, and shared registers. The 16-channel random signals which contain 8-channel super-Gaussian and 8-channel sub-Gaussian components are used to analyze the dependence of the source components, and the average correlation coefficient is 0.95452 between the original source signals and extracted ORICA signals. Finally, the proposed ORICA processor ASIC is implemented with TSMC 40 nm CMOS technology, and it consumes 15.72 mW at 100 MHz operating frequency.
Compilation Techniques Specific for a Hardware Cryptography-Embedded Multimedia Mobile Processor

Directory of Open Access Journals (Sweden)

Masa-aki FUKASE

2007-12-01

Full Text Available The development of single chip VLSI processors is the key technology of ever growing pervasive computing to answer overall demands for usability, mobility, speed, security, etc. We have so far developed a hardware cryptography-embedded multimedia mobile processor architecture, HCgorilla. Since HCgorilla integrates a wide range of techniques from architectures to applications and languages, one-sided design approach is not always useful. HCgorilla needs more complicated strategy, that is, hardware/software (H/S codesign. Thus, we exploit the software support of HCgorilla composed of a Java interface and parallelizing compilers. They are assumed to be installed in servers in order to reduce the load and increase the performance of HCgorilla-embedded clients. Since compilers are the essence of software's responsibility, we focus in this article on our recent results about the design, specifications, and prototyping of parallelizing compilers for HCgorilla. The parallelizing compilers are composed of a multicore compiler and a LIW compiler. They are specified to abstract parallelism from executable serial codes or the Java interface output and output the codes executable in parallel by HCgorilla. The prototyping compilers are written in Java. The evaluation by using an arithmetic test program shows the reasonability of the prototyping compilers compared with hand compilers.
VLSI ARCHITECTURE FOR IMAGE COMPRESSION THROUGH ADDER MINIMIZATION TECHNIQUE AT DCT STRUCTURE

Directory of Open Access Journals (Sweden)

N.R. Divya

2014-08-01

Full Text Available Data compression plays a vital role in multimedia devices to present the information in a succinct frame. Initially, the DCT structure is used for Image compression, which has lesser complexity and area efficient. Similarly, 2D DCT also has provided reasonable data compression, but implementation concern, it calls more multipliers and adders thus its lead to acquire more area and high power consumption. To contain an account of all, this paper has been dealt with VLSI architecture for image compression using Rom free DA based DCT (Discrete Cosine Transform structure. This technique provides high-throughput and most suitable for real-time implementation. In order to achieve this image matrix is subdivided into odd and even terms then the multiplication functions are removed by shift and add approach. Kogge_Stone_Adder techniques are proposed for obtaining a bit-wise image quality which determines the new trade-off levels as compared to the previous techniques. Overall the proposed architecture produces reduced memory, low power consumption and high throughput. MATLAB is used as a funding tool for receiving an input pixel and obtaining output image. Verilog HDL is used for implementing the design, Model Sim for simulation, Quatres II is used to synthesize and obtain details about power and area.
VLSI architectures for modern error-correcting codes

CERN Document Server

Zhang, Xinmiao

2015-01-01

Error-correcting codes are ubiquitous. They are adopted in almost every modern digital communication and storage system, such as wireless communications, optical communications, Flash memories, computer hard drives, sensor networks, and deep-space probing. New-generation and emerging applications demand codes with better error-correcting capability. On the other hand, the design and implementation of those high-gain error-correcting codes pose many challenges. They usually involve complex mathematical computations, and mapping them directly to hardware often leads to very high complexity. VLSI
Clock generators for SOC processors circuits and architectures

CERN Document Server

Fahim, Amr

2004-01-01

This book explores the design of fully-integrated frequency synthesizers suitable for system-on-a-chip (SOC) processors. The text takes a more global design perspective in jointly examining the design space at the circuit level as well as at the architectural level. The comprehensive coverage includes summary chapters on circuit theory as well as feedback control theory relevant to the operation of phase locked loops (PLLs). On the circuit level, the discussion includes low-voltage analog design in deep submicron digital CMOS processes, effects of supply noise, substrate noise, as well device noise. On the architectural level, the discussion includes PLL analysis using continuous-time as well as discrete-time models, linear and nonlinear effects of PLL performance, and detailed analysis of locking behavior. The book provides numerous real world applications, as well as practical rules-of-thumb for modern designers to use at the system, architectural, as well as the circuit level.
Comparison between research data processing capabilities of AMD and NVIDIA architecture-based graphic processors

International Nuclear Information System (INIS)

Dudnik, V.A.; Kudryavtsev, V.I.; Us, S.A.; Shestakov, M.V.

2015-01-01

A comparative analysis has been made to describe the potentialities of hardware and software tools of two most widely used modern architectures of graphic processors (AMD and NVIDIA). Special features and differences of GPU architectures are exemplified by fragments of GPGPU programs. Time consumption for the program development has been estimated. Some pieces of advice are given as to the optimum choice of the GPU type for speeding up the processing of scientific research results. Recommendations are formulated for the use of software tools that reduce the time of GPGPU application programming for the given types of graphic processors
VLSI design

CERN Document Server

Chandrasetty, Vikram Arkalgud

2011-01-01

This book provides insight into the practical design of VLSI circuits. It is aimed at novice VLSI designers and other enthusiasts who would like to understand VLSI design flows. Coverage includes key concepts in CMOS digital design, design of DSP and communication blocks on FPGAs, ASIC front end and physical design, and analog and mixed signal design. The approach is designed to focus on practical implementation of key elements of the VLSI design process, in order to make the topic accessible to novices. The design concepts are demonstrated using software from Mathworks, Xilinx, Mentor Graphic
Considerations for control system software verification and validation specific to implementations using distributed processor architectures

International Nuclear Information System (INIS)

Munro, J.K. Jr.

1993-01-01

Until recently, digital control systems have been implemented on centralized processing systems to function in one of several ways: (1) as a single processor control system; (2) as a supervisor at the top of a hierarchical network of multiple processors; or (3) in a client-server mode. Each of these architectures uses a very different set of communication protocols. The latter two architectures also belong to the category of distributed control systems. Distributed control systems can have a central focus, as in the cases just cited, or be quite decentralized in a loosely coupled, shared responsibility arrangement. This last architecture is analogous to autonomous hosts on a local area network. Each of the architectures identified above will have a different set of architecture-associated issues to be addressed in the verification and validation activities during software development. This paper summarizes results of efforts to identify, describe, contrast, and compare these issues
VLSI in medicine

CERN Document Server

Einspruch, Norman G

1989-01-01

VLSI Electronics Microstructure Science, Volume 17: VLSI in Medicine deals with the more important applications of VLSI in medical devices and instruments.This volume is comprised of 11 chapters. It begins with an article about medical electronics. The following three chapters cover diagnostic imaging, focusing on such medical devices as magnetic resonance imaging, neurometric analyzer, and ultrasound. Chapters 5, 6, and 7 present the impact of VLSI in cardiology. The electrocardiograph, implantable cardiac pacemaker, and the use of VLSI in Holter monitoring are detailed in these chapters. The
Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

OpenAIRE

Catalán, Sandra; Igual, Francisco D.; Mayo, Rafael; Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

2015-01-01

Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware ...

Processors and systems (picture processing)

Energy Technology Data Exchange (ETDEWEB)

Gemmar, P

1983-01-01

Automatic picture processing requires high performance computers and high transmission capacities in the processor units. The author examines the possibilities of operating processors in parallel in order to accelerate the processing of pictures. He therefore discusses a number of available processors and systems for picture processing and illustrates their capacities for special types of picture processing. He stresses the fact that the amount of storage required for picture processing is exceptionally high. The author concludes that it is as yet difficult to decide whether very large groups of simple processors or highly complex multiprocessor systems will provide the best solution. Both methods will be aided by the development of VLSI. New solutions have already been offered (systolic arrays and 3-d processing structures) but they also are subject to losses caused by inherently parallel algorithms. Greater efforts must be made to produce suitable software for multiprocessor systems. Some possibilities for future picture processing systems are discussed. 33 references.
VLSI electronics microstructure science

CERN Document Server

1982-01-01

VLSI Electronics: Microstructure Science, Volume 4 reviews trends for the future of very large scale integration (VLSI) electronics and the scientific base that supports its development.This book discusses the silicon-on-insulator for VLSI and VHSIC, X-ray lithography, and transient response of electron transport in GaAs using the Monte Carlo method. The technology and manufacturing of high-density magnetic-bubble memories, metallic superlattices, challenge of education for VLSI, and impact of VLSI on medical signal processing are also elaborated. This text likewise covers the impact of VLSI t
The language parallel Pascal and other aspects of the massively parallel processor

Science.gov (United States)

Reeves, A. P.; Bruner, J. D.

1982-01-01

A high level language for the Massively Parallel Processor (MPP) was designed. This language, called Parallel Pascal, is described in detail. A description of the language design, a description of the intermediate language, Parallel P-Code, and details for the MPP implementation are included. Formal descriptions of Parallel Pascal and Parallel P-Code are given. A compiler was developed which converts programs in Parallel Pascal into the intermediate Parallel P-Code language. The code generator to complete the compiler for the MPP is being developed independently. A Parallel Pascal to Pascal translator was also developed. The architecture design for a VLSI version of the MPP was completed with a description of fault tolerant interconnection networks. The memory arrangement aspects of the MPP are discussed and a survey of other high level languages is given.
Extending and implementing the Self-adaptive Virtual Processor for distributed memory architectures

NARCIS (Netherlands)

van Tol, M.W.; Koivisto, J.

2011-01-01

Many-core architectures of the future are likely to have distributed memory organizations and need fine grained concurrency management to be used effectively. The Self-adaptive Virtual Processor (SVP) is an abstract concurrent programming model which can provide this, but the model and its current
Performance evaluation of throughput computing workloads using multi-core processors and graphics processors

Science.gov (United States)

Dave, Gaurav P.; Sureshkumar, N.; Blessy Trencia Lincy, S. S.

2017-11-01

Current trend in processor manufacturing focuses on multi-core architectures rather than increasing the clock speed for performance improvement. Graphic processors have become as commodity hardware for providing fast co-processing in computer systems. Developments in IoT, social networking web applications, big data created huge demand for data processing activities and such kind of throughput intensive applications inherently contains data level parallelism which is more suited for SIMD architecture based GPU. This paper reviews the architectural aspects of multi/many core processors and graphics processors. Different case studies are taken to compare performance of throughput computing applications using shared memory programming in OpenMP and CUDA API based programming.
A scalable single-chip multi-processor architecture with on-chip RTOS kernel

NARCIS (Netherlands)

Theelen, B.D.; Verschueren, A.C.; Reyes Suarez, V.V.; Stevens, M.P.J.; Nunez, A.

2003-01-01

Now that system-on-chip technology is emerging, single-chip multi-processors are becoming feasible. A key problem of designing such systems is the complexity of their on-chip interconnects and memory architecture. It is furthermore unclear at what level software should be integrated. An example of a
Parallel computation of nondeterministic algorithms in VLSI

Energy Technology Data Exchange (ETDEWEB)

Hortensius, P D

1987-01-01

This work examines parallel VLSI implementations of nondeterministic algorithms. It is demonstrated that conventional pseudorandom number generators are unsuitable for highly parallel applications. Efficient parallel pseudorandom sequence generation can be accomplished using certain classes of elementary one-dimensional cellular automata. The pseudorandom numbers appear in parallel on each clock cycle. Extensive study of the properties of these new pseudorandom number generators is made using standard empirical random number tests, cycle length tests, and implementation considerations. Furthermore, it is shown these particular cellular automata can form the basis of efficient VLSI architectures for computations involved in the Monte Carlo simulation of both the percolation and Ising models from statistical mechanics. Finally, a variation on a Built-In Self-Test technique based upon cellular automata is presented. These Cellular Automata-Logic-Block-Observation (CALBO) circuits improve upon conventional design for testability circuitry.
A pipelined architecture for real time correction of non-uniformity in infrared focal plane arrays imaging system using multiprocessors

Science.gov (United States)

Zou, Liang; Fu, Zhuang; Zhao, YanZheng; Yang, JunYan

2010-07-01

This paper proposes a kind of pipelined electric circuit architecture implemented in FPGA, a very large scale integrated circuit (VLSI), which efficiently deals with the real time non-uniformity correction (NUC) algorithm for infrared focal plane arrays (IRFPA). Dual Nios II soft-core processors and a DSP with a 64+ core together constitute this image system. Each processor undertakes own systematic task, coordinating its work with each other's. The system on programmable chip (SOPC) in FPGA works steadily under the global clock frequency of 96Mhz. Adequate time allowance makes FPGA perform NUC image pre-processing algorithm with ease, which has offered favorable guarantee for the work of post image processing in DSP. And at the meantime, this paper presents a hardware (HW) and software (SW) co-design in FPGA. Thus, this systematic architecture yields an image processing system with multiprocessor, and a smart solution to the satisfaction with the performance of the system.
Prototype architecture for a VLSI level zero processing system. [Space Station Freedom

Science.gov (United States)

Shi, Jianfei; Grebowsky, Gerald J.; Horner, Ward P.; Chesney, James R.

1989-01-01

The prototype architecture and implementation of a high-speed level zero processing (LZP) system are discussed. Due to the new processing algorithm and VLSI technology, the prototype LZP system features compact size, low cost, high processing throughput, and easy maintainability and increased reliability. Though extensive control functions have been done by hardware, the programmability of processing tasks makes it possible to adapt the system to different data formats and processing requirements. It is noted that the LZP system can handle up to 8 virtual channels and 24 sources with combined data volume of 15 Gbytes per orbit. For greater demands, multiple LZP systems can be configured in parallel, each called a processing channel and assigned a subset of virtual channels. The telemetry data stream will be steered into different processing channels in accordance with their virtual channel IDs. This super system can cope with a virtually unlimited number of virtual channels and sources. In the near future, it is expected that new disk farms with data rate exceeding 150 Mbps will be available from commercial vendors due to the advance in disk drive technology.
VLSI design of an RSA encryption/decryption chip using systolic array based architecture

Science.gov (United States)

Sun, Chi-Chia; Lin, Bor-Shing; Jan, Gene Eu; Lin, Jheng-Yi

2016-09-01

This article presents the VLSI design of a configurable RSA public key cryptosystem supporting the 512-bit, 1024-bit and 2048-bit based on Montgomery algorithm achieving comparable clock cycles of current relevant works but with smaller die size. We use binary method for the modular exponentiation and adopt Montgomery algorithm for the modular multiplication to simplify computational complexity, which, together with the systolic array concept for electric circuit designs effectively, lower the die size. The main architecture of the chip consists of four functional blocks, namely input/output modules, registers module, arithmetic module and control module. We applied the concept of systolic array to design the RSA encryption/decryption chip by using VHDL hardware language and verified using the TSMC/CIC 0.35 m 1P4 M technology. The die area of the 2048-bit RSA chip without the DFT is 3.9 × 3.9 mm2 (4.58 × 4.58 mm2 with DFT). Its average baud rate can reach 10.84 kbps under a 100 MHz clock.
VLSI structures for track finding

International Nuclear Information System (INIS)

Dell'Orso, M.

1989-01-01

We discuss the architecture of a device based on the concept of associative memory designed to solve the track finding problem, typical of high energy physics experiments, in a time span of a few microseconds even for very high multiplicity events. This ''machine'' is implemented as a large array of custom VLSI chips. All the chips are equal and each of them stores a number of ''patterns''. All the patterns in all the chips are compared in parallel to the data coming from the detector while the detector is being read out. (orig.)
Green Secure Processors: Towards Power-Efficient Secure Processor Design

Science.gov (United States)

Chhabra, Siddhartha; Solihin, Yan

With the increasing wealth of digital information stored on computer systems today, security issues have become increasingly important. In addition to attacks targeting the software stack of a system, hardware attacks have become equally likely. Researchers have proposed Secure Processor Architectures which utilize hardware mechanisms for memory encryption and integrity verification to protect the confidentiality and integrity of data and computation, even from sophisticated hardware attacks. While there have been many works addressing performance and other system level issues in secure processor design, power issues have largely been ignored. In this paper, we first analyze the sources of power (energy) increase in different secure processor architectures. We then present a power analysis of various secure processor architectures in terms of their increase in power consumption over a base system with no protection and then provide recommendations for designs that offer the best balance between performance and power without compromising security. We extend our study to the embedded domain as well. We also outline the design of a novel hybrid cryptographic engine that can be used to minimize the power consumption for a secure processor. We believe that if secure processors are to be adopted in future systems (general purpose or embedded), it is critically important that power issues are considered in addition to performance and other system level issues. To the best of our knowledge, this is the first work to examine the power implications of providing hardware mechanisms for security.
Very Long Instruction Word Processors

Indian Academy of Sciences (India)

Pentium Processor have modified the processor architecture to exploit parallelism in a program. .... The type of operation itself is encoded using 14 bits. .... text of designing simple architectures with low power consump- tion and execute x86 ...
VLSI architecture and design for the Fermat Number Transform implementation

Energy Technology Data Exchange (ETDEWEB)

Pajayakrit, A.

1987-01-01

A new technique of sectioning a pipelined transformer, using the Fermat Number Transform (FNT), is introduced. Also, a novel VLSI design which overcomes the problems of implementing FNTs, for use in fast convolution/correlation, is described. The design comprises one complete section of a pipelined transformer and may be programmed to function at any point in a forward or inverse pipeline, so allowing the construction of a pipelined convolver or correlator using identical chips, thus the favorable properties of the transform can be exploited. This overcomes the difficulty of fitting a complete pipeline onto one chip without resorting to the use of several different designs. The implementation of high-speed convolver/correlator using the VLSI chips has been successfully developed and tested. For impulse response lengths of up to 16 points the sampling rates of 0.5 MHz can be achieved. Finally, the filter speed performance using the FNT chips is compared to other designs and conclusions drawn on the merits of the FNT for this application. Also, the advantages and limitations of the FNT are analyzed, with respect to the more conventional FFT, and the results are provided.
Real time image synthesis on a SIMD linear array processor: algorithms and architectures

International Nuclear Information System (INIS)

Letellier, Laurent

1993-01-01

Nowadays, image synthesis has become a widely used technique. The impressive computing power required for real time applications necessitates the use of parallel architectures. In this context, we evaluate an SIMD linear parallel architecture, SYMPATI2, dedicated to image processing. The objective of this study is to propose a cost-effective graphics accelerator relying on SYMPATI2's modular and programmable structure. The parallelization of basic image synthesis algorithms on SYMPATI2 enables us to determine its limits in this application field. These limits lead us to evaluate a new structure with a fast intercommunication network between processors, but processors have to support the message consistency, which brings about a strong decrease in performance. To solve this problem, we suggest a simple network whose access priorities are represented by tokens. The simulations of this new architecture indicate that the SIMD mode causes a drastic cut in parallelism. To cope with this drawback, we propose a context switching procedure which reduces the SIMD rigidity and increases the parallelism rate significantly. Then, the graphics accelerator we propose is compared with existing graphics workstations. This comparison indicates that our structure, which is able to accelerate both image synthesis and image processing, is competitive and well-suited for multimedia applications. (author) [fr
A Compact VLSI System for Bio-Inspired Visual Motion Estimation.

Science.gov (United States)

Shi, Cong; Luo, Gang

2018-04-01

This paper proposes a bio-inspired visual motion estimation algorithm based on motion energy, along with its compact very-large-scale integration (VLSI) architecture using low-cost embedded systems. The algorithm mimics motion perception functions of retina, V1, and MT neurons in a primate visual system. It involves operations of ternary edge extraction, spatiotemporal filtering, motion energy extraction, and velocity integration. Moreover, we propose the concept of confidence map to indicate the reliability of estimation results on each probing location. Our algorithm involves only additions and multiplications during runtime, which is suitable for low-cost hardware implementation. The proposed VLSI architecture employs multiple (frame, pixel, and operation) levels of pipeline and massively parallel processing arrays to boost the system performance. The array unit circuits are optimized to minimize hardware resource consumption. We have prototyped the proposed architecture on a low-cost field-programmable gate array platform (Zynq 7020) running at 53-MHz clock frequency. It achieved 30-frame/s real-time performance for velocity estimation on 160 × 120 probing locations. A comprehensive evaluation experiment showed that the estimated velocity by our prototype has relatively small errors (average endpoint error < 0.5 pixel and angular error < 10°) for most motion cases.
Memory Efficient VLSI Implementation of Real-Time Motion Detection System Using FPGA Platform

Directory of Open Access Journals (Sweden)

Sanjay Singh

2017-06-01

Full Text Available Motion detection is the heart of a potentially complex automated video surveillance system, intended to be used as a standalone system. Therefore, in addition to being accurate and robust, a successful motion detection technique must also be economical in the use of computational resources on selected FPGA development platform. This is because many other complex algorithms of an automated video surveillance system also run on the same platform. Keeping this key requirement as main focus, a memory efficient VLSI architecture for real-time motion detection and its implementation on FPGA platform is presented in this paper. This is accomplished by proposing a new memory efficient motion detection scheme and designing its VLSI architecture. The complete real-time motion detection system using the proposed memory efficient architecture along with proper input/output interfaces is implemented on Xilinx ML510 (Virtex-5 FX130T FPGA development platform and is capable of operating at 154.55 MHz clock frequency. Memory requirement of the proposed architecture is reduced by 41% compared to the standard clustering based motion detection architecture. The new memory efficient system robustly and automatically detects motion in real-world scenarios (both for the static backgrounds and the pseudo-stationary backgrounds in real-time for standard PAL (720 × 576 size color video.
Synthesis algorithm of VLSI multipliers for ASIC

Science.gov (United States)

Chua, O. H.; Eldin, A. G.

1993-01-01

Multipliers are critical sub-blocks in ASIC design, especially for digital signal processing and communications applications. A flexible multiplier synthesis tool is developed which is capable of generating multiplier blocks for word size in the range of 4 to 256 bits. A comparison of existing multiplier algorithms is made in terms of speed, silicon area, and suitability for automated synthesis and verification of its VLSI implementation. The algorithm divides the range of supported word sizes into sub-ranges and provides each sub-range with a specific multiplier architecture for optimal speed and area. The algorithm of the synthesis tool and the multiplier architectures are presented. Circuit implementation and the automated synthesis methodology are discussed.
mAgic-FPU and MADE: A customizable VLIW core and the modular VLIW processor architecture description environment

Science.gov (United States)

Paolucci, Pier S.; Kajfasz, Philippe; Bonnot, Philippe; Candaele, Bernard; Maufroid, Daniel; Pastorelli, Elena; Ricciardi, Andrea; Fusella, Yves; Guarino, Eugenio

2001-09-01

mAgic-FPU is the architecture of a family of VLIW cores for configurable system level integration of floating and fixed point computing power. mAgic customization permits the designer to tune basic parameters, such as the computing power/memory access ratio of the core processor, the number of available arithmetic operation per cycle, the register file size and number of port, as well as of the number of arithmetic operators. The reconfiguration (e.g., of register file size and number of port, as well as of the number of arithmetic operators) is supported by the software environment MADE (Modular VLIW processor Architecture and Assembler Description Environment). MADE reads an architecture description file and produces a customized assembler-scheduler for the target VLIW architecture, configuring a general purpose VLIW optimizer-scheduler engine. The mAgic-FPU core architecture satisfies the requisite of portability among silicon foundries. The first members of the mAgic FPU core family architecture fit the requirements of 'Smart Antenna for Adaptive Beam-Forming processing' and 'Physical Sound Synthesis'. The first 1 GigaFlops mAgic core will run at 100 MHz within an area of 40 mm 2 in 0.25 μm ATMEL CMOS technology in first half 2002.
A novel VLSI processor for high-rate, high resolution spectroscopy

CERN Document Server

Pullia, Antonio; Gatti, E; Longoni, A; Buttler, W

2000-01-01

A novel time-variant VLSI shaper amplifier, suitable for multi-anode Silicon Drift Detectors or other multi-element solid-state X-ray detection systems, is proposed. The new read-out scheme has been conceived for demanding applications with synchrotron light sources, such as X-ray holography or EXAFS, where both high count-rates and high-energy resolutions are required. The circuit is of the linear time-variant class, accepts randomly distributed events and features: a finite-width (1-10 mu s) quasi-optimal weight function, an ultra-low-level energy discrimination (approx 150 eV), and a full compatibility for monolithic integration in CMOS technology. Its impulse response has a staircase-like shape, but the weight function (which is in general different from the impulse response in time-variant systems) is quasi trapezoidal. The operation principles of the new scheme as well as the first experimental results obtained with a prototype of the circuit are presented and discussed in the work.

DART: A Functional-Level Reconfigurable Architecture for High Energy Efficiency

Directory of Open Access Journals (Sweden)

David Raphaël

2008-01-01

Full Text Available Abstract Flexibility becomes a major concern for the development of multimedia and mobile communication systems, as well as classical high-performance and low-energy consumption constraints. The use of general-purpose processors solves flexibility problems but fails to cope with the increasing demand for energy efficiency. This paper presents the DART architecture based on the functional-level reconfiguration paradigm which allows a significant improvement in energy efficiency. DART is built around a hierarchical interconnection network allowing high flexibility while keeping the power overhead low. To enable specific optimizations, DART supports two modes of reconfiguration. The compilation framework is built using compilation and high-level synthesis techniques. A 3G mobile communication application has been implemented as a proof of concept. The energy distribution within the architecture and the physical implementation are also discussed. Finally, the VLSI design of a 0.13 m CMOS SoC implementing a specialized DART cluster is presented.
DART: A Functional-Level Reconfigurable Architecture for High Energy Efficiency

Directory of Open Access Journals (Sweden)

Sébastien Pillement

2007-12-01

Full Text Available Flexibility becomes a major concern for the development of multimedia and mobile communication systems, as well as classical high-performance and low-energy consumption constraints. The use of general-purpose processors solves flexibility problems but fails to cope with the increasing demand for energy efficiency. This paper presents the DART architecture based on the functional-level reconfiguration paradigm which allows a significant improvement in energy efficiency. DART is built around a hierarchical interconnection network allowing high flexibility while keeping the power overhead low. To enable specific optimizations, DART supports two modes of reconfiguration. The compilation framework is built using compilation and high-level synthesis techniques. A 3G mobile communication application has been implemented as a proof of concept. The energy distribution within the architecture and the physical implementation are also discussed. Finally, the VLSI design of a 0.13Ã¢Â€Â‰ÃŽÂ¼m CMOS SoC implementing a specialized DART cluster is presented.
Multiple Embedded Processors for Fault-Tolerant Computing

Science.gov (United States)

Bolotin, Gary; Watson, Robert; Katanyoutanant, Sunant; Burke, Gary; Wang, Mandy

2005-01-01

A fault-tolerant computer architecture has been conceived in an effort to reduce vulnerability to single-event upsets (spurious bit flips caused by impingement of energetic ionizing particles or photons). As in some prior fault-tolerant architectures, the redundancy needed for fault tolerance is obtained by use of multiple processors in one computer. Unlike prior architectures, the multiple processors are embedded in a single field-programmable gate array (FPGA). What makes this new approach practical is the recent commercial availability of FPGAs that are capable of having multiple embedded processors. A working prototype (see figure) consists of two embedded IBM PowerPC 405 processor cores and a comparator built on a Xilinx Virtex-II Pro FPGA. This relatively simple instantiation of the architecture implements an error-detection scheme. A planned future version, incorporating four processors and two comparators, would correct some errors in addition to detecting them.
Design of a Low-Power VLSI Macrocell for Nonlinear Adaptive Video Noise Reduction

Directory of Open Access Journals (Sweden)

Sergio Saponara

2004-09-01

Full Text Available A VLSI macrocell for edge-preserving video noise reduction is proposed in the paper. It is based on a nonlinear rational filter enhanced by a noise estimator for blind and dynamic adaptation of the filtering parameters to the input signal statistics. The VLSI filter features a modular architecture allowing the extension of both mask size and filtering directions. Both spatial and spatiotemporal algorithms are supported. Simulation results with monochrome test videos prove its efficiency for many noise distributions with PSNR improvements up to 3.8 dB with respect to a nonadaptive solution. The VLSI macrocell has been realized in a 0.18 ÃŽÂ¼m CMOS technology using a standard-cells library; it allows for real-time processing of main video formats, up to 30 fps (frames per second 4CIF, with a power consumption in the order of few mW.
Directions in parallel processor architecture, and GPUs too

CERN Multimedia

CERN. Geneva

2014-01-01

Modern computing is power-limited in every domain of computing. Performance increments extracted from instruction-level parallelism (ILP) are no longer power-efficient; they haven't been for some time. Thread-level parallelism (TLP) is a more easily exploited form of parallelism, at the expense of programmer effort to expose it in the program. In this talk, I will introduce you to disparate topics in parallel processor architecture that will impact programming models (and you) in both the near and far future. About the speaker Olivier is a senior GPU (SM) architect at NVIDIA and an active participant in the concurrency working group of the ISO C++ committee. He has also worked on very large diesel engines as a mechanical engineer, and taught at McGill University (Canada) as a faculty instructor.
The Molen Polymorphic Media Processor

NARCIS (Netherlands)

Kuzmanov, G.K.

2004-01-01

In this dissertation, we address high performance media processing based on a tightly coupled co-processor architectural paradigm. More specifically, we introduce a reconfigurable media augmentation of a general purpose processor and implement it into a fully operational processor prototype. The
Plasma processing for VLSI

CERN Document Server

Einspruch, Norman G

1984-01-01

VLSI Electronics: Microstructure Science, Volume 8: Plasma Processing for VLSI (Very Large Scale Integration) discusses the utilization of plasmas for general semiconductor processing. It also includes expositions on advanced deposition of materials for metallization, lithographic methods that use plasmas as exposure sources and for multiple resist patterning, and device structures made possible by anisotropic etching.This volume is divided into four sections. It begins with the history of plasma processing, a discussion of some of the early developments and trends for VLSI. The second section
CASPER: Embedding Power Estimation and Hardware-Controlled Power Management in a Cycle-Accurate Micro-Architecture Simulation Platform for Many-Core Multi-Threading Heterogeneous Processors

Directory of Open Access Journals (Sweden)

Arun Ravindran

2012-02-01

Full Text Available Despite the promising performance improvement observed in emerging many-core architectures in high performance processors, high power consumption prohibitively affects their use and marketability in the low-energy sectors, such as embedded processors, network processors and application specific instruction processors (ASIPs. While most chip architects design power-efficient processors by finding an optimal power-performance balance in their design, some use sophisticated on-chip autonomous power management units, which dynamically reduce the voltage or frequencies of idle cores and hence extend battery life and reduce operating costs. For large scale designs of many-core processors, a holistic approach integrating both these techniques at different levels of abstraction can potentially achieve maximal power savings. In this paper we present CASPER, a robust instruction trace driven cycle-accurate many-core multi-threading micro-architecture simulation platform where we have incorporated power estimation models of a wide variety of tunable many-core micro-architectural design parameters, thus enabling processor architects to explore a sufficiently large design space and achieve power-efficient designs. Additionally CASPER is designed to accommodate cycle-accurate models of hardware controlled power management units, enabling architects to experiment with and evaluate different autonomous power-saving mechanisms to study the run-time power-performance trade-offs in embedded many-core processors. We have implemented two such techniques in CASPER–Chipwide Dynamic Voltage and Frequency Scaling, and Performance Aware Core-Specific Frequency Scaling, which show average power savings of 35.9% and 26.2% on a baseline 4-core SPARC based architecture respectively. This power saving data accounts for the power consumption of the power management units themselves. The CASPER simulation platform also provides users with complete support of SPARCV9
Mapping of H.264 decoding on a multiprocessor architecture

Science.gov (United States)

van der Tol, Erik B.; Jaspers, Egbert G.; Gelderblom, Rob H.

2003-05-01

Due to the increasing significance of development costs in the competitive domain of high-volume consumer electronics, generic solutions are required to enable reuse of the design effort and to increase the potential market volume. As a result from this, Systems-on-Chip (SoCs) contain a growing amount of fully programmable media processing devices as opposed to application-specific systems, which offered the most attractive solutions due to a high performance density. The following motivates this trend. First, SoCs are increasingly dominated by their communication infrastructure and embedded memory, thereby making the cost of the functional units less significant. Moreover, the continuously growing design costs require generic solutions that can be applied over a broad product range. Hence, powerful programmable SoCs are becoming increasingly attractive. However, to enable power-efficient designs, that are also scalable over the advancing VLSI technology, parallelism should be fully exploited. Both task-level and instruction-level parallelism can be provided by means of e.g. a VLIW multiprocessor architecture. To provide the above-mentioned scalability, we propose to partition the data over the processors, instead of traditional functional partitioning. An advantage of this approach is the inherent locality of data, which is extremely important for communication-efficient software implementations. Consequently, a software implementation is discussed, enabling e.g. SD resolution H.264 decoding with a two-processor architecture, whereas High-Definition (HD) decoding can be achieved with an eight-processor system, executing the same software. Experimental results show that the data communication considerably reduces up to 65% directly improving the overall performance. Apart from considerable improvement in memory bandwidth, this novel concept of partitioning offers a natural approach for optimally balancing the load of all processors, thereby further improving the
Dual-core Itanium Processor

CERN Multimedia

2006-01-01

Intel’s first dual-core Itanium processor, code-named "Montecito" is a major release of Intel's Itanium 2 Processor Family, which implements the Intel Itanium architecture on a dual-core processor with two cores per die (integrated circuit). Itanium 2 is much more powerful than its predecessor. It has lower power consumption and thermal dissipation.
Examining the volume efficiency of the cortical architecture in a multi-processor network model.

Science.gov (United States)

Ruppin, E; Schwartz, E L; Yeshurun, Y

1993-01-01

The convoluted form of the sheet-like mammalian cortex naturally raises the question whether there is a simple geometrical reason for the prevalence of cortical architecture in the brains of higher vertebrates. Addressing this question, we present a formal analysis of the volume occupied by a massively connected network or processors (neurons) and then consider the pertaining cortical data. Three gross macroscopic features of cortical organization are examined: the segregation of white and gray matter, the circumferential organization of the gray matter around the white matter, and the folded cortical structure. Our results testify to the efficiency of cortical architecture.
Parallel algorithms and architecture for computation of manipulator forward dynamics

Science.gov (United States)

Fijany, Amir; Bejczy, Antal K.

1989-01-01

Parallel computation of manipulator forward dynamics is investigated. Considering three classes of algorithms for the solution of the problem, that is, the O(n), the O(n exp 2), and the O(n exp 3) algorithms, parallelism in the problem is analyzed. It is shown that the problem belongs to the class of NC and that the time and processors bounds are of O(log2/2n) and O(n exp 4), respectively. However, the fastest stable parallel algorithms achieve the computation time of O(n) and can be derived by parallelization of the O(n exp 3) serial algorithms. Parallel computation of the O(n exp 3) algorithms requires the development of parallel algorithms for a set of fundamentally different problems, that is, the Newton-Euler formulation, the computation of the inertia matrix, decomposition of the symmetric, positive definite matrix, and the solution of triangular systems. Parallel algorithms for this set of problems are developed which can be efficiently implemented on a unique architecture, a triangular array of n(n+2)/2 processors with a simple nearest-neighbor interconnection. This architecture is particularly suitable for VLSI and WSI implementations. The developed parallel algorithm, compared to the best serial O(n) algorithm, achieves an asymptotic speedup of more than two orders-of-magnitude in the computation the forward dynamics.
Making CSB + -Trees Processor Conscious

DEFF Research Database (Denmark)

Samuel, Michael; Pedersen, Anders Uhl; Bonnet, Philippe

2005-01-01

of the CSB+-tree. We argue that it is necessary to consider a larger group of parameters in order to adapt CSB+-tree to processor architectures as different as Pentium and Itanium. We identify this group of parameters and study how it impacts the performance of CSB+-tree on Itanium 2. Finally, we propose......Cache-conscious indexes, such as CSB+-tree, are sensitive to the underlying processor architecture. In this paper, we focus on how to adapt the CSB+-tree so that it performs well on a range of different processor architectures. Previous work has focused on the impact of node size on the performance...... a systematic method for adapting CSB+-tree to new platforms. This work is a first step towards integrating CSB+-tree in MySQL’s heap storage manager....
DESIGN AND IMPLEMENTATION OF A VHDL PROCESSOR FOR DCT BASED IMAGE COMPRESSION

Directory of Open Access Journals (Sweden)

Md. Shabiul Islam

2017-11-01

Full Text Available This paper describes the design and implementation of a VHDL processor meant for performing 2D-Discrete Cosine Transform (DCT to use in image compression applications. The design flow starts from the system specification to implementation on silicon and the entire process is carried out using an advanced workstation based design environment for digital signal processing. The software allows the bit-true analysis to ensure that the designed VLSI processor satisfies the required specifications. The bit-true analysis is performed on all levels of abstraction (behavior, VHDL etc.. The motivation behind the work is smaller size chip area, faster processing, reducing the cost of the chip
Algorithmically specialized parallel computers

CERN Document Server

Snyder, Lawrence; Gannon, Dennis B

1985-01-01

Algorithmically Specialized Parallel Computers focuses on the concept and characteristics of an algorithmically specialized computer.This book discusses the algorithmically specialized computers, algorithmic specialization using VLSI, and innovative architectures. The architectures and algorithms for digital signal, speech, and image processing and specialized architectures for numerical computations are also elaborated. Other topics include the model for analyzing generalized inter-processor, pipelined architecture for search tree maintenance, and specialized computer organization for raster
Architecture and VHDL behavioural validation of a parallel processor dedicated to computer vision

International Nuclear Information System (INIS)

Collette, Thierry

1992-01-01

Speeding up image processing is mainly obtained using parallel computers; SIMD processors (single instruction stream, multiple data stream) have been developed, and have proven highly efficient regarding low-level image processing operations. Nevertheless, their performances drop for most intermediate of high level operations, mainly when random data reorganisations in processor memories are involved. The aim of this thesis was to extend the SIMD computer capabilities to allow it to perform more efficiently at the image processing intermediate level. The study of some representative algorithms of this class, points out the limits of this computer. Nevertheless, these limits can be erased by architectural modifications. This leads us to propose SYMPATIX, a new SIMD parallel computer. To valid its new concept, a behavioural model written in VHDL - Hardware Description Language - has been elaborated. With this model, the new computer performances have been estimated running image processing algorithm simulations. VHDL modeling approach allows to perform the system top down electronic design giving an easy coupling between system architectural modifications and their electronic cost. The obtained results show SYMPATIX to be an efficient computer for low and intermediate level image processing. It can be connected to a high level computer, opening up the development of new computer vision applications. This thesis also presents, a top down design method, based on the VHDL, intended for electronic system architects. (author) [fr
Lithography for VLSI

CERN Document Server

Einspruch, Norman G

1987-01-01

VLSI Electronics Microstructure Science, Volume 16: Lithography for VLSI treats special topics from each branch of lithography, and also contains general discussion of some lithographic methods.This volume contains 8 chapters that discuss the various aspects of lithography. Chapters 1 and 2 are devoted to optical lithography. Chapter 3 covers electron lithography in general, and Chapter 4 discusses electron resist exposure modeling. Chapter 5 presents the fundamentals of ion-beam lithography. Mask/wafer alignment for x-ray proximity printing and for optical lithography is tackled in Chapter 6.
Median and Morphological Specialized Processors for a Real-Time Image Data Processing

Directory of Open Access Journals (Sweden)

Kazimierz Wiatr

2002-01-01

Full Text Available This paper presents the considerations on selecting a multiprocessor MISD architecture for fast implementation of the vision image processing. Using the authorÃ¢Â€Â²s earlier experience with real-time systems, implementing of specialized hardware processors based on the programmable FPGA systems has been proposed in the pipeline architecture. In particular, the following processors are presented: median filter and morphological processor. The structure of a universal reconfigurable processor developed has been proposed as well. Experimental results are presented as delays on LCA level implementation for median filter, morphological processor, convolution processor, look-up-table processor, logic processor and histogram processor. These times compare with delays in general purpose processor and DSP processor.
Parallel computation for distributed parameter system-from vector processors to Adena computer

Energy Technology Data Exchange (ETDEWEB)

Nogi, T

1983-04-01

Research on advanced parallel hardware and software architectures for very high-speed computation deserves and needs more support and attention to fulfil its promise. Novel architectures for parallel processing are being made ready. Architectures for parallel processing can be roughly divided into two groups. One is a vector processor in which a single central processing unit involves multiple vector-arithmetic registers. The other is a processor array in which slave processors are connected to a host processor to perform parallel computation. In this review, the concept and data structure of the Adena (alternating-direction edition nexus array) architecture, which is conformable to distributed-parameter simulation algorithms, are described. 5 references.
Motion-sensor fusion-based gesture recognition and its VLSI architecture design for mobile devices

Science.gov (United States)

Zhu, Wenping; Liu, Leibo; Yin, Shouyi; Hu, Siqi; Tang, Eugene Y.; Wei, Shaojun

2014-05-01

With the rapid proliferation of smartphones and tablets, various embedded sensors are incorporated into these platforms to enable multimodal human-computer interfaces. Gesture recognition, as an intuitive interaction approach, has been extensively explored in the mobile computing community. However, most gesture recognition implementations by now are all user-dependent and only rely on accelerometer. In order to achieve competitive accuracy, users are required to hold the devices in predefined manner during the operation. In this paper, a high-accuracy human gesture recognition system is proposed based on multiple motion sensor fusion. Furthermore, to reduce the energy overhead resulted from frequent sensor sampling and data processing, a high energy-efficient VLSI architecture implemented on a Xilinx Virtex-5 FPGA board is also proposed. Compared with the pure software implementation, approximately 45 times speed-up is achieved while operating at 20 MHz. The experiments show that the average accuracy for 10 gestures achieves 93.98% for user-independent case and 96.14% for user-dependent case when subjects hold the device randomly during completing the specified gestures. Although a few percent lower than the conventional best result, it still provides competitive accuracy acceptable for practical usage. Most importantly, the proposed system allows users to hold the device randomly during operating the predefined gestures, which substantially enhances the user experience.

Surface and interface effects in VLSI

CERN Document Server

Einspruch, Norman G

1985-01-01

VLSI Electronics Microstructure Science, Volume 10: Surface and Interface Effects in VLSI provides the advances made in the science of semiconductor surface and interface as they relate to electronics. This volume aims to provide a better understanding and control of surface and interface related properties. The book begins with an introductory chapter on the intimate link between interfaces and devices. The book is then divided into two parts. The first part covers the chemical and geometric structures of prototypical VLSI interfaces. Subjects detailed include, the technologically most import
Hardware trigger processor for the MDT system

CERN Document Server

AUTHOR|(SzGeCERN)757787; The ATLAS collaboration; Hazen, Eric; Butler, John; Black, Kevin; Gastler, Daniel Edward; Ntekas, Konstantinos; Taffard, Anyes; Martinez Outschoorn, Verena; Ishino, Masaya; Okumura, Yasuyuki

2017-01-01

We are developing a low-latency hardware trigger processor for the Monitored Drift Tube system in the Muon spectrometer. The processor will fit candidate Muon tracks in the drift tubes in real time, improving significantly the momentum resolution provided by the dedicated trigger chambers. We present a novel pure-FPGA implementation of a Legendre transform segment finder, an associative-memory alternative implementation, an ARM (Zynq) processor-based track fitter, and compact ATCA carrier board architecture. The ATCA architecture is designed to allow a modular, staged approach to deployment of the system and exploration of alternative technologies.
Design Principles for Synthesizable Processor Cores

DEFF Research Database (Denmark)

Schleuniger, Pascal; McKee, Sally A.; Karlsson, Sven

2012-01-01

As FPGAs get more competitive, synthesizable processor cores become an attractive choice for embedded computing. Currently popular commercial processor cores do not fully exploit current FPGA architectures. In this paper, we propose general design principles to increase instruction throughput...
Novel memory architecture for video signal processor

Science.gov (United States)

Hung, Jen-Sheng; Lin, Chia-Hsing; Jen, Chein-Wei

1993-11-01

An on-chip memory architecture for video signal processor (VSP) is proposed. This memory structure is a two-level design for the different data locality in video applications. The upper level--Memory A provides enough storage capacity to reduce the impact on the limitation of chip I/O bandwidth, and the lower level--Memory B provides enough data parallelism and flexibility to meet the requirements of multiple reconfigurable pipeline function units in a single VSP chip. The needed memory size is decided by the memory usage analysis for video algorithms and the number of function units. Both levels of memory adopted a dual-port memory scheme to sustain the simultaneous read and write operations. Especially, Memory B uses multiple one-read-one-write memory banks to emulate the real multiport memory. Therefore, one can change the configuration of Memory B to several sets of memories with variable read/write ports by adjusting the bus switches. Then the numbers of read ports and write ports in proposed memory can meet requirement of data flow patterns in different video coding algorithms. We have finished the design of a prototype memory design using 1.2- micrometers SPDM SRAM technology and will fabricated it through TSMC, in Taiwan.
Time-Predictable Computer Architecture

Directory of Open Access Journals (Sweden)

Schoeberl Martin

2009-01-01

Full Text Available Today's general-purpose processors are optimized for maximum throughput. Real-time systems need a processor with both a reasonable and a known worst-case execution time (WCET. Features such as pipelines with instruction dependencies, caches, branch prediction, and out-of-order execution complicate WCET analysis and lead to very conservative estimates. In this paper, we evaluate the issues of current architectures with respect to WCET analysis. Then, we propose solutions for a time-predictable computer architecture. The proposed architecture is evaluated with implementation of some features in a Java processor. The resulting processor is a good target for WCET analysis and still performs well in the average case.
Rapid prototyping and evaluation of programmable SIMD SDR processors in LISA

Science.gov (United States)

Chen, Ting; Liu, Hengzhu; Zhang, Botao; Liu, Dongpei

2013-03-01

With the development of international wireless communication standards, there is an increase in computational requirement for baseband signal processors. Time-to-market pressure makes it impossible to completely redesign new processors for the evolving standards. Due to its high flexibility and low power, software defined radio (SDR) digital signal processors have been proposed as promising technology to replace traditional ASIC and FPGA fashions. In addition, there are large numbers of parallel data processed in computation-intensive functions, which fosters the development of single instruction multiple data (SIMD) architecture in SDR platform. So a new way must be found to prototype the SDR processors efficiently. In this paper we present a bit-and-cycle accurate model of programmable SIMD SDR processors in a machine description language LISA. LISA is a language for instruction set architecture which can gain rapid model at architectural level. In order to evaluate the availability of our proposed processor, three common baseband functions, FFT, FIR digital filter and matrix multiplication have been mapped on the SDR platform. Analytical results showed that the SDR processor achieved the maximum of 47.1% performance boost relative to the opponent processor.
Discussion paper for a highly parallel array processor-based machine

International Nuclear Information System (INIS)

Hagstrom, R.; Bolotin, G.; Dawson, J.

1984-01-01

The architectural plant for a quickly realizable implementation of a highly parallel special-purpose computer system with peak performance in the range of 6 billion floating point operations per second is discussed. The architecture is suitable to Lattice Gauge theoretical computations of fundamental physics interest and may be applicable to a range of other problems which deal with numerically intensive computational problems. The plan is quickly realizable because it employs a maximum of commercially available hardware subsystems and because the architecture is software-transparent to the individual processors, allowing straightforward re-use of whatever commercially available operating-systems and support software that is suitable to run on the commercially-produced processors. A tiny prototype instrument, designed along this architecture has already operated. A few elementary examples of programs which can run efficiently are presented. The large machine which the authors would propose to build would be based upon a highly competent array-processor, the ST-100 Array Processor, and specific design possibilities are discussed. The first step toward realizing this plan practically is to install a single ST-100 to allow algorithm development to proceed while a demonstration unit is built using two of the ST-100 Array Processors
Lipsi: Probably the Smallest Processor in the World

DEFF Research Database (Denmark)

Schoeberl, Martin

2018-01-01

While research on high-performance processors is important, it is also interesting to explore processor architectures at the other end of the spectrum: tiny processor cores for auxiliary functions. While it is common to implement small circuits for such functions, such as a serial port, in dedica...... at a minimal cost....
High-throughput sample adaptive offset hardware architecture for high-efficiency video coding

Science.gov (United States)

Zhou, Wei; Yan, Chang; Zhang, Jingzhi; Zhou, Xin

2018-03-01

A high-throughput hardware architecture for a sample adaptive offset (SAO) filter in the high-efficiency video coding video coding standard is presented. First, an implementation-friendly and simplified bitrate estimation method of rate-distortion cost calculation is proposed to reduce the computational complexity in the mode decision of SAO. Then, a high-throughput VLSI architecture for SAO is presented based on the proposed bitrate estimation method. Furthermore, multiparallel VLSI architecture for in-loop filters, which integrates both deblocking filter and SAO filter, is proposed. Six parallel strategies are applied in the proposed in-loop filters architecture to improve the system throughput and filtering speed. Experimental results show that the proposed in-loop filters architecture can achieve up to 48% higher throughput in comparison with prior work. The proposed architecture can reach a high-operating clock frequency of 297 MHz with TSMC 65-nm library and meet the real-time requirement of the in-loop filters for 8 K × 4 K video format at 132 fps.
An analog VLSI real time optical character recognition system based on a neural architecture

International Nuclear Information System (INIS)

Bo, G.; Caviglia, D.; Valle, M.

1999-01-01

In this paper a real time Optical Character Recognition system is presented: it is based on a feature extraction module and a neural network classifier which have been designed and fabricated in analog VLSI technology. Experimental results validate the circuit functionality. The results obtained from a validation based on a mixed approach (i.e., an approach based on both experimental and simulation results) confirm the soundness and reliability of the system
An analog VLSI real time optical character recognition system based on a neural architecture

Energy Technology Data Exchange (ETDEWEB)

Bo, G.; Caviglia, D.; Valle, M. [Genoa Univ. (Italy). Dip. of Biophysical and Electronic Engineering

1999-03-01

In this paper a real time Optical Character Recognition system is presented: it is based on a feature extraction module and a neural network classifier which have been designed and fabricated in analog VLSI technology. Experimental results validate the circuit functionality. The results obtained from a validation based on a mixed approach (i.e., an approach based on both experimental and simulation results) confirm the soundness and reliability of the system.
Functional Verification of Enhanced RISC Processor

OpenAIRE

SHANKER NILANGI; SOWMYA L

2013-01-01

This paper presents design and verification of a 32-bit enhanced RISC processor core having floating point computations integrated within the core, has been designed to reduce the cost and complexity. The designed 3 stage pipelined 32-bit RISC processor is based on the ARM7 processor architecture with single precision floating point multiplier, floating point adder/subtractor for floating point operations and 32 x 32 booths multiplier added to the integer core of ARM7. The binary representati...
Electro-optic techniques for VLSI interconnect

Science.gov (United States)

Neff, J. A.

1985-03-01

A major limitation to achieving significant speed increases in very large scale integration (VLSI) lies in the metallic interconnects. They are costly not only from the charge transport standpoint but also from capacitive loading effects. The Defense Advanced Research Projects Agency, in pursuit of the fifth generation supercomputer, is investigating alternatives to the VLSI metallic interconnects, especially the use of optical techniques to transport the information either inter or intrachip. As the on chip performance of VLSI continues to improve via the scale down of the logic elements, the problems associated with transferring data off and onto the chip become more severe. The use of optical carriers to transfer the information within the computer is very appealing from several viewpoints. Besides the potential for gigabit propagation rates, the conversion from electronics to optics conveniently provides a decoupling of the various circuits from one another. Significant gains will also be realized in reducing cross talk between the metallic routings, and the interconnects need no longer be constrained to the plane of a thin film on the VLSI chip. In addition, optics can offer an increased programming flexibility for restructuring the interconnect network.
High-speed packet filtering utilizing stream processors

Science.gov (United States)

Hummel, Richard J.; Fulp, Errin W.

2009-04-01

Parallel firewalls offer a scalable architecture for the next generation of high-speed networks. While these parallel systems can be implemented using multiple firewalls, the latest generation of stream processors can provide similar benefits with a significantly reduced latency due to locality. This paper describes how the Cell Broadband Engine (CBE), a popular stream processor, can be used as a high-speed packet filter. Results show the CBE can potentially process packets arriving at a rate of 1 Gbps with a latency less than 82 μ-seconds. Performance depends on how well the packet filtering process is translated to the unique stream processor architecture. For example the method used for transmitting data and control messages among the pseudo-independent processor cores has a significant impact on performance. Experimental results will also show the current limitations of a CBE operating system when used to process packets. Possible solutions to these issues will be discussed.
Invasive tightly coupled processor arrays

CERN Document Server

LARI, VAHID

2016-01-01

This book introduces new massively parallel computer (MPSoC) architectures called invasive tightly coupled processor arrays. It proposes strategies, architecture designs, and programming interfaces for invasive TCPAs that allow invading and subsequently executing loop programs with strict requirements or guarantees of non-functional execution qualities such as performance, power consumption, and reliability. For the first time, such a configurable processor array architecture consisting of locally interconnected VLIW processing elements can be claimed by programs, either in full or in part, using the principle of invasive computing. Invasive TCPAs provide unprecedented energy efficiency for the parallel execution of nested loop programs by avoiding any global memory access such as GPUs and may even support loops with complex dependencies such as loop-carried dependencies that are not amenable to parallel execution on GPUs. For this purpose, the book proposes different invasion strategies for claiming a desire...
APRON: A Cellular Processor Array Simulation and Hardware Design Tool

Science.gov (United States)

Barr, David R. W.; Dudek, Piotr

2009-12-01

We present a software environment for the efficient simulation of cellular processor arrays (CPAs). This software (APRON) is used to explore algorithms that are designed for massively parallel fine-grained processor arrays, topographic multilayer neural networks, vision chips with SIMD processor arrays, and related architectures. The software uses a highly optimised core combined with a flexible compiler to provide the user with tools for the design of new processor array hardware architectures and the emulation of existing devices. We present performance benchmarks for the software processor array implemented on standard commodity microprocessors. APRON can be configured to use additional processing hardware if necessary and can be used as a complete graphical user interface and development environment for new or existing CPA systems, allowing more users to develop algorithms for CPA systems.
APRON: A Cellular Processor Array Simulation and Hardware Design Tool

Directory of Open Access Journals (Sweden)

David R. W. Barr

2009-01-01

Full Text Available We present a software environment for the efficient simulation of cellular processor arrays (CPAs. This software (APRON is used to explore algorithms that are designed for massively parallel fine-grained processor arrays, topographic multilayer neural networks, vision chips with SIMD processor arrays, and related architectures. The software uses a highly optimised core combined with a flexible compiler to provide the user with tools for the design of new processor array hardware architectures and the emulation of existing devices. We present performance benchmarks for the software processor array implemented on standard commodity microprocessors. APRON can be configured to use additional processing hardware if necessary and can be used as a complete graphical user interface and development environment for new or existing CPA systems, allowing more users to develop algorithms for CPA systems.
Multi-valued LSI/VLSI logic design

Science.gov (United States)

Santrakul, K.

A procedure for synthesizing any large complex logic system, such as LSI and VLSI integrated circuits is described. This scheme uses Multi-Valued Multi-plexers (MVMUX) as the basic building blocks and the tree as the structure of the circuit realization. Simple built-in test circuits included in the network (the main current), provide a thorough functional checking of the network at any time. In brief, four major contributions are made: (1) multi-valued Algorithmic State Machine (ASM) chart for describing an LSI/VLSI behavior; (2) a tree-structured multi-valued multiplexer network which can be obtained directly from an ASM chart; (3) a heuristic tree-structured synthesis method for realizing any combinational logic with minimal or nearly-minimal MVMUX; and (4) a hierarchical design of LSI/VLSI with built-in parallel testing capability.
Technology computer aided design simulation for VLSI MOSFET

CERN Document Server

Sarkar, Chandan Kumar

2013-01-01

Responding to recent developments and a growing VLSI circuit manufacturing market, Technology Computer Aided Design: Simulation for VLSI MOSFET examines advanced MOSFET processes and devices through TCAD numerical simulations. The book provides a balanced summary of TCAD and MOSFET basic concepts, equations, physics, and new technologies related to TCAD and MOSFET. A firm grasp of these concepts allows for the design of better models, thus streamlining the design process, saving time and money. This book places emphasis on the importance of modeling and simulations of VLSI MOS transistors and
Token-Aware Completion Functions for Elastic Processor Verification

Directory of Open Access Journals (Sweden)

Sudarshan K. Srinivasan

2009-01-01

Full Text Available We develop a formal verification procedure to check that elastic pipelined processor designs correctly implement their instruction set architecture (ISA specifications. The notion of correctness we use is based on refinement. Refinement proofs are based on refinement maps, which—in the context of this problem—are functions that map elastic processor states to states of the ISA specification model. Data flow in elastic architectures is complicated by the insertion of any number of buffers in any place in the design, making it hard to construct refinement maps for elastic systems in a systematic manner. We introduce token-aware completion functions, which incorporate a mechanism to track the flow of data in elastic pipelines, as a highly automated and systematic approach to construct refinement maps. We demonstrate the efficiency of the overall verification procedure based on token-aware completion functions using six elastic pipelined processor models based on the DLX architecture.

Application of Advanced Multi-Core Processor Technologies to Oceanographic Research

Science.gov (United States)

2013-09-30

1 DISTRIBUTION STATEMENT A. Approved for public release; distribution is unlimited. Application of Advanced Multi-Core Processor Technologies...STM32 NXP LPC series No Proprietary Microchip PIC32/DSPIC No > 500 mW; < 5 W ARM Cortex TI OMAP TI Sitara Broadcom BCM2835 Varies FPGA...state-of-the-art information processing architectures. OBJECTIVES Next-generation processor architectures (multi-core, multi-threaded) hold the
Low power signal processing research at Stanford

Science.gov (United States)

Burr, J.; Williamson, P. R.; Peterson, A.

1991-01-01

This paper gives an overview of the research being conducted at Stanford University's Space, Telecommunications, and Radioscience Laboratory in the area of low energy computation. It discusses the work we are doing in large scale digital VLSI neural networks, interleaved processor and pipelined memory architectures, energy estimation and optimization, multichip module packaging, and low voltage digital logic.
A Scalable Multicore Architecture With Heterogeneous Memory Structures for Dynamic Neuromorphic Asynchronous Processors (DYNAPs).

Science.gov (United States)

Moradi, Saber; Qiao, Ning; Stefanini, Fabio; Indiveri, Giacomo

2018-02-01

Neuromorphic computing systems comprise networks of neurons that use asynchronous events for both computation and communication. This type of representation offers several advantages in terms of bandwidth and power consumption in neuromorphic electronic systems. However, managing the traffic of asynchronous events in large scale systems is a daunting task, both in terms of circuit complexity and memory requirements. Here, we present a novel routing methodology that employs both hierarchical and mesh routing strategies and combines heterogeneous memory structures for minimizing both memory requirements and latency, while maximizing programming flexibility to support a wide range of event-based neural network architectures, through parameter configuration. We validated the proposed scheme in a prototype multicore neuromorphic processor chip that employs hybrid analog/digital circuits for emulating synapse and neuron dynamics together with asynchronous digital circuits for managing the address-event traffic. We present a theoretical analysis of the proposed connectivity scheme, describe the methods and circuits used to implement such scheme, and characterize the prototype chip. Finally, we demonstrate the use of the neuromorphic processor with a convolutional neural network for the real-time classification of visual symbols being flashed to a dynamic vision sensor (DVS) at high speed.
Design and evaluation of an architecture for a digital signal processor for instrumentation applications

Science.gov (United States)

Fellman, Ronald D.; Kaneshiro, Ronald T.; Konstantinides, Konstantinos

1990-03-01

The authors present the design and evaluation of an architecture for a monolithic, programmable, floating-point digital signal processor (DSP) for instrumentation applications. An investigation of the most commonly used algorithms in instrumentation led to a design that satisfies the requirements for high computational and I/O (input/output) throughput. In the arithmetic unit, a 16- x 16-bit multiplier and a 32-bit accumulator provide the capability for single-cycle multiply/accumulate operations, and three format adjusters automatically adjust the data format for increased accuracy and dynamic range. An on-chip I/O unit is capable of handling data block transfers through a direct memory access port and real-time data streams through a pair of parallel I/O ports. I/O operations and program execution are performed in parallel. In addition, the processor includes two data memories with independent addressing units, a microsequencer with instruction RAM, and multiplexers for internal data redirection. The authors also present the structure and implementation of a design environment suitable for the algorithmic, behavioral, and timing simulation of a complete DSP system. Various benchmarking results are reported.
Hardware Synchronization for Embedded Multi-Core Processors

DEFF Research Database (Denmark)

Stoif, Christian; Schoeberl, Martin; Liccardi, Benito

2011-01-01

Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper, establi......Multi-core processors are about to conquer embedded systems — it is not the question of whether they are coming but how the architectures of the microcontrollers should look with respect to the strict requirements in the field. We present the step from one to multiple cores in this paper...
Many - body simulations using an array processor

International Nuclear Information System (INIS)

Rapaport, D.C.

1985-01-01

Simulations of microscopic models of water and polypeptides using molecular dynamics and Monte Carlo techniques have been carried out with the aid of an FPS array processor. The computational techniques are discussed, with emphasis on the development and optimization of the software to take account of the special features of the processor. The computing requirements of these simulations exceed what could be reasonably carried out on a normal 'scientific' computer. While the FPS processor is highly suited to the kinds of models described, several other computationally intensive problems in statistical mechanics are outlined for which alternative processor architectures are more appropriate
A high-accuracy optical linear algebra processor for finite element applications

Science.gov (United States)

Casasent, D.; Taylor, B. K.

1984-01-01

Optical linear processors are computationally efficient computers for solving matrix-matrix and matrix-vector oriented problems. Optical system errors limit their dynamic range to 30-40 dB, which limits their accuray to 9-12 bits. Large problems, such as the finite element problem in structural mechanics (with tens or hundreds of thousands of variables) which can exploit the speed of optical processors, require the 32 bit accuracy obtainable from digital machines. To obtain this required 32 bit accuracy with an optical processor, the data can be digitally encoded, thereby reducing the dynamic range requirements of the optical system (i.e., decreasing the effect of optical errors on the data) while providing increased accuracy. This report describes a new digitally encoded optical linear algebra processor architecture for solving finite element and banded matrix-vector problems. A linear static plate bending case study is described which quantities the processor requirements. Multiplication by digital convolution is explained, and the digitally encoded optical processor architecture is advanced.
Techniques for Computing the DFT Using the Residue Fermat Number Systems and VLSI

Science.gov (United States)

Truong, T. K.; Chang, J. J.; Hsu, I. S.; Pei, D. Y.; Reed, I. S.

1985-01-01

The integer complex multiplier and adder over the direct sum of two copies of a finite field is specialized to the direct sum of the rings of integers modulo Fermat numbers. Such multiplications and additions can be used in the implementation of a discrete Fourier transform (DFT) of a sequence of complex numbers. The advantage of the present approach is that the number of multiplications needed for the DFT can be reduced substantially over the previous approach. The architectural designs using this approach are regular, simple, expandable and, therefore, naturally suitable for VLSI implementation.
Integrated optical circuits for numerical computation

Science.gov (United States)

Verber, C. M.; Kenan, R. P.

1983-01-01

The development of integrated optical circuits (IOC) for numerical-computation applications is reviewed, with a focus on the use of systolic architectures. The basic architecture criteria for optical processors are shown to be the same as those proposed by Kung (1982) for VLSI design, and the advantages of IOCs over bulk techniques are indicated. The operation and fabrication of electrooptic grating structures are outlined, and the application of IOCs of this type to an existing 32-bit, 32-Mbit/sec digital correlator, a proposed matrix multiplier, and a proposed pipeline processor for polynomial evaluation is discussed. The problems arising from the inherent nonlinearity of electrooptic gratings are considered. Diagrams and drawings of the application concepts are provided.
Accuracy Limitations in Optical Linear Algebra Processors

Science.gov (United States)

Batsell, Stephen Gordon

1990-01-01

One of the limiting factors in applying optical linear algebra processors (OLAPs) to real-world problems has been the poor achievable accuracy of these processors. Little previous research has been done on determining noise sources from a systems perspective which would include noise generated in the multiplication and addition operations, noise from spatial variations across arrays, and from crosstalk. In this dissertation, we propose a second-order statistical model for an OLAP which incorporates all these system noise sources. We now apply this knowledge to determining upper and lower bounds on the achievable accuracy. This is accomplished by first translating the standard definition of accuracy used in electronic digital processors to analog optical processors. We then employ our second-order statistical model. Having determined a general accuracy equation, we consider limiting cases such as for ideal and noisy components. From the ideal case, we find the fundamental limitations on improving analog processor accuracy. From the noisy case, we determine the practical limitations based on both device and system noise sources. These bounds allow system trade-offs to be made both in the choice of architecture and in individual components in such a way as to maximize the accuracy of the processor. Finally, by determining the fundamental limitations, we show the system engineer when the accuracy desired can be achieved from hardware or architecture improvements and when it must come from signal pre-processing and/or post-processing techniques.
Algorithms, architectures and information systems security

CERN Document Server

Sur-Kolay, Susmita; Nandy, Subhas C; Bagchi, Aditya

2008-01-01

This volume contains articles written by leading researchers in the fields of algorithms, architectures, and information systems security. The first five chapters address several challenging geometric problems and related algorithms. These topics have major applications in pattern recognition, image analysis, digital geometry, surface reconstruction, computer vision and in robotics. The next five chapters focus on various optimization issues in VLSI design and test architectures, and in wireless networks. The last six chapters comprise scholarly articles on information systems security coverin
Compact MOSFET models for VLSI design

CERN Document Server

Bhattacharyya, A B

2009-01-01

Practicing designers, students, and educators in the semiconductor field face an ever expanding portfolio of MOSFET models. In Compact MOSFET Models for VLSI Design , A.B. Bhattacharyya presents a unified perspective on the topic, allowing the practitioner to view and interpret device phenomena concurrently using different modeling strategies. Readers will learn to link device physics with model parameters, helping to close the gap between device understanding and its use for optimal circuit performance. Bhattacharyya also lays bare the core physical concepts that will drive the future of VLSI.
Optical Array Processor: Laboratory Results

Science.gov (United States)

Casasent, David; Jackson, James; Vaerewyck, Gerard

1987-01-01

A Space Integrating (SI) Optical Linear Algebra Processor (OLAP) is described and laboratory results on its performance in several practical engineering problems are presented. The applications include its use in the solution of a nonlinear matrix equation for optimal control and a parabolic Partial Differential Equation (PDE), the transient diffusion equation with two spatial variables. Frequency-multiplexed, analog and high accuracy non-base-two data encoding are used and discussed. A multi-processor OLAP architecture is described and partitioning and data flow issues are addressed.
Code compression for VLIW embedded processors

Science.gov (United States)

Piccinelli, Emiliano; Sannino, Roberto

2004-04-01

The implementation of processors for embedded systems implies various issues: main constraints are cost, power dissipation and die area. On the other side, new terminals perform functions that require more computational flexibility and effort. Long code streams must be loaded into memories, which are expensive and power consuming, to run on DSPs or CPUs. To overcome this issue, the "SlimCode" proprietary algorithm presented in this paper (patent pending technology) can reduce the dimensions of the program memory. It can run offline and work directly on the binary code the compiler generates, by compressing it and creating a new binary file, about 40% smaller than the original one, to be loaded into the program memory of the processor. The decompression unit will be a small ASIC, placed between the Memory Controller and the System bus of the processor, keeping unchanged the internal CPU architecture: this implies that the methodology is completely transparent to the core. We present comparisons versus the state-of-the-art IBM Codepack algorithm, along with its architectural implementation into the ST200 VLIW family core.
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures

Science.gov (United States)

Manolakos, Elias S.

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub. PMID:26605332
Efficient Multicriteria Protein Structure Comparison on Modern Processor Architectures.

Science.gov (United States)

Sharma, Anuj; Manolakos, Elias S

2015-01-01

Fast increasing computational demand for all-to-all protein structures comparison (PSC) is a result of three confounding factors: rapidly expanding structural proteomics databases, high computational complexity of pairwise protein comparison algorithms, and the trend in the domain towards using multiple criteria for protein structures comparison (MCPSC) and combining results. We have developed a software framework that exploits many-core and multicore CPUs to implement efficient parallel MCPSC in modern processors based on three popular PSC methods, namely, TMalign, CE, and USM. We evaluate and compare the performance and efficiency of the two parallel MCPSC implementations using Intel's experimental many-core Single-Chip Cloud Computer (SCC) as well as Intel's Core i7 multicore processor. We show that the 48-core SCC is more efficient than the latest generation Core i7, achieving a speedup factor of 42 (efficiency of 0.9), making many-core processors an exciting emerging technology for large-scale structural proteomics. We compare and contrast the performance of the two processors on several datasets and also show that MCPSC outperforms its component methods in grouping related domains, achieving a high F-measure of 0.91 on the benchmark CK34 dataset. The software implementation for protein structure comparison using the three methods and combined MCPSC, along with the developed underlying rckskel algorithmic skeletons library, is available via GitHub.
Macrocell Builder: IP-Block-Based Design Environment for High-Throughput VLSI Dedicated Digital Signal Processing Systems

Directory of Open Access Journals (Sweden)

Urard Pascal

2006-01-01

Full Text Available We propose an efficient IP-block-based design environment for high-throughput VLSI systems. The flow generates SystemC register-transfer-level (RTL architecture, starting from a Matlab functional model described as a netlist of functional IP. The refinement model inserts automatically control structures to manage delays induced by the use of RTL IPs. It also inserts a control structure to coordinate the execution of parallel clocked IP. The delays may be managed by registers or by counters included in the control structure. The flow has been used successfully in three real-world DSP systems. The experimentations show that the approach can produce efficient RTL architecture and allows to save huge amount of time.
Artificial immune system algorithm in VLSI circuit configuration

Science.gov (United States)

Mansor, Mohd. Asyraf; Sathasivam, Saratha; Kasihmuddin, Mohd Shareduwan Mohd

2017-08-01

In artificial intelligence, the artificial immune system is a robust bio-inspired heuristic method, extensively used in solving many constraint optimization problems, anomaly detection, and pattern recognition. This paper discusses the implementation and performance of artificial immune system (AIS) algorithm integrated with Hopfield neural networks for VLSI circuit configuration based on 3-Satisfiability problems. Specifically, we emphasized on the clonal selection technique in our binary artificial immune system algorithm. We restrict our logic construction to 3-Satisfiability (3-SAT) clauses in order to outfit with the transistor configuration in VLSI circuit. The core impetus of this research is to find an ideal hybrid model to assist in the VLSI circuit configuration. In this paper, we compared the artificial immune system (AIS) algorithm (HNN-3SATAIS) with the brute force algorithm incorporated with Hopfield neural network (HNN-3SATBF). Microsoft Visual C++ 2013 was used as a platform for training, simulating and validating the performances of the proposed network. The results depict that the HNN-3SATAIS outperformed HNN-3SATBF in terms of circuit accuracy and CPU time. Thus, HNN-3SATAIS can be used to detect an early error in the VLSI circuit design.
Dual-scale topology optoelectronic processor.

Science.gov (United States)

Marsden, G C; Krishnamoorthy, A V; Esener, S C; Lee, S H

1991-12-15

The dual-scale topology optoelectronic processor (D-STOP) is a parallel optoelectronic architecture for matrix algebraic processing. The architecture can be used for matrix-vector multiplication and two types of vector outer product. The computations are performed electronically, which allows multiplication and summation concepts in linear algebra to be generalized to various nonlinear or symbolic operations. This generalization permits the application of D-STOP to many computational problems. The architecture uses a minimum number of optical transmitters, which thereby reduces fabrication requirements while maintaining area-efficient electronics. The necessary optical interconnections are space invariant, minimizing space-bandwidth requirements.
Raexplore: Enabling Rapid, Automated Architecture Exploration for Full Applications

Energy Technology Data Exchange (ETDEWEB)

Zhang, Yao [Argonne National Lab. (ANL), Argonne, IL (United States); Balaprakash, Prasanna [Argonne National Lab. (ANL), Argonne, IL (United States); Meng, Jiayuan [Argonne National Lab. (ANL), Argonne, IL (United States); Morozov, Vitali [Argonne National Lab. (ANL), Argonne, IL (United States); Parker, Scott [Argonne National Lab. (ANL), Argonne, IL (United States); Kumaran, Kalyan [Argonne National Lab. (ANL), Argonne, IL (United States)

2014-12-01

We present Raexplore, a performance modeling framework for architecture exploration. Raexplore enables rapid, automated, and systematic search of architecture design space by combining hardware counter-based performance characterization and analytical performance modeling. We demonstrate Raexplore for two recent manycore processors IBM Blue- Gene/Q compute chip and Intel Xeon Phi, targeting a set of scientific applications. Our framework is able to capture complex interactions between architectural components including instruction pipeline, cache, and memory, and to achieve a 3–22% error for same-architecture and cross-architecture performance predictions. Furthermore, we apply our framework to assess the two processors, and discover and evaluate a list of architectural scaling options for future processor designs.

Computer Architecture A Quantitative Approach

CERN Document Server

Hennessy, John L

2007-01-01

The era of seemingly unlimited growth in processor performance is over: single chip architectures can no longer overcome the performance limitations imposed by the power they consume and the heat they generate. Today, Intel and other semiconductor firms are abandoning the single fast processor model in favor of multi-core microprocessors--chips that combine two or more processors in a single package. In the fourth edition of Computer Architecture, the authors focus on this historic shift, increasing their coverage of multiprocessors and exploring the most effective ways of achieving parallelis
An Asynchronous Low Power and High Performance VLSI Architecture for Viterbi Decoder Implemented with Quasi Delay Insensitive Templates

Directory of Open Access Journals (Sweden)

T. Kalavathi Devi

2015-01-01

Full Text Available Convolutional codes are comprehensively used as Forward Error Correction (FEC codes in digital communication systems. For decoding of convolutional codes at the receiver end, Viterbi decoder is often used to have high priority. This decoder meets the demand of high speed and low power. At present, the design of a competent system in Very Large Scale Integration (VLSI technology requires these VLSI parameters to be finely defined. The proposed asynchronous method focuses on reducing the power consumption of Viterbi decoder for various constraint lengths using asynchronous modules. The asynchronous designs are based on commonly used Quasi Delay Insensitive (QDI templates, namely, Precharge Half Buffer (PCHB and Weak Conditioned Half Buffer (WCHB. The functionality of the proposed asynchronous design is simulated and verified using Tanner Spice (TSPICE in 0.25 µm, 65 nm, and 180 nm technologies of Taiwan Semiconductor Manufacture Company (TSMC. The simulation result illustrates that the asynchronous design techniques have 25.21% of power reduction compared to synchronous design and work at a speed of 475 MHz.
Soft-core dataflow processor architecture optimised for radar signal processing: Article

CSIR Research Space (South Africa)

Broich, R

2014-10-01

Full Text Available Current radar signal processors lack either performance or flexibility. Custom soft-core processors exhibit potential in high-performance signal processing applications, yet remain relatively unexplored in research literature. In this paper, we use...
Spike Neuromorphic VLSI-Based Bat Echolocation for Micro-Aerial Vehicle Guidance

National Research Council Canada - National Science Library

Horiuchi, Timothy K; Krishnaprasad, P. S

2007-01-01

.... This includes multiple efforts related to a VLSI-based echolocation system being developed in one of our laboratories from algorithm development, bat flight data analysis, to VLSI circuit design...
Real-time FPGA architectures for computer vision

Science.gov (United States)

Arias-Estrada, Miguel; Torres-Huitzil, Cesar

2000-03-01

This paper presents an architecture for real-time generic convolution of a mask and an image. The architecture is intended for fast low level image processing. The FPGA-based architecture takes advantage of the availability of registers in FPGAs to implement an efficient and compact module to process the convolutions. The architecture is designed to minimize the number of accesses to the image memory and is based on parallel modules with internal pipeline operation in order to improve its performance. The architecture is prototyped in a FPGA, but it can be implemented on a dedicated VLSI to reach higher clock frequencies. Complexity issues, FPGA resources utilization, FPGA limitations, and real time performance are discussed. Some results are presented and discussed.
Three-dimensional design methodologies for tree-based FPGA architecture

CERN Document Server

Pangracious, Vinod; Mehrez, Habib

2015-01-01

This book focuses on the development of 3D design and implementation methodologies for Tree-based FPGA architecture. It also stresses the needs for new and augmented 3D CAD tools to support designs such as, the design for 3D, to manufacture high performance 3D integrated circuits and reconfigurable FPGA-based systems. This book was written as a text that covers the foundations of 3D integrated system design and FPGA architecture design. It was written for the use in an elective or core course at the graduate level in field of Electrical Engineering, Computer Engineering and Doctoral Research programs. No previous background on 3D integration is required, nevertheless fundamental understanding of 2D CMOS VLSI design is required. It is assumed that reader has taken the core curriculum in Electrical Engineering or Computer Engineering, with courses like CMOS VLSI design, Digital System Design and Microelectronics Circuits being the most important. It is accessible for self-study by both senior students and profe...
Practical, redundant, failure-tolerant, self-reconfiguring embedded system architecture

Science.gov (United States)

Klarer, Paul R.; Hayward, David R.; Amai, Wendy A.

2006-10-03

This invention relates to system architectures, specifically failure-tolerant and self-reconfiguring embedded system architectures. The invention provides both a method and architecture for redundancy. There can be redundancy in both software and hardware for multiple levels of redundancy. The invention provides a self-reconfiguring architecture for activating redundant modules whenever other modules fail. The architecture comprises: a communication backbone connected to two or more processors and software modules running on each of the processors. Each software module runs on one processor and resides on one or more of the other processors to be available as a backup module in the event of failure. Each module and backup module reports its status over the communication backbone. If a primary module does not report, its backup module takes over its function. If the primary module becomes available again, the backup module returns to its backup status.
Programmable level-1 trigger with 3D-Flow processor array

International Nuclear Information System (INIS)

Crosetto, D.

1994-01-01

The 3D-Flow parallel processing system is a new concept in processor architecture, system architecture, and assembly architecture. Compared to the electronics used in present systems, this approach reduces the cost and complexity of the hardware and allows easy assembly, disassembly, incremental upgrading, and maintenance of different interconnection topologies. The 3D-Flow parallel-processing system benefits high energy physics (HEP) by allowing: (1) common less costly hardware to be used in different experiments. (2) new uses of existing installations. (3) tuning of trigger based on the first analyzed data, and (4) selection of desired events directly from raw data. The goal of this parallel-processing architecture is to acquire multiple data in parallel (up to 100 million frames per second) and to process them at high speed, accomplishing digital filtering on the input data, pattern recognition (particle identification), data moving, and data formatting. The main features of the system are its programmability, scalability, high-speed communication, and low cost. The compactness of the 3D-Flow parallel-processing system in concert with the processor architecture allows processor interconnections to be mapped into the geometry of sensors (detectors in HEP) without large interconnection signal delay, enabling real-time pattern recognition. The overall 3D-Flow project has passed a major design review at Fermilab (Reviewers included experts in computers, triggering, system assembly, and electronics)
Real time processor for array speckle interferometry

Science.gov (United States)

Chin, Gordon; Florez, Jose; Borelli, Renan; Fong, Wai; Miko, Joseph; Trujillo, Carlos

1989-02-01

The authors are constructing a real-time processor to acquire image frames, perform array flat-fielding, execute a 64 x 64 element two-dimensional complex FFT (fast Fourier transform) and average the power spectrum, all within the 25 ms coherence time for speckles at near-IR (infrared) wavelength. The processor will be a compact unit controlled by a PC with real-time display and data storage capability. This will provide the ability to optimize observations and obtain results on the telescope rather than waiting several weeks before the data can be analyzed and viewed with offline methods. The image acquisition and processing, design criteria, and processor architecture are described.
Nano lasers in photonic VLSI

NARCIS (Netherlands)

Hill, M.T.; Oei, Y.S.; Smit, M.K.

2007-01-01

We examine the use of micro and nano lasers to form digital photonic VLSI building blocks. Problems such as isolation and cascading of building blocks are addressed, and the potential of future nano lasers explored.
Accelerating molecular dynamic simulation on the cell processor and Playstation 3.

Science.gov (United States)

Luttmann, Edgar; Ensign, Daniel L; Vaidyanathan, Vishal; Houston, Mike; Rimon, Noam; Øland, Jeppe; Jayachandran, Guha; Friedrichs, Mark; Pande, Vijay S

2009-01-30

Implementation of molecular dynamics (MD) calculations on novel architectures will vastly increase its power to calculate the physical properties of complex systems. Herein, we detail algorithmic advances developed to accelerate MD simulations on the Cell processor, a commodity processor found in PlayStation 3 (PS3). In particular, we discuss issues regarding memory access versus computation and the types of calculations which are best suited for streaming processors such as the Cell, focusing on implicit solvation models. We conclude with a comparison of improved performance on the PS3's Cell processor over more traditional processors. (c) 2008 Wiley Periodicals, Inc.
Software-defined reconfigurable microwave photonics processor.

Science.gov (United States)

Pérez, Daniel; Gasulla, Ivana; Capmany, José

2015-06-01

We propose, for the first time to our knowledge, a software-defined reconfigurable microwave photonics signal processor architecture that can be integrated on a chip and is capable of performing all the main functionalities by suitable programming of its control signals. The basic configuration is presented and a thorough end-to-end design model derived that accounts for the performance of the overall processor taking into consideration the impact and interdependencies of both its photonic and RF parts. We demonstrate the model versatility by applying it to several relevant application examples.
Monte Carlo simulations on SIMD computer architectures

International Nuclear Information System (INIS)

Burmester, C.P.; Gronsky, R.; Wille, L.T.

1992-01-01

In this paper algorithmic considerations regarding the implementation of various materials science applications of the Monte Carlo technique to single instruction multiple data (SIMD) computer architectures are presented. In particular, implementation of the Ising model with nearest, next nearest, and long range screened Coulomb interactions on the SIMD architecture MasPar MP-1 (DEC mpp-12000) series of massively parallel computers is demonstrated. Methods of code development which optimize processor array use and minimize inter-processor communication are presented including lattice partitioning and the use of processor array spanning tree structures for data reduction. Both geometric and algorithmic parallel approaches are utilized. Benchmarks in terms of Monte Carl updates per second for the MasPar architecture are presented and compared to values reported in the literature from comparable studies on other architectures
Microprocessor architectures RISC, CISC and DSP

CERN Document Server

Heath, Steve

1995-01-01

'Why are there all these different processor architectures and what do they all mean? Which processor will I use? How should I choose it?' Given the task of selecting an architecture or design approach, both engineers and managers require a knowledge of the whole system and an explanation of the design tradeoffs and their effects. This is information that rarely appears in data sheets or user manuals. This book fills that knowledge gap.Section 1 provides a primer and history of the three basic microprocessor architectures. Section 2 describes the ways in which the architectures react with the
Scalable Motion Estimation Processor Core for Multimedia System-on-Chip Applications

Science.gov (United States)

Lai, Yeong-Kang; Hsieh, Tian-En; Chen, Lien-Fei

2007-04-01

In this paper, we describe a high-throughput and scalable motion estimation processor architecture for multimedia system-on-chip applications. The number of processing elements (PEs) is scalable according to the variable algorithm parameters and the performance required for different applications. Using the PE rings efficiently and an intelligent memory-interleaving organization, the efficiency of the architecture can be increased. Moreover, using efficient on-chip memories and a data management technique can effectively decrease the power consumption and memory bandwidth. Techniques for reducing the number of interconnections and external memory accesses are also presented. Our results demonstrate that the proposed scalable PE-ringed architecture is a flexible and high-performance processor core in multimedia system-on-chip applications.
Turbo decoder architecture for beyond-4G applications

CERN Document Server

Wong, Cheng-Chi

2013-01-01

This book describes the most recent techniques for turbo decoder implementation, especially for 4G and beyond 4G applications. The authors reveal techniques for the design of high-throughput decoders for future telecommunication systems, enabling designers to reduce hardware cost and shorten processing time. Coverage includes an explanation of VLSI implementation of the turbo decoder, from basic functional units to advanced parallel architecture. The authors discuss both hardware architecture techniques and experimental results, showing the variations in area/throughput/performance with respec
A multichip aVLSI system emulating orientation selectivity of primary visual cortical cells.

Science.gov (United States)

Shimonomura, Kazuhiro; Yagi, Tetsuya

2005-07-01

In this paper, we designed and fabricated a multichip neuromorphic analog very large scale integrated (aVLSI) system, which emulates the orientation selective response of the simple cell in the primary visual cortex. The system consists of a silicon retina and an orientation chip. An image, which is filtered by a concentric center-surround (CS) antagonistic receptive field of the silicon retina, is transferred to the orientation chip. The image transfer from the silicon retina to the orientation chip is carried out with analog signals. The orientation chip selectively aggregates multiple pixels of the silicon retina, mimicking the feedforward model proposed by Hubel and Wiesel. The chip provides the orientation-selective (OS) outputs which are tuned to 0 degrees, 60 degrees, and 120 degrees. The feed-forward aggregation reduces the fixed pattern noise that is due to the mismatch of the transistors in the orientation chip. The spatial properties of the orientation selective response were examined in terms of the adjustable parameters of the chip, i.e., the number of aggregated pixels and size of the receptive field of the silicon retina. The multichip aVLSI architecture used in the present study can be applied to implement higher order cells such as the complex cell of the primary visual cortex.
An Energy-Efficient and Scalable Deep Learning/Inference Processor With Tetra-Parallel MIMD Architecture for Big Data Applications.

Science.gov (United States)

Park, Seong-Wook; Park, Junyoung; Bong, Kyeongryeol; Shin, Dongjoo; Lee, Jinmook; Choi, Sungpill; Yoo, Hoi-Jun

2015-12-01

Deep Learning algorithm is widely used for various pattern recognition applications such as text recognition, object recognition and action recognition because of its best-in-class recognition accuracy compared to hand-crafted algorithm and shallow learning based algorithms. Long learning time caused by its complex structure, however, limits its usage only in high-cost servers or many-core GPU platforms so far. On the other hand, the demand on customized pattern recognition within personal devices will grow gradually as more deep learning applications will be developed. This paper presents a SoC implementation to enable deep learning applications to run with low cost platforms such as mobile or portable devices. Different from conventional works which have adopted massively-parallel architecture, this work adopts task-flexible architecture and exploits multiple parallelism to cover complex functions of convolutional deep belief network which is one of popular deep learning/inference algorithms. In this paper, we implement the most energy-efficient deep learning and inference processor for wearable system. The implemented 2.5 mm × 4.0 mm deep learning/inference processor is fabricated using 65 nm 8-metal CMOS technology for a battery-powered platform with real-time deep inference and deep learning operation. It consumes 185 mW average power, and 213.1 mW peak power at 200 MHz operating frequency and 1.2 V supply voltage. It achieves 411.3 GOPS peak performance and 1.93 TOPS/W energy efficiency, which is 2.07× higher than the state-of-the-art.
XL-100S microprogrammable processor

International Nuclear Information System (INIS)

Gorbunov, N.V.; Guzik, Z.; Sutulin, V.A.; Forytski, A.

1983-01-01

The XL-100S microprogrammable processor providing the multiprocessor operation mode in the XL system crate is described. The processor meets the EUR 6500 CAMAC standards, address up to 4 Mbyte memory, and interacts with 7 CAMAC branchas. Eight external requests initiate operations preset by a sequence of microcommands in a memory of the capacity up to 64 kwords of 32-Git. The microprocessor architecture allows one to emulate commands of the majority of mini- or micro-computers, including floating point operations. The XL-100S processor may be used in various branches of experimental physics: for physical experiment apparatus control, fast selection of useful physical events, organization of the of input/output operations, organization of direct assess to memory included, etc. The Am2900 microprocessor set is used as an elementary base. The device is made in the form of a single width CAMAC module
Performance of Artificial Intelligence Workloads on the Intel Core 2 Duo Series Desktop Processors

OpenAIRE

Abdul Kareem PARCHUR; Kuppangari Krishna RAO; Fazal NOORBASHA; Ram Asaray SINGH

2010-01-01

As the processor architecture becomes more advanced, Intel introduced its Intel Core 2 Duo series processors. Performance impact on Intel Core 2 Duo processors are analyzed using SPEC CPU INT 2006 performance numbers. This paper studied the behavior of Artificial Intelligence (AI) benchmarks on Intel Core 2 Duo series processors. Moreover, we estimated the task completion time (TCT) @1 GHz, @2 GHz and @3 GHz Intel Core 2 Duo series processors frequency. Our results show the performance scalab...

Coordinated Energy Management in Heterogeneous Processors

Directory of Open Access Journals (Sweden)

Indrani Paul

2014-01-01

Full Text Available This paper examines energy management in a heterogeneous processor consisting of an integrated CPU–GPU for high-performance computing (HPC applications. Energy management for HPC applications is challenged by their uncompromising performance requirements and complicated by the need for coordinating energy management across distinct core types – a new and less understood problem. We examine the intra-node CPU–GPU frequency sensitivity of HPC applications on tightly coupled CPU–GPU architectures as the first step in understanding power and performance optimization for a heterogeneous multi-node HPC system. The insights from this analysis form the basis of a coordinated energy management scheme, called DynaCo, for integrated CPU–GPU architectures. We implement DynaCo on a modern heterogeneous processor and compare its performance to a state-of-the-art power- and performance-management algorithm. DynaCo improves measured average energy-delay squared (ED2 product by up to 30% with less than 2% average performance loss across several exascale and other HPC workloads.
An Analogue VLSI Implementation of the Meddis Inner Hair Cell Model

Science.gov (United States)

McEwan, Alistair; van Schaik, André

2003-12-01

The Meddis inner hair cell model is a widely accepted, but computationally intensive computer model of mammalian inner hair cell function. We have produced an analogue VLSI implementation of this model that operates in real time in the current domain by using translinear and log-domain circuits. The circuit has been fabricated on a chip and tested against the Meddis model for (a) rate level functions for onset and steady-state response, (b) recovery after masking, (c) additivity, (d) two-component adaptation, (e) phase locking, (f) recovery of spontaneous activity, and (g) computational efficiency. The advantage of this circuit, over other electronic inner hair cell models, is its nearly exact implementation of the Meddis model which can be tuned to behave similarly to the biological inner hair cell. This has important implications on our ability to simulate the auditory system in real time. Furthermore, the technique of mapping a mathematical model of first-order differential equations to a circuit of log-domain filters allows us to implement real-time neuromorphic signal processors for a host of models using the same approach.
Las Vegas is better than determinism in VLSI and distributed computing

DEFF Research Database (Denmark)

Mehlhorn, Kurt; Schmidt, Erik Meineche

1982-01-01

In this paper we describe a new method for proving lower bounds on the complexity of VLSI - computations and more generally distributed computations. Lipton and Sedgewick observed that the crossing sequence arguments used to prove lower bounds in VLSI (or TM or distributed computing) apply to (ac...
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

Science.gov (United States)

Hristov, Ivan; Goranov, Goran; Hristova, Radoslava

2018-02-01

We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named "Ivy Bridge-EP") in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named "Knights Landing" (KNL). The results show 2 times better performance on KNL processor.
Architecture-Aware Optimization of an HEVC decoder on Asymmetric Multicore Processors

OpenAIRE

Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

2016-01-01

Low-power asymmetric multicore processors (AMPs) attract considerable attention due to their appealing performance-power ratio for energy-constrained environments. However, these processors pose a significant programming challenge due to the integration of cores with different performance capabilities, asking for an asymmetry-aware scheduling solution that carefully distributes the workload. The recent HEVC standard, which offers several high-level parallelization strategies, is an important ...
16-Bit RISC Processor Design for Convolution Application

OpenAIRE

Anand Nandakumar Shardul

2013-01-01

In this project, we propose a 16-bit non-pipelined RISC processor, which is used for signal processing applications. The processor consists of the blocks, namely, program counter, clock control unit, ALU, IDU and registers. Advantageous architectural modifications have been made in the incremented circuit used in program counter and carry select adder unit of the ALU in the RISC CPU core. Furthermore, a high speed and low power modified modifies multiplier has been designed and introduced in ...
Lithography requirements in complex VLSI device fabrication

International Nuclear Information System (INIS)

Wilson, A.D.

1985-01-01

Fabrication of complex very large scale integration (VLSI) circuits requires continual advances in lithography to satisfy: decreasing minimum linewidths, larger chip sizes, tighter linewidth and overlay control, increasing topography to linewidth ratios, higher yield demands, increased throughput, harsher device processing, lower lithography cost, and a larger part number set with quick turn-around time. Where optical, electron beam, x-ray, and ion beam lithography can be applied to judiciously satisfy the complex VLSI circuit fabrication requirements is discussed and those areas that are in need of major further advances are addressed. Emphasis will be placed on advanced electron beam and storage ring x-ray lithography
Space and frequency-multiplexed optical linear algebra processor - Fabrication and initial tests

Science.gov (United States)

Casasent, D.; Jackson, J.

1986-01-01

A new optical linear algebra processor architecture is described. Space and frequency-multiplexing are used to accommodate bipolar and complex-valued data. A fabricated laboratory version of this processor is described, the electronic support system used is discussed, and initial test data obtained on it are presented.
Unified and Modular Modeling and Functional Verification Framework of Real-Time Image Signal Processors

Directory of Open Access Journals (Sweden)

Abhishek Jain

2016-01-01

Full Text Available In VLSI industry, image signal processing algorithms are developed and evaluated using software models before implementation of RTL and firmware. After the finalization of the algorithm, software models are used as a golden reference model for the image signal processor (ISP RTL and firmware development. In this paper, we are describing the unified and modular modeling framework of image signal processing algorithms used for different applications such as ISP algorithms development, reference for hardware (HW implementation, reference for firmware (FW implementation, and bit-true certification. The universal verification methodology- (UVM- based functional verification framework of image signal processors using software reference models is described. Further, IP-XACT based tools for automatic generation of functional verification environment files and model map files are described. The proposed framework is developed both with host interface and with core using virtual register interface (VRI approach. This modeling and functional verification framework is used in real-time image signal processing applications including cellphone, smart cameras, and image compression. The main motivation behind this work is to propose the best efficient, reusable, and automated framework for modeling and verification of image signal processor (ISP designs. The proposed framework shows better results and significant improvement is observed in product verification time, verification cost, and quality of the designs.
OpenCL code generation for low energy wide SIMD architectures with explicit datapath.

NARCIS (Netherlands)

She, D.; He, Y.; Waeijen, L.J.W.; Corporaal, H.; Jeschke, H.; Silvén, O.

2013-01-01

Energy efficiency is one of the most important aspects in designing embedded processors. The use of a wide SIMD processor architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a configurable wide SIMD architecture that utilizes
Handbook of VLSI chip design and expert systems

CERN Document Server

Schwarz, A F

1993-01-01

Handbook of VLSI Chip Design and Expert Systems provides information pertinent to the fundamental aspects of expert systems, which provides a knowledge-based approach to problem solving. This book discusses the use of expert systems in every possible subtask of VLSI chip design as well as in the interrelations between the subtasks.Organized into nine chapters, this book begins with an overview of design automation, which can be identified as Computer-Aided Design of Circuits and Systems (CADCAS). This text then presents the progress in artificial intelligence, with emphasis on expert systems.
VLSI micro- and nanophotonics science, technology, and applications

CERN Document Server

Lee, El-Hang; Razeghi, Manijeh; Jagadish, Chennupati

2011-01-01

Addressing the growing demand for larger capacity in information technology, VLSI Micro- and Nanophotonics: Science, Technology, and Applications explores issues of science and technology of micro/nano-scale photonics and integration for broad-scale and chip-scale Very Large Scale Integration photonics. This book is a game-changer in the sense that it is quite possibly the first to focus on ""VLSI Photonics"". Very little effort has been made to develop integration technologies for micro/nanoscale photonic devices and applications, so this reference is an important and necessary early-stage pe
Rational calculation accuracy in acousto-optical matrix-vector processor

Science.gov (United States)

Oparin, V. V.; Tigin, Dmitry V.

1994-01-01

The high speed of parallel computations for a comparatively small-size processor and acceptable power consumption makes the usage of acousto-optic matrix-vector multiplier (AOMVM) attractive for processing of large amounts of information in real time. The limited accuracy of computations is an essential disadvantage of such a processor. The reduced accuracy requirements allow for considerable simplification of the AOMVM architecture and the reduction of the demands on its components.
Pursuit, Avoidance, and Cohesion in Flight: Multi-Purpose Control Laws and Neuromorphic VLSI

Science.gov (United States)

2010-10-01

spatial navigation in mammals. We have designed, fabricated, and are now testing a neuromorphic VLSI chip that implements a spike-based, attractor...Control Laws and Neuromorphic VLSI 5a. CONTRACT NUMBER 070402-7705 5b. GRANT NUMBER FA9550-07-1-0446 5c. PROGRAM ELEMENT NUMBER 6. AUTHOR(S...implementations (custom Neuromorphic VLSI and robotics) we will apply important practical constraints that can lead to deeper insight into how and why efficient
Performances of multiprocessor multidisk architectures for continuous media storage

Science.gov (United States)

Gennart, Benoit A.; Messerli, Vincent; Hersch, Roger D.

1996-03-01

Multimedia interfaces increase the need for large image databases, capable of storing and reading streams of data with strict synchronicity and isochronicity requirements. In order to fulfill these requirements, we consider a parallel image server architecture which relies on arrays of intelligent disk nodes, each disk node being composed of one processor and one or more disks. This contribution analyzes through bottleneck performance evaluation and simulation the behavior of two multi-processor multi-disk architectures: a point-to-point architecture and a shared-bus architecture similar to current multiprocessor workstation architectures. We compare the two architectures on the basis of two multimedia algorithms: the compute-bound frame resizing by resampling and the data-bound disk-to-client stream transfer. The results suggest that the shared bus is a potential bottleneck despite its very high hardware throughput (400Mbytes/s) and that an architecture with addressable local memories located closely to their respective processors could partially remove this bottleneck. The point- to-point architecture is scalable and able to sustain high throughputs for simultaneous compute- bound and data-bound operations.
Scientific programming on massively parallel processor CP-PACS

International Nuclear Information System (INIS)

Boku, Taisuke

1998-01-01

The massively parallel processor CP-PACS takes various problems of calculation physics as the object, and it has been designed so that its architecture has been devised to do various numerical processings. In this report, the outline of the CP-PACS and the example of programming in the Kernel CG benchmark in NAS Parallel Benchmarks, version 1, are shown, and the pseudo vector processing mechanism and the parallel processing tuning of scientific and technical computation utilizing the three-dimensional hyper crossbar net, which are two great features of the architecture of the CP-PACS are described. As for the CP-PACS, the PUs based on RISC processor and added with pseudo vector processor are used. Pseudo vector processing is realized as the loop processing by scalar command. The features of the connection net of PUs are explained. The algorithm of the NPB version 1 Kernel CG is shown. The part that takes the time for processing most in the main loop is the product of matrix and vector (matvec), and the parallel processing of the matvec is explained. The time for the computation by the CPU is determined. As the evaluation of the performance, the evaluation of the time for execution, the short vector processing of pseudo vector processor based on slide window, and the comparison with other parallel computers are reported. (K.I.)
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

Directory of Open Access Journals (Sweden)

Hristov Ivan

2018-01-01

Full Text Available We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named “Ivy Bridge-EP” in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named “Knights Landing” (KNL. The results show 2 times better performance on KNL processor.
Bio-Inspired Microsystem for Robust Genetic Assay Recognition

Directory of Open Access Journals (Sweden)

Jaw-Chyng Lue

2008-01-01

Full Text Available A compact integrated system-on-chip (SoC architecture solution for robust, real-time, and on-site genetic analysis has been proposed. This microsystem solution is noise-tolerable and suitable for analyzing the weak fluorescence patterns from a PCR prepared dual-labeled DNA microchip assay. In the architecture, a preceding VLSI differential logarithm microchip is designed for effectively computing the logarithm of the normalized input fluorescence signals. A posterior VLSI artificial neural network (ANN processor chip is used for analyzing the processed signals from the differential logarithm stage. A single-channel logarithmic circuit was fabricated and characterized. A prototype ANN chip with unsupervised winner-take-all (WTA function was designed, fabricated, and tested. An ANN learning algorithm using a novel sigmoid-logarithmic transfer function based on the supervised backpropagation (BP algorithm is proposed for robustly recognizing low-intensity patterns. Our results show that the trained new ANN can recognize low-fluorescence patterns better than an ANN using the conventional sigmoid function.
Supertracker: A Programmable Parallel Pipeline Arithmetic Processor For Auto-Cueing Target Processing

Science.gov (United States)

Mack, Harold; Reddi, S. S.

1980-04-01

Supertracker represents a programmable parallel pipeline computer architecture that has been designed to meet the real time image processing requirements of auto-cueing target data processing. The prototype bread-board currently under development will be designed to perform input video preprocessing and processing for 525-line and 875-line TV formats FLIR video, automatic display gain and contrast control, and automatic target cueing, classification, and tracking. The video preprocessor is capable of performing operations full frames of video data in real time, e.g., frame integration, storage, 3 x 3 convolution, and neighborhood processing. The processor architecture is being implemented using bit-slice microprogrammable arithmetic processors, operating in parallel. Each processor is capable of up to 20 million operations per second. Multiple frame memories are used for additional flexibility.
The Chameleon Architecture for Streaming DSP Applications

Directory of Open Access Journals (Sweden)

André B. J. Kokkeler

2007-02-01

Full Text Available We focus on architectures for streaming DSP applications such as wireless baseband processing and image processing. We aim at a single generic architecture that is capable of dealing with different DSP applications. This architecture has to be energy efficient and fault tolerant. We introduce a heterogeneous tiled architecture and present the details of a domain-specific reconfigurable tile processor called Montium. This reconfigurable processor has a small footprint (1.8 mm2 in a 130 nm process, is power efficient and exploits the locality of reference principle. Reconfiguring the device is very fast, for example, loading the coefficients for a 200 tap FIR filter is done within 80 clock cycles. The tiles on the tiled architecture are connected to a Network-on-Chip (NoC via a network interface (NI. Two NoCs have been developed: a packet-switched and a circuit-switched version. Both provide two types of services: guaranteed throughput (GT and best effort (BE. For both NoCs estimates of power consumption are presented. The NI synchronizes data transfers, configures and starts/stops the tile processor. For dynamically mapping applications onto the tiled architecture, we introduce a run-time mapping tool.

The Chameleon Architecture for Streaming DSP Applications

Directory of Open Access Journals (Sweden)

Heysters PaulM

2007-01-01

Full Text Available We focus on architectures for streaming DSP applications such as wireless baseband processing and image processing. We aim at a single generic architecture that is capable of dealing with different DSP applications. This architecture has to be energy efficient and fault tolerant. We introduce a heterogeneous tiled architecture and present the details of a domain-specific reconfigurable tile processor called Montium. This reconfigurable processor has a small footprint (1.8 mm2 in a 130 nm process, is power efficient and exploits the locality of reference principle. Reconfiguring the device is very fast, for example, loading the coefficients for a 200 tap FIR filter is done within 80 clock cycles. The tiles on the tiled architecture are connected to a Network-on-Chip (NoC via a network interface (NI. Two NoCs have been developed: a packet-switched and a circuit-switched version. Both provide two types of services: guaranteed throughput (GT and best effort (BE. For both NoCs estimates of power consumption are presented. The NI synchronizes data transfers, configures and starts/stops the tile processor. For dynamically mapping applications onto the tiled architecture, we introduce a run-time mapping tool.
Dynamic Neural Fields as a Step Towards Cognitive Neuromorphic Architectures

Directory of Open Access Journals (Sweden)

Yulia eSandamirskaya

2014-01-01

Full Text Available Dynamic Field Theory (DFT is an established framework for modelling embodied cognition. In DFT, elementary cognitive functions such as memory formation, formation of grounded representations, attentional processes, decision making, adaptation, and learning emerge from neuronal dynamics. The basic computational element of this framework is a Dynamic Neural Field (DNF. Under constraints on the time-scale of the dynamics, the DNF is computationally equivalent to a soft winner-take-all (WTA network, which is considered one of the basic computational units in neuronal processing. Recently, it has been shown how a WTA network may be implemented in neuromorphic hardware, such as analogue Very Large Scale Integration (VLSI device. This paper leverages the relationship between DFT and soft WTA networks to systematically revise and integrate established DFT mechanisms that have previously been spread among different architectures. In addition, I also identify some novel computational and architectural mechanisms of DFT which may be implemented in neuromorphic VLSI devices using WTA networks as an intermediate computational layer. These specific mechanisms include the stabilization of working memory, the coupling of sensory systems to motor dynamics, intentionality, and autonomous learning. I further demonstrate how all these elements may be integrated into a unified architecture to generate behavior and autonomous learning.
Huffman-based code compression techniques for embedded processors

KAUST Repository

Bonny, Mohamed Talal; Henkel, Jö rg

2010-01-01

% for ARM and MIPS, respectively. In our compression technique, we have conducted evaluations using a representative set of applications and we have applied each technique to two major embedded processor architectures, namely ARM and MIPS. © 2010 ACM.
Very wide register : an asymmetric register file organization for low power embedded processors.

NARCIS (Netherlands)

Raghavan, P.; Lambrechts, A.; Jayapala, M.; Catthoor, F.; Verkest, D.T.M.L.; Corporaal, H.

2007-01-01

In current embedded systems processors, multi-ported register files are one of the most power hungry parts of the processor, even when they are clustered. This paper presents a novel register file architecture, which has single ported cells and asymmetric interfaces to the memory and to the
Processor-in-memory-and-storage architecture

Science.gov (United States)

DeBenedictis, Erik

2018-01-02

A method and apparatus for performing reliable general-purpose computing. Each sub-core of a plurality of sub-cores of a processor core processes a same instruction at a same time. A code analyzer receives a plurality of residues that represents a code word corresponding to the same instruction and an indication of whether the code word is a memory address code or a data code from the plurality of sub-cores. The code analyzer determines whether the plurality of residues are consistent or inconsistent. The code analyzer and the plurality of sub-cores perform a set of operations based on whether the code word is a memory address code or a data code and a determination of whether the plurality of residues are consistent or inconsistent.
Emulation of Neural Networks on a Nanoscale Architecture

International Nuclear Information System (INIS)

Eshaghian-Wilner, Mary M; Friesz, Aaron; Khitun, Alex; Navab, Shiva; Parker, Alice C; Wang, Kang L; Zhou, Chongwu

2007-01-01

In this paper, we propose using a nanoscale spin-wave-based architecture for implementing neural networks. We show that this architecture can efficiently realize highly interconnected neural network models such as the Hopfield model. In our proposed architecture, no point-to-point interconnection is required, so unlike standard VLSI design, no fan-in/fan-out constraint limits the interconnectivity. Using spin-waves, each neuron could broadcast to all other neurons simultaneously and similarly a neuron could concurrently receive and process multiple data. Therefore in this architecture, the total weighted sum to each neuron can be computed by the sum of the values from all the incoming waves to that neuron. In addition, using the superposition property of waves, this computation can be done in O(1) time, and neurons can update their states quite rapidly
ORGANIZATION OF GRAPHIC INFORMATION FOR VIEWING THE MULTILAYER VLSI TOPOLOGY

Directory of Open Access Journals (Sweden)

V. I. Romanov

2016-01-01

Full Text Available One of the possible ways to reorganize of graphical information describing the set of topology layers of modern VLSI. The method is directed on the use in the conditions of the bounded size of video card memory. An additional effect, providing high performance of forming multi- image layout a multi-layer topology of modern VLSI, is achieved by preloading the required texture by means of auxiliary background process.
An orthogonal wavelet division multiple-access processor architecture for LTE-advanced wireless/radio-over-fiber systems over heterogeneous networks

Science.gov (United States)

Mahapatra, Chinmaya; Leung, Victor CM; Stouraitis, Thanos

2014-12-01

The increase in internet traffic, number of users, and availability of mobile devices poses a challenge to wireless technologies. In long-term evolution (LTE) advanced system, heterogeneous networks (HetNet) using centralized coordinated multipoint (CoMP) transmitting radio over optical fibers (LTE A-ROF) have provided a feasible way of satisfying user demands. In this paper, an orthogonal wavelet division multiple-access (OWDMA) processor architecture is proposed, which is shown to be better suited to LTE advanced systems as compared to orthogonal frequency division multiple access (OFDMA) as in LTE systems 3GPP rel.8 (3GPP, http://www.3gpp.org/DynaReport/36300.htm). ROF systems are a viable alternative to satisfy large data demands; hence, the performance in ROF systems is also evaluated. To validate the architecture, the circuit is designed and synthesized on a Xilinx vertex-6 field-programmable gate array (FPGA). The synthesis results show that the circuit performs with a clock period as short as 7.036 ns (i.e., a maximum clock frequency of 142.13 MHz) for transform size of 512. A pipelined version of the architecture reduces the power consumption by approximately 89%. We compare our architecture with similar available architectures for resource utilization and timing and provide performance comparison with OFDMA systems for various quality metrics of communication systems. The OWDMA architecture is found to perform better than OFDMA for bit error rate (BER) performance versus signal-to-noise ratio (SNR) in wireless channel as well as ROF media. It also gives higher throughput and mitigates the bad effect of peak-to-average-power ratio (PAPR).
Design of RISC Processor Using VHDL and Cadence

Science.gov (United States)

Moslehpour, Saeid; Puliroju, Chandrasekhar; Abu-Aisheh, Akram

The project deals about development of a basic RISC processor. The processor is designed with basic architecture consisting of internal modules like clock generator, memory, program counter, instruction register, accumulator, arithmetic and logic unit and decoder. This processor is mainly used for simple general purpose like arithmetic operations and which can be further developed for general purpose processor by increasing the size of the instruction register. The processor is designed in VHDL by using Xilinx 8.1i version. The present project also serves as an application of the knowledge gained from past studies of the PSPICE program. The study will show how PSPICE can be used to simplify massive complex circuits designed in VHDL Synthesis. The purpose of the project is to explore the designed RISC model piece by piece, examine and understand the Input/ Output pins, and to show how the VHDL synthesis code can be converted to a simplified PSPICE model. The project will also serve as a collection of various research materials about the pieces of the circuit.
Noise limitations in optical linear algebra processors.

Science.gov (United States)

Batsell, S G; Jong, T L; Walkup, J F; Krile, T F

1990-05-10

A general statistical noise model is presented for optical linear algebra processors. A statistical analysis which includes device noise, the multiplication process, and the addition operation is undertaken. We focus on those processes which are architecturally independent. Finally, experimental results which verify the analytical predictions are also presented.
A programmable systolic array correlator as a trigger processor for electron pairs in rich (ring image Cherenkov) counters

Science.gov (United States)

Männer, R.

1989-12-01

This paper describes a systolic array processor for a ring image Cherenkov counter which is capable of identifying pairs of electron circles with a known radius and a certain minimum distance within 15 μs. The processor is a very flexible and fast device. It consists of 128 x 128 processing elements (PEs), where one PE is assigned to each pixel of the image. All PEs run synchronously at 40 MHz. The identification of electron circles is done by correlating the detector image with the proper circle circumference. Circle centers are found by peak detection in the correlation result. A second correlation with a circle disc allows circles of closed electron pairs to be rejected. The trigger decision is generated if a pseudo adder detects at least two remaining circles. The device is controlled by a freely programmable sequencer. A VLSI chip containing 8 x 8 PEs is being developed using a VENUS design system and will be produced in 2μ CMOS technology.
A programmable systolic array correlator as a trigger processor for electron pairs in RICH (ring image Cherenkov) counters

International Nuclear Information System (INIS)

Maenner, R.

1989-01-01

This paper describes a systolic array processor for a ring image Cherenkov counter which is capable of identifying pairs of electron circles with a known radius and a certain minimum distance within 15 μs. The processor is a very flexible and fast device. It consists of 128x128 processing elements (PEs), where one PE is assigned to each pixel of the image. All PEs run synchronously at 40 MHz. The identification of electron circles is done by correlating the detector image with the proper circle circumference. Circle centers are found by peak detection in the correlation result. A second correlation with a circle disc allows circles of closed electron pairs to be rejected. The trigger decision is generated if a pseudo adder detects at least two remaining circles. The device is controlled by a freely programmable sequencer. A VLSI chip containing 8x8 PEs is being developed using a VENUS design system and will be produced in 2μ CMOS technology. (orig.)
Application of evolutionary algorithms for multi-objective optimization in VLSI and embedded systems

CERN Document Server

2015-01-01

This book describes how evolutionary algorithms (EA), including genetic algorithms (GA) and particle swarm optimization (PSO) can be utilized for solving multi-objective optimization problems in the area of embedded and VLSI system design. Many complex engineering optimization problems can be modelled as multi-objective formulations. This book provides an introduction to multi-objective optimization using meta-heuristic algorithms, GA and PSO, and how they can be applied to problems like hardware/software partitioning in embedded systems, circuit partitioning in VLSI, design of operational amplifiers in analog VLSI, design space exploration in high-level synthesis, delay fault testing in VLSI testing, and scheduling in heterogeneous distributed systems. It is shown how, in each case, the various aspects of the EA, namely its representation, and operators like crossover, mutation, etc. can be separately formulated to solve these problems. This book is intended for design engineers and researchers in the field ...
Control structures for high speed processors

Science.gov (United States)

Maki, G. K.; Mankin, R.; Owsley, P. A.; Kim, G. M.

1982-01-01

A special processor was designed to function as a Reed Solomon decoder with throughput data rate in the Mhz range. This data rate is significantly greater than is possible with conventional digital architectures. To achieve this rate, the processor design includes sequential, pipelined, distributed, and parallel processing. The processor was designed using a high level language register transfer language. The RTL can be used to describe how the different processes are implemented by the hardware. One problem of special interest was the development of dependent processes which are analogous to software subroutines. For greater flexibility, the RTL control structure was implemented in ROM. The special purpose hardware required approximately 1000 SSI and MSI components. The data rate throughput is 2.5 megabits/second. This data rate is achieved through the use of pipelined and distributed processing. This data rate can be compared with 800 kilobits/second in a recently proposed very large scale integration design of a Reed Solomon encoder.
An Analogue VLSI Implementation of the Meddis Inner Hair Cell Model

Directory of Open Access Journals (Sweden)

Alistair McEwan

2003-06-01

Full Text Available The Meddis inner hair cell model is a widely accepted, but computationally intensive computer model of mammalian inner hair cell function. We have produced an analogue VLSI implementation of this model that operates in real time in the current domain by using translinear and log-domain circuits. The circuit has been fabricated on a chip and tested against the Meddis model for (a rate level functions for onset and steady-state response, (b recovery after masking, (c additivity, (d two-component adaptation, (e phase locking, (f recovery of spontaneous activity, and (g computational efficiency. The advantage of this circuit, over other electronic inner hair cell models, is its nearly exact implementation of the Meddis model which can be tuned to behave similarly to the biological inner hair cell. This has important implications on our ability to simulate the auditory system in real time. Furthermore, the technique of mapping a mathematical model of first-order differential equations to a circuit of log-domain filters allows us to implement real-time neuromorphic signal processors for a host of models using the same approach.
MAP3D: a media processor approach for high-end 3D graphics

Science.gov (United States)

Darsa, Lucia; Stadnicki, Steven; Basoglu, Chris

1999-12-01

Equator Technologies, Inc. has used a software-first approach to produce several programmable and advanced VLIW processor architectures that have the flexibility to run both traditional systems tasks and an array of media-rich applications. For example, Equator's MAP1000A is the world's fastest single-chip programmable signal and image processor targeted for digital consumer and office automation markets. The Equator MAP3D is a proposal for the architecture of the next generation of the Equator MAP family. The MAP3D is designed to achieve high-end 3D performance and a variety of customizable special effects by combining special graphics features with high performance floating-point and media processor architecture. As a programmable media processor, it offers the advantages of a completely configurable 3D pipeline--allowing developers to experiment with different algorithms and to tailor their pipeline to achieve the highest performance for a particular application. With the support of Equator's advanced C compiler and toolkit, MAP3D programs can be written in a high-level language. This allows the compiler to successfully find and exploit any parallelism in a programmer's code, thus decreasing the time to market of a given applications. The ability to run an operating system makes it possible to run concurrent applications in the MAP3D chip, such as video decoding while executing the 3D pipelines, so that integration of applications is easily achieved--using real-time decoded imagery for texturing 3D objects, for instance. This novel architecture enables an affordable, integrated solution for high performance 3D graphics.
High performance graphics processors for medical imaging applications

International Nuclear Information System (INIS)

Goldwasser, S.M.; Reynolds, R.A.; Talton, D.A.; Walsh, E.S.

1989-01-01

This paper describes a family of high- performance graphics processors with special hardware for interactive visualization of 3D human anatomy. The basic architecture expands to multiple parallel processors, each processor using pipelined arithmetic and logical units for high-speed rendering of Computed Tomography (CT), Magnetic Resonance (MR) and Positron Emission Tomography (PET) data. User-selectable display alternatives include multiple 2D axial slices, reformatted images in sagittal or coronal planes and shaded 3D views. Special facilities support applications requiring color-coded display of multiple datasets (such as radiation therapy planning), or dynamic replay of time- varying volumetric data (such as cine-CT or gated MR studies of the beating heart). The current implementation is a single processor system which generates reformatted images in true real time (30 frames per second), and shaded 3D views in a few seconds per frame. It accepts full scale medical datasets in their native formats, so that minimal preprocessing delay exists between data acquisition and display
A fast band–Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors

Energy Technology Data Exchange (ETDEWEB)

Aliaga, José I., E-mail: aliaga@uji.es [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Alonso, Pedro [Departamento de Sistemas Informáticos y Computación, Universitat Politècnica de València (Spain); Badía, José M. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain); Chacón, Pablo [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Davidović, Davor [Rudjer Bošković Institute, Centar za Informatiku i Računarstvo – CIR, Zagreb (Croatia); López-Blanco, José R. [Dept. Biological Chemical Physics, Rocasolano Physics and Chemistry Institute, CSIC, Madrid (Spain); Quintana-Ortí, Enrique S. [Depto. Ingeniería y Ciencia de Computadores, Universitat Jaume I, Castellón (Spain)

2016-03-15

We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousands degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.
A fast band–Krylov eigensolver for macromolecular functional motion simulation on multicore architectures and graphics processors

International Nuclear Information System (INIS)

Aliaga, José I.; Alonso, Pedro; Badía, José M.; Chacón, Pablo; Davidović, Davor; López-Blanco, José R.; Quintana-Ortí, Enrique S.

2016-01-01

We introduce a new iterative Krylov subspace-based eigensolver for the simulation of macromolecular motions on desktop multithreaded platforms equipped with multicore processors and, possibly, a graphics accelerator (GPU). The method consists of two stages, with the original problem first reduced into a simpler band-structured form by means of a high-performance compute-intensive procedure. This is followed by a memory-intensive but low-cost Krylov iteration, which is off-loaded to be computed on the GPU by means of an efficient data-parallel kernel. The experimental results reveal the performance of the new eigensolver. Concretely, when applied to the simulation of macromolecules with a few thousands degrees of freedom and the number of eigenpairs to be computed is small to moderate, the new solver outperforms other methods implemented as part of high-performance numerical linear algebra packages for multithreaded architectures.
Multicore technology architecture, reconfiguration, and modeling

CERN Document Server

Qadri, Muhammad Yasir

2013-01-01

The saturation of design complexity and clock frequencies for single-core processors has resulted in the emergence of multicore architectures as an alternative design paradigm. Nowadays, multicore/multithreaded computing systems are not only a de-facto standard for high-end applications, they are also gaining popularity in the field of embedded computing. The start of the multicore era has altered the concepts relating to almost all of the areas of computer architecture design, including core design, memory management, thread scheduling, application support, inter-processor communication, debu

The VLSI handbook

CERN Document Server

Chen, Wai-Kai

2007-01-01

Written by a stellar international panel of expert contributors, this handbook remains the most up-to-date, reliable, and comprehensive source for real answers to practical problems. In addition to updated information in most chapters, this edition features several heavily revised and completely rewritten chapters, new chapters on such topics as CMOS fabrication and high-speed circuit design, heavily revised sections on testing of digital systems and design languages, and two entirely new sections on low-power electronics and VLSI signal processing. An updated compendium of references and othe
High-level language computer architecture

CERN Document Server

Chu, Yaohan

1975-01-01

High-Level Language Computer Architecture offers a tutorial on high-level language computer architecture, including von Neumann architecture and syntax-oriented architecture as well as direct and indirect execution architecture. Design concepts of Japanese-language data processing systems are discussed, along with the architecture of stack machines and the SYMBOL computer system. The conceptual design of a direct high-level language processor is also described.Comprised of seven chapters, this book first presents a classification of high-level language computer architecture according to the pr
Graphics processor efficiency for realization of rapid tabular computations

International Nuclear Information System (INIS)

Dudnik, V.A.; Kudryavtsev, V.I.; Us, S.A.; Shestakov, M.V.

2016-01-01

Capabilities of graphics processing units (GPU) and central processing units (CPU) have been investigated for realization of fast-calculation algorithms with the use of tabulated functions. The realization of tabulated functions is exemplified by the GPU/CPU architecture-based processors. Comparison is made between the operating efficiencies of GPU and CPU, employed for tabular calculations at different conditions of use. Recommendations are formulated for the use of graphical and central processors to speed up scientific and engineering computations through the use of tabulated functions
Multi-mode sensor processing on a dynamically reconfigurable massively parallel processor array

Science.gov (United States)

Chen, Paul; Butts, Mike; Budlong, Brad; Wasson, Paul

2008-04-01

This paper introduces a novel computing architecture that can be reconfigured in real time to adapt on demand to multi-mode sensor platforms' dynamic computational and functional requirements. This 1 teraOPS reconfigurable Massively Parallel Processor Array (MPPA) has 336 32-bit processors. The programmable 32-bit communication fabric provides streamlined inter-processor connections with deterministically high performance. Software programmability, scalability, ease of use, and fast reconfiguration time (ranging from microseconds to milliseconds) are the most significant advantages over FPGAs and DSPs. This paper introduces the MPPA architecture, its programming model, and methods of reconfigurability. An MPPA platform for reconfigurable computing is based on a structural object programming model. Objects are software programs running concurrently on hundreds of 32-bit RISC processors and memories. They exchange data and control through a network of self-synchronizing channels. A common application design pattern on this platform, called a work farm, is a parallel set of worker objects, with one input and one output stream. Statically configured work farms with homogeneous and heterogeneous sets of workers have been used in video compression and decompression, network processing, and graphics applications.
Spike Neuromorphic VLSI-Based Bat Echolocation for Micro-Aerial Vehicle Guidance

Science.gov (United States)

2007-03-31

IFinal 03/01/04 - 02/28/07 4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER Neuromorphic VLSI-based Bat Echolocation for Micro-aerial 5b.GRANTNUMBER Vehicle...uncovered interesting new issues in our choice for representing the intensity of signals. We have just finished testing the first chip version of an echo...timing-based algorithm (’openspace’) for sonar-guided navigation amidst multiple obstacles. 15. SUBJECT TERMS Neuromorphic VLSI, bat echolocation
A parallel architecture for digital filtering using Fermat number transforms

Science.gov (United States)

Truong, T. K.; Reed, I. S.; Yeh, C.-S.; Shao, H. M.

1983-01-01

In this correspondence, a parallel architecture is developed to compute the linear convolution of two sequences of arbitrary lengths using the Fermat number transform (FNT). In particular, a pipeline structure is designed to compute a 128-point FNT. In this FNT, only additions and bit rotations are required. The overlap-save method is generalized for the FNT to realize a digital filter of arbitrary length. The generalized overlap-save method alleviates the usual dynamic range limitation of FNT's of long transform lengths. A parallel architecture is developed to realize this type of overlap-save method using one FNT and several inverse FNT's of 128 points. Its architecture is regular, simple, and flexible, and therefore naturally suitable for VLSI implementation.
UNIBUS processor interface for a FASTBUS data acquisition system

International Nuclear Information System (INIS)

Larwill, M.; Lagerlund, T.D.; Barsotti, E.; Taff, L.M.; Franzen, J.

1981-01-01

Current work on a FASTBUS data acquisition system at Fermilab is described. The system will consist of three pieces of FASTBUS hardware: a UNIBUS processor interface (UPI), a dual-ported bulk memory, and a FASTBUS ''event builder'' (i.e., data acquisition processor). Primary efforts have been on specifying and constructing a UPI. The present specification includes capability for all basic FASTBUS operations, including list processing of consecutive FASTBUS operations. Some possible FASTBUS data acquisition system architectures employing the UPI are discussed along with some detailed specifications of the UPI itself
Development of superconductor electronics technology for high-end computing

Energy Technology Data Exchange (ETDEWEB)

Silver, A [Jet Propulsion Laboratory, 4800 Oak Grove Drive, Pasadena, CA 91109-8099 (United States); Kleinsasser, A [Jet Propulsion Laboratory, 4800 Oak Grove Drive, Pasadena, CA 91109-8099 (United States); Kerber, G [Northrop Grumman Space Technology, One Space Park, Redondo Beach, CA 90278 (United States); Herr, Q [Northrop Grumman Space Technology, One Space Park, Redondo Beach, CA 90278 (United States); Dorojevets, M [Department of Electrical and Computer Engineering, SUNY-Stony Brook, NY 11794-2350 (United States); Bunyk, P [Northrop Grumman Space Technology, One Space Park, Redondo Beach, CA 90278 (United States); Abelson, L [Northrop Grumman Space Technology, One Space Park, Redondo Beach, CA 90278 (United States)

2003-12-01

This paper describes our programme to develop and demonstrate ultra-high performance single flux quantum (SFQ) VLSI technology that will enable superconducting digital processors for petaFLOPS-scale computing. In the hybrid technology, multi-threaded architecture, the computational engine to power a petaFLOPS machine at affordable power will consist of 4096 SFQ multi-chip processors, with 50 to 100 GHz clock frequency and associated cryogenic RAM. We present the superconducting technology requirements, progress to date and our plan to meet these requirements. We improved SFQ Nb VLSI by two generations, to a 8 kA cm{sup -2}, 1.25 {mu}m junction process, incorporated new CAD tools into our methodology, demonstrated methods for recycling the bias current and data communication at speeds up to 60 Gb s{sup -1}, both on and between chips through passive transmission lines. FLUX-1 is the most ambitious project implemented in SFQ technology to date, a prototype general-purpose 8 bit microprocessor chip. We are testing the FLUX-1 chip (5K gates, 20 GHz clock) and designing a 32 bit floating-point SFQ multiplier with vector-register memory. We report correct operation of the complete stripline-connected gate library with large bias margins, as well as several larger functional units used in FLUX-1. The next stage will be an SFQ multi-processor machine. Important challenges include further reducing chip supply current and on-chip power dissipation, developing at least 64 kbit, sub-nanosecond cryogenic RAM chips, developing thermally and electrically efficient high data rate cryogenic-to-ambient input/output technology and improving Nb VLSI to increase gate density.
Development of superconductor electronics technology for high-end computing

International Nuclear Information System (INIS)

Silver, A; Kleinsasser, A; Kerber, G; Herr, Q; Dorojevets, M; Bunyk, P; Abelson, L

2003-01-01

This paper describes our programme to develop and demonstrate ultra-high performance single flux quantum (SFQ) VLSI technology that will enable superconducting digital processors for petaFLOPS-scale computing. In the hybrid technology, multi-threaded architecture, the computational engine to power a petaFLOPS machine at affordable power will consist of 4096 SFQ multi-chip processors, with 50 to 100 GHz clock frequency and associated cryogenic RAM. We present the superconducting technology requirements, progress to date and our plan to meet these requirements. We improved SFQ Nb VLSI by two generations, to a 8 kA cm -2 , 1.25 μm junction process, incorporated new CAD tools into our methodology, demonstrated methods for recycling the bias current and data communication at speeds up to 60 Gb s -1 , both on and between chips through passive transmission lines. FLUX-1 is the most ambitious project implemented in SFQ technology to date, a prototype general-purpose 8 bit microprocessor chip. We are testing the FLUX-1 chip (5K gates, 20 GHz clock) and designing a 32 bit floating-point SFQ multiplier with vector-register memory. We report correct operation of the complete stripline-connected gate library with large bias margins, as well as several larger functional units used in FLUX-1. The next stage will be an SFQ multi-processor machine. Important challenges include further reducing chip supply current and on-chip power dissipation, developing at least 64 kbit, sub-nanosecond cryogenic RAM chips, developing thermally and electrically efficient high data rate cryogenic-to-ambient input/output technology and improving Nb VLSI to increase gate density
A 600-µW ultra-low-power associative processor for image pattern recognition employing magnetic tunnel junction-based nonvolatile memories with autonomic intelligent power-gating scheme

Science.gov (United States)

Ma, Yitao; Miura, Sadahiko; Honjo, Hiroaki; Ikeda, Shoji; Hanyu, Takahiro; Ohno, Hideo; Endoh, Tetsuo

2016-04-01

A novel associative processor using magnetic tunnel junction (MTJ)-based nonvolatile memories has been proposed and fabricated under a 90 nm CMOS/70 nm perpendicular-MTJ (p-MTJ) hybrid process for achieving the exceptionally low-power performance of image pattern recognition. A four-transistor 2-MTJ (4T-2MTJ) spin transfer torque magnetoresistive random access memory was adopted to completely eliminate the standby power. A self-directed intelligent power-gating (IPG) scheme specialized for this associative processor is employed to optimize the operation power by only autonomously activating currently accessed memory cells. The operations of a prototype chip at 20 MHz are demonstrated by measurement. The proposed processor can successfully carry out single texture pattern matching within 6.5 µs using 128-dimension bag-of-feature patterns, and the measured average operation power of the entire processor core is only 600 µW. Compared with the twin chip designed with 6T static random access memory, 91.2% power reductions are achieved. More than 88.0% power reductions are obtained compared with the latest associative memories. The further power performance analysis is discussed in detail, which verifies the special superiority of the proposed processor in power consumption for large-capacity memory-based VLSI systems.
A single chip pulse processor for nuclear spectroscopy

International Nuclear Information System (INIS)

Hilsenrath, F.; Bakke, J.C.; Voss, H.D.

1985-01-01

A high performance digital pulse processor, integrated into a single gate array microcircuit, has been developed for spaceflight applications. The new approach takes advantage of the latest CMOS high speed A/D flash converters and low-power gated logic arrays. The pulse processor measures pulse height, pulse area and the required timing information (e.g. multi detector coincidence and pulse pile-up detection). The pulse processor features high throughput rate (e.g. 0.5 Mhz for 2 usec gausssian pulses) and improved differential linearity (e.g. + or - 0.2 LSB for a + or - 1 LSB A/D). Because of the parallel digital architecture of the device, the interface is microprocessor bus compatible. A satellite flight application of this module is presented for use in the X-ray imager and high energy particle spectrometers of the PEM experiment on the Upper Atmospheric Research Satellite
A second generation 50 Mbps VLSI level zero processing system prototype

Science.gov (United States)

Harris, Jonathan C.; Shi, Jeff; Speciale, Nick; Bennett, Toby

1994-01-01

Level Zero Processing (LZP) generally refers to telemetry data processing functions performed at ground facilities to remove all communication artifacts from instrument data. These functions typically include frame synchronization, error detection and correction, packet reassembly and sorting, playback reversal, merging, time-ordering, overlap deletion, and production of annotated data sets. The Data Systems Technologies Division (DSTD) at Goddard Space Flight Center (GSFC) has been developing high-performance Very Large Scale Integration Level Zero Processing Systems (VLSI LZPS) since 1989. The first VLSI LZPS prototype demonstrated 20 Megabits per second (Mbp's) capability in 1992. With a new generation of high-density Application-specific Integrated Circuits (ASIC) and a Mass Storage System (MSS) based on the High-performance Parallel Peripheral Interface (HiPPI), a second prototype has been built that achieves full 50 Mbp's performance. This paper describes the second generation LZPS prototype based upon VLSI technologies.
Optimizing the performance of streaming numerical kernels on the IBM Blue Gene/P PowerPC 450 processor

KAUST Repository

Malas, Tareq Majed Yasin; Ahmadia, Aron; Brown, Jed; Gunnels, John A.; Keyes, David E.

2012-01-01

Several emerging petascale architectures use energy-efficient processors with vectorized computational units and in-order thread processing. On these architectures the sustained performance of streaming numerical kernels, ubiquitous in the solution
Multi-core System Architecture for Safety-critical Control Applications

DEFF Research Database (Denmark)

Li, Gang

and size, and high power consumption. Increasing the frequency of a processor is becoming painful now due to the explosive power consumption. Furthermore, components integrated into a single-core processor have to be certified to the highest SIL, due to that no isolation is provided in a traditional single...... certification cost. Meanwhile, hardware platforms with improved processing power are required to execute the applications of larger size. To tackle the two issues mentioned above, the state of the art approaches are using more Electronic Control Units (ECU) in a federated architecture or increasing......-core processor. A promising alternative to improve processing power and provide isolation is to adopt a multi-core architecture with on-chip isolation. In general, a specific multi-core architecture can facilitate the development and certification of safety-related systems, due to its physical isolation between...
Confabulation Based Real-time Anomaly Detection for Wide-area Surveillance Using Heterogeneous High Performance Computing Architecture

Science.gov (United States)

2015-06-01

CONFABULATION BASED REAL-TIME ANOMALY DETECTION FOR WIDE-AREA SURVEILLANCE USING HETEROGENEOUS HIGH PERFORMANCE COMPUTING ARCHITECTURE SYRACUSE...DETECTION FOR WIDE-AREA SURVEILLANCE USING HETEROGENEOUS HIGH PERFORMANCE COMPUTING ARCHITECTURE 5a. CONTRACT NUMBER FA8750-12-1-0251 5b. GRANT...processors including graphic processor units (GPUs) and Intel Xeon Phi processors. Experimental results showed significant speedups, which can enable
Nonlinear Wave Simulation on the Xeon Phi Knights Landing Processor

OpenAIRE

Hristov Ivan; Goranov Goran; Hristova Radoslava

2018-01-01

We consider an interesting from computational point of view standing wave simulation by solving coupled 2D perturbed Sine-Gordon equations. We make an OpenMP realization which explores both thread and SIMD levels of parallelism. We test the OpenMP program on two different energy equivalent Intel architectures: 2× Xeon E5-2695 v2 processors, (code-named “Ivy Bridge-EP”) in the Hybrilit cluster, and Xeon Phi 7250 processor (code-named “Knights Landing” (KNL). The results show 2 times better per...
Motion estimation for video coding efficient algorithms and architectures

CERN Document Server

Chakrabarti, Indrajit; Chatterjee, Sumit Kumar

2015-01-01

The need of video compression in the modern age of visual communication cannot be over-emphasized. This monograph will provide useful information to the postgraduate students and researchers who wish to work in the domain of VLSI design for video processing applications. In this book, one can find an in-depth discussion of several motion estimation algorithms and their VLSI implementation as conceived and developed by the authors. It records an account of research done involving fast three step search, successive elimination, one-bit transformation and its effective combination with diamond search and dynamic pixel truncation techniques. Two appendices provide a number of instances of proof of concept through Matlab and Verilog program segments. In this aspect, the book can be considered as first of its kind. The architectures have been developed with an eye to their applicability in everyday low-power handheld appliances including video camcorders and smartphones.
Software and DVFS Tuning for Performance and Energy-Efficiency on Intel KNL Processors

Directory of Open Access Journals (Sweden)

Enrico Calore

2018-06-01

Full Text Available Energy consumption of processors and memories is quickly becoming a limiting factor in the deployment of large computing systems. For this reason, it is important to understand the energy performance of these processors and to study strategies allowing their use in the most efficient way. In this work, we focus on the computing and energy performance of the Knights Landing Xeon Phi, the latest Intel many-core architecture processor for HPC applications. We consider the 64-core Xeon Phi 7230 and profile its performance and energy efficiency using both its on-chip MCDRAM and the off-chip DDR4 memory as the main storage for application data. As a benchmark application, we use a lattice Boltzmann code heavily optimized for this architecture and implemented using several different arrangements of the application data in memory (data-layouts, in short. We also assess the dependence of energy consumption on data-layouts, memory configurations (DDR4 or MCDRAM and the number of threads per core. We finally consider possible trade-offs between computing performance and energy efficiency, tuning the clock frequency of the processor using the Dynamic Voltage and Frequency Scaling (DVFS technique.
Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Directory of Open Access Journals (Sweden)

Ananya Muddukrishna

2015-01-01

Full Text Available Performance degradation due to nonuniform data access latencies has worsened on NUMA systems and can now be felt on-chip in manycore processors. Distributing data across NUMA nodes and manycore processor caches is necessary to reduce the impact of nonuniform latencies. However, techniques for distributing data are error-prone and fragile and require low-level architectural knowledge. Existing task scheduling policies favor quick load-balancing at the expense of locality and ignore NUMA node/manycore cache access latencies while scheduling. Locality-aware scheduling, in conjunction with or as a replacement for existing scheduling, is necessary to minimize NUMA effects and sustain performance. We present a data distribution and locality-aware scheduling technique for task-based OpenMP programs executing on NUMA systems and manycore processors. Our technique relieves the programmer from thinking of NUMA system/manycore processor architecture details by delegating data distribution to the runtime system and uses task data dependence information to guide the scheduling of OpenMP tasks to reduce data stall times. We demonstrate our technique on a four-socket AMD Opteron machine with eight NUMA nodes and on the TILEPro64 processor and identify that data distribution and locality-aware task scheduling improve performance up to 69% for scientific benchmarks compared to default policies and yet provide an architecture-oblivious approach for programmers.
Bulk-memory processor for data acquisition

International Nuclear Information System (INIS)

Nelson, R.O.; McMillan, D.E.; Sunier, J.W.; Meier, M.; Poore, R.V.

1981-01-01

To meet the diverse needs and data rate requirements at the Van de Graaff and Weapons Neutron Research (WNR) facilities, a bulk memory system has been implemented which includes a fast and flexible processor. This bulk memory processor (BMP) utilizes bit slice and microcode techniques and features a 24 bit wide internal architecture allowing direct addressing of up to 16 megawords of memory and histogramming up to 16 million counts per channel without overflow. The BMP is interfaced to the MOSTEK MK 8000 bulk memory system and to the standard MODCOMP computer I/O bus. Coding for the BMP both at the microcode level and with macro instructions is supported. The generalized data acquisition system has been extended to support the BMP in a manner transparent to the user

Synthesis of on-chip control circuits for mVLSI biochips

DEFF Research Database (Denmark)

Potluri, Seetal; Schneider, Alexander Rüdiger; Hørslev-Petersen, Martin

2017-01-01

them to laboratory environments. To address this issue, researchers have proposed methods to reduce the number of offchip pressure sources, through integration of on-chip pneumatic control logic circuits fabricated using three-layer monolithic membrane valve technology. Traditionally, mVLSI biochip......-chip control circuit design and (iii) the integration of on-chip control in the placement and routing design tasks. In this paper we present a design methodology for logic synthesis and physical synthesis of mVLSI biochips that use on-chip control. We show how the proposed methodology can be successfully...... applied to generate biochip layouts with integrated on-chip pneumatic control....
Performance of Artificial Intelligence Workloads on the Intel Core 2 Duo Series Desktop Processors

Directory of Open Access Journals (Sweden)

Abdul Kareem PARCHUR

2010-12-01

Full Text Available As the processor architecture becomes more advanced, Intel introduced its Intel Core 2 Duo series processors. Performance impact on Intel Core 2 Duo processors are analyzed using SPEC CPU INT 2006 performance numbers. This paper studied the behavior of Artificial Intelligence (AI benchmarks on Intel Core 2 Duo series processors. Moreover, we estimated the task completion time (TCT @1 GHz, @2 GHz and @3 GHz Intel Core 2 Duo series processors frequency. Our results show the performance scalability in Intel Core 2 Duo series processors. Even though AI benchmarks have similar execution time, they have dissimilar characteristics which are identified using principal component analysis and dendogram. As the processor frequency increased from 1.8 GHz to 3.167 GHz the execution time is decreased by ~370 sec for AI workloads. In the case of Physics/Quantum Computing programs it was ~940 sec.
Emerging Applications for High K Materials in VLSI Technology

Science.gov (United States)

Clark, Robert D.

2014-01-01

The current status of High K dielectrics in Very Large Scale Integrated circuit (VLSI) manufacturing for leading edge Dynamic Random Access Memory (DRAM) and Complementary Metal Oxide Semiconductor (CMOS) applications is summarized along with the deposition methods and general equipment types employed. Emerging applications for High K dielectrics in future CMOS are described as well for implementations in 10 nm and beyond nodes. Additional emerging applications for High K dielectrics include Resistive RAM memories, Metal-Insulator-Metal (MIM) diodes, Ferroelectric logic and memory devices, and as mask layers for patterning. Atomic Layer Deposition (ALD) is a common and proven deposition method for all of the applications discussed for use in future VLSI manufacturing. PMID:28788599
Emerging Applications for High K Materials in VLSI Technology

Directory of Open Access Journals (Sweden)

Robert D. Clark

2014-04-01

Full Text Available The current status of High K dielectrics in Very Large Scale Integrated circuit (VLSI manufacturing for leading edge Dynamic Random Access Memory (DRAM and Complementary Metal Oxide Semiconductor (CMOS applications is summarized along with the deposition methods and general equipment types employed. Emerging applications for High K dielectrics in future CMOS are described as well for implementations in 10 nm and beyond nodes. Additional emerging applications for High K dielectrics include Resistive RAM memories, Metal-Insulator-Metal (MIM diodes, Ferroelectric logic and memory devices, and as mask layers for patterning. Atomic Layer Deposition (ALD is a common and proven deposition method for all of the applications discussed for use in future VLSI manufacturing.
Modal Processor Effects Inspired by Hammond Tonewheel Organs

Directory of Open Access Journals (Sweden)

Kurt James Werner

2016-06-01

Full Text Available In this design study, we introduce a novel class of digital audio effects that extend the recently introduced modal processor approach to artificial reverberation and effects processing. These pitch and distortion processing effects mimic the design and sonics of a classic additive-synthesis-based electromechanical musical instrument, the Hammond tonewheel organ. As a reverb effect, the modal processor simulates a room response as the sum of resonant filter responses. This architecture provides precise, interactive control over the frequency, damping, and complex amplitude of each mode. Into this framework, we introduce two types of processing effects: pitch effects inspired by the Hammond organ’s equal tempered “tonewheels”, “drawbar” tone controls, vibrato/chorus circuit, and distortion effects inspired by the pseudo-sinusoidal shape of its tonewheels and electromagnetic pickup distortion. The result is an effects processor that imprints the Hammond organ’s sonics onto any audio input.
Stepping motor control processor reference manual. Volume I

International Nuclear Information System (INIS)

Holloway, F.W.; VanArsdall, P.J.; Suski, G.J.; Gant, R.G.; Rash, M.

1980-01-01

This manual is intended to serve several purposes. The first goal is to describe the capabilities and operation of the SMC processor package from an operator or user point of view. Secondly, the manual will describe in some detail the basic hardware elements and how they can be used effectively to implement a step motor control system. Practical information on the use, installation and checkout of the hardware set is presented in the following sections along with programming suggestions. Available related system software is described in this manual for reference and as an aid in understanding the system architecture. Section two presents an overview and operations manual of the SMC processor describing its composition and functional capabilities. Section three contains hardware descriptions in some detail for the LLL-designed hardware used in the SMC processor. Basic theory of operation and important features are explained
HONEI: A collection of libraries for numerical computations targeting multiple processor architectures

Science.gov (United States)

van Dyk, Danny; Geveler, Markus; Mallach, Sven; Ribbrock, Dirk; Göddeke, Dominik; Gutwenger, Carsten

2009-12-01

We present HONEI, an open-source collection of libraries offering a hardware oriented approach to numerical calculations. HONEI abstracts the hardware, and applications written on top of HONEI can be executed on a wide range of computer architectures such as CPUs, GPUs and the Cell processor. We demonstrate the flexibility and performance of our approach with two test applications, a Finite Element multigrid solver for the Poisson problem and a robust and fast simulation of shallow water waves. By linking against HONEI's libraries, we achieve a two-fold speedup over straight forward C++ code using HONEI's SSE backend, and additional 3-4 and 4-16 times faster execution on the Cell and a GPU. A second important aspect of our approach is that the full performance capabilities of the hardware under consideration can be exploited by adding optimised application-specific operations to the HONEI libraries. HONEI provides all necessary infrastructure for development and evaluation of such kernels, significantly simplifying their development. Program summaryProgram title: HONEI Catalogue identifier: AEDW_v1_0 Program summary URL:http://cpc.cs.qub.ac.uk/summaries/AEDW_v1_0.html Program obtainable from: CPC Program Library, Queen's University, Belfast, N. Ireland Licensing provisions: GPLv2 No. of lines in distributed program, including test data, etc.: 216 180 No. of bytes in distributed program, including test data, etc.: 1 270 140 Distribution format: tar.gz Programming language: C++ Computer: x86, x86_64, NVIDIA CUDA GPUs, Cell blades and PlayStation 3 Operating system: Linux RAM: at least 500 MB free Classification: 4.8, 4.3, 6.1 External routines: SSE: none; [1] for GPU, [2] for Cell backend Nature of problem: Computational science in general and numerical simulation in particular have reached a turning point. The revolution developers are facing is not primarily driven by a change in (problem-specific) methodology, but rather by the fundamental paradigm shift of the
Harnessing VLSI System Design with EDA Tools

CERN Document Server

Kamat, Rajanish K; Gaikwad, Pawan K; Guhilot, Hansraj

2012-01-01

This book explores various dimensions of EDA technologies for achieving different goals in VLSI system design. Although the scope of EDA is very broad and comprises diversified hardware and software tools to accomplish different phases of VLSI system design, such as design, layout, simulation, testability, prototyping and implementation, this book focuses only on demystifying the code, a.k.a. firmware development and its implementation with FPGAs. Since there are a variety of languages for system design, this book covers various issues related to VHDL, Verilog and System C synergized with EDA tools, using a variety of case studies such as testability, verification and power consumption. * Covers aspects of VHDL, Verilog and Handel C in one text; * Enables designers to judge the appropriateness of each EDA tool for relevant applications; * Omits discussion of design platforms and focuses on design case studies; * Uses design case studies from diversified application domains such as network on chip, hospital on...
Power estimation on functional level for programmable processors

Directory of Open Access Journals (Sweden)

M. Schneider

2004-01-01

Full Text Available In diesem Beitrag werden verschiedene Ansätze zur Verlustleistungsschätzung von programmierbaren Prozessoren vorgestellt und bezüglich ihrer Übertragbarkeit auf moderne Prozessor-Architekturen wie beispielsweise Very Long Instruction Word (VLIW-Architekturen bewertet. Besonderes Augenmerk liegt hierbei auf dem Konzept der sogenannten Functional-Level Power Analysis (FLPA. Dieser Ansatz basiert auf der Einteilung der Prozessor-Architektur in funktionale Blöcke wie beispielsweise Processing-Unit, Clock-Netzwerk, interner Speicher und andere. Die Verlustleistungsaufnahme dieser Bl¨ocke wird parameterabhängig durch arithmetische Modellfunktionen beschrieben. Durch automatisierte Analyse von Assemblercodes des zu schätzenden Systems mittels eines Parsers können die Eingangsparameter wie beispielsweise der erzielte Parallelitätsgrad oder die Art des Speicherzugriffs gewonnen werden. Dieser Ansatz wird am Beispiel zweier moderner digitaler Signalprozessoren durch eine Vielzahl von Basis-Algorithmen der digitalen Signalverarbeitung evaluiert. Die ermittelten Schätzwerte für die einzelnen Algorithmen werden dabei mit physikalisch gemessenen Werten verglichen. Es ergibt sich ein sehr kleiner maximaler Schätzfehler von 3%. In this contribution different approaches for power estimation for programmable processors are presented and evaluated concerning their capability to be applied to modern digital signal processor architectures like e.g. Very Long InstructionWord (VLIW -architectures. Special emphasis will be laid on the concept of so-called Functional-Level Power Analysis (FLPA. This approach is based on the separation of the processor architecture into functional blocks like e.g. processing unit, clock network, internal memory and others. The power consumption of these blocks is described by parameter dependent arithmetic model functions. By application of a parser based automized analysis of assembler codes of the systems to be estimated
Power estimation on functional level for programmable processors

Science.gov (United States)

Schneider, M.; Blume, H.; Noll, T. G.

2004-05-01

In diesem Beitrag werden verschiedene Ansätze zur Verlustleistungsschätzung von programmierbaren Prozessoren vorgestellt und bezüglich ihrer Übertragbarkeit auf moderne Prozessor-Architekturen wie beispielsweise Very Long Instruction Word (VLIW)-Architekturen bewertet. Besonderes Augenmerk liegt hierbei auf dem Konzept der sogenannten Functional-Level Power Analysis (FLPA). Dieser Ansatz basiert auf der Einteilung der Prozessor-Architektur in funktionale Blöcke wie beispielsweise Processing-Unit, Clock-Netzwerk, interner Speicher und andere. Die Verlustleistungsaufnahme dieser Bl¨ocke wird parameterabhängig durch arithmetische Modellfunktionen beschrieben. Durch automatisierte Analyse von Assemblercodes des zu schätzenden Systems mittels eines Parsers können die Eingangsparameter wie beispielsweise der erzielte Parallelitätsgrad oder die Art des Speicherzugriffs gewonnen werden. Dieser Ansatz wird am Beispiel zweier moderner digitaler Signalprozessoren durch eine Vielzahl von Basis-Algorithmen der digitalen Signalverarbeitung evaluiert. Die ermittelten Schätzwerte für die einzelnen Algorithmen werden dabei mit physikalisch gemessenen Werten verglichen. Es ergibt sich ein sehr kleiner maximaler Schätzfehler von 3%. In this contribution different approaches for power estimation for programmable processors are presented and evaluated concerning their capability to be applied to modern digital signal processor architectures like e.g. Very Long InstructionWord (VLIW) -architectures. Special emphasis will be laid on the concept of so-called Functional-Level Power Analysis (FLPA). This approach is based on the separation of the processor architecture into functional blocks like e.g. processing unit, clock network, internal memory and others. The power consumption of these blocks is described by parameter dependent arithmetic model functions. By application of a parser based automized analysis of assembler codes of the systems to be estimated the input
CMOS VLSI Active-Pixel Sensor for Tracking

Science.gov (United States)

Pain, Bedabrata; Sun, Chao; Yang, Guang; Heynssens, Julie

2004-01-01

An architecture for a proposed active-pixel sensor (APS) and a design to implement the architecture in a complementary metal oxide semiconductor (CMOS) very-large-scale integrated (VLSI) circuit provide for some advanced features that are expected to be especially desirable for tracking pointlike features of stars. The architecture would also make this APS suitable for robotic- vision and general pointing and tracking applications. CMOS imagers in general are well suited for pointing and tracking because they can be configured for random access to selected pixels and to provide readout from windows of interest within their fields of view. However, until now, the architectures of CMOS imagers have not supported multiwindow operation or low-noise data collection. Moreover, smearing and motion artifacts in collected images have made prior CMOS imagers unsuitable for tracking applications. The proposed CMOS imager (see figure) would include an array of 1,024 by 1,024 pixels containing high-performance photodiode-based APS circuitry. The pixel pitch would be 9 m. The operations of the pixel circuits would be sequenced and otherwise controlled by an on-chip timing and control block, which would enable the collection of image data, during a single frame period, from either the full frame (that is, all 1,024 1,024 pixels) or from within as many as 8 different arbitrarily placed windows as large as 8 by 8 pixels each. A typical prior CMOS APS operates in a row-at-a-time ( grolling-shutter h) readout mode, which gives rise to exposure skew. In contrast, the proposed APS would operate in a sample-first/readlater mode, suppressing rolling-shutter effects. In this mode, the analog readout signals from the pixels corresponding to the windows of the interest (which windows, in the star-tracking application, would presumably contain guide stars) would be sampled rapidly by routing them through a programmable diagonal switch array to an on-chip parallel analog memory array. The
'Iconic' tracking algorithms for high energy physics using the TRAX-I massively parallel processor

International Nuclear Information System (INIS)

Vesztergombi, G.

1989-01-01

TRAX-I, a cost-effective parallel microcomputer, applying associative string processor (ASP) architecture with 16 K parallel processing elements, is being built by Aspex Microsystems Ltd. (UK). When applied to the tracking problem of very complex events with several hundred tracks, the large number of processors allows one to dedicate one or more processors to each wire (in MWPC), each pixel (in digitized images from streamer chambers or other visual detectors), or each pad (in TPC) to perform very efficient pattern recognition. Some linear tracking algorithms based on this ''ionic'' representation are presented. (orig.)
'Iconic' tracking algorithms for high energy physics using the TRAX-I massively parallel processor

International Nuclear Information System (INIS)

Vestergombi, G.

1989-11-01

TRAX-I, a cost-effective parallel microcomputer, applying Associative String Processor (ASP) architecture with 16 K parallel processing elements, is being built by Aspex Microsystems Ltd. (UK). When applied to the tracking problem of very complex events with several hundred tracks, the large number of processors allows one to dedicate one or more processors to each wire (in MWPC), each pixel (in digitized images from streamer chambers or other visual detectors), or each pad (in TPC) to perform very efficient pattern recognition. Some linear tracking algorithms based on this 'iconic' representation are presented. (orig.)
gFEX, the ATLAS Calorimeter Level-1 Real Time Processor

CERN Document Server

AUTHOR|(SzGeCERN)759889; The ATLAS collaboration; Begel, Michael; Chen, Hucheng; Lanni, Francesco; Takai, Helio; Wu, Weihao

2016-01-01

The global feature extractor (gFEX) is a component of the Level-1 Calorimeter trigger Phase-I upgrade for the ATLAS experiment. It is intended to identify patterns of energy associated with the hadronic decays of high momentum Higgs, W, & Z bosons, top quarks, and exotic particles in real time at the LHC crossing rate. The single processor board will be packaged in an Advanced Telecommunications Computing Architecture (ATCA) module and implemented as a fast reconfigurable processor based on three Xilinx Vertex Ultra-scale FPGAs. The board will receive coarse-granularity information from all the ATLAS calorimeters on 276 optical fibers with the data transferred at the 40 MHz Large Hadron Collider (LHC) clock frequency. The gFEX will be controlled by a single system-on-chip processor, ZYNQ, that will be used to configure all the processor Field-Programmable Gate Array (FPGAs), monitor board health, and interface to external signals. Now, the pre-prototype board which includes one ZYNQ and one Vertex-7 FPGA ...
gFEX, the ATLAS Calorimeter Level 1 Real Time Processor

CERN Document Server

Tang, Shaochun; The ATLAS collaboration

2015-01-01

The global feature extractor (gFEX) is a component of the Level-1Calorimeter trigger Phase-I upgrade for the ATLAS experiment. It is intended to identify patterns of energy associated with the hadronic decays of high momentum Higgs, W, & Z bosons, top quarks, and exotic particles in real time at the LHC crossing rate. The single processor board will be packaged in an Advanced Telecommunications Computing Architecture (ATCA) module and implemented as a fast reconfigurable processor based on three Xilinx Ultra-scale FPGAs. The board will receive coarse-granularity information from all the ATLAS calorimeters on 264 optical fibers with the data transferred at the 40 MHz LHC clock frequency. The gFEX will be controlled by a single system-on-chip processor, ZYNQ, that will be used to configure all the processor FPGAs, monitor board health, and interface to external signals. Now, the pre-prototype board which includes one ZYNQ and one Vertex-7 FPGA has been designed for testing and verification. The performance ...
The hardware track finder processor in CMS at CERN

CERN Document Server

Kluge, A

1997-01-01

The work covers the design of the Track Finder Processor in the high energy experiment CMS (Compact Muon Solenoid, planned for 2005) at CERN/Geneva. The task of this processor is to identify muons and measure their transverse momentum. The track finder processor makes it possible to determine the physical relevance of each high energetic collision and to forward only interesting data to the data an alysis units. Data of more than two hundred thousand detector cells are used to determine the location of muons and measure their transverse momentum. Each 25 ns a new data set is generated. Measurem ent of location and transverse momentum of the muons can be terminated within 350 ns by using an ASIC (Application Specific Integrated Circuit). A pipeline architecture processes new data sets with th e required data rate of 40 MHz to ensure dead time free operation. In the framework of this study specifications and the overall concept of the track finder processor were worked out in detail. Simul ations were performed...
3D-Flow processor for a programmable Level-1 trigger (feasibility study)

International Nuclear Information System (INIS)

Crosetto, D.

1992-10-01

A feasibility study has been made to use the 3D-Flow processor in a pipelined programmable parallel processing architecture to identify particles such as electrons, jets, muons, etc., in high-energy physics experiments
Initial explorations of ARM processors for scientific computing

International Nuclear Information System (INIS)

Abdurachmanov, David; Elmer, Peter; Eulisse, Giulio; Muzaffar, Shahzad

2014-01-01

Power efficiency is becoming an ever more important metric for both high performance and high throughput computing. Over the course of next decade it is expected that flops/watt will be a major driver for the evolution of computer architecture. Servers with large numbers of ARM processors, already ubiquitous in mobile computing, are a promising alternative to traditional x86-64 computing. We present the results of our initial investigations into the use of ARM processors for scientific computing applications. In particular we report the results from our work with a current generation ARMv7 development board to explore ARM-specific issues regarding the software development environment, operating system, performance benchmarks and issues for porting High Energy Physics software
Efficient Sorting on the Tilera Manycore Architecture

Energy Technology Data Exchange (ETDEWEB)

Morari, Alessandro; Tumeo, Antonino; Villa, Oreste; Secchi, Simone; Valero, Mateo

2012-10-24

e present an efficient implementation of the radix sort algo- rithm for the Tilera TILEPro64 processor. The TILEPro64 is one of the first successful commercial manycore processors. It is com- posed of 64 tiles interconnected through multiple fast Networks- on-chip and features a fully coherent, shared distributed cache. The architecture has a large degree of flexibility, and allows various optimization strategies. We describe how we mapped the algorithm to this architecture. We present an in-depth analysis of the optimizations for each phase of the algorithm with respect to the processor’s sustained performance. We discuss the overall throughput reached by our radix sort implementation (up to 132 MK/s) and show that it provides comparable or better performance-per-watt with respect to state-of-the art implemen- tations on x86 processors and graphic processing units.
A discussion of tools and techniques for distributed processor based control systems using CAMAC

International Nuclear Information System (INIS)

Tippie, J.W.; Scandora, A.E.

1985-01-01

This paper describes and analyzes various distributed processor architectures using commercially available CAMAC components. The general orientation is toward distributed control systems using Digital Equipment Corporation LSI11 processors in a CAMAC environment. The paper describes in detail software tools available to simplify the development of applications software and to provide a high-level runtime environment both at the host and the remote processors. Discussion focuses on techniques for downloading of operating systems from a large host and applications tasks written in high-level languages. It also discusses software tools which enable tasks in the remote processors to exchange messages and data with tasks in the host in a simple and elegant way

Reconfigurable signal processor designs for advanced digital array radar systems

Science.gov (United States)

Suarez, Hernan; Zhang, Yan (Rockee); Yu, Xining

2017-05-01

The new challenges originated from Digital Array Radar (DAR) demands a new generation of reconfigurable backend processor in the system. The new FPGA devices can support much higher speed, more bandwidth and processing capabilities for the need of digital Line Replaceable Unit (LRU). This study focuses on using the latest Altera and Xilinx devices in an adaptive beamforming processor. The field reprogrammable RF devices from Analog Devices are used as analog front end transceivers. Different from other existing Software-Defined Radio transceivers on the market, this processor is designed for distributed adaptive beamforming in a networked environment. The following aspects of the novel radar processor will be presented: (1) A new system-on-chip architecture based on Altera's devices and adaptive processing module, especially for the adaptive beamforming and pulse compression, will be introduced, (2) Successful implementation of generation 2 serial RapidIO data links on FPGA, which supports VITA-49 radio packet format for large distributed DAR processing. (3) Demonstration of the feasibility and capabilities of the processor in a Micro-TCA based, SRIO switching backplane to support multichannel beamforming in real-time. (4) Application of this processor in ongoing radar system development projects, including OU's dual-polarized digital array radar, the planned new cylindrical array radars, and future airborne radars.
Study and simulation of a parallel numerical processing machine

International Nuclear Information System (INIS)

Bel Hadj, Slaheddine

1981-12-01

This study has been carried out in the perspective of the implementation on a minicomputer of the NEPTUNIX package (software for the resolution of very large algebra-differential equation systems). Aiming at increasing the system performance, a previous research work has shown the necessity of reducing the execution time of certain numerical computation tasks, which are of frequent use. It has also demonstrated the feasibility of handling these tasks with efficient algorithms of parallel type. The present work deals with the study and simulation of a parallel architecture processor adapted to the fast execution of these algorithms. A minicomputer fitted with a connection to such a parallel processor, has a greatly extended computing power. Then the architecture of a parallel numerical processor, based on the use of VLSI microprocessors and co-processors, is described. Its design aims at the best cost / performance ratio. The last part deals with the simulation processor with the 'CHAMBOR' program. Results show an increasing factor of 30 in speed, in comparison with the execution on a MITRA 15 minicomputer. Moreover the conflicts importance, mainly at the level of access to a shared resource is evaluated. Although this implementation has been designed having in mind a dedicated application, other uses could be envisaged, particularly for the simulation of nuclear reactors: operator guiding system, the behavioural study under accidental circumstances, etc. (author) [fr
Design of an ultra-low-power digital processor for passive UHF RFID tags

Energy Technology Data Exchange (ETDEWEB)

Shi Wanggen; Zhuang Yiqi; Li Xiaoming; Wang Xianghua; Jin Zhao; Wang Dan, E-mail: wanggen_shi@163.co [Key Laboratory of the Ministry of Education for Wide Band-Gap Semiconductor Materials and Devices, Institute of Microelectronics, Xidian University, Xi' an 710071 (China)

2009-04-15

A new architecture of digital processors for passive UHF radio-frequency identification tags is proposed. This architecture is based on ISO/IEC 18000-6C and targeted at ultra-low power consumption. By applying methods like system-level power management, global clock gating and low voltage implementation, the total power of the design is reduced to a few microwatts. In addition, an innovative way for the design of a true RNG is presented, which contributes to both low power and secure data transaction. The digital processor is verified by an integrated FPGA platform and implemented by the Synopsys design kit for ASIC flows. The design fits different CMOS technologies and has been taped out using the 2P4M 0.35 mum process of Chartered Semiconductor.
Parallel Processor for 3D Recovery from Optical Flow

Directory of Open Access Journals (Sweden)

Jose Hugo Barron-Zambrano

2009-01-01

Full Text Available 3D recovery from motion has received a major effort in computer vision systems in the recent years. The main problem lies in the number of operations and memory accesses to be performed by the majority of the existing techniques when translated to hardware or software implementations. This paper proposes a parallel processor for 3D recovery from optical flow. Its main feature is the maximum reuse of data and the low number of clock cycles to calculate the optical flow, along with the precision with which 3D recovery is achieved. The results of the proposed architecture as well as those from processor synthesis are presented.
Adaptive WTA with an analog VLSI neuromorphic learning chip.

Science.gov (United States)

Häfliger, Philipp

2007-03-01

In this paper, we demonstrate how a particular spike-based learning rule (where exact temporal relations between input and output spikes of a spiking model neuron determine the changes of the synaptic weights) can be tuned to express rate-based classical Hebbian learning behavior (where the average input and output spike rates are sufficient to describe the synaptic changes). This shift in behavior is controlled by the input statistic and by a single time constant. The learning rule has been implemented in a neuromorphic very large scale integration (VLSI) chip as part of a neurally inspired spike signal image processing system. The latter is the result of the European Union research project Convolution AER Vision Architecture for Real-Time (CAVIAR). Since it is implemented as a spike-based learning rule (which is most convenient in the overall spike-based system), even if it is tuned to show rate behavior, no explicit long-term average signals are computed on the chip. We show the rule's rate-based Hebbian learning ability in a classification task in both simulation and chip experiment, first with artificial stimuli and then with sensor input from the CAVIAR system.
The Square Kilometre Array Science Data Processor. Preliminary compute platform design

International Nuclear Information System (INIS)

Broekema, P.C.; Nieuwpoort, R.V. van; Bal, H.E.

2015-01-01

The Square Kilometre Array is a next-generation radio-telescope, to be built in South Africa and Western Australia. It is currently in its detailed design phase, with procurement and construction scheduled to start in 2017. The SKA Science Data Processor is the high-performance computing element of the instrument, responsible for producing science-ready data. This is a major IT project, with the Science Data Processor expected to challenge the computing state-of-the art even in 2020. In this paper we introduce the preliminary Science Data Processor design and the principles that guide the design process, as well as the constraints to the design. We introduce a highly scalable and flexible system architecture capable of handling the SDP workload
Computer architecture fundamentals and principles of computer design

CERN Document Server

Dumas II, Joseph D

2005-01-01

Introduction to Computer ArchitectureWhat is Computer Architecture?Architecture vs. ImplementationBrief History of Computer SystemsThe First GenerationThe Second GenerationThe Third GenerationThe Fourth GenerationModern Computers - The Fifth GenerationTypes of Computer SystemsSingle Processor SystemsParallel Processing SystemsSpecial ArchitecturesQuality of Computer SystemsGenerality and ApplicabilityEase of UseExpandabilityCompatibilityReliabilitySuccess and Failure of Computer Architectures and ImplementationsQuality and the Perception of QualityCost IssuesArchitectural Openness, Market Timi
Trace-based post-silicon validation for VLSI circuits

CERN Document Server

Liu, Xiao

2014-01-01

This book first provides a comprehensive coverage of state-of-the-art validation solutions based on real-time signal tracing to guarantee the correctness of VLSI circuits. The authors discuss several key challenges in post-silicon validation and provide automated solutions that are systematic and cost-effective. A series of automatic tracing solutions and innovative design for debug (DfD) techniques are described, including techniques for trace signal selection for enhancing visibility of functional errors, a multiplexed signal tracing strategy for improving functional error detection, a tracing solution for debugging electrical errors, an interconnection fabric for increasing data bandwidth and supporting multi-core debug, an interconnection fabric design and optimization technique to increase transfer flexibility and a DfD design and associated tracing solution for improving debug efficiency and expanding tracing window. The solutions presented in this book improve the validation quality of VLSI circuit...
Quantum chemistry on a superconducting quantum processor

Energy Technology Data Exchange (ETDEWEB)

Kaicher, Michael P.; Wilhelm, Frank K. [Theoretical Physics, Saarland University, 66123 Saarbruecken (Germany); Love, Peter J. [Department of Physics and Astronomy, Tufts University, Medford, MA 02155 (United States)

2016-07-01

Quantum chemistry is the most promising civilian application for quantum processors to date. We study its adaptation to superconducting (sc) quantum systems, computing the ground state energy of LiH through a variational hybrid quantum classical algorithm. We demonstrate how interactions native to sc qubits further reduce the amount of quantum resources needed, pushing sc architectures as a near-term candidate for simulations of more complex atoms/molecules.
Interference control by best-effort process duty-cycling in chip multi-processor systems for real-time medical image processing

NARCIS (Netherlands)

Westmijze, M.; Bekooij, Marco Jan Gerrit; Smit, Gerardus Johannes Maria

2013-01-01

Systems with chip multi-processors are currently used for several applications that have real-time requirements. In chip multi-processor architectures, many hardware resources such as parts of the cache hierarchy are shared between cores and by using such resources, applications can significantly
Evaluation of existing and proposed computer architectures for future ground-based systems

Science.gov (United States)

Schulbach, C.

1985-01-01

Parallel processing architectures and techniques used in current supercomputers are described and projections are made of future advances. Presently, the von Neumann sequential processing pattern has been accelerated by having separate I/O processors, interleaved memories, wide memories, independent functional units and pipelining. Recent supercomputers have featured single-input, multiple data stream architectures, which have different processors for performing various operations (vector or pipeline processors). Multiple input, multiple data stream machines have also been developed. Data flow techniques, wherein program instructions are activated only when data are available, are expected to play a large role in future supercomputers, along with increased parallel processor arrays. The enhanced operational speeds are essential for adequately treating data from future spacecraft remote sensing instruments such as the Thematic Mapper.
Computing on Knights and Kepler Architectures

International Nuclear Information System (INIS)

Bortolotti, G; Caberletti, M; Ferraro, A; Giacomini, F; Manzali, M; Maron, G; Salomoni, D; Crimi, G; Zanella, M

2014-01-01

A recent trend in scientific computing is the increasingly important role of co-processors, originally built to accelerate graphics rendering, and now used for general high-performance computing. The INFN Computing On Knights and Kepler Architectures (COKA) project focuses on assessing the suitability of co-processor boards for scientific computing in a wide range of physics applications, and on studying the best programming methodologies for these systems. Here we present in a comparative way our results in porting a Lattice Boltzmann code on two state-of-the-art accelerators: the NVIDIA K20X, and the Intel Xeon-Phi. We describe our implementations, analyze results and compare with a baseline architecture adopting Intel Sandy Bridge CPUs.
A High-Speed and Low-Energy-Consumption Processor for SVD-MIMO-OFDM Systems

Directory of Open Access Journals (Sweden)

Hiroki Iwaizumi

2013-01-01

Full Text Available A processor design for singular value decomposition (SVD and compression/decompression of feedback matrices, which are mandatory operations for SVD multiple-input multiple-output orthogonal frequency-division multiplexing (MIMO-OFDM systems, is proposed and evaluated. SVD-MIMO is a transmission method for suppressing multistream interference and improving communication quality by beamforming. An application specific instruction-set processor (ASIP architecture is adopted to achieve flexibility in terms of operations and matrix size. The proposed processor realizes a high-speed/low-power design and real-time processing by the parallelization of floating-point units (FPUs and arithmetic instructions specialized in complex matrix operations.
A Real-Time Sound Field Rendering Processor

Directory of Open Access Journals (Sweden)

Tan Yiyu

2017-12-01

Full Text Available Real-time sound field renderings are computationally intensive and memory-intensive. Traditional rendering systems based on computer simulations suffer from memory bandwidth and arithmetic units. The computation is time-consuming, and the sample rate of the output sound is low because of the long computation time at each time step. In this work, a processor with a hybrid architecture is proposed to speed up computation and improve the sample rate of the output sound, and an interface is developed for system scalability through simply cascading many chips to enlarge the simulated area. To render a three-minute Beethoven wave sound in a small shoe-box room with dimensions of 1.28 m × 1.28 m × 0.64 m, the field programming gate array (FPGA-based prototype machine with the proposed architecture carries out the sound rendering at run-time while the software simulation with the OpenMP parallelization takes about 12.70 min on a personal computer (PC with 32 GB random access memory (RAM and an Intel i7-6800K six-core processor running at 3.4 GHz. The throughput in the software simulation is about 194 M grids/s while it is 51.2 G grids/s in the prototype machine even if the clock frequency of the prototype machine is much lower than that of the PC. The rendering processor with a processing element (PE and interfaces consumes about 238,515 gates after fabricated by the 0.18 µm processing technology from the ROHM semiconductor Co., Ltd. (Kyoto Japan, and the power consumption is about 143.8 mW.
Analysis of the computational requirements of a pulse-doppler radar signal processor

CSIR Research Space (South Africa)

Broich, R

2012-05-01

Full Text Available In an attempt to find an optimal processing architecture for radar signal processing applications, the different algorithms that are typically used in a pulse-Doppler radar signal processor are investigated. Radar algorithms are broken down...
Efficient Programming for Multicore Processor Heterogeneity: OpenMP versus OmpSs

OpenAIRE

Butko , Anastasiia; Bruguier , Florent; Gamatié , Abdoulaye; Sassatelli , Gilles

2017-01-01

International audience; ARM single-ISA heterogeneous multicore processors combine high-performance big cores with power-efficient small cores. They aim at achieving a suitable balance between performance and energy. How- ever, a main challenge is to program such architectures so as to efficiently exploit their features. In this paper, we study the impact on performance and energy trade-offs of single-ISA architecture according to OpenMP 3.0 and the OmpSs programming models. We consider differ...
Applying the roofline performance model to the intel xeon phi knights landing processor

OpenAIRE

Doerfler, D; Deslippe, J; Williams, S; Oliker, L; Cook, B; Kurth, T; Lobet, M; Malas, T; Vay, JL; Vincenti, H

2016-01-01

ï¿½ Springer International Publishing AG 2016. The Roofline Performance Model is a visually intuitive method used to bound the sustained peak floating-point performance of any given arithmetic kernel on any given processor architecture. In the Roofline, performance is nominally measured in floating-point operations per second as a function of arithmetic intensity (operations per byte of data). In this study we determine the Roofline for the Intel Knights Landing (KNL) processor, determining t...
A Knowledge Based Approach to VLSI CAD

Science.gov (United States)

1983-09-01

Avail-and/or Dist ISpecial L| OI. SEICURITY CLASIIrCATION OP THIS IPA.lErllm S Daene." A KNOwLEDE BASED APPROACH TO VLSI CAD’ Louis L Steinberg and...major issues lies in building up and managing the knowledge base of oesign expertise. We expect that, as with many recent expert systems, in order to
Design of an ultra-low-power digital processor for passive UHF RFID tags

International Nuclear Information System (INIS)

Shi Wanggen; Zhuang Yiqi; Li Xiaoming; Wang Xianghua; Jin Zhao; Wang Dan

2009-01-01

A new architecture of digital processors for passive UHF radio-frequency identification tags is proposed. This architecture is based on ISO/IEC 18000-6C and targeted at ultra-low power consumption. By applying methods like system-level power management, global clock gating and low voltage implementation, the total power of the design is reduced to a few microwatts. In addition, an innovative way for the design of a true RNG is presented, which contributes to both low power and secure data transaction. The digital processor is verified by an integrated FPGA platform and implemented by the Synopsys design kit for ASIC flows. The design fits different CMOS technologies and has been taped out using the 2P4M 0.35 μm process of Chartered Semiconductor.
Utilizing a multiprocessor architecture - The performance of MIDAS

International Nuclear Information System (INIS)

Maples, C.; Logan, D.; Meng, J.; Rathbun, W.; Weaver, D.

1983-01-01

The MIDAS architecture organizes multiple CPUs into clusters called distributed subsystems. Each subsystem consists of an array of processors controlled by a supervisory CPU. The multiprocessor array is composed of commercial CPUs (with floating point hardware) and specialized processing elements. Interprocessor communication within the array may occur either through switched memory modules or common shared memory. The architecture permits multiple processors to be focused on single problems. A distributed subsystem has been constructed and tested. It currently consists of a supervisor CPU; 16 blocks of independently switchable memory; 9 general purpose, VAX-class CPUs; and 2 specialized pipelined processors to handle I/O. Results on a variety of problems indicate that the subsystem performs 8 to 15 times faster than a standard computer with an identical CPU. The difference in performance represents the effect of differing CPU and I/O requirements

Processor tradeoffs in distributed real-time systems

Science.gov (United States)

Krishna, C. M.; Shin, Kang G.; Bhandari, Inderpal S.

1987-01-01

The problem of the optimization of the design of real-time distributed systems is examined with reference to a class of computer architectures similar to the continuously reconfigurable multiprocessor flight control system structure, CM2FCS. Particular attention is given to the impact of processor replacement and the burn-in time on the probability of dynamic failure and mean cost. The solution is obtained numerically and interpreted in the context of real-time applications.
A neuromorphic VLSI device for implementing 2-D selective attention systems.

Science.gov (United States)

Indiveri, G

2001-01-01

Selective attention is a mechanism used to sequentially select and process salient subregions of the input space, while suppressing inputs arriving from nonsalient regions. By processing small amounts of sensory information in a serial fashion, rather than attempting to process all the sensory data in parallel, this mechanism overcomes the problem of flooding limited processing capacity systems with sensory inputs. It is found in many biological systems and can be a useful engineering tool for developing artificial systems that need to process in real-time sensory data. In this paper we present a neuromorphic hardware model of a selective attention mechanism implemented on a very large scale integration (VLSI) chip, using analog circuits. The chip makes use of a spike-based representation for receiving input signals, transmitting output signals and for shifting the selection of the attended input stimulus over time. It can be interfaced to neuromorphic sensors and actuators, for implementing multichip selective attention systems. We describe the characteristics of the circuits used in the architecture and present experimental data measured from the system.
VLSI scaling methods and low power CMOS buffer circuit

International Nuclear Information System (INIS)

Sharma Vijay Kumar; Pattanaik Manisha

2013-01-01

Device scaling is an important part of the very large scale integration (VLSI) design to boost up the success path of VLSI industry, which results in denser and faster integration of the devices. As technology node moves towards the very deep submicron region, leakage current and circuit reliability become the key issues. Both are increasing with the new technology generation and affecting the performance of the overall logic circuit. The VLSI designers must keep the balance in power dissipation and the circuit's performance with scaling of the devices. In this paper, different scaling methods are studied first. These scaling methods are used to identify the effects of those scaling methods on the power dissipation and propagation delay of the CMOS buffer circuit. For mitigating the power dissipation in scaled devices, we have proposed a reliable leakage reduction low power transmission gate (LPTG) approach and tested it on complementary metal oxide semiconductor (CMOS) buffer circuit. All simulation results are taken on HSPICE tool with Berkeley predictive technology model (BPTM) BSIM4 bulk CMOS files. The LPTG CMOS buffer reduces 95.16% power dissipation with 84.20% improvement in figure of merit at 32 nm technology node. Various process, voltage and temperature variations are analyzed for proving the robustness of the proposed approach. Leakage current uncertainty decreases from 0.91 to 0.43 in the CMOS buffer circuit that causes large circuit reliability. (semiconductor integrated circuits)
Towards a Systematic Exploration of the Optimization Space for Many-Core Processors

NARCIS (Netherlands)

Fang, J.

2014-01-01

The architecture diversity of many-core processors - with their different types of cores, and memory hierarchies - makes the old model of reprogramming every application for every platform infeasible. Therefore, inter-platform portability has become a desirable feature of programming models. While
A Time-Composable Operating System for the Patmos Processor

DEFF Research Database (Denmark)

Ziccardi, Marco; Schoeberl, Martin; Vardanega, Tullio

2015-01-01

-composable operating system, on top of a time-composable processor, facilitates incremental development, which is highly desirable for industry. This paper makes a twofold contribution. First, we present enhancements to the Patmos processor to allow achieving time composability at the operating system level. Second......, we extend an existing time-composable operating system, TiCOS, to make best use of advanced Patmos hardware features in the pursuit of time composability.......In the last couple of decades we have witnessed a steady growth in the complexity and widespread of real-time systems. In order to master the rising complexity in the timing behaviour of those systems, rightful attention has been given to the development of time-predictable computer architectures...
Advanced symbolic analysis for VLSI systems methods and applications

CERN Document Server

Shi, Guoyong; Tlelo Cuautle, Esteban

2014-01-01

This book provides comprehensive coverage of the recent advances in symbolic analysis techniques for design automation of nanometer VLSI systems. The presentation is organized in parts of fundamentals, basic implementation methods and applications for VLSI design. Topics emphasized include statistical timing and crosstalk analysis, statistical and parallel analysis, performance bound analysis and behavioral modeling for analog integrated circuits . Among the recent advances, the Binary Decision Diagram (BDD) based approaches are studied in depth. The BDD-based hierarchical symbolic analysis approaches, have essentially broken the analog circuit size barrier. In particular, this book • Provides an overview of classical symbolic analysis methods and a comprehensive presentation on the modern BDD-based symbolic analysis techniques; • Describes detailed implementation strategies for BDD-based algorithms, including the principles of zero-suppression, variable ordering and canonical reduction; • Int...
Efficient Backprojection-Based Synthetic Aperture Radar Computation with Many-Core Processors

Directory of Open Access Journals (Sweden)

Jongsoo Park

2013-01-01

Full Text Available Tackling computationally challenging problems with high efficiency often requires the combination of algorithmic innovation, advanced architecture, and thorough exploitation of parallelism. We demonstrate this synergy through synthetic aperture radar (SAR via backprojection, an image reconstruction method that can require hundreds of TFLOPS. Computation cost is significantly reduced by our new algorithm of approximate strength reduction; data movement cost is economized by software locality optimizations facilitated by advanced architecture support; parallelism is fully harnessed in various patterns and granularities. We deliver over 35 billion backprojections per second throughput per compute node on an Intel® Xeon® processor E5-2670-based cluster, equipped with Intel® Xeon Phi™ coprocessors. This corresponds to processing a 3K×3K image within a second using a single node. Our study can be extended to other settings: backprojection is applicable elsewhere including medical imaging, approximate strength reduction is a general code transformation technique, and many-core processors are emerging as a solution to energy-efficient computing.
UW VLSI chip tester

Science.gov (United States)

McKenzie, Neil

1989-12-01

We present a design for a low-cost, functional VLSI chip tester. It is based on the Apple MacIntosh II personal computer. It tests chips that have up to 128 pins. All pin drivers of the tester are bidirectional; each pin is programmed independently as an input or an output. The tester can test both static and dynamic chips. Rudimentary speed testing is provided. Chips are tested by executing C programs written by the user. A software library is provided for program development. Tests run under both the Mac Operating System and A/UX. The design is implemented using Xilinx Logic Cell Arrays. Price/performance tradeoffs are discussed.
Using of new possibilities of Fermi architecture by development og GPGPU programs

International Nuclear Information System (INIS)

Dudnik, V.A.; Kudryavtsev, V.I.; Us, S.A.; Shestakov, M.V.

2013-01-01

Description of additional functions of hardware and software, which are presented in the structure of new architecture of FERMI graphic processors made by company NVIDIA, was given. Recommendations of their use within the realization of algorithms of scientific and technical calculations by means of the graphic processors were given. Application of the new possibilities of FERMI architecture and CUDA technologies (Compute Unified Device Architecture - unified hardware-software decision for parallel calculations on GPU) of NVIDIA Company was described. It was done for time reduction of applications' development which is using possibilities of GPGPU for acceleration of data processing
ATLAS FTK: Fast Track Trigger

CERN Document Server

Volpi, Guido; The ATLAS collaboration

2015-01-01

An overview of the ATLAS Fast Tracker processor is presented, reporting the design of the system, its expected performance, and the integration status. The next LHC runs, with a significant increase in instantaneous luminosity, will provide a big challenge to the trigger and data acquisition systems of all the experiments. An intensive use of the tracking information at the trigger level will be important to keep high efficiency in interesting events, despite the increase in multiple p-p collisions per bunch crossing (pile-up). In order to increase the use of tracks within the High Level Trigger (HLT), the ATLAS experiment planned the installation of an hardware processor dedicated to tracking: the Fast TracKer (FTK) processor. The FTK is designed to perform full scan track reconstruction at every Level-1 accept. To achieve this goal, the FTK uses a fully parallel architecture, with algorithms designed to exploit the computing power of custom VLSI chips, the Associative Memory, as well as modern FPGAs. The FT...
Applications of the scalable coherent interface to data acquisition at LHC

CERN Document Server

Bogaerts, A; Divià, R; Müller, H; Parkman, C; Ponting, P J; Skaali, B; Midttun, G; Wormald, D; Wikne, J; Falciano, S; Cesaroni, F; Vinogradov, V I; Kristiansen, E H; Solberg, B; Guglielmi, A M; Worm, F H; Bovier, J; Davis, C; CERN. Geneva. Detector Research and Development Committee

1991-01-01

We propose to use the Scalable Coherent Interface (SCI) as a very high speed interconnect between LHC detector data buffers and farms of commercial trigger processors. Both the global second and third level trigger can be based on SCI as a reconfigurable and scalable system. SCI is a proposed IEEE standard which uses fast point-to-point links to provide computer-bus like services. It can connect a maximum of 65 536 nodes (memories or processors), providing data transfer rates of up to 1 Gbyte/s. Scalable data acquisition systems can be built using either simple SCI rings or complex switches. The interconnections may be flat cables, coaxial cables, or optical fibres. SCI protocols have been entirely implemented in VLSI, resulting in a significant simplification of data acquisition software. Novel SCI features allow efficient implementation of both data and processor driven readout architectures. In particular, a very efficient implementation of the third level trigger can be achieved by combining SCI's shared ...
Digital VLSI design with Verilog a textbook from Silicon Valley Technical Institute

CERN Document Server

Williams, John

2008-01-01

This unique textbook is structured as a step-by-step course of study along the lines of a VLSI IC design project. In a nominal schedule of 12 weeks, two days and about 10 hours per week, the entire verilog language is presented, from the basics to everything necessary for synthesis of an entire 70,000 transistor, full-duplex serializer - deserializer, including synthesizable PLLs. Digital VLSI Design With Verilog is all an engineer needs for in-depth understanding of the verilog language: Syntax, synthesis semantics, simulation, and test. Complete solutions for the 27 labs are provided on the
System Level Design of Reconfigurable Server Farms Using Elliptic Curve Cryptography Processor Engines

Directory of Open Access Journals (Sweden)

Sangook Moon

2014-01-01

Full Text Available As today’s hardware architecture becomes more and more complicated, it is getting harder to modify or improve the microarchitecture of a design in register transfer level (RTL. Consequently, traditional methods we have used to develop a design are not capable of coping with complex designs. In this paper, we suggest a way of designing complex digital logic circuits with a soft and advanced type of SystemVerilog at an electronic system level. We apply the concept of design-and-reuse with a high level of abstraction to implement elliptic curve crypto-processor server farms. With the concept of the superior level of abstraction to the RTL used with the traditional HDL design, we successfully achieved the soft implementation of the crypto-processor server farms as well as robust test bench code with trivial effort in the same simulation environment. Otherwise, it could have required error-prone Verilog simulations for the hardware IPs and other time-consuming jobs such as C/SystemC verification for the software, sacrificing more time and effort. In the design of the elliptic curve cryptography processor engine, we propose a 3X faster GF(2m serial multiplication architecture.
Preliminary design of an advanced programmable digital filter network for large passive acoustic ASW systems. [Parallel processor

Energy Technology Data Exchange (ETDEWEB)

McWilliams, T.; Widdoes, Jr., L. C.; Wood, L.

1976-09-30

The design of an extremely high performance programmable digital filter of novel architecture, the LLL Programmable Digital Filter, is described. The digital filter is a high-performance multiprocessor having general purpose applicability and high programmability; it is extremely cost effective either in a uniprocessor or a multiprocessor configuration. The architecture and instruction set of the individual processor was optimized with regard to the multiple processor configuration. The optimal structure of a parallel processing system was determined for addressing the specific Navy application centering on the advanced digital filtering of passive acoustic ASW data of the type obtained from the SOSUS net. 148 figures. (RWR)
Real-time tracking with a 3D-Flow processor array

International Nuclear Information System (INIS)

Crosetto, D.

1993-06-01

The problem of real-time track-finding has been performed to date with CAM (Content Addressable Memories) or with fast coincidence logic, because the processing scheme was thought to have much slower performance. Advances in technology together with a new architectural approach make it feasible to also explore the computing technique for real-time track finding thus giving the advantages of implementing algorithms that can find more parameters such as calculate the sagitta, curvature, pt, etc., with respect to the CAM approach. The report describes real-time track finding using new computing approach technique based on the 3D-Flow array processor system. This system consists of a fixed interconnection architecture scheme, allowing flexible algorithm implementation on a scalable platform. The 3D-Flow parallel processing system for track finding is scalable in size and performance by either increasing the number of processors, or increasing the speed or else the number of pipelined stages. The present article describes the conceptual idea and the design stage of the project
Digital design and computer architecture

CERN Document Server

Harris, David

2010-01-01

Digital Design and Computer Architecture is designed for courses that combine digital logic design with computer organization/architecture or that teach these subjects as a two-course sequence. Digital Design and Computer Architecture begins with a modern approach by rigorously covering the fundamentals of digital logic design and then introducing Hardware Description Languages (HDLs). Featuring examples of the two most widely-used HDLs, VHDL and Verilog, the first half of the text prepares the reader for what follows in the second: the design of a MIPS Processor. By the end of D
Development of an integrated circuit VLSI used for time measurement and selective read out in the front end electronics of the DIRC for the Babar experience at SLAC; Developpement d'un circuit integre VLSI assurant mesure de temps et lecture selective dans l'electronique frontale du compteur DIRC de l'experience babar a slac

Energy Technology Data Exchange (ETDEWEB)

Zhang, B

1999-07-01

This thesis deals with the design the development and the tests of an integrated circuit VLSI, supplying selective read and time measure for 16 channels. This circuit has been developed for a experiment of particles physics, BABAR, that will take place at SLAC (Stanford Linear Accelerator Center). A first part describes the physical stakes of the experiment, the electronic architecture and the place of the developed circuit in the research program. The second part presents the technical drawings of the circuit, the prototypes leading to the final design and the validity tests. (A.L.B.)
Numerical analysis of electromigration in thin film VLSI interconnections

NARCIS (Netherlands)

Petrescu, V.; Mouthaan, A.J.; Schoenmaker, W.; Angelescu, S.; Vissarion, R.; Dima, G.; Wallinga, Hans; Profirescu, M.D.

1995-01-01

Due to the continuing downscaling of the dimensions in VLSI circuits, electromigration is becoming a serious reliability hazard. A software tool based on finite element analysis has been developed to solve the two partial differential equations of the two particle vacancy/imperfection model.
Heavy ion tests on programmable VLSI

International Nuclear Information System (INIS)

Provost-Grellier, A.

1989-11-01

The radiation from space environment induces operation damages in onboard computers systems. The definition of a strategy, for the Very Large Scale Integrated Circuitry (VLSI) qualification and choice, is needed. The 'upset' phenomena is known to be the most critical integrated circuit radiation effect. The strategies for testing integrated circuits are reviewed. A method and a test device were developed and applied to space applications candidate circuits. Cyclotron, synchrotron and Californium source experiments were carried out [fr
Comparison of Three Smart Camera Architectures for Real-Time Machine Vision System

Directory of Open Access Journals (Sweden)

Abdul Waheed Malik

2013-12-01

Full Text Available This paper presents a machine vision system for real-time computation of distance and angle of a camera from a set of reference points located on a target board. Three different smart camera architectures were explored to compare performance parameters such as power consumption, frame speed and latency. Architecture 1 consists of hardware machine vision modules modeled at Register Transfer (RT level and a soft-core processor on a single FPGA chip. Architecture 2 is commercially available software based smart camera, Matrox Iris GT. Architecture 3 is a two-chip solution composed of hardware machine vision modules on FPGA and an external microcontroller. Results from a performance comparison show that Architecture 2 has higher latency and consumes much more power than Architecture 1 and 3. However, Architecture 2 benefits from an easy programming model. Smart camera system with FPGA and external microcontroller has lower latency and consumes less power as compared to single FPGA chip having hardware modules and soft-core processor.

The GLUEchip: A custom VLSI chip for detectors readout and associative memories circuits

International Nuclear Information System (INIS)

Amendolia, S.R.; Galeotti, S.; Morsani, F.; Passuello, D.; Ristori, L.; Turini, N.

1993-01-01

An associative memory full-custom VLSI chip for pattern recognition has been designed and tested in the past years. It's the AMchip, that contains 128 patterns of 60 bits each. To expand the pattern capacity of an Associative Memory bank, the custom VLSI GLUEchip has been developed. The GLUEchip allows the interconnection of up to 16 AMchips or up to 16 GLUEchips: the resulting tree-like structure works like a single AMchip with an output pipelined structure and a pattern capacity increased by a factor 16 for each GLUEchip used
In-Network Adaptation of Video Streams Using Network Processors

Directory of Open Access Journals (Sweden)

Mohammad Shorfuzzaman

2009-01-01

problem can be addressed, near the network edge, by applying dynamic, in-network adaptation (e.g., transcoding of video streams to meet available connection bandwidth, machine characteristics, and client preferences. In this paper, we extrapolate from earlier work of Shorfuzzaman et al. 2006 in which we implemented and assessed an MPEG-1 transcoding system on the Intel IXP1200 network processor to consider the feasibility of in-network transcoding for other video formats and network processor architectures. The use of “on-the-fly” video adaptation near the edge of the network offers the promise of simpler support for a wide range of end devices with different display, and so forth, characteristics that can be used in different types of environments.
Parallel processor programs in the Federal Government

Science.gov (United States)

Schneck, P. B.; Austin, D.; Squires, S. L.; Lehmann, J.; Mizell, D.; Wallgren, K.

1985-01-01

In 1982, a report dealing with the nation's research needs in high-speed computing called for increased access to supercomputing resources for the research community, research in computational mathematics, and increased research in the technology base needed for the next generation of supercomputers. Since that time a number of programs addressing future generations of computers, particularly parallel processors, have been started by U.S. government agencies. The present paper provides a description of the largest government programs in parallel processing. Established in fiscal year 1985 by the Institute for Defense Analyses for the National Security Agency, the Supercomputing Research Center will pursue research to advance the state of the art in supercomputing. Attention is also given to the DOE applied mathematical sciences research program, the NYU Ultracomputer project, the DARPA multiprocessor system architectures program, NSF research on multiprocessor systems, ONR activities in parallel computing, and NASA parallel processor projects.
Microlens array processor with programmable weight mask and direct optical input

Science.gov (United States)

Schmid, Volker R.; Lueder, Ernst H.; Bader, Gerhard; Maier, Gert; Siegordner, Jochen

1999-03-01

We present an optical feature extraction system with a microlens array processor. The system is suitable for online implementation of a variety of transforms such as the Walsh transform and DCT. Operating with incoherent light, our processor accepts direct optical input. Employing a sandwich- like architecture, we obtain a very compact design of the optical system. The key elements of the microlens array processor are a square array of 15 X 15 spherical microlenses on acrylic substrate and a spatial light modulator as transmissive mask. The light distribution behind the mask is imaged onto the pixels of a customized a-Si image sensor with adjustable gain. We obtain one output sample for each microlens image and its corresponding weight mask area as summation of the transmitted intensity within one sensor pixel. The resulting architecture is very compact and robust like a conventional camera lens while incorporating a high degree of parallelism. We successfully demonstrate a Walsh transform into the spatial frequency domain as well as the implementation of a discrete cosine transform with digitized gray values. We provide results showing the transformation performance for both synthetic image patterns and images of natural texture samples. The extracted frequency features are suitable for neural classification of the input image. Other transforms and correlations can be implemented in real-time allowing adaptive optical signal processing.
Design of two easily-testable VLSI array multipliers

Energy Technology Data Exchange (ETDEWEB)

Ferguson, J.; Shen, J.P.

1983-01-01

Array multipliers are well-suited to VLSI implementation because of the regularity in their iterative structure. However, most VLSI circuits are very difficult to test. This paper shows that, with appropriate cell design, array multipliers can be designed to be very easily testable. An array multiplier is called c-testable if all its adder cells can be exhaustively tested while requiring only a constant number of test patterns. The testability of two well-known array multiplier structures are studied. The conventional design of the carry-save array multipler is shown to be not c-testable. However, a modified design, using a modified adder cell, is generated and shown to be c-testable and requires only 16 test patterns. Similar results are obtained for the baugh-wooley two's complement array multiplier. A modified design of the baugh-wooley array multiplier is shown to be c-testable and requires 55 test patterns. The implementation of a practical c-testable 16*16 array multiplier is also presented. 10 references.
Satellite on-board real-time SAR processor prototype

Science.gov (United States)

Bergeron, Alain; Doucet, Michel; Harnisch, Bernd; Suess, Martin; Marchese, Linda; Bourqui, Pascal; Desnoyers, Nicholas; Legros, Mathieu; Guillot, Ludovic; Mercier, Luc; Châteauneuf, François

2017-11-01

A Compact Real-Time Optronic SAR Processor has been successfully developed and tested up to a Technology Readiness Level of 4 (TRL4), the breadboard validation in a laboratory environment. SAR, or Synthetic Aperture Radar, is an active system allowing day and night imaging independent of the cloud coverage of the planet. The SAR raw data is a set of complex data for range and azimuth, which cannot be compressed. Specifically, for planetary missions and unmanned aerial vehicle (UAV) systems with limited communication data rates this is a clear disadvantage. SAR images are typically processed electronically applying dedicated Fourier transformations. This, however, can also be performed optically in real-time. Originally the first SAR images were optically processed. The optical Fourier processor architecture provides inherent parallel computing capabilities allowing real-time SAR data processing and thus the ability for compression and strongly reduced communication bandwidth requirements for the satellite. SAR signal return data are in general complex data. Both amplitude and phase must be combined optically in the SAR processor for each range and azimuth pixel. Amplitude and phase are generated by dedicated spatial light modulators and superimposed by an optical relay set-up. The spatial light modulators display the full complex raw data information over a two-dimensional format, one for the azimuth and one for the range. Since the entire signal history is displayed at once, the processor operates in parallel yielding real-time performances, i.e. without resulting bottleneck. Processing of both azimuth and range information is performed in a single pass. This paper focuses on the onboard capabilities of the compact optical SAR processor prototype that allows in-orbit processing of SAR images. Examples of processed ENVISAT ASAR images are presented. Various SAR processor parameters such as processing capabilities, image quality (point target analysis), weight and
Direct kinematics solution architectures for industrial robot manipulators: Bit-serial versus parallel

Science.gov (United States)

Lee, J.; Kim, K.

1991-01-01

A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
Direct kinematics solution architectures for industrial robot manipulators: Bit-serial versus parallel

Science.gov (United States)

Lee, J.; Kim, K.

A Very Large Scale Integration (VLSI) architecture for robot direct kinematic computation suitable for industrial robot manipulators was investigated. The Denavit-Hartenberg transformations are reviewed to exploit a proper processing element, namely an augmented CORDIC. Specifically, two distinct implementations are elaborated on, such as the bit-serial and parallel. Performance of each scheme is analyzed with respect to the time to compute one location of the end-effector of a 6-links manipulator, and the number of transistors required.
An FPGA design flow for reconfigurable network-based multi-processor systems on chip

NARCIS (Netherlands)

Kumar, A.; Hansson, M.A; Huisken, J.; Corporaal, H.

2007-01-01

Multi-processor systems on chip (MPSoC) platforms are becoming increasingly more heterogeneous and are shifting towards a more communication-centric methodology. Networks on chip (NoC) have emerged as the design paradigm for scalable on-chip communication architectures. As the system complexity
Demonstration of two-qubit algorithms with a superconducting quantum processor.

Science.gov (United States)

DiCarlo, L; Chow, J M; Gambetta, J M; Bishop, Lev S; Johnson, B R; Schuster, D I; Majer, J; Blais, A; Frunzio, L; Girvin, S M; Schoelkopf, R J

2009-07-09

Quantum computers, which harness the superposition and entanglement of physical states, could outperform their classical counterparts in solving problems with technological impact-such as factoring large numbers and searching databases. A quantum processor executes algorithms by applying a programmable sequence of gates to an initialized register of qubits, which coherently evolves into a final state containing the result of the computation. Building a quantum processor is challenging because of the need to meet simultaneously requirements that are in conflict: state preparation, long coherence times, universal gate operations and qubit readout. Processors based on a few qubits have been demonstrated using nuclear magnetic resonance, cold ion trap and optical systems, but a solid-state realization has remained an outstanding challenge. Here we demonstrate a two-qubit superconducting processor and the implementation of the Grover search and Deutsch-Jozsa quantum algorithms. We use a two-qubit interaction, tunable in strength by two orders of magnitude on nanosecond timescales, which is mediated by a cavity bus in a circuit quantum electrodynamics architecture. This interaction allows the generation of highly entangled states with concurrence up to 94 per cent. Although this processor constitutes an important step in quantum computing with integrated circuits, continuing efforts to increase qubit coherence times, gate performance and register size will be required to fulfil the promise of a scalable technology.
Initial validation of ATLAS software on the ARM architecture

Energy Technology Data Exchange (ETDEWEB)

Kawamura, Gen; Quadt, Arnulf; Smith, Joshua Wyatt [II. Physikalisches Institut, Georg-August Universitaet Goettingen (Germany); Seuster, Rolf [TRIUMF (Canada); Stewart, Graeme [University of Glasgow (United Kingdom)

2016-07-01

In the early 2000's the introduction of the multi-core era of computing helped industry and experiments such as ATLAS realize even more computing power. This was necessary as the limits of what a single-core processor could do where quickly being reached. Our current model of computing is to increase the number of multi-core nodes in a server farm in order to handle the increased influx of data. As power costs and our need for more computing power increase, this model will eventually become non-realistic. Once again a paradigm shift has to take place. One such option is to look at alternative architectures for large scale server farms. ARM processors are such an example. Making up approximately 95 % of the smartphone and tablet market these processors are widely available, very power conservative and constantly becoming faster. The ATLAS software code base (Athena) is extremely complex comprising of more than 6.5 million lines of code. It has very recently been ported to the ARM 64-bit architecture. The process of our port as well as the first validation plots are presented and compared to the traditional x86 architecture.
Reversible machine code and its abstract processor architecture

DEFF Research Database (Denmark)

Axelsen, Holger Bock; Glück, Robert; Yokoyama, Tetsuo

2007-01-01

A reversible abstract machine architecture and its reversible machine code are presented and formalized. For machine code to be reversible, both the underlying control logic and each instruction must be reversible. A general class of machine instruction sets was proven to be reversible, building...
A general model of concurrency and its implementation as many-core dynamic RISC processors

NARCIS (Netherlands)

Bernard, T.; Bousias, K.; Guang, L.; Jesshope, C.R.; Lankamp, M.; van Tol, M.W.; Zhang, L.

2008-01-01

This paper presents a concurrent execution model and its micro-architecture based on in-order RISC processors, which schedules instructions from large pools of contextualised threads. The model admits a strategy for programming chip multiprocessors using parallelising compilers based on existing
Architecture and program structures for a special purpose finite element computer

Energy Technology Data Exchange (ETDEWEB)

Norrie, D.H.; Norrie, C.W.

1983-01-01

The development of very large scale integration (VLSI) has made special-purpose computers economically possible. With such a machine, the loss of flexibility compared with a general-purpose computer can be offset by the increased speed which can be obtained by tailoring the architecture to the particular problem or class of problem. The first kind of special-purpose machine has its architecture modelled on the physical structure of the problem and the second kind has its design tailored to the computational algorithm used. The parallel finite element machine (PARFEM) being designed at the University of Calgary for the solution of finite element problems is of the second kind. Its conceptual design is described and progress to date outlined. 14 references.
HTGR core seismic analysis using an array processor

International Nuclear Information System (INIS)

Shatoff, H.; Charman, C.M.

1983-01-01

A Floating Point Systems array processor performs nonlinear dynamic analysis of the high-temperature gas-cooled reactor (HTGR) core with significant time and cost savings. The graphite HTGR core consists of approximately 8000 blocks of various shapes which are subject to motion and impact during a seismic event. Two-dimensional computer programs (CRUNCH2D, MCOCO) can perform explicit step-by-step dynamic analyses of up to 600 blocks for time-history motions. However, use of two-dimensional codes was limited by the large cost and run times required. Three-dimensional analysis of the entire core, or even a large part of it, had been considered totally impractical. Because of the needs of the HTGR core seismic program, a Floating Point Systems array processor was used to enhance computer performance of the two-dimensional core seismic computer programs, MCOCO and CRUNCH2D. This effort began by converting the computational algorithms used in the codes to a form which takes maximum advantage of the parallel and pipeline processors offered by the architecture of the Floating Point Systems array processor. The subsequent conversion of the vectorized FORTRAN coding to the array processor required a significant programming effort to make the system work on the General Atomic (GA) UNIVAC 1100/82 host. These efforts were quite rewarding, however, since the cost of running the codes has been reduced approximately 50-fold and the time threefold. The core seismic analysis with large two-dimensional models has now become routine and extension to three-dimensional analysis is feasible. These codes simulate the one-fifth-scale full-array HTGR core model. This paper compares the analysis with the test results for sine-sweep motion
High performance VLSI telemetry data systems

Science.gov (United States)

Chesney, J.; Speciale, N.; Horner, W.; Sabia, S.

1990-01-01

NASA's deployment of major space complexes such as Space Station Freedom (SSF) and the Earth Observing System (EOS) will demand increased functionality and performance from ground based telemetry acquisition systems well above current system capabilities. Adaptation of space telemetry data transport and processing standards such as those specified by the Consultative Committee for Space Data Systems (CCSDS) standards and those required for commercial ground distribution of telemetry data, will drive these functional and performance requirements. In addition, budget limitations will force the requirement for higher modularity, flexibility, and interchangeability at lower cost in new ground telemetry data system elements. At NASA's Goddard Space Flight Center (GSFC), the design and development of generic ground telemetry data system elements, over the last five years, has resulted in significant solutions to these problems. This solution, referred to as the functional components approach includes both hardware and software components ready for end user application. The hardware functional components consist of modern data flow architectures utilizing Application Specific Integrated Circuits (ASIC's) developed specifically to support NASA's telemetry data systems needs and designed to meet a range of data rate requirements up to 300 Mbps. Real-time operating system software components support both embedded local software intelligence, and overall system control, status, processing, and interface requirements. These components, hardware and software, form the superstructure upon which project specific elements are added to complete a telemetry ground data system installation. This paper describes the functional components approach, some specific component examples, and a project example of the evolution from VLSI component, to basic board level functional component, to integrated telemetry data system.
Synchronization of faulty processors in coarse-grained TMR protected partially reconfigurable FPGA designs

International Nuclear Information System (INIS)

Kretzschmar, U.; Gomez-Cornejo, J.; Astarloa, A.; Bidarte, U.; Ser, J. Del

2016-01-01

The expansion of FPGA technology in numerous application fields is a fact. Single Event Effects (SEE) are a critical factor for the reliability of FPGA based systems. For this reason, a number of researches have been studying fault tolerance techniques to harden different elements of FPGA designs. Using Partial Reconfiguration (PR) in conjunction with Triple Modular Redundancy (TMR) is an emerging approach in recent publications dealing with the implementation of fault tolerant processors on SRAM-based FPGAs. While these works pay great attention to the repair of erroneous instances by means of reconfiguration, the essential step of synchronizing the repaired processors is insufficiently addressed. In this context, this paper poses four different synchronization approaches for soft core processors, which balance differently the trade-off between synchronization speed and hardware overhead. All approaches are assessed in practice by synchronizing TMR protected PicoBlaze processors implemented on a Virtex-5 FPGA. Nevertheless all methods are of a general nature and can be applied for different processor architectures in a straightforward fashion. - Highlights: • Four different synchronization methods for faulty processors are proposed. • The methods balance between synchronization speed and hardware overhead. • They can be applied to TMR-protected reconfigurable FPGA designs. • The proposed schemes are implemented and tested in real hardware.
System on chip module configured for event-driven architecture

Science.gov (United States)

Robbins, Kevin; Brady, Charles E.; Ashlock, Tad A.

2017-10-17

A system on chip (SoC) module is described herein, wherein the SoC modules comprise a processor subsystem and a hardware logic subsystem. The processor subsystem and hardware logic subsystem are in communication with one another, and transmit event messages between one another. The processor subsystem executes software actors, while the hardware logic subsystem includes hardware actors, the software actors and hardware actors conform to an event-driven architecture, such that the software actors receive and generate event messages and the hardware actors receive and generate event messages.
SAD PROCESSOR FOR MULTIPLE MACROBLOCK MATCHING IN FAST SEARCH VIDEO MOTION ESTIMATION

Directory of Open Access Journals (Sweden)

Nehal N. Shah

2015-02-01

Full Text Available Motion estimation is a very important but computationally complex task in video coding. Process of determining motion vectors based on the temporal correlation of consecutive frame is used for video compression. In order to reduce the computational complexity of motion estimation and maintain the quality of encoding during motion compensation, different fast search techniques are available. These block based motion estimation algorithms use the sum of absolute difference (SAD between corresponding macroblock in current frame and all the candidate macroblocks in the reference frame to identify best match. Existing implementations can perform SAD between two blocks using sequential or pipeline approach but performing multi operand SAD in single clock cycle with optimized recourses is state of art. In this paper various parallel architectures for computation of the fixed block size SAD is evaluated and fast parallel SAD architecture is proposed with optimized resources. Further SAD processor is described with 9 processing elements which can be configured for any existing fast search block matching algorithm. Proposed SAD processor consumes 7% fewer adders compared to existing implementation for one processing elements. Using nine PE it can process 84 HD frames per second in worse case which is good outcome for real time implementation. In average case architecture process 325 HD frames per second.
Onboard Data Processors for Planetary Ice-Penetrating Sounding Radars

Science.gov (United States)

Tan, I. L.; Friesenhahn, R.; Gim, Y.; Wu, X.; Jordan, R.; Wang, C.; Clark, D.; Le, M.; Hand, K. P.; Plaut, J. J.

2011-12-01

Among the many concerns faced by outer planetary missions, science data storage and transmission hold special significance. Such missions must contend with limited onboard storage, brief data downlink windows, and low downlink bandwidths. A potential solution to these issues lies in employing onboard data processors (OBPs) to convert raw data into products that are smaller and closely capture relevant scientific phenomena. In this paper, we present the implementation of two OBP architectures for ice-penetrating sounding radars tasked with exploring Europa and Ganymede. Our first architecture utilizes an unfocused processing algorithm extended from the Mars Advanced Radar for Subsurface and Ionosphere Sounding (MARSIS, Jordan et. al. 2009). Compared to downlinking raw data, we are able to reduce data volume by approximately 100 times through OBP usage. To ensure the viability of our approach, we have implemented, simulated, and synthesized this architecture using both VHDL and Matlab models (with fixed-point and floating-point arithmetic) in conjunction with Modelsim. Creation of a VHDL model of our processor is the principle step in transitioning to actual digital hardware, whether in a FPGA (field-programmable gate array) or an ASIC (application-specific integrated circuit), and successful simulation and synthesis strongly indicate feasibility. In addition, we examined the tradeoffs faced in the OBP between fixed-point accuracy, resource consumption, and data product fidelity. Our second architecture is based upon a focused fast back projection (FBP) algorithm that requires a modest amount of computing power and on-board memory while yielding high along-track resolution and improved slope detection capability. We present an overview of the algorithm and details of our implementation, also in VHDL. With the appropriate tradeoffs, the use of OBPs can significantly reduce data downlink requirements without sacrificing data product fidelity. Through the development

VLSI top-down design based on the separation of hierarchies

NARCIS (Netherlands)

Spaanenburg, L.; Broekema, A.; Leenstra, J.; Huys, C.

1986-01-01

Despite the presence of structure, interactions between the three views on VLSI design still lead to lengthy iterations. By separating the hierarchies for the respective views, the interactions are reduced. This separated hierarchy allows top-down design with functional abstractions as exemplified
VLSI Design with Alliance Free CAD Tools: an Implementation Example

Directory of Open Access Journals (Sweden)

Chávez-Bracamontes Ramón

2015-07-01

Full Text Available This paper presents the methodology used for a digital integrated circuit design that implements the communication protocol known as Serial Peripheral Interface, using the Alliance CAD System. The aim of this paper is to show how the work of VLSI design can be done by graduate and undergraduate students with minimal resources and experience. The physical design was sent to be fabricated using the CMOS AMI C5 process that features 0.5 micrometer in transistor size, sponsored by the MOSIS Educational Program. Tests were made on a platform that transfers data from inertial sensor measurements to the designed SPI chip, which in turn sends the data back on a parallel bus to a common microcontroller. The results show the efficiency of the employed methodology in VLSI design, as well as the feasibility of ICs manufacturing from school projects that have insufficient or no source of funding
High-speed architecture for the decoding of trellis-coded modulation

Science.gov (United States)

Osborne, William P.

1992-01-01

Since 1971, when the Viterbi Algorithm was introduced as the optimal method of decoding convolutional codes, improvements in circuit technology, especially VLSI, have steadily increased its speed and practicality. Trellis-Coded Modulation (TCM) combines convolutional coding with higher level modulation (non-binary source alphabet) to provide forward error correction and spectral efficiency. For binary codes, the current stare-of-the-art is a 64-state Viterbi decoder on a single CMOS chip, operating at a data rate of 25 Mbps. Recently, there has been an interest in increasing the speed of the Viterbi Algorithm by improving the decoder architecture, or by reducing the algorithm itself. Designs employing new architectural techniques are now in existence, however these techniques are currently applied to simpler binary codes, not to TCM. The purpose of this report is to discuss TCM architectural considerations in general, and to present the design, at the logic gate level, or a specific TCM decoder which applies these considerations to achieve high-speed decoding.
GPU: the biggest key processor for AI and parallel processing

Science.gov (United States)

Baji, Toru

2017-07-01

Two types of processors exist in the market. One is the conventional CPU and the other is Graphic Processor Unit (GPU). Typical CPU is composed of 1 to 8 cores while GPU has thousands of cores. CPU is good for sequential processing, while GPU is good to accelerate software with heavy parallel executions. GPU was initially dedicated for 3D graphics. However from 2006, when GPU started to apply general-purpose cores, it was noticed that this architecture can be used as a general purpose massive-parallel processor. NVIDIA developed a software framework Compute Unified Device Architecture (CUDA) that make it possible to easily program the GPU for these application. With CUDA, GPU started to be used in workstations and supercomputers widely. Recently two key technologies are highlighted in the industry. The Artificial Intelligence (AI) and Autonomous Driving Cars. AI requires a massive parallel operation to train many-layers of neural networks. With CPU alone, it was impossible to finish the training in a practical time. The latest multi-GPU system with P100 makes it possible to finish the training in a few hours. For the autonomous driving cars, TOPS class of performance is required to implement perception, localization, path planning processing and again SoC with integrated GPU will play a key role there. In this paper, the evolution of the GPU which is one of the biggest commercial devices requiring state-of-the-art fabrication technology will be introduced. Also overview of the GPU demanding key application like the ones described above will be introduced.
Monte Carlo dose calculation using a cell processor based PlayStation 3 system

International Nuclear Information System (INIS)

Chow, James C L; Lam, Phil; Jaffray, David A

2012-01-01

This study investigates the performance of the EGSnrc computer code coupled with a Cell-based hardware in Monte Carlo simulation of radiation dose in radiotherapy. Performance evaluations of two processor-intensive functions namely, HOWNEAR and RANMAR G ET in the EGSnrc code were carried out basing on the 20-80 rule (Pareto principle). The execution speeds of the two functions were measured by the profiler gprof specifying the number of executions and total time spent on the functions. A testing architecture designed for Cell processor was implemented in the evaluation using a PlayStation3 (PS3) system. The evaluation results show that the algorithms examined are readily parallelizable on the Cell platform, provided that an architectural change of the EGSnrc was made. However, as the EGSnrc performance was limited by the PowerPC Processing Element in the PS3, PC coupled with graphics processing units or GPCPU may provide a more viable avenue for acceleration.
Monte Carlo dose calculation using a cell processor based PlayStation 3 system

Science.gov (United States)

Chow, James C. L.; Lam, Phil; Jaffray, David A.

2012-02-01

This study investigates the performance of the EGSnrc computer code coupled with a Cell-based hardware in Monte Carlo simulation of radiation dose in radiotherapy. Performance evaluations of two processor-intensive functions namely, HOWNEAR and RANMAR_GET in the EGSnrc code were carried out basing on the 20-80 rule (Pareto principle). The execution speeds of the two functions were measured by the profiler gprof specifying the number of executions and total time spent on the functions. A testing architecture designed for Cell processor was implemented in the evaluation using a PlayStation3 (PS3) system. The evaluation results show that the algorithms examined are readily parallelizable on the Cell platform, provided that an architectural change of the EGSnrc was made. However, as the EGSnrc performance was limited by the PowerPC Processing Element in the PS3, PC coupled with graphics processing units or GPCPU may provide a more viable avenue for acceleration.
Monte Carlo dose calculation using a cell processor based PlayStation 3 system

Energy Technology Data Exchange (ETDEWEB)

Chow, James C L; Lam, Phil; Jaffray, David A, E-mail: james.chow@rmp.uhn.on.ca [Department of Radiation Oncology, University of Toronto and Radiation Medicine Program, Princess Margaret Hospital, University Health Network, Toronto, Ontario M5G 2M9 (Canada)

2012-02-09

This study investigates the performance of the EGSnrc computer code coupled with a Cell-based hardware in Monte Carlo simulation of radiation dose in radiotherapy. Performance evaluations of two processor-intensive functions namely, HOWNEAR and RANMAR{sub G}ET in the EGSnrc code were carried out basing on the 20-80 rule (Pareto principle). The execution speeds of the two functions were measured by the profiler gprof specifying the number of executions and total time spent on the functions. A testing architecture designed for Cell processor was implemented in the evaluation using a PlayStation3 (PS3) system. The evaluation results show that the algorithms examined are readily parallelizable on the Cell platform, provided that an architectural change of the EGSnrc was made. However, as the EGSnrc performance was limited by the PowerPC Processing Element in the PS3, PC coupled with graphics processing units or GPCPU may provide a more viable avenue for acceleration.
An energy-efficient high-performance processor with reconfigurable data-paths using RSFQ circuits

International Nuclear Information System (INIS)

Takagi, Naofumi

2013-01-01

Highlights: ► An idea of a high-performance computer using RSFQ circuits is shown. ► An outline of processor with reconfigurable data-paths (RDPs) is shown. ► Architectural details of an SFQ-RDP are described. -- Abstract: We show recent progress in our research on an energy-efficient high-performance processor with reconfigurable data-paths (RDPs) using rapid single-flux-quantum (RSFQ) circuits. We mainly describe the architectural details of an RDP implemented using RSFQ circuits. An RDP consists of a lot of floating-point units (FPUs) and operand routing networks (ORNs) which connect the FPUs. We reconfigure the RDP to fit a computation, i.e., a group of floating-point operations, appearing in a ‘for’ loop of programs for numerical computations by setting the route in ORNs before the execution of the loop. In the RDP, a lot of FPUs work in parallel with pipelined fashion, and hence, very high-performance computation is achieved
Development of Radhard VLSI electronics for SSC calorimeters

International Nuclear Information System (INIS)

Dawson, J.W.; Nodulman, L.J.

1989-01-01

A new program of development of integrated electronics for liquid argon calorimeters in the SSC detector environment is being started at Argonne National Laboratory. Scientists from Brookhaven National Laboratory and Vanderbilt University together with an industrial participants are expected to collaborate in this work. Interaction rates, segmentation, and the radiation environment dictate that front-end electronics of SSC calorimeters must be implemented in the form of highly integrated, radhard, analog, low noise, VLSI custom monolithic devices. Important considerations are power dissipation, choice of functions integrated on the front-end chips, and cabling requirements. An extensive level of expertise in radhard electronics exists within the industrial community, and a primary objective of this work is to bring that expertise to bear on the problems of SSC detector design. Radiation hardness measurements and requirements as well as calorimeter design will be primarily the responsibility of Argonne scientists and our Brookhaven and Vanderbilt colleagues. Radhard VLSI design and fabrication will be primarily the industrial participant's responsibility. The rapid-cycling synchrotron at Argonne will be used for radiation damage studies involving response to neutrons and charged particles, while damage from gammas will be investigated at Brookhaven. 10 refs., 6 figs., 2 tabs
The test of VLSI circuits

Science.gov (United States)

Baviere, Ph.

Tests which have proven effective for evaluating VLSI circuits for space applications are described. It is recommended that circuits be examined after each manfacturing step to gain fast feedback on inadequacies in the production system. Data from failure modes which occur during operational lifetimes of circuits also permit redefinition of the manufacturing and quality control process to eliminate the defects identified. Other tests include determination of the operational envelope of the circuits, examination of the circuit response to controlled inputs, and the performance and functional speeds of ROM and RAM memories. Finally, it is desirable that all new circuits be designed with testing in mind.
A soft-core processor architecture optimised for radar signal processing applications

CSIR Research Space (South Africa)

Broich, R

2013-12-01

Full Text Available -performance soft-core processing architecture is proposed. To develop such a processing architecture, data and signal-flow characteristics of common radar signal processing algorithms are analysed. Each algorithm is broken down into signal processing...
The hardware implementation of the CERN SPS ultrafast feedback processor demonstrator

CERN Document Server

Dusakto, J E; Fox, J D; Olsen, J; Rivetta, C H; Höfle, W

2013-01-01

An ultrafast 4GSa/s transverse feedback processor has been developed for proof-of-concept studies of feedback control of e-cloud driven and transverse mode coupled intra-bunch instabilities in the CERN SPS. This system consists of a high-speed ADC on the front end and equally fast DAC on the back end. All control and signal processing is implemented in FPGA logic. This system is capable of taking up to 16 sample slices across a single SPS bunch and processing each slice individually within a reconfigurable signal processor. This demonstrator system is a rapidly developed prototype, consisting of both commercial and custom-design components. It can stabilize the motion of a single particle bunch using closed loop feedback. The system can also run open loop as a high-speed arbitrary waveform generator and contains diagnostic features including a special ADC snapshot capture memory. This paper describes the overall system, the feedback processor and focuses on the hardware architecture, design ...
NeuroFlow: A General Purpose Spiking Neural Network Simulation Platform using Customizable Processors.

Science.gov (United States)

Cheung, Kit; Schultz, Simon R; Luk, Wayne

2015-01-01

NeuroFlow is a scalable spiking neural network simulation platform for off-the-shelf high performance computing systems using customizable hardware processors such as Field-Programmable Gate Arrays (FPGAs). Unlike multi-core processors and application-specific integrated circuits, the processor architecture of NeuroFlow can be redesigned and reconfigured to suit a particular simulation to deliver optimized performance, such as the degree of parallelism to employ. The compilation process supports using PyNN, a simulator-independent neural network description language, to configure the processor. NeuroFlow supports a number of commonly used current or conductance based neuronal models such as integrate-and-fire and Izhikevich models, and the spike-timing-dependent plasticity (STDP) rule for learning. A 6-FPGA system can simulate a network of up to ~600,000 neurons and can achieve a real-time performance of 400,000 neurons. Using one FPGA, NeuroFlow delivers a speedup of up to 33.6 times the speed of an 8-core processor, or 2.83 times the speed of GPU-based platforms. With high flexibility and throughput, NeuroFlow provides a viable environment for large-scale neural network simulation.
HEP - A semaphore-synchronized multiprocessor with central control. [Heterogeneous Element Processor

Science.gov (United States)

Gilliland, M. C.; Smith, B. J.; Calvert, W.

1976-01-01

The paper describes the design concept of the Heterogeneous Element Processor (HEP), a system tailored to the special needs of scientific simulation. In order to achieve high-speed computation required by simulation, HEP features a hierarchy of processes executing in parallel on a number of processors, with synchronization being largely accomplished by hardware. A full-empty-reserve scheme of synchronization is realized by zero-one-valued hardware semaphores. A typical system has, besides the control computer and the scheduler, an algebraic module, a memory module, a first-in first-out (FIFO) module, an integrator module, and an I/O module. The architecture of the scheduler and the algebraic module is examined in detail.
Nested dissection on a mesh-connected processor array

International Nuclear Information System (INIS)

Worley, P.H.; Schreiber, R.

1986-01-01

The authors present a parallel implementation of Gaussian elimination without pivoting using the nested dissection ordering for solving Ax=b where A is an N x N symmetric positive definite matrix. If the graph of A is a √N x √N finite element mesh then a parallel complexity of O(√N) can be achieved for Gaussian elimination with the nested dissection ordering. The authors' implementation achieves this parallel complexity on a two dimensional MIMD processor array with N processors and nearest neighbors interconnections. Thus nested dissection is a near optimal algorithm for this problem on this interconnection topology. The parallel implementation on this architecture requires 158√N + O(log/sub 2/(√N)) parallel floating point multiplications. It is faster than a Kung-Leiserson systolic array for banded matrices for N≥961, and faster than a serial implementation for N as small as 9
Highly Parallel Computing Architectures by using Arrays of Quantum-dot Cellular Automata (QCA): Opportunities, Challenges, and Recent Results

Science.gov (United States)

Fijany, Amir; Toomarian, Benny N.

2000-01-01

There has been significant improvement in the performance of VLSI devices, in terms of size, power consumption, and speed, in recent years and this trend may also continue for some near future. However, it is a well known fact that there are major obstacles, i.e., physical limitation of feature size reduction and ever increasing cost of foundry, that would prevent the long term continuation of this trend. This has motivated the exploration of some fundamentally new technologies that are not dependent on the conventional feature size approach. Such technologies are expected to enable scaling to continue to the ultimate level, i.e., molecular and atomistic size. Quantum computing, quantum dot-based computing, DNA based computing, biologically inspired computing, etc., are examples of such new technologies. In particular, quantum-dots based computing by using Quantum-dot Cellular Automata (QCA) has recently been intensely investigated as a promising new technology capable of offering significant improvement over conventional VLSI in terms of reduction of feature size (and hence increase in integration level), reduction of power consumption, and increase of switching speed. Quantum dot-based computing and memory in general and QCA specifically, are intriguing to NASA due to their high packing density (10(exp 11) - 10(exp 12) per square cm ) and low power consumption (no transfer of current) and potentially higher radiation tolerant. Under Revolutionary Computing Technology (RTC) Program at the NASA/JPL Center for Integrated Space Microelectronics (CISM), we have been investigating the potential applications of QCA for the space program. To this end, exploiting the intrinsic features of QCA, we have designed novel QCA-based circuits for co-planner (i.e., single layer) and compact implementation of a class of data permutation matrices, a class of interconnection networks, and a bit-serial processor. Building upon these circuits, we have developed novel algorithms and QCA
A Single Chip VLSI Implementation of a QPSK/SQPSK Demodulator for a VSAT Receiver Station

Science.gov (United States)

Kwatra, S. C.; King, Brent

1995-01-01

This thesis presents a VLSI implementation of a QPSK/SQPSK demodulator. It is designed to be employed in a VSAT earth station that utilizes the FDMA/TDM link. A single chip architecture is used to enable this chip to be easily employed in the VSAT system. This demodulator contains lowpass filters, integrate and dump units, unique word detectors, a timing recovery unit, a phase recovery unit and a down conversion unit. The design stages start with a functional representation of the system by using the C programming language. Then it progresses into a register based representation using the VHDL language. The layout components are designed based on these VHDL models and simulated. Component generators are developed for the adder, multiplier, read-only memory and serial access memory in order to shorten the design time. These sub-components are then block routed to form the main components of the system. The main components are block routed to form the final demodulator.
Self-Organizing Maps on the Cell Broadband Engine Architecture

International Nuclear Information System (INIS)

McConnell, Sabine M

2010-01-01

We present and evaluate novel parallel implementations of Self-Organizing Maps for the Cell Broadband Engine Architecture. Motivated by the interactive nature of the data-mining process, we evaluate the scalability of the implementations on two clusters using different network characteristics and incarnations (PS3 TM console and PowerXCell 8i) of the architecture. Our implementations use varying combinations of the Power Processing Elements (PPEs) and Synergistic Processing Elements (SPEs) found in the Cell architecture. For a single processor, our implementation scaled well with the number of SPEs regardless of the incarnation. When combining multiple PS3 TM consoles, the synchronization over the slower network resulted in poor speedups and demonstrated that the use of such a low-cost cluster may be severely restricted, even without the use of SPEs. When using multiple SPEs for the PowerXCell 8i cluster, the speedup grew linearly with increasing number of SPEs for a given number of processors, and linear up to a maximum with the number of processors for a given number of SPEs. Our implementation achieved a worst-case efficiency of 67% for the maximum number of processing elements involved in the computation, but consistently higher values for smaller numbers of processing elements with speedups of up to 70.
JIST: Just-In-Time Scheduling Translation for Parallel Processors

Directory of Open Access Journals (Sweden)

Giovanni Agosta

2005-01-01

Full Text Available The application fields of bytecode virtual machines and VLIW processors overlap in the area of embedded and mobile systems, where the two technologies offer different benefits, namely high code portability, low power consumption and reduced hardware cost. Dynamic compilation makes it possible to bridge the gap between the two technologies, but special attention must be paid to software instruction scheduling, a must for the VLIW architectures. We have implemented JIST, a Virtual Machine and JIT compiler for Java Bytecode targeted to a VLIW processor. We show the impact of various optimizations on the performance of code compiled with JIST through the experimental study on a set of benchmark programs. We report significant speedups, and increments in the number of instructions issued per cycle up to 50% with respect to the non-scheduling version of the JITcompiler. Further optimizations are discussed.
Keystone Business Models for Network Security Processors

Directory of Open Access Journals (Sweden)

Arthur Low

2013-07-01

Full Text Available Network security processors are critical components of high-performance systems built for cybersecurity. Development of a network security processor requires multi-domain experience in semiconductors and complex software security applications, and multiple iterations of both software and hardware implementations. Limited by the business models in use today, such an arduous task can be undertaken only by large incumbent companies and government organizations. Neither the “fabless semiconductor” models nor the silicon intellectual-property licensing (“IP-licensing” models allow small technology companies to successfully compete. This article describes an alternative approach that produces an ongoing stream of novel network security processors for niche markets through continuous innovation by both large and small companies. This approach, referred to here as the "business ecosystem model for network security processors", includes a flexible and reconfigurable technology platform, a “keystone” business model for the company that maintains the platform architecture, and an extended ecosystem of companies that both contribute and share in the value created by innovation. New opportunities for business model innovation by participating companies are made possible by the ecosystem model. This ecosystem model builds on: i the lessons learned from the experience of the first author as a senior integrated circuit architect for providers of public-key cryptography solutions and as the owner of a semiconductor startup, and ii the latest scholarly research on technology entrepreneurship, business models, platforms, and business ecosystems. This article will be of interest to all technology entrepreneurs, but it will be of particular interest to owners of small companies that provide security solutions and to specialized security professionals seeking to launch their own companies.

Implementation of the DPM Monte Carlo code on a parallel architecture for treatment planning applications.

Science.gov (United States)

Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J

2004-09-01

We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1 x 10(8) or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8 x 10(8) histories. For a smaller number of histories (1 x 10(8)) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1 x 10(8) histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy.
Implementation of the DPM Monte Carlo code on a parallel architecture for treatment planning applications

International Nuclear Information System (INIS)

Tyagi, Neelam; Bose, Abhijit; Chetty, Indrin J.

2004-01-01

We have parallelized the Dose Planning Method (DPM), a Monte Carlo code optimized for radiotherapy class problems, on distributed-memory processor architectures using the Message Passing Interface (MPI). Parallelization has been investigated on a variety of parallel computing architectures at the University of Michigan-Center for Advanced Computing, with respect to efficiency and speedup as a function of the number of processors. We have integrated the parallel pseudo random number generator from the Scalable Parallel Pseudo-Random Number Generator (SPRNG) library to run with the parallel DPM. The Intel cluster consisting of 800 MHz Intel Pentium III processor shows an almost linear speedup up to 32 processors for simulating 1x10 8 or more particles. The speedup results are nearly linear on an Athlon cluster (up to 24 processors based on availability) which consists of 1.8 GHz+ Advanced Micro Devices (AMD) Athlon processors on increasing the problem size up to 8x10 8 histories. For a smaller number of histories (1x10 8 ) the reduction of efficiency with the Athlon cluster (down to 83.9% with 24 processors) occurs because the processing time required to simulate 1x10 8 histories is less than the time associated with interprocessor communication. A similar trend was seen with the Opteron Cluster (consisting of 1400 MHz, 64-bit AMD Opteron processors) on increasing the problem size. Because of the 64-bit architecture Opteron processors are capable of storing and processing instructions at a faster rate and hence are faster as compared to the 32-bit Athlon processors. We have validated our implementation with an in-phantom dose calculation study using a parallel pencil monoenergetic electron beam of 20 MeV energy. The phantom consists of layers of water, lung, bone, aluminum, and titanium. The agreement in the central axis depth dose curves and profiles at different depths shows that the serial and parallel codes are equivalent in accuracy
Computer architecture a quantitative approach

CERN Document Server

Hennessy, John L

2019-01-01

Computer Architecture: A Quantitative Approach, Sixth Edition has been considered essential reading by instructors, students and practitioners of computer design for over 20 years. The sixth edition of this classic textbook is fully revised with the latest developments in processor and system architecture. It now features examples from the RISC-V (RISC Five) instruction set architecture, a modern RISC instruction set developed and designed to be a free and openly adoptable standard. It also includes a new chapter on domain-specific architectures and an updated chapter on warehouse-scale computing that features the first public information on Google's newest WSC. True to its original mission of demystifying computer architecture, this edition continues the longstanding tradition of focusing on areas where the most exciting computing innovation is happening, while always keeping an emphasis on good engineering design.
VLSI Design of Trusted Virtual Sensors

Directory of Open Access Journals (Sweden)

Macarena C. Martínez-Rodríguez

2018-01-01

Full Text Available This work presents a Very Large Scale Integration (VLSI design of trusted virtual sensors providing a minimum unitary cost and very good figures of size, speed and power consumption. The sensed variable is estimated by a virtual sensor based on a configurable and programmable PieceWise-Affine hyper-Rectangular (PWAR model. An algorithm is presented to find the best values of the programmable parameters given a set of (empirical or simulated input-output data. The VLSI design of the trusted virtual sensor uses the fast authenticated encryption algorithm, AEGIS, to ensure the integrity of the provided virtual measurement and to encrypt it, and a Physical Unclonable Function (PUF based on a Static Random Access Memory (SRAM to ensure the integrity of the sensor itself. Implementation results of a prototype designed in a 90-nm Complementary Metal Oxide Semiconductor (CMOS technology show that the active silicon area of the trusted virtual sensor is 0.86 mm 2 and its power consumption when trusted sensing at 50 MHz is 7.12 mW. The maximum operation frequency is 85 MHz, which allows response times lower than 0.25 μ s. As application example, the designed prototype was programmed to estimate the yaw rate in a vehicle, obtaining root mean square errors lower than 1.1%. Experimental results of the employed PUF show the robustness of the trusted sensing against aging and variations of the operation conditions, namely, temperature and power supply voltage (final value as well as ramp-up time.
VLSI Design of Trusted Virtual Sensors.

Science.gov (United States)

Martínez-Rodríguez, Macarena C; Prada-Delgado, Miguel A; Brox, Piedad; Baturone, Iluminada

2018-01-25

This work presents a Very Large Scale Integration (VLSI) design of trusted virtual sensors providing a minimum unitary cost and very good figures of size, speed and power consumption. The sensed variable is estimated by a virtual sensor based on a configurable and programmable PieceWise-Affine hyper-Rectangular (PWAR) model. An algorithm is presented to find the best values of the programmable parameters given a set of (empirical or simulated) input-output data. The VLSI design of the trusted virtual sensor uses the fast authenticated encryption algorithm, AEGIS, to ensure the integrity of the provided virtual measurement and to encrypt it, and a Physical Unclonable Function (PUF) based on a Static Random Access Memory (SRAM) to ensure the integrity of the sensor itself. Implementation results of a prototype designed in a 90-nm Complementary Metal Oxide Semiconductor (CMOS) technology show that the active silicon area of the trusted virtual sensor is 0.86 mm 2 and its power consumption when trusted sensing at 50 MHz is 7.12 mW. The maximum operation frequency is 85 MHz, which allows response times lower than 0.25 μ s. As application example, the designed prototype was programmed to estimate the yaw rate in a vehicle, obtaining root mean square errors lower than 1.1%. Experimental results of the employed PUF show the robustness of the trusted sensing against aging and variations of the operation conditions, namely, temperature and power supply voltage (final value as well as ramp-up time).
Design of an Elliptic Curve Cryptography processor for RFID tag chips.

Science.gov (United States)

Liu, Zilong; Liu, Dongsheng; Zou, Xuecheng; Lin, Hui; Cheng, Jian

2014-09-26

Radio Frequency Identification (RFID) is an important technique for wireless sensor networks and the Internet of Things. Recently, considerable research has been performed in the combination of public key cryptography and RFID. In this paper, an efficient architecture of Elliptic Curve Cryptography (ECC) Processor for RFID tag chip is presented. We adopt a new inversion algorithm which requires fewer registers to store variables than the traditional schemes. A new method for coordinate swapping is proposed, which can reduce the complexity of the controller and shorten the time of iterative calculation effectively. A modified circular shift register architecture is presented in this paper, which is an effective way to reduce the area of register files. Clock gating and asynchronous counter are exploited to reduce the power consumption. The simulation and synthesis results show that the time needed for one elliptic curve scalar point multiplication over GF(2163) is 176.7 K clock cycles and the gate area is 13.8 K with UMC 0.13 μm Complementary Metal Oxide Semiconductor (CMOS) technology. Moreover, the low power and low cost consumption make the Elliptic Curve Cryptography Processor (ECP) a prospective candidate for application in the RFID tag chip.
Programming massively parallel processors a hands-on approach

CERN Document Server

Kirk, David B

2010-01-01

Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs. Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA. The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who ...
Point and track-finding processors for multiwire chambers

CERN Document Server

Hansroul, M

1973-01-01

The hardware processors described below are designed to be used in conjunction with multi-wire chambers. They have the characteristic of being based on computational methods in contrast to analogue procedures. In a sense, they are hardware implementations of computer programs. But, being specially designed for their purpose, they are free of the restrictions imposed by the architecture of the computer on which the equivalent program is to run. The parallelism inherent in the algorithms can thus be fully exploited. Combined with the use of fast access scratch-pad memories and the non-sequential nature of the control program, the parallelism accounts for the fact that these processors are expected to execute 2-3 orders of magnitude faster than the equivalent Fortran programs on a CDC 7600 or 6600. As a consequence, methods which are simple and straightforward, but which are impractical because they require an exorbitant amount of computer time can on the contrary be very attractive for hardware implementation. ...
Description and Simulation of a Fast Packet Switch Architecture for Communication Satellites

Science.gov (United States)

Quintana, Jorge A.; Lizanich, Paul J.

1995-01-01

The NASA Lewis Research Center has been developing the architecture for a multichannel communications signal processing satellite (MCSPS) as part of a flexible, low-cost meshed-VSAT (very small aperture terminal) network. The MCSPS architecture is based on a multifrequency, time-division-multiple-access (MF-TDMA) uplink and a time-division multiplex (TDM) downlink. There are eight uplink MF-TDMA beams, and eight downlink TDM beams, with eight downlink dwells per beam. The information-switching processor, which decodes, stores, and transmits each packet of user data to the appropriate downlink dwell onboard the satellite, has been fully described by using VHSIC (Very High Speed Integrated-Circuit) Hardware Description Language (VHDL). This VHDL code, which was developed in-house to simulate the information switching processor, showed that the architecture is both feasible and viable. This paper describes a shared-memory-per-beam architecture, its VHDL implementation, and the simulation efforts.
Architecture of high reliable control systems using complex software

International Nuclear Information System (INIS)

Tallec, M.

1990-01-01

The problems involved by the use of complex softwares in control systems that must insure a very high level of safety are examined. The first part makes a brief description of the prototype of PROSPER system. PROSPER means protection system for nuclear reactor with high performances. It has been installed on a French nuclear power plant at the beginnning of 1987 and has been continually working since that time. This prototype is realized on a multi-processors system. The processors communicate between themselves using interruptions and protected shared memories. On each processor, one or more protection algorithms are implemented. Those algorithms use data coming directly from the plant and, eventually, data computed by the other protection algorithms. Each processor makes its own acquisitions from the process and sends warning messages if some operating anomaly is detected. All algorithms are activated concurrently on an asynchronous way. The results are presented and the safety related problems are detailed. - The second part is about measurements' validation. First, we describe how the sensors' measurements will be used in a protection system. Then, a proposal for a method based on the techniques of artificial intelligence (expert systems and neural networks) is presented. - The last part is about the problems of architectures of systems including hardware and software: the different types of redundancies used till now and a proposition of a multi-processors architecture which uses an operating system that is able to manage several tasks implemented on different processors, which verifies the good operating of each of those tasks and of the related processors and which allows to carry on the operation of the system, even in a degraded manner when a failure has been detected are detailed [fr
Trade-Off Exploration for Target Tracking Application in a Customized Multiprocessor Architecture

Directory of Open Access Journals (Sweden)

Yassin El-Hillali

2009-01-01

Full Text Available This paper presents the design of an FPGA-based multiprocessor-system-on-chip (MPSoC architecture optimized for Multiple Target Tracking (MTT in automotive applications. An MTT system uses an automotive radar to track the speed and relative position of all the vehicles (targets within its field of view. As the number of targets increases, the computational needs of the MTT system also increase making it difficult for a single processor to handle it alone. Our implementation distributes the computational load among multiple soft processor cores optimized for executing specific computational tasks. The paper explains how we designed and profiled the MTT application to partition it among different processors. It also explains how we applied different optimizations to customize the individual processor cores to their assigned tasks and to assess their impact on performance and FPGA resource utilization. The result is a complete MTT application running on an optimized MPSoC architecture that fits in a contemporary medium-sized FPGA and that meets the application's real-time constraints.
Power gating of VLSI circuits using MEMS switches in low power applications

KAUST Repository

Shobak, Hosam

2011-12-01

Power dissipation poses a great challenge for VLSI designers. With the intense down-scaling of technology, the total power consumption of the chip is made up primarily of leakage power dissipation. This paper proposes combining a custom-designed MEMS switch to power gate VLSI circuits, such that leakage power is efficiently reduced while accounting for performance and reliability. The designed MEMS switch is characterized by an 0.1876 ? ON resistance and requires 4.5 V to switch. As a result of implementing this novel power gating technique, a standby leakage power reduction of 99% and energy savings of 33.3% are achieved. Finally the possible effects of surge currents and ground bounce noise are studied. These findings allow longer operation times for battery-operated systems characterized by long standby periods. © 2011 IEEE.
Design and Implementation of a Sort-Free K-Best Sphere Decoder

KAUST Repository

Mondal, Sudip

2012-10-18

This paper describes the design and VLSI architecture for a 4x4 breadth first K-Best MIMO decoder using a 64 QAM scheme. A novel sort free approach to path extension, as well as quantized metrics result in a high throughput VLSI architecture with lower power and area consumption compared to state of the art published systems. Functionality is confirmed via an FPGA implementation on a Xilinx Virtex II Pro FPGA. Comparison of simulation and measurements are given and FPGA utilization figures are provided. Finally, VLSI architectural tradeoffs are explored for a synthesized ASIC implementation in a 65nm CMOS technology.
Fast Optimal Replica Placement with Exhaustive Search Using Dynamically Reconfigurable Processor

Directory of Open Access Journals (Sweden)

Hidetoshi Takeshita

2011-01-01

Full Text Available This paper proposes a new replica placement algorithm that expands the exhaustive search limit with reasonable calculation time. It combines a new type of parallel data-flow processor with an architecture tuned for fast calculation. The replica placement problem is to find a replica-server set satisfying service constraints in a content delivery network (CDN. It is derived from the set cover problem which is known to be NP-hard. It is impractical to use exhaustive search to obtain optimal replica placement in large-scale networks, because calculation time increases with the number of combinations. To reduce calculation time, heuristic algorithms have been proposed, but it is known that no heuristic algorithm is assured of finding the optimal solution. The proposed algorithm suits parallel processing and pipeline execution and is implemented on DAPDNA-2, a dynamically reconfigurable processor. Experiments show that the proposed algorithm expands the exhaustive search limit by the factor of 18.8 compared to the conventional algorithm search limit running on a Neumann-type processor.
Modeling, realization and evaluation of a parallel architecture for the data acquisition in multidetectors

International Nuclear Information System (INIS)

Guirande, Ph.; Aleonard, M-M.; Dien, Q-T.; Pedroza, J-L.

1997-01-01

The efficiency increasing in four π (EUROGAM, EUROBALL, DIAMANT) is achieved by an increase in the granularity, hence in the event counting rate in the acquisition system. Consequently, an evolution of the architecture of readout systems, coding and software is necessary. To achieve the required evaluation we have implemented a parallel architecture to check the quality of the events. The first application of this architecture was to make available an improved data acquisition system for the DIAMANT multidetector. The data acquisition system of DIAMANT is based on an ensemble of VME cards which must manage: the event readout, their salvation on magnetic support and histogram construction. The ensemble consists of processors distributed in a net, a workstation to control the experiment and a display system for spectra and arrays. In such architecture the task of VME bus becomes quickly a limitation for performances not only for the data transfer but also for coordination of different processors. The parallel architecture used makes the VME bus operation easy. It is based on three DSP C40 (Digital Signal Processor) implanted in a commercial (LSI) VME. It is provided with an external bus used to read the raw data from an interface card (ROCVI) between the 32 bit ECL bus reading the real time VME-based encoders. The performed tests have evidenced jamming after data exchanges between the processors using two communication lines. The analysis of this problem has indicated the necessity of dynamical changes of tasks to avoid this blocking. Intrinsic evaluation (i.e. without transfer on the VME bus) has been carried out for two parallel topologies (processor farm and tree). The simulation software permitted the generation of event packets. The obtained rates are sensibly equivalent (6 Mo/s) independent of topology. The farm topology has been chosen because it is simple to implant. The charge evaluation has reduced the rate in 'simplex' communication mode to 5.3 Mo/s and
Multi-Softcore Architecture on FPGA

Directory of Open Access Journals (Sweden)

Mouna Baklouti

2014-01-01

Full Text Available To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication.
Design concepts for a virtualizable embedded MPSoC architecture enabling virtualization in embedded multi-processor systems

CERN Document Server

Biedermann, Alexander

2014-01-01

Alexander Biedermann presents a generic hardware-based virtualization approach, which may transform an array of any off-the-shelf embedded processors into a multi-processor system with high execution dynamism. Based on this approach, he highlights concepts for the design of energy aware systems, self-healing systems as well as parallelized systems. For the latter, the novel so-called Agile Processing scheme is introduced by the author, which enables a seamless transition between sequential and parallel execution schemes. The design of such virtualizable systems is further aided by introduction
Heterogeneous reconfigurable processors for real-time baseband processing from algorithm to architecture

CERN Document Server

Zhang, Chenxin; Öwall, Viktor

2016-01-01

This book focuses on domain-specific heterogeneous reconfigurable architectures, demonstrating for readers a computing platform which is flexible enough to support multiple standards, multiple modes, and multiple algorithms. The content is multi-disciplinary, covering areas of wireless communication, computing architecture, and circuit design. The platform described provides real-time processing capability with reasonable implementation cost, achieving balanced trade-offs among flexibility, performance, and hardware costs. The authors discuss efficient design methods for wireless communication processing platforms, from both an algorithm and architecture design perspective. Coverage also includes computing platforms for different wireless technologies and standards, including MIMO, OFDM, Massive MIMO, DVB, WLAN, LTE/LTE-A, and 5G. •Discusses reconfigurable architectures, including hardware building blocks such as processing elements, memory sub-systems, Network-on-Chip (NoC), and dynamic hardware reconfigur...
Simulating Hydrologic Flow and Reactive Transport with PFLOTRAN and PETSc on Emerging Fine-Grained Parallel Computer Architectures

Science.gov (United States)

Mills, R. T.; Rupp, K.; Smith, B. F.; Brown, J.; Knepley, M.; Zhang, H.; Adams, M.; Hammond, G. E.

2017-12-01

As the high-performance computing community pushes towards the exascale horizon, power and heat considerations have driven the increasing importance and prevalence of fine-grained parallelism in new computer architectures. High-performance computing centers have become increasingly reliant on GPGPU accelerators and "manycore" processors such as the Intel Xeon Phi line, and 512-bit SIMD registers have even been introduced in the latest generation of Intel's mainstream Xeon server processors. The high degree of fine-grained parallelism and more complicated memory hierarchy considerations of such "manycore" processors present several challenges to existing scientific software. Here, we consider how the massively parallel, open-source hydrologic flow and reactive transport code PFLOTRAN - and the underlying Portable, Extensible Toolkit for Scientific Computation (PETSc) library on which it is built - can best take advantage of such architectures. We will discuss some key features of these novel architectures and our code optimizations and algorithmic developments targeted at them, and present experiences drawn from working with a wide range of PFLOTRAN benchmark problems on these architectures.
The Potential of the Cell Processor for Scientific Computing

Energy Technology Data Exchange (ETDEWEB)

Williams, Samuel; Shalf, John; Oliker, Leonid; Husbands, Parry; Kamil, Shoaib; Yelick, Katherine

2005-10-14

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of the using the forth coming STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. We are the first to present quantitative Cell performance data on scientific kernels and show direct comparisons against leading superscalar (AMD Opteron), VLIW (IntelItanium2), and vector (Cray X1) architectures. Since neither Cell hardware nor cycle-accurate simulators are currently publicly available, we develop both analytical models and simulators to predict kernel performance. Our work also explores the complexity of mapping several important scientific algorithms onto the Cells unique architecture. Additionally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.

Electromagnetic Physics Models for Parallel Computing Architectures

Science.gov (United States)

Amadio, G.; Ananya, A.; Apostolakis, J.; Aurora, A.; Bandieramonte, M.; Bhattacharyya, A.; Bianchini, C.; Brun, R.; Canal, P.; Carminati, F.; Duhem, L.; Elvira, D.; Gheata, A.; Gheata, M.; Goulas, I.; Iope, R.; Jun, S. Y.; Lima, G.; Mohanty, A.; Nikitina, T.; Novak, M.; Pokorski, W.; Ribon, A.; Seghal, R.; Shadura, O.; Vallecorsa, S.; Wenzel, S.; Zhang, Y.

2016-10-01

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well.
The TMS34010 graphic processor - an architecture for image visualization in NMR tomography

International Nuclear Information System (INIS)

Slaets, Jan Frans Willem; Paiva, Maria Stela Veludo de; Almeida, Lirio O.B.

1989-01-01

This abstract presents a description of the minimum system implemented with the graphic processor TMS34010, which will be used in the reconstruction, treatment and interpretation f images obtained by NMR tomography. The project is being developed in the LIE (Electronic Instrumentation Laboratory), of the Sao Carlos Chemistry and Physical Institute, S P, Brazil and is already in operation
Performance analysis of general purpose and digital signal processor kernels for heterogeneous systems-on-chip

Directory of Open Access Journals (Sweden)

T. von Sydow

2003-01-01

Full Text Available Various reasons like technology progress, flexibility demands, shortened product cycle time and shortened time to market have brought up the possibility and necessity to integrate different architecture blocks on one heterogeneous System-on-Chip (SoC. Architecture blocks like programmable processor cores (DSP- and GPP-kernels, embedded FPGAs as well as dedicated macros will be integral parts of such a SoC. Especially programmable architecture blocks and associated optimization techniques are discussed in this contribution. Design space exploration and thus the choice which architecture blocks should be integrated in a SoC is a challenging task. Crucial to this exploration is the evaluation of the application domain characteristics and the costs caused by individual architecture blocks integrated on a SoC. An ATE-cost function has been applied to examine the performance of the aforementioned programmable architecture blocks. Therefore, representative discrete devices have been analyzed. Furthermore, several architecture dependent optimization steps and their effects on the cost ratios are presented.
High performance deformable image registration algorithms for manycore processors

CERN Document Server

Shackleford, James; Sharp, Gregory

2013-01-01

High Performance Deformable Image Registration Algorithms for Manycore Processors develops highly data-parallel image registration algorithms suitable for use on modern multi-core architectures, including graphics processing units (GPUs). Focusing on deformable registration, we show how to develop data-parallel versions of the registration algorithm suitable for execution on the GPU. Image registration is the process of aligning two or more images into a common coordinate frame and is a fundamental step to be able to compare or fuse data obtained from different sensor measurements. E
Probabilistic programmable quantum processors

International Nuclear Information System (INIS)

Buzek, V.; Ziman, M.; Hillery, M.

2004-01-01

We analyze how to improve performance of probabilistic programmable quantum processors. We show how the probability of success of the probabilistic processor can be enhanced by using the processor in loops. In addition, we show that an arbitrary SU(2) transformations of qubits can be encoded in program state of a universal programmable probabilistic quantum processor. The probability of success of this processor can be enhanced by a systematic correction of errors via conditional loops. Finally, we show that all our results can be generalized also for qudits. (Abstract Copyright [2004], Wiley Periodicals, Inc.)
Assimilation of Biophysical Neuronal Dynamics in Neuromorphic VLSI.

Science.gov (United States)

Wang, Jun; Breen, Daniel; Akinin, Abraham; Broccard, Frederic; Abarbanel, Henry D I; Cauwenberghs, Gert

2017-12-01

Representing the biophysics of neuronal dynamics and behavior offers a principled analysis-by-synthesis approach toward understanding mechanisms of nervous system functions. We report on a set of procedures assimilating and emulating neurobiological data on a neuromorphic very large scale integrated (VLSI) circuit. The analog VLSI chip, NeuroDyn, features 384 digitally programmable parameters specifying for 4 generalized Hodgkin-Huxley neurons coupled through 12 conductance-based chemical synapses. The parameters also describe reversal potentials, maximal conductances, and spline regressed kinetic functions for ion channel gating variables. In one set of experiments, we assimilated membrane potential recorded from one of the neurons on the chip to the model structure upon which NeuroDyn was designed using the known current input sequence. We arrived at the programmed parameters except for model errors due to analog imperfections in the chip fabrication. In a related set of experiments, we replicated songbird individual neuron dynamics on NeuroDyn by estimating and configuring parameters extracted using data assimilation from intracellular neural recordings. Faithful emulation of detailed biophysical neural dynamics will enable the use of NeuroDyn as a tool to probe electrical and molecular properties of functional neural circuits. Neuroscience applications include studying the relationship between molecular properties of neurons and the emergence of different spike patterns or different brain behaviors. Clinical applications include studying and predicting effects of neuromodulators or neurodegenerative diseases on ion channel kinetics.
Comparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors

OpenAIRE

Abdul Kareem PARCHUR; Ram Asaray SINGH

2012-01-01

High performance is a critical requirement to all microprocessors manufacturers. The present paper describes the comparison of performance in two main Intel Xeon series processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310). The microarchitecture of these processors is implemented using the basis of a new family of processors from Intel starting with the Pentium 4 processor. These processors can provide a performance boost for many ke...
The LASS hardware processor

International Nuclear Information System (INIS)

Kunz, P.F.

1976-01-01

The problems of data analysis with hardware processors are reviewed and a description is given of a programmable processor. This processor, the 168/E, has been designed for use in the LASS multi-processor system; it has an execution speed comparable to the IBM 370/168 and uses the subset of IBM 370 instructions appropriate to the LASS analysis task. (Auth.)
Implementation of an EPICS IOC on an Embedded Soft Core Processor Using Field Programmable Gate Arrays

International Nuclear Information System (INIS)

Douglas Curry; Alicia Hofler; Hai Dong; Trent Allison; J. Hovater; Kelly Mahoney

2005-01-01

At Jefferson Lab, we have been evaluating soft core processors running an EPICS IOC over μClinux on our custom hardware. A soft core processor is a flexible CPU architecture that is configured in the FPGA as opposed to a hard core processor which is fixed in silicon. Combined with an on-board Ethernet port, the technology incorporates the IOC and digital control hardware within a single FPGA. By eliminating the general purpose computer IOC, the designer is no longer tied to a specific platform, e.g. PC, VME, or VXI, to serve as the intermediary between the high level controls and the field hardware. This paper will discuss the design and development process as well as specific applications for JLab's next generation low-level RF controls and Machine Protection Systems
Optical RAM-enabled cache memory and optical routing for chip multiprocessors: technologies and architectures

Science.gov (United States)

Pleros, Nikos; Maniotis, Pavlos; Alexoudi, Theonitsa; Fitsios, Dimitris; Vagionas, Christos; Papaioannou, Sotiris; Vyrsokinos, K.; Kanellos, George T.

2014-03-01

The processor-memory performance gap, commonly referred to as "Memory Wall" problem, owes to the speed mismatch between processor and electronic RAM clock frequencies, forcing current Chip Multiprocessor (CMP) configurations to consume more than 50% of the chip real-estate for caching purposes. In this article, we present our recent work spanning from Si-based integrated optical RAM cell architectures up to complete optical cache memory architectures for Chip Multiprocessor configurations. Moreover, we discuss on e/o router subsystems with up to Tb/s routing capacity for cache interconnection purposes within CMP configurations, currently pursued within the FP7 PhoxTrot project.
Floating point only SIMD instruction set architecture including compare, select, Boolean, and alignment operations

Science.gov (United States)

Gschwind, Michael K [Chappaqua, NY

2011-03-01

Mechanisms for implementing a floating point only single instruction multiple data instruction set architecture are provided. A processor is provided that comprises an issue unit, an execution unit coupled to the issue unit, and a vector register file coupled to the execution unit. The execution unit has logic that implements a floating point (FP) only single instruction multiple data (SIMD) instruction set architecture (ISA). The floating point vector registers of the vector register file store both scalar and floating point values as vectors having a plurality of vector elements. The processor may be part of a data processing system.
Power gating of VLSI circuits using MEMS switches in low power applications

KAUST Repository

Shobak, Hosam; Ghoneim, Mohamed T.; El Boghdady, Nawal; Halawa, Sarah; Iskander, Sophinese M.; Anis, Mohab H.

2011-01-01

-designed MEMS switch to power gate VLSI circuits, such that leakage power is efficiently reduced while accounting for performance and reliability. The designed MEMS switch is characterized by an 0.1876 ? ON resistance and requires 4.5 V to switch. As a result
Drift chamber tracking with a VLSI neural network

International Nuclear Information System (INIS)

Lindsey, C.S.; Denby, B.; Haggerty, H.; Johns, K.

1992-10-01

We have tested a commercial analog VLSI neural network chip for finding in real time the intercept and slope of charged particles traversing a drift chamber. Voltages proportional to the drift times were input to the Intel ETANN chip and the outputs were recorded and later compared off line to conventional track fits. We will discuss the chamber and test setup, the chip specifications, and results of recent tests. We'll briefly discuss possible applications in high energy physics detector triggers
Adaptive Backoff Synchronization Techniques

Science.gov (United States)

1989-07-01

Percentage of synchronization and non- synchronisation references that cause invalidations in directory schemes with 2, 3, 4, 5, and 64 pointers...processors to arrive. The slight relative increase of synchronisation overhead in all cases when going from two to five pointers is because synchronization ...MASSACHUSETTS INSTITUTE OF TECHNOLOGY VLSI PUBLICATIONS q~JU VLSI Memo No. 89-547 It July 1989 Adaptive Backoff Synchronization Techniques Anant
Heterogeneous System Architectures from APUs to discrete GPUs

CERN Multimedia

CERN. Geneva

2013-01-01

We will present the Heterogeneous Systems Architectures that new AMD processors are bringing with the new GCN based GPUs and the new APUs. We will show how together they represent a huge step forward for programming flexibility and performance efficiently for Compute.
A Real-Time Marker-Based Visual Sensor Based on a FPGA and a Soft Core Processor.

Science.gov (United States)

Tayara, Hilal; Ham, Woonchul; Chong, Kil To

2016-12-15

This paper introduces a real-time marker-based visual sensor architecture for mobile robot localization and navigation. A hardware acceleration architecture for post video processing system was implemented on a field-programmable gate array (FPGA). The pose calculation algorithm was implemented in a System on Chip (SoC) with an Altera Nios II soft-core processor. For every frame, single pass image segmentation and Feature Accelerated Segment Test (FAST) corner detection were used for extracting the predefined markers with known geometries in FPGA. Coplanar PosIT algorithm was implemented on the Nios II soft-core processor supplied with floating point hardware for accelerating floating point operations. Trigonometric functions have been approximated using Taylor series and cubic approximation using Lagrange polynomials. Inverse square root method has been implemented for approximating square root computations. Real time results have been achieved and pixel streams have been processed on the fly without any need to buffer the input frame for further implementation.
The breaking point of modern processor and platform technology

CERN Document Server

Nowak, A; Lazzaro, A; Leduc, J

2011-01-01

This work is an overview of state of the art processors used in High Energy Physics, their architecture and an extensive outline of the forthcoming technologies. Silicon process science and hardware design are making constant and rapid progress, and a solid grasp of these developments is imperative to the understanding of their possible future applications, which might include software strategy, optimizations, computing center operations and hardware acquisitions. In particular, the current issue of software and platform scalability is becoming more and more noticeable, and will develop in the near future with the growing core count of single chips and the approach of certain x86 architectural limits. Other topics brought forward include the hard, physical limits of innovation, the applicability of tried and tested computing formulas to modern technologies, as well as an analysis of viable alternate choices for continued development.
Electromagnetic Physics Models for Parallel Computing Architectures

International Nuclear Information System (INIS)

Amadio, G; Bianchini, C; Iope, R; Ananya, A; Apostolakis, J; Aurora, A; Bandieramonte, M; Brun, R; Carminati, F; Gheata, A; Gheata, M; Goulas, I; Nikitina, T; Bhattacharyya, A; Mohanty, A; Canal, P; Elvira, D; Jun, S Y; Lima, G; Duhem, L

2016-01-01

The recent emergence of hardware architectures characterized by many-core or accelerated processors has opened new opportunities for concurrent programming models taking advantage of both SIMD and SIMT architectures. GeantV, a next generation detector simulation, has been designed to exploit both the vector capability of mainstream CPUs and multi-threading capabilities of coprocessors including NVidia GPUs and Intel Xeon Phi. The characteristics of these architectures are very different in terms of the vectorization depth and type of parallelization needed to achieve optimal performance. In this paper we describe implementation of electromagnetic physics models developed for parallel computing architectures as a part of the GeantV project. Results of preliminary performance evaluation and physics validation are presented as well. (paper)
Optimizations of Unstructured Aerodynamics Computations for Many-core Architectures

KAUST Repository

Al Farhan, Mohammed Ahmed

2018-04-13

We investigate several state-of-the-practice shared-memory optimization techniques applied to key routines of an unstructured computational aerodynamics application with irregular memory accesses. We illustrate for the Intel KNL processor, as a representative of the processors in contemporary leading supercomputers, identifying and addressing performance challenges without compromising the floating point numerics of the original code. We employ low and high-level architecture-specific code optimizations involving thread and data-level parallelism. Our approach is based upon a multi-level hierarchical distribution of work and data across both the threads and the SIMD units within every hardware core. On a 64-core KNL chip, we achieve nearly 2.9x speedup of the dominant routines relative to the baseline. These exhibit almost linear strong scalability up to 64 threads, and thereafter some improvement with hyperthreading. At substantially fewer Watts, we achieve up to 1.7x speedup relative to the performance of 72 threads of a 36-core Haswell CPU and roughly equivalent performance to 112 threads of a 56-core Skylake scalable processor. These optimizations are expected to be of value for many other unstructured mesh PDE-based scientific applications as multi and many-core architecture evolves.
High-energy heavy ion testing of VLSI devices for single event ...

Indian Academy of Sciences (India)

Unknown

per describes the high-energy heavy ion radiation testing of VLSI devices for single event upset (SEU) ... The experimental set up employed to produce low flux of heavy ions viz. silicon ... through which they pass, leaving behind a wake of elec- ... for use in Bus Management Unit (BMU) and bulk CMOS ... was scheduled.

Addressing Thermal and Performance Variability Issues in Dynamic Processors

Energy Technology Data Exchange (ETDEWEB)

Yoshii, Kazutomo [Argonne National Lab. (ANL), Argonne, IL (United States); Llopis, Pablo [Univ. Carlos III de Madrid (Spain); Zhang, Kaicheng [Northwestern Univ., Evanston, IL (United States); Luo, Yingyi [Northwestern Univ., Evanston, IL (United States); Ogrenci-Memik, Seda [Northwestern Univ., Evanston, IL (United States); Memik, Gokhan [Northwestern Univ., Evanston, IL (United States); Sankaran, Rajesh [Argonne National Lab. (ANL), Argonne, IL (United States); Beckman, Pete [Argonne National Lab. (ANL), Argonne, IL (United States)

2017-03-01

As CMOS scaling nears its end, parameter variations (process, temperature and voltage) are becoming a major concern. To overcome parameter variations and provide stability, modern processors are becoming dynamic, opportunistically adjusting voltage and frequency based on thermal and energy constraints, which negatively impacts traditional bulk-synchronous parallelism-minded hardware and software designs. As node-level architecture is growing in complexity, implementing variation control mechanisms only with hardware can be a challenging task. In this paper we investigate a software strategy to manage hardwareinduced variations, leveraging low-level monitoring/controlling mechanisms.
Matrix-Vector Based Fast Fourier Transformations on SDR Architectures

Directory of Open Access Journals (Sweden)

Y. He

2008-05-01

Full Text Available Today Discrete Fourier Transforms (DFTs are applied in various radio standards based on OFDM (Orthogonal Frequency Division Multiplex. It is important to gain a fast computational speed for the DFT, which is usually achieved by using specialized Fast Fourier Transform (FFT engines. However, in face of the Software Defined Radio (SDR development, more general (parallel processor architectures are often desirable, which are not tailored to FFT computations. Therefore, alternative approaches are required to reduce the complexity of the DFT. Starting from a matrix-vector based description of the FFT idea, we will present different factorizations of the DFT matrix, which allow a reduction of the complexity that lies between the original DFT and the minimum FFT complexity. The computational complexities of these factorizations and their suitability for implementation on different processor architectures are investigated.
Design and simulation of parallel and distributed architectures for images processing

International Nuclear Information System (INIS)

Pirson, Alain

1990-01-01

The exploitation of visual information requires special computers. The diversity of operations and the Computing power involved bring about structures founded on the concepts of concurrency and distributed processing. This work identifies a vision computer with an association of dedicated intelligent entities, exchanging messages according to the model of parallelism introduced by the language Occam. It puts forward an architecture of the 'enriched processor network' type. It consists of a classical multiprocessor structure where each node is provided with specific devices. These devices perform processing tasks as well as inter-nodes dialogues. Such an architecture benefits from the homogeneity of multiprocessor networks and the power of dedicated resources. Its implementation corresponds to that of a distributed structure, tasks being allocated to each Computing element. This approach culminates in an original architecture called ATILA. This modular structure is based on a transputer network supplied with vision dedicated co-processors and powerful communication devices. (author) [fr
Implementing An Image Understanding System Architecture Using Pipe

Science.gov (United States)

Luck, Randall L.

1988-03-01

This paper will describe PIPE and how it can be used to implement an image understanding system. Image understanding is the process of developing a description of an image in order to make decisions about its contents. The tasks of image understanding are generally split into low level vision and high level vision. Low level vision is performed by PIPE -a high performance parallel processor with an architecture specifically designed for processing video images at up to 60 fields per second. High level vision is performed by one of several types of serial or parallel computers - depending on the application. An additional processor called ISMAP performs the conversion from iconic image space to symbolic feature space. ISMAP plugs into one of PIPE's slots and is memory mapped into the high level processor. Thus it forms the high speed link between the low and high level vision processors. The mechanisms for bottom-up, data driven processing and top-down, model driven processing are discussed.
A Versatile Image Processor For Digital Diagnostic Imaging And Its Application In Computed Radiography

Science.gov (United States)

Blume, H.; Alexandru, R.; Applegate, R.; Giordano, T.; Kamiya, K.; Kresina, R.

1986-06-01

In a digital diagnostic imaging department, the majority of operations for handling and processing of images can be grouped into a small set of basic operations, such as image data buffering and storage, image processing and analysis, image display, image data transmission and image data compression. These operations occur in almost all nodes of the diagnostic imaging communications network of the department. An image processor architecture was developed in which each of these functions has been mapped into hardware and software modules. The modular approach has advantages in terms of economics, service, expandability and upgradeability. The architectural design is based on the principles of hierarchical functionality, distributed and parallel processing and aims at real time response. Parallel processing and real time response is facilitated in part by a dual bus system: a VME control bus and a high speed image data bus, consisting of 8 independent parallel 16-bit busses, capable of handling combined up to 144 MBytes/sec. The presented image processor is versatile enough to meet the video rate processing needs of digital subtraction angiography, the large pixel matrix processing requirements of static projection radiography, or the broad range of manipulation and display needs of a multi-modality diagnostic work station. Several hardware modules are described in detail. For illustrating the capabilities of the image processor, processed 2000 x 2000 pixel computed radiographs are shown and estimated computation times for executing the processing opera-tions are presented.
The AMchip: A VLSI associative memory for track finding

International Nuclear Information System (INIS)

Morsani, F.; Galeotti, S.; Passuello, D.; Amendolia, S.R.; Ristori, L.; Turini, N.

1992-01-01

An associative memory to be used for super-fast track finding in future high energy physics experiments, has been implemented on silicon as a full-custom CMOS VLSI chip (the AMchip). The first prototype has been designed and successfully tested at INFN in Pisa. It is implemented in 1.6 μm, double metal, silicon gate CMOS technology and contains about 140 000 MOS transistors on a 1x1 cm 2 silicon chip. (orig.)
Consumer Electronics Processors for Critical Real-Time Systems: a (Failed) Practical Experience

OpenAIRE

Fernandez , Gabriel; Cazorla , Francisco; Abella , Jaume

2018-01-01

International audience; The convergence between consumer electronics and critical real-time markets has increased the need for hardware platforms able to deliver high performance as well as high (sustainable) performance guarantees. Using the ARM big.LITTLE architecture as example of those platforms, in this paper we report our experience with one of its implementations (the Qualcomm SnapDragon 810 processor) to derive performance bounds with measurement-based techniques. Our theoretical and ...
A portable modular architecture for robotic manipulator control

International Nuclear Information System (INIS)

Butler, P.L.

1993-01-01

A control architecture has been developed to provide a framework for robotic manipulator control. This architecture, called the Modular Integrated Control Architecture (MICA), has been successfully applied to two different manipulator systems. MICA is a portable system in two respects. First, it can be used for the control of different types of manipulator systems. Second, the MICA code is portable across several operating environments. This portability allows the sharing of common control code among various systems. A major portion of MICA is the precise control of multiple processors that have to be coordinated to control a manipulator system. By having NUCA control the processor synchronization, the system developer can concentrate on the specific aspects of a new manipulator system. MICA also provides standard functions for trajectory generation that can be used for most manipulators. Custom trajectory generators can be easily added to suit the needs of a particular robotic control system. Another facility that MICA provides is a simulation of the manipulator, allowing the control code to be simulated before trying it on a manipulator system. Using this technique, one can develop code for a manipulator system without risking damage to the arm during development
Architecture Of High Speed Image Processing System

Science.gov (United States)

Konishi, Toshio; Hayashi, Hiroshi; Ohki, Tohru

1988-01-01

One of architectures for a high speed image processing system which corresponds to a new algorithm for a shape understanding is proposed. And the hardware system which is based on the archtecture was developed. Consideration points of the architecture are mainly that using processors should match with the processing sequence of the target image and that the developed system should be used practically in an industry. As the result, it was possible to perform each processing at a speed of 80 nano-seconds a pixel.
Phase space simulation of collisionless stellar systems on the massively parallel processor

International Nuclear Information System (INIS)

White, R.L.

1987-01-01

A numerical technique for solving the collisionless Boltzmann equation describing the time evolution of a self gravitating fluid in phase space was implemented on the Massively Parallel Processor (MPP). The code performs calculations for a two dimensional phase space grid (with one space and one velocity dimension). Some results from calculations are presented. The execution speed of the code is comparable to the speed of a single processor of a Cray-XMP. Advantages and disadvantages of the MPP architecture for this type of problem are discussed. The nearest neighbor connectivity of the MPP array does not pose a significant obstacle. Future MPP-like machines should have much more local memory and easier access to staging memory and disks in order to be effective for this type of problem
Measurements of the LHCb software stack on the ARM architecture

International Nuclear Information System (INIS)

Kartik, S Vijay; Couturier, Ben; Clemencic, Marco; Neufeld, Niko

2014-01-01

The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86 6 4 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture – specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda – and makes comparisons with the performance on x86 6 4 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance – this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark. The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed – these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and
Emergent Auditory Feature Tuning in a Real-Time Neuromorphic VLSI System.

Science.gov (United States)

Sheik, Sadique; Coath, Martin; Indiveri, Giacomo; Denham, Susan L; Wennekers, Thomas; Chicca, Elisabetta

2012-01-01

Many sounds of ecological importance, such as communication calls, are characterized by time-varying spectra. However, most neuromorphic auditory models to date have focused on distinguishing mainly static patterns, under the assumption that dynamic patterns can be learned as sequences of static ones. In contrast, the emergence of dynamic feature sensitivity through exposure to formative stimuli has been recently modeled in a network of spiking neurons based on the thalamo-cortical architecture. The proposed network models the effect of lateral and recurrent connections between cortical layers, distance-dependent axonal transmission delays, and learning in the form of Spike Timing Dependent Plasticity (STDP), which effects stimulus-driven changes in the pattern of network connectivity. In this paper we demonstrate how these principles can be efficiently implemented in neuromorphic hardware. In doing so we address two principle problems in the design of neuromorphic systems: real-time event-based asynchronous communication in multi-chip systems, and the realization in hybrid analog/digital VLSI technology of neural computational principles that we propose underlie plasticity in neural processing of dynamic stimuli. The result is a hardware neural network that learns in real-time and shows preferential responses, after exposure, to stimuli exhibiting particular spectro-temporal patterns. The availability of hardware on which the model can be implemented, makes this a significant step toward the development of adaptive, neurobiologically plausible, spike-based, artificial sensory systems.
Emergent auditory feature tuning in a real-time neuromorphic VLSI system

Directory of Open Access Journals (Sweden)

Sadique eSheik

2012-02-01

Full Text Available Many sounds of ecological importance, such as communication calls, are characterised by time-varying spectra. However, most neuromorphic auditory models to date have focused on distinguishing mainly static patterns, under the assumption that dynamic patterns can be learned as sequences of static ones. In contrast, the emergence of dynamic feature sensitivity through exposure to formative stimuli has been recently modeled in a network of spiking neurons based on the thalamocortical architecture. The proposed network models the effect of lateral and recurrent connections between cortical layers, distance-dependent axonal transmission delays, and learning in the form of Spike Timing Dependent Plasticity (STDP, which effects stimulus-driven changes in the pattern of network connectivity. In this paper we demonstrate how these principles can be efficiently implemented in neuromorphic hardware. In doing so we address two principle problems in the design of neuromorphic systems: real-time event-based asynchronous communication in multi-chip systems, and the realization in hybrid analog/digital VLSI technology of neural computational principles that we propose underlie plasticity in neural processing of dynamic stimuli. The result is a hardware neural network that learns in real-time and shows preferential responses, after exposure, to stimuli exhibiting particular spectrotemporal patterns. The availability of hardware on which the model can be implemented, makes this a significant step towards the development of adaptive, neurobiologically plausible, spike-based, artificial sensory systems.
Comparison of Processor Performance of SPECint2006 Benchmarks of some Intel Xeon Processors

Directory of Open Access Journals (Sweden)

Abdul Kareem PARCHUR

2012-08-01

Full Text Available High performance is a critical requirement to all microprocessors manufacturers. The present paper describes the comparison of performance in two main Intel Xeon series processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310. The microarchitecture of these processors is implemented using the basis of a new family of processors from Intel starting with the Pentium 4 processor. These processors can provide a performance boost for many key application areas in modern generation. The scaling of performance in two major series of Intel Xeon processors (Type A: Intel Xeon X5260, X5460, E5450 and L5320 and Type B: Intel Xeon X5140, 5130, 5120 and E5310 has been analyzed using the performance numbers of 12 CPU2006 integer benchmarks, performance numbers that exhibit significant differences in performance. The results and analysis can be used by performance engineers, scientists and developers to better understand the performance scaling in modern generation processors.
Simulation of a parallel processor on a serial processor: The neutron diffusion equation

International Nuclear Information System (INIS)

Honeck, H.C.

1981-01-01

Parallel processors could provide the nuclear industry with very high computing power at a very moderate cost. Will we be able to make effective use of this power. This paper explores the use of a very simple parallel processor for solving the neutron diffusion equation to predict power distributions in a nuclear reactor. We first describe a simple parallel processor and estimate its theoretical performance based on the current hardware technology. Next, we show how the parallel processor could be used to solve the neutron diffusion equation. We then present the results of some simulations of a parallel processor run on a serial processor and measure some of the expected inefficiencies. Finally we extrapolate the results to estimate how actual design codes would perform. We find that the standard numerical methods for solving the neutron diffusion equation are still applicable when used on a parallel processor. However, some simple modifications to these methods will be necessary if we are to achieve the full power of these new computers. (orig.) [de
Design and implementation of a high performance network security processor

Science.gov (United States)

Wang, Haixin; Bai, Guoqiang; Chen, Hongyi

2010-03-01

The last few years have seen many significant progresses in the field of application-specific processors. One example is network security processors (NSPs) that perform various cryptographic operations specified by network security protocols and help to offload the computation intensive burdens from network processors (NPs). This article presents a high performance NSP system architecture implementation intended for both internet protocol security (IPSec) and secure socket layer (SSL) protocol acceleration, which are widely employed in virtual private network (VPN) and e-commerce applications. The efficient dual one-way pipelined data transfer skeleton and optimised integration scheme of the heterogenous parallel crypto engine arrays lead to a Gbps rate NSP, which is programmable with domain specific descriptor-based instructions. The descriptor-based control flow fragments large data packets and distributes them to the crypto engine arrays, which fully utilises the parallel computation resources and improves the overall system data throughput. A prototyping platform for this NSP design is implemented with a Xilinx XC3S5000 based FPGA chip set. Results show that the design gives a peak throughput for the IPSec ESP tunnel mode of 2.85 Gbps with over 2100 full SSL handshakes per second at a clock rate of 95 MHz.
The prefrontal landscape: implications of functional architecture for understanding human mentation and the central executive.

Science.gov (United States)

Goldman-Rakic, P S

1996-10-29

The functional architecture of prefrontal cortex is central to our understanding of human mentation and cognitive prowess. This region of the brain is often treated as an undifferentiated structure, on the one hand, or as a mosaic of psychological faculties, on the other. This paper focuses on the working memory processor as a specialization of prefrontal cortex and argues that the different areas within prefrontal cortex represent iterations of this function for different information domains, including spatial cognition, object cognition and additionally, in humans, semantic processing. According to this parallel processing architecture, the 'central executive' could be considered an emergent property of multiple domain-specific processors operating interactively. These processors are specializations of different prefrontal cortical areas, each interconnected both with the domain-relevant long-term storage sites in posterior regions of the cortex and with appropriate output pathways.
Using Software Technology to Specify Abstract Interfaces in VLSI Design.

Science.gov (United States)

1985-01-01

with the complexity lev- els inherent in VLSI design, in that they can capitalize on their foundations in discrete mathemat- ics and the theory of...basis, rather than globally. Such a partitioning of module semantics makes the specification easier to construct and verify intelectual !y; it also...access function definitions. A standard language improves executability characteristics by capitalizing on portable, optimized system software developed
Functional unit for a processor

NARCIS (Netherlands)

Rohani, A.; Kerkhoff, Hans G.

2013-01-01

The invention relates to a functional unit for a processor, such as a Very Large Instruction Word Processor. The invention further relates to a processor comprising at least one such functional unit. The invention further relates to a functional unit and processor capable of mitigating the effect of
A novel low-voltage low-power analogue VLSI implementation of neural networks with on-chip back-propagation learning

Science.gov (United States)

Carrasco, Manuel; Garde, Andres; Murillo, Pilar; Serrano, Luis

2005-06-01

In this paper a novel design and implementation of a VLSI Analogue Neural Net based on Multi-Layer Perceptron (MLP) with on-chip Back Propagation (BP) learning algorithm suitable for the resolution of classification problems is described. In order to implement a general and programmable analogue architecture, the design has been carried out in a hierarchical way. In this way the net has been divided in synapsis-blocks and neuron-blocks providing an easy method for the analysis. These blocks basically consist on simple cells, which are mainly, the activation functions (NAF), derivatives (DNAF), multipliers and weight update circuits. The analogue design is based on current-mode translinear techniques using MOS transistors working in the weak inversion region in order to reduce both the voltage supply and the power consumption. Moreover, with the purpose of minimizing the noise, offset and distortion of even order, the topologies are fully-differential and balanced. The circuit, named ANNE (Analogue Neural NEt), has been prototyped and characterized as a proof of concept on CMOS AMI-0.5A technology occupying a total area of 2.7mm2. The chip includes two versions of neural nets with on-chip BP learning algorithm, which are respectively a 2-1 and a 2-2-1 implementations. The proposed nets have been experimentally tested using supply voltages from 2.5V to 1.8V, which is suitable for single cell lithium-ion battery supply applications. Experimental results of both implementations included in ANNE exhibit a good performance on solving classification problems. These results have been compared with other proposed Analogue VLSI implementations of Neural Nets published in the literature demonstrating that our proposal is very efficient in terms of occupied area and power consumption.

Adaptive signal processor

Energy Technology Data Exchange (ETDEWEB)

Walz, H.V.

1980-07-01

An experimental, general purpose adaptive signal processor system has been developed, utilizing a quantized (clipped) version of the Widrow-Hoff least-mean-square adaptive algorithm developed by Moschner. The system accommodates 64 adaptive weight channels with 8-bit resolution for each weight. Internal weight update arithmetic is performed with 16-bit resolution, and the system error signal is measured with 12-bit resolution. An adapt cycle of adjusting all 64 weight channels is accomplished in 8 ..mu..sec. Hardware of the signal processor utilizes primarily Schottky-TTL type integrated circuits. A prototype system with 24 weight channels has been constructed and tested. This report presents details of the system design and describes basic experiments performed with the prototype signal processor. Finally some system configurations and applications for this adaptive signal processor are discussed.
Adaptive signal processor

International Nuclear Information System (INIS)

Walz, H.V.

1980-07-01

An experimental, general purpose adaptive signal processor system has been developed, utilizing a quantized (clipped) version of the Widrow-Hoff least-mean-square adaptive algorithm developed by Moschner. The system accommodates 64 adaptive weight channels with 8-bit resolution for each weight. Internal weight update arithmetic is performed with 16-bit resolution, and the system error signal is measured with 12-bit resolution. An adapt cycle of adjusting all 64 weight channels is accomplished in 8 μsec. Hardware of the signal processor utilizes primarily Schottky-TTL type integrated circuits. A prototype system with 24 weight channels has been constructed and tested. This report presents details of the system design and describes basic experiments performed with the prototype signal processor. Finally some system configurations and applications for this adaptive signal processor are discussed
Design and Test Space Exploration of Transport-Triggered Architectures

NARCIS (Netherlands)

Zivkovic, V.; Tangelder, R.J.W.T.; Kerkhoff, Hans G.

2000-01-01

This paper describes a new approach in the high level design and test of transport-triggered architectures (TTA), a special type of application specific instruction processors (ASIP). The proposed method introduces the test as an additional constraint, besides throughput and circuit area. The
Multi-Threaded Dense Linear Algebra Libraries for Low-Power Asymmetric Multicore Processors

OpenAIRE

Catalán, Sandra; Herrero, José R.; Igual, Francisco D.; Rodríguez-Sánchez, Rafael; Quintana-Ortí, Enrique S.

2015-01-01

Dense linear algebra libraries, such as BLAS and LAPACK, provide a relevant collection of numerical tools for many scientific and engineering applications. While there exist high performance implementations of the BLAS (and LAPACK) functionality for many current multi-threaded architectures,the adaption of these libraries for asymmetric multicore processors (AMPs)is still pending. In this paper we address this challenge by developing an asymmetry-aware implementation of the BLAS, based on the...
Multithreading in vector processors

Science.gov (United States)

Evangelinos, Constantinos; Kim, Changhoan; Nair, Ravi

2018-01-16

In one embodiment, a system includes a processor having a vector processing mode and a multithreading mode. The processor is configured to operate on one thread per cycle in the multithreading mode. The processor includes a program counter register having a plurality of program counters, and the program counter register is vectorized. Each program counter in the program counter register represents a distinct corresponding thread of a plurality of threads. The processor is configured to execute the plurality of threads by activating the plurality of program counters in a round robin cycle.
One-Chip Solution to Intelligent Robot Control: Implementing Hexapod Subsumption Architecture Using a Contemporary Microprocessor

Directory of Open Access Journals (Sweden)

Nikita Pashenkov

2008-11-01

Full Text Available This paper introduces a six-legged autonomous robot managed by a single controller and a software core modeled on subsumption architecture. We begin by discussing the features and capabilities of IsoPod, a new processor for robotics which has enabled a streamlined implementation of our project. We argue that this processor offers a unique set of hardware and software features, making it a practical development platform for robotics in general and for subsumption-based control architectures in particular. Next, we summarize original ideas on subsumption architecture implementation for a six-legged robot, as presented by its inventor Rodney Brooks in 1980's. A comparison is then made to a more recent example of a hexapod control architecture based on subsumption. The merits of both systems are analyzed and a new subsumption architecture layout is formulated as a response. We conclude with some remarks regarding the development of this project as a hint at new potentials for intelligent robot design, opened up by a recent development in embedded controller market.
Scientific Computing Kernels on the Cell Processor

Energy Technology Data Exchange (ETDEWEB)

Williams, Samuel W.; Shalf, John; Oliker, Leonid; Kamil, Shoaib; Husbands, Parry; Yelick, Katherine

2007-04-04

The slowing pace of commodity microprocessor performance improvements combined with ever-increasing chip power demands has become of utmost concern to computational scientists. As a result, the high performance computing community is examining alternative architectures that address the limitations of modern cache-based designs. In this work, we examine the potential of using the recently-released STI Cell processor as a building block for future high-end computing systems. Our work contains several novel contributions. First, we introduce a performance model for Cell and apply it to several key scientific computing kernels: dense matrix multiply, sparse matrix vector multiply, stencil computations, and 1D/2D FFTs. The difficulty of programming Cell, which requires assembly level intrinsics for the best performance, makes this model useful as an initial step in algorithm design and evaluation. Next, we validate the accuracy of our model by comparing results against published hardware results, as well as our own implementations on a 3.2GHz Cell blade. Additionally, we compare Cell performance to benchmarks run on leading superscalar (AMD Opteron), VLIW (Intel Itanium2), and vector (Cray X1E) architectures. Our work also explores several different mappings of the kernels and demonstrates a simple and effective programming model for Cell's unique architecture. Finally, we propose modest microarchitectural modifications that could significantly increase the efficiency of double-precision calculations. Overall results demonstrate the tremendous potential of the Cell architecture for scientific computations in terms of both raw performance and power efficiency.
Atmel's New Rad-Hard Sparc V8 Processor 200Mhz & Low Power System on Chip

Science.gov (United States)

Ganry, Nicolas; Mantelet, Guy; Parkes, Steve; McClements, Chris

2014-08-01

The AT6981 is a new generation of processor designed for critical spaceflight applications, which combines a high-performance SPARC® V8 radiation hard processor, with enough on-chip memory for many aerospace applications and state-of-the-art SpaceWire networking technology from STAR- Dundee. The AT6981 is implemented in Atmel 90nm rad-hard technology, enabling 200 MHz operating speed for the processor with power consumption levels around 1W. This advanced technology allows strong system integration in a SoC with embedded peripherals like CAN, 1553, Ethernet, DDR and embedded memory with 1Mbytes SRAM. The device is ITAR- free and is developed in France by Atmel Aerospace having more than of 30years space experience. This paper describes this new SoC architecture and technical options considered to insure the best performances, the minimum power consumption and high reliability. This device will be available on the market in H2 2014 for evaluation with first flight models targeted end 2015.
3081/E processor

International Nuclear Information System (INIS)

Kunz, P.F.; Gravina, M.; Oxoby, G.

1984-04-01

The 3081/E project was formed to prepare a much improved IBM mainframe emulator for the future. Its design is based on a large amount of experience in using the 168/E processor to increase available CPU power in both online and offline environments. The processor will be at least equal to the execution speed of a 370/168 and up to 1.5 times faster for heavy floating point code. A single processor will thus be at least four times more powerful than the VAX 11/780, and five processors on a system would equal at least the performance of the IBM 3081K. With its large memory space and simple but flexible high speed interface, the 3081/E is well suited for the online and offline needs of high energy physics in the future
High-Efficient Parallel CAVLC Encoders on Heterogeneous Multicore Architectures

Directory of Open Access Journals (Sweden)

H. Y. Su

2012-04-01

Full Text Available This article presents two high-efficient parallel realizations of the context-based adaptive variable length coding (CAVLC based on heterogeneous multicore processors. By optimizing the architecture of the CAVLC encoder, three kinds of dependences are eliminated or weaken, including the context-based data dependence, the memory accessing dependence and the control dependence. The CAVLC pipeline is divided into three stages: two scans, coding, and lag packing, and be implemented on two typical heterogeneous multicore architectures. One is a block-based SIMD parallel CAVLC encoder on multicore stream processor STORM. The other is a component-oriented SIMT parallel encoder on massively parallel architecture GPU. Both of them exploited rich data-level parallelism. Experiments results show that compared with the CPU version, more than 70 times of speedup can be obtained for STORM and over 50 times for GPU. The implementation of encoder on STORM can make a real-time processing for 1080p @30fps and GPU-based version can satisfy the requirements for 720p real-time encoding. The throughput of the presented CAVLC encoders is more than 10 times higher than that of published software encoders on DSP and multicore platforms.
Does the Intel Xeon Phi processor fit HEP workloads?

Science.gov (United States)

Nowak, A.; Bitzes, G.; Dotti, A.; Lazzaro, A.; Jarp, S.; Szostek, P.; Valsan, L.; Botezatu, M.; Leduc, J.

2014-06-01

This paper summarizes the five years of CERN openlab's efforts focused on the Intel Xeon Phi co-processor, from the time of its inception to public release. We consider the architecture of the device vis a vis the characteristics of HEP software and identify key opportunities for HEP processing, as well as scaling limitations. We report on improvements and speedups linked to parallelization and vectorization on benchmarks involving software frameworks such as Geant4 and ROOT. Finally, we extrapolate current software and hardware trends and project them onto accelerators of the future, with the specifics of offline and online HEP processing in mind.
VLSI Technology for Cognitive Radio

Science.gov (United States)

VIJAYALAKSHMI, B.; SIDDAIAH, P.

2017-08-01

One of the most challenging tasks of cognitive radio is the efficiency in the spectrum sensing scheme to overcome the spectrum scarcity problem. The popular and widely used spectrum sensing technique is the energy detection scheme as it is very simple and doesn’t require any previous information related to the signal. We propose one such approach which is an optimised spectrum sensing scheme with reduced filter structure. The optimisation is done in terms of area and power performance of the spectrum. The simulations of the VLSI structure of the optimised flexible spectrum is done using verilog coding by using the XILINX ISE software. Our method produces performance with 13% reduction in area and 66% reduction in power consumption in comparison to the flexible spectrum sensing scheme. All the results are tabulated and comparisons are made. A new scheme for optimised and effective spectrum sensing opens up with our model.
Integrated fuel processor development

International Nuclear Information System (INIS)

Ahmed, S.; Pereira, C.; Lee, S. H. D.; Krumpelt, M.

2001-01-01

The Department of Energy's Office of Advanced Automotive Technologies has been supporting the development of fuel-flexible fuel processors at Argonne National Laboratory. These fuel processors will enable fuel cell vehicles to operate on fuels available through the existing infrastructure. The constraints of on-board space and weight require that these fuel processors be designed to be compact and lightweight, while meeting the performance targets for efficiency and gas quality needed for the fuel cell. This paper discusses the performance of a prototype fuel processor that has been designed and fabricated to operate with liquid fuels, such as gasoline, ethanol, methanol, etc. Rated for a capacity of 10 kWe (one-fifth of that needed for a car), the prototype fuel processor integrates the unit operations (vaporization, heat exchange, etc.) and processes (reforming, water-gas shift, preferential oxidation reactions, etc.) necessary to produce the hydrogen-rich gas (reformate) that will fuel the polymer electrolyte fuel cell stacks. The fuel processor work is being complemented by analytical and fundamental research. With the ultimate objective of meeting on-board fuel processor goals, these studies include: modeling fuel cell systems to identify design and operating features; evaluating alternative fuel processing options; and developing appropriate catalysts and materials. Issues and outstanding challenges that need to be overcome in order to develop practical, on-board devices are discussed
VLSI Implementation of a Fixed-Complexity Soft-Output MIMO Detector for High-Speed Wireless

Directory of Open Access Journals (Sweden)

Di Wu

2010-01-01

Full Text Available This paper presents a low-complexity MIMO symbol detector with close-Maximum a posteriori performance for the emerging multiantenna enhanced high-speed wireless communications. The VLSI implementation is based on a novel MIMO detection algorithm called Modified Fixed-Complexity Soft-Output (MFCSO detection, which achieves a good trade-off between performance and implementation cost compared to the referenced prior art. By including a microcode-controlled channel preprocessing unit and a pipelined detection unit, it is flexible enough to cover several different standards and transmission schemes. The flexibility allows adaptive detection to minimize power consumption without degradation in throughput. The VLSI implementation of the detector is presented to show that real-time MIMO symbol detection of 20 MHz bandwidth 3GPP LTE and 10 MHz WiMAX downlink physical channel is achievable at reasonable silicon cost.
TMS320C25 Digital Signal Processor For 2-Dimensional Fast Fourier Transform Computation

International Nuclear Information System (INIS)

Ardisasmita, M. Syamsa

1996-01-01

The Fourier transform is one of the most important mathematical tool in signal processing and analysis, which converts information from the time/spatial domain into the frequency domain. Even with implementation of the Fast Fourier Transform algorithms in imaging data, the discrete Fourier transform execution consume a lot of time. Digital signal processors are designed specifically to perform computation intensive digital signal processing algorithms. By taking advantage of the advanced architecture. parallel processing, and dedicated digital signal processing (DSP) instruction sets. This device can execute million of DSP operations per second. The device architecture, characteristics and feature suitable for fast Fourier transform application and speed-up are discussed
A proposed scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC detectors

International Nuclear Information System (INIS)

Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C.; Lockyer, N.; Vanberg, R.

1990-01-01

A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequence, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a proposed new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of Gigabytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the proposed Scalable Parallel Open Architecture data acquisition system are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build a prototype of the proposed data acquisition system architecture is given in the paper. The major component of the system, a self-routing parallel event builder, is described in detail
An Efficient Reconfigurable Architecture for Fingerprint Recognition

Directory of Open Access Journals (Sweden)

Satish S. Bhairannawar

2016-01-01

Full Text Available The fingerprint identification is an efficient biometric technique to authenticate human beings in real-time Big Data Analytics. In this paper, we propose an efficient Finite State Machine (FSM based reconfigurable architecture for fingerprint recognition. The fingerprint image is resized, and Compound Linear Binary Pattern (CLBP is applied on fingerprint, followed by histogram to obtain histogram CLBP features. Discrete Wavelet Transform (DWT Level 2 features are obtained by the same methodology. The novel matching score of CLBP is computed using histogram CLBP features of test image and fingerprint images in the database. Similarly, the DWT matching score is computed using DWT features of test image and fingerprint images in the database. Further, the matching scores of CLBP and DWT are fused with arithmetic equation using improvement factor. The performance parameters such as TSR (Total Success Rate, FAR (False Acceptance Rate, and FRR (False Rejection Rate are computed using fusion scores with correlation matching technique for FVC2004 DB3 Database. The proposed fusion based VLSI architecture is synthesized on Virtex xc5vlx30T-3 FPGA board using Finite State Machine resulting in optimized parameters.
SABRE: a bio-inspired fault-tolerant electronic architecture

International Nuclear Information System (INIS)

Bremner, P; Samie, M; Dragffy, G; Pipe, A G; Liu, Y; Tempesti, G; Timmis, J; Tyrrell, A M

2013-01-01

As electronic devices become increasingly complex, ensuring their reliable, fault-free operation is becoming correspondingly more challenging. It can be observed that, in spite of their complexity, biological systems are highly reliable and fault tolerant. Hence, we are motivated to take inspiration for biological systems in the design of electronic ones. In SABRE (self-healing cellular architectures for biologically inspired highly reliable electronic systems), we have designed a bio-inspired fault-tolerant hierarchical architecture for this purpose. As in biology, the foundation for the whole system is cellular in nature, with each cell able to detect faults in its operation and trigger intra-cellular or extra-cellular repair as required. At the next level in the hierarchy, arrays of cells are configured and controlled as function units in a transport triggered architecture (TTA), which is able to perform partial-dynamic reconfiguration to rectify problems that cannot be solved at the cellular level. Each TTA is, in turn, part of a larger multi-processor system which employs coarser grain reconfiguration to tolerate faults that cause a processor to fail. In this paper, we describe the details of operation of each layer of the SABRE hierarchy, and how these layers interact to provide a high systemic level of fault tolerance. (paper)
Virtual Prototyping and Performance Analysis of Two Memory Architectures

Directory of Open Access Journals (Sweden)

Huda S. Muhammad

2009-01-01

Full Text Available The gap between CPU and memory speed has always been a critical concern that motivated researchers to study and analyze the performance of memory hierarchical architectures. In the early stages of the design cycle, performance evaluation methodologies can be used to leverage exploration at the architectural level and assist in making early design tradeoffs. In this paper, we use simulation platforms developed using the VisualSim tool to compare the performance of two memory architectures, namely, the Direct Connect architecture of the Opteron, and the Shared Bus of the Xeon multicore processors. Key variations exist between the two memory architectures and both design approaches provide rich platforms that call for the early use of virtual system prototyping and simulation techniques to assess performance at an early stage in the design cycle.
Video sensor architecture for surveillance applications.

Science.gov (United States)

Sánchez, Jordi; Benet, Ginés; Simó, José E

2012-01-01

This paper introduces a flexible hardware and software architecture for a smart video sensor. This sensor has been applied in a video surveillance application where some of these video sensors are deployed, constituting the sensory nodes of a distributed surveillance system. In this system, a video sensor node processes images locally in order to extract objects of interest, and classify them. The sensor node reports the processing results to other nodes in the cloud (a user or higher level software) in the form of an XML description. The hardware architecture of each sensor node has been developed using two DSP processors and an FPGA that controls, in a flexible way, the interconnection among processors and the image data flow. The developed node software is based on pluggable components and runs on a provided execution run-time. Some basic and application-specific software components have been developed, in particular: acquisition, segmentation, labeling, tracking, classification and feature extraction. Preliminary results demonstrate that the system can achieve up to 7.5 frames per second in the worst case, and the true positive rates in the classification of objects are better than 80%.

Video Sensor Architecture for Surveillance Applications

Directory of Open Access Journals (Sweden)

José E. Simó

2012-02-01

Full Text Available This paper introduces a flexible hardware and software architecture for a smart video sensor. This sensor has been applied in a video surveillance application where some of these video sensors are deployed, constituting the sensory nodes of a distributed surveillance system. In this system, a video sensor node processes images locally in order to extract objects of interest, and classify them. The sensor node reports the processing results to other nodes in the cloud (a user or higher level software in the form of an XML description. The hardware architecture of each sensor node has been developed using two DSP processors and an FPGA that controls, in a flexible way, the interconnection among processors and the image data flow. The developed node software is based on pluggable components and runs on a provided execution run-time. Some basic and application-specific software components have been developed, in particular: acquisition, segmentation, labeling, tracking, classification and feature extraction. Preliminary results demonstrate that the system can achieve up to 7.5 frames per second in the worst case, and the true positive rates in the classification of objects are better than 80%.
A multi coding technique to reduce transition activity in VLSI circuits

International Nuclear Information System (INIS)

Vithyalakshmi, N.; Rajaram, M.

2014-01-01

Advances in VLSI technology have enabled the implementation of complex digital circuits in a single chip, reducing system size and power consumption. In deep submicron low power CMOS VLSI design, the main cause of energy dissipation is charging and discharging of internal node capacitances due to transition activity. Transition activity is one of the major factors that also affect the dynamic power dissipation. This paper proposes power reduction analyzed through algorithm and logic circuit levels. In algorithm level the key aspect of reducing power dissipation is by minimizing transition activity and is achieved by introducing a data coding technique. So a novel multi coding technique is introduced to improve the efficiency of transition activity up to 52.3% on the bus lines, which will automatically reduce the dynamic power dissipation. In addition, 1 bit full adders are introduced in the Hamming distance estimator block, which reduces the device count. This coding method is implemented using Verilog HDL. The overall performance is analyzed by using Modelsim and Xilinx Tools. In total 38.2% power saving capability is achieved compared to other existing methods. (semiconductor technology)
VLSI Design of SVM-Based Seizure Detection System With On-Chip Learning Capability.

Science.gov (United States)

Feng, Lichen; Li, Zunchao; Wang, Yuanfa

2018-02-01

Portable automatic seizure detection system is very convenient for epilepsy patients to carry. In order to make the system on-chip trainable with high efficiency and attain high detection accuracy, this paper presents a very large scale integration (VLSI) design based on the nonlinear support vector machine (SVM). The proposed design mainly consists of a feature extraction (FE) module and an SVM module. The FE module performs the three-level Daubechies discrete wavelet transform to fit the physiological bands of the electroencephalogram (EEG) signal and extracts the time-frequency domain features reflecting the nonstationary signal properties. The SVM module integrates the modified sequential minimal optimization algorithm with the table-driven-based Gaussian kernel to enable efficient on-chip learning. The presented design is verified on an Altera Cyclone II field-programmable gate array and tested using the two publicly available EEG datasets. Experiment results show that the designed VLSI system improves the detection accuracy and training efficiency.
Design of a VLSI Decoder for Partially Structured LDPC Codes

Directory of Open Access Journals (Sweden)

Fabrizio Vacca

2008-01-01

of their parity matrix can be partitioned into two disjoint sets, namely, the structured and the random ones. For the proposed class of codes a constructive design method is provided. To assess the value of this method the constructed codes performance are presented. From these results, a novel decoding method called split decoding is introduced. Finally, to prove the effectiveness of the proposed approach a whole VLSI decoder is designed and characterized.
Communication and Memory Architecture Design of Application-Specific High-End Multiprocessors

Directory of Open Access Journals (Sweden)

Yahya Jan

2012-01-01

Full Text Available This paper is devoted to the design of communication and memory architectures of massively parallel hardware multiprocessors necessary for the implementation of highly demanding applications. We demonstrated that for the massively parallel hardware multiprocessors the traditionally used flat communication architectures and multi-port memories do not scale well, and the memory and communication network influence on both the throughput and circuit area dominates the processors influence. To resolve the problems and ensure scalability, we proposed to design highly optimized application-specific hierarchical and/or partitioned communication and memory architectures through exploring and exploiting the regularity and hierarchy of the actual data flows of a given application. Furthermore, we proposed some data distribution and related data mapping schemes in the shared (global partitioned memories with the aim to eliminate the memory access conflicts, as well as, to ensure that our communication design strategies will be applicable. We incorporated these architecture synthesis strategies into our quality-driven model-based multi-processor design method and related automated architecture exploration framework. Using this framework, we performed a large series of experiments that demonstrate many various important features of the synthesized memory and communication architectures. They also demonstrate that our method and related framework are able to efficiently synthesize well scalable memory and communication architectures even for the high-end multiprocessors. The gains as high as 12-times in performance and 25-times in area can be obtained when using the hierarchical communication networks instead of the flat networks. However, for the high parallelism levels only the partitioned approach ensures the scalability in performance.
Stream-processing pipelines: processing of streams on multiprocessor architecture

NARCIS (Netherlands)

Kavaldjiev, N.K.; Smit, Gerardus Johannes Maria; Jansen, P.G.

In this paper we study the timing aspects of the operation of stream-processing applications that run on a multiprocessor architecture. Dependencies are derived for the processing and communication times of the processors in such a system. Three cases of real-time constrained operation and four
Application of Raptor-M3G to reactor dosimetry problems on massively parallel architectures - 026

International Nuclear Information System (INIS)

Longoni, G.

2010-01-01

The solution of complex 3-D radiation transport problems requires significant resources both in terms of computation time and memory availability. Therefore, parallel algorithms and multi-processor architectures are required to solve efficiently large 3-D radiation transport problems. This paper presents the application of RAPTOR-M3G (Rapid Parallel Transport Of Radiation - Multiple 3D Geometries) to reactor dosimetry problems. RAPTOR-M3G is a newly developed parallel computer code designed to solve the discrete ordinates (SN) equations on multi-processor computer architectures. This paper presents the results for a reactor dosimetry problem using a 3-D model of a commercial 2-loop pressurized water reactor (PWR). The accuracy and performance of RAPTOR-M3G will be analyzed and the numerical results obtained from the calculation will be compared directly to measurements of the neutron field in the reactor cavity air gap. The parallel performance of RAPTOR-M3G on massively parallel architectures, where the number of computing nodes is in the order of hundreds, will be analyzed up to four hundred processors. The performance results will be presented based on two supercomputing architectures: the POPLE supercomputer operated by the Pittsburgh Supercomputing Center and the Westinghouse computer cluster. The Westinghouse computer cluster is equipped with a standard Ethernet network connection and an InfiniBand R interconnects capable of a bandwidth in excess of 20 GBit/sec. Therefore, the impact of the network architecture on RAPTOR-M3G performance will be analyzed as well. (authors)
Formal verification an essential toolkit for modern VLSI design

CERN Document Server

Seligman, Erik; Kumar, M V Achutha Kiran

2015-01-01

Formal Verification: An Essential Toolkit for Modern VLSI Design presents practical approaches for design and validation, with hands-on advice for working engineers integrating these techniques into their work. Building on a basic knowledge of System Verilog, this book demystifies FV and presents the practical applications that are bringing it into mainstream design and validation processes at Intel and other companies. The text prepares readers to effectively introduce FV in their organization and deploy FV techniques to increase design and validation productivity. Presents formal verific
DPL/Daedalus design environment (for VLSI)

Energy Technology Data Exchange (ETDEWEB)

Batali, J; Mayle, N; Shrobe, H; Sussman, G; Weise, D

1981-01-01

The DPL/Daedalus design environment is an interactive VLSI design system implemented at the MIT Artificial Intelligence Laboratory. The system consists of several components: a layout language called DPL (for design procedure language); an interactive graphics facility (Daedalus); and several special purpose design procedures for constructing complex artifacts such as PLAs and microprocessor data paths. Coordinating all of these is a generalized property list data base which contains both the data representing circuits and the procedures for constructing them. The authors first review the nature of the data base and then turn to DPL and Daedalus, the two most common ways of entering information into the data base. The next two sections review the specialized procedures for constructing PLAs and data paths; the final section describes a tool for hierarchical node extraction. 5 references.
Transportable GPU (General Processor Units) chip set technology for standard computer architectures

Science.gov (United States)

Fosdick, R. E.; Denison, H. C.

1982-11-01

The USAFR-developed GPU Chip Set has been utilized by Tracor to implement both USAF and Navy Standard 16-Bit Airborne Computer Architectures. Both configurations are currently being delivered into DOD full-scale development programs. Leadless Hermetic Chip Carrier packaging has facilitated implementation of both architectures on single 41/2 x 5 substrates. The CMOS and CMOS/SOS implementations of the GPU Chip Set have allowed both CPU implementations to use less than 3 watts of power each. Recent efforts by Tracor for USAF have included the definition of a next-generation GPU Chip Set that will retain the application-proven architecture of the current chip set while offering the added cost advantages of transportability across ISO-CMOS and CMOS/SOS processes and across numerous semiconductor manufacturers using a newly-defined set of common design rules. The Enhanced GPU Chip Set will increase speed by an approximate factor of 3 while significantly reducing chip counts and costs of standard CPU implementations.
A longitudinal multi-bunch feedback system using parallel digital signal processors

International Nuclear Information System (INIS)

Sapozhnikov, L.; Fox, J.D.; Olsen, J.J.; Oxoby, G.; Linscott, I.; Drago, A.; Serio, M.

1994-01-01

A programmable longitudinal feedback system based on four AT ampersand T 1610 digital signal processors has been developed as a component of the PEP-II R ampersand D program. This longitudinal quick prototype is a proof of concept for the PEP-II system and implements full-speed bunch-by-bunch signal processing for storage rings with bunch spacings of 4 ns. The design incorporates a phase-detector-based front end that digitizes the oscillation phases of bunches at the 250 MHz crossing rate, four programmable signal processors that compute correction signals, and a 250-MHz hold buffer/kicker driver stage that applies correction signals back on the beam. The design implements a general-purpose, table-driven downsampler that allows the system to be operated at several accelerator facilities. The hardware architecture of the signal processing is described, and the software algorithms used in the feedback signal computation are discussed. The system configuration used for tests at the LBL Advanced Light Source is presented
Programming the Linpack Benchmark for the IBM PowerXCell 8i Processor

Directory of Open Access Journals (Sweden)

Michael Kistler

2009-01-01

Full Text Available In this paper we present the design and implementation of the Linpack benchmark for the IBM BladeCenter QS22, which incorporates two IBM PowerXCell 8i1 processors. The PowerXCell 8i is a new implementation of the Cell Broadband Engine™2 architecture and contains a set of special-purpose processing cores known as Synergistic Processing Elements (SPEs. The SPEs can be used as computational accelerators to augment the main PowerPC processor. The added computational capability of the SPEs results in a peak double precision floating point capability of 108.8 GFLOPS. We explain how we modified the standard open source implementation of Linpack to accelerate key computational kernels using the SPEs of the PowerXCell 8i processors. We describe in detail the implementation and performance of the computational kernels and also explain how we employed the SPEs for high-speed data movement and reformatting. The result of these modifications is a Linpack benchmark optimized for the IBM PowerXCell 8i processor that achieves 170.7 GFLOPS on a BladeCenter QS22 with 32 GB of DDR2 SDRAM memory. Our implementation of Linpack also supports clusters of QS22s, and was used to achieve a result of 11.1 TFLOPS on a cluster of 84 QS22 blades. We compare our results on a single BladeCenter QS22 with the base Linpack implementation without SPE acceleration to illustrate the benefits of our optimizations.
Kalman filter tracking on parallel architectures

Science.gov (United States)

Cerati, G.; Elmer, P.; Krutelyov, S.; Lantz, S.; Lefebvre, M.; McDermott, K.; Riley, D.; Tadel, M.; Wittich, P.; Wurthwein, F.; Yagil, A.

2017-10-01

We report on the progress of our studies towards a Kalman filter track reconstruction algorithm with optimal performance on manycore architectures. The combinatorial structure of these algorithms is not immediately compatible with an efficient SIMD (or SIMT) implementation; the challenge for us is to recast the existing software so it can readily generate hundreds of shared-memory threads that exploit the underlying instruction set of modern processors. We show how the data and associated tasks can be organized in a way that is conducive to both multithreading and vectorization. We demonstrate very good performance on Intel Xeon and Xeon Phi architectures, as well as promising first results on Nvidia GPUs.
3D-SoftChip: A Novel Architecture for Next-Generation Adaptive Computing Systems

Directory of Open Access Journals (Sweden)

Lee Mike Myung-Ok

2006-01-01

Full Text Available This paper introduces a novel architecture for next-generation adaptive computing systems, which we term 3D-SoftChip. The 3D-SoftChip is a 3-dimensional (3D vertically integrated adaptive computing system combining state-of-the-art processing and 3D interconnection technology. It comprises the vertical integration of two chips (a configurable array processor and an intelligent configurable switch through an indium bump interconnection array (IBIA. The configurable array processor (CAP is an array of heterogeneous processing elements (PEs, while the intelligent configurable switch (ICS comprises a switch block, 32-bit dedicated RISC processor for control, on-chip program/data memory, data frame buffer, along with a direct memory access (DMA controller. This paper introduces the novel 3D-SoftChip architecture for real-time communication and multimedia signal processing as a next-generation computing system. The paper further describes the advanced HW/SW codesign and verification methodology, including high-level system modeling of the 3D-SoftChip using SystemC, being used to determine the optimum hardware specification in the early design stage.
An area-efficient path memory structure for VLSI Implementation of high speed Viterbi decoders

DEFF Research Database (Denmark)

Paaske, Erik; Pedersen, Steen; Sparsø, Jens

1991-01-01

Path storage and selection methods for Viterbi decoders are investigated with special emphasis on VLSI implementations. Two well-known algorithms, the register exchange, algorithm, REA, and the trace back algorithm, TBA, are considered. The REA requires the smallest number of storage elements...
Algorithms for computational fluid dynamics n parallel processors

International Nuclear Information System (INIS)

Van de Velde, E.F.

1986-01-01

A study of parallel algorithms for the numerical solution of partial differential equations arising in computational fluid dynamics is presented. The actual implementation on parallel processors of shared and nonshared memory design is discussed. The performance of these algorithms is analyzed in terms of machine efficiency, communication time, bottlenecks and software development costs. For elliptic equations, a parallel preconditioned conjugate gradient method is described, which has been used to solve pressure equations discretized with high order finite elements on irregular grids. A parallel full multigrid method and a parallel fast Poisson solver are also presented. Hyperbolic conservation laws were discretized with parallel versions of finite difference methods like the Lax-Wendroff scheme and with the Random Choice method. Techniques are developed for comparing the behavior of an algorithm on different architectures as a function of problem size and local computational effort. Effective use of these advanced architecture machines requires the use of machine dependent programming. It is shown that the portability problems can be minimized by introducing high level operations on vectors and matrices structured into program libraries
Motion camera based on a custom vision sensor and an FPGA architecture

Science.gov (United States)

Arias-Estrada, Miguel

1998-09-01

A digital camera for custom focal plane arrays was developed. The camera allows the test and development of analog or mixed-mode arrays for focal plane processing. The camera is used with a custom sensor for motion detection to implement a motion computation system. The custom focal plane sensor detects moving edges at the pixel level using analog VLSI techniques. The sensor communicates motion events using the event-address protocol associated to a temporal reference. In a second stage, a coprocessing architecture based on a field programmable gate array (FPGA) computes the time-of-travel between adjacent pixels. The FPGA allows rapid prototyping and flexible architecture development. Furthermore, the FPGA interfaces the sensor to a compact PC computer which is used for high level control and data communication to the local network. The camera could be used in applications such as self-guided vehicles, mobile robotics and smart surveillance systems. The programmability of the FPGA allows the exploration of further signal processing like spatial edge detection or image segmentation tasks. The article details the motion algorithm, the sensor architecture, the use of the event- address protocol for velocity vector computation and the FPGA architecture used in the motion camera system.
An Evaluation of an Ada Implementation of the Rete Algorithm for Embedded Flight Processors

Science.gov (United States)

1990-12-01

computers was desired. The VAX VMS operating system has many built-in methods for determining program performance (including VAX PCA), but these methods... overviev , of the target environment-- the MIL-STD-1750A VHSIC Avionic Modular Processor ( VA.IP, running under the Ada Avionics Real-Time Software (AARTS... computers . Mil-STD-1750A, the Air Force’s standard flight computer architecture, however, places severe constraints on applications software processing
First results from a silicon-strip detector with VLSI readout

International Nuclear Information System (INIS)

Anzivino, G.; Horisberger, R.; Hubbeling, L.; Hyams, B.; Parker, S.; Breakstone, A.; Litke, A.M.; Walker, J.T.; Bingefors, N.

1986-01-01

A 256-strip silicon detector with 25 μm strip pitch, connected to two 128-channel NMOS VLSI chips (Microplex), has been tested using straight-through tracks from a ruthenium beta source. The readout channels have a pitch of 47.5 μm. A single multiplexed output provides voltages proportional to the integrated charge from each strip. The most probable signal height from the beta traversals is approximately 14 times the rms noise in any single channel. (orig.)
Real-time tracking with a 3D-flow processor array

International Nuclear Information System (INIS)

Crosetto, D.

1993-01-01

The problem of real-time track-finding has been performed to date with CAM (Content Addressable Memories) or with fast coincidence logic, because the processing scheme was though to have much slower performance. Advances in technology together with a new architectural approach make it feasible to also explore the computing technique for real-time track finding thus giving the advantages of implementing algorithms that can find more parameters such as calculate the sagitta, curvature, pt, etc. with respect to the CAM approach. This report describes real-time track finding using a new computing approach technique based on the 3D-flow array processor system. This system consists of a fixed interconnection architexture scheme, allowing flexible algorithm implementation on a scalable platform. The 3D-Flow parallel processing system for track finding is scalable in size and performance by either increasing the number of processors, or increasing the speed or else the number of pipelined stages. The present article describes the conceptual idea and the design stage of the project

XOP: A second generation fast processor for on-line use in high energy physics experiments

International Nuclear Information System (INIS)

Lingjaerde, T.

1981-01-01

Processors for trigger calculations and data compression in high energy physics are characterized by a high data input capability combined with fas execution of relatively simple routines. In order to achieve the required performance it is advantageous to replace the classical computer instruction-set by microcoded instructions, the various fields of which control the internal subunits in parallel. The fast processor called ESOP is based on such a principle: the different operations are handled step by step by dedicated optimized modules under control of a central instruction unit. Thus, the arithmetic operations, address calculations, conditional checking, loop counts and next instruction evaluation all overlap in time. Based upon the experience from ESOP the architecture of a new processor 'XOP' is beginning to take shape which will be faster and easier to use. In this context the most important innovations are: easy handling of operands in the arithmetic unit by means of three data buses and large data files, a powerful data addressing unit for easy handling of vectors, as well as single operands, and a very flexible logic for conditional branching. Input/output will be made transparent through the introduction of internal fast processors which will be used in conjunction with powerful firmware as a software debugging aid. (orig.)
Does the Intel Xeon Phi processor fit HEP workloads?

International Nuclear Information System (INIS)

Nowak, A; Bitzes, G; Dotti, A; Lazzaro, A; Jarp, S; Szostek, P; Valsan, L; Botezatu, M; Leduc, J

2014-01-01

This paper summarizes the five years of CERN openlab's efforts focused on the Intel Xeon Phi co-processor, from the time of its inception to public release. We consider the architecture of the device vis a vis the characteristics of HEP software and identify key opportunities for HEP processing, as well as scaling limitations. We report on improvements and speedups linked to parallelization and vectorization on benchmarks involving software frameworks such as Geant4 and ROOT. Finally, we extrapolate current software and hardware trends and project them onto accelerators of the future, with the specifics of offline and online HEP processing in mind.
Database architecture evolution: Mammals flourished long before dinosaurs became extinct

NARCIS (Netherlands)

S. Manegold (Stefan); M.L. Kersten (Martin); P.A. Boncz (Peter)

2009-01-01

textabstractThe holy grail for database architecture research is to find a solution that is Scalable & Speedy, to run on anything from small ARM processors up to globally distributed compute clusters, Stable & Secure, to service a broad user community, Small & Simple, to be comprehensible to a small
Design Optimization of Mixed-Criticality Real-Time Applications on Cost-Constrained Partitioned Architectures

DEFF Research Database (Denmark)

Tamas-Selicean, Domitian; Pop, Paul

2011-01-01

In this paper we are interested to implement mixed-criticality hard real-time applications on a given heterogeneous distributed architecture. Applications have different criticality levels, captured by their Safety-Integrity Level (SIL), and are scheduled using static-cyclic scheduling. Mixed......-criticality tasks can be integrated onto the same architecture only if there is enough spatial and temporal separation among them. We consider that the separation is provided by partitioning, such that applications run in separate partitions, and each partition is allocated several time slots on a processor. Tasks...... slots on each processor and (iv) the schedule tables, such that all the applications are schedulable and the development costs are minimized. We have proposed a Tabu Search-based approach to solve this optimization problem. The proposed algorithm has been evaluated using several synthetic and real...
Single-chip serial channel enhances multi-processor systems

Energy Technology Data Exchange (ETDEWEB)

Millar, J.

1982-01-01

In this paper multiprocessor systems are described and explained. The impact that VLSI advancements are having on multiprocessor design is pointed out. The TMS 7041 single-chip microcomputer is described briefly, highlighting its multiprocessor communication capability. And finally, a typical multiprocessor system is shown, implementing the TMS 7041.
Fast-prototyping of VLSI

International Nuclear Information System (INIS)

Saucier, G.; Read, E.

1987-01-01

Fast-prototyping will be a reality in the very near future if both straightforward design methods and fast manufacturing facilities are available. This book focuses, first, on the motivation for fast-prototyping. Economic aspects and market considerations are analysed by European and Japanese companies. In the second chapter, new design methods are identified, mainly for full custom circuits. Of course, silicon compilers play a key role and the introduction of artificial intelligence techniques sheds a new light on the subject. At present, fast-prototyping on gate arrays or on standard cells is the most conventional technique and the third chapter updates the state-of-the art in this area. The fourth chapter concentrates specifically on the e-beam direct-writing for submicron IC technologies. In the fifth chapter, a strategic point in fast-prototyping, namely the test problem is addressed. The design for testability and the interface to the test equipment are mandatory to fulfill the test requirement for fast-prototyping. Finally, the last chapter deals with the subject of education when many people complain about the lack of use of fast-prototyping in higher education for VLSI
The Fermilab Advanced Computer Program multi-array processor system (ACPMAPS): A site oriented supercomputer for theoretical physics

International Nuclear Information System (INIS)

Nash, T.; Areti, H.; Atac, R.

1988-08-01

The ACP Multi-Array Processor System (ACPMAPS) is a highly cost effective, local memory parallel computer designed for floating point intensive grid based problems. The processing nodes of the system are single board array processors based on the FORTRAN and C programmable Weitek XL chip set. The nodes are connected by a network of very high bandwidth 16 port crossbar switches. The architecture is designed to achieve the highest possible cost effectiveness while maintaining a high level of programmability. The primary application of the machine at Fermilab will be lattice gauge theory. The hardware is supported by a transparent site oriented software system called CANOPY which shields theorist users from the underlying node structure. 4 refs., 2 figs
JPP: A Java Pre-Processor

OpenAIRE

Kiniry, Joseph R.; Cheong, Elaine

1998-01-01

The Java Pre-Processor, or JPP for short, is a parsing pre-processor for the Java programming language. Unlike its namesake (the C/C++ Pre-Processor, cpp), JPP provides functionality above and beyond simple textual substitution. JPP's capabilities include code beautification, code standard conformance checking, class and interface specification and testing, and documentation generation.
RISC. A new style in the design of architectures

International Nuclear Information System (INIS)

Cortadella, J.; Gonzalez, A.; Llaberia, J.M.

1988-01-01

In the 80's a new architecture design and implementation style has appeared: the RISC style. It proposes an overall view of the system were the processor is included. For each function, an extensive analysis has to be performed in order to evaluate the advantages and disadvantages that hardware and software introduce in the design. An optimum design involved an agreement between both levels and has to take into account cost, performance, and technological factors. In this paper, the main features of this new architecture design style are presented. (Author)
Fast Tracker Performance using the new ”Variable Resolution Associative Memory” for ATLAS

CERN Document Server

Iizawa, T; The ATLAS collaboration

2012-01-01

The Fast Tracker (FTK) for the ATLAS trigger is the only state-of-the-art online processor that tackles and solves the full track reconstruction problem at a hadron collider. We describe an important advancement for the Associative Memory device (AM). The AM is a VLSI processor for pattern recognition based on Content Addressable Memory (CAM) architecture. Pattern matching is carried out by finding track candidates in coarse resolution ”roads”. A large AM bank stores all trajectories of interest, called ”patterns”, for a given detector resolution. The AM extracts roads compatible with a given event during detector read-out. Two important variables characterize the quality of the AM bank: its ”coverage” and the level of fake roads. The coverage, which describes the geometric efficiency of a bank, is defined as the fraction of tracks that match at least one pattern in the bank. Given a certain road size, the coverage of the bank can be increased just adding patterns to the bank, while the number of ...
Key Technologies of Phone Storage Forensics Based on ARM Architecture

Science.gov (United States)

Zhang, Jianghan; Che, Shengbing

2018-03-01

Smart phones are mainly running Android, IOS and Windows Phone three mobile platform operating systems. The android smart phone has the best market shares and its processor chips are almost ARM software architecture. The chips memory address mapping mechanism of ARM software architecture is different with x86 software architecture. To forensics to android mart phone, we need to understand three key technologies: memory data acquisition, the conversion mechanism from virtual address to the physical address, and find the system’s key data. This article presents a viable solution which does not rely on the operating system API for a complete solution to these three issues.
Flexible feature-space-construction architecture and its VLSI implementation for multi-scale object detection

Science.gov (United States)

Luo, Aiwen; An, Fengwei; Zhang, Xiangyu; Chen, Lei; Huang, Zunkai; Jürgen Mattausch, Hans

2018-04-01

Feature extraction techniques are a cornerstone of object detection in computer-vision-based applications. The detection performance of vison-based detection systems is often degraded by, e.g., changes in the illumination intensity of the light source, foreground-background contrast variations or automatic gain control from the camera. In order to avoid such degradation effects, we present a block-based L1-norm-circuit architecture which is configurable for different image-cell sizes, cell-based feature descriptors and image resolutions according to customization parameters from the circuit input. The incorporated flexibility in both the image resolution and the cell size for multi-scale image pyramids leads to lower computational complexity and power consumption. Additionally, an object-detection prototype for performance evaluation in 65 nm CMOS implements the proposed L1-norm circuit together with a histogram of oriented gradients (HOG) descriptor and a support vector machine (SVM) classifier. The proposed parallel architecture with high hardware efficiency enables real-time processing, high detection robustness, small chip-core area as well as low power consumption for multi-scale object detection.
Producing chopped firewood with firewood processors

International Nuclear Information System (INIS)

Kaerhae, K.; Jouhiaho, A.

2009-01-01

The TTS Institute's research and development project studied both the productivity of new, chopped firewood processors (cross-cutting and splitting machines) suitable for professional and independent small-scale production, and the costs of the chopped firewood produced. Seven chopped firewood processors were tested in the research, six of which were sawing processors and one shearing processor. The chopping work was carried out using wood feeding racks and a wood lifter. The work was also carried out without any feeding appliances. Altogether 132.5 solid m 3 of wood were chopped in the time studies. The firewood processor used had the most significant impact on chopping work productivity. In addition to the firewood processor, the stem mid-diameter, the length of the raw material, and of the firewood were also found to affect productivity. The wood feeding systems also affected productivity. If there is a feeding rack and hydraulic grapple loader available for use in chopping firewood, then it is worth using the wood feeding rack. A wood lifter is only worth using with the largest stems (over 20 cm mid-diameter) if a feeding rack cannot be used. When producing chopped firewood from small-diameter wood, i.e. with a mid-diameter less than 10 cm, the costs of chopping work were over 10 EUR solid m -3 with sawing firewood processors. The shearing firewood processor with a guillotine blade achieved a cost level of 5 EUR solid m -3 when the mid-diameter of the chopped stem was 10 cm. In addition to the raw material, the cost-efficient chopping work also requires several hundred annual operating hours with a firewood processor, which is difficult for individual firewood entrepreneurs to achieve. The operating hours of firewood processors can be increased to the required level by the joint use of the processors by a number of firewood entrepreneurs. (author)
Implementation of a VLSI Level Zero Processing system utilizing the functional component approach

Science.gov (United States)

Shi, Jianfei; Horner, Ward P.; Grebowsky, Gerald J.; Chesney, James R.

1991-01-01

A high rate Level Zero Processing system is currently being prototyped at NASA/Goddard Space Flight Center (GSFC). Based on state-of-the-art VLSI technology and the functional component approach, the new system promises capabilities of handling multiple Virtual Channels and Applications with a combined data rate of up to 20 Megabits per second (Mbps) at low cost.
Toward an understanding of the building blocks: constructing programs for high processor count systems

International Nuclear Information System (INIS)

Reilly, M H

2008-01-01

Technology and industry trends have clearly shown that the future of technical computing lies in exploitation of more processors in larger multiprocessor systems. Exploitation of high processor count architectures demands a more thorough understanding of the underlying system dynamics and an accounting for them in the design of high-performance applications. Currently these dynamics are incompletely described by the widely adopted benchmarks and kernel metrics. Systems are most often characterized to allow comparisons and ranking. Often the characterizations are in the form of a scalar measure of some aspect of system performance that is a 'not to exceed' number: the maximum possible level of performance that could be attained. While such comparisons typically drive both system design and procurement, more useful characterizations can be used to drive application development and design. This paper explores a few of these measures and presents a few simple examples of their application. The first set of metrics addresses individual processor performance, specifically performance related to memory references. The second set of metrics attempts to describe the behavior of the message-passing system under load and across a range of conditions
Design of a real-time wind turbine simulator using a custom parallel architecture

Science.gov (United States)

Hoffman, John A.; Gluck, R.; Sridhar, S.

1995-01-01

The design of a new parallel-processing digital simulator is described. The new simulator has been developed specifically for analysis of wind energy systems in real time. The new processor has been named: the Wind Energy System Time-domain simulator, version 3 (WEST-3). Like previous WEST versions, WEST-3 performs many computations in parallel. The modules in WEST-3 are pure digital processors, however. These digital processors can be programmed individually and operated in concert to achieve real-time simulation of wind turbine systems. Because of this programmability, WEST-3 is very much more flexible and general than its two predecessors. The design features of WEST-3 are described to show how the system produces high-speed solutions of nonlinear time-domain equations. WEST-3 has two very fast Computational Units (CU's) that use minicomputer technology plus special architectural features that make them many times faster than a microcomputer. These CU's are needed to perform the complex computations associated with the wind turbine rotor system in real time. The parallel architecture of the CU causes several tasks to be done in each cycle, including an IO operation and the combination of a multiply, add, and store. The WEST-3 simulator can be expanded at any time for additional computational power. This is possible because the CU's interfaced to each other and to other portions of the simulation using special serial buses. These buses can be 'patched' together in essentially any configuration (in a manner very similar to the programming methods used in analog computation) to balance the input/ output requirements. CU's can be added in any number to share a given computational load. This flexible bus feature is very different from many other parallel processors which usually have a throughput limit because of rigid bus architecture.
Eigensolution of finite element problems in a completely connected parallel architecture

Science.gov (United States)

Akl, Fred A.; Morel, Michael R.

1989-01-01

A parallel algorithm for the solution of the generalized eigenproblem in linear elastic finite element analysis, (K)(phi)=(M)(phi)(omega), where (K) and (M) are of order N, and (omega) is of order q is presented. The parallel algorithm is based on a completely connected parallel architecture in which each processor is allowed to communicate with all other processors. The algorithm has been successfully implemented on a tightly coupled multiple-instruction-multiple-data (MIMD) parallel processing computer, Cray X-MP. A finite element model is divided into m domains each of which is assumed to process n elements. Each domain is then assigned to a processor, or to a logical processor (task) if the number of domains exceeds the number of physical processors. The macro-tasking library routines are used in mapping each domain to a user task. Computational speed-up and efficiency are used to determine the effectiveness of the algorithm. The effect of the number of domains, the number of degrees-of-freedom located along the global fronts and the dimension of the subspace on the performance of the algorithm are investigated. For a 64-element rectangular plate, speed-ups of 1.86, 3.13, 3.18 and 3.61 are achieved on two, four, six and eight processors, respectively.
Multi-net optimization of VLSI interconnect

CERN Document Server

Moiseev, Konstantin; Wimer, Shmuel

2015-01-01

This book covers layout design and layout migration methodologies for optimizing multi-net wire structures in advanced VLSI interconnects. Scaling-dependent models for interconnect power, interconnect delay and crosstalk noise are covered in depth, and several design optimization problems are addressed, such as minimization of interconnect power under delay constraints, or design for minimal delay in wire bundles within a given routing area. A handy reference or a guide for design methodologies and layout automation techniques, this book provides a foundation for physical design challenges of interconnect in advanced integrated circuits. • Describes the evolution of interconnect scaling and provides new techniques for layout migration and optimization, focusing on multi-net optimization; • Presents research results that provide a level of design optimization which does not exist in commercially-available design automation software tools; • Includes mathematical properties and conditions for optimal...
Multi-threaded ATLAS simulation on Intel Knights Landing processors

CERN Document Server

AUTHOR|(INSPIRE)INSPIRE-00014247; The ATLAS collaboration; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea

2017-01-01

The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with detai...
Multi-threaded ATLAS Simulation on Intel Knights Landing Processors

CERN Document Server

Farrell, Steven; The ATLAS collaboration; Calafiura, Paolo; Leggett, Charles

2016-01-01

The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), will be delivered to its users in two phases with the first phase online now and the second phase expected in mid-2016. Cori Phase 2 will be based on the KNL architecture and will contain over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a great use-case for the KNL architecture and supercomputers like Cori. Simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this presentation we will give an overview of the ATLAS simulation application with details on its multi-thr...

Physico-topological methods of increasing stability of the VLSI circuit components to irradiation. Fiziko-topologhicheskie sposoby uluchsheniya radiatsionnoj stojkosti komponentov BIS

Energy Technology Data Exchange (ETDEWEB)

Pereshenkov, V S [MIFI, Moscow, (Russian Federation); Shishianu, F S; Rusanovskij, V I [S. Lazo KPI, Chisinau, (Moldova, Republic of)

1992-01-01

The paper presents the method used and the experimental results obtained for 8-bit microprocessor irradiated with [gamma]-rays and neutrons. The correlation between the electrical and technological parameters with the irradiation ones is revealed. The influence of leakage current between devices incorporated in VLSI circuits was studied. The obtained results create the possibility to determine the technological parameters necessary for designing the circuit able to work at predetermined doses. The necessary substrate doping concentration for isolation which eliminates the leakage current between devices prevents the VLSI circuit break down was determined. (Author).
A track reconstructing low-latency trigger processor for high-energy physics

International Nuclear Information System (INIS)

Cuveland, Jan de

2009-01-01

The detection and analysis of the large number of particles emerging from high-energy collisions between atomic nuclei is a major challenge in experimental heavy-ion physics. Efficient trigger systems help to focus the analysis on relevant events. A primary objective of the Transition Radiation Detector of the ALICE experiment at the LHC is to trigger on high-momentum electrons. In this thesis, a trigger processor is presented that employs massive parallelism to perform the required online event reconstruction within 2 μs to contribute to the Level-1 trigger decision. Its three-stage hierarchical architecture comprises 109 nodes based on FPGA technology. Ninety processing nodes receive data from the detector front-end at an aggregate net bandwidth of 2.16 Tbit/s via 1080 optical links. Using specifically developed components and interconnections, the system combines high bandwidth with minimum latency. The employed tracking algorithm three-dimensionally reassembles the track segments found in the detector's drift chambers based on explicit value comparisons, calculates the momentum of the originating particles from the course of the reconstructed tracks, and finally leads to a trigger decision. The architecture is capable of processing up to 20 000 track segments in less than 2 μs with high detection efficiency and reconstruction precision for high-momentum particles. As a result, this thesis shows how a trigger processor performing complex online track reconstruction within tight real-time requirements can be realized. The presented hardware has been built and is in continuous data taking operation in the ALICE experiment. (orig.)
A track reconstructing low-latency trigger processor for high-energy physics

Energy Technology Data Exchange (ETDEWEB)

Cuveland, Jan de

2009-09-17

The detection and analysis of the large number of particles emerging from high-energy collisions between atomic nuclei is a major challenge in experimental heavy-ion physics. Efficient trigger systems help to focus the analysis on relevant events. A primary objective of the Transition Radiation Detector of the ALICE experiment at the LHC is to trigger on high-momentum electrons. In this thesis, a trigger processor is presented that employs massive parallelism to perform the required online event reconstruction within 2 {mu}s to contribute to the Level-1 trigger decision. Its three-stage hierarchical architecture comprises 109 nodes based on FPGA technology. Ninety processing nodes receive data from the detector front-end at an aggregate net bandwidth of 2.16 Tbit/s via 1080 optical links. Using specifically developed components and interconnections, the system combines high bandwidth with minimum latency. The employed tracking algorithm three-dimensionally reassembles the track segments found in the detector's drift chambers based on explicit value comparisons, calculates the momentum of the originating particles from the course of the reconstructed tracks, and finally leads to a trigger decision. The architecture is capable of processing up to 20 000 track segments in less than 2 {mu}s with high detection efficiency and reconstruction precision for high-momentum particles. As a result, this thesis shows how a trigger processor performing complex online track reconstruction within tight real-time requirements can be realized. The presented hardware has been built and is in continuous data taking operation in the ALICE experiment. (orig.)
Accuracy-Energy Configurable Sensor Processor and IoT Device for Long-Term Activity Monitoring in Rare-Event Sensing Applications

Directory of Open Access Journals (Sweden)

Daejin Park

2014-01-01

Full Text Available A specially designed sensor processor used as a main processor in IoT (internet-of-thing device for the rare-event sensing applications is proposed. The IoT device including the proposed sensor processor performs the event-driven sensor data processing based on an accuracy-energy configurable event-quantization in architectural level. The received sensor signal is converted into a sequence of atomic events, which is extracted by the signal-to-atomic-event generator (AEG. Using an event signal processing unit (EPU as an accelerator, the extracted atomic events are analyzed to build the final event. Instead of the sampled raw data transmission via internet, the proposed method delays the communication with a host system until a semantic pattern of the signal is identified as a final event. The proposed processor is implemented on a single chip, which is tightly coupled in bus connection level with a microcontroller using a 0.18 μm CMOS embedded-flash process. For experimental results, we evaluated the proposed sensor processor by using an IR- (infrared radio- based signal reflection and sensor signal acquisition system. We successfully demonstrated that the expected power consumption is in the range of 20% to 50% compared to the result of the basement in case of allowing 10% accuracy error.
Fast robot kinematics modeling by using a parallel simulator (PSIM)

International Nuclear Information System (INIS)

El-Gazzar, H.M.; Ayad, N.M.A.

2002-01-01

High-speed computers are strongly needed not only for solving scientific and engineering problems, but also for numerous industrial applications. Such applications include computer-aided design, oil exploration, weather predication, space applications and safety of nuclear reactors. The rapid development in VLSI technology makes it possible to implement time consuming algorithms in real-time situations. Parallel processing approaches can now be used to reduce the processing-time for models of very high mathematical structure such as the kinematics molding of robot manipulator. This system is used to construct and evaluate the performance and cost effectiveness of several proposed methods to solve the Jacobian algorithm. Parallelism is introduced to the algorithms by using different task-allocations and dividing the whole job into sub tasks. Detailed analysis is performed and results are obtained for the case of six DOF (degree of freedom) robot arms (Stanford Arm). Execution times comparisons between Von Neumann (uni processor) and parallel processor architectures by using parallel simulator package (PSIM) are presented. The gained results are much in favour for the parallel techniques by at least fifty-percent improvements. Of course, further studies are needed to achieve the convenient and optimum number of processors has to be done
Fast robot kinematics modeling by using a parallel simulator (PSIM)

Energy Technology Data Exchange (ETDEWEB)

El-Gazzar, H M; Ayad, N M.A. [Atomic Energy Authority, Reactor Dept., Computer and Control Lab., P.O. Box no 13759 (Egypt)

2002-09-15

High-speed computers are strongly needed not only for solving scientific and engineering problems, but also for numerous industrial applications. Such applications include computer-aided design, oil exploration, weather predication, space applications and safety of nuclear reactors. The rapid development in VLSI technology makes it possible to implement time consuming algorithms in real-time situations. Parallel processing approaches can now be used to reduce the processing-time for models of very high mathematical structure such as the kinematics molding of robot manipulator. This system is used to construct and evaluate the performance and cost effectiveness of several proposed methods to solve the Jacobian algorithm. Parallelism is introduced to the algorithms by using different task-allocations and dividing the whole job into sub tasks. Detailed analysis is performed and results are obtained for the case of six DOF (degree of freedom) robot arms (Stanford Arm). Execution times comparisons between Von Neumann (uni processor) and parallel processor architectures by using parallel simulator package (PSIM) are presented. The gained results are much in favour for the parallel techniques by at least fifty-percent improvements. Of course, further studies are needed to achieve the convenient and optimum number of processors has to be done.
Optical Associative Processors For Visual Perception"

Science.gov (United States)

Casasent, David; Telfer, Brian

1988-05-01

We consider various associative processor modifications required to allow these systems to be used for visual perception, scene analysis, and object recognition. For these applications, decisions on the class of the objects present in the input image are required and thus heteroassociative memories are necessary (rather than the autoassociative memories that have been given most attention). We analyze the performance of both associative processors and note that there is considerable difference between heteroassociative and autoassociative memories. We describe associative processors suitable for realizing functions such as: distortion invariance (using linear discriminant function memory synthesis techniques), noise and image processing performance (using autoassociative memories in cascade with with a heteroassociative processor and with a finite number of autoassociative memory iterations employed), shift invariance (achieved through the use of associative processors operating on feature space data), and the analysis of multiple objects in high noise (which is achieved using associative processing of the output from symbolic correlators). We detail and provide initial demonstrations of the use of associative processors operating on iconic, feature space and symbolic data, as well as adaptive associative processors.
Application of parallelized software architecture to an autonomous ground vehicle

Science.gov (United States)

Shakya, Rahul; Wright, Adam; Shin, Young Ho; Momin, Orko; Petkovsek, Steven; Wortman, Paul; Gautam, Prasanna; Norton, Adam

2011-01-01

This paper presents improvements made to Q, an autonomous ground vehicle designed to participate in the Intelligent Ground Vehicle Competition (IGVC). For the 2010 IGVC, Q was upgraded with a new parallelized software architecture and a new vision processor. Improvements were made to the power system reducing the number of batteries required for operation from six to one. In previous years, a single state machine was used to execute the bulk of processing activities including sensor interfacing, data processing, path planning, navigation algorithms and motor control. This inefficient approach led to poor software performance and made it difficult to maintain or modify. For IGVC 2010, the team implemented a modular parallel architecture using the National Instruments (NI) LabVIEW programming language. The new architecture divides all the necessary tasks - motor control, navigation, sensor data collection, etc. into well-organized components that execute in parallel, providing considerable flexibility and facilitating efficient use of processing power. Computer vision is used to detect white lines on the ground and determine their location relative to the robot. With the new vision processor and some optimization of the image processing algorithm used last year, two frames can be acquired and processed in 70ms. With all these improvements, Q placed 2nd in the autonomous challenge.
AMD's 64-bit Opteron processor

CERN Multimedia

CERN. Geneva

2003-01-01

This talk concentrates on issues that relate to obtaining peak performance from the Opteron processor. Compiler options, memory layout, MPI issues in multi-processor configurations and the use of a NUMA kernel will be covered. A discussion of recent benchmarking projects and results will also be included.BiographiesDavid RichDavid directs AMD's efforts in high performance computing and also in the use of Opteron processors...
Composable processor virtualization for embedded systems

NARCIS (Netherlands)

Molnos, A.M.; Milutinovic, A.; She, D.; Goossens, K.G.W.

2010-01-01

Processor virtualization divides a physical processor's time among a set of virual machines, enabling efficient hardware utilization, application security and allowing co-existence of different operating systems on the same processor. Through initially intended for the server domain, virtualization
Benchmarking Data Analysis and Machine Learning Applications on the Intel KNL Many-Core Processor

OpenAIRE

Byun, Chansup; Kepner, Jeremy; Arcand, William; Bestor, David; Bergeron, Bill; Gadepally, Vijay; Houle, Michael; Hubbell, Matthew; Jones, Michael; Klein, Anna; Michaleas, Peter; Milechin, Lauren; Mullen, Julie; Prout, Andrew; Rosa, Antonio

2017-01-01

Knights Landing (KNL) is the code name for the second-generation Intel Xeon Phi product family. KNL has generated significant interest in the data analysis and machine learning communities because its new many-core architecture targets both of these workloads. The KNL many-core vector processor design enables it to exploit much higher levels of parallelism. At the Lincoln Laboratory Supercomputing Center (LLSC), the majority of users are running data analysis applications such as MATLAB and O...
International Conference on VLSI, Communication, Advanced Devices, Signals & Systems and Networking

CERN Document Server

Shirur, Yasha; Prasad, Rekha

2013-01-01

This book is a collection of papers presented by renowned researchers, keynote speakers and academicians in the International Conference on VLSI, Communication, Analog Designs, Signals and Systems, and Networking (VCASAN-2013), organized by B.N.M. Institute of Technology, Bangalore, India during July 17-19, 2013. The book provides global trends in cutting-edge technologies in electronics and communication engineering. The content of the book is useful to engineers, researchers and academicians as well as industry professionals.
Built-in self-repair of VLSI memories employing neural nets

Science.gov (United States)

Mazumder, Pinaki

1998-10-01

The decades of the Eighties and the Nineties have witnessed the spectacular growth of VLSI technology, when the chip size has increased from a few hundred devices to a staggering multi-millon transistors. This trend is expected to continue as the CMOS feature size progresses towards the nanometric dimension of 100 nm and less. SIA roadmap projects that, where as the DRAM chips will integrate over 20 billion devices in the next millennium, the future microprocessors may incorporate over 100 million transistors on a single chip. As the VLSI chip size increase, the limited accessibility of circuit components poses great difficulty for external diagnosis and replacement in the presence of faulty components. For this reason, extensive work has been done in built-in self-test techniques, but little research is known concerning built-in self-repair. Moreover, the extra hardware introduced by conventional fault-tolerance techniques is also likely to become faulty, therefore causing the circuit to be useless. This research demonstrates the feasibility of implementing electronic neural networks as intelligent hardware for memory array repair. Most importantly, we show that the neural network control possesses a robust and degradable computing capability under various fault conditions. Overall, a yield analysis performed on 64K DRAM's shows that the yield can be improved from as low as 20 percent to near 99 percent due to the self-repair design, with overhead no more than 7 percent.
Multiprocessor architecture: Synthesis and evaluation

Science.gov (United States)

Standley, Hilda M.

1990-01-01

Multiprocessor computed architecture evaluation for structural computations is the focus of the research effort described. Results obtained are expected to lead to more efficient use of existing architectures and to suggest designs for new, application specific, architectures. The brief descriptions given outline a number of related efforts directed toward this purpose. The difficulty is analyzing an existing architecture or in designing a new computer architecture lies in the fact that the performance of a particular architecture, within the context of a given application, is determined by a number of factors. These include, but are not limited to, the efficiency of the computation algorithm, the programming language and support environment, the quality of the program written in the programming language, the multiplicity of the processing elements, the characteristics of the individual processing elements, the interconnection network connecting processors and non-local memories, and the shared memory organization covering the spectrum from no shared memory (all local memory) to one global access memory. These performance determiners may be loosely classified as being software or hardware related. This distinction is not clear or even appropriate in many cases. The effect of the choice of algorithm is ignored by assuming that the algorithm is specified as given. Effort directed toward the removal of the effect of the programming language and program resulted in the design of a high-level parallel programming language. Two characteristics of the fundamental structure of the architecture (memory organization and interconnection network) are examined.
How to harness the performance potential of current multi-core processors

International Nuclear Information System (INIS)

Jarp, Sverre; Lazzaro, Alfio; Leduc, Julien; Nowak, Andrzej

2011-01-01

Leakage currents have put a stop to the semiconductor industry's ability to increase processor frequency in order to enhance the performance of new microprocessors. Instead, we observe a slew of changes inside the micro-architecture with an aim of enhancing the performance. Several of these changes, however, do not translate into automatic speed improvements for the software. This paper discusses the increased complexity of modern microprocessors by separating out into dimensions each feature that impacts performance and mentions briefly ways of improving software, in particular that of the High Energy Physics community, to take full advantage.
A scalable parallel open architecture data acquisition system for low to high rate experiments, test beams and all SSC [Superconducting Super Collider] detectors

International Nuclear Information System (INIS)

Barsotti, E.; Booth, A.; Bowden, M.; Swoboda, C.; Lockyer, N.; VanBerg, R.

1989-12-01

A new era of high-energy physics research is beginning requiring accelerators with much higher luminosities and interaction rates in order to discover new elementary particles. As a consequences, both orders of magnitude higher data rates from the detector and online processing power, well beyond the capabilities of current high energy physics data acquisition systems, are required. This paper describes a new data acquisition system architecture which draws heavily from the communications industry, is totally parallel (i.e., without any bottlenecks), is capable of data rates of hundreds of GigaBytes per second from the detector and into an array of online processors (i.e., processor farm), and uses an open systems architecture to guarantee compatibility with future commercially available online processor farms. The main features of the system architecture are standard interface ICs to detector subsystems wherever possible, fiber optic digital data transmission from the near-detector electronics, a self-routing parallel event builder, and the use of industry-supported and high-level language programmable processors in the proposed BCD system for both triggers and online filters. A brief status report of an ongoing project at Fermilab to build the self-routing parallel event builder will also be given in the paper. 3 figs., 1 tab
Deterministic chaos in the processor load

International Nuclear Information System (INIS)

Halbiniak, Zbigniew; Jozwiak, Ireneusz J.

2007-01-01

In this article we present the results of research whose purpose was to identify the phenomenon of deterministic chaos in the processor load. We analysed the time series of the processor load during efficiency tests of database software. Our research was done on a Sparc Alpha processor working on the UNIX Sun Solaris 5.7 operating system. The conducted analyses proved the presence of the deterministic chaos phenomenon in the processor load in this particular case
Adaptive Motion Estimation Processor for Autonomous Video Devices

Directory of Open Access Journals (Sweden)

Dias T

2007-01-01

Full Text Available Motion estimation is the most demanding operation of a video encoder, corresponding to at least 80% of the overall computational cost. As a consequence, with the proliferation of autonomous and portable handheld devices that support digital video coding, data-adaptive motion estimation algorithms have been required to dynamically configure the search pattern not only to avoid unnecessary computations and memory accesses but also to save energy. This paper proposes an application-specific instruction set processor (ASIP to implement data-adaptive motion estimation algorithms that is characterized by a specialized datapath and a minimum and optimized instruction set. Due to its low-power nature, this architecture is highly suitable to develop motion estimators for portable, mobile, and battery-supplied devices. Based on the proposed architecture and the considered adaptive algorithms, several motion estimators were synthesized both for a Virtex-II Pro XC2VP30 FPGA from Xilinx, integrated within an ML310 development platform, and using a StdCell library based on a 0.18 μm CMOS process. Experimental results show that the proposed architecture is able to estimate motion vectors in real time for QCIF and CIF video sequences with a very low-power consumption. Moreover, it is also able to adapt the operation to the available energy level in runtime. By adjusting the search pattern and setting up a more convenient operating frequency, it can change the power consumption in the interval between 1.6 mW and 15 mW.
Systolic trees and systolic language recognition by tree automata

Energy Technology Data Exchange (ETDEWEB)

Steinby, M

1983-01-01

K. Culik II, J. Gruska, A. Salomaa and D. Wood have studied the language recognition capabilities of certain types of systolically operating networks of processors (see research reports Cs-81-32, Cs-81-36 and Cs-82-01, Univ. of Waterloo, Ontario, Canada). In this paper, their model for systolic VLSI trees is formalised in terms of standard tree automaton theory, and the way in which some known facts about recognisable forests and tree transductions can be applied in VLSI tree theory is demonstrated. 13 references.
Advanced parallel processing with supercomputer architectures

International Nuclear Information System (INIS)

Hwang, K.

1987-01-01

This paper investigates advanced parallel processing techniques and innovative hardware/software architectures that can be applied to boost the performance of supercomputers. Critical issues on architectural choices, parallel languages, compiling techniques, resource management, concurrency control, programming environment, parallel algorithms, and performance enhancement methods are examined and the best answers are presented. The authors cover advanced processing techniques suitable for supercomputers, high-end mainframes, minisupers, and array processors. The coverage emphasizes vectorization, multitasking, multiprocessing, and distributed computing. In order to achieve these operation modes, parallel languages, smart compilers, synchronization mechanisms, load balancing methods, mapping parallel algorithms, operating system functions, application library, and multidiscipline interactions are investigated to ensure high performance. At the end, they assess the potentials of optical and neural technologies for developing future supercomputers

An efficient optical architecture for sparsely connected neural networks

Science.gov (United States)

Hine, Butler P., III; Downie, John D.; Reid, Max B.

1990-01-01

An architecture for general-purpose optical neural network processor is presented in which the interconnections and weights are formed by directing coherent beams holographically, thereby making use of the space-bandwidth products of the recording medium for sparsely interconnected networks more efficiently that the commonly used vector-matrix multiplier, since all of the hologram area is in use. An investigation is made of the use of computer-generated holograms recorded on such updatable media as thermoplastic materials, in order to define the interconnections and weights of a neural network processor; attention is given to limits on interconnection densities, diffraction efficiencies, and weighing accuracies possible with such an updatable thin film holographic device.
Multi-threaded ATLAS simulation on Intel Knights Landing processors

Science.gov (United States)

Farrell, Steven; Calafiura, Paolo; Leggett, Charles; Tsulaia, Vakhtang; Dotti, Andrea; ATLAS Collaboration

2017-10-01

The Knights Landing (KNL) release of the Intel Many Integrated Core (MIC) Xeon Phi line of processors is a potential game changer for HEP computing. With 72 cores and deep vector registers, the KNL cards promise significant performance benefits for highly-parallel, compute-heavy applications. Cori, the newest supercomputer at the National Energy Research Scientific Computing Center (NERSC), was delivered to its users in two phases with the first phase online at the end of 2015 and the second phase now online at the end of 2016. Cori Phase 2 is based on the KNL architecture and contains over 9000 compute nodes with 96GB DDR4 memory. ATLAS simulation with the multithreaded Athena Framework (AthenaMT) is a good potential use-case for the KNL architecture and supercomputers like Cori. ATLAS simulation jobs have a high ratio of CPU computation to disk I/O and have been shown to scale well in multi-threading and across many nodes. In this paper we will give an overview of the ATLAS simulation application with details on its multi-threaded design. Then, we will present a performance analysis of the application on KNL devices and compare it to a traditional x86 platform to demonstrate the capabilities of the architecture and evaluate the benefits of utilizing KNL platforms like Cori for ATLAS production.
Centaure: an heterogeneous parallel architecture for computer vision

International Nuclear Information System (INIS)

Peythieux, Marc

1997-01-01

This dissertation deals with the architecture of parallel computers dedicated to computer vision. In the first chapter, the problem to be solved is presented, as well as the architecture of the Sympati and Symphonie computers, on which this work is based. The second chapter is about the state of the art of computers and integrated processors that can execute computer vision and image processing codes. The third chapter contains a description of the architecture of Centaure. It has an heterogeneous structure: it is composed of a multiprocessor system based on Analog Devices ADSP21060 Sharc digital signal processor, and of a set of Symphonie computers working in a multi-SIMD fashion. Centaure also has a modular structure. Its basic node is composed of one Symphonie computer, tightly coupled to a Sharc thanks to a dual ported memory. The nodes of Centaure are linked together by the Sharc communication links. The last chapter deals with a performance validation of Centaure. The execution times on Symphonie and on Centaure of a benchmark which is typical of industrial vision, are presented and compared. In the first place, these results show that the basic node of Centaure allows a faster execution than Symphonie, and that increasing the size of the tested computer leads to a better speed-up with Centaure than with Symphonie. In the second place, these results validate the choice of running the low level structure of Centaure in a multi- SIMD fashion. (author) [fr
Parallel algorithms for placement and routing in VLSI design. Ph.D. Thesis

Science.gov (United States)

Brouwer, Randall Jay

1991-01-01

The computational requirements for high quality synthesis, analysis, and verification of very large scale integration (VLSI) designs have rapidly increased with the fast growing complexity of these designs. Research in the past has focused on the development of heuristic algorithms, special purpose hardware accelerators, or parallel algorithms for the numerous design tasks to decrease the time required for solution. Two new parallel algorithms are proposed for two VLSI synthesis tasks, standard cell placement and global routing. The first algorithm, a parallel algorithm for global routing, uses hierarchical techniques to decompose the routing problem into independent routing subproblems that are solved in parallel. Results are then presented which compare the routing quality to the results of other published global routers and which evaluate the speedups attained. The second algorithm, a parallel algorithm for cell placement and global routing, hierarchically integrates a quadrisection placement algorithm, a bisection placement algorithm, and the previous global routing algorithm. Unique partitioning techniques are used to decompose the various stages of the algorithm into independent tasks which can be evaluated in parallel. Finally, results are presented which evaluate the various algorithm alternatives and compare the algorithm performance to other placement programs. Measurements are presented on the parallel speedups available.
PLA realizations for VLSI state machines

Science.gov (United States)

Gopalakrishnan, S.; Whitaker, S.; Maki, G.; Liu, K.

1990-01-01

A major problem associated with state assignment procedures for VLSI controllers is obtaining an assignment that produces minimal or near minimal logic. The key item in Programmable Logic Array (PLA) area minimization is the number of unique product terms required by the design equations. This paper presents a state assignment algorithm for minimizing the number of product terms required to implement a finite state machine using a PLA. Partition algebra with predecessor state information is used to derive a near optimal state assignment. A maximum bound on the number of product terms required can be obtained by inspecting the predecessor state information. The state assignment algorithm presented is much simpler than existing procedures and leads to the same number of product terms or less. An area-efficient PLA structure implemented in a 1.0 micron CMOS process is presented along with a summary of the performance for a controller implemented using this design procedure.
Critical Path Driven Cosynthesis for Heterogeneous Target Architectures

DEFF Research Database (Denmark)

Bjørn-Jørgensen, Peter; Madsen, Jan

1997-01-01

This paper presents a critical path driven algorithm to produce a static schedule of a single-rate system onto a heterogeneous target architecture. Our algorithm is a list based scheduling algorithm which concurrently assigns tasks to processors and allocates nets to interprocessor communication........ Experimental results show that our algorithm is able to find good results, as compared to other methods, in small amount of CPU time....
A NEW OS ARCHITECTURE FOR IOT

Directory of Open Access Journals (Sweden)

Jean Y. Astier

2018-03-01

Full Text Available Current computer operating systems architectures are not well suited for the coming world of connected objects, known as the Internet of Things (IoT for multiple reasons: poor communication performances in both point-to-point and broadcast cases, poor operational reliability and network security, excessive requirements both in terms of processor power and memory size leading to excessive electrical power consumption. We introduce a new computer operating system architecture well adapted to IoT, from the most modest to the most complex, and more generally able to significantly raise the input/output capacities of any communicating computer. This architecture rests on the principles of the Von Neumann hardware model, and is composed of two types of asymmetric distributed containers, which communicate by message passing. We describe the sub-systems of both of these types of containers, where each sub-system has its own scheduler, and a dedicated execution level.
Neurovision processor for designing intelligent sensors

Science.gov (United States)

Gupta, Madan M.; Knopf, George K.

1992-03-01

A programmable multi-task neuro-vision processor, called the Positive-Negative (PN) neural processor, is proposed as a plausible hardware mechanism for constructing robust multi-task vision sensors. The computational operations performed by the PN neural processor are loosely based on the neural activity fields exhibited by certain nervous tissue layers situated in the brain. The neuro-vision processor can be programmed to generate diverse dynamic behavior that may be used for spatio-temporal stabilization (STS), short-term visual memory (STVM), spatio-temporal filtering (STF) and pulse frequency modulation (PFM). A multi- functional vision sensor that performs a variety of information processing operations on time- varying two-dimensional sensory images can be constructed from a parallel and hierarchical structure of numerous individually programmed PN neural processors.
An engineering methodology for implementing and testing VLSI (Very Large Scale Integrated) circuits

Science.gov (United States)

Corliss, Walter F., II

1989-03-01

The engineering methodology for producing a fully tested VLSI chip from a design layout is presented. A 16-bit correlator, NPS CORN88, that was previously designed, was used as a vehicle to demonstrate this methodology. The study of the design and simulation tools, MAGIC and MOSSIM II, was the focus of the design and validation process. The design was then implemented and the chip was fabricated by MOSIS. This fabricated chip was then used to develop a testing methodology for using the digital test facilities at NPS. NPS CORN88 was the first full custom VLSI chip, designed at NPS, to be tested with the NPS digital analysis system, Tektronix DAS 9100 series tester. The capabilities and limitations of these test facilities are examined. NPS CORN88 test results are included to demonstrate the capabilities of the digital test system. A translator, MOS2DAS, was developed to convert the MOSSIM II simulation program to the input files required by the DAS 9100 device verification software, 91DVS. Finally, a tutorial for using the digital test facilities, including the DAS 9100 and associated support equipments, is included as an appendix.
Hardware architecture for projective model calculation and false match refining using random sample consensus algorithm

Science.gov (United States)

Azimi, Ehsan; Behrad, Alireza; Ghaznavi-Ghoushchi, Mohammad Bagher; Shanbehzadeh, Jamshid

2016-11-01

The projective model is an important mapping function for the calculation of global transformation between two images. However, its hardware implementation is challenging because of a large number of coefficients with different required precisions for fixed point representation. A VLSI hardware architecture is proposed for the calculation of a global projective model between input and reference images and refining false matches using random sample consensus (RANSAC) algorithm. To make the hardware implementation feasible, it is proved that the calculation of the projective model can be divided into four submodels comprising two translations, an affine model and a simpler projective mapping. This approach makes the hardware implementation feasible and considerably reduces the required number of bits for fixed point representation of model coefficients and intermediate variables. The proposed hardware architecture for the calculation of a global projective model using the RANSAC algorithm was implemented using Verilog hardware description language and the functionality of the design was validated through several experiments. The proposed architecture was synthesized by using an application-specific integrated circuit digital design flow utilizing 180-nm CMOS technology as well as a Virtex-6 field programmable gate array. Experimental results confirm the efficiency of the proposed hardware architecture in comparison with software implementation.
Evaluation of the Intel iWarp parallel processor for space flight applications

Science.gov (United States)

Hine, Butler P., III; Fong, Terrence W.

1993-01-01

The potential of a DARPA-sponsored advanced processor, the Intel iWarp, for use in future SSF Data Management Systems (DMS) upgrades is evaluated through integration into the Ames DMS testbed and applications testing. The iWarp is a distributed, parallel computing system well suited for high performance computing applications such as matrix operations and image processing. The system architecture is modular, supports systolic and message-based computation, and is capable of providing massive computational power in a low-cost, low-power package. As a consequence, the iWarp offers significant potential for advanced space-based computing. This research seeks to determine the iWarp's suitability as a processing device for space missions. In particular, the project focuses on evaluating the ease of integrating the iWarp into the SSF DMS baseline architecture and the iWarp's ability to support computationally stressing applications representative of SSF tasks.
Vector and parallel processors in computational science. Proceedings

Energy Technology Data Exchange (ETDEWEB)

Duff, I S; Reid, J K

1985-01-01

This volume contains papers from most of the invited talks and from several of the contributed talks and poster sessions presented at VAPP II. The contents present an extensive coverage of all important aspects of vector and parallel processors, including hardware, languages, numerical algorithms and applications. The topics covered include descriptions of new machines (both research and commercial machines), languages and software aids, and general discussions of whole classes of machines and their uses. Numerical methods papers include Monte Carlo algorithms, iterative and direct methods for solving large systems, finite elements, optimization, random number generation and mathematical software. The specific applications covered include neutron diffusion calculations, molecular dynamics, weather forecasting, lattice gauge calculations, fluid dynamics, flight simulation, cartography, image processing and cryptography. Most machines and architecture types are being used for these applications. many refs.
VIRTUS: a multi-processor system in FASTBUS

International Nuclear Information System (INIS)

Ellett, J.; Jackson, R.; Ritter, R.; Schlein, P.; Yaeger, D.; Zweizig, J.

1986-01-01

VIRTUS is a system of parallel MC68000-based processors interconnected by FASTBUS that is used either on-line as an intelligent trigger component or off-line for full event processing. Each processor receives the complete set of data from one event. The host computer, a VAX 11/780, down-line loads all software to the processors, controls and monitors the functioning of all processors, and writes processed data to tape. Instructions, programs, and data are transferred among the processors and the host in the form of fixed format, variable length data blocks. (Auth.)
A Heterogeneous Multi-core Architecture with a Hardware Kernel for Control Systems

DEFF Research Database (Denmark)

Li, Gang; Guan, Wei; Sierszecki, Krzysztof

2012-01-01

Rapid industrialisation has resulted in a demand for improved embedded control systems with features such as predictability, high processing performance and low power consumption. Software kernel implementation on a single processor is becoming more difficult to satisfy those constraints. This pa......Rapid industrialisation has resulted in a demand for improved embedded control systems with features such as predictability, high processing performance and low power consumption. Software kernel implementation on a single processor is becoming more difficult to satisfy those constraints......). Second, a heterogeneous multi-core architecture is investigated, focusing on its performance in relation to hard real-time constraints and predictable behavior. Third, the hardware implementation of HARTEX is designated to support the heterogeneous multi-core architecture. This hardware kernel has...... several advantages over a similar kernel implemented in software: higher-speed processing capability, parallel computation, and separation between the kernel itself and the applications being run. A microbenchmark has been used to compare the hardware kernel with the software kernel, and compare...
Trust-Management, Intrusion-Tolerance, Accountability, and Reconstitution Architecture (TIARA)

Science.gov (United States)

2009-12-01

Tainting, tagged, metadata, architecture, hardware, processor, microkernel , zero-kernel, co-design 16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF... microkernels (e.g., [27]) embraced the idea that it was beneficial to reduce the ker- nel, separating out services as separate processes isolated from...limited adoption. More recently Tanenbaum [72] notes the security virtues of microkernels and suggests the modern importance of security makes it
Design Implementation and Testing of a VLSI High Performance ASIC for Extracting the Phase of a Complex Signal

National Research Council Canada - National Science Library

Altmeyer, Ronald

2002-01-01

This thesis documents the research, circuit design, and simulation testing of a VLSI ASIC which extracts phase angle information from a complex sampled signal using the arctangent relationship: (phi=tan/-1 (Q/1...
A lock circuit for a multi-core processor

DEFF Research Database (Denmark)

2015-01-01

An integrated circuit comprising a multiple processor cores and a lock circuit that comprises a queue register with respective bits set or reset via respective, connections dedicated to respective processor cores, whereby the queue register identifies those among the multiple processor cores...... that are enqueued in the queue register. Furthermore, the integrated circuit comprises a current register and a selector circuit configured to select a processor core and identify that processor core by a value in the current register. A selected processor core is a prioritized processor core among the cores...... configured with an integrated circuit; and a silicon die configured with an integrated circuit....
The definitive guide to ARM Cortex-M3 and Cortex-M4 processors

CERN Document Server

Yiu, Joseph

2013-01-01

This book presents the background of the ARM architecture and outlines the features of the processors such as the instruction set, interrupt-handling and also demonstrates how to program and utilize the advanced features available such as the Memory Protection Unit (MPU). Chapters on getting started with IAR, Keil, gcc and CooCox CoIDE tools help beginners develop program codes. Coverage also includes the important areas of software development such as using the low power features, handling information input/output, mixed language projects with assembly and C, and other advanced topics. Tw
Design of Networks-on-Chip for Real-Time Multi-Processor Systems-on-Chip

DEFF Research Database (Denmark)

Sparsø, Jens

2012-01-01

This paper addresses the design of networks-on-chips for use in multi-processor systems-on-chips - the hardware platforms used in embedded systems. These platforms typically have to guarantee real-time properties, and as the network is a shared resource, it has to provide service guarantees...... (bandwidth and/or latency) to different communication flows. The paper reviews some past work in this field and the lessons learned, and the paper discusses ongoing research conducted as part of the project "Time-predictable Multi-Core Architecture for Embedded Systems" (T-CREST), supported by the European...
A Scalable Architecture for VoIP Conferencing

Directory of Open Access Journals (Sweden)

R Venkatesha Prasad

2003-10-01

Full Text Available Real-Time services are traditionally supported on circuit switched network. However, there is a need to port these services on packet switched network. Architecture for audio conferencing application over the Internet in the light of ITU-T H.323 recommendations is considered. In a conference, considering packets only from a set of selected clients can reduce speech quality degradation because mixing packets from all clients can lead to lack of speech clarity. A distributed algorithm and architecture for selecting clients for mixing is suggested here based on a new quantifier of the voice activity called "Loudness Number" (LN. The proposed system distributes the computation load and reduces the load on client terminals. The highlights of this architecture are scalability, bandwidth saving and speech quality enhancement. Client selection for playing out tries to mimic a physical conference where the most vocal participants attract more attention. The contributions of the paper are expected to aid H.323 recommendations implementations for Multipoint Processors (MP. A working prototype based on the proposed architecture is already functional.

Some links on this page may take you to non-federal websites. Their policies may differ from this site.